Watching a Poker Bot Learn — Deep CFR on Kuhn Poker

Current Iteration

What the bot is doing right now

Each iteration: do K external-sampling traversals per player (sample chance and the opponent's actions, fully expand the traverser's actions), dump (info-set, sampled regret) tuples into a reservoir-sampled memory, then retrain the advantage network from scratch on that memory. The gold path shows the most recent traversal.

Player 1

Player 2

Player 1 decision Player 2 decision Showdown / fold Path traversed this iteration

IdlePress Play to start training. The bot begins with no knowledge — every action is a 50/50 coin flip.

The Bot's Brain — Two Tiny Neural Networks

No regret table — predicted advantages instead

There's no tabular regret any more. Each player has a small multi-layer perceptron V_p(I, a | θ_p) that takes an info-set's features (card + action history) and predicts an advantage per action. The 12 cards below show those predicted advantages. Strategy is regret matching on the network output. With only 12 info-sets the network is overkill — the lesson here is the mechanism, not the speedup. (In real poker the table would be 10¹⁴ entries; here the network just memorizes 12 of them.)

Live Math Walkthrough

From info-set features → network output → regret matching

For one info-set, here's what the network sees, what it predicts, and the strategy that comes out. The "true" subtree EV is shown for comparison — that's what tabular CFR would compute, and what the network is being trained to approximate.

Inspecting tip: click any info-set card above to switch

1Subtree counterfactual EV

If we computed the EV of each action by walking the actual subtree (what tabular CFR does), we'd get these numbers. The network's job is to approximate the resulting instantaneous regret r(a) = EV(a) − V(I) from sampled tuples in its memory.

2Network prediction

The advantage network V_p(I, ·) takes the info-set's 7-d feature vector (3 card + 4 history one-hots) and outputs one predicted advantage per action. After training, these should track the instantaneous regrets on the left.

3Strategy = regret matching on V

Same regret matching as tabular CFR, applied to the network's predicted advantages instead of cumulative regret table values. When all predictions ≤ 0, pick the highest-advantage action with prob 1 (Deep CFR's ablation finding).

Strategy Convergence

Slower and noisier than tabular — that's the honest result on Kuhn

Each line tracks the linearly weighted average betting probability σ̄ derived from the network's predictions. Dashed lines mark the Nash target. Convergence on Kuhn is roughly 10–20× worse than tabular CFR+ because the function approximator is overkill for 12 info-sets — but the same algorithm scales to 10¹⁷-node games where tabular CFR is impossible.

P1 bets with K (root) P1 bluffs with J (root) P2 bets K after check P2 bluffs J after check P1 calls Q after check-bet Nash target

What Deep CFR Changes vs Tabular CFR

Throw out the regret table, predict regrets with a neural network

All previous CFR variants — vanilla, CFR+, DCFR — share one bottleneck: a giant regret table, one entry per (info-set, action) pair. Heads-Up Limit Hold'em has 10¹⁴ info-sets; No-Limit has 10¹⁷. Won't fit in any computer. The historical workaround is hand-crafted abstraction — bucket similar situations together — but abstraction is lossy, domain-specific, and a chicken-and-egg problem.

Deep CFR's pitch: throw out the table, predict regrets with a small neural network that takes an info-set's features as input. The network generalizes across similar info-sets automatically, with no manual feature engineering for what counts as "similar".

1. External-sampling MCCFR traversal. Each iteration, do K traversals per player. At the traverser's nodes expand all actions; at chance and opponent nodes sample. Compute a sampled instantaneous regret r̃(I, a) = v(a) − Σ σ(a)·v(a) and dump the (info-set, regrets, iteration) tuple into a replay memory.

advMemory.add( (encode(I), [r̃(a₀), r̃(a₁)], t) )

2. Reservoir sampling. The memory is reservoir-sampled, so a sample from iteration 1 has the same survival probability as a sample from iteration 1000. This is crucial: the paper's Figure 4 shows that sliding-window memory makes exploitability INCREASE once the buffer fills.

3. Retrain the network from scratch each iteration. Sample mini-batches from memory, minimize MSE between the network's predicted advantage V(I, a | θ) and the stored sampled regrets. Fine-tuning from the previous iteration's weights gets stuck in bad local minima as the regression target shifts.

θ_p ← argmin_θ E_{(I, t, r̃)~M_V,p} [ t · Σ_a ( r̃(a) − V(I, a | θ) )² ]

4. Strategy = regret matching on network output. Once trained, V(I, ·) plays the role the regret table used to play. When all predicted advantages are negative, pick the highest-advantage action with prob 1 (the paper's ablation: this beats uniform under approximation error).

Why this matters: the first non-tabular CFR variant that actually scales. In Heads-Up Limit Hold'em, Deep CFR loses to a 3.3·10⁸-bucket tabular abstraction by only 11 mbb/g and beats Neural Fictitious Self-Play by 43 mbb/g — with no domain knowledge. Pluribus (multi-player No-Limit) and ReBeL build on this foundation.

About this visualization. Kuhn poker has only 12 info-sets — a regret table fits comfortably and the network is strictly worse than tabular here. Expect convergence to be slower and noisier than CFR+/DCFR on the other pages. The point is to show how the algorithm works, not why it wins. The win shows up only at scale.

How Exploitability is Computed

The Nash-distance metric — and why it equals 0 at convergence

Exploitability measures how much chip equity an opponent could win against your strategy by playing a best response — the strategy that maximally exploits yours. At Nash equilibrium no deviation helps either side, so exploitability = 0. Higher = further from Nash. Reported here in milli-big-blinds per game (mbb/g) — the standard poker artificial-intelligence metric (1 mbb/g = 0.001 chips lost per hand against a perfect opponent).

ε(σ) = ½ · ( best_response_value(P1, σ̄_P2) + best_response_value(P2, σ̄_P1) )

Where best_response_value(player, opp_strategy) is what player achieves with their best response to a fixed opp_strategy. In a zero-sum game the Nash values cancel, so ε ≥ 0 always; ε = 0 only at Nash. Both best-response values are computed against the average strategy σ̄, not the current σ.

Computation in Kuhn — brute-force enumeration:

Each player has only 6 information sets, each with 2 actions. So each player has at most 2⁶ = 64 pure best-response strategies. Cheap enough to brute-force:

for each of 64 candidate best-response strategies σ_BR:
  v = expected utility(σ_BR vs σ̄_opp) over all 6 deals
best_response_value = max v

The crucial constraint: the responder must commit to one action per information set — it can't peek at the opponent's card and pick differently per deal. (An earlier bug in this site let the best-response strategy cheat that way, which floored exploitability around 277 mbb/g instead of converging to 0.)

Why this approach doesn't scale to bigger games: Leduc Poker (the next step up) has 144 info sets per player with 2 or 3 actions each. Brute force would need ~3¹⁴⁴ ≈ 10⁶⁸ strategies — infeasible. Larger games use a topological dynamic-programming best-response computation instead, processing info sets deepest-first and reusing already-decided choices in deeper subtrees.