Watching a Poker Bot Learn — External Sampling MCCFR on Kuhn Poker

Current Iteration

What External Sampling MCCFR does each step

Unlike Vanilla Counterfactual Regret Minimization which sweeps every deal, External Sampling picks one random deal per iteration, alternates which player's regrets get updated, and then walks the tree: enumerates the updating player's actions (visits every branch) but samples one branch wherever the opponent acts. Cheap per iteration, unbiased in expectation.

Player 1

Player 2

Player 1 decision Player 2 decision Showdown / fold Updating player — enumerated Opponent — sampled (one branch)

IdlePress Play to start. Each iteration: sample one deal, alternate which player updates, enumerate that player's actions, sample the opponent's. Same Nash convergence as Vanilla Counterfactual Regret Minimization, but each iteration is much cheaper.

The Bot's Brain — 12 Information Sets

What does the bot know in each situation?

In Kuhn poker, the bot only sees its own card plus the action history. That's 12 distinct situations. For each, it tracks cumulative regrets, then derives the current strategy σ (the bar) by regret matching — directly proportional to positive regret. The smaller avg % is the strategy averaged over all iterations — that's the quantity that converges to Nash. Important difference from Vanilla Counterfactual Regret Minimization: External Sampling alternates which player's regrets get updated, so on any given iteration only half of the cards (the updating player's six information sets) actually change.

Live Math Walkthrough

How expected value and frequency are calculated for one decision

Pick one information set to inspect. The bot does this three-step calculation only on the iterations where (a) the sampled deal reaches this information set, and (b) this set's player is the current updating player. So most iterations skip most sets — that's why per-iteration cost is so much lower than Vanilla Counterfactual Regret Minimization.

Inspecting tip: click any info-set card above to switch

1Counterfactual EV per action

For each action, what payoff do I expect if I play it — averaging over the opponent's possible cards and their current strategies in the subtree?

2Regret update

Regret of an action = how much better it would have been than what I did. Cumulative regret R(I,a) sums this over every iteration so far.

3Frequency = regret matching

Play each action proportional to its positive regret. Negative regret? Don't play it. The average strategy across iterations is what converges to Nash.

Strategy Convergence

Watch strategies stabilize toward Nash equilibrium

Each line tracks one decision (the bot's average betting probability for a given hand). Dashed lines mark the Nash-equilibrium target — convergence means the bot is unexploitable. Lines may look slightly noisier than the Vanilla Counterfactual Regret Minimization version because each External Sampling iteration is a single sampled trajectory rather than a full sweep, but the average strategy still converges at the same O(1/√T) rate.

P1 bets with K (root) P1 bluffs with J (root) P2 bets K after check P2 bluffs J after check P1 calls Q after check-bet Nash target

How External Sampling MCCFR Works

Same Nash convergence as Vanilla CFR — but each iteration is a single sampled trajectory

1. Sample one chance outcome. Pick a single deal from the deck according to its natural probability. Vanilla Counterfactual Regret Minimization sweeps every deal each iteration; this samples one.

2. Pick the updating player. Alternate each iteration: odd iterations update Player 1's regrets, even iterations update Player 2's.

3. Walk the tree, asymmetrically. At updating-player nodes enumerate every action and recurse into each. At opponent nodes sample one action according to the current strategy σ and recurse into only that branch.

4. Update regrets only at the updating player's information sets along the path. Standard counterfactual-regret update — but the chance-probability factor is absorbed into the sampling, so the per-update math is simpler.

regret(I, a) += π_opp(I) · ( v(I, a) − v(I) )

Where π_opp(I) is the product of the opponent's strategy probabilities along the path to this information set. Chance probabilities aren't multiplied in here — they're already accounted for by the sampling step.

5. Match strategy to regret — exactly the same as Vanilla CFR.

σ(I, a) = max(0, R(I,a)) / Σ max(0, R(I,·))

The result: the per-iteration regret update is an unbiased estimator of what Vanilla CFR would have computed. So the same Nash convergence theorem applies (Lanctot et al. 2009, Theorem 5) — and you get there with iterations that are orders of magnitude cheaper to compute on large games.

Why MCCFR Looks Slower on Kuhn

Run Vanilla and MCCFR side-by-side and Vanilla wins — that's expected, here's why

If you train this page and the Vanilla Counterfactual Regret Minimization page for the same number of iterations, Vanilla's exploitability drops to 0 faster. That isn't a bug in the External Sampling implementation — it's a fundamental property of sampling-based algorithms on tiny games. There are three compounding reasons:

1. Higher per-iteration variance.

Vanilla CFR walks the full game tree every iteration. Every regret update uses the exact counterfactual value at every information set. Zero variance — each iteration moves the cumulative regret table in the precisely correct direction.

External Sampling MCCFR samples one deal and one opponent action sequence per iteration. The regret update is an unbiased estimator of the true counterfactual value (matches in expectation), but each individual update is noisy. The expectation matches, but the variance around it doesn't shrink unless you average over many iterations.

E[ ES regret update ] = Vanilla regret update     ← same expectation
Var[ ES regret update ] > 0                       ← but noisy

That noise translates into a slower decay of cumulative regret per iteration.

2. Alternating updates halve the work per iteration.

Vanilla CFR updates both players' regrets every iteration. External Sampling alternates: odd iterations only update Player 1, even iterations only update Player 2. So in iteration count T, External Sampling has effectively done T/2 updates per player versus Vanilla's T per player.

This isn't strictly worse for convergence — it's how the unbiasedness math works out — but it means External Sampling needs roughly twice as many iterations to give each player the same number of updates as Vanilla.

3. The convergence-rate constant is bigger.

Both algorithms satisfy the same asymptotic bound, but with different constants:

Vanilla:   max |R⁺(I,a)| / T  ≤  C_v / √T
External:  max |R⁺(I,a)| / T  ≤  C_es / √T  with C_es > C_v

The bound is the same shape (both decay as 1/√T), but External Sampling's constant absorbs the variance penalty: C_es includes a factor proportional to 1/min_reach_prob across the tree (Lanctot et al. 2009, Theorem 5). On Kuhn — where deals are uniform and strategies stay relatively spread out — the penalty is modest but visible. On a much larger game with rare reach probabilities, the penalty would be larger in per-iteration terms, but irrelevant in wall-clock terms because each iteration is so much cheaper to compute.

So when does External Sampling win?

The whole point of External Sampling MCCFR isn't faster per-iteration convergence — it's that each iteration is dramatically cheaper, and on big games this matters enormously:

cost_per_iter(Vanilla)   =  O(|deals| × tree_size)
cost_per_iter(External)  =  O(updating_player_tree)        ← independent of deal count

For Kuhn: |deals| = 6 and tree_size ≈ 5 nodes. Vanilla's per-iteration cost is around 30 node visits — fast enough that on a modern machine you can run millions of iterations per second. The slight per-iteration disadvantage of External Sampling shows up directly in convergence speed because you can run roughly the same number of iterations in the same wall-clock time.

For Heads-Up Limit Texas Hold'em: |deals| has trillions of distinct chance outcomes (different card permutations) and the tree has billions of nodes. Vanilla can't finish a single full iteration in any practical time. External Sampling does thousands of iterations per second regardless. The variance penalty is still there, but the iteration-rate advantage is so large that External Sampling converges to Nash in hours where Vanilla would take millennia.

Practical takeaway:

On Kuhn — and on any game small enough that Vanilla CFR finishes a sweep in microseconds — use Vanilla CFR. It converges faster per iteration with no variance penalty.

External Sampling MCCFR earns its place when the per-iteration cost of Vanilla becomes the bottleneck. The Kuhn page on this site is mostly a demonstration that External Sampling produces the same Nash equilibrium, not that it's faster here. The wall-clock advantage shows up at games big enough that you can't enumerate the chance space comfortably — which is, in practice, almost any real poker variant.

How Exploitability is Computed

The Nash-distance metric — and why it equals 0 at convergence

Exploitability measures how much chip equity an opponent could win against your strategy by playing a best response — the strategy that maximally exploits yours. At Nash equilibrium no deviation helps either side, so exploitability = 0. Higher = further from Nash. Reported here in milli-big-blinds per game (mbb/g) — the standard poker artificial-intelligence metric (1 mbb/g = 0.001 chips lost per hand against a perfect opponent).

ε(σ) = ½ · ( best_response_value(P1, σ̄_P2) + best_response_value(P2, σ̄_P1) )

Where best_response_value(player, opp_strategy) is what player achieves with their best response to a fixed opp_strategy. In a zero-sum game the Nash values cancel, so ε ≥ 0 always; ε = 0 only at Nash. Both best-response values are computed against the average strategy σ̄, not the current σ.

Computation in Kuhn — brute-force enumeration:

Each player has only 6 information sets, each with 2 actions. So each player has at most 2⁶ = 64 pure best-response strategies. Cheap enough to brute-force:

for each of 64 candidate best-response strategies σ_BR:
  v = expected utility(σ_BR vs σ̄_opp) over all 6 deals
best_response_value = max v

The crucial constraint: the responder must commit to one action per information set — it can't peek at the opponent's card and pick differently per deal. (An earlier bug in this site let the best-response strategy cheat that way, which floored exploitability around 277 mbb/g instead of converging to 0.)

Why this approach doesn't scale to bigger games: Leduc Poker (the next step up) has 144 info sets per player with 2 or 3 actions each. Brute force would need ~3¹⁴⁴ ≈ 10⁶⁸ strategies — infeasible. Larger games use a topological dynamic-programming best-response computation instead, processing info sets deepest-first and reusing already-decided choices in deeper subtrees.