Watching a Poker Bot Learn — CFR+ on Kuhn Poker

Current Iteration

What the bot is doing right now

CFR+ does two passes per iteration — one pass updates Player 1's regrets, the next pass updates Player 2's (alternating updates). Each pass walks all 6 card permutations. The gold path shows the most recent traversal.

Player 1

Player 2

Player 1 decision Player 2 decision Showdown / fold Path traversed this iteration

IdlePress Play to start training. The bot begins with no knowledge — every action is a 50/50 coin flip.

The Bot's Brain — 12 Information Sets

What does the bot know in each situation?

In Kuhn poker, the bot only sees its own card plus the action history. That's 12 distinct situations. CFR+'s defining trick: cumulative regret is clamped to ≥ 0 immediately after every update — bad actions get forgotten on the spot instead of slowly clawing back from a deep hole. Notice the regret cells below never go red. The current strategy σ (the bar) is regret-matching+ on those non-negative regrets; the smaller avg % is the weighted average across iterations using weight w_T = max(T−d, 0).

Live Math Walkthrough

How EV and frequency are calculated for one decision

Every iteration, the bot does the three-step calculation below for each of its 12 information sets. Pick one to inspect — the numbers update live as training proceeds.

Inspecting tip: click any info-set card above to switch

1Counterfactual EV per action

For each action, what payoff do I expect if I play it — averaging over the opponent's possible cards and their current strategies in the subtree?

2Regret update (CFR+)

Regret of an action = how much better it would have been than what I did. CFR+ stores cumulative regret R⁺(I,a) = max(0, previous + this iteration's regret) — negative regret never accumulates.

3Frequency = regret-matching+

Play each action proportional to its (already non-negative) cumulative regret. Average across iterations is weighted: w_T = max(T−d, 0). Both the current and weighted-average strategies converge — the average just gets there faster and more stably.

Strategy Convergence

CFR+ snaps to Nash an order of magnitude faster than vanilla CFR

Each line tracks the weighted-average betting probability σ̄ for one decision. Dashed lines mark the Nash target. In Tammelin's experiments CFR+ reached 1 mbb/hand exploitability in roughly 100 iterations vs ~1000 for vanilla CFR — flip to Turbo and watch the lines lock in.

P1 bets with K (root) P1 bluffs with J (root) P2 bets K after check P2 bluffs J after check P1 calls Q after check-bet Nash target

What CFR+ Changes vs Vanilla CFR

Three small tweaks, ~10× faster convergence

1. Regret-matching+ (the "+"). After each regret update, clamp the cumulative value to ≥ 0. In vanilla CFR a bad action's regret can drop to −10,000; even when it starts looking good again you wait forever for the regret to climb back above zero. CFR+ wipes that debt instantly.

R⁺(I, a) ← max( R⁺(I, a) + π_{-i}(I) · ( v(I,a) − v(I) ), 0 )

2. Alternating updates. Instead of updating both players in one pass, do two passes per iteration — pass 1 only touches Player 1's regrets and strategy sum, pass 2 only touches Player 2's. The other player's strategy is held fixed during each pass.

3. Weighted averaging. When accumulating the average strategy, weight the contribution from iteration T by w_T = max(T − d, 0). Later iterations have already shaken off early noise, so they should count more. d is an optional delay (here d = 0).

σ̄(I, a) ∝ Σ_T w_T · π_i(I) · σ_T(I, a)

Bonus: current strategy converges too. Plain CFR only converges on average — you must track σ̄. With CFR+, σ itself approximately converges, so even without averaging you get a usable strategy.

And because most regrets stay at zero (no negative drift), the regret tables compress dramatically — important when you're solving Heads-Up Limit Hold'em on hardware with finite memory.

How Exploitability is Computed

The Nash-distance metric — and why it equals 0 at convergence

Exploitability measures how much chip equity an opponent could win against your strategy by playing a best response — the strategy that maximally exploits yours. At Nash equilibrium no deviation helps either side, so exploitability = 0. Higher = further from Nash. Reported here in milli-big-blinds per game (mbb/g) — the standard poker artificial-intelligence metric (1 mbb/g = 0.001 chips lost per hand against a perfect opponent).

ε(σ) = ½ · ( best_response_value(P1, σ̄_P2) + best_response_value(P2, σ̄_P1) )

Where best_response_value(player, opp_strategy) is what player achieves with their best response to a fixed opp_strategy. In a zero-sum game the Nash values cancel, so ε ≥ 0 always; ε = 0 only at Nash. Both best-response values are computed against the average strategy σ̄, not the current σ.

Computation in Kuhn — brute-force enumeration:

Each player has only 6 information sets, each with 2 actions. So each player has at most 2⁶ = 64 pure best-response strategies. Cheap enough to brute-force:

for each of 64 candidate best-response strategies σ_BR:
  v = expected utility(σ_BR vs σ̄_opp) over all 6 deals
best_response_value = max v

The crucial constraint: the responder must commit to one action per information set — it can't peek at the opponent's card and pick differently per deal. (An earlier bug in this site let the best-response strategy cheat that way, which floored exploitability around 277 mbb/g instead of converging to 0.)

Why this approach doesn't scale to bigger games: Leduc Poker (the next step up) has 144 info sets per player with 2 or 3 actions each. Brute force would need ~3¹⁴⁴ ≈ 10⁶⁸ strategies — infeasible. Larger games use a topological dynamic-programming best-response computation instead, processing info sets deepest-first and reusing already-decided choices in deeper subtrees.