Watching a Poker Bot Learn — CFR on Kuhn Poker

Current Iteration

What the bot is doing right now

Each iteration the bot deals random cards, plays out every action path, computes "what if I'd done X instead?", and adjusts its regrets. Watch the gold path — that's what just happened.

Player 1

Player 2

Player 1 decision Player 2 decision Showdown / fold Path traversed this iteration

IdlePress Play to start training. The bot begins with no knowledge — every action is a 50/50 coin flip.

The Bot's Brain — 12 Information Sets

What does the bot know in each situation?

In Kuhn poker, the bot only sees its own card plus the action history. That's 12 distinct situations. For each, it tracks cumulative regrets, then derives the current strategy σ (the bar) by regret matching — directly proportional to positive regret. The smaller avg % is the strategy averaged over all iterations — that's the quantity that converges to Nash (and what the chart below tracks).

Live Math Walkthrough

How EV and frequency are calculated for one decision

Every iteration, the bot does the three-step calculation below for each of its 12 information sets. Pick one to inspect — the numbers update live as training proceeds.

Inspecting tip: click any info-set card above to switch

1Counterfactual EV per action

For each action, what payoff do I expect if I play it — averaging over the opponent's possible cards and their current strategies in the subtree?

2Regret update

Regret of an action = how much better it would have been than what I did. Cumulative regret R(I,a) sums this over every iteration so far.

3Frequency = regret matching

Play each action proportional to its positive regret. Negative regret? Don't play it. The average strategy across iterations is what converges to Nash.

Strategy Convergence

Watch strategies stabilize toward Nash equilibrium

Each line tracks one decision (the bot's average betting probability for a given hand). Dashed lines mark the Nash-equilibrium target — convergence means the bot is unexploitable.

P1 bets with K (root) P1 bluffs with J (root) P2 bets K after check P2 bluffs J after check P1 calls Q after check-bet Nash target

How CFR Actually Works

Three loops, no opponent model needed

1. Walk the game tree. For every possible card deal and action sequence, recursively visit every decision point.

2. Compute counterfactuals. At each information set, ask: "If I'd reached here and played action A, what would I get?" Subtract the EV of what I actually did. That's the regret.

regret(I, a) += π_{-i}(I) · ( v(I,a) - v(I) )

3. Match strategy to regret. Next iteration, play actions proportional to positive regret — actions you wish you'd played more, you play more.

σ(I, a) = max(0, R(I,a)) / Σ max(0, R(I,·))

The magic: averaging strategies across iterations converges to a Nash equilibrium. The bot becomes unexploitable without ever modeling its opponent — it just plays itself and stops regretting.

How Exploitability is Computed

The Nash-distance metric — and why it equals 0 at convergence

Exploitability measures how much chip equity an opponent could win against your strategy by playing a best response — the strategy that maximally exploits yours. At Nash equilibrium no deviation helps either side, so exploitability = 0. Higher = further from Nash. Reported here in milli-big-blinds per game (mbb/g) — the standard poker artificial-intelligence metric (1 mbb/g = 0.001 chips lost per hand against a perfect opponent).

ε(σ) = ½ · ( best_response_value(P1, σ̄_P2) + best_response_value(P2, σ̄_P1) )

Where best_response_value(player, opp_strategy) is what player achieves with their best response to a fixed opp_strategy. In a zero-sum game the Nash values cancel, so ε ≥ 0 always; ε = 0 only at Nash. Both best-response values are computed against the average strategy σ̄, not the current σ.

Computation in Kuhn — brute-force enumeration:

Each player has only 6 information sets, each with 2 actions. So each player has at most 2⁶ = 64 pure best-response strategies. Cheap enough to brute-force:

for each of 64 candidate best-response strategies σ_BR:
  v = expected utility(σ_BR vs σ̄_opp) over all 6 deals
best_response_value = max v

The crucial constraint: the responder must commit to one action per information set — it can't peek at the opponent's card and pick differently per deal. (An earlier bug in this site let the best-response strategy cheat that way, which floored exploitability around 277 mbb/g instead of converging to 0.)

Why this approach doesn't scale to bigger games: Leduc Poker (the next step up) has 144 info sets per player with 2 or 3 actions each. Brute force would need ~3¹⁴⁴ ≈ 10⁶⁸ strategies — infeasible. Larger games use a topological dynamic-programming best-response computation instead, processing info sets deepest-first and reusing already-decided choices in deeper subtrees.