Watching a Poker Bot Learn — Discounted CFR on Kuhn Poker

Current Iteration

What the bot is doing right now

Like CFR+, DCFR uses alternating updates — pass 1 updates Player 1's regrets, pass 2 updates Player 2's. After both passes, accumulated regrets and strategy contributions get discounted: positive regrets ×t^3/2/(t^3/2+1), negative regrets halved, average-strategy contributions ×(t/(t+1))². The gold path shows the most recent traversal.

Player 1

Player 2

Player 1 decision Player 2 decision Showdown / fold Path traversed this iteration

IdlePress Play to start training. The bot begins with no knowledge — every action is a 50/50 coin flip.

The Bot's Brain — 12 Information Sets

What does the bot know in each situation?

In Kuhn poker, the bot sees only its own card plus the action history — 12 distinct situations. Unlike CFR+, DCFR does let cumulative regret go negative — but the negative half is halved every iteration (β=0), so red cells decay geometrically toward zero instead of being wiped instantly. Positive regrets get a gentle multiplicative discount of t^3/2/(t^3/2+1) per iteration. The current strategy σ bar is regret matching on the positive part; the smaller avg % is the quadratically weighted average across iterations.

Live Math Walkthrough

How EV and frequency are calculated for one decision

Every iteration, the bot does the three-step calculation below for each of its 12 information sets. Pick one to inspect — the numbers update live as training proceeds.

Inspecting tip: click any info-set card above to switch

1Counterfactual EV per action

For each action, what payoff do I expect if I play it — averaging over the opponent's possible cards and their current strategies in the subtree?

2Regret update (DCFR)

Cumulative regret R(I,a) = previous + this iteration's regret. At iteration end, multiply by t^α/(t^α+1) if positive (α=3/2) or t^β/(t^β+1) = 1/2 if negative (β=0). Negatives decay geometrically; positives are softened toward recent values.

3Frequency = regret matching

Play each action proportional to positive cumulative regret (negatives are ignored even though they aren't zero). Average strategy is computed with iteration-t weight ∝ t^γ = t² — so recent iterations dominate the output.

Strategy Convergence

DCFR(3/2, 0, 2) is the first algorithm to consistently beat CFR+

Each line tracks the quadratically weighted average betting probability σ̄ for one decision. Dashed lines mark the Nash target. In the Brown & Sandholm experiments DCFR matched or outperformed CFR+ in every game tested, typically by a factor of 2–3× — most pronounced in games with severe mistake actions.

P1 bets with K (root) P1 bluffs with J (root) P2 bets K after check P2 bluffs J after check P1 calls Q after check-bet Nash target

What DCFR Changes vs CFR+

Three knobs (α, β, γ) instead of one extreme corner

CFR+ is conceptually the corner case DCFR(∞, −∞, 2) — never discount positives, wipe negatives instantly, quadratic averaging. DCFR introduces three tunable knobs and the authors' empirical sweet spot is (α, β, γ) = (3/2, 0, 2), which softens both extremes.

1. Discount POSITIVE regrets (α = 3/2). CFR+ never forgets positive regret — a great early-iteration bet keeps influencing the strategy forever. DCFR multiplies positive regret by t^α/(t^α+1) per iteration, gently decaying old wins. As t grows the multiplier approaches 1, so the discount is mild but always present.

R(I,a) ← R(I,a) · t^α / (t^α + 1) [if positive]

2. Discount NEGATIVE regrets (β = 0). CFR+ wipes negatives to exactly 0 (β = −∞). DCFR with β = 0 multiplies negatives by t⁰/(t⁰+1) = 1/2 every iteration — geometric decay toward zero. They never quite hit zero, which sounds like a bug but lets you run regret-based pruning algorithms (CFR+ can't do this without modification).

R(I,a) ← R(I,a) · 1/2 [if negative, with β = 0]

3. Discount AVERAGE-strategy contributions (γ = 2). Same quadratic weighting CFR+ uses in practice: iteration t contributes ∝ t² to σ̄, so the early uninformed iterations get drowned out by the later ones. Implemented in rolling form by multiplying the strategy sum by (t/(t+1))² each step.

The motivating example (from the paper): three actions with payoffs 0, 1, and −1,000,000. CFR+ takes ~471,407 iterations to lock onto the obviously-best action because every iteration carries equal weight, including the disastrous first one. With discounting, the first iteration's bad regret signal gets shrunk away — convergence in ~970 iterations.

How Exploitability is Computed

The Nash-distance metric — and why it equals 0 at convergence

Exploitability measures how much chip equity an opponent could win against your strategy by playing a best response — the strategy that maximally exploits yours. At Nash equilibrium no deviation helps either side, so exploitability = 0. Higher = further from Nash. Reported here in milli-big-blinds per game (mbb/g) — the standard poker artificial-intelligence metric (1 mbb/g = 0.001 chips lost per hand against a perfect opponent).

ε(σ) = ½ · ( best_response_value(P1, σ̄_P2) + best_response_value(P2, σ̄_P1) )

Where best_response_value(player, opp_strategy) is what player achieves with their best response to a fixed opp_strategy. In a zero-sum game the Nash values cancel, so ε ≥ 0 always; ε = 0 only at Nash. Both best-response values are computed against the average strategy σ̄, not the current σ.

Computation in Kuhn — brute-force enumeration:

Each player has only 6 information sets, each with 2 actions. So each player has at most 2⁶ = 64 pure best-response strategies. Cheap enough to brute-force:

for each of 64 candidate best-response strategies σ_BR:
  v = expected utility(σ_BR vs σ̄_opp) over all 6 deals
best_response_value = max v

The crucial constraint: the responder must commit to one action per information set — it can't peek at the opponent's card and pick differently per deal. (An earlier bug in this site let the best-response strategy cheat that way, which floored exploitability around 277 mbb/g instead of converging to 0.)

Why this approach doesn't scale to bigger games: Leduc Poker (the next step up) has 144 info sets per player with 2 or 3 actions each. Brute force would need ~3¹⁴⁴ ≈ 10⁶⁸ strategies — infeasible. Larger games use a topological dynamic-programming best-response computation instead, processing info sets deepest-first and reusing already-decided choices in deeper subtrees.