Current Iteration
What the bot is doing right now
Like CFR+, DCFR uses alternating updates — pass 1 updates Player 1's regrets, pass 2 updates Player 2's. After both passes, accumulated regrets and strategy contributions get discounted: positive regrets ×t3/2/(t3/2+1), negative regrets halved, average-strategy contributions ×(t/(t+1))2. The gold path shows the most recent traversal.
The Bot's Brain — 12 Information Sets
What does the bot know in each situation?
In Kuhn poker, the bot sees only its own card plus the action history — 12 distinct situations. Unlike CFR+, DCFR does let cumulative regret go negative — but the negative half is halved every iteration (β=0), so red cells decay geometrically toward zero instead of being wiped instantly. Positive regrets get a gentle multiplicative discount of t3/2/(t3/2+1) per iteration. The current strategy σ bar is regret matching on the positive part; the smaller avg % is the quadratically weighted average across iterations.
Live Math Walkthrough
How EV and frequency are calculated for one decision
Every iteration, the bot does the three-step calculation below for each of its 12 information sets. Pick one to inspect — the numbers update live as training proceeds.
For each action, what payoff do I expect if I play it — averaging over the opponent's possible cards and their current strategies in the subtree?
Cumulative regret R(I,a) = previous + this iteration's regret. At iteration end, multiply by tα/(tα+1) if positive (α=3/2) or tβ/(tβ+1) = 1/2 if negative (β=0). Negatives decay geometrically; positives are softened toward recent values.
Play each action proportional to positive cumulative regret (negatives are ignored even though they aren't zero). Average strategy is computed with iteration-t weight ∝ tγ = t2 — so recent iterations dominate the output.
Strategy Convergence
DCFR(3/2, 0, 2) is the first algorithm to consistently beat CFR+
Each line tracks the quadratically weighted average betting probability σ̄ for one decision. Dashed lines mark the Nash target. In the Brown & Sandholm experiments DCFR matched or outperformed CFR+ in every game tested, typically by a factor of 2–3× — most pronounced in games with severe mistake actions.
What DCFR Changes vs CFR+
Three knobs (α, β, γ) instead of one extreme corner
CFR+ is conceptually the corner case DCFR(∞, −∞, 2) — never discount positives, wipe negatives instantly, quadratic averaging. DCFR introduces three tunable knobs and the authors' empirical sweet spot is (α, β, γ) = (3/2, 0, 2), which softens both extremes.
1. Discount POSITIVE regrets (α = 3/2). CFR+ never forgets positive regret — a great early-iteration bet keeps influencing the strategy forever. DCFR multiplies positive regret by tα/(tα+1) per iteration, gently decaying old wins. As t grows the multiplier approaches 1, so the discount is mild but always present.
R(I,a) ← R(I,a) · tα / (tα + 1) [if positive]
2. Discount NEGATIVE regrets (β = 0). CFR+ wipes negatives to exactly 0 (β = −∞). DCFR with β = 0 multiplies negatives by t0/(t0+1) = 1/2 every iteration — geometric decay toward zero. They never quite hit zero, which sounds like a bug but lets you run regret-based pruning algorithms (CFR+ can't do this without modification).
R(I,a) ← R(I,a) · 1/2 [if negative, with β = 0]
3. Discount AVERAGE-strategy contributions (γ = 2). Same quadratic weighting CFR+ uses in practice: iteration t contributes ∝ t2 to σ̄, so the early uninformed iterations get drowned out by the later ones. Implemented in rolling form by multiplying the strategy sum by (t/(t+1))2 each step.
The motivating example (from the paper): three actions with payoffs 0, 1, and −1,000,000. CFR+ takes ~471,407 iterations to lock onto the obviously-best action because every iteration carries equal weight, including the disastrous first one. With discounting, the first iteration's bad regret signal gets shrunk away — convergence in ~970 iterations.
How Exploitability is Computed
The Nash-distance metric — and why it equals 0 at convergence
Exploitability measures how much chip equity an opponent could win against your strategy by playing a best response — the strategy that maximally exploits yours. At Nash equilibrium no deviation helps either side, so exploitability = 0. Higher = further from Nash. Reported here in milli-big-blinds per game (mbb/g) — the standard poker artificial-intelligence metric (1 mbb/g = 0.001 chips lost per hand against a perfect opponent).
ε(σ) = ½ · ( best_response_value(P1, σ̄_P2) + best_response_value(P2, σ̄_P1) )
Where best_response_value(player, opp_strategy) is what player achieves with their best response to a fixed opp_strategy. In a zero-sum game the Nash values cancel, so ε ≥ 0 always; ε = 0 only at Nash. Both best-response values are computed against the average strategy σ̄, not the current σ.
Computation in Kuhn — brute-force enumeration:
Each player has only 6 information sets, each with 2 actions. So each player has at most 2⁶ = 64 pure best-response strategies. Cheap enough to brute-force:
for each of 64 candidate best-response strategies σ_BR:
v = expected utility(σ_BR vs σ̄_opp) over all 6 deals
best_response_value = max v
The crucial constraint: the responder must commit to one action per information set — it can't peek at the opponent's card and pick differently per deal. (An earlier bug in this site let the best-response strategy cheat that way, which floored exploitability around 277 mbb/g instead of converging to 0.)
Why this approach doesn't scale to bigger games: Leduc Poker (the next step up) has 144 info sets per player with 2 or 3 actions each. Brute force would need ~3¹⁴⁴ ≈ 10⁶⁸ strategies — infeasible. Larger games use a topological dynamic-programming best-response computation instead, processing info sets deepest-first and reusing already-decided choices in deeper subtrees.