Section 1 · The Core Replacement
The regret table becomes a function approximator
Tabular CFR stores one regret value per (information set, action) pair. In No-Limit Texas Hold'em (HUNL = Heads-Up No-Limit) that means storing entries for ~10160 states. Deep CFR keeps the same algorithm but represents the regret function as a neural network instead.
Key invariant. The CFR algorithm itself doesn't change — it's still iterating regret updates and doing regret matching. What changes is how regret is stored and queried. Tabular CFR reads from a hash map; Deep CFR does a forward pass through the value network D(I, a) instead.
Section 2 · The Training Iteration
Five steps repeat. Click a step to see what happens inside it.
Step 1 — External sampling traversal
For player i (the "traverser") on iteration t: at every information set where i acts, recursively explore every available action. At opponent and chance nodes, randomly sample one action according to the opponent's current strategy or chance distribution. This drastically prunes the tree while still giving an unbiased estimate of regret for the traverser.
walk(h): if P(h)=i: for a in A(h): walk(h·a) else: a ~ σ_{-i}(h); walk(h·a)Section 3 · External Sampling, Visualized
The traverser explores all branches; everyone else plays one sampled move
Section 4 · Regret Matching — Output to Policy
How a regret vector becomes a probability distribution
After the network outputs raw regret estimates per action, regret matching converts them into a play distribution in two simple steps. Negative regrets are clipped to zero, then the surviving values are normalized.
Why this works. Regret matching guarantees average regret shrinks at rate O(1/√T) — that's the classical CFR result. With deep networks, the same theory holds provided the network's regret estimates stay close to the true regret values. Network error becomes the only knob that matters.
Section 5 · Why It Works Without Hand-Crafted Abstraction
Similar features → similar regrets — the network discovers its own abstraction
In tabular CFR each information set is an opaque hash key. The algorithm has no concept that "AKs on a wet board with 2/3 pot bet" is similar to "AKs on a wet board with 65% pot bet" — they're separate cells. To make the table fit, humans hand-design abstractions that bucket similar states together. Card abstractions, bet-size abstractions, action-sequence abstractions. Designing them is hard and abstractions are the dominant source of exploitability.
A neural network with a sensible feature encoding gets this for free. Two states with similar inputs produce similar outputs because that's what continuous function approximators do. Generalization is the default behavior, not a designed-in property.
The trade. You give up exact storage and gain learned approximation. With enough capacity and samples, the network's implicit abstraction can be far finer than any human could design — and it adapts to the actual structure of the game rather than what we guess matters.
Section 6 · Single Deep CFR — The Refinement You Asked About
The catch: Deep CFR actually trains two networks. SD-CFR removes one of them.
Recall from the paper: the strategy that converges to a Nash equilibrium isn't any single iteration's policy — it's the time-averaged strategy across all iterations. Tabular CFR computes this average exactly. Deep CFR fits a second network to approximate it, which introduces a second source of error.
Two networks per player
Errors compound. Sampling error in the strategy buffer + approximation error fitting Ŝ. Two leaky valves between you and the true equilibrium.
Just keep all the value networks
Cleaner. Average strategy is computed by sampling, no second network. Provably exact when value networks are accurate (Theorem 2).
Bottom line. Deep CFR's "two-network" design was a holdover from the assumption that storing all past models was infeasible. Once you accept it isn't, SD-CFR drops the noisy averaging network and gets a strictly better algorithm both in theory and in head-to-head poker matches.