Deep CFR Explained — Tabular Table to Neural Network

A conceptual walkthrough of how Counterfactual Regret Minimization (CFR) drops its lookup table for a function approximator Back to index
Algorithm Deep CFR
Origin Brown et al. 2018
Domain Imperfect-info
Your question
"We don't need to do any abstraction — just rotate tabular to network and it will work just fine, right?"
Mostly yes. Deep CFR replaces the regret table with a neural network that takes state features in and outputs regret per action. No hand-crafted card buckets, no bet-size groupings. The catch: the network is itself a learned, soft approximation of the table. So you trade hand-crafted abstraction for learned approximation — but you get domain generality and zero expert knowledge in return.

Section 1 · The Core Replacement

The regret table becomes a function approximator

Tabular CFR stores one regret value per (information set, action) pair. In No-Limit Texas Hold'em (HUNL = Heads-Up No-Limit) that means storing entries for ~10160 states. Deep CFR keeps the same algorithm but represents the regret function as a neural network instead.

Tabular CFR — explicit table
One row per information set
info setfoldcallraise
I_1: K, check+2.34-1.10+0.52
I_2: Q, check-call+0.840.00-0.91
I_3: J, bet-bet-1.42+1.86+0.13
I_4: K, bet-call+3.71-0.27+1.05
I_5: J, check-fold0.00-2.13+0.46
I_6: A, raise-call+4.12+1.91-0.55
I_7: T, ...+0.08-1.34+2.21
I_8: 9, ...-0.62+0.710.00
I_9, I_10, ............
1014 info sets in Limit Hold'em  ·  10160 in No-Limit. Won't fit in any memory.
replaced by
Deep CFR — neural network
A function: features → regret per action
FEATURES cards pot stack history street position HIDDEN REGRET +2.3 fold -1.1 call +0.5 raise
~100K parameters  ·  ~400KB on disk  ·  D(I, a) ≈ regret(I, a)

Key invariant. The CFR algorithm itself doesn't change — it's still iterating regret updates and doing regret matching. What changes is how regret is stored and queried. Tabular CFR reads from a hash map; Deep CFR does a forward pass through the value network D(I, a) instead.

Section 2 · The Training Iteration

Five steps repeat. Click a step to see what happens inside it.

STEP 01
External sampling
Walk a partial game tree. Try all of the traverser's actions; sample one for the opponent and chance.
STEP 02
Query current network
At each visited info set, ask the value network for its current regret estimates.
STEP 03
Compute new regrets
From rollout payoffs, compute counterfactual regret rt(I, a) for each action.
STEP 04
Reservoir buffer
Append (info-set features, regret vector) to a fixed-size buffer using reservoir sampling.
STEP 05
Re-train network
Train a fresh value network from the buffer. Increment iteration t. Repeat.

Step 1 — External sampling traversal

For player i (the "traverser") on iteration t: at every information set where i acts, recursively explore every available action. At opponent and chance nodes, randomly sample one action according to the opponent's current strategy or chance distribution. This drastically prunes the tree while still giving an unbiased estimate of regret for the traverser.

walk(h): if P(h)=i: for a in A(h): walk(h·a) else: a ~ σ_{-i}(h); walk(h·a)

Section 3 · External Sampling, Visualized

The traverser explores all branches; everyone else plays one sampled move

P1 root · traverser P2 P2 C P1 P1 P1 P1 ACTS HERE → explore ALL actions P2 / CHANCE HERE → sample ONE action terminal payoffs propagate up · regret computed at every blue node visited
Traverser node — try every action
Opponent node — sample one
Chance node — sample one outcome
Action not sampled
Terminal payoff

Section 4 · Regret Matching — Output to Policy

How a regret vector becomes a probability distribution

After the network outputs raw regret estimates per action, regret matching converts them into a play distribution in two simple steps. Negative regrets are clipped to zero, then the surviving values are normalized.

Step A · Raw regret output from network D(I, a)
fold
+2.5
call
-0.8
raise-2x
+1.2
all-in
-1.5
Step B · Clip negatives to zero [x]+ = max(x, 0)
fold
2.5
call
0.0
raise-2x
1.2
all-in
0.0
Step C · Normalize → probability distribution σ(I, a) = [r]+ / Σ [r]+
fold
0.676
call
0.000
raise-2x
0.324
all-in
0.000
sum = 0.676 + 0.000 + 0.324 + 0.000 = 1.000 ✓

Why this works. Regret matching guarantees average regret shrinks at rate O(1/√T) — that's the classical CFR result. With deep networks, the same theory holds provided the network's regret estimates stay close to the true regret values. Network error becomes the only knob that matters.

Section 5 · Why It Works Without Hand-Crafted Abstraction

Similar features → similar regrets — the network discovers its own abstraction

In tabular CFR each information set is an opaque hash key. The algorithm has no concept that "AKs on a wet board with 2/3 pot bet" is similar to "AKs on a wet board with 65% pot bet" — they're separate cells. To make the table fit, humans hand-design abstractions that bucket similar states together. Card abstractions, bet-size abstractions, action-sequence abstractions. Designing them is hard and abstractions are the dominant source of exploitability.

A neural network with a sensible feature encoding gets this for free. Two states with similar inputs produce similar outputs because that's what continuous function approximators do. Generalization is the default behavior, not a designed-in property.

The trade. You give up exact storage and gain learned approximation. With enough capacity and samples, the network's implicit abstraction can be far finer than any human could design — and it adapts to the actual structure of the game rather than what we guess matters.

feature dim 1 (e.g. hand strength) feature dim 2 (e.g. pot odds) raise zone call zone fold zone all-in zone
Each dot is one info set the network has trained on. Smooth color regions show the network's learned regret surface — new info sets falling near a cluster inherit a similar regret estimate without ever being seen.

Section 6 · Single Deep CFR — The Refinement You Asked About

The catch: Deep CFR actually trains two networks. SD-CFR removes one of them.

Recall from the paper: the strategy that converges to a Nash equilibrium isn't any single iteration's policy — it's the time-averaged strategy across all iterations. Tabular CFR computes this average exactly. Deep CFR fits a second network to approximate it, which introduces a second source of error.

Deep CFR — original

Two networks per player

Value network D(I, a) — predicts regret
Avg-strategy network Ŝ(I, a) — approximates time-averaged policy

Errors compound. Sampling error in the strategy buffer + approximation error fitting Ŝ. Two leaky valves between you and the true equilibrium.

"Bs is not guaranteed to reflect the true average strategy" — Theorem 1 in the paper.
SD-CFR — proposed

Just keep all the value networks

Value network at iteration 1: D1
Value network at iteration 2: D2
... D3, D4, ..., DT
To play: pick Dt with weight t, follow it for the hand

Cleaner. Average strategy is computed by sampling, no second network. Provably exact when value networks are accurate (Theorem 2).

~400KB per network · 25,000 networks ≈ 10GB on disk · trivial in modern terms.

Bottom line. Deep CFR's "two-network" design was a holdover from the assumption that storing all past models was infeasible. Once you accept it isn't, SD-CFR drops the noisy averaging network and gets a strictly better algorithm both in theory and in head-to-head poker matches.