Deep CFR Explained — Tabular Table to Neural Network

Your question

"We don't need to do any abstraction — just rotate tabular to network and it will work just fine, right?"

Mostly yes. Deep CFR replaces the regret table with a neural network that takes state features in and outputs regret per action. No hand-crafted card buckets, no bet-size groupings. The catch: the network is itself a learned, soft approximation of the table. So you trade hand-crafted abstraction for learned approximation — but you get domain generality and zero expert knowledge in return.

Section 1 · The Core Replacement

The regret table becomes a function approximator

Tabular CFR stores one regret value per (information set, action) pair. In No-Limit Texas Hold'em (HUNL = Heads-Up No-Limit) that means storing entries for ~10¹⁶⁰ states. Deep CFR keeps the same algorithm but represents the regret function as a neural network instead.

Tabular CFR — explicit table

One row per information set

I_1: K, check+2.34-1.10+0.52

I_2: Q, check-call+0.840.00-0.91

I_3: J, bet-bet-1.42+1.86+0.13

I_4: K, bet-call+3.71-0.27+1.05

I_5: J, check-fold0.00-2.13+0.46

I_6: A, raise-call+4.12+1.91-0.55

I_7: T, ...+0.08-1.34+2.21

I_8: 9, ...-0.62+0.710.00

I_9, I_10, ............

10¹⁴ info sets in Limit Hold'em · 10¹⁶⁰ in No-Limit. Won't fit in any memory.

replaced by

Deep CFR — neural network

A function: features → regret per action

~100K parameters · ~400KB on disk · D(I, a) ≈ regret(I, a)

Key invariant. The CFR algorithm itself doesn't change — it's still iterating regret updates and doing regret matching. What changes is how regret is stored and queried. Tabular CFR reads from a hash map; Deep CFR does a forward pass through the value network D(I, a) instead.

Section 2 · The Training Iteration

Five steps repeat. Click a step to see what happens inside it.

STEP 01

External sampling

Walk a partial game tree. Try all of the traverser's actions; sample one for the opponent and chance.

STEP 02

Query current network

At each visited info set, ask the value network for its current regret estimates.

STEP 03

Compute new regrets

From rollout payoffs, compute counterfactual regret r^t(I, a) for each action.

STEP 04

Reservoir buffer

Append (info-set features, regret vector) to a fixed-size buffer using reservoir sampling.

STEP 05

Re-train network

Train a fresh value network from the buffer. Increment iteration t. Repeat.

Step 1 — External sampling traversal

For player i (the "traverser") on iteration t: at every information set where i acts, recursively explore every available action. At opponent and chance nodes, randomly sample one action according to the opponent's current strategy or chance distribution. This drastically prunes the tree while still giving an unbiased estimate of regret for the traverser.

walk(h): if P(h)=i: for a in A(h): walk(h·a) else: a ~ σ_{-i}(h); walk(h·a)

Section 3 · External Sampling, Visualized

The traverser explores all branches; everyone else plays one sampled move

Traverser node — try every action

Opponent node — sample one

Chance node — sample one outcome

Action not sampled

Terminal payoff

Section 4 · Regret Matching — Output to Policy

How a regret vector becomes a probability distribution

After the network outputs raw regret estimates per action, regret matching converts them into a play distribution in two simple steps. Negative regrets are clipped to zero, then the surviving values are normalized.

Step A · Raw regret output from network D(I, a)

fold

+2.5

call

-0.8

raise-2x

+1.2

all-in

-1.5

Step B · Clip negatives to zero [x]₊ = max(x, 0)

fold

2.5

call

0.0

raise-2x

1.2

all-in

0.0

Step C · Normalize → probability distribution σ(I, a) = [r]₊ / Σ [r]₊

fold

0.676

call

0.000

raise-2x

0.324

all-in

0.000

sum = 0.676 + 0.000 + 0.324 + 0.000 = 1.000 ✓

Why this works. Regret matching guarantees average regret shrinks at rate O(1/√T) — that's the classical CFR result. With deep networks, the same theory holds provided the network's regret estimates stay close to the true regret values. Network error becomes the only knob that matters.

Section 5 · Why It Works Without Hand-Crafted Abstraction

Similar features → similar regrets — the network discovers its own abstraction

In tabular CFR each information set is an opaque hash key. The algorithm has no concept that "AKs on a wet board with 2/3 pot bet" is similar to "AKs on a wet board with 65% pot bet" — they're separate cells. To make the table fit, humans hand-design abstractions that bucket similar states together. Card abstractions, bet-size abstractions, action-sequence abstractions. Designing them is hard and abstractions are the dominant source of exploitability.

A neural network with a sensible feature encoding gets this for free. Two states with similar inputs produce similar outputs because that's what continuous function approximators do. Generalization is the default behavior, not a designed-in property.

The trade. You give up exact storage and gain learned approximation. With enough capacity and samples, the network's implicit abstraction can be far finer than any human could design — and it adapts to the actual structure of the game rather than what we guess matters.

Each dot is one info set the network has trained on. Smooth color regions show the network's learned regret surface — new info sets falling near a cluster inherit a similar regret estimate without ever being seen.

Section 6 · Single Deep CFR — The Refinement You Asked About

The catch: Deep CFR actually trains two networks. SD-CFR removes one of them.

Recall from the paper: the strategy that converges to a Nash equilibrium isn't any single iteration's policy — it's the time-averaged strategy across all iterations. Tabular CFR computes this average exactly. Deep CFR fits a second network to approximate it, which introduces a second source of error.

Deep CFR — original

Two networks per player

Value network D(I, a) — predicts regret

Avg-strategy network Ŝ(I, a) — approximates time-averaged policy

Errors compound. Sampling error in the strategy buffer + approximation error fitting Ŝ. Two leaky valves between you and the true equilibrium.

"B^s is not guaranteed to reflect the true average strategy" — Theorem 1 in the paper.

SD-CFR — proposed

Just keep all the value networks

Value network at iteration 1: D¹

Value network at iteration 2: D²

... D³, D⁴, ..., D^T

To play: pick D^t with weight t, follow it for the hand

Cleaner. Average strategy is computed by sampling, no second network. Provably exact when value networks are accurate (Theorem 2).

~400KB per network · 25,000 networks ≈ 10GB on disk · trivial in modern terms.

Bottom line. Deep CFR's "two-network" design was a holdover from the assumption that storing all past models was infeasible. Once you accept it isn't, SD-CFR drops the noisy averaging network and gets a strictly better algorithm both in theory and in head-to-head poker matches.