A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go Through Self-Play

Type: paper Slug: a-general-reinforcement-learning-algorithm-that-masters-chess-shogi-and-go—hassabis Sources: a-general-reinforcement-learning-algorithm-that-masters-chess-shogi-and-go—hassabis Last updated: 2026-05-13

Summary

Silver, Schrittwieser, Antonoglou, Guez, Lanctot, and colleagues, with Simonyan, Kavukcuoglu, and Hassabis (2017) introduced AlphaZero, a single reinforcement learning algorithm that achieved superhuman performance in chess, shogi, and Go using only self-play and no domain-specific knowledge beyond the game rules. AlphaZero learned each game from scratch in hours, defeated the previous state-of-the-art programs (Stockfish, Elmo, AlphaGo Zero), and demonstrated that a general-purpose RL algorithm can match or exceed hand-crafted solutions across diverse game domains.

Core content

Key departure from prior work: Unlike AlphaGo (which used human expert data and hand-crafted features) or traditional chess engines (which use alpha-beta search with extensive human-engineered heuristics), AlphaZero uses a single neural network trained entirely through self-play (paper—a-general-reinforcement-learning-algorithm-that-masters-chess-shogi-and-go §Introduction).

Algorithm:

Neural network: Residual network taking the board state as input, outputting both a policy (move probabilities) and a value (expected outcome) (paper—a-general-reinforcement-learning-algorithm-that-masters-chess-shogi-and-go §Methods).
MCTS with learned policy: Monte Carlo tree search guided by the neural network’s policy for prior probabilities and value for leaf evaluation (paper—a-general-reinforcement-learning-algorithm-that-masters-chess-shogi-and-go §Methods).
Self-play training: The network is updated from self-play games, with MCTS as the policy improvement operator and the network as the value function approximator (paper—a-general-reinforcement-learning-algorithm-that-masters-chess-shogi-and-go §Methods).

Results:

Chess: Defeated Stockfish (TCEC season 9 champion) 64-36 in a 100-game match after just 4 hours of self-play training (paper—a-general-reinforcement-learning-algorithm-that-masters-chess-shogi-and-go §Results).
Shogi: Defeated Elmo (world champion program) 90-8 in a 100-game match after 2 hours of training (paper—a-general-reinforcement-learning-algorithm-that-masters-chess-shogi-and-go §Results).
Go: Defeated AlphaGo Zero (which itself defeated Lee Sedol) 60-40, after just 8 hours of training (paper—a-general-reinforcement-learning-algorithm-that-masters-chess-shogi-and-go §Results).

Significance: Demonstrated that deep RL can discover sophisticated strategies from first principles, without any human knowledge. AlphaZero’s chess play was described as “alien” — developing openings and strategies that humans had never considered.

Connections- Theme: theme—self-play, theme—game-playing-ai, chess, go, theme—deep-RL

Project: AlphaZero
Collaborators: David Silver (first), Julian Schrittwieser, Ioannis Antonoglou, Arthur Guez, Marc Lanctot, Timothy Lillicrap, Karen Simonyan
Era: deepmind-ascent
Venue: venue—Science
Supersedes: paper—mastering-the-game-of-go-without-human-knowledge — AlphaZero is the generalized successor to AlphaGo Zero
Notable quote: “The tabula rasa approach of AlphaZero yields a novel and distinctive playing style.” (paper—a-general-reinforcement-learning-algorithm-that-masters-chess-shogi-and-go §Discussion)

Honest Gaps

Metadata lists 7 co-authors but the actual paper has additional authors.
The PDF extraction has scanning artefacts with merged/jammed text — some sections may be hard to parse.
AlphaZero required enormous compute (5,000 first-generation TPUs for training, 64 second-generation TPUs for MCTS).
The match conditions against Stockfish favored AlphaZero (fixed search time per move rather than fixed total time).
AlphaZero learns from scratch each time — it cannot transfer knowledge between games despite using the same architecture.
No code or trained models were released.

MinedDeep

Explorer

A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go Through Self-Play

A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go Through Self-Play

Summary

Core content

Connections- Theme: theme—self-play, theme—game-playing-ai, chess, go, theme—deep-RL

Honest Gaps

Graph View

Table of Contents

Backlinks