Sobhan Mohammadpour was debugging code late one evening when the numbers stopped making sense — in the best possible way. He and his colleagues at MIT were testing how AI agents played Phantom Tic-Tac-Toe, a version of the classic game where players can’t see their opponent’s moves. According to decades of game theory, specialized algorithms should dominate in such imperfect-information games, where players must guess their rivals’ intentions. But here, a general-purpose method called policy gradient — originally designed for robotics and not game strategy — was beating them, move for move. The result wasn’t just surprising; it upended a foundational assumption in AI research.
For years, experts believed that game-theoretic algorithms, built from mathematical models of strategic reasoning, were the gold standard for two-player, zero-sum games like poker or bidding wars. These games, where one player’s gain is the other’s loss, rely on hidden information, making them notoriously difficult for machines to master. Policy gradient methods, by contrast, were seen as too general, too slow — a jack-of-all-trades with no edge in high-stakes competition. But the MIT-led team, including researchers from UT Austin, UC Berkeley, and Carnegie Mellon, suspected the field had never properly tested them. So they built a new benchmark: a standardized testing ground for algorithms across five imperfect-information games, from obscured versions of Hex to the bluff-heavy Liar’s Dice.
The team measured performance using a metric called exploitability — how vulnerable an AI’s strategy is to a perfect counter-strategy. Zero exploitability means flawless play; higher scores reveal weaknesses. In head-to-head trials, policy gradient methods not only matched specialized algorithms but often surpassed them, achieving exploitability scores up to 50% lower in certain Phantom Tic-Tac-Toe variants. "It had been pretty much taken for granted that specialized game-theoretic algorithms were the right approach," says Samuel Sokota of CMU. "Our study showed that policy gradient methods can work better — and that the specialized algorithms may not work as well as people thought."
The implications extend beyond games. Real-world decisions — from autonomous driving to financial negotiations — often unfold under imperfect information, where agents must adapt without full knowledge. If general-purpose methods can outperform bespoke solutions in these controlled settings, they may offer a more flexible path to robust AI. The team’s benchmark, now open for public use, invites researchers worldwide to test their own algorithms on equal footing.
"We’re not proposing a new algorithm that can beat out others," explains Max Rudolph of UT Austin. "We’re proposing a way to assess them." That shift — from competition to evaluation — could reshape how AI progress is measured. As the field grapples with its assumptions, one truth is emerging: sometimes, the generalist wins.
