MIT and Harvard researchers have found an unlikely teacher for artificial intelligence: the classic board game Battleship. In a creative twist on a game that cognitive scientists have long used to study human decision-making, researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard University's School of Engineering and Applied Sciences (SEAS) developed "Collaborative Battleship" to understand why language models struggle to ask useful questions in uncertain environments—the very skill that matters most in high-stakes fields like medical diagnosis and scientific discovery.
The problem is fundamental. While today's language models excel at answering complex questions, they're remarkably poor at asking the right ones. This weakness matters enormously. In medicine, software development, or research, an AI agent that asks uninformed questions wastes time and resources. The researchers recognized that fields requiring exploration in uncertainty demand something different from the language models typically optimized for response generation alone.
To study this gap, the team reframed Battleship as a collaborative exercise in natural language inquiry. One AI participant acted as "captain," asking yes-no questions to locate hidden ships, while another served as "spotter," answering in real time. The researchers first collected data from more than 40 humans playing together, building the "BattleshipQA" dataset as a benchmark. When they tested state-of-the-art models like GPT-5 and smaller systems like Llama 4 Scout against this human baseline, the results revealed a striking pattern: while top-tier models could outpace humans at the game, smaller systems struggled fundamentally with question quality.
The breakthrough came through two complementary strategies. First, the team implemented Monte Carlo inference, a technique that carefully weighs the likelihood of different possibilities after each answer. Rather than asking questions randomly, the AI captain now reasons about potential ship locations as individual particles, inflating or deflating their probability with each spotter response. This calculated, adaptive approach transformed question quality across all model sizes.
Second, the researchers converted questions into Python code, giving smaller language models a language they understand particularly well. When an AI captain asks "Is there a ship in column one that spans two rows?"—it gets automatically translated into encoded instructions for the spotter to search and verify. This simple step dramatically improved accuracy: GPT-4o-mini saw a nearly 30 percent performance boost, while even the large Claude 4 Opus model gained roughly eight percentage points.
The results speak for themselves. Llama 4 Scout, a relatively small model that beat humans just 8 percent of the time initially, achieved an 82 percent win rate after refinement. Remarkably, it outpaced GPT-5 while operating at roughly 1 percent of the cost. "Today's language models are primarily optimized to answer complex queries, but it's less clear whether they learn to ask good questions for themselves," explains Gabriel Grand, MIT PhD student and lead author on the work. "Our work shows that asking informative questions depends on the ability to predict and simulate the world."
The implications extend far beyond a parlor game. These techniques suggest a path forward for deploying AI agents in domains where curiosity and strategic inquiry aren't luxuries—they're essential. By teaching machines to ask better questions, researchers are fundamentally reshaping what these systems can accomplish.
