Teaching AI to Ask Better Questions Through Battleship Games
MIT and Harvard researchers used the classic guessing game to reveal—and fix—a critical weakness in language models.

A fundamental flaw in AI agents
Artificial intelligence agents are increasingly deployed in customer service and software development, but they struggle in high-stakes domains like medical diagnosis and scientific discovery. The core problem: most language models are optimized to answer questions, not ask them.
Researchers at MIT's Computer Science and Artificial Intelligence Laboratory and Harvard's School of Engineering and Applied Sciences identified this weakness through an unexpected testing ground—the board game Battleship. Their findings reveal why AI agents fail at information-seeking tasks and demonstrate techniques that dramatically improve performance across model sizes.
Why it matters
The ability to formulate strategic questions is essential for AI agents working in scientific research, medical diagnosis, and any domain requiring exploration of vast solution spaces. This research shows that even smaller, more cost-effective language models can match or exceed frontier systems when equipped with better reasoning strategies—potentially democratizing access to capable AI agents while reducing computational costs by 99 percent.
Reframing a childhood game
The research team created "Collaborative Battleship," where one player acts as a "captain" asking natural language questions about hidden ship locations, while a "spotter" teammate provides yes-or-no answers. After collecting gameplay data from over 40 human participants to build the BattleshipQA dataset, researchers tested state-of-the-art models including GPT-5 and smaller systems like Llama 4 Scout.
The results were revealing. Large models could complete games in fewer turns than humans, but smaller models performed poorly. More importantly, most models struggled to formulate useful questions—the kind that efficiently narrow down possibilities.
Monte Carlo inference transforms performance
To address this weakness, researchers implemented Monte Carlo inference strategies that help models evaluate the likelihood of different options with each answer received. This approach treats potential guesses as weighted particles that become more or less probable as information accumulates.
The impact was dramatic. Llama 4 Scout initially beat human players only 8 percent of the time. After refinements, its win rate jumped to 82 percent—while operating at roughly 1 percent of GPT-5's cost.
"Today's language models are primarily optimized to answer complex queries, but it's less clear whether they learn to ask good questions for themselves," said Gabriel Grand, MIT PhD student and lead author. "Our work shows that asking informative questions depends on the ability to predict and simulate the world."
Converting questions to code
The team also tackled the "spotter" role, where smaller models frequently gave incorrect answers. By automatically converting natural language questions into Python code—explicit instructions for verifying answers—models improved accuracy by 15 percent on average. The lightweight GPT-4o-mini saw nearly 30 percent improvement.
Testing extended to "Guess Who?," where models must eliminate options from a field of 100 characters. Llama 4 Scout's success rate increased from 30 percent to over 72 percent with the enhanced techniques.
Limitations and future directions
While promising, the research has boundaries. Models still lag behind humans in answering complex questions, and expert human players remain difficult to beat—unlike chess, where AI systems dominate at all levels. The researchers acknowledge that Battleship represents a relatively simple test environment compared to real scientific discovery challenges.
Future work will explore human-AI collaboration, fine-tuning on game simulations, and testing in more complex domains requiring consideration of far more options.
The findings were first reported by MIT News and presented at the International Conference on Learning Representations in April 2026. The research was supported by the MIT-IBM Watson AI Lab, DARPA, and the National Science Foundation, among others.
This is an original analysis by the Omega editorial team. Source reporting: AI Watch.
Want systems like this working for your business?
Book a Call
