AI

Teaching AI to Ask Better Questions Through Battleship Games

MIT and Harvard researchers used the classic guessing game to reveal—and fix—a critical weakness in language models.

Omega Editorial· June 3, 2026· 3 min read

A fundamental flaw in AI agents

Artificial intelligence agents are increasingly deployed in customer service and software development, but they struggle in high-stakes domains like medical diagnosis and scientific discovery. The core problem: most language models are optimized to answer questions, not ask them.

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory and Harvard's School of Engineering and Applied Sciences identified this weakness through an unexpected testing ground—the board game Battleship. Their findings reveal why AI agents fail at information-seeking tasks and demonstrate techniques that dramatically improve performance across model sizes.

Why it matters

The ability to formulate strategic questions is essential for AI agents working in scientific research, medical diagnosis, and any domain requiring exploration of vast solution spaces. This research shows that even smaller, more cost-effective language models can match or exceed frontier systems when equipped with better reasoning strategies—potentially democratizing access to capable AI agents while reducing computational costs by 99 percent.

Reframing a childhood game

The research team created "Collaborative Battleship," where one player acts as a "captain" asking natural language questions about hidden ship locations, while a "spotter" teammate provides yes-or-no answers. After collecting gameplay data from over 40 human participants to build the BattleshipQA dataset, researchers tested state-of-the-art models including GPT-5 and smaller systems like Llama 4 Scout.

The results were revealing. Large models could complete games in fewer turns than humans, but smaller models performed poorly. More importantly, most models struggled to formulate useful questions—the kind that efficiently narrow down possibilities.

Monte Carlo inference transforms performance

To address this weakness, researchers implemented Monte Carlo inference strategies that help models evaluate the likelihood of different options with each answer received. This approach treats potential guesses as weighted particles that become more or less probable as information accumulates.

The impact was dramatic. Llama 4 Scout initially beat human players only 8 percent of the time. After refinements, its win rate jumped to 82 percent—while operating at roughly 1 percent of GPT-5's cost.

"Today's language models are primarily optimized to answer complex queries, but it's less clear whether they learn to ask good questions for themselves," said Gabriel Grand, MIT PhD student and lead author. "Our work shows that asking informative questions depends on the ability to predict and simulate the world."

Converting questions to code

The team also tackled the "spotter" role, where smaller models frequently gave incorrect answers. By automatically converting natural language questions into Python code—explicit instructions for verifying answers—models improved accuracy by 15 percent on average. The lightweight GPT-4o-mini saw nearly 30 percent improvement.

Testing extended to "Guess Who?," where models must eliminate options from a field of 100 characters. Llama 4 Scout's success rate increased from 30 percent to over 72 percent with the enhanced techniques.

Limitations and future directions

While promising, the research has boundaries. Models still lag behind humans in answering complex questions, and expert human players remain difficult to beat—unlike chess, where AI systems dominate at all levels. The researchers acknowledge that Battleship represents a relatively simple test environment compared to real scientific discovery challenges.

Future work will explore human-AI collaboration, fine-tuning on game simulations, and testing in more complex domains requiring consideration of far more options.

The findings were first reported by MIT News and presented at the International Conference on Learning Representations in April 2026. The research was supported by the MIT-IBM Watson AI Lab, DARPA, and the National Science Foundation, among others.

#language models#ai agents#monte carlo inference#mit csail#question generation#scientific discovery

This is an original analysis by the Omega editorial team. Source reporting: AI Watch.

Want systems like this working for your business?

Book a Call

More in AI

AI· 3 min read

Anthropic Warns AI Industry Needs 'Brake Pedal' for Self-Improving Systems

As models approach the ability to recursively improve themselves, the AI safety company calls for mechanisms to pause or slow frontier development.

Via AI Watch · Jun 5, 2026
AI· 2 min read

Nvidia Certifies SK Hynix, Samsung, Micron for HBM4 Production

All three dominant memory chipmakers have received approval to mass-produce the latest high-bandwidth memory for AI accelerators.

Via AI Watch · Jun 5, 2026
AI· 3 min read

Tencent Hires Former OpenAI Researcher to Lead AGI Push in China

As Silicon Valley talent migrates to Chinese tech giants, Beijing's AI strategy shows signs of shifting toward ambitious long-term goals.

Via AI Watch · Jun 5, 2026