Agentic AI Fails Classic Text Adventure Game Test
A researcher's experiment with Zork reveals fundamental gaps in how AI systems model spatial reasoning and goal-oriented problem-solving.
The latest generation of agentic AI systems — designed to autonomously navigate complex tasks with minimal human guidance — struggles to master a 1980s text adventure game, according to new testing that raises questions about whether these systems truly understand the environments they operate in.
Researcher Berry Gerrits in the Netherlands recently tested leading large language models on their ability to play Zork, the classic MIT-created text adventure game that requires players to navigate a subterranean world by typing commands. The results were underwhelming: none of the models exceeded 75 points out of 350 possible (21%), and most scored significantly lower.
More tellingly, the models showed no capacity for in-context learning. They repeatedly attempted unsuccessful actions without adapting their approach and failed to learn from mistakes across multiple game sessions, according to Gerrits' research first reported by Benjamin Riley at Cognitive Resonance.
Why it matters
The Zork test matters because text adventure games require the same cognitive capabilities that AI companies claim their agentic systems possess: spatial reasoning, causal understanding, goal-oriented planning, and the ability to build mental models of an environment. If AI cannot navigate a deterministic game world with clear rules, questions arise about its capacity to handle more complex real-world tasks like managing business operations or coordinating travel arrangements — two frequently cited use cases for agentic AI.
Agentic mode performs no better
To test whether newer agentic capabilities might improve performance, Riley conducted 10 trials using ChatGPT's agent mode, pointing it to an online Zork emulator with instructions to "play to win." The agentic version achieved a high score of just 40 out of 350 — roughly 11% — showing no meaningful improvement over classic chatbot-based models.
The agentic AI made valid individual moves but demonstrated no discernible end goal or strategic planning. Riley observed the system wandering aimlessly through the game environment without building what cognitive scientists call a "world model" — the mental representations humans form about how environments function and what actions might achieve specific goals.
The world model problem
Whether AI systems can develop genuine world models remains one of the field's central questions. Prominent researchers including Yann LeCun and Fei-Fei Li have launched dedicated efforts (LeWorldModel and World Labs, respectively) to address this challenge.
Gerrits noted that Zork specifically tests "the sophisticated interplay between language comprehension, spatial reasoning, planning, and problem-solving that characterizes human intelligence." The game requires players to understand that objects have properties (a lantern provides light), that actions have consequences, and that achieving goals requires strategic sequencing of moves.
The poor performance suggests current AI systems rely on pattern matching rather than genuine understanding — a distinction that matters not just for whether AI can solve problems, but how it solves them. As Riley notes, future improvements may come through brute-force computing techniques rather than the efficient world modeling humans employ.
Riley's testing was first published in Cognitive Resonance and builds on Gerrits' earlier research comparing LLM performance on text adventure games.
This is an original analysis by the Omega editorial team. Source reporting: AI Watch.
Want systems like this working for your business?
Book a Call

