Frontier AI Models Are Getting Stumped by a Simple Children's Game
Earlier this week, researchers at Apple released a damning paper, criticizing the AI industry for vastly overstating the ability of its top AI models to reason or "think." The team found that the models, including OpenAI's o3, Anthropic's Claude 3.7, and Google's Gemini, were stumped by even the simplest of puzzles. For instance, the "large reasoning models," or LRMs, consistently failed at Tower of Hanoi, a children's puzzle game that involves three pegs and a number of differently-sized disks that have to be arranged in a specific order. The researchers found that the AI models' accuracy in the game was […]


Earlier this week, researchers at Apple released a damning paper, criticizing the AI industry for vastly overstating the ability of its top AI models to reason or "think."
The team found that the models including OpenAI's o3, Anthropic's Claude 3.7, and Google's Gemini were stumped by even the simplest of puzzles. For instance, the "large reasoning models," or LRMs, consistently failed at Tower of Hanoi, a children's puzzle game that involves three pegs and a number of differently-sized disks that have to be arranged in a specific order.
The researchers found that the AI models' accuracy in the game was less than 80 percent with seven disks, and were more or less entirely stumped by puzzles involving eight disks.
They also consistently failed at Blocks World, a block-stacking puzzle, and River Crossing, a puzzle that involves moving items across a river using a boat with several constraints.
"Through extensive experimentation across diverse puzzles, we show that frontier [large reasoning models] face a complete accuracy collapse beyond certain complexities," the Apple researchers wrote.
It was an eyebrow-raising finding, highlighting how even the most sophisticated of AI models are still failing to logic their way through simple puzzles, despite being made out to be something far more sophisticated by their makers' breathless marketing.
Those approaches to selling the tech to the public have led to users anthropomorphizing AI models — or thinking of them like humans — leading to a major schism between their presumed and actual capabilities.
The findings amplify ongoing fears that current AI approaches, including "reasoning" AI models that break down tasks into individual steps, are a dead end, despite billions of dollars being poured into their development.
Worse yet, past a threshold of complexity, their shortcomings are becoming even more apparent, undercutting the AI industry's promises that simply scaling up the models' training data could make them more intelligent and capable of "reasoning."
Noted AI critic Gary Marcus wasn't surprised by the researchers' findings.
"In many ways, the paper echoes and amplifies an argument that I have been making since 1998," he wrote in a recent post on his Substack, referring to a paper he authored over 26 years ago. "Neural networks of various kinds can generalise within a distribution of data they are exposed to, but their generalisations tend to break down beyond that distribution."
In short, getting stumped by simple children's games isn't exactly what you'd expect from AI models being sold as the next breakthrough in problem-solving and a step toward artificial general — or superhuman — intelligence (AGI), the stated goal of OpenAI.
"It’s not just about ‘solving’ the puzzle," colead author and Apple machine learning engineer Iman Mirzadeh told Marcus. "We have an experiment where we give the solution algorithm to the model, and [the model still failed]... based on what we observe from their thoughts, their process is not logical and intelligent."
Marcus argues that large language and reasoning models simply cast far too wide a net and easily get lost as a result.
"What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that these LLMs that have generated so much hype are no substitute for good, well-specified conventional algorithms," he wrote.
"What this means for business is that you can’t simply drop [OpenAI's LLM] o3 or Claude into some complex problem and expect them to work reliably," the critic added. "What it means for society is that we can never fully trust generative AI; its outputs are just too hit-or-miss."
While many valid use cases for the models remain, "anybody who thinks LLMs are a direct route to the sort AGI that could fundamentally transform society for the good is kidding themselves," Marcus concluded.
More on the paper: Apple Researchers Just Released a Damning Paper That Pours Cold Water on the Entire AI Industry
The post Frontier AI Models Are Getting Stumped by a Simple Children's Game appeared first on Futurism.