Apple's AI reasoning study sparks fierce debate over flawed testing methods

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Apple’s machine-learning research team ignited a fierce debate in the AI community with “The Illusion of Thinking,” a 53-page paper arguing that reasoning AI models like OpenAI’s “o” series and Google’s Gemini don’t actually “think” but merely perform sophisticated pattern matching. The controversy deepened when a rebuttal paper co-authored by Claude Opus 4 challenged Apple’s methodology, suggesting the observed failures stemmed from experimental flaws rather than fundamental reasoning limitations.

What you should know: Apple’s study tested leading reasoning models on classic cognitive puzzles and found their performance collapsed as complexity increased.

Researchers used four benchmark problems—Tower of Hanoi, Blocks World, River Crossing, and Checkers Jumping—that require multi-step planning and complete solution generation.
As puzzle difficulty increased, model accuracy plunged to zero on the most complex tasks, with reasoning traces also shrinking in length.
Apple interpreted this as evidence that models “give up” on hard problems rather than engaging in genuine reasoning.

The pushback: Critics immediately challenged Apple’s experimental design and conclusions across social media and academic circles.

ML researcher “Lisan al Gaib” argued that Apple conflated token budget failures with reasoning failures, noting “all models will have 0 accuracy with more than 13 disks simply because they cannot output that much!”
For Tower of Hanoi puzzles requiring exponentially more output steps, models hit context window limits rather than reasoning walls.
VentureBeat’s Carl Franzen pointed out that Apple never benchmarked model performance against human performance on identical tasks.

The rebuttal paper: “The Illusion of The Illusion of Thinking” by Alex Lawsen, an independent AI researcher, and Claude Opus 4 systematically dismantled Apple’s methodology.

The authors demonstrated that performance collapse resulted from token limitations rather than reasoning deficits—Tower of Hanoi with 15 disks requires over 32,000 moves to print.
When models were allowed to provide compressed, programmatic answers instead of step-by-step enumeration, they succeeded on far more complex problems.
Some River Crossing puzzles in Apple’s benchmark were mathematically unsolvable as posed, yet failures were still counted against the models.

In plain English: The technical debate centers on whether AI models truly understand problems or just hit artificial limits. Think of it like asking someone to solve a math problem but only giving them a tiny piece of paper—their failure might reflect the paper size, not their math skills. Apple’s test required AI models to write out every single step of complex puzzles, which quickly overwhelmed their “memory space” (context windows). When researchers allowed the models to write shorter, code-based answers instead of full explanations, they performed much better on the same puzzles.

Why this matters for enterprise: The debate highlights critical considerations for companies deploying reasoning AI in production environments.

Task formulation, context windows, and output requirements can dramatically affect model performance independent of actual reasoning capability.
Enterprise teams building AI copilots or decision-support systems need to consider hybrid solutions that externalize memory or use compressed outputs.
The controversy underscores that benchmarking results may not reflect real-world application performance.

What they’re saying: The research community remains divided on whether current reasoning models represent genuine cognitive breakthroughs.

University of Pennsylvania’s Ethan Mollick called claims that LLMs are “hitting a wall” premature, comparing them to unfulfilled predictions about “model collapse.”
Some critics suggested Apple—trailing competitors in LLM development—might be attempting to diminish expectations around reasoning capabilities.
Carnegie Mellon researcher Rohan Paul summarized the core issue: “Token limits, not logic, froze the models.”

The big picture: This academic skirmish reflects deeper questions about AI evaluation methodology and the nature of machine reasoning itself.

The episode demonstrates that evaluation design has become as crucial as model architecture in determining apparent AI capabilities.
Both papers highlight the challenge of distinguishing between genuine reasoning limitations and artificial constraints imposed by test design.
For ML researchers, the takeaway is clear: before declaring AI milestones or failures, ensure the test itself isn’t constraining the system’s ability to demonstrate its capabilities.

Do reasoning models really “think” or not? Apple research sparks lively debate, response

VentureBeat

Menu

Apple’s AI reasoning study sparks fierce debate over flawed testing methods

Recent News

Apple builds AI search team to challenge ChatGPT and Google

AI helps Italian rescue team find missing hiker after 11 months

AI “vibe coding” helps non-technical founders build $80M exits

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

Apple’s AI reasoning study sparks fierce debate over flawed testing methods

Recent News

Apple builds AI search team to challenge ChatGPT and Google

AI helps Italian rescue team find missing hiker after 11 months

AI “vibe coding” helps non-technical founders build $80M exits

Join the revolution

CO/AI

Resources

Join the revolution