Do Reasoning Models Really Think? The Unlikely Apple-Anthropic Research Feud

TLDR: This post examines findings from 'The Illusion of Thinking' (Parshin Shojaee et al., 2025) and the subsequent commentary 'The Illusion of the Illusion of Thinking' (C. Opus and A. Lawsen, 2025).

The latest advancements of AI models has well and truly arrived with Large Reasoning Models (LRMs). OpenAI's o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking represent something new. These AI systems appear to "think" before they respond. These models generate detailed reasoning traces, engage in self-reflection, and work through problems step-by-step. But the question remains Is this genuine reasoning, or sophisticated pattern matching dressed up as thought?

A recent academic dispute brings this question into the forefront. Researchers from Apple claim these models face "complete accuracy collapse" when problems get complex enough. Critics, including researchers and models from Anthropic (yes their model Opus is one of the authors), argue these failures are a result of flawed experiments, not flawed models.

The Promise and the Problem

Before we jump into the dispute, lets run through what makes LRMs different. Unlike typical language models that generate responses directly, LRMs produce extensive "thinking processes" before delivering answers. They use Chain-of-Thought reasoning, verify their own work, and refine their thinking iteratively. This shift could address one of AI's persistent challenges by moving past pattern recognition to actual reasoning. So if successful, LRMs would be much better placed to handle complex planning and multi-step problem solving.

To test this, instead of basing their decisions on existing benchmarks, Shojaee and colleagues designed a controlled puzzle environment including Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World.

Their findings were interesting. All the tested models experienced complete accuracy collapse beyond certain complexity thresholds. Performance didn't gradually decline; it disappeared entirely, dropping to zero. Even more interesting, as problems approached complexity limits, LRMs actually reduced their reasoning effort, using fewer tokens despite having budget available. This suggests not just a performance ceiling, but a fundamental limitation.

The research revealed three patterns when comparing LRMs with standard LLMs under equal computational resources:

Low Complexity Tasks - Standard language models surprisingly outperformed their reasoning-enhanced counterparts, suggesting that the additional computational overhead of reasoning mechanisms wasn't justified for simple problems.

Medium Complexity Tasks - Reasoning models demonstrated clear advantages, with their structured thinking processes providing meaningful improvements in accuracy and reliability.

High Complexity Tasks - Both model types experienced what researchers termed "complete accuracy collapse," failing entirely when complexity exceeded certain thresholds

On simpler tasks, LRMs often found correct solutions early, then wasted resources exploring incorrect alternatives. Even when provided with explicit algorithms (like the optimal Tower of Hanoi solution), they couldn't leverage this information to improve performance.

The Counter-Evidence

Fairly shortly after, Opus and Lawsen mounted a critique arguing these "failures" reflect experimental limitations rather than model deficiencies. Their analysis focused on several oversights that they believe invalidate the prior research.

First, they argued the original study missed something basic. Models recognise when they approach output limits. Replications showed models explicitly stating they would "stop here" to avoid excessive length. The Tower of Hanoi evaluation required outputting every single move, leading to exponential token growth. When examined closely, the collapse points aligned perfectly with token limits. This suggests the "failure" was hitting a wall, not a reasoning breakdown.

But here's where it gets interesting. They state the River Crossing experiments included mathematically impossible puzzles. For N ≥ 6 actors with a boat capacity of 3, certain variants have no solution. This is a mathematical fact the study overlooked. Models were marked as failures for not solving the unsolvable.

The study also used "compositional depth" (number of moves) as a complexity metric, which confused mechanical execution with problem-solving difficulty. Tower of Hanoi requires exponentially many moves (2^N-1) but involves trivial decisions. River Crossing needs fewer moves but requires complex constraint satisfaction, making it computationally NP-hard. This explains why models could execute 100+ Hanoi moves but fail on 5-move River Crossing problems.

Most revealing was what happened when researchers changed the output format. When prompted to output a generating function rather than listing every move for Tower of Hanoi N=15, models achieved high accuracy. This improvement suggests reasoning capabilities remain intact when freed from exhaustive enumeration. The "collapse" was an artifact of output format.

So what does this mean?

The question of whether LRMs "think" may matter less than understanding their strength and weaknesses when given a task. All the models tested represent real advances in AI reasoning, but they're not perfect. They excel at certain problems, struggle with others, and can appear brilliant or inadequate depending on how we test them.

The core issue isn't whether LRMs "truly think" but whether our evaluations measure reasoning versus other constraints. Future evaluation frameworks should differentiate reasoning capability from output limitations, verify problems are solvable before testing, create complexity metrics reflecting computational difficulty, and test multiple solution representations.

When reviewing what model to use, ask yourself:

  • How do we measure "complexity" for target tasks?
  • Is it computationally challenging or just lengthy?
  • What output constraints exist in production, and how might they affect performance?
  • Have we verified our test problems actually have solutions?

Like most things if you put garbage in you will get garbage out so always check what you are giving the LLM and that you are not setting it up for failure.

Own your content, your audience, your experience and your SEO.

subscribe now

Already a paid subscriber? Sign in.