Apple’s recent paper - The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity - has polarized the Machine Learning community like I have rarely seen before. The paper’s main claim is that AI models don’t truly reason. They simulate reasoning by pattern matching.
The paper. Released June 6, 2025, this new paper challenges whether Large Reasoning Models (LRMs) such as OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking truly reason or just mimic it. LRMs represent novel architectures & processes, including extended Chain-of-Thought (CoT) with self-reflection, and have shown impressive performance on diverse reasoning benchmarks. This new paper echoes earlier critiques showing that LLMs break down with small prompt changes. The key debate is on how LLMs tackle complex reasoning and problem-solving, with some AI leading voices viewing LRMs as critical milestones toward achieving Artificial General Intelligence (AGI). What stands out most to me is how this paper frames reasoning through the lens of complexity - and how LRMs handle that complexity.
Complexity as the stress-tester. Apple’s team tested LRMs using classic puzzles such as the Tower of Hanoi and River Crossing, because you can scale complexity in a controlled way. It’s important to note that LRMs were therefore not tested on real-world applications, which, for those focused on practical AI use (like yours truly), is a significant limitation of the paper's reach. However, this method has the advantage of more effectively isolating reasoning abilities from memorization or pattern recognition compared to benchmarks such as MATH or AIME.
So, what do we learn? At low complexity, regular LLMs outperform LRMs. When problems are simple, straightforward pattern matching and language modeling work well. Anything else is an overkill. At medium complexity, LRMs show their real strength. A positive outlook of this paper could focus on this result. More specifically, their “thinking traces” or chain-of-thought outputs help them navigate more complex steps better than standard LLMs. At high complexity - and this is the result that received most attention - both LLMs and LRMs fail spectacularly. Accuracy drops to zero.
Key findings. Even more telling, LRMs reduce their reasoning effort (fewer tokens) as complexity rises, despite having enough compute. This suggests that complexity fundamental breaks down language models. Humans also tend to give up if the task is too complex, but not in the same systematic way - human outliers are key drivers of change! Besides this scaling paradox showing that reasoning effort peaks at moderate complexity - the paper highlights several other core limitations. LRMs can’t reliably perform exact computations or follow algorithms consistently, even when explicitly given. Their reasoning is inconsistent across different puzzles, showing they don’t truly understand the underlying logic. Overall there is still a lot of guessing under the hood, which is in the real world a major limitation if your use case relies on data. But overall whether this is a feature or a bug depends on how you use LLMs/LRMs.
Apple’s motives. You will read criticism of Apple’s motives. Some question the timing, suggesting Apple is managing expectations ahead of WWDC 2025. Others suspect corporate strategy to downplay competitors while Siri lags. But Apple’s ML research output has been consistent, and clearly this piece is solid and not isolated. It echoes earlier research from eminent colleagues so better to focus on the claims in the over-hyped AI World.
LRMs skeptics. And it really does echo the skeptics of the LRMs hype. The ones that have told us for years now that LLMs are NOT a direct route to AGI. The finding that LRMs hit a hard ceiling on complex reasoning, no matter the compute, is key in this debate. As Yann LeCun or Gary Marcus regularly point out - we need new architectures, not just bigger models.
LRMs defenders. How about defenders of LRMs? They argue that the paper’s puzzle-based complexity tests might not generalize to real-world reasoning, where hybrid systems and strategic shortcuts come into play. And it might very well be that models avoid long algorithmic steps deliberately - not because they can’t handle complexity.
What's next? To me this paper highlights the crucial need to use the right AI tools for the right tasks, as LLMs are not a one-size-fits-all solution. LLMs excel at language generation, brainstorming, and coding assistance, but struggle with tasks requiring reliable logical reasoning, precise algorithmic execution, and consistent arithmetic. The real diminishing returns show when we try to push LLMs beyond their core strengths. Relying solely on LLMs is not the path to Artificial General Intelligence (AGI).
The call. This is a call to rethink AI’s approach to complexity - a call to recognize LLMs’ strengths in simpler tasks but pushing for new architectures or frameworks to tackle higher complexity.