✨ The Illusion of AI thinking

Hey Reader,

Apple just rolled out a deep dive that claims our smartest chatbots hit a brick wall when puzzles get tough.

The paper—titled “The Illusion of Thinking”—fans out six large models and asks if slow “thinking” text actually helps them solve logic games.

I read the whole thing so you don’t have to, and here’s the story in plain talk.

HOW THE EXPERIMENT WAS SETUP

Apple picks puzzles with clear rules like Tower of Hanoi, river crossings, the 24 Game, and simple path-finding mazes.

They tag every puzzle as low, medium, or high complexity so they can watch scores climb or crash in neat blocks.

Each model runs twice per puzzle—once in plain “just answer” mode and once in “show every step” mode that prints a chain-of-thought.

All tests share the same token budget, meaning nobody wins by simply rambling longer.

The key metric is pass@k, which tells how often a model grabs the right answer in its first few tries.

LLMS BEATS LRMS IN SIMPLE TASKS

On low complexity tasks every model, thinker or not, rockets to near-perfect scores after a few hundred tokens.

Those yellow dots in the left panels hit the ceiling fast, so plain recall handles baby puzzles just fine.

LRMS BEAT LLMS IN MIDDLE COMPLEXITY TASKS, BUT WITH HIGHER COSTS

Shift to medium complexity and the blue panels show a sharp fork: models that “think out loud” climb toward 80-plus percent while silent twins stall near 20.

Claude 3.7 Sonnet (+thinking) is the poster child here, climbing smooth as token counts rise, while its no-thought clone plateaus early.

DeepSeek-R1 pulls the same trick, beating its quiet twin by roughly fourfold across the mid-tier puzzles.

Apple calls this the “sweet spot” where forcing text steps actually helps.

BOTH LLMS AND LRMS CANNOT DO COMPLEX REASONING TASKS

Then come the red panels and the plot twists hard south.

On high complexity every curve—chain-of-thought or not—hugs the floor even when Apple pumps in ten-times more tokens.

The team labels that nosedive accuracy collapse, meaning extra compute buys zero extra brains.

They even cheat by dropping the exact Tower of Hanoi algorithm into the prompt, yet the models still bomb once disk stacks pass seven.

Look at the figure Apple sticks right in the center of the paper.

Six charts sit in a grid: yellow on the left, blue in the middle, red on the right, repeated for Claude and DeepSeek.

The yellow curves shoot straight up, the blue curves split wide between thinkers and non-thinkers, and the red curves look like a dead line on a heart monitor.

Those dots tell the whole cautionary tale: easy wins, middle gains, hard quits.

Even stranger, token counts used for “thinking” drop as the puzzles demand more steps, which is backward if real logic is in play.

WHAT WENT WRONG.

Apple digs into the chains of thought and finds recycled prompt text, random math talk, and half-finished plans.

The models often bail early, ignore sub-goals, or loop back on themselves without noticing.

Temperature tweaks, top-p swaps, longer context windows, and code fine-tunes all fail to lift the red lines.

Cost tests show you can quadruple output tokens and still sit at zero, so money alone won’t fix it.

Critics argue the tasks force long answers, but Apple fires back that real agents must handle long plans in real life too.

KEY TAKE-AWAYS.

Lesson one: a flashy chain-of-thought does not equal real reasoning; it may just echo training snippets.

Lesson two: always ramp test puzzles from tiny to nasty, because mid-tier success hides a cliff up ahead.

Lesson three: keep rock-solid algorithms or symbolic planners for deep logic and let LLMs handle glue text, quick code, and fuzzy tasks.

Lesson four: watch your token budget like cash, because past the collapse line you burn compute with zero gain.

LIMITS OF TODAY’S LLMs.

Apple says current models lean hard on pattern recall, lack true state tracking, and can’t guarantee progress as tasks grow.

They rarely spot their own slip-ups, so early errors snowball while the output still sounds confident.

More scaling will stretch context, not fix reasoning, so hybrid stacks or new architectures must step in.

WHAT THIS MEANS FOR US.

We should keep LLMs on tasks they ace: writing drafts, cleaning data blurbs, and whipping up quick helper code.

For project planning, safety checks, or long algebra we still need formal logic engines or step-by-step verifiers.

Let’s build pipelines that call the right tool for each slice instead of betting the farm on one giant brain.

Apple’s study drives home that today’s “thinking” models shine in the middle and stall at the top, so plan designs with that ceiling in mind.

Ping me if you want to brainstorm guardrails or mixed pipelines, because we all hate late-stage surprises.

Cheers,

Diogo

P.S. Launched a new blog post on the OpenAI Agents SDK.

P.P.S. The AI Hero Program is officially live,

Data Heroes, with Diogo Resende

✨ The Illusion of AI thinking

🔒 One MCP server was wide-open

🔥 Another EPIC month of course updates

✨ AI will never replace me