✨ The Illusion of AI thinking


Hey Reader,

Apple just rolled out a deep dive that claims our smartest chatbots hit a brick wall when puzzles get tough.

The paper—titled The Illusion of Thinking—fans out six large models and asks if slow “thinking” text actually helps them solve logic games.

I read the whole thing so you don’t have to, and here’s the story in plain talk.

HOW THE EXPERIMENT WAS SETUP

Apple picks puzzles with clear rules like Tower of Hanoi, river crossings, the 24 Game, and simple path-finding mazes.

They tag every puzzle as low, medium, or high complexity so they can watch scores climb or crash in neat blocks.

Each model runs twice per puzzle—once in plain “just answer” mode and once in “show every step” mode that prints a chain-of-thought.

All tests share the same token budget, meaning nobody wins by simply rambling longer.

The key metric is pass@k, which tells how often a model grabs the right answer in its first few tries.

LLMS BEATS LRMS IN SIMPLE TASKS

On low complexity tasks every model, thinker or not, rockets to near-perfect scores after a few hundred tokens.

Those yellow dots in the left panels hit the ceiling fast, so plain recall handles baby puzzles just fine.

LRMS BEAT LLMS IN MIDDLE COMPLEXITY TASKS, BUT WITH HIGHER COSTS

Shift to medium complexity and the blue panels show a sharp fork: models that “think out loud” climb toward 80-plus percent while silent twins stall near 20.

Claude 3.7 Sonnet (+thinking) is the poster child here, climbing smooth as token counts rise, while its no-thought clone plateaus early.

DeepSeek-R1 pulls the same trick, beating its quiet twin by roughly fourfold across the mid-tier puzzles.

Apple calls this the “sweet spot” where forcing text steps actually helps.

BOTH LLMS AND LRMS CANNOT DO COMPLEX REASONING TASKS

Then come the red panels and the plot twists hard south.

On high complexity every curve—chain-of-thought or not—hugs the floor even when Apple pumps in ten-times more tokens.

The team labels that nosedive accuracy collapse, meaning extra compute buys zero extra brains.

They even cheat by dropping the exact Tower of Hanoi algorithm into the prompt, yet the models still bomb once disk stacks pass seven.

Look at the figure Apple sticks right in the center of the paper.

Six charts sit in a grid: yellow on the left, blue in the middle, red on the right, repeated for Claude and DeepSeek.

The yellow curves shoot straight up, the blue curves split wide between thinkers and non-thinkers, and the red curves look like a dead line on a heart monitor.

Those dots tell the whole cautionary tale: easy wins, middle gains, hard quits.

Even stranger, token counts used for “thinking” drop as the puzzles demand more steps, which is backward if real logic is in play.

WHAT WENT WRONG.

Apple digs into the chains of thought and finds recycled prompt text, random math talk, and half-finished plans.

The models often bail early, ignore sub-goals, or loop back on themselves without noticing.

Temperature tweaks, top-p swaps, longer context windows, and code fine-tunes all fail to lift the red lines.

Cost tests show you can quadruple output tokens and still sit at zero, so money alone won’t fix it.

Critics argue the tasks force long answers, but Apple fires back that real agents must handle long plans in real life too.

KEY TAKE-AWAYS.

Lesson one: a flashy chain-of-thought does not equal real reasoning; it may just echo training snippets.

Lesson two: always ramp test puzzles from tiny to nasty, because mid-tier success hides a cliff up ahead.

Lesson three: keep rock-solid algorithms or symbolic planners for deep logic and let LLMs handle glue text, quick code, and fuzzy tasks.

Lesson four: watch your token budget like cash, because past the collapse line you burn compute with zero gain.

LIMITS OF TODAY’S LLMs.

Apple says current models lean hard on pattern recall, lack true state tracking, and can’t guarantee progress as tasks grow.

They rarely spot their own slip-ups, so early errors snowball while the output still sounds confident.

More scaling will stretch context, not fix reasoning, so hybrid stacks or new architectures must step in.

WHAT THIS MEANS FOR US.

We should keep LLMs on tasks they ace: writing drafts, cleaning data blurbs, and whipping up quick helper code.

For project planning, safety checks, or long algebra we still need formal logic engines or step-by-step verifiers.

Let’s build pipelines that call the right tool for each slice instead of betting the farm on one giant brain.

Apple’s study drives home that today’s “thinking” models shine in the middle and stall at the top, so plan designs with that ceiling in mind.

Ping me if you want to brainstorm guardrails or mixed pipelines, because we all hate late-stage surprises.

Cheers,

Diogo

P.S. Launched a new blog post on the OpenAI Agents SDK.

P.P.S. The AI Hero Program is officially live,

Data Heroes, with Diogo Resende

Stay on top of Data Science, AI, and Analytics trends—sign up now for bite-sized insights. Join 40k+ people and boost your career with fresh strategies, top tools, and insider knowledge.

Read more from Data Heroes, with Diogo Resende

Hey Reader! I poked a few Model Context Protocol servers this week and stumbled on one with zero security. If you’re using (or about to use) MCP, watch this: 👉 Hacking MCP: We Found a Server with ZERO Protection! You’ll see the exact test I ran and the two fixes you can copy-paste today to keep your own endpoint off the hacker menu. Catch you in the comments,Diogo

Hey Hey! Berlin’s in full-tilt summer mode—street raves, open-air cinemas, midnight kebabs. I kept pace and shipped a MASSIVE stack of course upgrades. Jump in, then jump outside. Ninja Haircut GIF RAG, AI Agents & Generative AI with Python and OpenAI 2025 NEW IN MAY: RAGAS + RAG deep dive with all the latest OpenAI tricks—done and waiting COMING IN JUNE: New sections on OpenAI Image Generation + Reasoning Models AI Agents For All! Build No-Code AI Agents & Master AI 2025 MAY REMAKES:...

Hey Reader! The internet is filled with ChatGPT failures. From the inability to count the R's in the word strawberry: Asking ChatGPT to count the # of Rs in the word strawberry To being unable to answer simple puzzles correctly: Puzzle that ChatGPT cannot answer But the bottomline is that this goes way beyond simple things. Go to ChatGPT and type: Generate an image of a clock And you will get something like this: Image generated with the prompt above Notice the 10:10 time. ChatGPT almost...