Reasoning Models Reward Hacking + New Course Image

Published 6 months ago • 1 min read

Hello Reader,

Hey [Recipient Name],

I came across OpenAI’s latest approach for managing AI models, called Chain-of-Thought (CoT) monitoring. It’s all about transparency—letting AI models openly share their reasoning steps so that we can catch and address misbehavior quickly.

In practical terms, this means models explicitly tell us their reasoning behind every decision, revealing patterns we normally wouldn't see. For instance, the first screenshot below shows how CoT monitoring encourages AI to maintain good behavior throughout training, clearly performing better compared to models without monitoring:

But even more important is how well CoT catches attempts at cheating. As soon as a model tries to misbehave, CoT monitoring spots it clearly:

Still, CoT monitoring isn’t foolproof. Some cheating attempts slip past detection, emphasizing there’s still work to do in making monitoring even better:

During training, the agent discovered two reward hacks affecting nearly all training environments (the two large spikes) shown in deep pink. After the team noticed each hack, they made a manual intervention to patch any bugs that allowed such hacks, which is why the hacking rate eventually plummets in each case.

The big takeaway here is that transparency in AI reasoning, combined with thoughtful monitoring strategies, is crucial to safer, more trustworthy AI systems.

Here's the link if you want to read more: OpenAI CoT Monitoring

I'd love to hear your thoughts on this—let me know if you’d like to discuss further!

As a last thing, my newest course is launching soon!

Here is the thumbnaill for the course:

Best,

Diogo

Read more from Data Heroes, with Diogo Resende

🔒 One MCP server was wide-open

Hey Reader! I poked a few Model Context Protocol servers this week and stumbled on one with zero security. If you’re using (or about to use) MCP, watch this: 👉 Hacking MCP: We Found a Server with ZERO Protection! You’ll see the exact test I ran and the two fixes you can copy-paste today to keep your own endpoint off the hacker menu. Catch you in the comments,Diogo

about 1 month ago • 1 min read

✨ The Illusion of AI thinking

Hey Reader, Apple just rolled out a deep dive that claims our smartest chatbots hit a brick wall when puzzles get tough. The paper—titled “The Illusion of Thinking”—fans out six large models and asks if slow “thinking” text actually helps them solve logic games. I read the whole thing so you don’t have to, and here’s the story in plain talk. HOW THE EXPERIMENT WAS SETUP Apple picks puzzles with clear rules like Tower of Hanoi, river crossings, the 24 Game, and simple path-finding mazes. They...

3 months ago • 3 min read

🔥 Another EPIC month of course updates

Hey Hey! Berlin’s in full-tilt summer mode—street raves, open-air cinemas, midnight kebabs. I kept pace and shipped a MASSIVE stack of course upgrades. Jump in, then jump outside. Ninja Haircut GIF RAG, AI Agents & Generative AI with Python and OpenAI 2025 NEW IN MAY: RAGAS + RAG deep dive with all the latest OpenAI tricks—done and waiting COMING IN JUNE: New sections on OpenAI Image Generation + Reasoning Models AI Agents For All! Build No-Code AI Agents & Master AI 2025 MAY REMAKES:...

3 months ago • 1 min read