The Problem with Reasoners | Aidan McLaughlin

o1 reasoners are the most exciting models since the original GPT-4. They prove what I predicted earlier this year in my AI Search: The Bitter-er Lesson paper: we can get models to think for longer instead of building bigger models.

It's too bad they suck on problems you should care about.

RL is Magic Until it Isn’t

Before we built large language models, we had reinforcement learning (RL). 2015 - 2020 was the golden age of RL, and anyone remotely interested in AI stared at the ceiling wondering how you could build “AlphaZero, but general.”

RL is magic: build an environment, set an objective, and watch a cute little agent ascend past human skill. When I was younger, I wanted to RL everything: Clash Royale, investment analysis, algebra; whatever interested me, I daydreamed about RL that could do it better.

Why is RL so fun? Unlike today’s chatbots, it gave you a glimpse into the mind of God. AlphaZero was artistic, satisfying, and right; it metaphorized nature’s truths with strategy. Today’s smartest chatbots feel like talking to a hyper-educated human with 30th-percentile intelligence. AI can be beautiful, and we’ve forgotten that.

AlphaZero (black) plays Ng5, sacrificing its knight for no immediate material gain. This move was flippant but genius; it changed chess forever. After seeing it, I realized I wanted to research AI.

But RL is often useless.

In high school, I wondered how to spin up an RL agent to write philosophy essays. I got top marks, so I figured that writing well wasn’t truly random. Could you reward an agent based on a teacher’s grade? Sure, but then you’d never surpass your teacher. Sometimes, humanity has a philosophical breakthrough, but these are rare; certainly not a source of endless reward. How would one reward superhuman philosophy? What does the musing of aliens 10 × smarter than us look like? To this day, I still have no idea how to build a philosophy-class RL agent, and I’m unsure if anyone else does either.

<aside> 📖

RL is great for board games, high-frequency trading, protein folding, and sports… but not for open-ended thought without clear feedback. In formal RL, we call these problems sparse reward environments. While we’ve designed more efficient RL algorithms, there is no silver bullet to solve them.

</aside>

How does o1 work? Well, nobody outside of OpenAI knows for sure, but my high-level guess is:

You take a language model, like GPT-4.