The Human Evaluator's Goodbye

For as long as we've been building intelligent systems, we've been the sole arbiters of intelligence. We built the tests, we graded the results, we defined what 'smarter' meant. We are now building things that are so far beyond us, we can no longer see the top. We are in the process of ceding the judge's seat, with an admission that we are no longer qualified to score the test.

I feel this acutely when my friends ask for recommendations. For the last 3 years, it's been pretty easy to tell the difference in model quality, and explaining when to use o3 over 4o is quite straightforward. But lately, when they ask about the difference between a model like o3 and o3-pro, I find it hard to give a concrete and general answer. Of course, this was predictable, but it's still surreal to experience it firsthand.

Sometime in late 2023, I mentally checked out from looking at benchmark numbers and have been purely "vibe testing" the models since. This has become quite common, there's been plenty of "taste" testing for models and things like "big model smell" are artifacts of that. There are still many footprints models leave that allow us to tell them apart (Claude's signature writing style, for example). Nowadays though, you need to think pretty hard to design good evals which would allow you to distinguish between top models. We're operating in a regime where the deltas in capability are finer than the resolution of our measurement tools.

We've been here before. Decades ago, the world's best chess players had to accept that they could no longer judge the top chess engines. If you showed Magnus Carlsen games from two superhuman engines today, even he cannot reliably tell which one is the better engine. The meaningful differences simply exist beyond human comprehension.

We will do what we've done in chess; to find out which of the two superhuman models is "smarter", we will pit them against each other on a massive battery of complex tasks and see which one comes out on top more often. Eventually, we'll have to rely on AIs even for knowing which AI model is smarter because we won't be smart enough to tell the difference, or see the differences that matter.