Assessing Human and Agent Performance Objectively

A Harvard and Stanford study published in Science this week found that OpenAI's o1 reasoning model correctly diagnosed emergency room patients at a 67% accuracy rate... compared to 55% and 50% for the two attending physicians evaluated in the same conditions.

A predictable reaction: AI isn't ready. It's only at 67% accuracy. The stakes are too high. It isn't perfect.

But the question that framing dodges: compared to what?

We apply a perfectionism standard to AI that we don't apply to humans. Human physicians misdiagnose. Human drivers crash. Human operators make errors under fatigue, time pressure, and information overload. We accept that baseline without question because it's familiar. (That's not to say it's without lawsuits, though!)

Nonetheless, we demand that AI achieve something humans have never achieved before we're willing to trust it with anything consequential.

Waymo surfaced this same dynamic. The safety data consistently showed fewer accidents per mile than human drivers. The response wasn't celebration... It was skepticism. The comparison wasn't AI versus the actual human baseline. It was AI compared against a romanticized version of human performance that doesn't exist in the data.

Sure, the target should be perfection as the north star. But perfection is not a rational prerequisite for action, especially when the alternative is an imperfect human baseline operating at a fraction of the scale.

This matters well beyond medicine. Every domain considering agentic operations faces the same false standard. In media, the question isn't whether an agentic content operations pipeline will ever make a mistake. It will. The question is whether it makes fewer consequential errors than manual human operations running at scale, across thousands of titles, under delivery deadlines, in multiple languages, for dozens of distribution partners simultaneously.

That's the actual trade. And the evidence across domain after domain shows that the trade looks very different from what the perfectionism framing implies.

Unlike the Waymo scenario, the great news is that in most cases it doesn't have to be an either/or. Agents can take the first cut, with human oversight, not as a fail safe, but as an affirming view. And as AI gets better, the need for that affirming view will undoubtedly decline.

https://www.harvardmagazine.com/ai/ai-outperforms-doctors-diagnosis-harvard-study

Get new posts and founder notes.