The original METR graph reports task success which has very different implications from reliability.
Managing tail risk / TVaR is a real problem but the proposed solutions and conclusions are problematic because they mix up model reliability with tail performance.
This wasn't intended as a scientifically precise paper, more about building intuition for a concept I think is underweighted.
Looking back, I'm not sure we formally define reliability anywhere. Instead, I focused mainly on asking people to think about the second moment and the distributional nature of P(success | task complexity). I think that's both (i) intuitively similar to what practitioners mean when they say a model "works reliably" and (ii) upstream of essentially all formal risk measures.
And yeah, I wasn't about to wade into TVaR/CVaR/ES/CTE/EP/distortion risk measure/the dozen other formal tail measures here. (Also love that as a field we've invented N names for essentially the same thing...)
Please distinguish between two groups: AI practitioners and AI influencers who follow or generate hype.
As a seasoned AI practitioner, I can’t agree with the title “The AI industry worships at the altar of Accuracy — but humanity answers to a more fickle, demanding god: Reliability.” This is ObviouslyWrong :)
It is a well-known fact that accuracy is not the best metric. That’s precisely why many other evaluation metrics have been defined and used in model development and training.
Hasan, fair pushback on artistic license in the subtitle.
But the substantive claim isn't "accuracy vs. other metrics" — it's "mean-anything vs. tail-anything."
Yes, we've moved beyond raw accuracy to MAPE, BLEU, pass@k, etc. But these are still measures of central tendency. To quantify, we parsed 12.5K evals in lm_eval — 96% target central tendency. The Zillow team weren't influencers; they were serious practitioners. Their tails still killed them. I could go on, but you get the idea.
The question: what percentage of model development explicitly optimizes tail behavior —P95 error, distributional robustness, disciplined risk analysis, etc.? In my experience, it's small even among strong teams. There's been some notable work emerging recently but, well, it's notable because it's rare. Are you seeing something different?
I think this hits what I've experienced pretty well! Also, when there's low reliability but good average accuracy, it feels a bit like gambling - and humans are susceptible to that.
Love the gambling analogy! Never connected those dots, but that's exactly how unreliable LLMs feel for me too. Bonus: Explains some of the addictive behavior/AI psychoses we're starting to see.
The definition of reliability in this post is about 70% reliable :)
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
The original METR graph reports task success which has very different implications from reliability.
Managing tail risk / TVaR is a real problem but the proposed solutions and conclusions are problematic because they mix up model reliability with tail performance.
Oh, totally!
This wasn't intended as a scientifically precise paper, more about building intuition for a concept I think is underweighted.
Looking back, I'm not sure we formally define reliability anywhere. Instead, I focused mainly on asking people to think about the second moment and the distributional nature of P(success | task complexity). I think that's both (i) intuitively similar to what practitioners mean when they say a model "works reliably" and (ii) upstream of essentially all formal risk measures.
And yeah, I wasn't about to wade into TVaR/CVaR/ES/CTE/EP/distortion risk measure/the dozen other formal tail measures here. (Also love that as a field we've invented N names for essentially the same thing...)
Please distinguish between two groups: AI practitioners and AI influencers who follow or generate hype.
As a seasoned AI practitioner, I can’t agree with the title “The AI industry worships at the altar of Accuracy — but humanity answers to a more fickle, demanding god: Reliability.” This is ObviouslyWrong :)
It is a well-known fact that accuracy is not the best metric. That’s precisely why many other evaluation metrics have been defined and used in model development and training.
Hasan, fair pushback on artistic license in the subtitle.
But the substantive claim isn't "accuracy vs. other metrics" — it's "mean-anything vs. tail-anything."
Yes, we've moved beyond raw accuracy to MAPE, BLEU, pass@k, etc. But these are still measures of central tendency. To quantify, we parsed 12.5K evals in lm_eval — 96% target central tendency. The Zillow team weren't influencers; they were serious practitioners. Their tails still killed them. I could go on, but you get the idea.
The question: what percentage of model development explicitly optimizes tail behavior —P95 error, distributional robustness, disciplined risk analysis, etc.? In my experience, it's small even among strong teams. There's been some notable work emerging recently but, well, it's notable because it's rare. Are you seeing something different?
I think this hits what I've experienced pretty well! Also, when there's low reliability but good average accuracy, it feels a bit like gambling - and humans are susceptible to that.
Love the gambling analogy! Never connected those dots, but that's exactly how unreliable LLMs feel for me too. Bonus: Explains some of the addictive behavior/AI psychoses we're starting to see.