The original METR graph reports task success which has very different implications from reliability.
Managing tail risk / TVaR is a real problem but the proposed solutions and conclusions are problematic because they mix up model reliability with tail performance.
This wasn't intended as a scientifically precise paper, more about building intuition for a concept I think is underweighted.
Looking back, I'm not sure we formally define reliability anywhere. Instead, I focused mainly on asking people to think about the second moment and the distributional nature of P(success | task complexity). I think that's both (i) intuitively similar to what practitioners mean when they say a model "works reliably" and (ii) upstream of essentially all formal risk measures.
And yeah, I wasn't about to wade into TVaR/CVaR/ES/CTE/EP/distortion risk measure/the dozen other formal tail measures here. (Also love that as a field we've invented N names for essentially the same thing...)
I think this hits what I've experienced pretty well! Also, when there's low reliability but good average accuracy, it feels a bit like gambling - and humans are susceptible to that.
Love the gambling analogy! Never connected those dots, but that's exactly how unreliable LLMs feel for me too. Bonus: Explains some of the addictive behavior/AI psychoses we're starting to see.
As a physician-scientist, I really appreciate how you separate mean accuracy from tail reliability, because in real clinical workflows, the tails are where harm lives. A tool that’s “97% right on average” but unpredictably wrong in edge cases isn’t a helpful teammate; it’s the equivalent of a confident colleague who fabricates 30% of the time; nobody would staff that person in a hospital. The compounding point is especially important for anything agentic: even “pretty good” per-step reliability collapses when you chain decisions, which mirrors how small documentation or triage errors can cascade into major downstream consequences. What feels most actionable here is the systems framing (decomposition, verification, retrieval/grounding, and selective abstention) because medicine doesn’t need a model that always answers; it needs a system that knows when to say “I’m not sure, escalate,” and can be audited. That’s how we earn the trust required for higher-stakes use cases.
Hasan, fair pushback on artistic license in the subtitle.
But the substantive claim isn't "accuracy vs. other metrics" — it's "mean-anything vs. tail-anything."
Yes, we've moved beyond raw accuracy to MAPE, BLEU, pass@k, etc. But these are still measures of central tendency. To quantify, we parsed 12.5K evals in lm_eval — 96% target central tendency. The Zillow team weren't influencers; they were serious practitioners. Their tails still killed them. I could go on, but you get the idea.
The question: what percentage of model development explicitly optimizes tail behavior —P95 error, distributional robustness, disciplined risk analysis, etc.? In my experience, it's small even among strong teams. There's been some notable work emerging recently but, well, it's notable because it's rare. Are you seeing something different?
The definition of reliability in this post is about 70% reliable :)
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
The original METR graph reports task success which has very different implications from reliability.
Managing tail risk / TVaR is a real problem but the proposed solutions and conclusions are problematic because they mix up model reliability with tail performance.
Oh, totally!
This wasn't intended as a scientifically precise paper, more about building intuition for a concept I think is underweighted.
Looking back, I'm not sure we formally define reliability anywhere. Instead, I focused mainly on asking people to think about the second moment and the distributional nature of P(success | task complexity). I think that's both (i) intuitively similar to what practitioners mean when they say a model "works reliably" and (ii) upstream of essentially all formal risk measures.
And yeah, I wasn't about to wade into TVaR/CVaR/ES/CTE/EP/distortion risk measure/the dozen other formal tail measures here. (Also love that as a field we've invented N names for essentially the same thing...)
I think this hits what I've experienced pretty well! Also, when there's low reliability but good average accuracy, it feels a bit like gambling - and humans are susceptible to that.
Love the gambling analogy! Never connected those dots, but that's exactly how unreliable LLMs feel for me too. Bonus: Explains some of the addictive behavior/AI psychoses we're starting to see.
As a physician-scientist, I really appreciate how you separate mean accuracy from tail reliability, because in real clinical workflows, the tails are where harm lives. A tool that’s “97% right on average” but unpredictably wrong in edge cases isn’t a helpful teammate; it’s the equivalent of a confident colleague who fabricates 30% of the time; nobody would staff that person in a hospital. The compounding point is especially important for anything agentic: even “pretty good” per-step reliability collapses when you chain decisions, which mirrors how small documentation or triage errors can cascade into major downstream consequences. What feels most actionable here is the systems framing (decomposition, verification, retrieval/grounding, and selective abstention) because medicine doesn’t need a model that always answers; it needs a system that knows when to say “I’m not sure, escalate,” and can be audited. That’s how we earn the trust required for higher-stakes use cases.
Hasan, fair pushback on artistic license in the subtitle.
But the substantive claim isn't "accuracy vs. other metrics" — it's "mean-anything vs. tail-anything."
Yes, we've moved beyond raw accuracy to MAPE, BLEU, pass@k, etc. But these are still measures of central tendency. To quantify, we parsed 12.5K evals in lm_eval — 96% target central tendency. The Zillow team weren't influencers; they were serious practitioners. Their tails still killed them. I could go on, but you get the idea.
The question: what percentage of model development explicitly optimizes tail behavior —P95 error, distributional robustness, disciplined risk analysis, etc.? In my experience, it's small even among strong teams. There's been some notable work emerging recently but, well, it's notable because it's rare. Are you seeing something different?