9 Comments
User's avatar
Alex's avatar

The definition of reliability in this post is about 70% reliable :)

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

The original METR graph reports task success which has very different implications from reliability.

Managing tail risk / TVaR is a real problem but the proposed solutions and conclusions are problematic because they mix up model reliability with tail performance.

Kushal Chakrabarti's avatar

Oh, totally!

This wasn't intended as a scientifically precise paper, more about building intuition for a concept I think is underweighted.

Looking back, I'm not sure we formally define reliability anywhere. Instead, I focused mainly on asking people to think about the second moment and the distributional nature of P(success | task complexity). I think that's both (i) intuitively similar to what practitioners mean when they say a model "works reliably" and (ii) upstream of essentially all formal risk measures.

And yeah, I wasn't about to wade into TVaR/CVaR/ES/CTE/EP/distortion risk measure/the dozen other formal tail measures here. (Also love that as a field we've invented N names for essentially the same thing...)

Jonathan Mortensen's avatar

I think this hits what I've experienced pretty well! Also, when there's low reliability but good average accuracy, it feels a bit like gambling - and humans are susceptible to that.

Kushal Chakrabarti's avatar

Love the gambling analogy! Never connected those dots, but that's exactly how unreliable LLMs feel for me too. Bonus: Explains some of the addictive behavior/AI psychoses we're starting to see.

DOCTOR KLOVER 🍀's avatar

As a physician-scientist, I really appreciate how you separate mean accuracy from tail reliability, because in real clinical workflows, the tails are where harm lives. A tool that’s “97% right on average” but unpredictably wrong in edge cases isn’t a helpful teammate; it’s the equivalent of a confident colleague who fabricates 30% of the time; nobody would staff that person in a hospital. The compounding point is especially important for anything agentic: even “pretty good” per-step reliability collapses when you chain decisions, which mirrors how small documentation or triage errors can cascade into major downstream consequences. What feels most actionable here is the systems framing (decomposition, verification, retrieval/grounding, and selective abstention) because medicine doesn’t need a model that always answers; it needs a system that knows when to say “I’m not sure, escalate,” and can be audited. That’s how we earn the trust required for higher-stakes use cases.

chris copeland's avatar

AI Reliability Is Really a Question of Delegation

The sharpest point in “The 9s of AI Reliability” is not that AI gets things wrong. Everyone knows that by now. The sharper point is that we are still talking about AI as if impressive task performance and operational trust are the same thing.

They are not.

A system can be capable enough to surprise you and still not be reliable enough to build responsibility around. That distinction is where a lot of the current confusion lives. A demo asks, “Can it do the task?” A business process asks, “Can I hand this off without creating a larger supervision problem somewhere else?”

Those are very different questions.

This is why mean accuracy can be so misleading. Average performance makes a system look better from a distance than it feels inside a real workflow. The dangerous failures are often not the average ones. They are the edge cases, the ambiguous handoffs, the long chains of dependent steps, the moments when a wrong answer does not simply remain wrong but moves downstream into a customer email, a pricing decision, a legal exposure, a medical note, or a management report.

At that point, the issue is no longer whether the system is intelligent. The issue is whether its errors are visible, bounded, and governable.

That is what reliability buys: not perfection, but designability. A consistently limited system can be used because people know where to put the guardrails. An inconsistently brilliant system is harder to trust because its failures do not stay politely inside the box. It can be dazzling in one moment and quietly destructive in the next.

This also explains why AI can feel both magical and economically underwhelming. If a tool drafts something impressive but requires careful review every time, the labor has not disappeared. It has moved. Someone still has to verify, constrain, correct, and absorb the consequences. The organization may think it has purchased automation, when in practice it has purchased a faster generator of review obligations.

That does not mean AI is useless. It means usefulness arrives in thresholds.

At one level of reliability, AI is an idea generator. At another, it becomes a junior assistant. At a higher level, it becomes something you can delegate bounded work to. Higher still, it becomes a system you can trust with customers, money, scheduling, reputation, or strategy. Each additional “9” does not just improve the same product. It changes the category of responsibility the system is allowed to touch.

That is the deeper reliability crisis. We are not only waiting for AI to become smarter. We are waiting for it to become stable enough that humans can responsibly reduce supervision rather than merely relocate it.

The frontier is not spectacle. It is governable trust.

For now, the right question is not “Can the model do this?”

The right question is: “What has to remain wrapped around the model for this to be safe?”

Lori Boyters's avatar

AI definitely has its place in today's world. As with the beginnings of most past automation technologies, one should not place full confidence in its results, at least at this point in time in its infancy, especially when accuracy and reliability are critical, such as with the finance systems. The "9s of AI reliability" should be taken into account to ensure a higher percentage of up time. AI reliability is not absolute and varies depending on the task, quality of the data used, and the environmental conditions.

Trevor James McNeil's avatar

The biggest issue in terms of A.I.'s long term viability is the 27% gap in tail accuracy. A.I. models are an average of 97% mean-accurate and 70% tail-accurate. A nearly 30% accuracy gap is fatal in any sort of architecture and the mean-to-tail accuracy ratio in A.I. models has already spelled doom even for e-commerce juggernauts like Zillow. This is the first thing that must be addressed if A.I. is going to have any long-term future as a standard tool.

User's avatar
Comment deleted
Dec 24
Comment deleted
Kushal Chakrabarti's avatar

Hasan, fair pushback on artistic license in the subtitle.

But the substantive claim isn't "accuracy vs. other metrics" — it's "mean-anything vs. tail-anything."

Yes, we've moved beyond raw accuracy to MAPE, BLEU, pass@k, etc. But these are still measures of central tendency. To quantify, we parsed 12.5K evals in lm_eval — 96% target central tendency. The Zillow team weren't influencers; they were serious practitioners. Their tails still killed them. I could go on, but you get the idea.

The question: what percentage of model development explicitly optimizes tail behavior —P95 error, distributional robustness, disciplined risk analysis, etc.? In my experience, it's small even among strong teams. There's been some notable work emerging recently but, well, it's notable because it's rare. Are you seeing something different?