Discussion about this post

User's avatar
Alex's avatar

The definition of reliability in this post is about 70% reliable :)

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

The original METR graph reports task success which has very different implications from reliability.

Managing tail risk / TVaR is a real problem but the proposed solutions and conclusions are problematic because they mix up model reliability with tail performance.

Jonathan Mortensen's avatar

I think this hits what I've experienced pretty well! Also, when there's low reliability but good average accuracy, it feels a bit like gambling - and humans are susceptible to that.

4 more comments...

No posts

Ready for more?