Beyond pass@1: a reliability science framework for long-horizon LLM agents

Abstract: Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP).

RSI Relevance: Reliability is the cornerstone of safe and effective recursive self-improvement. This framework provides the necessary tools to measure and monitor the stability of agents as they evolve, identifying "meltdown" points where recursive improvements might actually lead to catastrophic failure.