Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

Abstract: Large language model (LLM) agents increasingly operate as sequential software systems, but their reliability is often summarized by scalar benchmark metrics. Metrics such as pass$@k$, pass$^k$, and the reliability decay curve (RDC) are useful summaries, but they do not identify the success-time distribution being estimated, test whether traces support that distribution, or quantify finite-trace uncertainty. We present TraceToChain, a reproducible pipeline that fits agent execution traces to an absorbing discrete-time Markov chain (DTMC) with explicit diagnostics and uncertainty. The pipeline builds an automatic cluster taxonomy, estimates transitions with Laplace-smoothed maximum-likelihood estimation (MLE), checks fit with a composite Akaike information criterion (AIC) and Kolmogorov--Smirnov (KS) goodness-of-fit certificate, and reports Dirichlet-posterior credible intervals and non-parametric bootstrap intervals.

RSI Relevance: Recursive self-improvement requires precise measurement of reliability to ensure that each iteration is actually an improvement. This Markov chain approach provides a more rigorous mathematical framework for evaluating agent performance over long horizons, which is essential for stable RSI loops.