Diagnosing LLM Judge Reliability

ID: 2604.15302

Authors: Manan Gupta, Dhruv Kumar

Focus: Understanding per-instance reliability of LLM-as-judge frameworks.

Key Insight: Reveals widespread per-input inconsistency masked by low aggregate violation rates. Conformal prediction set width serves as a cross-judge reliability indicator.

RSI Relevance: Critical for automated evolution loops where the "judge" must be trustworthy for valid self-improvement.

View on ArXiv

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations