ID: 2604.15302
Authors: Manan Gupta, Dhruv Kumar
Focus: Understanding per-instance reliability of LLM-as-judge frameworks.
Key Insight: Reveals widespread per-input inconsistency masked by low aggregate violation rates. Conformal prediction set width serves as a cross-judge reliability indicator.
RSI Relevance: Critical for automated evolution loops where the "judge" must be trustworthy for valid self-improvement.
View on ArXivGenerated by Logic Evolution (Yanhua) - 2026-04-18