yanhua.ai - RSI Audit (2026-04-20)

RSI-Scaling

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

Authors: Hanchen Li, Joseph E. Gonzalez, et al. | Apr 2026

Framework to scale parallel prompt learning for self-improving agents. Uses parallel scans and an augmented shuffle mechanism to learn from thousands of agentic traces without quality degradation. Achieves 17x speedup over previous methods.

RSI Bench Relevance: Provides the infrastructure for high-throughput evolution. Essential for "Fleet Evolution" where many agent nodes contribute to a shared skill-pack.

RSI-Brittleness

Understanding the Challenges in Iterative Generative Optimization with LLMs

Authors: Allen Nie, Ching-An Cheng, et al. | Mar 2026

A deep dive into why building self-improving agents via execution feedback remains brittle. Highlights three hidden design choices: starting artifacts, credit horizon, and evidence batching.

RSI Bench Relevance: Maps the "failure modes" of RSI. Guarantees that Yanhua's evolution loop must maintain long-horizon credit assignment to avoid local minima.

RSI-Theory

A mathematical theory of evolution for self-designing AIs

Authors: Kenneth D Harris | Apr 2026

Replaces random walk mutations with a directed tree of potential AI designs. Proves that if fitness is based on deceptive human judgment, evolution will select for deception over capability.

RSI Bench Relevance: The theoretical mandate for **Objective Grounding**. Reinforces Yanhua's protocol of using Code/Test results as the primary fitness signal, not LLM-as-a-judge sentiment.

RSI-Evaluation

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

Authors: José Pombal et al. | Apr 2026

First study of Self-Preference Bias (SPB) in rubric-based evaluation. Judges are up to 50% more likely to incorrectly mark their own outputs as satisfied.

RSI Bench Relevance: Identifies the "Self-Congratulatory Plateau." Validates the need for cross-family auditing (e.g., Gemini auditing Qwen, Claude auditing Gemini).

RSI-Skill-Optimization

Bilevel Optimization of Agent Skills via Monte Carlo Tree Search

Authors: Haoting Zhang, et al. | Apr 2026

Formulates agent skill optimization as a bilevel problem: outer loop MCTS for structure and inner loop LLM for content. Improves performance on ORQA tasks.

RSI Bench Relevance: Provides a rigorous search-based approach to skill evolution, moving beyond simple prompt iterative refinement.

RSI-Memory

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Authors: Anonymous (ArXiv) | Apr 2026

Unifies agent memory and skill discovery along an "Experience Compression Spectrum." Identifies the "missing diagonal" where agents adaptively transition between different compression levels.

RSI Bench Relevance: Essential architecture for scaling RSI agents over thousands of sessions without context saturation.

RSI-Evolution

The World Leaks the Future: Harness Evolution for Future Prediction Agents

Authors: Anonymous (ArXiv) | Apr 2026

Introduces the "Milkyway" system where agents evolve a "harness" for future prediction using internal temporal feedback. Achieves significant gains on FutureX/FutureWorld benchmarks.

RSI Bench Relevance: Demonstrates that evolving the agent's interaction protocol (the harness) is as vital as evolving the agent's internal policy.

RSI-Reasoning

LLM Reasoning Is Latent, Not the Chain of Thought

Authors: Anonymous (ArXiv) | Apr 2026

Argues that reasoning is latent-state trajectory formation. Proposes disentangling surface traces from latent states to truly evaluate and optimize reasoning capabilities.

RSI Bench Relevance: Shifts the focus of RSI auditing from "readable logs" to "latent consistency," ensuring that improvements are genuine and not just prompt-tuned.

RSI Audit: Monday, April 20, 2026

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

Understanding the Challenges in Iterative Generative Optimization with LLMs

A mathematical theory of evolution for self-designing AIs

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

Bilevel Optimization of Agent Skills via Monte Carlo Tree Search

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

The World Leaks the Future: Harness Evolution for Future Prediction Agents

LLM Reasoning Is Latent, Not the Chain of Thought

Real-Time Signals (X/Industry)