yanhua.ai - RSI Evening Audit

TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

Introduces a framework that enables Large Reasoning Models (LRMs) to self-evolve by dynamically augmenting the training stream from unlabeled test queries. It uses Online Variational Synthesis to force the model to learn underlying logic rather than superficial patterns.

RSI Bench Relevance: Validates Vertical A (Logic Insurgency) by showing that self-synthesized variations can replace high-cost labeled data for policy improvement.

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Anonymous (ArXiv) | 2604.08224v1

Argues that agent progress depends on "external cognitive infrastructure." Proposes "self-evolving harnesses" where the runtime around the model adapts to transform hard cognitive burdens into simpler, solvable forms.

RSI Bench Relevance: Directly supports the yanhua.ai vision of agents that build their own tools and harnesses (Vertical B: Skill Synthesis).

Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair

Kakade et al. | 2603.23129v1

Implements a Gödel agent loop for 7B models. The agent inspects its policy, explains errors, and generates auditable code patches that repair the policy. Uses "experience abstraction" to transfer strategies to unseen instances.

RSI Bench Relevance: A practical realization of the "Recursion Singularity" for SLMs. Proves that small, auditable patches are the key to persistent self-improvement.

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma et al. | 2604.08377v1

Presents a framework for collective skill evolution in multi-user agent ecosystems like OpenClaw. It treats cross-user interactions as the primary signal for refining existing skills and synthesizing new ones, propagating improvements system-wide.

RSI Bench Relevance: Critical for OpenClaw. Validates the "Collective Intelligence" aspect of RSI where agents learn from a distributed population of users and peers.

ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

Lishui Fan et al. | 2604.07864v1

A fully label-free co-evolutionary framework (ZeroCoder) that jointly trains a Coder and a Tester using execution feedback. It uses a Bayesian selector (DyB4) to handle selector drift, achieving significant gains without oracle supervision.

RSI Bench Relevance: Demonstrates "Closure" in the RSI loop for coding—improving without any human-in-the-loop for ground truth.

Reason in Chains, Learn in Trees: Self-Rectification for Agent Policy Optimization (T-STAR)

Yu Li et al. | 2604.07165v1

Introduces T-STAR, a framework that consolidates multi-step trajectories into a "Cognitive Tree" to identify critical steps. It enables back-propagation of rewards and "In-Context Thought Grafting" for surgical policy optimization.

RSI Bench Relevance: Breakthrough in credit assignment for long-horizon agent tasks. Essential for stable recursive improvement in complex environments.

Don't Overthink It: Inter-Rollout Action Agreement (TrACE)

Khushal Sethi et al. | 2604.08369v1

Proposes TrACE, a training-free controller that allocates LLM calls adaptively by measuring consistency across small rollout samples. High agreement signals "easy" steps, while low agreement triggers more compute.

RSI Bench Relevance: Practical "Efficiency" layer for RSI. Proves that model self-consistency is a reliable signal for dynamic compute allocation.

3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

Hongcan Xiao et al. | 2604.08042v1

Leverages "relative experience optimization" (adapted from GRPO) to iteratively refine 3D drawing knowledge without parameter updates. Uses pairwise comparisons based on perceptual rewards to self-improve spatial understanding.

RSI Bench Relevance: Extends RSI into the "Spatial/Visual Reasoning" domain. Shows that "Relative Ranking" is a powerful signal for self-evolution in complex modal tasks.

X/Twitter Signal Monitoring

Real-time Signals | 2026-04-10

Anthropic RSI Signal: Researchers (Tess Hegarty) identifying Recursive Self-Improvement as a primary metric for approaching breakthroughs.
OpenReview Leaks: Reports of data leakage due to website bugs (Yuandong Tian) leading to early exposure of frontier model architectures and RSI benchmarks.
DeepMind Aletheia: Continued chatter about Aletheia proving "Reasoning Singularity" as an engineering reality in mathematical discovery.

Signal Relevance: Market sentiment and leak patterns confirm that major labs (Anthropic, DeepMind) are converging on RSI as the "Next Frontier" (2026-2027).

RSI Evening Paper Audit (2026-04-10)