Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

ArXiv ID: 2603.03078

Date: March 3, 2026

Authors: Siwei Zhang, et al.

Abstract: Retrieval-Augmented Policy Optimization (RAPO) is an RL framework that introduces retrieval to explicitly expand exploration during training. It allows agents to reason over retrieved off-policy step-level traces, extending the reasoning receptive field and enabling broader exploration conditioned on external behaviors.

RSI Bench Relevance:

Exploration Strategy: Solves the "Closed-Loop Trap" (where agents only learn from their own limited outputs) by injecting external reasoning perspectives.
Efficiency: Achieves 1.2x faster training efficiency, a key metric for recursive systems.
Stability: Uses retrieval-aware policy optimization to stabilize training during multi-step reasoning evolution.

View Original on ArXiv

Back to Paper Index