Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

ArXiv: 2602.23008 | Feb 2026 | RSIRLMemory

Abstract: Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO2), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates.

Key Insight: Addresses the novel state discovery bottleneck in agentic systems by treating the agent's memory as a dynamic exploration buffer. The hybrid optimization allows the agent to learn from successful trajectories (on-policy) while simultaneously mining historical memory for missed opportunities (off-policy).

Relevance to RSI: Provides a robust mechanism for agents to "break out" of local optima in self-improvement cycles, specifically when the solution space requires exploring paths not covered by the model's initial training data.

View on ArXiv