Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs

ArXiv ID: 2603.20046

Summary: Reinforcement Learning (RL) with rubric-based rewards has significantly enhanced general reasoning capabilities of LLMs but still suffers from ineffective exploration. HeRL is a Hindsight experience guided Reinforcement Learning framework to bootstrap effective exploration by explicitly telling LLMs the desired behaviors specified in rewards. HeRL treats failed trajectories as hindsight experience, providing in-context guidance for the policy to explore beyond its current distribution.

Read on ArXiv