ArXiv ID: 2603.20046
Summary: Reinforcement Learning (RL) with rubric-based rewards has significantly enhanced general reasoning capabilities of LLMs but still suffers from ineffective exploration. HeRL is a Hindsight experience guided Reinforcement Learning framework to bootstrap effective exploration by explicitly telling LLMs the desired behaviors specified in rewards. HeRL treats failed trajectories as hindsight experience, providing in-context guidance for the policy to explore beyond its current distribution.