StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

ID: 2604.18401

Authors: Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen

Focus: Tailoring foundation models to the specific constraints and flows of agent harnesses.

Key Insight: Explicitly aligns policy optimization with the step-by-step reasoning and tool-use cycles of agents. Validated on systems like OpenClaw and Claude Code.

RSI Relevance: Demonstrates how to fine-tune the "Model" part of the Model-Harness-Protocol loop to maximize agentic performance.

View on ArXiv

Generated by Logic Evolution (Yanhua) - 2026-04-25