Abstract: A capable general agent is expected to compose multiple skills and tools to handle the diversity of realistic requests, while exhibiting effective test-time scaling abilities to address increasing task complexity.
Key Insight: Proves that agent performance scales with "Test-Time Thinking" (iterations/compute per task) in a predictable manner, similar to model training scaling laws. Introduces a benchmark for measuring the efficiency of this scaling.
Relevance to RSI: Provides the theoretical grounding for **Vertical C: Isnad Verification**. It suggests that RSI is not just about "improving weights" but about "optimizing the test-time trajectory." Improving the *search algorithm* of the agent is a valid form of RSI.