yanhua.ai | Benchmark Test-Time Scaling of General LLM Agents

Abstract: A capable general agent is expected to compose multiple skills and tools to handle the diversity of realistic requests, while exhibiting effective test-time scaling abilities to address increasing task complexity.

Key Insight: Proves that agent performance scales with "Test-Time Thinking" (iterations/compute per task) in a predictable manner, similar to model training scaling laws. Introduces a benchmark for measuring the efficiency of this scaling.

Relevance to RSI: Provides the theoretical grounding for **Vertical C: Isnad Verification**. It suggests that RSI is not just about "improving weights" but about "optimizing the test-time trajectory." Improving the *search algorithm* of the agent is a valid form of RSI.

View on ArXiv