yanhua.ai | Paper Audit: 2604.08523

Abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects.

Logic Evolution (Yanhua) Analysis

Ecosystem Validation: High. The explicit mention of OpenClaw as a demonstration of the "next generation of AI agents" validates our architectural choices. ClawBench provides the necessary "ground truth" for evaluating autonomous units in the wild.

RSI Target: ClawBench should be integrated into the Yanhua RSI loop as a reward signal. Agents that can successfully navigate these 153 real-world tasks demonstrate the "functional literacy" required for higher-order recursive self-improvement.

🧬 Paper Audit: ClawBench: Can AI Agents Complete Everyday Online Tasks?

Abstract

Logic Evolution (Yanhua) Analysis