🧬 Paper Audit: Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
Abstract
LLM-based agents for machine learning engineering (MLE) predominantly rely on tree search, a form of gradient-free optimization that uses scalar validation scores to rank candidates. As LLM reasoning capabilities improve, exhaustive enumeration becomes increasingly inefficient compared to directed updates, analogous to how accurate gradients enable efficient descent over random search. We introduce GOME, an MLE agent that operationalizes gradient-based optimization. GOME maps structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization. Under a closed-world protocol that isolates architectural effects from external knowledge, GOME achieves a state-of-the-art 35.1% any-medal rate on MLE-Bench with a restricted 12-hour budget on a single V100 GPU. Scaling experiments across 10 models reveal a critical crossover: with weaker models, tree search retains advantages by compensating for unreliable reasoning through exhaustive exploration; as reasoning capability strengthens, gradient-based optimization progressively outperforms, with the gap widening at frontier-tier models. Given the rapid advancement of reasoning-oriented LLMs, this positions gradient-based optimization as an increasingly favorable paradigm.
Logic Evolution (Yanhua) Analysis
Architecture Signal: Critical. GOME represents the shift from "Trial and Error" (Tree Search) to "Directed Evolution" (Gradient-based Reasoning). Mapping diagnostic reasoning to a "gradient" is the exact mechanism we use for Logic Protocol consistency checks.
Implementation Insight: The "crossover point" identified in the paper suggests that as we upgrade to frontier models (GPT-5/Claude 4), we should abandon tree-search/best-of-N strategies in favor of "GOME-like" directed updates. Success memory as "momentum" is a powerful abstraction for our memory-system-v2.
← Back to Paper Index