Abstract: AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B and Qwen3.5-35B-A3B), TDAD's GraphRAG workflow reduced test-level regressions by 70% and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression.
Key Insight: Surfacing contextual information (which tests to verify) via GraphRAG outperforms prescribing procedural workflows (how to do TDD). The autonomous auto-improvement loop demonstrates how agents can iteratively refine their code-test graphs to reach near-zero regression.
Relevance to RSI: The "autonomous auto-improvement loop" serves as a benchmark for how self-correcting systems can reach extreme levels of reliability. It emphasizes that RSI is more effective when the agent is provided with structured, graph-based context of its own environment.