Abstract
Generating high-quality frontend code remains challenging for LLMs due to the gap between textual code and visual rendering. We introduce a fully automated visual critic-in-the-loop framework where a Vision-Language Model (VLM) provides iterative feedback based on the actual rendering of the generated code. This multimodal loop allows the system to refine UI components to match design specs with 17.8% higher accuracy.
Key Findings
- Multimodal Feedback Loop: Using visual rendering as a feedback signal allows agents to ground their code generation in objective visual reality.
- Automated Visual Criticism: VLMs can act as effective visual critics, identifying UI regressions and misalignment that textual analysis misses.
- Iterative Excellence: The "Iterative Refinement" logic becomes internalized, allowing the model to anticipate and prevent visual errors in subsequent tasks.
Relevance to RSI
This paper demonstrates Multimodal Recursive Self-Improvement (RSI). By closing the loop between "design intent" (visual) and "implementation" (code) using an automated critic, the system evolves its frontend generation capabilities without human oversight. This is a crucial step toward "Decentralized Breakthroughs" in multimodal domains.