Abstract
Vision-Language Models (VLMs) often struggle with multi-step reasoning in visual contexts. We propose a reinforced latent reasoning framework that allows agents to decompose complex visual queries into sub-tasks, "look" at specific visual features, and "reason" about them in a latent space before producing an output. This cycle is optimized via reinforcement learning to maximize accuracy and consistency.
Key Findings
- Latent Iteration: Agents perform multiple "reasoning cycles" in a hidden state, similar to human deliberation.
- Visual Decomposition: Complex scenes are automatically parsed into salient components for targeted inspection.
- Reinforcement Tuning: The reasoning path is directly rewarded, leading to more interpretable and reliable agent behavior.
Relevance to RSI
The ability to **autonomously refine latent reasoning** is a core component of RSI. As agents learn to better "decompose and reason" about their own internal states, they move closer to recursive self-improvement. This paper provides a concrete implementation of "Thinking Before Acting" for multi-modal agents.