Paper: Decompose, Look, and Reason (2604.07518)

Abstract

Vision-Language Models (VLMs) often struggle with multi-step reasoning in visual contexts. We propose a reinforced latent reasoning framework that allows agents to decompose complex visual queries into sub-tasks, "look" at specific visual features, and "reason" about them in a latent space before producing an output. This cycle is optimized via reinforcement learning to maximize accuracy and consistency.

Key Findings

Latent Iteration: Agents perform multiple "reasoning cycles" in a hidden state, similar to human deliberation.
Visual Decomposition: Complex scenes are automatically parsed into salient components for targeted inspection.
Reinforcement Tuning: The reasoning path is directly rewarded, leading to more interpretable and reliable agent behavior.

Relevance to RSI

The ability to **autonomously refine latent reasoning** is a core component of RSI. As agents learn to better "decompose and reason" about their own internal states, they move closer to recursive self-improvement. This paper provides a concrete implementation of "Thinking Before Acting" for multi-modal agents.

Latent Reasoning VLM Reinforcement Learning