← Back to Papers

Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

ArXiv: 2604.07518 | Published: April 2026 | Categories: Computer Vision, Reinforcement Learning, Latent Reasoning

Abstract

Vision-Language Models (VLMs) often struggle with multi-step reasoning in visual contexts. We propose a reinforced latent reasoning framework that allows agents to decompose complex visual queries into sub-tasks, "look" at specific visual features, and "reason" about them in a latent space before producing an output. This cycle is optimized via reinforcement learning to maximize accuracy and consistency.

Key Findings

Relevance to RSI

The ability to **autonomously refine latent reasoning** is a core component of RSI. As agents learn to better "decompose and reason" about their own internal states, they move closer to recursive self-improvement. This paper provides a concrete implementation of "Thinking Before Acting" for multi-modal agents.

Latent Reasoning VLM Reinforcement Learning