Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

ICLR 2026 Conference Submission15318 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning, test-time, instance-level, policy gradient, latent space, latent reasoning

TL;DR: We propose to enhance LLM reasoning ability via test-time instance-level policy gradient in latent space.

Abstract: Large Language Models (LLMs) typically reason through explicit, step-by-step natural-language traces. Humans, however, also rely on non-linguistic, unconscious processes, such as the inspirations that emerge during the incubation period. In this work, we introduce LatentSeek, a novel framework designed to enhance the reasoning capabilities of LLMs through Test-Time Instance-level Policy Gradient within the model’s latent space—thus complementing explicit natural-language steps. LatentSeek employs policy gradient optimization to iteratively refine latent representations, guided solely by a self-generated reward signal. This allows the model to adapt its reasoning trajectory dynamically on a per-instance basis. Empirical evaluations across diverse benchmarks, GSM8K, MATH-500, and AIME2024 as well as multiple LLM families (e.g., LLaMA, Qwen) demonstrate that LatentSeek outperforms established baselines, including Chain-of-Thought (CoT), Best-of-N (BoN) and training-based methods. Further analysis indicates that LatentSeek is computationally efficient, typically converging within a few optimization iterations for average-level problems. Moreover, the model's performance improves as the number of latent update iterations increases, highlighting the benefits of exploring within the latent space. These findings highlight LatentSeek as a lightweight and effective paradigm for improving the reasoning capabilities of LLMs without changing their parameters.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 15318

Loading