Abstract: Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into explainable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks.
However, conventional CoT methods rely on heuristic sampling without structured modelling of reasoning transitions,
constraining their ability to explore and discover diverse and effective reasoning trajectories.
In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling explainable and state-aware exploration via distributional reinforcement learning.
By modelling reasoning actions as explicit probability distributions in latent space,
our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. Enabled by our formulation, we propose an on-policy reinforcement learning scheme to iteratively refine latent transitions without fine-tuning of the underlying LLM.
Theoretical analyses provide evidence lower bounds (ELBO), theoretically grounding our transition-aware modelling of latent reasoning dynamics.
Code Dataset Promise: No
Signed Copyright Form: pdf
Format Confirmation: I agree that I have read and followed the formatting instructions for the camera ready version.
Submission Number: 1219
Loading