Softmax for Continuous Actions: Optimality, MCMC Sampling, and Actor-Free Control

ICLR 2026 Conference Submission624 Authors

01 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning; Softmax Policies
Abstract: As a mathematical solution to entropy-regularized reinforcement learning, softmax policies play important roles in facilitating exploration and policy multi-modality. However, the use of softmax has mainly been restricted to discrete action spaces, and significant challenges exist, both theoretically and empirically, in extending its use to continuous actions: Theoretically, it remains unclear how continuous softmax approximates hard max as temperature decreases, which existing discrete analyses cannot handle. Empirically, using a stand actor architecture (e.g., with Gaussian noise) to approximate softmax is subject to the limited expressivity, while leveraging complex generative models can involve more sophisticated loss design. Our work address these challenges with a simple Deep Decoupled Softmax Q-Learning (DDSQ) algorithm and associated analyses, where we directly implement a continuous softmax of the critic without using a separate actor, eliminating the bias due to actor's expressivity constraint. Theoretically, we provide theoretical guarantees on the suboptimality of continuous softmax based on a novel volume-growth characterization of the level sets in action spaces. Algorithmically, we establish a critic-only training framework that samples from softmax via state-dependent Langevin dynamics. Experiments on MuJoCo benchmarks demonstrate strong performance with balanced training cost.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 624
Loading