D2C-HRHR: Discrete Actions with Double Distributional Critics for High-Risk-High-Return Tasks

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Deep Reinforcement Learning; Distributional Reinforcement Learning; Double Distributional Critics; Action Entropy
TL;DR: We propose a model with discrete participants and distributive critics, which includes an action entropy-based exploratory strategy and a structure of dual distribution criticism.
Abstract: Tasks involving high-risk–high-return (HRHR) actions, such as obstacle crossing, often exhibit multimodal action distributions and stochastic returns. Most reinforcement learning (RL) methods assume unimodal Gaussian policies and rely on scalar-valued critics, which limits their effectiveness in HRHR settings. We formally define HRHR tasks and theoretically show that Gaussian policies cannot guarantee convergence to the optimal solution. To address this, we propose a reinforcement learning framework that (i) discretizes continuous action spaces to approximate multimodal distributions, (ii) employs entropy-regularized exploration to improve coverage of risky but rewarding actions, and (iii) introduces a dualcritic architecture for more accurate discrete value distribution estimation. The framework scales to high-dimensional action spaces, supporting complex control domains. Experiments on locomotion and manipulation benchmarks with high risks of failure demonstrate that our method outperforms baselines, underscoring the importance of explicitly modeling multimodality and risk in RL.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 10796
Loading