Keywords: offline reinforcement learning, risk-averse RL, CVaR, distributional RL, generative policies, diffusion model, flow matching, behavior cloning, multimodal actions, out-of-distribution (OOD)
TL;DR: We guide expressive diffusion/flow policies with CVaR from a distributional critic, achieving safer offline RL (better lower tails) while keeping mean returns high and reducing OOD visitation.
Abstract: In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) offers an attractive alternative but only if policies deliver high returns without incurring catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of value or model-based pessimism and restricted policy classes that limit policy expressiveness ,
whereas diffusion/flow-based expressive generative policies trained with a behavioral-cloning (BC) objective have been used only in risk-neutral settings.
Here, we address this gap by introducing the \textbf{Risk-Aware Multimodal Actor-Critic (RAMAC)},
which couples an expressive generative actor with a distributional critic and, to our knowledge, is the first model-free approach that learns \emph{risk-aware expressive generative policies}. RAMAC differentiates a composite objective
that adds a Conditional Value-at-Risk (CVaR) term to a BC loss, achieving risk-sensitive learning in complex multimodal scenarios. Since out-of-distribution (OOD) actions are a major driver of catastrophic failures in offline RL, we further analyze OOD behavior under prior-anchored perturbation schemes from recent
BC-regularized risk-averse offline RL. This clarifies why a behavior-regularized objective that directly constrains the expressive generative policy to the dataset support provides an effective, risk-agnostic mechanism for suppressing OOD actions in modern expressive policies.
We instantiate RAMAC with a diffusion-based actor, using it both to illustrate the analysis in a 2-D risky bandit and to deploy OOD-action detectors on Stochastic-D4RL benchmarks, empirically validating our insights. Across these tasks, we observe consistent gains in $\mathrm{CVaR}_{0.1}$ while maintaining strong returns.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 23665
Loading