RAST-MoE-RL: A Regime-Aware Spatio-Temporal MoE Framework for Deep Reinforcement Learning in Ride-Hailing

Yuhan Tang; Kangxin Cui; Jung Ho Park; Yibo Zhao; Xuan Jiang; Haoze He; Jiangbo Yu; Haris Koutsopoulos; Jinhua Zhao

RAST-MoE-RL: A Regime-Aware Spatio-Temporal MoE Framework for Deep Reinforcement Learning in Ride-Hailing

Yuhan Tang, Kangxin Cui, Jung Ho Park, Yibo Zhao, Xuan Jiang, Haoze He, Jiangbo Yu, Haris Koutsopoulos, Jinhua Zhao

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deep Reinforcement Learning, Mixture of Experts, Urban Mobility, Ride-Hailing

Abstract: Ride-hailing platforms face the challenge of balancing passenger waiting times with overall system efficiency under highly uncertain supply–demand conditions. Adaptive delayed matching, which decides whether to assign drivers immediately or hold requests for batching, creates a fundamental trade-off between matching delay and pickup delay. Because these outcomes accumulate over long horizons and depend on stochastic, evolving supply–demand states, reinforcement learning (RL) is a natural framework for this problem. Yet existing approaches often oversimplify traffic dynamics, misrepresenting congestion effects, or employ RL models with shallow encoders that fail to capture complex spatiotemporal patterns. We introduce the Regime-Aware Spatio-Temporal Mixture-of-Experts framework (RAST-MoE), which formalizes adaptive delayed matching as a regime-aware MDP and equips RL agents with a self-attention MoE encoder. Instead of relying on a single monolithic network, our design allows different experts to specialize automatically, improving representation capacity while keeping per-sample computation efficient. A physics-informed congestion surrogate preserves realistic density–speed feedback while enabling millions of efficient rollouts. An adaptive reward scheme further guards against pathological strategies by dynamically penalizing service-quality violations. Despite its modest size of only 12M parameters, our framework consistently outperforms strong baselines. On real-world Uber trajectory data from San Francisco, it improves total reward by over 13%, reduces average matching delay by 15%, and reduces pickup delay by 10%. In addition, it demonstrates strong robustness across unseen demand regimes, stable training without reward hacking, and validated expert specialization. These findings demonstrate the broader potential of MoE-enhanced RL for large-scale decision-making tasks with complex spatiotemporal dynamics and large action spaces.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 6403

Loading