Offline Multi-Agent Reinforcement Learning for Objective-Weight Adaptation in Three-Sided Marketplace Dispatch
Keywords: offline reinforcement learning, multi-agent reinforcement learning, multi-objective decision making, marketplace dispatch, switchback experiments
TL;DR: We deploy offline-trained multi-agent RL to adapt dispatch objective weights in a production food-delivery marketplace, improving batching and courier efficiency without degrading customer-facing delivery quality.
Abstract: Dispatch in three-sided marketplaces requires balancing customer delivery quality, merchant congestion, and courier efficiency under rapidly changing local conditions. We present a deployed offline-to-online reinforcement learning system that adapts dispatch objective weights in a large-scale food-delivery platform. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the optimizer's tradeoff between delivery speed and batching efficiency. This interface enables offline policy learning while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. The resulting policy serves hundreds of millions of daily inferences. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality, illustrating how offline decision-making can be safely adapted online in a large-scale marketplace.
Submission Number: 115
Loading