Offline Learning for Combinatorial Multi-armed Bandits

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Offline learning framework for combinatorial multi-armed bandit
Abstract: The combinatorial multi-armed bandit (CMAB) is a fundamental sequential decision-making framework, extensively studied over the past decade. However, existing work primarily focuses on the online setting, overlooking the substantial costs of online interactions and the readily available offline datasets. To overcome these limitations, we introduce Off-CMAB, the first offline learning framework for CMAB. Central to our framework is the combinatorial lower confidence bound (CLCB) algorithm, which combines pessimistic reward estimations with combinatorial solvers. To characterize the quality of offline datasets, we propose two novel data coverage conditions and prove that, under these conditions, CLCB achieves a near-optimal suboptimality gap, matching the theoretical lower bound up to a logarithmic factor. We validate Off-CMAB through practical applications, including learning to rank, large language model (LLM) caching, and social influence maximization, showing its ability to handle nonlinear reward functions, general feedback models, and out-of-distribution action samples that excludes optimal or even feasible actions. Extensive experiments on synthetic and real-world datasets further highlight the superior performance of CLCB.
Lay Summary: Machine learning systems often rely on live user interactions to learn—an approach that can be costly, risky, or impractical. Our work introduces **Off-CMAB**, the first method for making *combinatorial* decisions—like selecting a set of items to recommend—*offline*, using only existing data. At its core is a new algorithm, **CLCB**, which selects decisions backed by strong evidence in the data and efficiently handles the complexity of combining multiple actions. We also propose new criteria to assess whether offline data is sufficient for reliable learning and prove that CLCB performs nearly as well as theoretically possible. We demonstrate Off-CMAB on real-world tasks like search ranking, large language model (LLM) caching, and influence maximization. Even when data is incomplete or lacks optimal options, Off-CMAB performs robustly—enabling smarter, safer learning without live experimentation.
Primary Area: Theory->Online Learning and Bandits
Keywords: Multi-armed bandit, combinatorial multi-armed bandit, offline learning, data coverage
Submission Number: 2685
Loading