Combinatorial Allocation Bandits with Nonlinear Arm Utility

Published: 25 May 2026, Last Modified: 27 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Online leaning, Bandits, Matching
Abstract: A matching platform is a system that matches participants of different types, such as companies and job-seekers. In such a platform, maximizing matches may concentrate assignments on popular participants, increasing dissatisfaction among others, and eventually causing churn, which reduces the platform's profit opportunities. To address this issue, we propose a novel online learning problem, Combinatorial Allocation Bandits (CAB), which incorporates the notion of *arm satisfaction*. In CAB, at each round, the learner observes feature vectors for $K$ arms and $N$ users, assigns users to arms, and observes feedback following a generalized linear model (GLM). Unlike prior work, the objective is to maximize arm satisfaction rather than the number of positive feedback. For CAB, we develop an upper confidence bound algorithm that uses an approximate optimization oracle and achieves an approximate regret upper bound, whose dependence on $d$, $T$, and $N$ matches the known lower bound for contextual combinatorial linear bandits up to logarithmic factors. We also analyze a Thompson sampling algorithm with a standard regret bound under an exact optimization oracle, and propose a cheaper one-pass variant retaining sublinear approximate regret under a self-concordance assumption. Experiments on synthetic data support the objective and show that CAB-UCB achieves higher cumulative satisfaction than baselines.
Submission Number: 58
Loading