Feasible Action Search for Bandit Linear Programs via Thompson Sampling

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
TL;DR: An efficient method based on Thompson Sampling to find feasible actions for LPs with bandit feedback
Abstract: We study the 'feasible action search' (FAS) problem for linear bandits, wherein a learner attempts to discover a feasible point for a set of linear constraints $\Phi_* a \ge 0,$ without knowledge of the matrix $\Phi_* \in \mathbb{R}^{m \times d}$. A FAS learner selects a sequence of actions $a_t,$ and uses observations of the form $\Phi_* a_t + \mathrm{noise}$ to either find a point with nearly optimal 'safety margin', or detect that the constraints are infeasible, where the safety margin of an action measures its (signed) distance from the constraint boundary. While of interest in its own right, the FAS problem also directly addresses a key deficiency in the extant theory of 'safe linear bandits' (SLBs), by discovering a safe initialisation for low-regret SLB methods. We propose and analyse a novel efficient FAS-learner. Our method, FAST, is based on Thompson Sampling. It applies a _coupled_ random perturbation to an estimate of $\Phi_*,$ and plays a maximin point of a game induced by this perturbed matrix. We prove that FAST stops in $\tilde{O}(d^3/\varepsilon^2 M_*^2)$ steps, and incurs $\tilde{O}(d^3/|M_*|)$ safety costs, to either correctly detect infeasibility, or output a point that is at least $(1-\varepsilon) M_*$-safe, where $M_*$ is the _optimal safety margin_ of $\Phi_*$. Further, instantiating prior SLB methods with the output of FAS yields the first SLB methods that incur $\tilde{O}(\sqrt{d^3 T/M_*^2})$ regret and $O(1)$ risk without a priori knowledge of a safe action. The main technical novelty lies in the extension of Thompson Sampling to this multiobjective setting, for which we both propose a coupled noise design, and provide an analysis that avoids convexity considerations.
Lay Summary: Practical engineering and scientific disciplines often need to find processes that satisfy a number of objectives that lie in tension with one-another. Our work describes a new method, FAST, that allows efficient and intelligent trial-and-error to quickly find a process that achieves a nearly 'best-possible,' i.e., minimax, balance between such objectives. FAST observes the results of previous experiments to select which process to try next, i.e., it is a "sequential experimental design". While there were some prior methods that could be used to solve this task, these methods were computationally very slow, and would have taken days or years of computation to pick the right processes to try. Our method, which is based on a technique called Thompson Sampling (TS), instead reduces this time to seconds, while using nearly the fewest-possible number of experiments (in a certain technical sense). Our basic technical contribution is to allow TS-like techniques to work with many objectives, while prior understanding of TS dealt with only one objective. For this, we both developed a new algorithmic design in the form a "coupled noise", and developed new wys to mathematically analyse TS with many objectives. Since the method is general, it may serve to help practitioners in diverse fields such as manufacturing, control, and resource-allocation to quickly discover good processes that balance the many needs they must address.
Primary Area: Theory->Online Learning and Bandits
Keywords: Safe Bandits, Linear Bandits, Thompson Sampling
Submission Number: 13853
Loading