Track: Research Track
Keywords: Multi-Armed Bandits, Online Learning
Abstract: We study an online learning setting where an agent's actions are constrained to local movements on a dynamic graph, a setting that captures scenarios such as autonomous reconnaissance. This problem highlights a core challenge in adaptive systems: how to learn effectively with only partial, localized feedback in a non-stationary environment. We propose a set of structural conditions, termed \textit{Recurrent Reachability} and \textit{Temporal Stability}, that are sufficient for learnability. Our analysis reveals a foundational \textit{anatomy of regret}, decomposing it into a statistical learning cost and a physical navigation cost. We introduce a family of local algorithms, progressing from a canonical protocol to a more practical, adaptive variant, and culminating in a reward-aware exploration policy that achieves provably near-optimal regret on any graph sequence satisfying our conditions. We corroborate our theory in a disaster-response simulation.
Submission Number: 144
Loading