Abstract: Addressing non-stationarity and latent variables in bandit algorithms presents significant challenges. This paper tackles both challenges simultaneously in Multi-Agent Multi-Armed Bandits by integrating causal inference principles with panel data methodologies. We propose Dynamic Causal Multi-Armed Bandits (DCMAB) and Dynamic Causal Contextual Bandits (DCCB), focusing on treatment effect estimation rather than direct reward modeling. Our algorithms, employing matrix completion on agent-time reward matrices, effectively leverage shared information among agents while adapting to dynamic environments. We establish sub-linear regret for the proposed algorithms and extend their applicability to scenarios with time-varying treatment effects. Through extensive simulations and a real-world application in the stock market, we validate the superiority of our proposed methods in non-stationary bandits with latent variables.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Olivier_Cappé2
Submission Number: 4805
Loading