LNUCB-TA: Linear-nonlinear Hybrid Bandit Learning with Temporal Attention

27 Sept 2024 (modified: 03 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Contextual Multi-Armed Bandit, Exploration-Exploitation Trade-off, Adaptive k-Nearest Neighbors (k-NN), Attention-Based Exploration Rate, Sub-linear Regret
TL;DR: We present LNUCB-TA, a linear-nonlinear hybrid bandit model that introduces a novel nonlinear component designed to reduce time complexity, and an innovative global-and-local attention-based exploration mechanism.
Abstract: Existing contextual multi-armed bandit (MAB) algorithms struggle to simultaneously capture long-term trends as well as local patterns across all arms, leading to suboptimal performance in complex environments with rapidly changing reward structures. Additionally, they typically employ static exploration rates, which do not adapt to dynamic conditions. To address these issues, we present LNUCB-TA, a hybrid bandit model that introduces a novel nonlinear component (adaptive $k$-Nearest Neighbors ($k$-NN)) designed to reduce time complexity, and an innovative global-and-local attention-based exploration mechanism. Our method incorporates a unique synthesis of linear and nonlinear estimation techniques, where the nonlinear component dynamically adjusts $k$ based on reward variance, thereby effectively capturing spatiotemporal patterns in the data. This is critical for reducing the likelihood of selecting suboptimal arms and accurately estimating rewards while reducing computational time. Also, our proposed attention-based mechanism prioritizes arms based on their historical performance and frequency of selection, thereby balancing exploration and exploitation in real-time without the need for fine-tuning exploration parameters. Incorporating both global attention (based on overall performance across all arms) and local attention (focusing on individual arm performance), the algorithm efficiently adapts to temporal and spatial complexities in the available context. Empirical evaluation demonstrates that LNUCB-TA significantly outperforms state-of-the-art contextual MAB algorithms, including purely linear, nonlinear, and vanilla combination of linear and nonlinear bandits based on cumulative and mean rewards, convergence performance, and demonstrates consistency of results across different exploration rates. Theoretical analysis further proves the robustness of LNUCB-TA with a sub-linear regret bound.
Supplementary Material: zip
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11860
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview