Low Complexity Online Contextual Learning with Continuous Actions

Published: 19 Dec 2025, Last Modified: 05 Jan 2026AAMAS 2026 ExtendedAbstractEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Contextual learning, Stochastic optimization
TL;DR: Gradient-based algorithm for contextual online learning that achieves state-of-the-art regret for this problem while having low memory and runtime complexity
Abstract: We study an online contextual learning problem, where an agent repeatedly observes independent and identically distributed contexts $c_t \in \mathbb{R}^d$ and selects actions $x_t \in \mathbb{R}^k$ to maximize their cumulative reward $r(x_t,c_t)$ over time. The reward function is Lipschitz continuous in contexts, so good actions for a given context are also reasonably good for similar contexts. Current algorithms that leverage this structure are infeasible due to huge runtime or memory complexity. In this paper, we propose Congrad, a simple kernel-based projected gradient ascent algorithm, which maintains $O(n)$ memory and $O(n(k+d))$ computational complexity per iteration by projecting policies onto a fixed $n$-dimensional function space. Congrad utilizes a kernel that updates adjacent contexts to the random context observed at each turn. The kernel initially has a broad width to enable fast global learning, then progressively narrows for local refinement. We prove that Congrad converges with probability 1 to the optimal policy in the function space, and establish an expected regret bound of $O(T^{\frac{d+1}{d+2}} \log^2 T)$, independent of the action space dimension $k$. Numerical simulations validate our theoretical guarantees, demonstrating that regret remains stable as the action dimension increases.
Area: Learning and Adaptation (LEARN)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 738
Loading