%\documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{amsmath,amssymb}
\usepackage{amsthm}
\usepackage{pifont}
\usepackage{algorithm}
\usepackage{graphicx}    % For including images
% \usepackage{subfigure}
% \usepackage{subcaption}
\usepackage{subfig}

\usepackage{algpseudocode}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{remark}[theorem]{Remark}


%%%%user defined macros%%%
\renewcommand{\thefootnote}{\fnsymbol{footnote}}


\newcommand{\cmark}{\ding{51}} 
\newcommand{\xmark}{\ding{55}}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


%%%
\makeatletter 
\renewcommand\AB@affilsepx{~~~\protect\Affilfont}
\makeatother

%%%%%Title%%%%
\title{Corruption-Robust Variance-aware Algorithms for Generalized Linear Bandits under Heavy-tailed Rewards}

% The standard author block has changed for UAI 2025 to provide % more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors

% Authors
\author[1]{Qingyuan~Yu}
\author[2,4]{Euijin~Baek}
\author[3]{\href{mailto:<smslixiang@gmail.com>}{Xiang~Li}}
\author[4,5]{\href{mailto:<qsunstats@gmail.com>}{Qiang~Sun}}

% Affiliations
\affil[1]{%
    USTC
}
\affil[2]{%
    University of Alberta
}
\affil[3]{%
    University of Pennsylvania
}
\affil[4]{%
    University of Toronto 
}
\affil[5]{
    MBZUAI
}


\begin{document}
\maketitle

\begin{abstract}

%However, real-world challenges such as heavy-tailed noise, reward corruption, and nonlinear reward functions remain difficult to handle. To address these difficulties, we propose GAdaOFUL, a novel algorithm that leverages adaptive Huber regression to achieve robust performance in generalized linear models (GLMs), where rewards can be nonlinear functions of features. GAdaOFUL achieves a variance-aware regret bound of $\widetilde{\mathcal{O}}\left(d\sqrt{\sum_{t\in[T]}\nu_{t}^{2}} + dC\right)$ under rewards with only bounded second moments, where $d$ is the feature dimension, $T$ is the number of rounds, $\nu_t^2$ represents the reward variance at round $t$, and $C$ denotes the corruption level. The algorithm is adaptive to the problem difficulty, achieving better regret when the cumulative variance is small. Simulations show that GAdaOFUL performs well in practice, validating its robustness and effectiveness.

  

Stochastic linear bandits have recently received significant attention in sequential decision-making.
However, real-world challenges such as heavy-tailed noise, reward corruption, and nonlinear reward functions remain difficult to address. To tackle these difficulties, we propose GAdaOFUL, a novel algorithm that leverages adaptive Huber regression to achieve robustness in generalized linear models (GLMs), where rewards can be nonlinear functions of features. GAdaOFUL achieves a state-of-the-art variance-aware regret bound, scaling with the square root of the cumulative reward variance over time, plus an additional term proportional to the level of corruption. The algorithm adapts to problem complexity, yielding improved regret when the cumulative variance is small. Simulation results demonstrate the robustness and effectiveness of GAdaOFUL in practice. The code is available at \url{https://github.com/NeXAIS/GAdaOFUL}. 

\end{abstract}



\section{Introduction}\label{sec:intro}



In online decision-making, stochastic linear bandits have been a powerful framework for balancing exploration and exploitation in sequential processes \citep{lattimore2020bandit}. However, many real-world applications involve nonlinear relationships between actions and rewards, limiting the effectiveness of standard linear bandit algorithms. To address this limitation, generalized linear models (GLMs) have emerged as a natural extension, allowing expected rewards to be modeled as nonlinear functions of input features via a link function \citep{filippi2010parametric}. This flexibility makes GLMs well-suited for applications such as online advertisements and recommendation
systems \citep{li2010contextual,li2012unbiased}. 







Despite their advantages, designing algorithms that perform well in real-world settings remains challenging. We present some key challenges below.
%\vspace{-0.1in}
\begin{itemize}
	\item \textbf{Heavy-tailed rewards:}
	Many existing methods assume that rewards follow a sub-Gaussian distribution \citep{filippi2010parametric,li2012unbiased,li2017provably,zhou2019learning,lu2021low}, simplifying analysis but failing to capture the variability commonly observed in practice. In financial markets, for instance, extreme returns occur far more frequently than expected under a normal distribution, a characteristic known as heavy-tailed behavior \citep{cont2000herd, foss2011introduction}. Real-world reward distributions often exhibit such properties, leading to poor performance for algorithms based on sub-Gaussian assumptions \citep{bubeck2013bandits}. 
    
	
	\item \textbf{Adversarial corruptions:} 
	Bandit systems are also susceptible to adversarial manipulations or corrupt reward signals, which can significantly degrade their performance. In recommendation systems, for example, adversaries may inject false feedback or manipulate click-through rates,  misleading the algorithm and deteriorating recommendations for genuine users. While recent research has developed corruption-robust algorithms  \citep{lykouris2018stochastic, kapoor2019corruption, bogunovic2020corruption}, many of these approaches do not account for heavy-tailed rewards, where extreme outcomes occur more frequently. Thus, a fundamental challenge remains: designing algorithms that can simultaneously handle corruption and heavy-tailed noise, ensuring robustness in practical settings. Notably, \cite{jun2018adversarial} demonstrate how adversarial manipulations of rewards can dramatically increase regret, underscoring the urgency of robust defenses against adversarial corruptions.
    
	
	\item \textbf{Worst-case analysis:}
	Traditional bandit algorithms often rely on worst-case regret bounds, which tend to be overly conservative. In contrast,  variance-aware regret bounds allow regret to scale with the observed reward variance, rather than an upper bound on reward magnitudes. This provides a more refined measure of problem complexity, adapting to the actual variability in rewards. When cumulative reward variance is small, the regret naturally decreases, indicating that the problem is easier to solve. Consequently, variance-aware regret bounds lead to tighter performance guarantees  \citep{zhang2021improved, dai2022variance, di2023variance, li2023varianceaware}. %bounds and improved performance when cumulative variance is low
    
\end{itemize}





This paper investigates whether it is possible to design generalized linear bandit algorithms that simultaneously handle heavy-tailed rewards, protect against adversarial corruption, and incorporate variance-aware regret. 
Specifically, we consider a setting where the reward follows a generalized linear model (GLM) with heavy-tailed noise. At each round, the algorithm selects an arm \(\phi_t\) from a decision set $D_t$ and observe a reward \(y_t\), which satisfies 
\[
y_t = f(\langle \phi_t, \theta^*\rangle) + \epsilon_t + c_t,
\]
where $\theta^*$ is an unknown $d$-dimensional true parameter, \(c_t\) represents adversarial corruption, and \(\epsilon_t\) denotes the heavy-tailed noise with bounded variance \(\nu_t^2\).
Our goal is to minimize the cumulative regret over \(T\) rounds, defined as the total loss from not selecting the optimal action sequence:
\begin{align}\label{def_regret}
	\text{Reg}(T) := \sum_{t=1}^T \left[ \sup_{\phi \in D_t} f(\langle \phi, \theta^* \rangle) - f(\langle \phi_t, \theta^* \rangle)\right].
\end{align}




\paragraph{Our Contributions.}


\begin{table*}[!ht]
	\caption{Theoretical performance comparison of different methods in most related works. For an introduction to earlier studies (which often exhibit poorer performance), see the references in these papers. 
		In this table, \(\nu_t^2\) denotes the conditional reward variance for the \(t\)-th reward and \(C = \sum_{t=1}^T |c_t|\) represents the total corruption level with \(C \vee 1 = \max\{C, 1\}\). The worst-case lower bound is \(\Omega(d \sqrt{T} + d C)\) \citep{lattimore2020bandit,bogunovic2021stochastic}.
	}
	\centering
	% \begin{threeparttable}
		\resizebox{\linewidth}{!}{
			\begin{tabular}{c|ccccc}
				\toprule
				\textbf{Method} & \textbf{Reward} & \textbf{Noise} & \textbf{Corruption} 
				& \textbf{Variance-aware} & \textbf{Regret} \\
				\midrule
				\citep{he2022nearly} & linear & Sub-gaussian & \cmark  & \xmark  & $\widetilde{O}(d \sqrt{T} + d\cdot C)$ \\
				\citep{ye2023corruption} & GLM\footnotemark[1] & Sub-gaussian & \cmark  & \xmark & $\widetilde{O}(d \sqrt{T} + d\cdot C)$ \\
				\citep{li2023varianceaware} & linear & Finite variance & \xmark & \cmark & $\widetilde{O}(d \sqrt{\sum_{t=1}^T \nu_t^2} + d)$ \\
				\citep{xue2024efficient} & GLM & Finite variance\footnotemark[2] & \xmark & \xmark& $\widetilde{O}(d \sqrt{T})$\\
				\midrule
				Ours, Theorem \ref{correg} & GLM & Finite variance & \cmark & \cmark & $\widetilde{O}(d \sqrt{\sum_{t=1}^T \nu_t^2} + d\cdot C \vee 1 )$ \\
				\bottomrule
		\end{tabular}}
		
		% \begin{tablenotes}
			% \item[a] They consider a general heavy-tailed setting where the rewards have only \((1+\epsilon)\)-th order moments. For a fair comparison, we translate the results to the finite-variance setting.
			% \end{tablenotes}
		% \end{threeparttable}
	
	\label{tab:comparison}
\end{table*}
\footnotetext[1]{\citet{ye2023corruption} consider a general and abstract class $\mathcal{G} $ where $\mathcal{G}$ has a bounded Eluder dimension. Their general regret is $\widetilde{O}(d_0 \sqrt{T} + d_0\cdot C)$ where $d_0$ is the Eluder dimension of $\mathcal{G}$. If $\mathcal{G}$ is the GLM family, $d_0= O(d)$.}
\footnotetext[2]{ \citet{xue2024efficient} consider a general heavy-tailed setting where the rewards have only \((1+\epsilon)\)-th order moments. For a fair comparison, we translate the results to the finite-variance setting.}
\renewcommand{\thefootnote}{\arabic{footnote}}


While many existing works address some of these challenges individually (see Table \ref{tab:comparison} for the most relevant studies), we propose the first algorithm that successfully tackles all three key aspects: handling heavy-tailed rewards, resisting adversarial corruption, and adapting to variance-aware regret.



Specifically, we introduce GAdaOFUL (Generalized Adaptive Huber Regression-based OFUL, see Algorithm \ref{alg:main}), a novel algorithm designed to  address all three challenges simultaneously. At its core, GAdaOFUL leverages adaptive Huber regression \citep{sun2020adaptive, sun2021we, li2023varianceaware} to mitigate the impact of heavy-tailed noise. To handle potential adversarial corruption, the algorithm carefully scales each residual error by selecting an appropriate variance parameter \(\sigma_t^2\).
More precisely, \(\sigma_t^2\) depends on the reward variance \(\nu_t^2\), a sample importance weight \(w_t\), and the total corruption level \(C = \sum_{t=1}^T |c_t|\). Here, \(w_t\) quantifies the significance of the \(t\)-th  sample in improving prediction accuracy.


Beyond integrating these elements, our work introduces two key technical novelties:
\begin{itemize}
    \item \textbf{Nonlinear Extension Framework}: Unlike most previous works that focus on linear settings, we introduce a novel integral-based loss function (Eq. \eqref{eq:com_theta}) that extends naturally to nonlinear GLM cases. Notably, this loss function remains convex, enabling the use of efficient convex optimization techniques, such as (stochastic) gradient descent. This formulation significantly broadens GAdaOFUL's applicability to more complex nonlinear scenarios while maintaining computational efficiency.
    
    \item \textbf{Corruption-Robust Analysis}: 
    Our proof employs a reduction approach, introducing an auxiliary problem that reformulates the corrupted-reward setting into an equivalent corruption-free counterpart. By carefully selecting hyperparameters, we establish a direct connection between optimality in uncorrupted and corrupted cases. Leveraging constrained convex optimization, we derive state-of-the-art regret bounds that remain valid even under adversarial conditions. See Section \ref{sec:proof} for a detailed proof overview. 
\end{itemize}


%Specifically, we introduce an auxiliary problem that transforms the setting with corrupted rewards into an equivalent corruption-free counterpart.


We rigorously prove that GAdaOFUL achieves a state-of-the-art regret bound of 
\[
\widetilde{\mathcal{O}}\left(d\sqrt{\sum_{t \in [T]}\nu_{t}^{2}} + dC +d\right),
\]
even in the presence of heavy-tailed noise and adversarial corruption. Here, $d$ represents the feature dimension, $T$ is the total number of rounds, and $\widetilde{\mathcal{O}}(\cdot)$ hides constant factors and logarithmic terms in $T$. To the best of our knowledge, this work is the first to unify GLMs, heavy-tailed noise, corruption, and variance-awareness into a comprehensive framework. Table~\ref{tab:comparison} provides a comparison with recent state-of-the-art algorithms.



%%%%%%%%Preliminaries%%%%%%
\section{Preliminaries}

\paragraph{Notation.} We denote the \(\ell_2\)-norm in \(\mathbb{R}^d\) by \(\|\cdot\|\), and \(\text{Ball}_d(B)\) represents the \(\ell_2\)-norm ball in \(\mathbb{R}^d\) with radius \(B > 0\). For a positive definite matrix \(H \in \mathbb{R}^{d \times d}\), we define \(\|x\|_H = \sqrt{x^T H x}\) for a vector \(x \in \mathbb{R}^d\). Additionally, for two positive semidefinite matrices \(H_1\) and \(H_2\), we write \(H_1 \succeq H_2\) if \(H_2 - H_1\) is positive semidefinite.



\paragraph{Generalized Linear Models (GLMs).}

Generalized Linear Models (GLMs), first introduced by \cite{nelder1972generalized}, extend traditional linear regression by allowing more flexible relationships between the response variable \(y\) and predictor variables (or feature vector) \(\phi\). 
In GLMs, the relationship is modeled through a linear predictor, a linear combination of the predictors and unknown coefficients \(\theta\). 
Unlike linear regression, where the conditional mean equals the linear predictor \(\langle \phi, \theta \rangle\), GLMs link the linear predictor to the conditional mean through a link function \(f\):
\[
\mathbb{E}[y|\phi] = f(\langle \phi, \theta \rangle).
\]
The link function \(f(\cdot)\) is typically an increasing, differentiable function. By choosing different link functions, GLMs can model various types of data. For example, Poisson regression is suitable for count data, where the link function is \(f(x) = \exp(x)\) \citep{coxe2009analysis}. For binary outcomes, logistic regression is a natural choice, using the logistic function \(f(x) = \frac{\exp(x)}{1 + \exp(x)}\) \citep{hilbe2011logistic}.


\paragraph{Heavy-Tailed Noise.}

Unlike standard bandit models that assume sub-Gaussian or bounded noise, we consider a more general and realistic setting where the stochastic noise in rewards may exhibit heavy tails. Formally, we assume that the noise sequence $\{\epsilon_t\}$ forms a martingale difference sequence adapted to the filtration $\{\mathcal{F}_{t-1}\}$, satisfying $\mathbb{E}[\epsilon_t \mid \mathcal{F}_{t-1}] = 0$ and $\mathbb{E}[\epsilon_t^2 \mid \mathcal{F}_{t-1}] = \nu_t^2$. This setting accommodates heavy-tailed but variance-bounded noise, which can still exhibit occasional large deviations while being more realistic for applications such as finance, crowdsourcing, and networked systems. This type of noise model has been explored in the context of bandits by \citet{bubeck2013bandits}, and further studied in generalized linear settings by \citet{xue2024efficient}, who developed robust estimation methods for heavy-tailed generalized linear bandits.

\paragraph{Adversarial Corruption.}

In addition to stochastic noise, we consider adversarial corruption to the reward, modeled by an additive term \(c_t\). We make no assumptions about the distribution or structure of $\{c_t\}$ beyond the total $\ell_1$ budget being bounded, i.e., $\sum_{t=1}^T |c_t| \leq C$ for some known constant $C > 0$. Importantly, the corruption is allowed to be adaptive: the adversary may observe the realized (possibly noisy) reward before choosing \(c_t\). This corruption model captures various real-world scenarios such as data poisoning, faulty sensor readings, or malicious manipulations. Similar corruption-resilient formulations have been explored in stochastic multi-armed bandits \citep{lykouris2018stochastic}, linear bandits \citep{bogunovic2021stochastic,NEURIPS2022_df5f94d6}, though these works typically assume bounded or sub-Gaussian noise.

\paragraph{Problem Setting and Model Assumptions.}

We study a stochastic generalized linear bandit model with heavy-tailed noise and adversarial corruption. Let \(\{D_t\}_{t \geq 1}\) represent a predetermined sequence of decision sets and \(\{\mathcal{F}_t\}_{t \geq 1}\) a filtration corresponding to the information available up to time \(t\). At each round \(t\), the agent selects an action \(\phi_t \in D_t\) and observes the reward 
\[
y_t = f(\langle \phi_t, \theta^* \rangle) + \epsilon_t + c_t,
\]
where \(\theta^* \in \mathbb{R}^d\) is an unknown parameter vector, \(\epsilon_t\) is a martingale difference noise with \(\mathbb{E}[\epsilon_t | \mathcal{F}_{t-1}] = 0\) and \(\mathbb{E}[\epsilon_t^2 | \mathcal{F}_{t-1}] = \nu_t^2\), and \(c_t\) is an adversarial corruption. The cumulative corruption level is \(C := \sum_{t=1}^T |c_t|\), which is assumed to be known. Additionally, \(\|\theta^*\| \leq B\) for some bound \(B\), and both \(\phi_t\) and \(\nu_t\) are \(\mathcal{F}_{t-1}\)-measurable with \(\|\phi_t\| \leq L\).
The function \(f\), referred to as the activation function \cite{zhao2023optimal}, is assumed to be an increasing, differentiable function on \([-BL, BL]\), with constants \(k, K \in \mathbb{R}\) such that \(0 < k \leq f'(z) \leq K\) for all \(z \in [-BL, BL]\).



\iffalse

\paragraph{Stochastic generalized linear bandit with heavy-tailed 
	rewards and adversarial corruption.}

Let \(\{D_t\}_{t \geq 1}\) represent a predetermined sequence of decision sets and \(\{\mathcal{F}_t\}_{t \geq 1}\) a filtration corresponding to the information available up to time \(t\). At each round \(t\), the agent selects an action \(\phi_t \in D_t\) and observes the reward 
\[
y_t = f(\langle \phi_t, \theta^* \rangle) + \epsilon_t + c_t,
\]
where \(\theta^* \in \mathbb{R}^d\) is an unknown parameter vector, \(\epsilon_t\) is a martingale difference noise with \(\mathbb{E}[\epsilon_t | \mathcal{F}_{t-1}] = 0\) and \(\mathbb{E}[\epsilon_t^2 | \mathcal{F}_{t-1}] = \nu_t^2\), and \(c_t\) is an adversarial corruption. The cumulative corruption level is \(C := \sum_{t=1}^T |c_t|\), which is assumed to be known. Additionally, \(\|\theta^*\| \leq B\) for some bound \(B\), and both \(\phi_t\) and \(\nu_t\) are \(\mathcal{F}_{t-1}\)-measurable with \(\|\phi_t\| \leq L\).
The function \(f\), referred to as the activation function \cite{zhao2023optimal}, is assumed to be an increasing, differentiable function on \([-BL, BL]\), with constants \(k, K \in \mathbb{R}\) such that \(0 < k \leq f'(z) \leq K\) for all \(z \in [-BL, BL]\).

\fi

\section{THE GAdaOFUL METHOD}


In this section, we introduce our algorithm, GAdaOFUL, designed to tackle heavy-tailed noise and adversarial attacks. Heavy-tailed noise refers to rewards with finite variances, while adversarial attacks involve deliberate corruption intended to degrade the reward signals. Our algorithm is applicable to GLMs and achieves state-of-the-art regret.

\paragraph{Adaptive Huber regression modified for GLMs.}


Our algorithm, GAdaOFUL, is based on adaptive Huber regression \citep{sun2020adaptive}, utilizing the pseudo-Huber loss function \citep{sun2021we} to tackle heavy-tailed issues. Specifically, the Pseudo-Huber loss is defined as \(\ell_{\tau}(x) = \tau(\sqrt{\tau^2 + x^2} - \tau)\). This loss serves as a smooth approximation to the Huber loss \citep{huber1992robust}, transitioning between quadratic penalties for small residuals and linear penalties for larger ones, making it differentiable everywhere. 
%It is also straightforward to verify that \(\ell_{\tau}(x)\) is convex in \(x\).

However, the original adaptive Huber regression is designed for linear models \citep{sun2020adaptive, sun2021we}, which limits its theoretical grounding for nonlinear models like GLMs. To address this nonlinearity, we modify the Pseudo-Huber loss to better mitigate its effects. The derivative of \(\ell_{\tau}(x)\) with respect to \(x\) is given by \(\ell'_{\tau}(x) = \frac{\tau x}{\sqrt{\tau^2 + x^2}}\), and we reformulate the loss in terms of its derivative \(\ell'_{\tau}(x)\).
At each round, GAdaOFUL first estimates the ground-truth parameter \(\theta^*\) by minimizing the following optimization problem:
\begin{gather}
	\label{eq:com_theta}
	\theta_t := \operatorname{argmin}_{\theta \in \text{Ball}_d(B)} L_t(\theta), \\
	L_t(\theta) := \frac{\lambda k}{2} \|\theta\|^2 - \sum_{s=1}^{t} \frac{1}{\sigma_s} \int_0^{\langle \phi_s, \theta \rangle} \frac{\tau_s z_s(u)}{\sqrt{\tau_s^2 + z_s^2(u)}} \, du. \nonumber
\end{gather}
Here, \(k > 0\) is a lower bound for \(\min_{|x| \leq B L} f'(x)\), \(z_s(u) = {(y_s - f(u))}/{\sigma_s}\) is the scaled residual error, and \(\sigma_t^2\) represents surrogate conditional variances. 

\begin{remark}
The rationale behind using this loss function \eqref{eq:com_theta} lies in its ability to handle the nonlinearity of GLMs while retaining desirable properties from the linear case. In general, a GLM can be interpreted as a form of weighted linear regression. At a high level, our proposed integral-based loss ensures that the weights used are of the same order, determined by \(f'\)
 , and bounded within the interval \([k,K]\) as per our setup. This point can be verified by computing the derivative and Hessian of the new loss function. Notably, the derivative and Hessian of the proposed loss resemble those of the linear case, as highlighted in \citep{li2023varianceaware}. This resemblance enables a natural extension of proof techniques and results from the linear setting to the nonlinear one.
\end{remark}


It is straightforward to verify that 
\begin{itemize}
	\item \(L_t(\theta)\) is convex in \(\theta\), so the optimization problem in \eqref{eq:com_theta} can be efficiently solved by convex solvers.
	\item \(L_t(\theta)\) depends on the adaptive (or varying) values of \(\tau_t\), which are essential for achieving optimal regret, as shown by \citet{li2023varianceaware} for non-corrupted cases.
	\item If \(f(\cdot)\) is the identity function, \(L_t(\theta)\) reduces to the one used by \citet{li2023varianceaware}.
\end{itemize}

\paragraph{Algorithm description.}
Next, we outline the steps of the GAdaOFUL algorithm. At round \(t\), construct a confidence ellipsoid:  
\[
\mathcal{C}_{t-1} := \{\theta \in \text{Ball}_d(B) : \|\theta - \theta_{t-1}\|_{H_{t-1}} \leqslant \beta_{t-1}\},
\]
where \(H_t\) is the shape matrix, and \(\beta_t\) is the exploration radius. It can be proven that \(\theta^*\) lies within \(\mathcal{C}_t\) with high probability for all \(t \geq 0\). Based on this confidence set, select \(\phi_t\) by maximizing the inner product \(\langle \phi, \theta \rangle\):
\begin{align}\label{def_phi}
	(\phi_t, \cdot) = \operatorname{argmax}_{\phi\in D_t, \theta \in C_{t-1}} \langle \phi, \theta \rangle
\end{align}
Play the chosen arm, then observe the reward \(y_t\) and its conditional variance \(v_t^2\).
Next, compute \((\sigma_t, w_t, \tau_t)\) according to \eqref{eq:paras}, and solve the optimization problem in \eqref{eq:com_theta} to update \(\theta\). Finally, update the shape matrix \(H_t\) and the exploration radius \(\beta_t\), then proceed to the next round.

\begin{algorithm}[t!]
	\caption{Generalized Adaptive Huber regression based OFUL (GAdaOFUL).}
	\label{alg:main}
	\begin{algorithmic}[1]
		\State \textbf{Constants}: $\lambda = d/B^2, \sigma_{\text{min}} = \frac{1}{\sqrt{T}}, m_0 = \left(6\sqrt{3 \log \frac{2T^2}{\delta}}\right)^{-1}$, and $m_1 = \left(42\log \frac{2T^2}{\delta}\right)^{-1}$.
		\State \textbf{Initialization}:  $H_0 = \lambda I, \theta_0 = \mathbf{0}, \beta_0 = \sqrt{\lambda} B$.
		\For{$t = 1$ \textbf{to} $T$}
		\State Construct the confidence set $C_{t-1}. $
		\State Solve $(\phi_t, \cdot) = \operatorname{argmax}_{\phi\in D_t, \theta \in C_{t-1}} \langle \phi, \theta \rangle$.
		\State Play $\phi_t$ and observe $(y_t, \nu_t)$.
		\State Set $\sigma_t, w_t$ and $\tau_t$  according to (\ref{eq:paras}) and record $\{ \sigma_s, w_s, \tau_s : 1 \leq s \leq t \}$.
		\State Compute $\theta_t$ according to \eqref{eq:com_theta}.
		\State Define $\beta_t$ and set $H_t = H_{t-1} + \frac{\phi_t \phi_t^\top}{\sigma_t^2}$.
		\EndFor
	\end{algorithmic}
\end{algorithm}

\paragraph{Parameter selection.}

In the following, we specify the parameters used in Algorithm \ref{alg:main}. The parameter \(\sigma_t\) represents the surrogate conditional variance, \(w_t\) quantifies the importance of the \(t\)-th sample \((y_t, \phi_t, \sigma_t)\), and \(\tau_t\) is a robustification parameter used in Pseudo-Huber regression. Their expressions are given as follows:

\begin{equation}
\tag{\ref{eq:paras}}
	\begin{gathered}
		\sigma_t = \max \left\{ \nu_t, \sigma_{\min}, {\|\phi_t\|_{H_{t-1}^{-1}}}/{m_0}, \alpha\|\phi_t\|^{1/2}_{H_{t-1}^{-1}}  \right\}, \\
		w_t = \left \|\frac{\phi_t}{\sigma_t}\right \|_{H_{t-1}^{-1}},~\text{and}~\tau_t = \tau_0 \frac{\sqrt{1 + w_t^2}}{w_t},
	\end{gathered}
\end{equation}

where 
\begin{align*}
\alpha=\displaystyle{\max\left\{\frac{\sqrt{LBK}}{m_1^{1/4}d^{1/4}},C^{\frac{1}{2}}\kappa^{-\frac{1}{4}}\right\}},
\end{align*}
\(C= \sum_{t=1}^T |c_t|\) is the corruption level, and \(\kappa=d\cdot\log\left(1+TL^2/(d\lambda\sigma_{\min}^2)\right)\) is a constant.


We briefly explain the selection of \(\sigma_t\), the most important parameter. First, we set \(\sigma_t \geq \nu_t\) to ensure it is larger than the true conditional variance, which keeps the conditional variance of \(y_t/\sigma_t\) always less than 1. Note that we do not require \(\nu_t\) itself to be known; any valid upper bound suffices to ensure the theoretical properties of the adaptive Huber loss. We also impose \(\sigma_t \geq \sigma_{\min}\) to avoid numerical instability.
Additionally, we require \(\sigma_t \geq {\|\phi_t\|_{H_{t-1}^{-1}}}/{m_0}\) to ensure that the importance measure \(w_t\) remains bounded by a constant \(m_0\). Finally, we set \(\sigma_t \geq \alpha\|\phi_t\|^{1/2}_{H_{t-1}^{-1}}\) with a carefully chosen \(\alpha\) to mitigate potential corruptions. This \(\alpha\) depends on \(C^{\frac{1}{2}}\kappa^{-\frac{1}{4}}\), which accounts for the effects of corruption—a factor not considered by \citet{li2023varianceaware}.

\section{REGRET ANALYSIS}
In this section, we present the theoretical analysis for Algorithm \ref{alg:main}.

\subsection{Non-corrupted Case}
To begin, we consider the non-corrupted case where \(C = 0\) and show the results in Theorem \ref{noncor}. 


\begin{theorem}[Uncorrupted Case]\label{noncor}
	Assume \(C = 0\). Let $\lambda=d/B^2$ and \(\kappa = d \cdot \log\left(1 + \frac{TL^2}{d\lambda\sigma_{\min}^2}\right)\). If \(\tau_0\sqrt{\log\left(\frac{2T^2}{\delta}\right)} \geq \max\{\sqrt{2\kappa}, 2\sqrt{d}\}\), then with probability \(1 - 3\delta\), it holds that for all \(0 \leq t \leq T\),
	\[
	\|\theta_t - \theta^*\|_{H_t} \leq \beta_t,
	\]
	where
	\begin{align}\label{compute_beta}
		\beta_t = \frac{32}{k} \left[\frac{\kappa}{\tau_0} + \sqrt{\kappa \log\left(\frac{2t^2}{\delta}\right)} + \tau_0 \log\left(\frac{2t^2}{\delta}\right)\right] + 5\sqrt{\lambda} B.
	\end{align}
	Then, with probability at least \(1 - 3\delta\), we have
	\begin{small}
		\begin{align}\label{compute_regret}
			\operatorname{Reg}(T) &\leq 4K\beta_T \left[ \sqrt{\kappa} \cdot \sqrt{\sum_{t \in [T]} \nu_t^2 + 1} + \frac{L\kappa}{m_0^2\sqrt{\lambda}} + \frac{LBK\kappa}{\sqrt{m_1 d}} \right].  
		\end{align}  
	\end{small}
	
\end{theorem}


Similar to previous work \citep{li2023varianceaware}, in Theorem \ref{noncor}, we demonstrate that (i) the true parameter \(\theta^*\) falls within the constructed confidence intervals \(\mathcal{C}_{t}\) with high probability, and (ii) the regret scales as \(\operatorname{Reg}(T) = \widetilde{O}(d \sqrt{\sum_{t=1}^T \nu_t^2} + d)\), which matches the results for linear bandits with heavy-tailed rewards \citep{li2023varianceaware}. 
The key takeaway is that even when considering a nonlinear GLM model for rewards, using our modified loss in \eqref{eq:com_theta} allows us to maintain the same level of regret performance.
The proof of Theorem \ref{noncor} largely follows the approach in \citep{li2023varianceaware}, utilizing a quadratic approximation of the nonlinear loss \(L_t(\theta)\); see the appendix for the details.

\subsection{Corrupted Case}

Next, we consider the case with corruption, where \(C > 0\) is assumed to be known.\footnote{In fact, it is sufficient to have a valid upper bound for \(\sum_{t=1}^T |c_t|\). If \(C\) is unknown, one can employ the doubling trick \citep{besson2018doubling} to estimate a valid upper bound. After a logarithmic number of guesses, we can reliably determine a true upper bound.} Theorem \ref{corbeta} guarantees the high probability coverage, while Theorem \ref{correg} upper bounds the regret.

\begin{theorem}\label{corbeta}
	Let $\kappa=d\cdot\log\left(1+TL^2/(d\lambda\sigma_{min}^2)\right)$. If $\tau_0\sqrt{\log({2T^2}/{\delta})}\ge \max\{\sqrt{2\kappa},2\sqrt{d}\}$,then with probability $1-3\delta$, it holds that, for all $0\leq t \leq T$,
	$$ \|\theta_t-\theta^*\|_{H_t}\le \beta_t, $$
	where
	\begin{small}
		\begin{align}\label{bound_beta}  
			\beta_t &= \frac{4\sqrt{\kappa}}{k} + \frac{32}{k} \left[ \frac{\kappa}{\tau_0} + \sqrt{\kappa\log\frac{2t^2}{\delta}} + \tau_0\log\frac{2t^2}{\delta} \right] + 5\sqrt{\lambda}B.  
		\end{align}
	\end{small}
	
\end{theorem}

Theorem \ref{corbeta} demonstrates that \(\theta^*\) falls within the set \(\mathcal{C}_{t} := \{\theta \in \mathrm{Ball}_{d}(B) : \|\theta - \theta_{t}\|_{H_{t}} \leqslant \beta_{t}\}\) for any \(t \geq 1\) with high probability. In contrast to the non-corrupted case, the presence of corruption introduces an additional constant term of \(\frac{4\sqrt{\kappa}}{k}\) to \(\beta_t\). However, this term is negligible when \(\tau_0 = \widetilde{O}(\sqrt{d})\), which also leads to \(\beta_t = \widetilde{O}(\sqrt{d})\).

With the above confidence region, the regret bound of Algorithm \ref{alg:main} for corrupted cases is explicitly given as follows.
\begin{theorem}\label{correg}
	Then with probability at least $1-3\delta$,
\begin{align*}  
\operatorname{Reg}(T) 
&\leq 4K\beta_T \left[ \sqrt{\kappa} \cdot \sqrt{\sum_{t \in [T]}\nu_t^2 + 1} \right. \\  
&\qquad \left. + \frac{L\kappa}{m_0^2\sqrt{\lambda}} + \frac{LBK\kappa}{\sqrt{m_1 d}} + 2C\sqrt{\kappa} \right].  
\end{align*}
	where $\beta_T$ is defined in (\ref{bound_beta}).
\end{theorem}


By setting $\lambda$ and $\tau_0$ carefully, $\operatorname{Reg}(T)$ is simplified to $\widetilde{O}(d \sqrt{\sum_{t=1}^n \nu_t^2}+d\cdot C \vee 1  )$ where $C \vee 1 = \max\{ C, 1\}$.

\begin{corollary}\label{non_coro}
	Let $\lambda = {d}/{B^2}$  and $\tau_0= \max \left\{ {\sqrt{2\kappa}},{2\sqrt d} \right\} / \sqrt{\log(2T^2/\delta)}$.
	%, then,  
	% \begin{small}
		% \begin{align*}  
			% \beta_T &\leq \frac{64}{k} \left( 6\sqrt{\kappa\log\left(\frac{2T^2}{\delta}\right)} + \sqrt{d\log\left(\frac{2T^2}{\delta}\right)} \right) + 5\sqrt{d}.
			% \end{align*}
		% \end{small}
	The regret bound in Theorem \ref{correg} becomes
	\begin{small}
		\begin{align*} 
			\operatorname{Reg}(T) = \widetilde{\mathcal{O}}\left(\frac{Kd}{k}\sqrt{\sum_{t\in[T]}\nu_{t}^{2}} + \frac{Kd}{k} \cdot \max \{LBK, C\}\right),
		\end{align*} 
	\end{small}
	where $\tilde{\mathcal{O}}(\cdot)$ hides constant factors and logarithmic dependence on $T$. 
\end{corollary}

\paragraph{Comparison with previous works.}

We emphasize that the regret bound in Theorem \ref{correg} and Corollary \ref{non_coro} is the first variance-aware regret to simultaneously address heavy-tailed rewards, adversarial corruption, and nonlinear settings. 
While numerous studies have explored bandits under the GLM framework and light-tailed noise scenarios, to our knowledge, none have accomplished all three aspects. 
For a comparison among the most related and competitive results, see Table \ref{tab:comparison}. We discuss their differences below.

In the absence of corruptions (i.e., \(C = 0\)), the regret bound simplifies to \(\widetilde{\mathcal{O}}\left(d\sqrt{\sum_{t \in [T]} \nu_{t}^{2}} + d\right)\), aligning with the results obtained by AdaOFUL \citep{li2023varianceaware}. However, AdaOFUL is restricted to linear bandit problems, while GAdaOFUL is applicable to the more general GLM setting. Recently, \cite{xue2024efficient} introduced an algorithm that addresses heavy-tailed noise within the GLM context; however, their results do not account for adversarial corruption, and their regret fails to be variance-aware, thus lacking adaptability to the problem's difficulty.

In cases where corruptions are present (i.e, \(C > 0\)), \cite{he2022nearly} proposed the CW-OFUL algorithm, which achieves a minimax optimal regret bound of \(\widetilde{\mathcal{O}}\left(d\sqrt{T} + d \cdot C\right)\). This bound is also obtained by \citet{ye2023corruption} for GLM rewards setting.
However, both of them pertain to the worst-case scenario. We argue that our bound, \(\widetilde{\mathcal{O}}\left(d\sqrt{\sum_{t \in [T]} \nu_{t}^{2}} + d \cdot (C \vee 1)\right)\), is significantly more adaptive than theirs. Specifically, if we assume that the variance \(\nu_t\) remains constant and significant (i.e., \(\nu_t = \Theta(1)\) for all \(t \geq 1\)), our regret reduces to theirs, implying that our approach is also minimax optimal in the worst case.


\subsection{Proof Sketch}
\label{sec:proof}
At the end of this section, we provide a proof sketch of Theorem \ref{corbeta} and \ref{correg}.


\paragraph{Proof of Theorem \ref{corbeta}.}  
The proof of Theorem \ref{corbeta} consists of two key steps. 
In the first step, we focus on bounding \(\|\nabla L_T(\theta^*)\|_{H_t^{-1}}\). 
Thanks to the loss function \eqref{eq:com_theta}, the gradient estimator can be written as
\[
	\nabla L_T(\theta)=\lambda k \theta - \sum_{t=1}^{T}\frac{\tau_t z_t(\theta)}{\sqrt{\tau_t^2+z_t^2(\theta)}}\frac{\phi_t}{\sigma_t}
\]
where $\displaystyle{z_t(\theta)=\frac{y_t-f(\langle \phi_t, \theta \rangle)}{\sigma_t}}$ is the standardized residual error.
With this expression, we show that, with high probability, 
\begin{equation}
\label{eq:upper}
\|\nabla L_T(\theta^*)\|_{H_t^{-1}} = \widetilde{O}\left( \frac{\kappa}{\tau_0} + \tau_0 + \sqrt{\kappa} + B\sqrt{\lambda} \right)
\end{equation}
holds for any \(T \geq 1\). In other words, \(\|\nabla L_T(\theta^*)\|_{H_t^{-1}}\) is uniformly bounded in terms of \(\tau_0\). To achieve a smaller bound, we should tune \(\tau_0 = \widetilde{O}(\sqrt{\kappa})\), which is precisely what we set in Theorem \ref{corbeta}.

In the second step, we aim to control \(\nabla^2 L_T(\theta)\). More specifically, we demonstrate that, with high probability, for all \(T \geq 0\) and any \(\|\theta\| \leq B\),
\begin{equation}
\label{eq:lower}
\nabla^2 L_T(\theta) \succeq \frac{k}{4} H_T.
\end{equation}
A similar lower bound to \eqref{eq:lower} appears in previous work \citep{li2023varianceaware}; however, our loss formulation in \eqref{eq:com_theta} enables its extension to GLMs.
Combining these two steps in \eqref{eq:upper} and \eqref{eq:lower}, we apply the mean value theorem and obtain that
\[
\nabla L_T(\theta_T) - \nabla L_T(\theta^*) = \nabla^2 L_T(\theta_T^*) (\theta_T - \theta^*)
\]
for some vector \(\theta_T^*\) satisfying \(\|\theta_T^*\| \leq B\). The first-order stationary condition of the constrained convex optimization in \eqref{eq:com_theta} implies that $\langle \nabla L_T(\theta_T), \theta_T - \theta^* \rangle \leq 0.$

Combining all the above results, we have:
\begin{align*}
	\frac{k}{4} \|\theta_T-\theta^*\|_{H_T}^2
	\le&\langle \nabla^2{L}_T(\theta_T^*)({\theta}_T-{\theta^*}), {\theta}_T-{\theta^*}\rangle \\
	=&\langle \nabla{L}_T(\theta_T)-\nabla{L}_T(\theta^*), {\theta}_T-{\theta^*}\rangle\\
	\le&\langle-\nabla{L}_T(\theta^*), {\theta}_T-{\theta^*}\rangle\\
	\le&\|\nabla L_T(\theta^*)\|_{H_t^{-1}} \cdot \|\theta_T-\theta^*\|_{H_T}
\end{align*}
This leads to the implication:
\begin{align}\label{bound_theta}
	&\|\theta_T - \theta^*\|_{H_T} 
	\leq \frac{4}{k} \|\nabla L_T(\theta^*)\|_{H_t^{-1}}\nonumber\\
	&\qquad \qquad =\widetilde{O}\left( \frac{\kappa}{\tau_0} + \tau_0 + \sqrt{\kappa} + B\sqrt{\lambda} \right). 
\end{align}   





\paragraph{Proof of Theorem \ref{correg}.}
By the Lipschitz continuity of the (nonlinear) link function $f$ (i.e., $\sup_{|x| \le BL}f'(x) \le K$), we bound the regret in \eqref{def_regret} by
\begin{align*}
	\operatorname{Reg}(T)&\le K\sum_{t=1}^{T}\left[\sup_{\phi\in\mathcal{D}_{t}}\left\langle\phi,\theta^{*}\right\rangle-  \left\langle\phi_{t},\theta^{*}\right\rangle\right].
\end{align*}
To bound the right-hand side, we apply a standard argument:
\begin{align*}
	&\sum_{t=1}^{T}\left[\sup_{\phi\in\mathcal{D}_{t}}\left\langle\phi,\theta^{*}\right\rangle-\left\langle\phi_{t},\theta^{*}\right\rangle\right]\\
	\leqslant&\sum_{t=1}^{T}\left[\sup_{\phi\in\mathcal{D}_{t},\theta\in\mathcal{C}_{t-1}}\langle\phi,\theta\rangle-\left\langle\phi_{t},\theta^{*}\right\rangle\right]\\
	\overset{(a)}{=}&\sum_{t=1}^{T}\left[\sup_{\theta\in\mathcal{C}_{t-1}}\left\langle\phi_{t},\theta\right\rangle-\left\langle\phi_{t},\theta^{*}\right\rangle\right]\\
	\leqslant
	&\sum_{t=1}^{T}\left\|\phi_{t}\right\|_{H_{t-1}^{-1}}\cdot\sup_{\theta\in\mathcal{C}_{t-1}}\left\|\theta-\theta^{*}\right\|_{H_{t-1}}\\
	\overset{(b)}{\le}  &2\beta_{T} \cdot \sum_{t=1}^{T}\left\|\phi_{t}\right\|_{H_{t-1}^{-1}}
	\overset{(c)}{=} 2\beta_{T} \cdot \sum_{t=1}^T \sigma_t w_t,
\end{align*}
where $(a)$ uses the selection rule for $\phi_t$ in \eqref{def_phi} and $(b)$ follows from the implications of $\mathcal{C}_{t-1}$. Specifically, for any $\theta \in \mathcal{C}_{t-1}$,
\begin{align*}
	\left\|\theta-\theta^{*}\right\|_{H_{t-1}}
	&\le \left\|\theta-\theta_{t-1}\right\|_{H_{t-1}} 
	+ \left\|\theta_{t-1}-\theta^{*}\right\|_{H_{t-1}}\\
	&\le 2 \beta_{t-1} \le 2\beta_T.
\end{align*}
The inequality $(c)$ uses definition of $w_t$ from \eqref{eq:paras} where $w_t = \left \|{\phi_t}\right \|_{H_{t-1}^{-1}}/{\sigma_t}$.


Since $\sigma_t$ is defined as the maximum of several expressions in \eqref{eq:paras}, specifically, $\sigma_t = \max\{\nu_t, \sigma_{1,t}, \sigma_{2,t}, \sigma_{3,t}\}$ for certain $\sigma_{i,t}$ ($i=1,2,3$), we can use the bound $\sigma_t \le \nu_t + \sigma_{1,t} + \sigma_{2,t} + \sigma_{3,t}$. We then bound the remaining sums, either $\sum_{t=1}^T \nu_t w_t$ or $\sum_{t=1}^T \sigma_{i,t} w_t$. 
A key reason our algorithm achieves a finer variance-aware bound is the careful selection of the parameter $\sigma_g$. In particular, the dominant term is bounded as  
\[
\sum_{t=1}^T \nu_t w_t \le \sqrt{\sum_{t=1}^T \nu_t^2} \cdot \sqrt{\sum_{t=1}^T w_t^2} = \widetilde{O}\left(d \cdot \sqrt{\sum_{t=1}^T \nu_t^2}\right).
\]  
We then make efforts to show that the remaining term, $\sum_{t=1}^T \sigma_{i,t} w_t$, is at most $\widetilde{O}(d) \cdot C$. The specific techniques used to establish these bounds are detailed in Appendix \ref{proof:theorem3}.

% Since $\sigma_t$ is the maximum of several expressions defined in \eqref{eq:paras} as $\sigma_t = \max\{\nu_t, \sigma_{1,t}, \sigma_{2,t}, \sigma_{3,t}\}$ for some particular $\sigma_{i,t} (i=1,2,3)$ (see \eqref{eq:paras} for the specific expressions), we can use the bound $\sigma_t \le \nu_t + \sigma_{1,t}+\sigma_{2,t} + \sigma_{3,t}$.
% We then bound the remaining sums, either $\sum_{t=1}^T \nu_t w_t$ or $\sum_{t=1}^T \sigma_{i,t} w_t$.
% We want to highlight that the reason why our algorithm can achieve the finer variance-aware bound is becasue we set the parameter $\sigma_g$ carefully. In particular, the dominant term is bounded by $\sum_{t=1}^T \nu_t w_t \le \sqrt{\sum_{t=1}^T \nu_t^2} \cdot \sqrt{\sum_{t=1}^T w_t^2} = \widetilde{O}(d \cdot \sqrt{\sum_{t=1}^T \nu_t^2} )$.  
% We then pay many efforts to show that the remaining $\sum_{t=1}^T \sigma_{i,t} w_t$ is only $\widetilde{O}(d) \cdot C$.
% The specific techniques for these bounds are detailed in the appendix.



\section{Numerical studies}

\subsection{Experimental Setup}
%gada/
\begin{figure*}[th!]
%\vspace{-10pt}
\subfloat[Linear reward with $C=0$.]{%
\includegraphics[width=0.5\linewidth]{noncor+lin1.pdf}  
\label{subfig:a}%
}\hfill
\subfloat[Linear reward with $C=100$.]
{%
\includegraphics[width=0.5\linewidth]{cor+lin1.pdf} 
\label{subfig:b}%
}\\
\subfloat[Nonlinear reward with $C=0$.]{%
\includegraphics[width=0.5\linewidth]{noncor+nonlin1.pdf} 
\label{subfig:c}%
}\hfill
\subfloat[Nonlinear reward with $C=100e$.]{%
\includegraphics[width=0.5\linewidth]{cor+nonlin1.pdf}  
\label{subfig:d}%
}
\caption{Comparison results of various online bandit algorithms are presented under four scenarios: with linear or nonlinear rewards and in the presence or absence of corruption. The total number of iterations is set to \(T = 2000\).}
\label{fig:results}
%\vspace{-10pt}
\end{figure*}

We conduct numerical experiments to compare different online bandit algorithms.
We consider a 10-dimensional space (\(d = 10\)), where the vector dimensions are specified with \(B = 1\) and \(L = 1\), and conduct the following experimental setup. The target vector \(\theta^*\) is randomly chosen from the unit sphere. The experiment is repeated ten times to reduce the effect of random results.

\begin{itemize}
	\item \textbf{Decision set:} The decision set \(D_t\) comprises 20 random unit vectors in \(\mathbb{R}^{d}\) (\(|D_t| = 20\)). Each vector is independently generated in the same manner as \(\theta^*\). Notably, \(D_t\) is not a fixed set; instead, it is dynamically and randomly generated for each trial.


    \iffalse
	\item \textbf{Noise distribution:} We use noise from a \(t\)-distribution with 3 degrees of freedom (denoted as \(t_3\)). The probability density function of \(t_3\) is given by:
	\[
	f(t) = \frac{\Gamma\left(\frac{v+1}{2}\right)}{\sqrt{v\pi} \Gamma\left(\frac{v}{2}\right)} \left(1 + \frac{t^2}{v}\right)^{-\frac{v+1}{2}},
	\]
	where \(v = 3\), \(\Gamma\) is the gamma function, and \(t\) is the random variable. The second moment (variance) of \(t_3\) is 3, while higher moments do not exist (i.e., \(E[X^k]\) is undefined for \(k \geq 3\)), making it a heavy-tailed distribution.
    \fi

    
    \item \textbf{Noise distribution:} We use noise from a \(t\)-distribution with 3 degrees of freedom (denoted as \(t_3\)). The probability density function of \(t_3\) is given by:
	\[
	f(t) = \frac{\Gamma\left(\frac{v+1}{2}\right)}{\sqrt{v\pi} \Gamma\left(\frac{v}{2}\right)} \left(1 + \frac{t^2}{v}\right)^{-\frac{v+1}{2}},
	\]
	where \(v = 3\), \(\Gamma\) is the gamma function, and \(t\) is the random variable. The second moment (variance) of \(t_3\) is 3, while higher moments do not exist (i.e., \(E[X^k]\) is undefined for \(k \geq 3\)). Compared to Gaussian noise, the \(t_3\)-distribution better simulates real-world stochastic disturbances with occasional extreme values, challenging algorithms to be robust under non-sub-Gaussian conditions. This choice allows us to test the heavy-tail robustness of the algorithms.
   

	
	\item \textbf{Nonlinear function:} We select the exponential function \(y = \exp(x)\) for the mapping \(f\). This function is commonly used in various exponential models and is monotonically increasing. Within the interval \(x \in [-1, 1]\), the derivative values range from \(\exp(-1)\) to \(\exp(1)\), which we denote as \(k\) and \(K\), respectively. Additional experimental results using other three nonlinear link functions—yielding similar findings—are provided in Appendix~\ref{appendix:nonlinear}.
	
	\item \textbf{Corruption:} To simulate corruption, we employ the flipping technique \citep{bogunovic2021stochastic} during the first \(n\) steps. Specifically, for each reward \(y\) computed as \(y = f(\langle \theta, \phi \rangle) + \epsilon\), where \(\epsilon\) is noise from the \(t_3\) distribution, we flip the reward to \(y' = -f(\langle \theta, \phi \rangle) + \epsilon\). This manipulation misleads the bandit into making completely opposite decisions regarding the position of \(\theta\). This simulates a corruption level of \(C = 2Kn\), where \(K\) is the maximum value of \(f'\). The inequality \(|y - y'| = 2f(\langle \theta, \phi \rangle) \leq 2K\) ensures the bound on the corruption. For linear function, \(K = 1\), while for the exponential function, \(K = e\). In this experiment, we choose \(n = 50\).
\end{itemize}


\subsection{Experimental Results}  

\paragraph{Alternative algorithms.} 
We conducted experiments to compare the performance of the GAdaOFUL algorithm with several competing algorithms, including Greedy \citep{NEURIPS2018_2cfd4560}, OFUL \citep{abbasi2011improved}, CW-OFUL \citep{NEURIPS2022_df5f94d6}, and AdaOFUL \citep{li2023varianceaware}, across diverse conditions. The Greedy algorithm and OFUL are classic bandit learning methods that provide  baselines for our comparisons. CW-OFUL extends OFUL to enhance robustness against corrupted rewards, making it a suitable benchmark  for comparison with GAdaOFUL in corrupted environments. While AdaOFUL performs well under heavy-tailed noise, it does not explicitly address corruption, enabling a clear comparison of its performance against GAdaOFUL in such settings. 



\paragraph{Experimental results.} 
The results are presented in Figure \ref{fig:results}. The $x$-axis represents the number of steps, while the \(y\)-axis indicates the averaged regret over 10 repeated trials. The regret-iteration plot illustrates how regret accumulates as the number of steps increases.

When there is no corruption (i.e., $C = 0$), the results in subfigures (a) and (c) demonstrate that GAdaOFUL achieves the smallest regret among all considered baselines. Notably, in scenarios with linear rewards and no corruption, GAdaOFUL coincides with the previous AdaOFUL, resulting in overlapping curves in subfigure (a). Furthermore, because AdaOFUL is specifically designed for linear rewards, its performance significantly degrades when the underlying reward deviates from a linear model, as shown in subfigure (c).

Next, we consider the case where corruption exists (i.e., \(C = 1\)). Again, the results in subfigures (b) and (d) reveal that GAdaOFUL achieves the smallest regret among all baselines. The original AdaOFUL does not account for nonlinear rewards. To further substantiate the superiority of our method, we also analyze a stronger competitor, AdaOFUL(nonlinear), which employs the same loss function \eqref{eq:com_theta} as GAdaOFUL for computing \(\theta\) and is designed for nonlinear rewards. The only difference is that GAdaOFUL considers potential corruption in the rewards and modifies the selection of \(\sigma\) in \eqref{eq:paras}. 
Interestingly, even with corrupted and nonlinear rewards, GAdaOFUL maintains its superiority, demonstrating its effectiveness in managing both nonlinearities and robustness against corruption.

Additionally, to assess the robustness of our method under a broader range of nonlinear reward structures, we conducted further experiments with alternative nonlinear functions. The results, consistent with our main findings, are presented in Appendix~\ref{appendix:nonlinear}.

In summary, the experimental results consistently show that GAdaOFUL outperforms other algorithms across various conditions. GAdaOFUL exhibits remarkable robustness, particularly in the presence of corruption and heavy-tailed noise, while also remaining effective under nonlinear conditions. These results validate the theoretical foundations of GAdaOFUL and suggest its practical utility in real-world applications where the reliability of feedback may be uncertain or compromised.


\section{Conclusions and Discussions}



In this paper, we introduced GAdaOFUL, a novel online bandit algorithm that achieves a state-of-the-art regret bound of \(\widetilde{\mathcal{O}}\left(d\sqrt{\sum_{t \in [T]} \nu_{t}^{2}} + d \cdot (C \vee 1)\right)\). This bound highlights the algorithm's efficiency in low-variance environments and its resilience against corrupted, nonlinear, and heavy-tailed rewards. Specifically, our results demonstrate that sublinear regret is attainable even in the presence of highly non-standard reward characteristics commonly observed in real-world scenarios, such as adversarial corruption, nonlinearity, and heavy-tailed noise.  Empirical evaluations show that GAdaOFUL  outperforms  existing methods.  

There are several promising directions for future research building on our work. 
\begin{itemize}
\item First, our algorithm is primarily based on adaptive Huber regression \citep{sun2020adaptive}, originally designed for data with only  $1+\delta$ ($\delta\leq 1$) moments. This paper focuses on rewards with bounded variance, aligning with the assumptions of most variance-aware algorithms. A natural direction would be to generalize our results to settings where rewards possess only  $1+\delta$ moments, in the spirit of \citet{huang2024tackling},  which builds upon  \citet{li2023varianceaware}. 

\item Second, a key limitation of this work is the dependence on the assumption that the reward model admits a generalized linear form. However, real-world reward structures often deviate from the GLM framework, introducing significant model  mismatch.  One possible approach is to consider a broader class of reward models, such as nonparametric and neural-network-based reward models.  

\item Third, a promising direction is  to leverage our algorithm as a modular component for tackling more complex tasks, such as linear Markov decision processes (MDPs) \citep{he2023nearly}. 

\item Fourth, it is practically valuable to develop parameter-free algorithms, along the lines of  \cite{sun2021we},  that do not require prior knowledge of problem- or data-dependent constants.

\item  Finally, it is of interest to design more comprehensive evaluations that stress-test the algorithm under diverse and potentially misspecified environments. Investigating the performance of GAdaOFUL under model misspecification could inspire more robust algorithms and provide deeper insights into the limitations of reward-model-based approaches.
\end{itemize}





\iffalse
There are several promising future directions for this work:
\begin{itemize}
	\item First, our algorithmic development is primarily based on adaptive Huber regression \citep{sun2020adaptive}, which originally analyzes data with only \(1 + \epsilon\) moments. In our study, we focus on rewards with bounded variance due to the prevalence of variance-aware algorithms. It would be valuable to extend our results to a more general setting where rewards only have \(1 + \epsilon\) moments, similar to the approach taken by \citep{huang2024tackling} in relation to \citep{li2023varianceaware}. 

    
	\item Second, while our primary focus has been on regret, which measures the efficiency of decision-making over iterations, it is important to explore the statistical properties of the estimates obtained. Recent studies indicate that, for linear and light-tailed rewards, these estimates (computed using adaptively collected data) can be biased, leading to large-sample behavior that deviates from the standard normal distribution due to the central limit theorem \citep{khamaru2021near, lin2023semi, lin2024statistical}. It would be intriguing to investigate how the large-sample behavior changes under the assumptions of heavy-tailed and nonlinear rewards.

    
	\item Third, an intriguing direction would be to leverage our algorithm as a modular component for addressing more complex tasks, such as linear Markov decision processes (MDPs) \citep{he2023nearly}. Additionally, it could be used to develop parameter-free algorithms that function effectively without requiring prior knowledge of parameter constants.
    
     
     \item Finally, a key limitation of our current work is the dependence on the assumption that the reward model follows a generalized linear form. Both our theoretical analysis and empirical evaluation are conducted under this assumption. However, real-world reward structures often deviate from the GLM framework, introducing a significant source of model mismatch. Addressing this gap is an important direction for future research. One possible approach is to consider a broader class of reward models, such as nonparametric methods or neural network-based estimators that can capture more complex patterns. Another is to design more comprehensive evaluations that stress-test the algorithm under diverse and potentially misspecified environments. Studying how GAdaOFUL performs when the reward model is misspecified may lead to more robust algorithms and a deeper understanding of the limits of reward-model-based approaches.
\end{itemize}
\fi



\begin{acknowledgements} % will be removed in pdf for initial submission
Qiang Sun's research is partially supported by the Natural Sciences and Engineering Research Council of Canada (Grant RGPIN-2018-06484), computing resources provided by the Digital Research Alliance of Canada, and MBZUAI. 

\end{acknowledgements}



% References
\bibliography{refer}

\newpage

\onecolumn

\title{Appendix}
\maketitle

In the appendix, we provide the proofs  of the main results and the supporting lemmas. 
\appendix
%\end{document}
\section{Proof of Theorem 1}

Initially, we introduce two lemmas to help our discussion. The first lemma establishes a bound of \(\|\nabla L_T(\theta^*)\|_{H_t^{-1}}\). The second lemma asserts that $\nabla^2L_T(\theta)$ is positive with high probability.

\begin{lemma}
	Assume \( \mathbb{E}\left[z_{t}^{2}\left(\theta^{*}\right) \mid \mathcal{F}_{t-1}\right] \leqslant b^{2} \) for all \( t \geqslant 1 \), where $\displaystyle{z_t(\theta)=\frac{y_t-f(\langle\phi_t, \theta\rangle)}{\sigma_t}}$. With probability at least \( 1 - \delta \), for all \( T \geqslant 1 \), it follows that
	
	\begin{align}
		\left\| \nabla L_{T}\left(\theta^{*}\right) \right\|_{H_{T}^{-1}} \leqslant 8\left[\frac{\kappa b^{2}}{\tau_{0}} + b\sqrt{\kappa \log \frac{2T^{2}}{\delta}} + \tau_{0} \log \frac{2T^{2}}{\delta}\right] + kB\sqrt{\lambda},
	\end{align}
	
	where \(\kappa=d\cdot\log\left(1+TL^2/(d\lambda\sigma_{\min}^2)\right)\) is a constant.
\end{lemma}

\begin{lemma}
	Assume \( \mathbb{E}\left[z_{t}^{2}\left(\theta^{*}\right) \mid \mathcal{F}_{t-1}\right] \leqslant b^{2} \) for all \( t \geqslant 1 \), where $\displaystyle{z_t(\theta)=\frac{y_t-f(\langle\phi_t, \theta\rangle)}{\sigma_t}}$. If we set
	
	\[
	\tau_{0} \sqrt{\log \frac{2T^{2}}{\delta}} \geqslant \max\{\sqrt{2\kappa} b, 2\sqrt{d}\},
	\]
	
	with probability at least \( 1 - 2\delta \), we have that for all \( T \geqslant 0 \),
	\begin{align}\label{eq:L_positive}
		\nabla^2L_T(\theta)\succeq \frac{k}{4} H_T
		\quad\text{for any} \quad \|\theta\| \leqslant B.
	\end{align}
\end{lemma}






Since $\sigma_t\ge v_t$, where $v_t$ is the variance of $\epsilon_t=y_t-f(\langle\phi_t, \theta\rangle)$, here we set $b=1$.


Let $\theta(\eta)=(1-\eta)\theta^*+\eta\theta_T$. Using the mean value theorem for vector-valued functions, we have
\begin{align}\label{eq:integral_eta}
	\nabla{L_T}(\theta_T) - \nabla{L_T}(\theta^*) = \int_0^1 \nabla^2 L_T(\theta(\eta))\,d\eta\cdot (\theta_T-\theta^*).
\end{align}

Combining  \eqref{eq:L_positive} and \(\|\theta(\eta)\|\leq B\) for all \(\eta \in [0, 1]\),
it follows that
\begin{align}\label{eq:theorem1_inequality1}
	\frac{k}{4}\left\|\theta_{T}-\theta^{*}\right\|_{H_{T}}^{2} &\leqslant \left\langle\theta_{T}-\theta^{*},\nabla L_{T}\left(\theta_{T}\right)-\nabla L_{T}\left(\theta^{*}\right)\right\rangle.
\end{align}

The first-order stationary condition of the constrained convex optimization that $\theta_T := \operatorname{argmin}_{\theta \in \text{Ball}_d(B)} L_T(\theta)$ implies  
\[
\langle \nabla L_T(\theta_T), \theta_T - \theta^* \rangle \leq 0.
\]
Consequently,
\begin{align}\label{eq:theorem1_inequality2}
	& \left\langle\theta_{T}-\theta^{*},\nabla L_{T}\left(\theta_{T}\right)-\nabla L_{T}\left(\theta^{*}\right)\right\rangle.\nonumber\\
	\leqslant &\left\langle\theta_{T}-\theta^{*},-\nabla L_{T}\left(\theta^{*}\right)\right\rangle \nonumber\\
	\leqslant& \left\|\theta_{T}-\theta^{*}\right\|_{H_{T}}\left\|\nabla L_{T}\left(\theta^{*}\right)\right\|_{H_{T}^{-1}}.
\end{align}
By \eqref{eq:theorem1_inequality1} and \eqref{eq:theorem1_inequality2}, we have
\begin{align}\label{eq:nonlinear_bound}
	\|\theta_T-\theta^*\|_{H_T}&\le \frac{4}{k}\|\nabla L_T(\theta^*)\|_{H_t^{-1}} \nonumber\\
	&\le \frac{32}{k}\left[\frac{\kappa}{\tau_0}+\sqrt{\kappa\log\frac{2T^2}{\delta}}+\tau_0\log\frac{2T^2}{\delta}\right]+5B\sqrt\lambda.
\end{align}

Then the regret can be bounded as
\begin{align*}
	\operatorname{Reg}(T)\le 2K\beta_T\left[\sqrt{2\kappa}\cdot \sqrt{\sum_{t \in [T]}\nu_t^2 + 1} + \frac{2L\kappa}{m_0^2\sqrt\lambda}+\frac{2LBK\kappa}{\sqrt{m_1 d}}\right],
\end{align*}

The proof of this can refer to the proof of Theorem 3, with the only difference being that $C$ is set to 0.



\subsection{Proof of Lemma 1}
Let $\displaystyle{z_t(\theta)=\frac{y_t-f(\langle \phi_t, \theta \rangle)}{\sigma_t}}$.  The gradient is given by
\begin{align*}
	\nabla L_T(\theta)=\lambda k \theta - \sum_{t=1}^{T}\frac{\tau_t z_t(\theta)}{\sqrt{\tau_t^2+z_t^2(\theta)}}\frac{\phi_t}{\sigma_t}.
\end{align*}
By triangle inequality, we have
\begin{align*}
	\|\nabla L_T(\theta^*)\|_{H_T^{-1}}\le\left\|\lambda k \theta^*\right\|_{H_T^{-1}}+\underbrace{\left\|\sum_{t=1}^{T}\frac{\tau_t z_t(\theta^*)}{\sqrt{\tau_t^2+z_t^2(\theta^*)}}\frac{\phi_t}{\sigma_t}\right\|_{H_T^{-1}}}_{d_T}.
\end{align*}


For the first term $\left\|\lambda k \theta^*\right\|_{H_T^{-1}}$, 
we have $H_T^{-1}\preceq\lambda^{-1} I$, due to $H_T\succeq\lambda I$. Thus, $\left\|\lambda k \theta^*\right\|_{H_T^{-1}}\leq kB \sqrt{\lambda}$.

For the second term $\left\|d_T\right\|_{H_T^{-1}}$, we have $\left\|d_T\right\|_{H_T^{-1}}\leq \alpha_T$, where

\begin{align*}
	\alpha_T=8\left[\frac{\kappa b^2}{\tau_0}+b\sqrt{\kappa\log\frac{2T^2}{\delta}}+\tau_0\log\frac{2T^2}{\delta}\right].
\end{align*}

The proof of this inequality follows exactly the same steps as the proof of Lemma B.2 in \citep{li2023varianceaware}, and is therefore omitted here for brevity. The reader is referred to \citep{li2023varianceaware} for a detailed proof. 








\subsection{Proof of Lemma 2}
The second-order gradient is given by
\begin{align*}
	\nabla^2L_T(\theta)&=\lambda kI+\sum_{t=1}^{T}\left(\frac{\tau_t}{\sqrt{\tau_t^2+z_t^2}}\right)^3f'\left( \langle \phi_t, \theta^* \rangle \right)\frac{\phi_t\phi_t^\top}{\sigma_t^2}\\
	&\succeq k\left( \lambda I+\sum_{t=1}^{T}\left(\frac{\tau_t}{\sqrt{\tau_t^2+z_t^2}}\right)^3\frac{\phi_t\phi_t^\top}{\sigma_t^2}\right).
\end{align*}
where the last inequality holds due to $f'(x)\ge k$, when $x \in [-BL,BL]$. In this case, we can treat it in the same way as the linear case by decomposing the equation into three parts for processing.


\begin{align*}
	& \lambda I+\sum_{t=1}^{T}\left(\frac{\tau_t}{\sqrt{\tau_t^2+z_t^2}}\right)^3\frac{\phi_t\phi_t^\top}{\sigma_t^2}\\
	= & H_{T} - \underbrace{\sum_{t=1}^{T}\left[1-\left(\frac{\tau_{t}}{\sqrt{\tau_{t}^{2}+z_{t}^{2}\left(\theta^{*}\right)}}\right)^{3}\right]\frac{\phi_{t}\phi_{t}^{\top}}{\sigma_{t}^{2}}}_{H_{1, T}}  - \underbrace{\sum_{t=1}^{T}\left[\left(\frac{\tau_{t}}{\sqrt{\tau_{t}^{2}+z_{t}^{2}\left(\theta^{*}\right)}}\right)^{3}-\left(\frac{\tau_{t}}{\sqrt{\tau_{t}^{2}+z_{t}^{2}(\theta)}}\right)^{3}\right]\frac{\phi_{t}\phi_{t}^{\top}}{\sigma_{t}^{2}}}_{H_{2, T}}.
\end{align*}


Then we can prove that ${H_{1, T}}\preceq \frac{1}{4}H_T$, and that
${H_{2, T}}\preceq \frac{1}{2}H_T$. For detailed proves, we refer the reader to Lemma B.1 in \citep{li2023varianceaware}. Thus,
\begin{align*}
	\nabla^2L_T(\theta)\succeq k\left( \lambda I+\sum_{t=1}^{T}\left(\frac{\tau_t}{\sqrt{\tau_t^2+z_t^2}}\right)^3\frac{\phi_t\phi_t^\top}{\sigma_t^2}\right)\succeq\frac{k}{4} H_T.
\end{align*}




\section{Proof of Theorem 2}


To prove the main theorems, we introduce an auxiliary problem: 
\begin{align*}
	z_t(s) &= \frac{y_t - f(s)}{\sigma_t}, \ \tilde{z}_t(s) := \frac{y_t - f(s) - c_t}{\sigma_t}. \\
	\tilde{L}_T(\theta) &:= \frac{\lambda k}{2} \|\theta\|^2 + \sum_{t = 1}^T \frac{1}{\sigma_t} \int_0^{\langle \phi_t, \theta \rangle} \frac{\tau_t \tilde{z}_t(s)}{\sqrt{\tau_t^2 + \tilde{z}^2_t(s)}} ds.\\
	\theta_T &= \arg\min_{\|\theta\| \leq B} L_T(\theta), \ \tilde{\theta}_T := \arg\min_{\|\theta\| \leq B} \tilde{L}_T(\theta).
\end{align*}
The symbol $\tilde{z}_t(s)$ is defined as the standardized difference between the observed value $y_t$ and the prediction $f(s)$ adjusted for the corruption $c_t$, all scaled by the noise level $\sigma_t$. A crucial property of $\tilde{z}_t(s)$ is that $\mathbb{E}[\tilde{z}_t(s)| \mathcal{F}_{t-1}] = 0 $.
It means $\tilde{L}_T$ is essentially the non-corrupted objective, which we studied in Theorem 1 and we can use \eqref{eq:nonlinear_bound} to bound $\left\|\tilde{\theta}_T-\theta_T^*\right\|_{H_T}$. 


Therefore, the only thing we need to do is to find the relationship between $\theta_T$ and $\tilde{\theta}_T$.

\begin{align*}
	\frac{k}{4}\|\theta_T - \tilde{\theta}_T\|_{H_T}^2
	\stackrel{(a)}{\leq}&\langle\nabla\tilde{L}_T(\tilde{\theta}_T)-\nabla\tilde{L}_T({\theta}_T),\tilde{\theta}_T-{\theta}_T \rangle \\
	=&\langle\nabla\tilde{L}_T(\tilde{\theta}_T)-\nabla{L}_T({\theta}_T),\tilde{\theta}_T-{\theta}_T \rangle+\langle\nabla{L}_T({\theta}_T)-\nabla\tilde{L}_T({\theta}_T),\tilde{\theta}_T-{\theta}_T \rangle\\
	\stackrel{(b)}{\leq}&\langle\nabla{L}_T({\theta}_T)-\nabla\tilde{L}_T({\theta}_T),\tilde{\theta}_T-{\theta}_T \rangle\\
	=&\sum_{t = 1}^T\left(\frac{\tau_t z_t(\langle \phi_t, \theta \rangle)}{\sqrt{\tau_t^2 + z_t^2(\langle \phi_t, \theta \rangle)}} - \frac{\tau_t \tilde{z}_t(\langle \phi_t, \theta \rangle)}{\sqrt{\tau_t^2 + \tilde{z_t}^2(\langle \phi_t, \theta \rangle)}}\right)\left\langle \frac{\phi_t}{\sigma_t},\tilde{\theta}_T-\theta_T\right\rangle\\
	\stackrel{(c)}{\leq}&\sum_{t = 1}^T\left | z_t(\langle \phi_t, \theta \rangle) - \tilde{z}_t(\langle \phi_t, \theta \rangle) \right |\left|\left\langle \frac{\phi_t}{\sigma_t},\tilde{\theta}_T-\theta_T\right\rangle\right|\\
	=&\sum_{t = 1}^T\frac{\left|c_t\right|}{\sigma_t}\left|\left\langle \frac{\phi_t}{\sigma_t},\tilde{\theta}_T-\theta_T\right\rangle\right|\\
	\stackrel{(d)}{\leq}&\sum_{t = 1}^T\frac{\left|c_t\right|w_t}{\sigma_t}
	\|\theta_T - \tilde{\theta}_T\|_{H_T}\\
	\stackrel{(e)}{\leq}& \sqrt\kappa\|\theta_T - \tilde{\theta}_T\|_{H_T}.\\
\end{align*}
Inequality $(a)$ uses mean value theorem and $\displaystyle{\nabla^2L_T(\theta)\succeq \frac{k}{4} H_T}$, the same as \eqref{eq:theorem1_inequality1}. The first-order stationary condition of the constrained convex optimization implies that $\langle\nabla\tilde{L}_T(\tilde{\theta}_T),\tilde{\theta}_T-{\theta}_T \rangle\leq 0 $ and $\langle\nabla{L}_T({\theta}_T),{\theta}_T-\tilde{\theta}_T \rangle\leq 0$, thus proving inequality $(b)$. Inequality $(c)$ comes from the fact that $\displaystyle{0\leq\frac{d}{dx}\frac{\tau x}{\sqrt{\tau^2 + x^2}}}\leq1$. 
Inequality $(d)$ uses $\displaystyle{\left|\left\langle \frac{\phi_t}{\sigma_t},\tilde{\theta}_T-\theta_T\right\rangle\right|\leq \left\|\frac{\phi_t}{\sigma_t}\right\|_{H_T^{-1}}\left\|\tilde{\theta}_T-\theta_T\right\|_{H_T}}$ and $\displaystyle{\left\|\frac{\phi_t}{\sigma_t}\right\|_{H_T^{-1}}\leq \left\|\frac{\phi_t}{\sigma_t}\right\|_{H_{t-1}^{-1}}=w_t}$. 
Inequality $(e)$ comes from $\sigma_t=\sigma_t^2/\sigma_t\geq C\|\phi_t\|_{H_{t-1}^{-1}}/\sqrt\kappa\sigma_t={C}w_t/{\sqrt\kappa}$. 
Thus, we have
\[
\left\|\tilde{\theta}_T-\theta_T\right\|_{H_T}\le\frac{4\sqrt{\kappa}}{k}.
\]
Combining the upper bound of $\left\|\tilde{\theta}_T-\theta_T^*\right\|_{H_T}$ in \eqref{eq:nonlinear_bound},
\begin{align*}
	\left\|{\theta}_T-\theta_T^*\right\|_{H_T}&\leq\left\|\tilde{\theta}_T-\theta_T\right\|_{H_T}+\left\|\tilde{\theta}_T-\theta_T^*\right\|_{H_T}\\
	&\leq\frac{4\sqrt{\kappa}}{k}+\frac{32}{k}\left[\frac{\kappa}{\tau_0}+\sqrt{\kappa\log\frac{2T^2}{\delta}}+\tau_0\log\frac{2T^2}{\delta}\right]+5\sqrt\lambda B.
\end{align*}





\section{Proof of Theorem 3}
\label{proof:theorem3}
In this proof, we will bound the regret in the event that high probability coverage holds.

By the Lipschitz continuity of the (nonlinear) link function $f$ (i.e., $\sup_{|x| \le BL}f'(x) \le K$), we bound the regret from its definition by
\begin{align}\label{eq:bound_regret1}
	\operatorname{Reg}(T):= \sum_{t=1}^T \left[ \sup_{\phi \in D_t} f(\langle \phi, \theta^* \rangle) - f(\langle \phi_t, \theta^* \rangle)\right]&\le K\sum_{t=1}^{T}\left[\sup_{\phi\in\mathcal{D}_{t}}\left\langle\phi,\theta^{*}\right\rangle-  \left\langle\phi_{t},\theta^{*}\right\rangle\right].
\end{align}
To bound the right-hand side, we apply a standard argument:

\begin{align}\label{eq:bound_regret2}
	&\sum_{t=1}^{T}\left[\sup_{\phi\in\mathcal{D}_{t}}\left\langle\phi,\theta^{*}\right\rangle-\left\langle\phi_{t},\theta^{*}\right\rangle\right]\nonumber\\
	\leqslant&\sum_{t=1}^{T}\left[\sup_{\phi\in\mathcal{D}_{t},\theta\in\mathcal{C}_{t-1}}\langle\phi,\theta\rangle-\left\langle\phi_{t},\theta^{*}\right\rangle\right]\nonumber\\
	\overset{(a)}{=}&\sum_{t=1}^{T}\left[\sup_{\theta\in\mathcal{C}_{t-1}}\left\langle\phi_{t},\theta\right\rangle-\left\langle\phi_{t},\theta^{*}\right\rangle\right]\nonumber\\
	\leqslant
	&\sum_{t=1}^{T}\left\|\phi_{t}\right\|_{H_{t-1}^{-1}}\cdot\sup_{\theta\in\mathcal{C}_{t-1}}\left\|\theta-\theta^{*}\right\|_{H_{t-1}}\nonumber\\
	\overset{(b)}{\le}  &2\beta_{T} \cdot \sum_{t=1}^{T}\left\|\phi_{t}\right\|_{H_{t-1}^{-1}}
	\overset{(c)}{=} 2\beta_{T} \cdot \sum_{t=1}^T \sigma_t w_t,
\end{align}

where $(a)$ uses the selection rule for $\phi_t$  that $(\phi_t, \cdot) = \operatorname{argmax}_{\phi\in D_t, \theta \in C_{t-1}} \langle \phi, \theta \rangle$, and $(b)$ follows from the implications of $\mathcal{C}_{t-1}$. Specifically, for any $\theta \in \mathcal{C}_{t-1}$,
\begin{align*}
	\left\|\theta-\theta^{*}\right\|_{H_{t-1}}
	&\le \left\|\theta-\theta_{t-1}\right\|_{H_{t-1}} 
	+ \left\|\theta_{t-1}-\theta^{*}\right\|_{H_{t-1}}\le 2 \beta_{t-1} \le 2\beta_T.
\end{align*}
The inequality $(c)$ uses definition of $w_t$ that $w_t = \left \|{\phi_t}\right \|_{H_{t-1}^{-1}}/{\sigma_t}$. 


Here, the problem is converted to how to bound $\sum_{t=1}^T \sigma_t w_t$.

Since $\sigma_t\ge {\|\phi_t\|_{H_{t-1}^{-1}}}/{m_0} $ and $m_0\le 1 $, we have \(w_{t} \leqslant 1\). Notice that \(\frac{\left\|\phi_{t}\right\|}{\sigma_{t}} \leqslant \frac{\left\|\phi_{t}\right\|}{\sigma_{\min}} \leqslant \frac{L}{\sigma_{\min}}\). Then by Lemma \ref{bound_w_t},
\begin{align} \label{eq:bound_w}
	\sum_{t=1}^{T}w_{t}^2= \sum_{t=1}^{T}\min\left\{1, w_{t}^{2}\right\}=\sum_{t=1}^{T}\min\left\{1,\left\|\frac{\phi_{t}}{\sigma_{t}}\right\|_{H_{t-1}^{-1}}^{2}\right\}  \leqslant 2 d\log\left(1 + \frac{T L^{2}}{d\lambda\sigma_{\min}^{2}}\right) = 2\kappa,
\end{align}
where \(\kappa=d\cdot\log\left(1+TL^2/(d\lambda\sigma_{\min}^2)\right)\) is a constant.


Recall the definition of $\sigma_t$:

\begin{equation}\label{eq:paras}
	\begin{gathered}
		\sigma_t = \max \left\{ \nu_t, \sigma_{\min}, {\|\phi_t\|_{H_{t-1}^{-1}}}/{m_0}, \alpha\|\phi_t\|^{1/2}_{H_{t-1}^{-1}}  \right\}, 
	\end{gathered}
\end{equation}

where \(\alpha=\displaystyle{\max\left\{\frac{\sqrt{LBK}}{m_1^{1/4}d^{1/4}},C^{\frac{1}{2}}\kappa^{-\frac{1}{4}}\right\}}\), \(C= \sum_{t=1}^T |c_t|\) is the corruption level.

According to what value $\sigma_{t}$ takes, we decompose $[T]$ into three sets $[T] \subseteq \cup_{i=1}^{3} \mathcal{J}_{i}$ where

\begin{align*}
	\mathcal{J}_1 &= \left\{ t \in [T] : \sigma_t \in \left\{ \nu_t, \sigma_{\min} \right\} \right\}, \\
	\mathcal{J}_2 &= \left\{ t \in [T] : \sigma_t = \frac{\left\| \phi_t \right\|_{H_{t-1}^{-1}}}{m_0} \right\}, \\
	\mathcal{J}_3 &= \left\{ t \in [T] : \sigma_t = \alpha\|\phi_t\|^{1/2}_{H_{t-1}^{-1}} \right\}.
\end{align*}

First, for any \(t\in\mathcal{J}_1\),


\begin{align}\label{eq:regret1}
	\sum_{t\in\mathcal{J}_{1}}\sigma_{t} w_t &\leqslant \sum_{t\in\mathcal{J}_{1}}\max\left\{\nu_{t},\sigma_{\min}\right\}w_t \nonumber\\
	&\leqslant \sum_{t\in[T]}\max\left\{\nu_{t},\sigma_{\min}\right\}w_t \nonumber\\
	&\stackrel{(a)}{\leqslant} \sqrt{\sum_{t\in[T]}\left(\nu_{t}^{2}+\sigma_{\min}^{2}\right)}\sqrt{\sum_{t\in[T]} w_{t}^{2}} \nonumber\\
	&\stackrel{(b)}{\leqslant} \sqrt{2\kappa}\cdot\sqrt{\sum_{t\in[T]}\nu_{t}^{2}+1}.
\end{align}


Here $(a)$ holds due to Cauchy-Schwarz inequality and $(b)$ uses \eqref{eq:bound_w} and \(\sigma_{\min}=\frac{1}{\sqrt{T}}\).

Second, for any \(t\in\mathcal{J}_2\), we have 


\begin{align}\label{eq:regret2}
	\sum_{t\in\mathcal{J}_{2}}\sigma_{t} w_{t} &= \frac{1}{m_{0}}\sum_{t\in\mathcal{J}_{2}}\sigma_{t} w_{t}^{2} \leqslant \frac{\sup_{t\in\mathcal{J}_{2}}\sigma_{t}}{m_{0}}\sum_{t\in\mathcal{J}_{2}} w_{t}^{2}\nonumber \\
	&\leqslant \frac{\sup_{t\in[T]}\left\|\phi_{t}\right\|_{H_{t-1}^{-1}}}{m_{0}^{2}}\cdot\sum_{t\in\mathcal{J}_{2}} w_{t}^{2} \nonumber\\
	&\leqslant \frac{\sup_{t\in[T]}\left\|\phi_{t}\right\|_{H_{t-1}^{-1}}}{m_{0}^{2}}\cdot\sum_{t\in[T]} w_{t}^{2}\leqslant \frac{2 L\kappa}{m_{0}^{2}\sqrt{\lambda}}
\end{align}


where the last inequality holds due to \(\left\|\phi_{t}\right\|_{H_{t-1}^{-1}}\leqslant\frac{1}{\sqrt{\lambda}}\left\|\phi_{t}\right\|\leqslant\frac{L}{\sqrt{\lambda}}\) for all \(t\geqslant 1\) and \eqref{eq:bound_w}.

Finally, for any \( t \in \mathcal{J}_{3} \), we have $\sigma_t^2=\alpha^2\left\| {\phi_{t}} \right\|_{H_{t-1}^{-1}}$, which implies $\sigma_t=\alpha^2 w_t$ due to $w_t=\left\|\frac{\phi_t}{\sigma_t}\right\|_{H_{t-1}^{-1}}$.



Therefore,

\begin{align}\label{eq:regret3}
	\sum_{t \in \mathcal{J}_{3}} \sigma_{t} w_t&=\sum_{t \in \mathcal{J}_{3}}\alpha^2 w_t^2  \le \alpha^2 \sum_{t \in [T]}w_t^2 \nonumber\\
	&\le \left(\frac{L B K}{\sqrt{m_{1} d}}+\frac{C}{\sqrt{\kappa} }\right)\sum_{t \in [T]}w_t^2 \nonumber\\
	&\le \left(\frac{L B K}{\sqrt{m_{1} d}}+\frac{C}{\sqrt{\kappa}} \right)\cdot 2\kappa \nonumber\\
	&=\frac{2L B K\kappa}{\sqrt{m_{1} d}}+2C\sqrt{\kappa}.
\end{align}

Plugging \eqref{eq:regret1}, \eqref{eq:regret2} and \eqref{eq:regret3} into \eqref{eq:bound_regret1}  and \eqref{eq:bound_regret2}, we have

\[
\operatorname{Reg}(T) \leqslant 2 K\beta_{T} \left[ \sqrt{2\kappa} \cdot \sqrt{\sum_{t \in [T]} \nu_{t}^{2} + 1} + \frac{2 L \kappa}{m_{0}^{2} \sqrt{\lambda}} + \frac{2L B K\kappa}{\sqrt{m_{1} d}}+2C\sqrt{\kappa}\right].
\]

\section{Auxiliary Lemmas}
\begin{lemma}\label{bound_w_t}
	(Lemma 11 in \citep{abbasi2011improved}). Let \(\left\{x_{t}\right\}_{t\geqslant 1}\subset \mathbb{R}^{d}\) and assume \(\left\|x_{t}\right\|\leqslant L\) for all \(t\geqslant 1\). Set \(Z_{t}=\sum_{s=1}^{t} x_{s} x_{s}^{\top}+\lambda I\). Then it follows that
	
	\[
	\sum_{t=1}^{T}\min\left\{1,\left\|x_{t}\right\|_{Z_{t-1}^{-1}}^{2}\right\}\leqslant 2 d\log\left(\frac{d\lambda+T L^{2}}{d\lambda}\right).
	\]
\end{lemma}
\begin{lemma} \label{lin_1} 
	(Lemma B.1 in \citep{li2023varianceaware}).
	Assume $\displaystyle{z_t(\theta)=\frac{y_t-\langle \phi_t, \theta \rangle}{\sigma_t}}$, $\mathbb{E}[z_t | \mathcal{F}_{t-1}] = 0 $  and \(\mathbb{E}\left[z_{t}^{2}\left(\theta^{*}\right)\mid\mathcal{F}_{t-1}\right]\leqslant b^{2}\) for all \(t\geqslant 1\). If we set
	$$
	\tau_{0}\sqrt{\log\frac{2 T^{2}}{\delta}}\geqslant\max\{\sqrt{2\kappa} b, 2\sqrt{d}\},
	$$
	with probability at least \(1-2\delta\), we have that for all \(T\geqslant 0\),
	$$
	\frac{1}{4} H_{T}\leq\lambda I+\sum_{t=1}^{T}\left(\frac{\tau_T}{\sqrt{\tau_T^2+z_T\left(\theta^2\right)}}\right)^3\frac{\phi_T\phi_T^\top}{\sigma_T^2}\leq H_{T} \text{ for any } \|\theta\|\leqslant B.
	$$
	
	
\end{lemma}


\begin{lemma}\label{lin_2}
	(Lemma B.2 in \citep{li2023varianceaware}).
	Assume $\displaystyle{z_t(\theta)=\frac{y_t-\langle \phi_t, \theta \rangle}{\sigma_t}}$,  $\mathbb{E}[z_t | \mathcal{F}_{t-1}] = 0 $ and \(\mathbb{E}\left[z_{t}^{2}\left(\theta^{*}\right)\mid\mathcal{F}_{t-1}\right] \leqslant b^{2}\) for all \(t \geqslant 1\). With probability at least \(1-\delta\), for all \(T \geqslant 1\), it follows that
	$$
	\left\|\lambda \theta ^*- \sum_{t=1}^{T}\frac{\tau_T z_T(\theta^*)}{\sqrt{\tau_T^2+z_T^2(\theta^*)}}\frac{\phi_T}{\sigma_T}\right\|_{H_{T}^{-1}} \leqslant 8\left[\frac{\kappa b^{2}}{\tau_{0}} + b\sqrt{\kappa\log\frac{2T^{2}}{\delta}} + \tau_{0}\log\frac{2T^{2}}{\delta}\right] + \sqrt{\lambda}B
	$$
	where $\kappa=d\cdot\log\left(1+TL^2/(d\lambda\sigma_{min}^2)\right)$.
\end{lemma}


\section{Additional Experiments on Nonlinear Reward Functions} \label{appendix:nonlinear}

 In this section, we present additional experimental results to further show the effectiveness of GAdaOFUL across other types of nonlinear reward functions. Specifically, we consider the following mappings $f$: (1) logistic function: \(f(x) = \displaystyle{\frac{5}{1 + \exp(-x)}}\), (2)  quadratic function: \(f(x) = (x + 1.5)^2\), and (3) logarithmic function: \(f(x) = 3 \log(x + 2)\).

Each function is monotonically increasing and has been appropriately translated and scaled to meet our modeling assumptions.
The corresponding results are presented in the following subsections. As shown, our method consistently achieves the lowest regret, regardless of whether reward corruption is present.

\subsection{Logistic Link Function}

\begin{figure}[H]
    \centering
    \subfloat[No corruption]{%
        \includegraphics[width=0.48\linewidth]{logistic.pdf}
        \label{fig:logistic-clean}
    }
    \hfill
    \subfloat[Corruption level = 100]{%
        \includegraphics[width=0.48\linewidth]{logisticcor.pdf}
        \label{fig:logistic-corrupt}
    }
    \caption{Performance under clean and corrupted settings with the logistic link function.}
    \label{fig:logistic}
\end{figure}

\subsection{Quadratic Link Function}

\begin{figure}[H]
    \centering
    \subfloat[No corruption]{%
        \includegraphics[width=0.48\linewidth]{quad.pdf}
        \label{fig:quad-clean}
    }
    \hfill
    \subfloat[Corruption level = 100]{%
        \includegraphics[width=0.48\linewidth]{quadcor.pdf}
        \label{fig:quad-corrupt}
    }
    \caption{Performance under clean and corrupted settings with the quadratic link function.}
    \label{fig:quadratic}
\end{figure}

\subsection{Logarithmic Link Function}

\begin{figure}[H]
    \centering
    \subfloat[No corruption]{%
        \includegraphics[width=0.48\linewidth]{log.pdf}
        \label{fig:log-clean}
    }
    \hfill
    \subfloat[Corruption level = 100]{%
        \includegraphics[width=0.48\linewidth]{logcor.pdf}
        \label{fig:log-corrupt}
    }
    \caption{Performance under clean and corrupted settings with the logarithmic link function.}
    \label{fig:logarithmic}
\end{figure}

\end{document}


















