% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} 
% after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} 
% ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} 
% newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{marvosym}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

% Load macros and some necessary packages
\input{macros}

% Define the external document
% \myexternaldocument{supplement}

% Add line number for twocolumn
% \usepackage[switch]{lineno}
% \linenumbers

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Stochastic Graphical Bandits with Heavy-Tailed Rewards}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<gouyt@lamda.nju.edu.cn>?Subject=Your UAI 2023 paper}{Yutian Gou}}
\author[2]{\href{mailto:<yijinfeng@jd.com>?Subject=Your UAI 2023 paper}{Jinfeng Yi}}
\author[1\thanks{Corresponding author.}]{\href{mailto:<zhanglj@lamda.nju.edu.cn>?Subject=Your UAI 2023 paper}{Lijun Zhang}}

% Add affiliations after the authors
\affil[1]{%
    National Key Laboratory for Novel Software Technology\\
    Nanjing University\\
    Nanjing 210023, China
}
\affil[2]{%
    JD AI Research\\ 
    Beijing 100176\\
    China
}
  
\begin{document}
\maketitle

\begin{abstract}
  We consider stochastic graphical bandits, where after pulling an arm, the decision maker observes rewards of not only the chosen arm but also its neighbors in a feedback graph. Most of existing work assumes that the rewards are drawn from bounded or at least sub-Gaussian distributions, which however may be violated in many practical scenarios such as social advertising and financial markets. To settle this issue, we investigate stochastic graphical bandits with heavy-tailed rewards, where the distributions have finite moments of order $1+\epsilon$, for some $\epsilon\in(0, 1]$. Firstly, we develop one UCB-type algorithm, whose expected regret is upper bounded by a sum of gap-based quantities over the \textit{clique covering} of the feedback graph. The key idea is to estimate the reward means of the selected arm's neighbors by more refined robust estimators, and to construct a graph-based upper confidence bound for selecting candidates. Secondly, we design another elimination-based strategy and improve the regret bound to a gap-based sum with size controlled by the \textit{independence number} of the feedback graph. For benign graphs, the \textit{independence number} could be smaller than the size of the \textit{clique covering}, resulting in tighter regret bounds. Finally, we conduct experiments on synthetic data to demonstrate the effectiveness of our methods.
\end{abstract}

\section{INTRODUCTION}
As one of the most classical problem in online sequential decision-making, Multi-Armed Bandits (MAB) has been successfully applied to various real-world scenes such as medical trials \citep{journals/Villar2015,conf/miccai/Gutierrez2017}, news recommendation \citep{conf/www/Li2010}, online advertising \citep{conf/icml/Chen2013,conf/neurips/Xu2013,journals/Schwartz2017}, resource allocations \citep{conf/uai/Lattimore2014}, and online routing \citep{conf/neurips/Kveton2015}. In its original stochastic form \citep{journals/Robbins1952}, at each round $t$, a player has to select an arm $i$ from $K$ available candidates and receives a reward generated independently from an unknown but fixed distribution. The player's goal is to minimize the \textit{regret} over $T$ steps of the game, namely the difference between the cumulative rewards of the chosen arms and that of the optimal arm in hindsight. In order to achieve this goal, the player needs to overcome the dilemma of exploration (learning new information about all arms) and exploitation (selecting the optimal arm based on available information). In the seminal work, \cite{journals/Lai1985} establish an $\Omega(K\log T)$ asymptotic regret lower bound and propose UCB policy that attains this lower bound asymptotically. In the past decades, plentiful algorithms and theoretical results for bandits have been well developed \citep{journals/Bubeck2012Survey, book/Lattimore2020}.

However, one limitation of the stochastic MAB is that the regret bound scales linearly with $K$, and thus may become vacuous when the arm set gets very large. To address this limitation, \cite{conf/neurips/Mannor2011} introduce an important variant of MAB termed Graphical Bandits (GB). In this scene, there exists an undirected feedback graph with node set consisting of $K$ arms and edge set revealing the relationship between arms. After pulling an arm, the decision maker will observe the rewards from not only the chosen arm but also its neighbors in the graph. Later, \cite{conf/uai/Caron2012} consider the stochastic version of GB. They present UCB-based algorithms for stochastic GB with bounded rewards and provide regret bounds depending on the \textit{clique covering} of the feedback graph, whose size can be much smaller than $K$ for benign graphs. For stochastic GB with bounded rewards, \cite{conf/uai/Caron2012} proposed a lower bound of $\Omega(\log T)$. However, no lower bounds have been proposed for stochastic GB under the sub-Gaussian setting \citep{conf/colt/problem/Teodor2022}.

While the stochastic GB has been extensively studied in the literature \citep{conf/sigmetrics/Buccapatnam2014,journals/jmlr/Buccapatnam2017,conf/icml/Cohen2016,conf/aaai/Tossou2017,conf/aaai/Liu2018a,conf/uai/Liu2018b,conf/uai/Hu2019,conf/alt/Lykouris2020,journals/corr/Marinov2022}, most previous studies assume that the rewards are drawn from either bounded or at least sub-Gaussian distributions. Since the sub-Gaussian random variables possess the characteristic of exponentially decaying tails, we can use the empirical mean to estimate the reward means of each arm, and guarantee exponential deviations by the standard concentration of measure techniques \citep{journals/Hoeffding1963}. However, there do exist practical scenarios which do not behave sub-Gaussian but can be modeled by the heavy-tailed distributions \citep{book/Foss2011}, such as frequent price fluctuations for financial markets \citep{book/Rachev2003}, preferential attachment in social networks \citep{journals/network/Mahanti2013} and unevenly distributed clicks of slogans in social advertising \citep{conf/cosn/Park2013}. Unfortunately, as heavy-tailed rewards no longer enjoy exponentially decaying tails, the empirical mean estimator can only provide polynomial concentration properties \citep{journals/Catoni2012}, making it much harder to estimate the reward means of each arm. 

In this study, we investigate stochastic GB with heavy-tailed rewards, where the reward distributions are assumed to have bounded $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. We present two novel algorithms for this setting based on more refined robust estimators. Firstly, we design one UCB-type algorithm named RUNE, whose expected regret is upper bounded by a sum of gap-based quantities over the \textit{clique covering} of the feedback graph. The key idea is to estimate the reward means of the selected arm's neighbors by truncated empirical mean or median of means, and to construct a graph-based upper confidence bound for selecting candidates. Secondly, we propose another elimination-based algorithm termed RAAE and provide a regret bound by a gap-based sum whose size is controlled by the \textit{independence number} of the feedback graph. For benign graphs, the \textit{independence number} could be smaller than the size of the \textit{clique covering}, resulting in tighter regret bounds. To the best of our knowledge, we provide the first regret bounds for stochastic GB with heavy-tailed rewards. Please refer to Table 1 for a comparison between our results and the previous results in stochastic graphical bandits. The contributions of this work are summarized as follows: 

\begin{compactitem}
	\item We propose one novel UCB-type algorithm for stochastic GB with heavy-tailed rewards, named RUNE. Our algorithm obtains a gap-based logarithmic regret bound of $O(\sum_{C\in\C}\frac{v^{1/\epsilon}\Delta_C^{\max}\log (N_C T)}{(\Delta_C^{\min})^{(1+\epsilon)/\epsilon}} + \sum_{C\in\C}\Delta_C^{\max})$, where $\C$ is a clique cover of $G$, $N_{C}$ is a quantity related to clique $C\in\C$, $\Delta_C^{\max}$ is the maximum reward gap of $C$, and $\Delta_C^{\min}$ is the minimum nonzero reward gap of $C$. 
	
	\item To further improve the regret bound, we design another elimination-based algorithm termed RAAE and provide a gap-based logarithmic regret bound of $O( \sum_{i\in S}\frac{v^{1/\epsilon}\log T}{\Delta_{i}^{1/\epsilon}} + \Delta_{\max}\log T )$, where $\Delta_{\max}$ is the maximum suboptimal reward gap, $S$ is a subset of the first $\alpha$ suboptimal arms with the ties broken arbitrarily and $\alpha$ is the \textit{independence number} of $G$. This regret bound is a substantial improvement over RUNE since the \textit{independence number} is smaller than the size of \textit{clique covering} for benign graphs.
	
	\item To demonstrate the effectiveness of our methods, we present synthetic experiments for comparing RUNE and RAAE with previous algorithms. The empirical results support our theoretical results. 
\end{compactitem}

\newcolumntype{g}{>{\columncolor{LightCyan}}c}
\begin{table*}
    \caption{Comparison between different algorithms for stochastic graphical bandits. $T$ is the number of rounds, $K$ is the number of arms, $\Delta_i$ is the reward gap of arm $i$, $v$ is an upper bound of the $(1+\epsilon)$-th moments, $\delta$ is the maximum degree in the feedback graph $G$, $\C$ is a clique cover of $G$, $N_{C}$ is a quantity related to clique $C\in\C$, $\Delta_C^{\max}$ is the maximum reward gap of $C$, $\Delta_C^{\min}$ is the minimum nonzero reward gap of $C$, $\Delta_{\max}$ is the maximum suboptimal reward gap, $S$ is a subset of the first $\alpha$ suboptimal arms with the ties broken arbitrarily and $\alpha$ is the \textit{independence number} of $G$. }\label{table:1}
    \centering
    \resizebox{2\columnwidth}{!}{%
    \begin{tabular}{gggg}
    \toprule 
    \rowcolor{white} Algorithm & Regret (bounded in $[0,1]$) & Regret  (bounded $(1+\epsilon)$-th raw moments) &  Regret  (bounded $(1+\epsilon)$-th central moments)  \\
    \midrule
    \rowcolor{white} UCB-N & & &  \\
    \rowcolor{white} \small{\citep{conf/uai/Caron2012}}  & \multirow{-2}{*}{$O(\sum_{C\in\C}\frac{\Delta_C^{\max}\log T}{(\Delta_C^{\min})^2}+K)$} & \multirow{-2}{*}{$\backslash$} & \multirow{-2}{*}{$\backslash$} \\ 
    \rowcolor{white} UCB-NE & & & \\
    \rowcolor{white} \small{\citep{conf/uai/Hu2019}}  & \multirow{-2}{*}{$O(\sum_{C\in\C}\frac{\Delta_C^{\max}\log (N_{C}T)}{(\Delta_C^{\min})^2}+|\C|)$} & \multirow{-2}{*}{$\backslash$} & \multirow{-2}{*}{$\backslash$}\\
    \rowcolor{white} UCB-LP & & &\\ 
    \rowcolor{white} \small{\citep{conf/sigmetrics/Buccapatnam2014}}  & \multirow{-2}{*}{$O(\sum_{i\in D}\frac{\log T}{\Delta_i}+K\delta)$} & \multirow{-2}{*}{$\backslash$} & \multirow{-2}{*}{$\backslash$}\\
    \rowcolor{white} AAE-AlphaSample & & &\\   
    \rowcolor{white} \small{\citep{conf/icml/Cohen2016}}  & \multirow{-2}{*}{$O(\sum_{i\in S}\frac{\log T}{\Delta_{i}})$} & \multirow{-2}{*}{$\backslash$} & \multirow{-2}{*}{$\backslash$}\\
    RUNE-TEM & & & \\
    \small{(Theorem \ref{thm:RUNE-TEM})}& \multirow{-2}{*}{$O(\sum_{C\in\C}\frac{\Delta_C^{\max}\log (N_C T)}{(\Delta_C^{\min})^{2}} + |\C|)$}& \multirow{-2}{*}{$O(\sum_{C\in\C}\frac{v^{1/\epsilon}\Delta_C^{\max}\log (N_C T)}{(\Delta_C^{\min})^{(1+\epsilon)/\epsilon}} + \sum_{C\in\C}\Delta_C^{\max})$} & \multirow{-2}{*}{$\backslash$}\\
     RUNE-MoM & & & \\
    \small{(Theorem \ref{thm:RUNE-MoM})}& \multirow{-2}{*}{$O(\sum_{C\in\C}\frac{\Delta_C^{\max}\log (N_C T)}{(\Delta_C^{\min})^{2}} + |\C|)$}& \multirow{-2}{*}{$\backslash$} & \multirow{-2}{*}{$O(\sum_{C\in\C}\frac{v^{1/\epsilon}\Delta_C^{\max}\log (N_C T)}{(\Delta_C^{\min})^{(1+\epsilon)/\epsilon}} + \sum_{C\in\C}\Delta_C^{\max})$}\\
     RAAE-TEM & & & \\
    \small{(Theorem \ref{thm:RAAE-TEM})}& \multirow{-2}{*}{$O( \sum_{i\in S}\frac{\log T}{\Delta_{i}})$}& \multirow{-2}{*}{$O( \sum_{i\in S}\frac{v^{1/\epsilon}\log T}{\Delta_{i}^{1/\epsilon}} + \Delta_{\max}\log T )$} & \multirow{-2}{*}{$\backslash$}\\
     RAAE-MoM & & & \\
    \small{(Theorem \ref{thm:RAAE-MoM})}& \multirow{-2}{*}{$O( \sum_{i\in S}\frac{\log T}{\Delta_{i}})$}& \multirow{-2}{*}{$\backslash$} & \multirow{-2}{*}{$O( \sum_{i\in S}\frac{v^{1/\epsilon}\log T}{\Delta_{i}^{1/\epsilon}} + \Delta_{\max}\log T )$}\\
    \bottomrule
    \end{tabular}
    }
\end{table*}

\section{PRELIMINARIES AND RELATED WORK}
In this section, we first provide a formal description of our problem setup, and then review related work about stochastic bandits, including stochastic graphical bandits and stochastic bandits with heavy-tailed rewards. 

\subsection{Problem Setup and Definitions}
We consider stochastic GB with a fixed undirected feedback graph $G=(V, E)$, where $V=\{ 1,2,\cdots,K\}$ denotes the arm set, and $E\subseteq V\times V$ reveals the relationship between arms. An edge $(i,j)\in E$ means that when the arm $i$ (or $j$) is pulled at round $t$, the player will receive a reward from $i$ and also observe the reward of $j$. For each arm $i\in V$, we assume that the reward $X_{i,t}$ at round $t$ is sampled independently from an unknown but fixed distribution $\P_{i}$ with mean $\mu_{i}$ and bounded $(1+\epsilon)$-th moments, i.e., $\E_{X\sim\P_{i}}[|X|^{1+\epsilon}]\le v$ or $\E_{X\sim\P_{i}}[|X-\mu_i|^{1+\epsilon}]\le v$.

The player's goal is to minimize the (pseudo) \textit{expected regret} over $T$ steps of the game, which is defined as
\begin{equation}\label{def:regret}
	\E[R_{T}] = T\mu^{\star} - \sum_{t=1}^T \mu_{I_t} = \sum_{i\in V} \Delta_{i}\E[T_{i}(T)]~, 
\end{equation}
where $\mu^{\star} = \max_{i\in V} \mu_{i}$, $I_t$ is the arm chosen by the player at round $t$, $\Delta_{i}:=\mu^{\star}-\mu_{i}$ denotes the reward gap of arm $i$ relative to the optimal arm, and $T_{i}(T)=\sum_{t=1}^T \Ibb_{\{I_t=i\}}$ refers to the number of pulls for arm $i$ up to time $T$.

Note that the player in MAB can only observe the rewards from the selected arm $I_t$ at round $t$, whereas in GB, the rewards from its neighbors can be also observed. In other words, the main difference between GB and MAB lies in the fact that the number of observations made for arm $i$ until round $T$ is no longer $T_{i}(T)$ in \eqref{def:regret} but 
\begin{equation}
    O_{i}(T)=\sum_{t=1}^T \Ibb_{\{I_t\in N(i)\}} ~,
\end{equation}
where $N(i)$ denotes the set consisting of arm $i$ and its adjacent nodes in $G$. By the definitions, it can be verified that $O_{i}(T)\ge T_{i}(T)$ holds for any feedback graph. Thus, the player can provide a more accurate estimate for the mean of each arm's reward distribution, by utilizing the side information of the feedback graph.

Before stating existing results, we introduce two standard graph-theoretic definitions \citep{book/West2001}, which will be used to describe regret bounds.

\begin{Def}\label{clique-covering}
    A \textit{clique} in graph $G=(V,E)$ is a subset of vertices $C\subseteq V$ such that all arms in $C$ are neighbors with each other. A \textit{clique covering} $\C$ of $G$ is a set of cliques such that $V=\cup_{C\in\C} C$. The \textit{clique covering number} $\bar{\chi}(G)$ is the size of the smallest clique covering in $G$.
\end{Def}

\begin{Def}\label{independent-set}
    An \textit{independent set} in graph $G=(V,E)$ is a subset of vertices $S\subseteq V$ that are not connected by any edges with each other. Namely, $S$ is independent if for any $u,v\in S, u\neq v$, then $(u,v)\notin E$. The \textit{independence number} $\alpha(G)$ is the size of the maximum independent set in $G$.
\end{Def}

Note that each node in a maximum independent set must consume one clique to cover, thus $\alpha(G)\le\bar{\chi}(G)$ for any graph $G$, and the gap between them can be very large \citep{conf/neurips/Mannor2011}.

\subsection{Stochastic Graphical Bandits}
To fully exploit the side information of the feedback graph $G$, previous work \citep{conf/uai/Caron2012,conf/sigmetrics/Buccapatnam2014,journals/jmlr/Buccapatnam2017,conf/icml/Cohen2016,conf/aaai/Tossou2017,conf/aaai/Liu2018a,conf/uai/Liu2018b,conf/uai/Hu2019,conf/alt/Lykouris2020} has used the structural information of the feedback graph to characterize their regret bounds.

One category of classical methods for stochastic GB is based on UCB \citep{journals/Lai1985,journals/Agrawal1995,journals/ml/Auer2002a}. For stochastic MAB with bounded rewards, \cite{journals/ml/Auer2002a} propose UCB1 according to the principle of \underline{O}ptimism in the \underline{F}ace of \underline{U}ncertainty (OFU), which attains an optimal regret bound of $O(\sum_{i:\Delta_i>0}\frac{\log T}{\Delta_i} +\sum_{i=1}^K \Delta_i)$. Afterward, \cite{conf/uai/Caron2012} extend UCB1 \citep{journals/ml/Auer2002a} to UCB-N for stochastic GB with bounded rewards, where the main improvement is to update the estimated values of not only the chosen arm but also its neighbors at each round. They show that UCB-N attains an $O(\sum_{C\in\C}\frac{\Delta_C^{\max}\log T}{(\Delta_C^{\min})^2}+\sum_{i=1}^K \Delta_i)$ gap-based regret bound. In addition, they present an $\Omega(\log T)$ regret lower bound for this setting. Later, \cite{conf/uai/Hu2019} modify the index of UCB-N to enlarge the exploration phase and improve the bound to $O(\sum_{C\in\C}\frac{\Delta_C^{\max}\log (N_{C}T)}{(\Delta_C^{\min})^2}+\sum_{C\in\C}\Delta_C^{\max})$, where $N_{C}=\max_{i\in C} |N(i)|^{\frac{1}{4}}$ is determined by the maximum degree of clique $C\in\C$. Note that the sum of gap-based quantities in this bound is taken over the \textit{clique covering} of the feedback graph, instead of the whole arm set.

Besides, there is another class of algorithms for stochastic GB based on the elimination technique \citep{journals/jmlr/Even-Dar2006, journals/pmh/Auer2010}. \cite{conf/sigmetrics/Buccapatnam2014,journals/jmlr/Buccapatnam2017} propose a strategy termed UCB-LP, which leverages a linear programming (LP) induced by the feedback graph to explicitly guide the exploration stage. UCB-LP obtains an $O(\sum_{i\in D}\frac{\log T}{\Delta_i}+K\delta)$ gap-based regret bound, where $D$ is a particularly selected dominated set of $G$ (i.e., every node in the graph is either in $D$ or has at least one neighbor in $D$) and $\delta$ is the maximum degree in the feedback graph. Furthermore, they established an LP-based lower bound, which is also logarithmic with respect to $T$. Later, \cite{conf/icml/Cohen2016} consider a harder setting where the feedback graph may be directed, time-variant, and not entirely revealed to the player. They propose an elimination-based algorithm and obtain an $O(\sum_{i\in S}\frac{\log T}{\Delta_{i}})$ gap-based regret bound, where $S$ is the set of the $\alpha_{\max}\log K$ arms with the smallest gap and $\alpha_{\max}$ is an upper bound of the \textit{independence number} of the feedback graph over $T$ rounds. Recently, \cite{conf/alt/Lykouris2020} propose a novel layering technique by using the \textit{independent set} for sampling and derive a similar $O(\sum_{i\in I}\frac{\log^2 T}{\Delta_{i}})$ regret bound for UCB-N, where $I$ is any independent set of $G$. In addition, other work \citep{conf/aaai/Tossou2017,conf/aaai/Liu2018a,conf/uai/Liu2018b,conf/uai/Hu2019,conf/alt/Lykouris2020} applies Thompson Sampling \citep{journals/Thompson1933} to stochastic GB and provides the corresponding theoretical guarantees. 

\subsection{Stochastic Heavy-Tailed Bandits}
\cite{conf/allerton/Liu2011} are the first to investgate stochastic MAB with heavy-tailed rewards. In particular, they consider reward distributions with finite moments of order $1+\epsilon$ for some $\epsilon\in(0,1]$. They propose an algorithm based on a deterministic sequencing of exploration and exploitation, which attains a polynomial regret of $O(T^{\frac{1}{1+\epsilon}})$. In a subsequent work, \cite{journals/tit/Bubeck2013} design a framework termed Robust UCB by replacing the empirical mean in UCB1 \citep{journals/ml/Auer2002a} with more refined robust estimators, such as truncated empirical mean or median of means. They obtain the first gap-based logarithmic regret of $O(\sum_{i:\Delta_{i}>0}(\frac{v}{\Delta_{i}})^{\frac{1}{\epsilon}}\log T + \sum_{i=1}^K \Delta_i)$, where $v$ is an upper bound of the $(1+\epsilon)$-th moments. For stochastic MAB with finite variances ($\epsilon=1$), this regret bound recovers the optimal regret under the bounded or sub-Gaussian assumption \citep{journals/Lai1985,journals/ml/Auer2002a}. Besides, \cite{journals/tit/Bubeck2013} also provide a matching lower bound of $O(\Delta_{i}^{-\frac{1}{\epsilon}}\log T)$. Later, \cite{conf/icml/Medina2016} extend the results to stochastic linear bandits with infinite action sets. They design two algorithms both with sublinear regret bounds, which are subsequently improved to be nearly optimal by \cite{conf/neurips/Shao2018}. Recently, robust estimators are applied by \cite{conf/ijcai/Xue2020} to design algorithms for stochastic linear bandits with finite action sets, and nearly optimal sublinear regret bounds are established. For other settings, robust estimators are also employed by \cite{conf/icml/Lu2019,conf/aistats/Tao2022} to design algorithms for stochastic lipschitz bandits with heavy-tailed rewards and stochastic MAB with heavy-tailed rewards in the (local) differential privacy model, respectively. In addition, heavy-tailed distributions have been extensively studied in the offline setting \citep{journals/aos/Brownlees2015,journals/jmlr/Hsu2016,conf/neurips/Zhang2018a}.

\section{MAIN RESULTS}\label{main}
We first propose two UCB-type algorithms termed RUN and RUNE for stochastic graphical bandits with heavy-tailed rewards. Next, we present another elimination-based algorithm named RAAE with an improved regret bound. All the technical lemmas and proofs are deferred to the supplementary due to the space limitation.

\subsection{Robust UCB Strategy with Feedback Graph}\label{main:1}

\begin{algorithm}[tb]
	\caption{RUN-TEM}
	\label{alg:1}
	\begin{algorithmic}[1]
		\STATE {\bfseries Input:} Graph $G = (V;E)$, $\epsilon\in(0, 1]$, $(1+\epsilon)$-th raw moment bound $v$, confidence level $\delta\in(0,1)$    \vspace{.5ex}
		\STATE  {\bfseries Initialize:} Set $O_i(0)=0$ for each arm $i\in V$. Let $\muh_i(t)$ be the estimate mean value based on the first $s$ observed values $X_{i,1},\cdots,X_{i,s}$ of arm $i$ up to time $t$ \vspace{1ex}
		
		\FOR{$t > 1$}
		\STATE Pull arm 
		\begin{equation*}
		    I_t=\argmax_{i\in V} \mbox{UCB}_i(t),
		\end{equation*}
		where $\mbox{UCB}_i(t)$ is computed by \eqref{RUN-TEM:UCB_index}
		
		\STATE Receive reward $X_{I_t,t}$ and observe rewards $X_{k,t}\ (k\in N(I_t))$
		\FOR{arm $k\in N(I_t)$}
		\STATE $O_k(t) = O_k(t-1) + 1$
		\STATE Compute the truncation level $B_{k,t,\delta}$ by \eqref{RUN-TEM:truncation level}
		\STATE Update the estimate value: 
		\begin{equation*}
		\hspace{-1cm}
		    \muh_{k}(t) = \frac{O_k(t-1)\muh_{k}(t-1) + X_{k,t}\Ibb_{\{|X_{k,t}|\le B_{k,t,\delta}\}}}{O_k(t-1)+1}
		\end{equation*}
		\ENDFOR
		\ENDFOR
	\end{algorithmic}
\end{algorithm}

The basic idea behind existing algorithms for stochastic GB is to exploit the side information by sampling through the feedback graph. In this section, we begin with a simple strategy termed RUN and then present an improved algorithm named RUNE.

Following the seminal work of \cite{conf/uai/Caron2012}, we propose \underline{R}obust \underline{U}CB-\underline{N} (RUN) policy for stochastic GB with heavy-tailed rewards. Since the rewards of each arm no longer follow the sub-Gaussian distribution, their used empirical mean estimator can only provide polynomial deviations \citep{journals/Catoni2012}. To settle this issue, we employ RUN with \underline{T}runcated \underline{E}mpirical \underline{M}ean (TEM) estimator, which can guarantee exponential deviations for even heavy-tailed rewards \citep{journals/tit/Bubeck2013}. The key idea of TEM is to truncate large rewards while computing the average value. Since truncation will bias the distribution, we cannot use a fixed truncation level uniformly over all time. Instead, we use an increasing truncation levels sequence for each arm $i\in V$:
\begin{equation}\label{RUN-TEM:truncation level}
    B_{i,t,\delta}=\left(\frac{vO_i(t)}{\log(1/\delta)}\right)^{\frac{1}{1+\epsilon}} ~,
\end{equation}
where $\delta\in(0,1)$ is a confidence level predetermined by the player. At each round $t$, we will compute the average truncated reward of each arm $i\in V$:
\begin{equation}
   \muh_{i}(t) = \frac{\sum_{s=1}^{t} X_{i,s}\Ibb_{\{|X_{i,s}|\le B_{i,s,\delta} \cap I_s\in N(i)\}}}{O_{i}(t)} ~,
\end{equation}
which is updated incrementally in our algorithm to reduce the time complexity. Under the truncation level \eqref{RUN-TEM:truncation level}, we can obtain the concentration properties of TEM in the following proposition. 

\begin{prop}\label{prop:1} 
	Let $\delta\in(0,1), \epsilon\in(0, 1]$ be positive parameters. Let
	$X_1,X_2,\cdots,X_n$ be i.i.d.~random variables sampling from fixed distribution $\P$ with finite mean $\mu$ and bounded $(1+\epsilon)$-th raw moments, i.e., $\E_{X\sim\P}[|X|^{1+\epsilon}] \le v$. Consider the TEM estimator 
	\begin{equation}
	    \muh_{T} = \frac{1}{n}\sum_{t=1}^n X_t\Ibb_{\{|X_t|\le B_{t,\delta}\}}~,
	\end{equation}	
	where $B_{t,\delta}=(\frac{vt}{\log(1/\delta)})^{\frac{1}{1+\epsilon}}$, then with probability at least $1-\delta$,
	\begin{equation}\label{prop:1-1}
		\muh_{T} \ge \mu - 5v^{\frac{1}{1+\epsilon}}\left( \frac{\log(1/\delta)}{n}\right)^{\frac{\epsilon}{1+\epsilon}} ~,
	\end{equation}
	and also, with probability at least $1-\delta$,
	\begin{equation}\label{prop:1-2}
		\muh_{T} \le \mu + 5v^{\frac{1}{1+\epsilon}}\left( \frac{\log(1/\delta)}{n}\right)^{\frac{\epsilon}{1+\epsilon}} ~.
	\end{equation}
\end{prop}

By using the concentration properties in Proposition \ref{prop:1}, we construct an upper confidence bound based on the sum of average truncated reward and a confidence term:
\begin{equation}\label{RUN-TEM:UCB_index}
    \mbox{UCB}_i(t) = \muh_{i}(t-1) + 5v^{\frac{1}{1+\epsilon}}\left( \frac{\log(1/\delta)}{O_i(t-1)}\right)^{\frac{\epsilon}{1+\epsilon}} ~,
\end{equation}
where we take the convention $\sqrt{1/0}=+\infty$ so that all arms get observed at least once. 

At each round $t$, following the principle of OFU, we first pull the arm $I_t$ with the maximum UCB index defined in \eqref{RUN-TEM:UCB_index} with ties broken arbitrarily. After that, we will receive reward $X_{I_t,t}$ of the selected arm and also observe rewards $X_{k,t}$ of all arm $k$ in its neighbor set $N(I_t)$. Finally, we update the observation number $O_k(t)$ and the estimate value $\muh_k(t)$ for all the arms $k\in N(I_t)$ by the truncation level defined in \eqref{RUN-TEM:truncation level}. The above procedure is summarized in Algorithm \ref{alg:1}, and is referred to as RUN-TEM. 

Finally, we establish the following expected regret bound for RUN-TEM.

\begin{thm}\label{thm:RUN-TEM}
	Let $G=(V;E)$, $\epsilon\in(0, 1]$ and $v > 0$. Assume that the reward distributions $\P_i$ satisfy that, 
	\begin{equation}
	\begin{aligned}
         \E_{X\sim \P_i} \left[|X|^{1+\epsilon}\right] \le v\ (\forall i\in V)~, 
	\end{aligned}
	\end{equation}
	then the expected regret of  Algorithm \ref{alg:1} (RUN-TEM) with $\delta=\frac{1}{t^4}$ after $T$ steps is upper bounded by
	\begin{equation}\label{bound:RUN-TEM}
	\begin{aligned}
         \E[R_{T}] \le 
         &\inf_{\C} \left\{ 40\left(\sum_{C\in\C}\frac{(10v)^{1/\epsilon}\Delta_{C}^{\max}}{(\Delta_{C}^{\min})^{(1+\epsilon)/\epsilon}}\right)\log T \right\}\\
         &+ \left( 1 + \frac{\pi^2}{3}\right) \sum_{i=1}^K \Delta_i ~, 
	\end{aligned}
	\end{equation}
	where $\Delta_{C}^{\min} := \min_{i\in C\backslash \{i^{\star}\}} \Delta_i$ is the minimum nonzero reward gap in clique $C$ and $\Delta_{C}^{\max} := \max_{i\in C} \Delta_i$ is the maximum reward gap in clique $C$. 
\end{thm}

\textbf{Remark.} If we choose $\C$ as the trivial covering $\{ \{i\}:i\in V \}$, the above regret bound reduces exactly to 
\begin{equation}
	\begin{aligned}
	\hspace{-2mm}
         40\sum_{i\in V: \Delta_{i}>0}\left(\frac{10v}{\Delta_{i}}\right)^{\frac{1}{\epsilon}}\log T + \left( 1 + \frac{\pi^2}{3}\right) \sum_{i=1}^K \Delta_i ~, 
	\end{aligned}
\end{equation}
which matches the optimal regret bound for heavy-tailed MAB proved by \cite{journals/tit/Bubeck2013}. Moreover, if $G$ is a complete graph, then the whole graph constitute a clique covering and we further obtain the regret bound
\begin{equation}
	\begin{aligned}
         40\left(\frac{(10v)^{1/\epsilon}\Delta_{\max}}{\Delta_{\min}^{(1+\epsilon)/\epsilon}}\right)\log T + \left( 1 + \frac{\pi^2}{3}\right) \sum_{i=1}^K \Delta_i ~,
	\end{aligned}
\end{equation}
which is a substantial improvement over the regret bounds of Robust UCB \citep{journals/tit/Bubeck2013} since the leading term is independent of $K$. Except for these two extremes, if we choose more proper $\C$, the regret bound of RUN-TEM can be also improved significantly compared to the standard regret bounds of heavy-tailed MAB, since we make effective utilization on the side information of $G$.

In addition, when the rewards are generated from distributions with finite variances ($\epsilon=1$), RUN-TEM yields regret bound
\begin{equation}
	\begin{aligned}
	    \E[R_{T}] \le 
        &\inf_{\C} \left\{ 40\left(\sum_{C\in\C}\frac{\sqrt{10v}\Delta_{C}^{\max}}{(\Delta_{C}^{\min})^{2}}\right)\log T \right\}\\
        &+ \left( 1 + \frac{\pi^2}{3}\right) \sum_{i=1}^K \Delta_i ~,
	\end{aligned}
\end{equation}
which enjoys the same order as the regret bound of UCB-N \citep{conf/uai/Caron2012}. However, when the rewards are generated from distributions with infinite variances ($0<\epsilon<1$) \citep{journals/pieee/Shao1993}, the theoretical results of UCB-N are no longer applicable, while our method still enjoys a gap-based regret bound \eqref{bound:RUN-TEM} with the leading term scales logarithmically with $T$. 

Although RUN-TEM can obtain a graph-based logarithmic regret bound, the second term in \eqref{bound:RUN-TEM} is still in the order of $O(K)$. To settle this issue, we further design \underline{R}obust \underline{U}CB-\underline{NE} (RUNE) strategy with an improved regret bound. Inspired by \cite{conf/uai/Hu2019}, we embed the side information of the feedback graph into the principle of OFU and redefine a graph-based UCB index for each arm $i$ to enlarge the exploration stage properly:
\begin{equation}\label{ucb_index_m}
	\muh_i(t-1) + 5v^{\frac{1}{1+\epsilon}}\left( \frac{\log(|N(i)|/\delta)}{O_i(t-1)}\right)^{\frac{\epsilon}{1+\epsilon}} ~, 
\end{equation}
where $|N(i)|$ is the size of arm $i$'s neighbor set and $\muh_i(t-1)$ is computed by the TEM estimator with an altered truncation levels sequence
\begin{equation}\label{RUNE-TEM:truncation level}
    B_{i,t,\delta}=\left(\frac{vO_i(t)}{\log(|N(i)|/\delta)}\right)^{\frac{1}{1+\epsilon}} ~.
\end{equation}
Except the above two parameters, other procedures follow the same as Algorithm \ref{alg:1}, and this policy is called RUNE-TEM. Finally, we obtain a regret bound with constant terms taking sum over the \textit{clique covering} of the feedback graph and logarithmic in the size of the cliques, which is summarized in the following theorem.

\begin{thm}\label{thm:RUNE-TEM}
    Consider the same preconditions as Theorem \ref{thm:RUN-TEM}. Let $\delta=\frac{1}{t^4}$, then the expected regret of RUNE-TEM after $T$ steps is upper bounded by
	\begin{small}
	\begin{equation}\label{bound:RUNE-TEM}
		\hspace{-0.1cm}
		\begin{aligned}
			\E[R(T)]
			&\le \inf_{\C} \left\{ 40\left(\sum_{C\in\C}\frac{(10v)^{1/\epsilon}\Delta_{C}^{\max}}{\left(\Delta_{C}^{\min}\right)^{(1+\epsilon)/\epsilon}}\right)\log T\right.\\
			&\left. +\sum_{C\in\C}\left[\left(\frac{40(10v)^{1/\epsilon}\Delta_{C}^{\max}}{\left(\Delta_{C}^{\min}\right)^{(1+\epsilon)/\epsilon}}\right)\log N_C \right.\right.\\
			&\left.\left.+ \left( 1 + \frac{\pi^2}{3}\right) \Delta_{C}^{\max}\right]\right\} ~, 
		\end{aligned}
	\end{equation}
	\end{small}where $\Delta_{C}^{\min} = \min_{i\in C\backslash\{i^{\star}\}} \Delta_i, \Delta_{C}^{\max} = \max_{i\in C} \Delta_i$, and $N_{C}=\max_{i\in C} |N(i)|^{\frac{1}{4}}$ is determined by the maximum degree of clique $C\in\C$.
\end{thm}

\textbf{Remark.} Here, we provide a discussion about the difference between RUN-TEM and RUNE-TEM. Given the same feedback graph $G$, the leading term of RUN-TEM and RUNE-TEM is the same. However, the constant term of RUN-TEM is in order $O(\sum_{i=1}^K \Delta_i)$ while RUNE-TEM improves it to $O(\sum_{C\in\C}[  \frac{\Delta_{C}^{\max}\log(N_C)}{(\Delta_{C}^{\min})^{(1+\epsilon)/\epsilon}} + \Delta_{C}^{\max}])$, where the sum is only taken over the \textit{clique covering} of $G$, not all $K$ arms. As a result, RUNE-TEM can obtain a promotion in the constant term when the clique size
is large. 

Note that RUNE-TEM can only be applied to the rewards distributions with bounded $(1 + \epsilon)$-th raw moments, which means that the selected arms may change along with the synchronized shift of all the reward distributions. Thus, it would
be more desirable to obtain a regret bound in terms of the centered moments bound. To address this problem, we employ RUNE with \underline{M}edian \underline{o}f \underline{M}eans (MoM) estimator \citep{journals/jcss/Alon1999}, and result in a translation-invariant algorithm termed RUNE-MoM. The main idea is to first divide the rewards $X_{i,1},\cdots,X_{i,n}$ of each arm $i\in V$ into $k$ various disjoint blocks with size $N=\lceil n/k\rceil$:
\begin{equation}
    \X_i = \{X_{i,1:N},\cdots,X_{i,((k-1)N+1):n}\}~.
\end{equation}
After that, we compute separately the standard empirical mean of each block $s\in[k]$ by
\begin{equation}
    \muh_s=\frac{1}{N}\sum_{t=(s-1)N+1}^{sN} X_t~.
\end{equation}
Finally, we acquire the mean estimate value $\muh_{i}(t)$ by taking a median value of these empirical means within each block:
\begin{equation}
    \muh_{i}(t) = \mbox{\rm median}(\muh_1,\cdots,\muh_k)~.
\end{equation}
For a particular arm set the block size k as following
\begin{equation}\label{RUNE-MoM:k}
    k=\lceil 8\log(|N(i)|e^{-1/8}/\delta)\rceil~,
\end{equation}
where $\delta\in(0,1)$ is a confidence level predetermined by the player, we can obtain the properties of MoM described in the following proposition. 

\begin{prop}\label{prop:2} 
	Let $\delta\in(0,1), \epsilon\in(0, 1]$ be positive parameters. Let
	$X_1,X_2,\cdots,X_n$ be i.i.d.~random variables sampling from fixed distribution $\P$ with finite mean $\mu$ and bounded $(1+\epsilon)$-th central moments, i.e., $\E_{X\sim\P}[|X-\mu|^{1+\epsilon}] \le v$. Let $k=\lceil 8\log(1/\delta)\rceil$, $N=\lceil n/k\rceil$,
	\begin{equation}
	    \muh_1=\frac{1}{N}\sum_{t=1}^N X_t,\cdots,\muh_k=\frac{1}{N}\sum_{t=(k-1)N+1}^{kN} X_t~,
	\end{equation}
	be $k$ empirical mean estimates, where each one is computed on $N$ rewards. Consider the MoM estimator 
	\begin{equation}
	    \muh_{M} = \mbox{\rm median}(\muh_1,\cdots,\muh_k)~,
	\end{equation}
	then with probability at least $1-\delta$,
	\begin{equation}\label{prop:2-1}
		\muh_{M} \ge \mu - (12v)^{\frac{1}{1+\epsilon}}\left( \frac{8\log(e^{1/8}/\delta)}{n}\right)^{\frac{\epsilon}{1+\epsilon}} ~,
	\end{equation}
	and also, with probability at least $1-\delta$,
	\begin{equation}\label{prop:2-2}
		\muh_{M} \le \mu + (12v)^{\frac{1}{1+\epsilon}}\left( \frac{8\log(e^{1/8}/\delta)}{n}\right)^{\frac{\epsilon}{1+\epsilon}} ~.
	\end{equation}
\end{prop}

Through the concentration properties in Proposition \ref{prop:2}, we can redefine a graph-based UCB index for RUNE by
\begin{equation}\label{RUNE-MoM:UCB_index}
    \muh_{i}(t-1) + (12v)^{\frac{1}{1+\epsilon}}\left( \frac{8\log(|N(i)|/\delta)}{n}\right)^{\frac{\epsilon}{1+\epsilon}} ~.
\end{equation} 

Finally, we obtain the regret upper bound of RUNE-MoM as following.

\begin{thm}\label{thm:RUNE-MoM}
    Let $G=(V;E)$, $\epsilon\in(0, 1]$ and $v > 0$. Assume that the reward satisfy the distributions $\P_i$ with mean $\mu_i$ such that 
	\begin{equation}
	\begin{aligned}
         \E_{X\sim \P_i} \left[|X-\mu_i|^{1+\epsilon}\right] \le v\ (\forall i\in V)~, 
	\end{aligned}
	\end{equation}
	then the expected regret of RUNE-MoM with $\delta=\frac{1}{t^4}$ after $T$ steps is upper bounded by
	\begin{small}
	\begin{equation}\label{bound:RUNE-MoM}
		\begin{aligned}
			\E[R(T)]
			&\le \inf_{\C} \left\{ 64\left(\sum_{C\in\C}\frac{(24v)^{1/\epsilon}\Delta_{C}^{\max}}{\left(\Delta_{C}^{\min}\right)^{(1+\epsilon)/\epsilon}}\right)\log T\right.\\
			&\left. +\sum_{C\in\C}\left[\left(\frac{64(24v)^{1/\epsilon}\Delta_{C}^{\max}}{\left(\Delta_{C}^{\min}\right)^{(1+\epsilon)/\epsilon}}\right)\log N_C \right.\right.\\
			&\left.\left. + \left( 1 + \frac{e^{1/8}\pi^2}{3}\right) \Delta_{C}^{\max}\right]\right\} , 
		\end{aligned}
	\end{equation}
	\end{small}where $\Delta_{C}^{\min} = \min_{i\in C\backslash\{i^{\star}\}} \Delta_i, \Delta_{C}^{\max} = \max_{i\in C} \Delta_i$, and $N_{C}=\max_{i\in C} |N(i)|^{\frac{1}{4}}$ is determined by the maximum degree of clique $C\in\C$.
\end{thm}
\textbf{Remark.} Note that the theoretical guarantee of RUNE-MoM is in the same order as RUEN-TEM. However, the regret bound of RUEN-TEM depends on the raw moment bound while the regret bound of RUNE-MoM depends on the central moment bound, which is translation invariant under a synchronized shift of all the reward distributions.

\subsection{Robust Active Arm Elimination with Feedback Graph}\label{main:2}

\begin{algorithm}[t]
	\caption{RAAE-TEM}
	\label{alg:2}
	\begin{algorithmic}[1]
		\STATE {\bfseries Input:} Graph $G = (V,E)$ with $K$ nodes, $\epsilon\in(0, 1]$, $(1+\epsilon)$-th raw moment bound $v$, number of rounds $T$    \vspace{.5ex}
		\STATE  {\bfseries Initialize:}  $r\leftarrow 1, t\leftarrow 1, V_1\leftarrow V, \varepsilon_{1}\leftarrow 1/4^{\epsilon}$ \vspace{1ex}
		
		\WHILE{$|V_r|>1$ {\bfseries and} $t\le T$}
		\STATE Select a maximal independent set $I_{r}$ greedily from the subgraph induced by set $V_r$
		\STATE Compute the sampling times $n_r$ by \eqref{RAAE-TEM:n_r}
		\FOR {$s=1$ {\bfseries to} $n_r$}
		\FOR{all $i\in I_{r}$}
		\STATE Pull arm $i$ and receive reward $X_{i,t}$
		\STATE Observe rewards of all arms in $N(i)$
		\STATE Update $O_j(t)$ and $\muh_j(t)$ for all arms $j\in N(i)$ with the truncated level
		$B_{j,t}$ described in \eqref{RAAE:truncation level}
		\STATE $t\leftarrow t+1$
		\ENDFOR
		\ENDFOR
		\STATE Compute $\muh_r^{\star}=\max_{i\in V_r} \muh_i(t-1)$
		% \STATE Set $c_{r}=v^{1/(1+\epsilon)}\left( \frac{c\log(1/\delta)}{n_r}\right)^{\epsilon/(1+\epsilon)}$
		\STATE Execute active arm elimination described by \eqref{active arm elimination}
		\STATE $\varepsilon_{r+1}\leftarrow\varepsilon_{r}/2^{\epsilon}, r\leftarrow r+1$
		\ENDWHILE
		\STATE Play the arm left in $V_r$ until $T$ rounds have passed
	\end{algorithmic}
\end{algorithm}

Since the expected regret of RUNE is bounded by a sum of gap-based quantities over the \textit{clique covering} of $G$, which may be unacceptable when the \textit{clique covering} gets too large. Thus, a question arises here is whether it is possible to further improve the regret. We answer this question affirmatively by designing an elimination-based strategy

Inspired by AAE-AlphaSample \citep{conf/icml/Cohen2016}, we propose \underline{R}obust \underline{A}ctive \underline{A}rm \underline{E}limination (RAAE), described in Algorithm \ref{alg:2}, where the main idea is to sample each arm a minimal number of times and eliminate the "bad" arms one by one. To reduce the sampling times of each epoch, we firstly select a maximal independent set from the sub-graph induced by the active arm set and then play the arms in it once. Although \cite{conf/icml/Cohen2016} consider a similar setting, their theoretical results cannot be applied directly to stochastic GB with heavy-tailed rewards. To settle this issue, we adopt TEM or MoM estimator in RAAE and estimate the reward mean of each arm by a graph-based sampling mechanism. 

We begin with RAAE equipped by TEM estimator. RAAE-TEM works in epochs $r = 1,2,\cdots$. At each epoch $r$, the player maintains an active arm set $V_r$, initialized by $V_{1}=V$, and selects a maximal independent set $I_r$ greedily from the subgraph induced by $V_r$. After that, the player pulls the arms in $I_{r}$ once to update the average truncated rewards $\muh_i(t)$ of all the arms $i\in V_r$, with truncation levels
\begin{equation}\label{RAAE:truncation level}
    B_{i,t}=\left(\frac{vO_i(t)}{\log(2KT)}\right)^{\frac{1}{1+\epsilon}} ~.
\end{equation}
Then, we will eliminate the arms in $V_{r}$ that are known to be sub-optimal with sufficient confidence: 
\begin{equation}\label{active arm elimination}
    V_{r+1}=\{ i\in V_{r}: \muh_i(t-1) \ge \muh_r^{\star} - 2\varepsilon_{r} \} ~,
\end{equation}
where $\muh_r^{\star}=\max_{i\in V_r} \muh_i(t-1)$ and $\varepsilon_{r}$ is the accuracy parameter, initialize by $1/4^{\epsilon}$. As the analysis will show, by repeating this process for sampling times 
\begin{equation}\label{RAAE-TEM:n_r}
    n_{r} = \left\lceil \frac{(5(5v)^{1/\epsilon}\log(2KT)}{\varepsilon_{r}^{(1+\epsilon)/\epsilon}} \right\rceil ~,
\end{equation} 
the mean rewards of all arms in $V_{r}$ can be estimated within $\varepsilon_{r}$ accuracy. As a result, each suboptimal arm $i$ with $\Delta_{i} > 4\varepsilon_{r}$ will be eliminated with high probability at each epoch $r$. Thus, we multiply $\varepsilon_{r}$ by $1/2^{\epsilon}$ after each epoch to increase the estimation accuracy. Finally, we obtain the following regret bound of RAAE-TEM.

\begin{thm}\label{thm:RAAE-TEM}
	Assume $K\ge 2$ and $T\ge K$. Suppose that the independence number of feedback graph $G = (V,E)$ is at most $\alpha$, and the reward distributions $\P_i$ satisfy that, 
	\begin{equation}
	\begin{aligned}
         \E_{X\sim \P_i} \left[|X|^{1+\epsilon}\right] \le v\ (\forall i\in V)~, 
	\end{aligned}
	\end{equation}
	then the expected regret of Algorithm \ref{alg:2} (RAAE-TEM) after $T$ steps is at most
	\begin{small}
	\begin{equation}\label{bound:RAAE-TEM}
		\E[R_{T}] \le O\left( \sum_{i\in V^{(\alpha)}}\frac{v^{1/\epsilon}}{\Delta_{i}^{1/\epsilon}}\log T + \Delta_{\max}\log T\right) ~,
	\end{equation}
	\end{small}where $\Delta_{\max}=\max_{i\in V}\Delta_{i}$, $V^{(\alpha)}$ denotes a subset of the first $\alpha$ suboptimal arms with ties broken arbitrarily.
\end{thm}

\textbf{Remark.} If $G$ is a complete graph, then $\alpha=1$ and the above regret bound reduces to 
\begin{equation}
	\begin{aligned}
         O\left(\frac{\log T}{\Delta_{\min}^{1/\epsilon}} + \Delta_{\max}\log T\right), 
	\end{aligned}
\end{equation}
which is independent of $K$. Inversely, if $G$ is an empty graph, which means that $E=\emptyset$, then $\alpha=K$ and the above regret is on the order of $O(\sum_{i:\Delta_{i}>0}(\frac{v}{\Delta_{i}})^{\frac{1}{\epsilon}}\log T)$ which has been proved optimal for heavy-tailed MAB \citep{journals/tit/Bubeck2013}. Except for these two extremes, the regret bound of RUN-TEM can be improved significantly compared to Robust UCB \citep{journals/tit/Bubeck2013} when $\alpha<K$.

We provide a discussion about the difference between RAAE-TEM and RUNE-TEM. Note that the regret bound of RUNE-TEM \eqref{bound:RUNE-TEM} is summed over the \textit{clique covering} of $G$ and the gap-based quantities rely on the ratio of the maximum and minimum mean reward gaps within each clique, which can be quite large in the worst case. However, the regret bound of RAAE-TEM \eqref{bound:RAAE-TEM} is summed over the subset of $\alpha$ arms with the smallest nonzero gaps and the gap-based quantities only rely on the reciprocal of the mean reward gaps. As $\alpha$ is much smaller than the size of the \textit{clique covering} for benign graphs, we conclude that the regret bound of RAAE-TEM is tighter than RUNE-TEM in this case.

Furthermore, we can employ RAAE with MoM estimator to process heavy-tailed rewards with bounded $(1+\epsilon)$-th central moments. By using Proposition \ref{prop:2}, we reselect block number $k=\lceil 8\log(2KT)\rceil$, block size $N=\lceil n/k\rceil$ and sampling times
\begin{equation}
    n_r = \left\lceil \frac{(8(12v)^{1/\epsilon}\log(2e^{1/8}KT))}{\varepsilon_{r}^{(1+\epsilon)/\epsilon}} \right\rceil ~, 
\end{equation}
where $\varepsilon_{r}$ is the same as that in RAAE-TEM. Finally, we obtain the following regret bound of RAAE-MoM.

\begin{thm}\label{thm:RAAE-MoM}
	Assume $K\ge 2$ and $T\ge K$. Suppose that the independence numbers of feedback graph $G = (V,E)$ is at most $\alpha$, and the reward distributions $\P_i$ with mean $\mu_i$ satisfy that, 
	\begin{equation}
	\begin{aligned}
         \E_{X\sim \P_i} \left[|X-\mu_i|^{1+\epsilon}\right] \le v\ (\forall i\in V)~, 
	\end{aligned}
	\end{equation}
	then the expected regret of RAAE-MoM after $T$ steps is at most
	\begin{small}
	\begin{equation}\label{bound:RAAE-MoM}
		\E[R_{T}] \le O\left( \sum_{i\in V^{(\alpha)}}\frac{v^{1/\epsilon}}{\Delta_{i}^{1/\epsilon}}\log T + \Delta_{\max}\log T\right) ~,
	\end{equation}
	\end{small}where $\Delta_{\max}=\max_{i\in V}\Delta_{i}$, $V^{(\alpha)}$ denotes a subset of the first $\alpha$ suboptimal arms with ties broken arbitrarily.
\end{thm}

\section{EXPERIMENTS}
\begin{figure*}[htbp]
    \setlength{\belowcaptionskip}{-0.5cm} %-0.8
    \centering
    \subfigure[Random Graph]{\includegraphics[width=0.4\textwidth]{figures/fig_1.eps}}
    \quad
    \subfigure[Deterministic Graph]{\includegraphics[width=0.4\textwidth]{figures/fig_2.eps}}
  	\caption{Comparison of our algorithms (RUNE-TEM, RAAE-TEM) versus UCB-N and AAE-AlphaSample for stochastic GB with heavy-tailed rewards}
    \label{fig:results}
    \vspace{0.2in}
\end{figure*}

In this section, we present numerical results to demonstrate the effectiveness of our algorithms. We compare our methods (RUNE-TEM, RAAE-TEM)\footnote{Code will be made available at \href{https://github.com/yutian-007/graphical-bandits-with-heavy-tailed-rewards/}{https://github.com/yutian-007/graphical-bandits-with-heavy-tailed-rewards/}.} with UCB-N \citep{conf/uai/Caron2012} and AAE-AlphaSample \citep{conf/icml/Cohen2016}.

\textbf{Setup.} We synthesize a stochastic GB problem with $K=30$, there are $2$ optimal arms assigned uniformly at random from $[K]$ and all other arms are sub-optimal. The means of the optimal rewards are set to $\mu^{\star}=1.0$ and the means of sub-optimal rewards are restricted to $(0, 1.0)$. The time horizon is set as $T=10000$ for all experiments, and we take the average of $10$ independent runs of each algorithm. 

\textbf{Reward Distribution.} To generate heavy-tailed rewards, we consider Pareto random variable $X$ with shape parameters $\alpha$ and scale parameter $x_m$, whose probability density function can be written as following
\begin{equation}
    f_X(x)=\left\{ \begin{array}{cc}
      \frac{\alpha x_m^{\alpha}}{x^{\alpha+1}},   &  x\ge x_m \\
      0,  & x<x_m 
    \end{array}\right. ~.
\end{equation}
In our experiments, we make $\alpha>1$, such that the expectation exists and can be computed in the form $\E[X]=\frac{\alpha x_m}{\alpha-1}$. Also, the $r$-th raw moments exists when $r<\alpha$ and can be calculated by the formula $\E[X^r]=\frac{\alpha x_m^r}{\alpha-r}$. We can verify that the smaller is $\alpha$, the heavier is the distribution tail. To guarantee that the $(1+\epsilon)$-th raw moment is bounded by some constant $v>0$, we set $\alpha=1.1+\epsilon$, where $\epsilon=0.3$. Each arm's rewards are sampling independently from a predetermined Pareto distribution with parameter $\alpha,x_m=\frac{\mu(\alpha-1)}{\alpha}$ and the rewards of any two arms in a given round are generated independently.

\textbf{Feedback Graph.} We conduct experiments on two fixed undirected graphs, One is a random graph, generated by the Erd{\H o}s-R{\'e}nyi model \citep{journals/Erdos1960}. In details, we represent the edges by a random matrix $E\in\{0,1\}^{K\times K}$, and assign $E_{ij}=1$ ($i\neq j$) with a fixed probability $p$ and $E_{ii}=1$ for all $i\in[K]$. The other is a deterministic graph constructed by \cite{conf/aaai/Lu2021b} with $K = 30$, $\alpha = 10$ and $\bar{\chi} = 14$, which is illustrated in Fig.~\ref{fig:graph}.

\begin{figure}[htbp]
    \setlength{\belowcaptionskip}{-0.5cm} %-0.8
    \centering    \includegraphics[width=0.5\textwidth]{figures/feedback_graph.eps}
    \caption{Illustration of the Deterministic Graph \citep{conf/aaai/Lu2021b}}
    \label{fig:graph}      
    \vspace{0.2in}
\end{figure}

\textbf{Results.} We present two results in Fig.~\ref{fig:results}. As can be seen, the regret curves of elimination-based methods increase at the beginning and then maintain stable after some epochs, because they can find the best arms with high probability. Furthermore, RAAE-TEM performs better than AAE-AlphaSample in both settings, which is expected since it can use more refined robust estimators to improve the estimated accuracy. Also, RUNE-TEM suffers smaller regret than UCB-N, since it has a preferable regret bound. Particularly, RAAE-TEM performs better than RUNE-TEM, which is consistent with our theoretical analysis. 

\section{CONCLUSION AND FUTURE WORK}
We design two novel algorithms for stochastic graphical bandits with heavy-tailed rewards, which only require the existence of the $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. One of our algorithms is based on UCB strategy and obtains regret bounds depending on a sum of gap-based quantities over the \textit{clique covering} of the feedback graph. The other one is based on successive elimination technique and enjoys an improved regret bound depending on a gap-based sum with size controlled by $\alpha$, which is smaller than the size of the \textit{clique covering} for benign graphs. To the best of our knowledge, we provide the first regret bounds for stochastic GB with heavy-tailed rewards. Thus, a natural and challenging open problem is whether one can prove a lower bound for this setting. Obtaining lower bounds seems highly non-trivial even for stochastic GB under the sub-Gaussian setting \citep{conf/colt/problem/Teodor2022}, and we leave it as a future work.

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    This work was partially supported by NSFC (62122037), and JiangsuSF (BK20200064), and the Fundamental Research Funds for the Central Universities (2023300246).
\end{acknowledgements}

% References
\nocite{book/Vershynin2018}
\bibliography{ref}
\end{document}
