%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{algorithm}
%\usepackage{algorithmicx}
\usepackage{algpseudocode}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{comment}
\usepackage{subcaption}

\usepackage{nameref}
\usepackage{zref-xr}
\zxrsetup{toltxlabel}
\zexternaldocument*{nie_684}

% CJQ added Feb 19 for table (copied from proposal) -- got error for xcolor already loaded-- removed cell colors
\usepackage{multicol}%,subfigure}
\usepackage{wrapfig, rotating}

% \usepackage[svgnames,rgb]{xcolor}

% \newcommand{\theHalgorithm}{\arabic{algorithm}}
\usepackage{mathtools}
\DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
\DeclarePairedDelimiter\floor{\lfloor}{\rfloor}
\DeclareMathOperator*{\argmax}{arg\,max} 
\DeclareMathOperator*{\argmin}{arg\,min} 
\newcommand{\cjq}[1]{\color{blue}#1\color{black}}

% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

% % if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

\usepackage{balance}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
% \theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


\title{An Explore-then-Commit Algorithm for Submodular Maximization Under Full-bandit Feedback Supplementary Material}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:nieg@iastate.edu?Subject=Your UAI 2022 paper}{Guanyu Nie}{}}
\author[2]{\href{mailto:agarw180@purdue.edu?Subject=Your UAI 2022 paper}{Mridul Agarwal}{}}
\author[2]{\href{mailto:aumrawal@purdue.edu?Subject=Your UAI 2022 paper}{Abhishek Kumar Umrawal}{}}
\author[2]{\href{mailto:vaneet@purdue.edu?Subject=Your UAI 2022 paper}{Vaneet Aggarwal}{}}
\author[1]{\href{mailto:cjquinn@iastate.edu?Subject=Your UAI 2022 paper}{Christopher John Quinn}{}}
% Add affiliations after the authors
\affil[1]{%
    Computer Science Department\\
    Iowa State University\\
    Ames, Iowa, USA
}
\affil[2]{%
   % School of Industrial Engineering\\
    Purdue University\\
    West Lafayette, Indiana, USA
}
  
\begin{document}
\onecolumn
\maketitle

\section{Proofs} \label{prf:main}

We will separate the proof of \cref{thm:main} into two cases.  The first case is for when the clean event $\mathcal{E}$ defined in \cref{sec:regret-analysis} happens, which we will show in \cref{lem:probcleanevents} happens with high probability.  Under the clean event, we will prove important preliminary results, namely   \cref{lem:consequtive_reward} and \cref{cor:sk_lower}.  These will establish that even though ETCG, using random rewards, may pick a different sequence of subsets than an offline greedy algorithm \citep{nemhauser1978analysis} using a value oracle for the expected reward function $f$, ETCG's chosen set of size $k$ will nonetheless be near-optimal. The second case is when the  complementary event happens, which occurs with low probability.


% You will see that most of the efforts have been devoted to the situation where the clean event happens. 
This proof structure is analogous to the standard MAB proof for explore-then-commit strategies (see for instance, Section 1.2 in \citep{MAL-068}). 
%
However, unlike for standard MAB problems, ETCG makes sequences of decisions during exploration.  Furthermore, the combinatorial action space and non-linear reward function make the problem challenging.  Even in the special setting of deterministic rewards, the standard MAB problem becomes trivial (finding the largest of $n$ base arms) while   maximizing a submodular function with a cardinality constraint is NP-hard \citep{nemhauser1978analysis}.  %\cjq{unless the book came up a different proof, better to cite the original paper}

\begin{comment}
\cjq{direct copy-- modify with links to subsections and lemma statements}
 We first briefly summarize the proof structure.  In proving the theorem, we first use the Hoeffding bound to show that the event that the empirical means of all actions played are concentrated around their respective statistical means, which we refer to as the \textit{clean} event, happens with high probability. We show that when the clean event occurs, the (conditional) expected cumulative $(1-1/e)$ regret is sublinear in the horizon $T$.  We then show that when the clean event does not occur, the (conditional) expected cumulative $(1-1/e)$ regret could be linear in the horizon $T$.  However, since the probability of the clean event is sufficiently high, the expected cumulative $(1-1/e)$ regret \cref{eq:reg:exp1e} remains sublinear in the horizon $T$. 
\end{comment}


\subsection{Preliminary}\label{sec:appd:proof:prelim}
We first introduce some new notations and lemmas that are useful in the analysis.  Recall from \cref{prob_state} that for an action $S\in \mathcal{S}$, $f_t(S)$ denotes a (random) reward at time $t$,   $f(S)$ denotes the expected value for action $S$,  %,  $f(S)$ at time $t$,  denotes the expected value, 
and $\bar{f}_t(S)$ denotes the empirical mean  of rewards received from playing action $S$ up to and including time $t$. In the following, we will drop the subscript $t$ from the empirical mean, writing $\bar{f}(S)$ when it is clear from context that action $S$ has been played $m$ times.  Also recall that $S^{(i)}$ denotes the set of size $i\in\{1,\dots,k\}$ chosen after finishing phase $i$, and by the greedy structure of \cref{alg:PG}, $\emptyset=S^{(0)}\subset S^{(1)} \subset \dots \subset S^{(k)}$. This sequence of subsets that ETCG picks \textit{does not necessarily match} the sequence chosen by the offline greedy approximation \citep{nemhauser1978analysis} using a value oracle for the expected reward function $f$.  Even though ETCG may select a different sequence, we will later show in \cref{lem:consequtive_reward} that with high probability,   ensures the expected marginal gain is not too small.


% \cjq{double check conditional notation no longer used:} \color{red}
% Let $f(a|S^{(i)})$ denote the expected marginal gain of adding item $a$ to set $S^{(i)}$, i.e., $f(a|S^{(i)})=f(S^{(i)} \cup \{a\})-f(S^{(i)})$.  Denote the empirical mean of the marginal gain until time $t$ as $\bar{f}_t(a|S^{(i)})=\bar{f}_t(S^{(i)} \cup \{a\})-f(S^{(i)})$.  Note that this marginal gain $\bar{f}_t(a|S^{(i)})$ is defined to be with respect to the expected reward $f(S^{(i)})$.  
% \color{black}

Now we define events that are important in our analysis. Recall that $\bar{f}(S^{(i-1)}\cup\{a\})$ is the empirical mean of the $m$ rewards from playing action $S^{(i-1)}\cup\{a\}$ in phase $i$.  For each subset $S^{(i-1)}\cup\{a\}$, the $m$ rewards  are i.i.d. with mean $f(S^{(i-1)}\cup\{a\})$ and bounded in $[0,1]$.  Thus, we can bound the deviation of the (unbiased) empirical mean $\bar{f}(S^{(i-1)}\cup\{a\})$ from the expected value $f(S^{(i-1)}\cup\{a\})$ for each action in $\mathcal{S}_{i}$.  Specifically, we can use a two-sided Hoeffding bound for bounded variables.

\begin{lemma} [Hoeffding's inequality] \label{lem:hoeffding}
Let $X_1, \cdots, X_n$ be independent random variables bounded in the interval $[0, 1]$, and let $\bar{X}$ denote their empirical mean. Then we have for any $\epsilon >0$,
\begin{align}
    \mathbb{P}\left( \big|\bar{X} -  \mathbb{E}[\bar{X}] \big| \geq \epsilon  \right) \leq 2 \mathrm{exp} \left( - 2 n \epsilon^2  \right).
\end{align}
\end{lemma}

We will use Hoeffding's inequality to bound the probabilities of the empirical means $\bar{f}(S^{(i-1)}\cup\{a\})$ for all actions $S^{(i-1)}\cup\{a\} \in \mathcal{S}_{i}$ played in phase $i$.  By \cref{alg:PG}, each action will be played the same number of times, denoted by $m$, so we consider bounding the probabilities of equal-sized confidence radii $\mathrm{rad} := \sqrt{2\log(T)/m}$ for all the actions $S^{(i-1)}\cup\{a\} \in \mathcal{S}_{i}$ played in phase $i$.

% \cjq{if we only ever use $rad$, don't introduce notation for more complicated setting}
% For all $a\in \Omega\setminus S^{(i)}$, denote the confidence radius for the action $S^{(i)}\cup\{a\} \in \mathcal{S}_{i+1}$  at the end of phase $i+1$ for $i\in \{0,\cdots, k-1\}$ as $\mathrm{rad}_{i+1}(a) := \sqrt{2\log(T)/m}$. All actions in $\mathcal{S}_{i+1}$ are played $m$ times, resulting in equal-sized confidence radii. 
% Note that here the quantity  neither depends on the phase index $i+1$ nor the action $S^{(i)}\cup\{a\}$ because for this simple algorithm, each action will have the same confidence radius since all actions are played  exactly the same amount of times. 
% For ease of notation we denote $\mathrm{rad} = \mathrm{rad}_{i+1}(a)$. 

We consider the event that the empirical means of all actions played in phase $i$ are concentrated around their statistical means within a radius $\mathrm{rad}$.  Denote this event as  $\mathcal{E}_{i}$, %\cjq{add verbal definition first}
\begin{align}
    \mathcal{E}_{i}:=\bigcap_{S\cup\{a\} \in \mathcal{S}_{i} }\bigg\{\big|\bar{f}(S\cup\{a\})-f(S\cup\{a\}) \big|< \mathrm{rad}\bigg\}. \label{eq:phase_event}
\end{align}


Define the \textit{clean event} $\mathcal{E}$ to be the event that the empirical means of all actions played up to and including phase $k$ are within $\mathrm{rad}$ of their corresponding statistical means:
\begin{align}
    \mathcal{E} := \mathcal{E}_{1}\cap \dots \cap \mathcal{E}_{k}. \label{eq:clean_event}
\end{align}


% \cjq{Aside: It may not help, but we  only need the arms that appear empirically best to be concentrated around their means (eg it might be fine for bad arms to not be close to their means)}

\begin{lemma} \label{lem:probcleanevents}
The probability of the clean event $\mathcal{E}$ defined in \eqref{eq:clean_event} satisfies:
\begin{align}
    \mathbb{P}(\mathcal{E}) %
    %
    & \geq 1 - \frac{2nk}{T^4}. \nonumber
\end{align}
\end{lemma}
\begin{proof}
% \cjq{first break down clean even as product, then Hoeffding; or provide a roadmap for the ordering of steps}

We begin by breaking up the probability of the clean event $\mathcal{E}$ into conditional probabilities for the events $\{\mathcal{E}_{i}\}_{i=1}^{k}$ for each phase,
\begin{align}
    \mathbb{P}(\mathcal{E}) &= \mathbb{P}(\mathcal{E}_{1}\cap \dots \cap \mathcal{E}_{k}) \nonumber\\
    %
    &= \prod_{i=1}^{k} \mathbb{P}(\mathcal{E}_{i}|\mathcal{E}_{1},\dots,\mathcal{E}_{i-1}). \label{eq:probbnd:Econd} 
\end{align}



% \color{blue}
% I don't think that is that the following is true in general 
% \begin{align}
%     \mathbb{P}(\mathcal{E}) &= \mathbb{P}(\mathcal{E}_{1}\cap \dots \cap \mathcal{E}_{k}) \nonumber\\
%     %
%     &= \prod_{i=0}^{k-1} \mathbb{P}(\mathcal{E}_{i+1}) 
% \end{align}, as the results in earlier phases (which partly depend on the concentration of empirical means) affects which actions are played in later phases, and while we can use the same inequality to bound the probability of concentration, the actual probability of concentration will vary (we just know that it is at least as big as what Hoeffding gives)


% \color{black}



% For the clean event $\mathcal{E}$, its definition is not tied




Recall that  $\mathcal{E}_{i}$, defined in \eqref{eq:phase_event}, is the event where the empirical means of all actions played in phase $i$ were concentrated around their statistical means. Which actions are available in phase $i$, namely  $\{S^{(i-1)}\cup\{a\}\}_{a\in \Omega\backslash S^{(i-1)}}$, depends on the action $S^{(i-1)}$ from the previous phase that had the highest empirical mean, which in turn is related to $\mathcal{E}_{i-1}$.  Although we cannot directly evaluate \eqref{eq:probbnd:Econd}, by conditioning on  $S^{(i-1)}$ we will be able to obtain a bound on \eqref{eq:probbnd:Econd}.



% the empirical means of 
% Since the number of samples for each action, denoted by $m$, are fixed and the rewards are conditionally independent, the event of each empirical mean being concentrated is the product of the probabilities, 
% \cjq{earlier, did we make clear that $a:S^{(i)}\cup\{a\} \in \mathcal{S}_{i+1}$ is same as $a\in \Omega\backslash S^{(i)}$?  maybe the latter easier to read}

%% Original
% \begin{align}
%     \mathbb{P}(\mathcal{E}_{i+1})&=\mathbb{P}\left[ \bigcap_{S^{(i)}\cup\{a\} \in \mathcal{S}_{i+1}}\left\{\bigg|\bar{f}(S^{(i)}\cup\{a\})-f(S^{(i)}\cup\{a\}) \bigg| < \mathrm{rad} \right\} \right] \nonumber\\%
%     %
%     &= \prod_{ S^{(i)}\cup\{a\} \in \mathcal{S}_{i+1} } \mathbb{P}\left[ \left\{\bigg|\bar{f}(S^{(i)}\cup\{a\})-f(S^{(i)}\cup\{a\}) \bigg| < \mathrm{rad} \right\} \right] \nonumber \\
%     %
%     &\geq \left(1-\frac{2}{T^4} \right)^{| \mathcal{S}_{i+1} |} \nonumber\\
%     %
%     & = \left(1-\frac{2}{T^4} \right)^{n-i} \\
%     %
%     & \geq \left(1-\frac{2}{T^4} \right)^n \label{eq:probbnd:phase}
% \end{align}

\begin{align}
    \mathbb{P}(\mathcal{E}_{i}|\mathcal{E}_{1},\dots,\mathcal{E}_{i-1}) %
    %
    &= \sum_{S\in\left\{S' \ \big| \ S'\subseteq\Omega, \ |S'|=i-1\right\}} \mathbb{P}(S^{(i-1)}=S, \mathcal{E}_{i}|\mathcal{E}_{1},\dots,\mathcal{E}_{i-1}) \tag{law of total probability} \\
    %
    %
    &= \sum_{S\in\left\{S' \ \big| \ S'\subseteq\Omega, \ |S'|=i-1\right\}} \mathbb{P}(S^{(i-1)}=S|\mathcal{E}_{1},\dots,\mathcal{E}_{i-1}) \times \mathbb{P}(\mathcal{E}_{i}| S^{(i-1)}=S, \mathcal{E}_{1}, \dots, \mathcal{E}_{i-1}) \nonumber \\
    %
    %
    &= \sum_{S\in\left\{S' \ \big| \ S'\subseteq\Omega, \ |S'|=i-1\right\}} \mathbb{P}(S^{(i-1)}=S|\mathcal{E}_{1},\dots,\mathcal{E}_{i-1}) \times \mathbb{P}(\mathcal{E}_{i}| S^{(i-1)}=S), \label{eq:eventseqprob:fact:1}
\end{align} where \eqref{eq:eventseqprob:fact:1} follows from  rewards in phase $i$ being conditionally independent of rewards from other phases, given the corresponding actions played during phase $i$. %; consequenently the concentration of empirical means during phase $i+1$ are conditionally independent of rewards in earlier phases, conditioned on the action 



We now focus on bounding $\mathbb{P}(\mathcal{E}_{i}| S^{(i-1)}=S)$.  By conditioning on the set chosen in the previous phase, $S^{(i-1)}=S$, we know all the actions that will be played in the current phase $i$, $\{S^{(i-1)}\cup\{a\}\}_{a\in \Omega\backslash S^{(i-1)}}$.  The rewards of all the actions are bounded in $[0,1]$ and are conditionally independent (given the corresponding action).  %Thus the events that the empirical means will be concentrated will be independent of each other.  

% \vspace{1cm}

Apply \cref{lem:hoeffding} to  the empirical mean $\bar{f}(S^{(i-1)}\cup\{a\})$ of $m$ rewards for action $S^{(i-1)}\cup\{a\}$  and choosing $\epsilon=\mathrm{rad}=\sqrt{2\log(T)/m}$ gives %a bound on the empirical mean not being concentration 




\begin{align}
    \mathbb{P}\left[\big|\bar{f}(S^{(i-1)}\cup\{a\})-f(S^{(i-1)}\cup\{a\}) \big| \geq \mathrm{rad} \right]     &\leq 2 \mathrm{exp} \left( - 2 m \mathrm{rad}^2  \right) \nonumber\\
    %
    &= 2 \mathrm{exp} \left( - 2 m (2\log(T)/m ) \right) \nonumber\\
    %
    &= 2 \mathrm{exp} \left( - 4 \log(T)  \right) \nonumber\\
    %
    &= \frac{2}{T^4}. \nonumber
\end{align}

Thus, for any individual action $S^{(i-1)}\cup\{a\} \in \mathcal{S}_{i}$, we can bound the probability that its sample mean $\bar{f}(S^{(i-1)}\cup\{a\})$ is within a specified confidence radius (complementary  of the event above) as
\begin{align}
    % &\hspace{-2cm}
    \mathbb{P}\left[\bigg|\bar{f}(S^{(i-1)}\cup\{a\})-f(S^{(i-1)}\cup\{a\}) \bigg| < \mathrm{rad} \right] %\nonumber\\
    &= 1-\mathbb{P}\left[\bigg|\bar{f}(S^{(i-1)}\cup\{a\})-f(S^{(i-1)}\cup\{a\}) \bigg| \geq \mathrm{rad} \right] \nonumber\\
    %
    &\geq 1-\frac{2}{T^4}. \label{eq:probbnd:single}
\end{align}

We can then use \eqref{eq:probbnd:single} to bound  $\mathbb{P}(\mathcal{E}_{i}| S^{(i-1)}=S)$ for any set $S\subset \Omega$ of $i-1$ arms.


\begin{align}
    \mathbb{P}(\mathcal{E}_{i}| S^{(i-1)}=S) %
    %
    %
    &=\mathbb{P}\left[ \bigcap_{a \in \Omega \backslash S^{(i-1)}}\left\{\bigg|\bar{f}(S^{(i-1)}\cup\{a\})-f(S^{(i-1)}\cup\{a\}) \bigg| < \mathrm{rad} \right\} \bigg |S^{(i-1)}=S \right] \tag{definition of $\mathcal{E}_{i}$}\\%
    %
    %
    &= \prod_{a \in \Omega \backslash S^{(i-1)} } \mathbb{P}\left[ \left\{\bigg|\bar{f}(S^{(i-1)}\cup\{a\})-f(S^{(i-1)}\cup\{a\}) \bigg| < \mathrm{rad} \right\}  \bigg |S^{(i-1)}=S \right] \tag{rewards are independent conditioned on actions} \\
    %
    &\geq \left(1-\frac{2}{T^4} \right)^{| \Omega \backslash S^{(i-1)} |} \tag{using \eqref{eq:probbnd:single}}\\
    %
    & = \left(1-\frac{2}{T^4} \right)^{n-i+1} \nonumber\\
    %
    & \geq \left(1-\frac{2}{T^4} \right)^n. \label{eq:probbnd:phase}
\end{align}

Using \eqref{eq:eventseqprob:fact:1} and \eqref{eq:probbnd:phase}, we are now ready to lower bound the probability of a clean event. %now continue with \eqref{eq:eventseqprob:fact:1}.

\begin{align}
    \mathbb{P}(\mathcal{E}) &= \mathbb{P}(\mathcal{E}_{1}\cap \dots \cap \mathcal{E}_{k}) \nonumber\\
    %
    &= \prod_{i=1}^{k} \mathbb{P}(\mathcal{E}_{i}|\mathcal{E}_{1},\dots,\mathcal{E}_{i-1}) \nonumber \\
    %
    %
    &= \prod_{i=1}^{k} \sum_{S\in\left\{S' \ \big| \ S'\subseteq\Omega, \ |S'|=i-1\right\}} \mathbb{P}(S^{(i-1)}=S|\mathcal{E}_{1},\dots,\mathcal{E}_{i-1}) \times \mathbb{P}(\mathcal{E}_{i}| S^{(i-1)}=S) \tag{ using \eqref{eq:eventseqprob:fact:1} } \\
    %
    %
    &\geq \prod_{i=1}^{k} \sum_{S\in\left\{S' \ \big| \ S'\subseteq\Omega, \ |S'|=i-1\right\}} \mathbb{P}(S^{(i-1)}=S|\mathcal{E}_{1},\dots,\mathcal{E}_{i-1}) \times \left(1-\frac{2}{T^4} \right)^n \tag{ using \eqref{eq:probbnd:phase} } \\
    %
    %
    &= \prod_{i=1}^{k} \left(1-\frac{2}{T^4} \right)^n \sum_{S\in\left\{S' \ \big| \ S'\subseteq\Omega, \ |S'|=i-1\right\}} \mathbb{P}(S^{(i-1)}=S|\mathcal{E}_{1},\dots,\mathcal{E}_{i-1})  \nonumber \\ 
    %
    &= \prod_{i=1}^{k} \left(1-\frac{2}{T^4} \right)^n   \nonumber \\ 
    %
    & =   \left(1-\frac{2}{T^4} \right)^{nk} \nonumber\\
    %
    &\geq   1-\frac{2nk}{T^4}. \tag{Bernoulli's inequality}
\end{align}




% \cjq{double check: the arm sets in later phases depend on results of earlier phases, and thus the reward distributions depend on each other and consequently events of concentration;  shouldn't we be a little more careful here } 


% For the clean event $\mathcal{E}$ defined in \eqref{eq:clean_event}, from \eqref{eq:probbnd:phase} we have
% \begin{align}
%     \mathbb{P}(\mathcal{E}) &= \mathbb{P}(\mathcal{E}_{1}\cap \dots \cap \mathcal{E}_{k}) \nonumber\\
%     %
%     &= \prod_{i=0}^{k-1} \mathbb{P}(\mathcal{E}_{i+1}) \tag{\cjq{This is a big error}}\\
%     %
%     %
%     &\geq \prod_{i=0}^{k-1} \left(1-\frac{2}{T^4} \right)^{n}\nonumber\\
%     & =   \left(1-\frac{2}{T^4} \right)^{nk} \nonumber\\
%     & \cjq{add step}\nonumber\\
%     & \geq 1 - \frac{2nk}{T^4}.  \label{eq:ETC:instregr:prf:probcleanevent} 
% \end{align}

This concludes the proof for \cref{lem:probcleanevents}.
\end{proof}




In \cref{lem:probcleanevents}, we  showed that the clean event $\mathcal{E}$ will happen with high probability. Next, we present a lemma showing that the marginal gain of the action selected at the end of any exploitation phase is large under the condition that the clean event $\mathcal{E}$ happens.
%
% Specifically, we next show that under event $\mathcal{E}$, we can bound the gap of the expected rewards with the expected reward of the optimal set, even though the empirically best actions at the end of each phase \textit{might not match those} chosen by the offline greedy algorithm (which we know is near-optimal and which ETCG would pick if the rewards were deterministic).

% set that the offline greedy would choose,

% even though the empirically best actions at the end of each phase \textit{might not match those} chosen by the offline greedy algorithm, we can nonetheless bound the gap of the expected rewards with the expected reward of the set that the offline greedy would choose,
% \cjq{Add a note this is same lemma as in section 4}
\begin{lemma}[\cref{lem:consequtive_reward_main} in \cref{sec:regret-analysis}] \label{lem:consequtive_reward}
Under the clean event $\mathcal{E}$,  for all   $i\in \{1,\cdots, k\}$,
% Conditioned on $\mathcal{E}$, the empirically best subsets at the end of each phase of the ETC-Greedy algorithm satisfy
\begin{align}
    f(S^{(i)})-f(S^{(i-1)}) \geq \frac{1}{k}\left[f(S^*)-f(S^{(i-1)})\right]-2 \mathrm{rad}. \label{eq:consequtive_reward}
\end{align}
\end{lemma}
\begin{proof}
Recall that $a_{i}$, defined in \eqref{eq:emp_best}, is the index of the arm that with $S^{(i-1)}$ forms the action with highest empirical mean at the end of phase $i$, i.e., $a_{i} = \argmax_{a\in \mathcal{S}_{i}} \bar{f}(S^{(i-1)} \cup \{a\})$ and $S^{(i)}=S^{(i-1)} \cup \{a_{i}\}$. Let $a_{i}^*$ denote the index of the arm that with $S^{(i-1)}$ forms the action with highest expected value, i.e, $a_{i}^* = \argmax_{a\in \mathcal{S}_{i}}f(S^{(i-1)} \cup \{a\})$.    For each $a \in \Omega \setminus S^{(i-1)}$, the event that the empirical mean $\bar{f}(S^{(i-1)} \cup \{a\})$ is concentrated within a radius of size $\mathrm{rad}$ around the expectated value can be written as
%
%
% \cjq{fix formatting -- look at alignat usage \url{https://tex.stackexchange.com/questions/200502/align-environment-align-on-the-left-side}}
% nicer way (need to use array?  https://tex.stackexchange.com/questions/102816/centering-equations-within-alignat-command
\begin{alignat}{3}
    f(S^{(i-1)} \cup \{a\}) - \mathrm{rad}& &&\leq  \bar{f}(S^{(i-1)} \cup \{a\}) &&\leq f(S^{(i-1)} \cup \{a\}) + \mathrm{rad} \tag{concentration in $\mathcal{E}_{i}$}\\ %\label{eq:instregrprf:conf:1}\\
    %
    \Longleftrightarrow \qquad f(S^{(i-1)} \cup \{a\}) - 2\mathrm{rad}& &&\leq \bar{f}(S^{(i-1)} \cup \{a\})- \mathrm{rad} &&\leq f(S^{(i-1)} \cup \{a\}). \label{eq:instregrprf:conf:1b}  
\end{alignat}


We next lower bound the expected reward $f(S^{(i)})$ for the empirically best action in phase $i$, $S^{(i)}= \{a_{i}\}\cup S^{(i-1)}$.   To do so, we apply \eqref{eq:instregrprf:conf:1b} to two specific arms, the empirically best $a_{i}$ and the statistically best $a_{i}^*$. We get
% \cjq{maybe add ``We can relate the empirical mean of .. to that of ..}
\begin{align}
    f(S^{(i)}) &= f(S^{(i-1)} \cup \{a_{i}\}) \tag{by design, $S^{(i)}\gets \{a_{i}\}\cup S^{(i-1)}$ }\\
    %
    &\geq \bar{f}(S^{(i-1)} \cup \{a_{i}\}) - \mathrm{rad} \tag{ using  \eqref{eq:instregrprf:conf:1b} } \\
    %
    &\geq \bar{f}(S^{(i-1)} \cup \{a_{i}^*\}) - \mathrm{rad} \tag{$a_{i}$ has the highest empirical mean} \\
    %
    &\geq  f(S^{(i-1)} \cup \{a_{i}^*\}) - 2\mathrm{rad}. \tag{ using  \eqref{eq:instregrprf:conf:1b} } 
\end{align}
Subtracting $f(S^{(i-1)})$ on both side we have
\begin{align}
    f(S^{(i)}) - f(S^{(i-1)}) &\geq  f(S^{(i-1)} \cup \{a_{i}^*\}) - f(S^{(i-1)}) - 2\mathrm{rad}. \label{eq:lem:marggain:12}
\end{align}


Recall from \cref{prob_state} that $S^*=\argmax_{S:|S|\leq k}f(S)$ denotes the optimal solution in the offline problem.   
We will next show that the improvements in expectation of the chosen actions from one phase to the next are lower bounded by the gap between the  optimal set $S^*$ of cardinality $k$ and the set $S^{(i)}$ chosen in the previous round.  %\cjq{double check $S^*$ defined}

% Since $f(S^{(i)}\cup \{a_{i+1}^*\})$ is the largest expected value of rewards of all actions in $\mathcal{S}_{i+1}$, it is larger than the average of all actions \cjq{verbally describe this special set} $\{S^{(i)}\cup \{a\} : a \in S^* \setminus S^{(i)}\}$ where the additional arm $a$ is in the optimal set of cardinality $k$ (in terms of expected value). 

\begin{align}
    f(S^{(i)}) - f(S^{(i-1)}) &\geq  f(S^{(i-1)} \cup \{a_{i}^*\}) - f(S^{(i-1)}) - 2\mathrm{rad} \tag{copying \eqref{eq:lem:marggain:12}} \\
    &= \max_{ a \in \Omega \setminus S^{(i-1)} } f(S^{(i-1)} \cup \{a\}) - f(S^{(i-1)}) - 2\mathrm{rad} \tag{by def.} \\
    & \geq \max_{ a \in S^* \setminus S^{(i-1)} } f(S^{(i-1)} \cup \{a\}) - f(S^{(i-1)}) - 2\mathrm{rad} \tag{restricted set}\\
    %
    &\geq \frac{1}{|S^* \backslash S^{(i-1)}  |} \sum_{ a \in S^* \backslash S^{(i-1)} } f(S^{(i-1)}\cup \{a\}) - f(S^{(i-1)}) - 2\mathrm{rad} \tag{max greater than average}\\
    & = \frac{1}{|S^* \backslash S^{(i-1)}  |} \sum_{ a \in S^* \backslash S^{(i-1)} } \left[f(S^{(i-1)}\cup \{a\})-f(S^{(i-1)})\right] - 2\mathrm{rad} \nonumber\\
    & \geq  \frac{1}{k} \sum_{ a \in S^* \backslash S^{(i-1)} } \left[f(S^{(i-1)}\cup \{a\})-f(S^{(i-1)})\right] - 2\mathrm{rad} \tag{$S^*$ has cardinality $k$}\\
    & \geq  \frac{1}{k} \left[f(S^*)-f(S^{(i-1)})\right] - 2\mathrm{rad}, \label{eq:lem:marggain:30}
\end{align}

where \eqref{eq:lem:marggain:30} follows from a well known bound for submodular functions. % (see Appendix~\ref{apdx:submod}).
\end{proof}


\cref{lem:consequtive_reward}  identifies a lower bound of the expected marginal gain $f(S^{(i)})-f(S^{(i-1)})$ of the empirically best action $S^{(i)}$ at the end of phase $i$. 
% The sequence of subsets $\{S^{(0)},S^{(1)},\dots,S^{(k)}\}$ that ETCG picks \textit{does not necessarily match} the sequence chosen by the offline greedy approximation \citep{nemhauser1978analysis} using a value oracle for the expected reward function $f$.  Even though ETCG may select a different sequence, \cref{lem:consequtive_reward_main} ensures the expected marginal gain is not too small.  
As a corollary of \cref{lem:consequtive_reward}, using properties of submodular set functions and unraveling the recursion induced by \cref{lem:consequtive_reward}, we can lower bound the expected value of ETCG's chosen set $S^{(k)}$ of size $k$, which is used for exploitation in phase $k+1$.

\begin{corollary}[\cref{cor:sk_lower:main} in \cref{sec:regret-analysis}] \label{cor:sk_lower}
Under the clean event $\mathcal{E}$, % for all  $i\in \{1,2,\cdots, k\}$,
\begin{align}
    f(S^{(k)}) & \geq (1 - \frac{1}{e})f(S^*) - 2k \mathrm{rad} 
    . \label{eq:final_reward}
\end{align}
\end{corollary}

% \color{gray}

%Rearranging \eqref{eq:consequtive_reward}, 
\begin{proof}
We begin by unraveling the recursion induced by \cref{lem:consequtive_reward} and using properties of submodular set functions,
\begin{align}
    f(S^{(i)})-f(S^{(i-1)}) &\geq \frac{1}{k}\left[f(S^*)-f(S^{(i-1)})\right]-2 \mathrm{rad}. \tag{copying \eqref{eq:consequtive_reward}}\\
    %
    & \nonumber\\
    %
    \Longleftrightarrow \qquad f(S^{(i)}) 
    & \geq \frac{1}{k}f(S^*) + (1-\frac{1}{k}) f(S^{(i-1)})-2 \mathrm{rad} \tag{rearranging}\\
    %
    %
    &=\left[\frac{1}{k}f(S^*)-2 \mathrm{rad}\right] + (1-\frac{1}{k}) f(S^{(i-1)}). \label{eq:prf:mainthm:case1:65}
\end{align}
% \begin{align}
%     f(S^{(i+1)}) 
%     & \geq 
%     % f(S^{(k-1)}) + \frac{1}{k}\left[f(S^*)-f(S^{(k-1)})\right]-2 \mathrm{rad} \nonumber\\
%     % &= 
%     \frac{1}{k}f(S^*) + (1-\frac{1}{k}) f(S^{(i)})-2 \mathrm{rad} \nonumber\\
%     %
%     &=\left[\frac{1}{k}f(S^*)-2 \mathrm{rad}\right] + (1-\frac{1}{k}) f(S^{(i)}) \label{eq:prf:mainthm:case1:65}
% \end{align}

Applying \eqref{eq:prf:mainthm:case1:65} recursively for $i=k$,
\begin{align}
    f(S^{(k)}) 
    & \geq 
    % f(S^{(k-1)}) + \frac{1}{k}\left[f(S^*)-f(S^{(k-1)})\right]-2 \mathrm{rad} \nonumber\\
    % &= 
    \left[\frac{1}{k}f(S^*)-2 \mathrm{rad}\right] + (1-\frac{1}{k}) f(S^{(k-1)})
     \tag{using \eqref{eq:prf:mainthm:case1:65} for $i=k$}\\
    %
    %
    & \geq \left[\frac{1}{k}f(S^*)-2 \mathrm{rad}\right] + (1-\frac{1}{k}) \left(  
    \left[\frac{1}{k}f(S^*)-2 \mathrm{rad}\right] 
    +(1-\frac{1}{k}) f(S^{(k-2)}) \right) \tag{using \eqref{eq:prf:mainthm:case1:65} for $i=k-1$ }\\
    %
    %
    % & = \frac{1}{k}f(S^*) + \frac{1}{k}(1-\frac{1}{k})f(S^*) + (1-\frac{1}{k})^2 f(S^{(k-2)})-2(1-\frac{1}{k}) \mathrm{rad} - 2 \mathrm{rad} \tag{rearranging}\\
    %
    %
    & = \left[\frac{1}{k}f(S^*)-2 \mathrm{rad}\right]\sum_{\ell=0}^1(1-\frac{1}{k})^\ell + (1-\frac{1}{k})^2f(S^{(k-2)}) \tag{rearranging}\\
    %
    %
    &\vdots \tag{continue recursing until we get to $S^{(0)}=\emptyset$; $f(\emptyset)=0$} \\
    %
    %
    & \geq \left[\frac{1}{k}f(S^*)-2 \mathrm{rad}\right]\sum_{\ell=0}^{k-1}(1-\frac{1}{k})^\ell
    %&\geq \frac{1}{k}f(S^*) + \frac{1}{k}(1-\frac{1}{k})f(S^*) + \cdots + \frac{1}{k}(1-\frac{1}{k})^{k-1}f(S^*) - 2\sum_{i=0}^{k-1}(1-\frac{1}{k})^i %\sqrt{2\log(T)/m}
    % \mathrm{rad} 
    \label{eq:prf:mainthm:case1:70}
\end{align}
Simplifying the geometric summation,
\begin{align}
    \sum_{\ell=0}^{k-1}(1-\frac{1}{k})^\ell
    %
    &= \frac{1-(1-\frac{1}{k})^k}{1-(1-\frac{1}{k})} \nonumber\\
    %
    &= k \left(1-(1-\frac{1}{k})^k\right). \nonumber  
\end{align}

Continuing with \eqref{eq:prf:mainthm:case1:70}, %and plugging in the value of $\mathrm{rad}$
\begin{align}
    f(S^{(k)}) 
    %
    & \geq \left[\frac{1}{k}f(S^*)-2 \mathrm{rad}\right]k \left(1-(1-\frac{1}{k})^k\right) \nonumber\\
    %\nonumber\\
    %
    %
    &= \left(1-\left(1-\frac{1}{k}\right)^k\right)f(S^*) - 2k \left(1-(1-\frac{1}{k})^k\right) \mathrm{rad} \nonumber\\ %\sqrt{2\log(T)/m} \nonumber\\
    &\geq \left(1-\left(1-\frac{1}{k}\right)^k\right)f(S^*) - 2k \mathrm{rad}.
    % \sqrt{2\log(T)/m}.
    \tag{simplifying with $(1-\frac{1}{k})^k\leq 1$}\nonumber
\end{align}
Using the well-known lower bound $\left(1- \left(1 - \frac{1}{k}\right)^k \right) \geq 1 - \frac{1}{e}$, %(see Appendix~\ref{apdx:1minus1ebnd}), 
we get
\begin{align}
    f(S^{(k)}) & \geq (1 - \frac{1}{e})f(S^*) - 2k \mathrm{rad} 
    %\sqrt{2\log(T)/m}
    .\nonumber
\end{align}
Rearranging terms we have
\begin{align}
    (1 - \frac{1}{e})f(S^*) - f(S^{(k)}) & \leq  2k \mathrm{rad}
    %\sqrt{2\log(T)/m}
    . \nonumber %\label{eq:final_reward}
\end{align}


\end{proof}
% \color{black}

\vspace{1cm}


\subsection{Theorem~\ref{thm:main} Proof}\label{sec:apdx:prf:main-thm}

Now we are ready to prove the main theorem,  \cref{thm:main}.  

\subsubsection*{Case 1: clean event $\mathcal{E}$ happens}


In the first case we analyse the expected regret under the condition that the clean event $\mathcal{E}$ happens. In this section, all expectations will be conditioned on $\mathcal{E}$, but to simplify notation   we will write $\mathbb{E}[\cdot]$ instead of $\mathbb{E}[\cdot|\mathcal{E}]$. % for ease of notation. 

First we can break up the expected $(1-\frac{1}{e})$-regret \eqref{eq:reg:exp1e} conditioned on $\mathcal{E}$ into two parts, one for the first $k$ phases, and the second for the exploitation phase.  Also recall that $f_t(S_t)$ is the random reward for taking action $S_t$, which itself is random, depending on empirical means of actions in earlier phases.  % \cjq{explain steps}
\begin{align}
    \mathbb{E}[\mathcal{R}(T)] 
    % &= \mathbb{E}\left[(1-\frac{1}{e}) T f( S^* ) -\sum_{t=1}^T f_t(S_t) \right] \nonumber\\
    & = (1-\frac{1}{e}) T f( S^* ) -\sum_{t=1}^T \mathbb{E}[f_t(S_t)] \tag{using the definition \eqref{eq:reg:exp1e}}\\
    & = (1-\frac{1}{e}) T f( S^* ) -\sum_{t=1}^T \mathbb{E}[   \mathbb{E}[ f_t(S_t)  |S_t  ]] \tag{law of total expectation}\\
    & = (1-\frac{1}{e}) T f( S^* ) -\sum_{t=1}^T \mathbb{E}[f(S_t)] \tag{$f(\cdot)$ defined as expected reward}\\
    & = \sum_{t=1}^T \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S_t)]\right) \tag{rearranging}\\
    & = \underbrace{\sum_{i=1}^{k} \sum_{t=T_{i-1}+1}^{T_{i}} \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S_t)]\right)}_{\text{First $k$ phases}} + \underbrace{\sum_{t=T_k+1}^T \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S_t)]\right)}_{\text{Exploitation phase}} \nonumber\\
    &=\sum_{i=1}^{k} \sum_{t=T_{i-1}+1}^{T_{i}} \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S_t)]\right)+\sum_{t=T_k+1}^T \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(k)})]\right). \label{eq:prf:mainthm:case1:60}
\end{align}

Recall that in phase $i$, each of the $n-i+1$ actions in $\mathcal{S}_{i}$ is played exactly $m$ times, meaning $T_{i}-T_{i-1} = m(n-i+1)$. Since all actions played in phase $i$ include the set $S^{(i-1)}$ (the empirically best set played in phase $i-1$), in notation $S^{(i-1)} \subset S_t$ for $t \in \{T_{i-1}+1, \cdots, T_{i}\}$, by monotonicity of the expected reward function $f$, we have $f(S^{(i-1)}) \leq f(S_t)$, for $t \in \{T_{i-1}+1, \cdots, T_{i}\}$. Thus, we can simplify the inner summation in the first term of \eqref{eq:prf:mainthm:case1:60} as 
% \cjq{Guanyu - several steps were missing and guidance to the reader needed improvement}
\begin{align}
    \sum_{t=T_{i-1}+1}^{T_{i}} \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S_t)]\right)
    %
    &\leq 
     \sum_{t=T_{i-1}+1}^{T_{i}} \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(i-1)})]\right) \tag{monotonicity: $f(S^{(i-1)}) \leq f(S_t)$}\\
     %
     %
     &=
     m(n-i+1) \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(i-1)})]\right). \label{eq:prf:mainthm:case1:63}
\end{align}

% \cjq{CJQ paused here}

Plugging \eqref{eq:prf:mainthm:case1:63} back into \eqref{eq:prf:mainthm:case1:60},
\begin{align}
    \mathbb{E}[\mathcal{R}(T)] & \leq \sum_{i=1}^{k}m(n-i+1)\left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(i-1)})]\right) + \sum_{t=T_k+1}^T \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(k)})]\right) \nonumber\\
    & \leq mn\sum_{i=1}^{k}\left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(i-1)})]\right) + \sum_{t=T_k+1}^T \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(k)})]\right). \label{eq:regret_decomp1}
\end{align}


% \cjq{Guanyu smooth out steps}
Now we upper bound the two terms above using \cref{cor:sk_lower}.  % \cref{lem:consequtive_reward}. 

% \cjq{To streamline the proof, maybe take the next the next page or so and introduce a corollary for \eqref{eq:final_exp_reward}. }



Since for $i\in \{2,\cdots, k\}$, $S^{(i-1)}$'s are random variables, we can take the expectation of  \eqref{eq:consequtive_reward} (conditioned on event $\mathcal{E}$), yielding
\begin{align}
    \mathbb{E}[f(S^{(i)})] - \mathbb{E}[f(S^{(i-1)})]
    &\geq  \frac{1}{k} \left[f(S^*)-\mathbb{E}[f(S^{(i-1)})]\right] - 2\mathrm{rad}, \label{eq:consequtive_exp_reward} \\
    %
    	\Longleftrightarrow \qquad f(S^*)-\mathbb{E}[f(S^{(i-1)})] 
    &\leq k(\mathbb{E}[f(S^{(i)})] - \mathbb{E}[f(S^{(i-1)})]+  2\mathrm{rad} ).  \label{eq:consequtive_exp_reward2}
\end{align} 
and of \eqref{eq:final_reward}, yielding 
\begin{align}
    (1 - \frac{1}{e})f(S^*) - \mathbb{E}[f(S^{(k)})] & \leq  2k \mathrm{rad}
    %\sqrt{2\log(T)/m}
    . \label{eq:final_exp_reward}
\end{align}

% Rearranging \eqref{eq:consequtive_exp_reward} we get

% \begin{align}
%     f(S^*)-\mathbb{E}[f(S^{(i)})] \leq k(\mathbb{E}[f(S^{(i+1)})] - \mathbb{E}[f(S^{(i)})]+  2\mathrm{rad} ). \label{eq:consequtive_exp_reward2}
% \end{align}

Apply \eqref{eq:consequtive_exp_reward2} and \eqref{eq:final_exp_reward} to the first and second terms in \eqref{eq:regret_decomp1} respectively yields
%\cjq{do in steps}

% \cjq{Guanyu - the first sum has a $1-1/e$ that gets skipped over}

\begin{align}
    \mathbb{E}[\mathcal{R}(T)] 
    %
    & \leq mn\sum_{i=1}^{k}\left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(i-1)})]\right) + \sum_{t=T_k+1}^T \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(k)})]\right) \tag{copying \eqref{eq:regret_decomp1}} \\
    %
    %
    & \leq mn\sum_{i=1}^{k}\left(f(S^*)-\mathbb{E}[f(S^{(i-1)})]\right) + \sum_{t=T_k+1}^T \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(k)})]\right) \tag{using $1-\frac{1}{e}\leq 1$ in first sum} \\
    %
    %
    & \leq mnk\sum_{i=1}^{k}\left(\mathbb{E}[f(S^{(i)})] - \mathbb{E}[f(S^{(i-1)})]+  2\mathrm{rad}\right) + \sum_{t=T_k+1}^T \left(2k
    \mathrm{rad}
    % \sqrt{2\log(T)/m}
    \right) \tag{using \eqref{eq:consequtive_exp_reward2} and \eqref{eq:final_exp_reward}}\\
    & = mnk
    \left( \mathbb{E}[f(S^{(k)})] - \mathbb{E}[f(S^{(0)})]
    % (\mathbb{E}[f(S^{(1)})] - \mathbb{E}[f(S^{(0)})] + \mathbb{E}[f(S^{(2)})] - \mathbb{E}[f(S^{(1)})]+ \nonumber\\
    % &\qquad \cdots + \mathbb{E}[f(S^{(k)})] - \mathbb{E}[f(S^{(k-1)})]
    +  2k\mathrm{rad}\right) + \sum_{t=T_k+1}^T \left(2k \mathrm{rad}
    % \sqrt{2\log(T)/m}
    \right) \tag{telescoping sum}\\
    & \leq mnk \left(\mathbb{E}[f(S^{(k)})] +  2k\mathrm{rad}\right) + 2kT 
    \mathrm{rad}
    % \sqrt{2\log(T)/m}
    \tag{$f(S^{(0)})=0$}\nonumber\\[10pt]
    & \leq mnk \left(1 +  2k\mathrm{rad}\right) + 2kT 
    \mathrm{rad}
    % \sqrt{2\log(T)/m}
    . \tag{rewards are bounded in $[0,1]$}
\end{align}

% \cjq{if value of $\mathrm{rad}$ not used till below, keep it as $\mathrm{rad}$ until now}

Plugging in the definition of $\mathrm{rad} = \sqrt{2\log(T)/m}$ and using the bound $\sqrt{2\log(T)/m} < \sqrt{2\log(T)}$ to simplify the formula, we have


\begin{align}
    \mathbb{E}[\mathcal{R}(T)] 
    & \leq mnk \left(1 +  2k\sqrt{2\log(T)/m}\right) + 2kT \sqrt{2\log(T)/m} \nonumber\\
    &\leq mnk \left(1 +  2k\sqrt{2\log(T)}\right) + 2kT \sqrt{2\log(T)/m}. \label{eq:final_regret1}
\end{align}

We want to optimize  $m$, the number of times actions are played.  Denoting the regret bound \eqref{eq:final_regret1} as a function of $m$ 
\begin{align}
    g(m) = mnk \left(1 +  2k\sqrt{2\log(T)}\right) + 2kT \sqrt{2\log(T)/m},
\end{align}
then 
\begin{align}
    g'(m) = nk \left(1 +  2k\sqrt{2\log(T)}\right) - kT\sqrt{2\log(T)}m^{-3/2}.
\end{align}
Setting $g'(m)=0$ and solving for $m$, \newcommand{\mopt}{m^*}
\newcommand{\moptrnd}{m^\dagger}
\begin{align}
    \mopt= \left(\frac{T\sqrt{2\log(T)}}{n+2nk\sqrt{2\log(T)}}\right)^{2/3}. \label{eq:value:m}
\end{align}
We next check the second derivative,
\begin{align}
    g''(m) = \frac{3}{2}kT\sqrt{2\log(T)}m^{-5/2}. \label{eq:g2ndder}
\end{align}
For positive values of $m$, %$m$ being a positive integer,
$g''(m) > 0$, thus $g(m)$ reaches a minima at \eqref{eq:value:m}.

Since $m$ is the number of times actions are played, we (trivially) need $m\geq 1$ and $m$ to be an integer. We choose 
\begin{align}
    \moptrnd= \ceil*{\left(\frac{T\sqrt{2\log(T)}}{n+2nk\sqrt{2\log(T)}}\right)^{2/3}}. \label{eq:int:value:m}
\end{align}

Since from \eqref{eq:g2ndder} we have that  $g''(m) >0$ for positive $m$, $g(\mopt)\leq g(\moptrnd)$. % \frac{3}{2}kT\sqrt{2\log(T)}m^{-5/2}. \label{eq:g2ndder}From \eqref{eq:geq1}, $\moptrnd\geq 1$.  We also have $g(\mopt)\leq g(\moptrnd)$. %the chosen $m$ is still at least 1.

%By \eqref{eq:geq1}, $\moptrnd\geq 1$ for $T \geq n (2k+1)$.
For $T\geq n(k+1)$, we have 
\begin{align}
    \mopt=\left(\frac{T\sqrt{2\log(T)}}{n+2nk\sqrt{2\log(T)}}\right)^{2/3} &= \left(\frac{T}{\frac{n}{\sqrt{2\log(T)}}+2nk}\right)^{2/3} \nonumber\\
    &\geq \left(\frac{n(k+1)}{\frac{n}{\sqrt{2\log(n(k+1))}}+2nk}\right)^{2/3} \nonumber\\
    %
    &= \left(\frac{k+1}{\frac{1}{\sqrt{2\log(n (k+1))}}+2k}\right)^{2/3} \nonumber\\
    %
    &\geq \left(\frac{k+1}{2k+1}\right)^{2/3} \nonumber\\
    &\geq \left(\frac{1}{2}\right)^{2/3} \nonumber\\
    & > \frac{1}{2}. \label{eq:geq1}
\end{align}



Plugging \eqref{eq:int:value:m} back in to \eqref{eq:final_regret1},  
%\cjq{Guanyu: you are plugging in \eqref{eq:value:m}, not  \eqref{eq:int:value:m} -- so the bound you are using is too optimistic (minimized); need to argue using $g(\moptrnd)\leq a g(\mopt)+b$ for some small a and or b, or another way}


\begin{align}
    \mathbb{E}[\mathcal{R}(T)]
    %
    &\leq \moptrnd nk \left(1 +  2k\sqrt{2\log(T)}\right) + 2kT \sqrt{2\log(T)/\moptrnd} \tag{ \eqref{eq:final_regret1} with $\moptrnd$ samples for each action} \\
    %
    &= \ceil*{\mopt} nk \left(1 +  2k\sqrt{2\log(T)}\right) + 2kT \sqrt{2\log(T)/\ceil*{\mopt}}  \nonumber\\
    %
    &\leq \ceil*{\mopt} nk \left(1 +  2k\sqrt{2\log(T)}\right) + 2kT \sqrt{2\log(T)/\mopt}  \tag{Since $\ceil{\mopt} \geq \mopt$} \\
        %
    &\leq 2\mopt nk \left(1 +  2k\sqrt{2\log(T)}\right) + 2kT \sqrt{2\log(T)/\mopt}  \tag{Since $\mopt \geq 1/2$, $\ceil{\mopt} \leq 2\mopt$} \\
    %
    &= 2\left(\frac{T\sqrt{2\log(T)}}{n+2nk\sqrt{2\log(T)}}\right)^{2/3}nk(1+2k\sqrt{2\log(T)}) + 
    % \nonumber\\
    % & \qquad 
    2kT \sqrt{2\log(T)} \left(\frac{n+2nk\sqrt{2\log(T)}}{T\sqrt{2\log(T)}}\right)^{1/3}  \tag{using \eqref{eq:value:m}}\\
    %
    %
    & = \frac{2(T\sqrt{2\log(T)})^{2/3}}{n^{2/3}(1+2k\sqrt{2\log(T)})^{2/3}}nk(1+2k\sqrt{2\log(T)}) + 
    % \nonumber\\
    % & \qquad 
    2kT \sqrt{2\log(T)} \frac{n^{1/3}(1+2k\sqrt{2\log(T)})^{1/3}}{(T\sqrt{2\log(T)})^{1/3}} \tag{rearranging}\\
    %
    %
    & = 2(T\sqrt{2\log(T)})^{2/3}n^{1/3}k\left(1+2k\sqrt{2\log(T)}\right)^{1/3} + 
    % \nonumber\\
    % & \qquad 
    2k(T\sqrt{2\log(T)})^{2/3} n^{1/3}(1+2k\sqrt{2\log(T)})^{1/3} \tag{cancelling common terms}\\
    %
    %
    &= 4n^\frac{1}{3}k(T\sqrt{2\log(T)})^\frac{2}{3}(1+ 2k\sqrt{2\log(T)})^\frac{1}{3} \label{eq:final_regret_clean}\\
    %
    % &= 3n^\frac{1}{3}k(2T\sqrt{2\log(T)})^\frac{2}{3}(1+ 2k\sqrt{2\log(T)})^\frac{1}{3} \nonumber\\
    %
    &= \mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2}). \nonumber 
\end{align} 

\begin{comment}
For the value $\mopt$ \eqref{eq:value:m}, we observe that if the overall horizon $T$ is sufficiently big, namely if $T \geq n (2k+1)$, then
\begin{align}
    \left(\frac{T\sqrt{2\log(T)}}{n+2nk\sqrt{2\log(T)}}\right)^{2/3} &= \left(\frac{T}{\frac{n}{\sqrt{2\log(T)}}+2nk}\right)^{2/3} \nonumber\\
    &\geq \left(\frac{n (2k+1)}{\frac{n}{\sqrt{2\log(n (2k+1))}}+2nk}\right)^{2/3} \nonumber\\
    %
    &= \left(\frac{2k+1}{\frac{1}{\sqrt{2\log(n (2k+1))}}+2k}\right)^{2/3} \nonumber\\
    %
    &\geq 1. \label{eq:geq1}
\end{align}

%Note that for sufficient exploration, we need $T\geq n+(n-1)+\cdots+(n-k+1)$, or roughly $T\ge nk$ if $n >> k$. In this case,

% \cjq{use notation to distinguish the chosen $m$'s and generic function argument} 

As $\mopt$ might not be integer valued, we choose 
\begin{align}
    \moptrnd= \floor*{\left(\frac{T\sqrt{2\log(T)}}{n+2nk\sqrt{2\log(T)}}\right)^{2/3}}. \label{eq:int:value:m}
\end{align}

% \cjq{CJQ paused here}

By \eqref{eq:geq1}, $\moptrnd\geq 1$ for $T \geq n (2k+1)$.   Since from \eqref{eq:g2ndder} we have that  $g''(m) >0$ for positive $m$, $g(\mopt)\leq g(\moptrnd)$. % \frac{3}{2}kT\sqrt{2\log(T)}m^{-5/2}. \label{eq:g2ndder}From \eqref{eq:geq1}, $\moptrnd\geq 1$.  We also have $g(\mopt)\leq g(\moptrnd)$. %the chosen $m$ is still at least 1.

Plugging \eqref{eq:int:value:m} back in to \eqref{eq:final_regret1},  
\cjq{Guanyu: you are plugging in \eqref{eq:value:m}, not  \eqref{eq:int:value:m} -- so the bound you are using is too optimistic (minimized); need to argue using $g(\moptrnd)\leq a g(\mopt)+b$ for some small a and or b, or another way}


\begin{align}
    \mathbb{E}[\mathcal{R}(T)]
    %
    &\leq \moptrnd nk \left(1 +  2k\sqrt{2\log(T)}\right) + 2kT \sqrt{2\log(T)/\moptrnd} \tag{ \eqref{eq:final_regret1} with $\moptrnd$} \\
    %
    &= \floor{\mopt} nk \left(1 +  2k\sqrt{2\log(T)}\right) + 2kT \sqrt{2\log(T)/\floor{\mopt}}  \nonumber\\
    %
    &\leq \mopt nk \left(1 +  2k\sqrt{2\log(T)}\right) + 2kT \sqrt{2\log(T)/(\mopt/2)}  \tag{Since $\mopt \geq 1$, $\floor{\mopt} \geq \mopt/2$} \\
    %
    &= \left(\frac{T\sqrt{2\log(T)}}{n+2nk\sqrt{2\log(T)}}\right)^{2/3}nk(1+2k\sqrt{2\log(T)}) + 
    % \nonumber\\
    % & \qquad 
    2\sqrt{2}kT \sqrt{2\log(T)} \left(\frac{n+2nk\sqrt{2\log(T)}}{T\sqrt{2\log(T)}}\right)^{1/3}  \tag{using \eqref{eq:value:m}}\\
    %
    %
    & = \frac{(T\sqrt{2\log(T)})^{2/3}}{n^{2/3}(1+2k\sqrt{2\log(T)})^{2/3}}nk(1+2k\sqrt{2\log(T)}) + 
    % \nonumber\\
    % & \qquad 
    2\sqrt{2}kT \sqrt{2\log(T)} \frac{n^{1/3}(1+2k\sqrt{2\log(T)})^{1/3}}{(T\sqrt{2\log(T)})^{1/3}} \tag{rearranging}\\
    %
    %
    & = (T\sqrt{2\log(T)})^{2/3}n^{1/3}k\left(1+2k\sqrt{2\log(T)}\right)^{1/3} + 
    % \nonumber\\
    % & \qquad 
    2\sqrt{2}k(T\sqrt{2\log(T)})^{2/3} n^{1/3}(1+2k\sqrt{2\log(T)})^{1/3} \tag{cancelling common terms}\\
    %
    %
    &= (1+2\sqrt{2})n^\frac{1}{3}k(T\sqrt{2\log(T)})^\frac{2}{3}(1+ 2k\sqrt{2\log(T)})^\frac{1}{3} \label{eq:final_regret_clean}\\
    %
    % &= 3n^\frac{1}{3}k(2T\sqrt{2\log(T)})^\frac{2}{3}(1+ 2k\sqrt{2\log(T)})^\frac{1}{3} \nonumber\\
    %
    &= \mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2}). \nonumber 
\end{align} 
\end{comment}
where \eqref{eq:final_regret_clean} follows by factoring.  In conclusion, the expected $(1-1/e)$ regret \eqref{eq:reg:exp1e} is upper bounded by \eqref{eq:final_regret_clean} if the clean event $\mathcal{E}$ happens.

\subsubsection*{Case 2: clean event $\mathcal{E}$ does not happen}
We next derive an upper bound for the expected $(1-1/e)$ regret \eqref{eq:reg:exp1e} for case that the event $\mathcal{E}$ does not happen. By \cref{lem:probcleanevents}, %, As we have showed in \eqref{eq:ETC:instregr:prf:probcleanevent}, 
\[\mathbb{P}(\bar{\mathcal{E}}) = 1-\mathbb{P}(\mathcal{E}) \leq \frac{2nk}{T^4}.\] Since the reward function $f_t(\cdot)$ is upper bounded by 1, the  expected $(1-1/e)$ regret \eqref{eq:reg:exp1e} incurred under $\bar{\mathcal{E}}$ for a horizon of $T$ is at most $T$, %. Thus we have
\begin{align}
    \mathbb{E}[\mathcal{R}(T)|\bar{\mathcal{E}}] \leq T. \label{eq:badevent:regretbnd}
\end{align}

\subsubsection*{Putting it all together}

Combining Cases 1 and 2 we have, 
\begin{align}
    \mathbb{E}[\mathcal{R}(T)] &= \mathbb{E}[\mathcal{R}(T)|\mathcal{E}] \cdot \mathbb{P}(\mathcal{E}) +\mathbb{E}[\mathcal{R}(T)|\bar{\mathcal{E}}]\cdot \mathbb{P}(\bar{\mathcal{E}}) \tag{Law of total expectation}\\
    %
    %
    &\leq \left[4n^\frac{1}{3}k(T\sqrt{2\log(T)})^\frac{2}{3}(1+ 2k\sqrt{2\log(T)})^\frac{1}{3}\right]  \cdot 1 +T\cdot 2nkT^{-4} \tag{using \eqref{eq:final_regret_clean}, \cref{lem:probcleanevents}, and \eqref{eq:badevent:regretbnd}}\\
    %
    %
    &= \mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2}). % + \mathcal{O}(nkT^{-3}). 
    \nonumber
\end{align}
This concludes the proof of %the main theorem. 
\cref{thm:main}.

% \section{Basic Facts About Submodular Functions}
% \subsection{}\label{apdx:1minus1ebnd}
% For completeness, we  show the (well-known) lower bound % of the coefficient
% \begin{align}
%     1- \left(1 - \frac{1}{k}\right)^k \geq 1-\frac{1}{e}
% \end{align} for all $k\geq 1$.  The right hand side is the limit of the left hand side as $k\to \infty$.

% \begin{proof}
% First, we consider the limit of $1- \left(1 - \frac{1}{k}\right)^k$.  Rearranging,
% \begin{align*}
%     1- \left(1 - \frac{1}{k}\right)^k &= 1- \left(\frac{k-1}{k}\right)^k %\\
%     %
%     = 1- \frac{(k-1)^k}{k^k}\\
%     &= \frac{k^k - (k-1)^k}{k^k} %\\
%     % &= \frac{\frac{k^k}{(k-1)^k} - \frac{(k-1)^k}{(k-1)^k}}{\frac{k^k}{(k-1)^k}} \tag{divide by $(k-1)^k$}\\
%     % &
%     = \frac{\frac{k^k}{(k-1)^k} - 1}{\frac{k^k}{(k-1)^k}} \tag{divide by $(k-1)^k$}\\
%     %
%     &= \frac{ \left(1 + \frac{1}{k-1} \right)^k    - 1}{\left(1 + \frac{1}{k-1} \right)^k } %\\
%         %
%     % &
%     = \frac{ \left(1 + \frac{1}{k-1} \right)^{k-1}\left(1 + \frac{1}{k-1} \right)    - 1}{ \left(1 + \frac{1}{k-1} \right)^{k-1}\left(1 + \frac{1}{k-1} \right)} 
% \end{align*}  Since  $\lim_{\ell \to \infty} \left( 1 + \frac{1}{\ell}\right)^\ell = e$, in the limit as $k\to \infty$, 

% \begin{align*}
%     \lim_{k\to\infty} 1- \left(1 - \frac{1}{k}\right)^k  
%     %
%     &= \frac{ e\cdot 1    - 1}{ e\cdot 1 } = 1-\frac{1}{e}.
% \end{align*}

% Second, we show that $1- \left(1 - \frac{1}{k}\right)^k$ is decreasing in $k$, and thus its limit is a lower bound for all $k$. Consider the continuous function $ 1- \left(1 - \frac{1}{x}\right)^x$ for $x>1$.  Its derivative is 
% \begin{align}
%     \frac{d}{dx} 1- \left(1 - \frac{1}{x}\right)^x &= 0 - \left(1 - \frac{1}{x}\right)^x \log x.
% \end{align}
% For $x>1$, $0 < \frac{1}{x} < 1$, hence $\left(1 - \frac{1}{x}\right)>0$.  $\log x > 0$ for all $x>1$, thus the derivative is negative for all $x>1$ and consequently $1- \left(1 - \frac{1}{x}\right)^x$ is monotone decreasing.
% \end{proof}

% \subsection{}\label{apdx:submod}
% For completeness, we reproduce the following result from \cite{nemhauser1978analysis}.
% \begin{lemma}
% For a monotonically non-decreasing submodular function $f$ defined on the set of subsets of $\Omega$, we have for arbitrary  $A,B\subseteq \Omega$, 
% \begin{align}
%     f(B)-f(A) \leq \sum_{ j \in B \setminus A } \left[f(A\cup \{j\})-f(A)\right]. \nonumber
% \end{align}
% \end{lemma}

% \begin{proof}
% Enumerate the elements that are in set $B$ but not in $A$ as  %$A \backslash B = \{ j_1, \cdots, j_r \}$ and 
% $B \backslash A = \{ j_1, \cdots,j_q \}$.  We have 
% % \cjq{you need to explain steps. Is the first one setting up a telescoping sum? is the second just new notation or something else? also, switching from differences to conditional notation back to differences is confusing} 
% \begin{align}
%     f(A\cup B)-f(A) &= \sum_{\ell=1}^q [f(A\cup \{j_1,\cdots, j_\ell\}) - f(A\cup \{j_1,\cdots, j_{\ell-1}\})] \tag{telescoping sum}\\
%     % & = \sum_{l=1}^qf(k_l|A\cup \{k_1,\cdots, k_{l-1}\}) \\
%     % & \leq \sum_{\ell=1}^q f(_l|A) \tag{diminishing returns}\\
%     &\leq \sum_{\ell=1}^q \left[f(A\cup \{j_\ell\})-f(A)\right], \label{eq:lem4_1}
% \end{align} with \eqref{eq:lem4_1} following from the definition of submodularity in \cref{intro}. %diminishing returns (

% Using the monotonicity of $f$, we have
% \begin{align}
%      f(B) %&= \sum_{l=1}^r [f(B\cup \{j_1,\cdots, j_l\}) - f(B\cup \{j_1,\cdots, j_{l-1}\})] \nonumber\\
%     %& = \sum_{l=1}^rf(j_l|B\cup \{j_1,\cdots, j_{l}\}\setminus\{j_{l}\}) \nonumber\\
%     %& \geq \sum_{l=1}^rf(j_l|B\cup A\setminus\{j_{l}\}) \tag{diminishing returns}\\
%     %& = \sum_{ j \in A \backslash B } \left[f(A\cup B)-f(A\cup B\setminus\{j\})\right]. 
%     &\leq f(A\cup B) \nonumber\\
%     %
%     \qquad \Longleftrightarrow \qquad -f(A\cup B)+f(B) &\leq 0.  \label{eq:lem4_2}
% \end{align}

% Adding \eqref{eq:lem4_2} and \eqref{eq:lem4_1}, we get
% \begin{align}
%     f(B)-f(A) \leq \sum_{\ell=1}^q \left[f(A\cup \{j_\ell\})-f(A)\right]. \nonumber
% \end{align}

% \end{proof}



\section{Algorithm OG$^\text{o}$} \label{implimentation:ogo}
In this section we describe implementation details and parameter selection for OG$^\text{o}$ algorithm \cite{streeter2008online}. The choice of exploration probability is given by the original paper:$\gamma = n^{1/3}k\left(\frac{\log(n)}{T}\right)^{1/3}$. $\epsilon$ is the learning rate for Randomized Weighted Majority (WMR) expert algorithm \cite{Arora2012TheMW}. It is chosen by setting the derivative of regret upper bound to zero, which is $\epsilon = \sqrt{\frac{\log(n)}{T_e}}$, where $T_e$ is the time spent on updating expert $e$. Since it explores with probability $\gamma$, and there are $k$ expert algorithms, we have $T_e \approx \frac{\gamma T}{k}$. Thus we pick $\epsilon = \sqrt{\frac{k\log(n)}{\gamma T}}$. In experiments, there are many cases the chosen $\gamma$ is large or even larger than 1, so we cap the probability of exploring $\gamma$ by 1/2 to avoid exploring too much. \cref{alg:OGO} shows the pseudo code for implementation details of this algorithm.

\setcounter{algorithm}{1}
\begin{algorithm}
    \caption{Online Greedy for Opaque Feedback Model (OG$^\text{o}$)}
    \label{alg:OGO}
\begin{algorithmic}
  \State {\bfseries Input:} set of base arms $\Omega$, horizon $T$, cardinality constraint $k$
  \State Initialize $n\gets|\Omega|$, $\gamma \gets n^{1/3}k\left(\frac{\log(n)}{T}\right)^{1/3}$, $\epsilon \gets \sqrt{\frac{k\log(n)}{\gamma T}}$
  \State Initialize $\boldsymbol{\omega}_1 \gets \text{ones}(k, n)$
  \For{$t \in [1,\cdots,T]$}
  \State $S_t \gets \emptyset$
  \State $l \gets \text{zeros}(k, n)$ \algorithmiccomment{loss}
  \State Randomly sample a value $\xi \sim \text{Uniform}([0,1])$ 
  \If{$\xi \leq \gamma$} \algorithmiccomment{Exploration with probability $\gamma$}
  \State $e \sim \text{Uniform}(\{1,\cdots,k\})$ 
  \For{$i \in [1,\cdots, e-1]$} \algorithmiccomment{For experts before $e$, exploit}
  \State Select an arm $a$ with probability $\frac{\boldsymbol{\omega}_t[i,a]}{\sum \boldsymbol{\omega}_t[i,:]}$, re-sample if $a \in S_t$ 
  \State $S_t \gets S_t \cup \{a\}$
  \EndFor
  \State $a \sim \text{Uniform}(\{1,\cdots,n\}\backslash S_t)$ \algorithmiccomment{For expert $e$, explore}
  \State $S_t \gets S_t \cup \{a\}$ 
  \State Play action $S_t$, observe $f_t(S_t)$
  \State Update $l[i,j] \gets f_t(S_t)$ for all $i = e$ and $j \neq a$ \algorithmiccomment{Feed back $f_t(S_t)$ to expert $e$ associated with action $a$}
  \State Update $\boldsymbol{\omega}_{t+1}[i,j] \gets \boldsymbol{\omega}_{t}[i,j]\exp(-\epsilon l[i,j])$ for all pairs of $i$ and $j$  
  \Else \algorithmiccomment{Exploitation with probability $1-\gamma$}
  \For{$i \in [1,\cdots, k]$} \algorithmiccomment{For experts before $e$, exploit}
  \State Select arm $a$ with probability $\frac{\boldsymbol{\omega}_t[i,a]}{\sum \boldsymbol{\omega}_t[i,:]}$, re-sample if $a \in S_t$
  \State $S_t \gets S_t \cup \{a\}$
  \EndFor
  \State Play action $S_t$, observe $f_t(S_t)$
  \State $\boldsymbol{\omega}_{t+1}[i,j] \gets \boldsymbol{\omega}_{t}[i,j]$
  \algorithmiccomment{Since feeding back 0 to all expert-action payoffs, loss is 0, no update}
  \EndIf 
  \EndFor
\end{algorithmic}
\end{algorithm}









\section{More Experiments} \label{sec:more-exp}

\subsection{Max Function}
We also conduct experiments with synthetic data on max functions: $f(S)=\max_{a\in S}f(\{a\})$. Similar with the setup in \cref{{sec:exp:syn}}, We use $n=20$ base arms and cardinality constraint $k=4$. Again, we generate individual arm rewards $\{f(\{a\})\}_{a\in\Omega}$ randomly $f(\{a\}) \overset{i.i.d.}{\sim} \mathcal{U}([0.1,0.9])$ and add noise when sampling. The noise follows a truncated normal distribution with mean 0 and standard deviation 0.1 within interval $[-0.1,0.1]$. The results are shown in \cref{fig:max}.

\begin{figure}[h]%
    \centering
    \subfloat[]{\label{fig:max:a}{\includegraphics[width=0.45\linewidth]{figures/regret_max.png} }}%
    \subfloat[]{\label{fig:max:b}{\includegraphics[width=0.427\linewidth]{figures/reward_plot_max.png} }}%
    \caption{(a) shows results for cumulative regret as a function of time horizon $T$. (b) shows the moving average plot with window size 100 of instantaneous reward as a function of $t$. The gray dashed lines in (a) represent $y = aT^{2/3}$ for various values of $a$. The gray dashed line in (b) represents the value of the optimal solution.}%
    \label{fig:max}%
\end{figure}

We can see from \cref{fig:max:a}, ETCG outperforms all other baseline methods evaluated up to $T=10^6$, but DART seems to be able to surpass ETCG for larger $T$. The reason is that max reward function bounded in $[0,1]$ satisfies the assumptions of DART, so DART's $\mathcal{O}(T^{1/2})$ regret bound holds. Thus, we expect DART to eventually outperform ETCG for max reward functions. Notably, despite DART's asymptotic advantage for $\max$ function, ETCG does better than DART for all but very large horizons (namely $T$=1,000,000).  We argue it is unrealistic for any application to be stationary (assumed by DART) over such a long horizon.

\subsection{Denpendence on $n$ and $k$}
We also empirically plot the regret as a function of $n$ and $k$ to see if the dependence on $n$ and $k$ is ``correct'' for linear functions. 

The results are shown in \cref{fig:lin_nk}. From the figures we can see that for linear rewards,  $\mathcal{O}(n^{1/3})$ appears tight and $\mathcal{O}(k^{4/3})$ appears loose (the estimated exponent is closer to $O(k^{1/3})$). We will leave it as an open question on whether there exists an algorithm that has a better guarantee with respect to $k$.

\clearpage

\begin{figure}[t!]%
    \centering
    \subfloat[]{{\includegraphics[width=0.463\linewidth]{figures/toy_regret_lin_k.png} }}%
    \subfloat[]{{\includegraphics[width=0.5\linewidth]{figures/toy_regret_lin_n.png} }}%
    \caption{(a) shows results for cumulative regret as a function of cardinality constraint $k$. (b) shows results for cumulative regret as a function of number of base arms $n$. The gray dashed lines in (a) represent $y = aT^{4/3}$ for various values of $a$. The gray dashed lines in (b) represent $y = CT^{1/3}$ for various values of $a$.}%
    \label{fig:lin_nk}%
\end{figure}

\bibliography{refs.bib}

\end{document}