%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{algorithm}
%\usepackage{algorithmicx}
\usepackage{algpseudocode}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{comment}
\usepackage{subcaption}


\usepackage{nameref}
\usepackage{zref-xr}
\zxrsetup{toltxlabel}
\zexternaldocument*{nie_684-supp}

% CJQ added Feb 19 for table (copied from proposal) -- got error for xcolor already loaded-- removed cell colors
\usepackage{multicol}%,subfigure}
\usepackage{wrapfig, rotating}

% \usepackage[svgnames,rgb]{xcolor}


% \newcommand{\theHalgorithm}{\arabic{algorithm}}
\usepackage{mathtools}
\DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
\DeclarePairedDelimiter\floor{\lfloor}{\rfloor}
\DeclareMathOperator*{\argmax}{arg\,max} 
\DeclareMathOperator*{\argmin}{arg\,min} 
\newcommand{\cjq}[1]{\color{blue}#1\color{black}}

% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

% % if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

\usepackage{balance}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
% \theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


\title{An Explore-then-Commit Algorithm for Submodular Maximization Under Full-bandit Feedback}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:nieg@iastate.edu?Subject=Your UAI 2022 paper}{Guanyu Nie}{}}
\author[2]{\href{mailto:agarw180@purdue.edu?Subject=Your UAI 2022 paper}{Mridul Agarwal}{}}
\author[2]{\href{mailto:aumrawal@purdue.edu?Subject=Your UAI 2022 paper}{Abhishek Kumar Umrawal}{}}
\author[2]{\href{mailto:vaneet@purdue.edu?Subject=Your UAI 2022 paper}{Vaneet Aggarwal}{}}
\author[1]{\href{mailto:cjquinn@iastate.edu?Subject=Your UAI 2022 paper}{Christopher John Quinn}{}}
% Add affiliations after the authors
\affil[1]{%
    Computer Science Department\\
    Iowa State University\\
    Ames, Iowa, USA
}
\affil[2]{%
    Purdue University\\
    West Lafayette, Indiana, USA
}
  
\begin{document}
\maketitle

\begin{abstract}
  We investigate the problem of combinatorial multi-armed bandits with stochastic submodular (in expectation) rewards and full-bandit feedback, where no extra information other than the reward of selected action at each time step $t$ is observed. We propose a simple algorithm, Explore-Then-Commit Greedy (ETCG) 
  and prove that it  achieves a $(1-1/e)$-regret upper bound of $\mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2})$ for a horizon $T$, number of base elements $n$, and cardinality constraint $k$.
  We also show in experiments with synthetic and real-world data that the ETCG empirically outperforms other full-bandit methods. 
\end{abstract}

\section{Introduction}
\label{intro}
The stochastic multi-armed bandit (MAB) problem was first introduced by \cite{bams/1183517370}. It formalizes  challenging sequential decision problems faced by many organizations, including inventory selection, scheduling, work assignments and team formation, multi-market ad campaigns, product recommendation, crowd-sourcing, and investing.  The decision maker selects an arm and observes reward that comes from an unknown distribution at each round. The goal of the decision maker is to maximize expected cumulative reward over all rounds. The solution to classical MAB problem demonstrates the trade-off between \textit{exploration} and \textit{exploitation}: should the agent try the arm that has not been tried many times so far (exploration) or should stick with the arm that performed well based on previous observations (exploitation)?  

The combinatorial multi-armed bandit (CMAB) problem is an extension of the MAB problem. In this setting, the decision maker selects a \textit{super arm} composed of \textit{base arms} at each round, and observes a reward corresponding to the selected super arm. If the decision maker only learns the aggregated reward for the selected super arm, that feedback is referred to as \textit{full-bandit}. Otherwise, if the decision maker learns additional information (e.g., individual rewards of the base arms), the feedback is referred to as \textit{semi-bandit}. Furthermore, there are two common formalizations depending on the assumed nature of environments: the \textit{stochastic} setting and the \textit{adversarial} setting. 

In the adversarial setting, the reward sequence is generated by an unrestricted adversary, potentially based on the history of decision maker's actions \citep{10.1137/S0097539701398375}. In the stochastic environment, the reward of each arm is drawn independently from a fixed distribution \citep{10.1023/A:1013689704352}. For many bandit problems, the stochastic setting is a special case of the adversarial setting.  For those problems, algorithms designed for the adversarial setting maintain the theoretical performance guarantees when applied to problems in the stochastic setting, though typically they  empirically under-perform algorithms specifically designed for the stochastic setting \citep{lattimore_szepesvari_2020}. Moreover, the strategies designed for the stochastic setting may have simpler designs and be computationally more  efficient. Thus, developing efficient algorithms specializing in stochastic setting is important.  Furthermore, as we will later describe, the stochastic setting we consider in this paper is not a special case of the adversarial settings that has been studied in the literature.  Specifically, past research in the adversarial setting assume the reward function has extra properties that, when specialized to the stochastic setting, are overly restrictive.

When the reward depends non-linearly on the ground set, additional challenges have been added to develop efficient algorithms. For example, opening additional restaurants in a small market may result in diminishing returns due to market saturation. Such diminishing returns can be naturally modeled with the class of submodular set functions. A set function $f:2^\Omega \rightarrow \mathbb{R}$ defined on a finite ground set $\Omega$ is said to be \textit{submodular} if it satisfies the diminishing return property: for all $A\subseteq B\subseteq \Omega$, and $x\in \Omega\setminus B$, it holds that $f(A\cup \{x\})-f(A) \geq f(B\cup \{x\})-f(B)$ \citep{nemhauser1978analysis}. In this paper, we focus on the problem of combinatorial multi-armed bandits with stochastic submodular (in expectation) rewards and full-bandit feedback. We further assume that the reward function is monotone: a submodular set function $f:2^\Omega \rightarrow \mathbb{R}$ is called monotone if for any $A \subseteq B \subseteq \Omega$ we have $f(A) \leq f(B)$. 


\subsection{Motivating Examples}

\paragraph{Influence Maximization}
Consider a case of social network where a company developed an application and wants to market it through the network. The best way to do this is selecting a set of highly influential users and hope they can love the application and recommend their friends to use it. Influence maximization is a problem of finding a small subset (seed set) in a network that can achieve maximum influence. This subset selection problem in social networks is commonly modeled as an offline submodular optimization problem
\citep{domingos2001mining,kempe2003maximizing,chen2010scalable}. Algorithms and heuristics for solving this problem often assume knowledge of the network and diffusion model. A recent line of research has generalized the problem as a multi-armed bandit problem (with extra feedback) where the knowledge of the network and diffusion model is not required \citep{lei2015online,wen2017online,vaswani2017model,li2020online,perrault2020budgeted}.

\paragraph{Recommender Systems}
When recommending bundles of items, such as movies, news articles, or consumer products, considering the estimated individual item rankings alone may be suboptimal. The system should recommend diversified items to maximize the coverage of information that users are interested, in order to get as much positive feedback as possible. This is motivated by recommending items with redundant information leads to diminishing returns on utility. This problem of sequentially recommending sets of items to users has been  studied through the framework of contextual submodular combinatorial bandits \citep{Qin2013PromotingDI,    yue2011linear, takemori2020submodular}.
\paragraph{Crowdsourcing and Crowdsensing}
Crowdsourcing involves batches of simple tasks being sequentially assigned to workers with unknown quality and speed.  For example,  workers may be recruited to manually label images in a database. Crowdsensing involves sequentially collecting data from large numbers of users in different locations.  For instance, mobile phone accelerometer data can help identify potholes in city roads.  Instances of these problems often involve sequential decision making of assigning/selecting subsets of workers/users with unknown qualities and under a budget.  There is a line of research on this topic using the framework of combinatorial multi-armed bandits with submodular rewards 
\citep{zhang2012information,nushi2016learning,song2021minimizing}. 

\subsection{Our Contribution}
The main contribution of this paper can be summarized as follows:
\begin{itemize}
    \item We propose Explore-then-Commit Greedy (ETCG), the first algorithm designed for stochastic CMAB problems with a submodular reward function (in expectation) and full-bandit feedback.  It is procedurally simple and has low storage and per-round computational complexity. 
    \item We prove that ETCG achieves  $\mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2})$ expected cumulative $(1-1/e)$-regret.
    \item We show ETCG outperforms other full-bandit methods on experiments with synthetic and real-world data.  
\end{itemize}

\subsection{Related Work} \label{sec:related-work}

\begin{table*}[t]
\centering
\begin{tabular}{lccccc}%
\toprule
%
 & \multicolumn{2}{c}{Reward} &\multicolumn{1}{c}{Feedback} 
 & \multicolumn{2}{c}{Regret}\\%
%
\cmidrule(lr){2-3}\cmidrule(lr){4-4}\cmidrule(lr){5-6}%
%
 & Submodular  & Stochastic & Full-Bandit   & Cumulative  & $(1-1/e)$ Bound  \\
%
\midrule
%
\cite{streeter2008online}    & \checkmark &  & \checkmark  & \checkmark &  $\tilde{\mathcal{O}}(\ n^\frac{1}{3}\ k^2\ T^\frac{2}{3}\ )$  \\
%
\cite{golovin2014online}    & \checkmark &  & \checkmark & \checkmark  &  $\tilde{\mathcal{O}}(\ n^\frac{2}{3}\ k^\frac{2}{3}\ T^\frac{2}{3}\ )$  \\
%
\cite{niazadeh2021online}     & \checkmark &  & \checkmark  & \checkmark &  $\tilde{\mathcal{O}}(\ n^\frac{2}{3}\ k\phantom{\frac{3}{3}}\ T^\frac{2}{3}\ )$   \\
%
\cite{agarwal2021dart}   &  & \checkmark & \checkmark & \checkmark &   $\tilde{\mathcal{O}}(\ n^\frac{1}{2}\ k^{\frac{3}{2}}\ T^\frac{1}{2}\ )$   \\
%
\cite{agarwal2021stochastic}   &  & \checkmark & \checkmark & \checkmark &   $\tilde{\mathcal{O}}(\ n^\frac{1}{3}\ k^{\frac{1}{2}}\ T^\frac{2}{3}\ )$   \\
%
\cite{chen2018projection}   & \checkmark & \checkmark  &  &  \checkmark &   $\phantom{\ n^\frac{1}{3}\ k^{\frac{1}{2}}}\ \tilde{\mathcal{O}}( T^\frac{1}{2})^\dagger$  \\
%
\cite{du2021combinatorial}   &  & \checkmark & \checkmark &  &  -----    \\
%
ETCG (ours)  & \checkmark & \checkmark & \checkmark & \checkmark & $\tilde{\mathcal{O}}(\ n^\frac{1}{3}\ k^\frac{4}{3}\ T^\frac{2}{3}\ )$    \\
%
\bottomrule
\end{tabular}
\caption{\label{tab:related-work} Table of select related works, enumerating which problem and performance aspects are shared with our proposed ETCG.  The notation $\tilde{\mathcal{O}}(\cdot)$ drops $\log$ terms.  $^\dagger$\citep{chen2018projection} require additional smoothness properties of $f$ and the dependence on $k$ and $n$ is unknown.}
\end{table*}

We now briefly discuss related works from several research topics that overlap in multiple aspects with the problem we study.  \cref{tab:related-work}  lists related works and enumerates aspects of the problem setup including properties of the reward function, the feedback model, and regret type.  We let $n$ denote the number of base arms, $k$ the maximum cardinality, and $T$ the time horizon.


\paragraph{Adversarial} The closest related works are those for adversarial CMAB with submodular rewards, full-bandit feedback, and cumulative regret. In the adversarial setting, the environment chooses a sequence of monotone and submodular functions $\{f_1, \dots, f_T\}$. This is incompatible with our setting, since we only require the set function $f_t$ to be monotone and submodular \textit{in expectation}. Regret in the adversarial setting is also different---the decision-maker competes against a maximizing action over the sum of the sequence, $(1-1/e) \max_{a\in \mathcal{A}}\sum_{t=1}^T f_t(a)$.  

We nonetheless consider the following regret bounds to be relevant benchmarks for the stochastic setting.


\cite{streeter2008online} proposed an algorithm that  achieves  $\mathcal{O}(k^2(n\log n)^{1/3}T^{2/3}(\log T)^2)$ $(1-1/e)$-regret.   The method we will propose, ETCG, will have a  lower regret bound, by a factor of $k^{2/3}$ (ignoring $\log$ terms).   \cite{golovin2014online} later proposed an algorithm that  achieves   $\mathcal{O}( k^{2/3}n^{2/3}(\log n)^{1/3}T^{2/3} )$ $(1-1/e)$-regret. Recently, \cite{niazadeh2021online} proposed a new algorithm for the adversarial setting that achieves $\mathcal{O}(kn^{2/3}(\log n)^{1/3}T^{2/3})$ $(1-1/e)$-regret. The method we will propose, ETCG, will have a much lower regret bound than those two, by a factor of $n^{1/3}$ for both (ignoring $\log$ terms), for problems where there are many base arms relative to the cardinality constraint (i.e. $n\gg k$), such as social influence maximization. 

\paragraph{Semi-bandit} 
To our knowledge, all prior works on stochastic, combinatorial multi-armed bandits with submodular rewards assume semi-bandit feedback. In this setting, the decision maker receives additional feedback.  For example, in \citep{lin2015stochastic}, the decision maker receives not only the reward of the chosen subset but also learns marginal gains of its elements. Several methods have been proposed that solve a continuous optimization problem as a surrogate for the submodular set function and require gradient estimates through extra feedback \citep{zhang2019online, chen2018projection, zhu2021projection}.   The ``linear submodular bandit'' problem involves maximizing a linear combination of known submodular functions, with marginal gains  provided as extra feedback \citep{yue2011linear, yu2016linear, takemori2020submodular}.    Research on the application of online influence maximization use extra feedback about the nodes and/or edges in the diffusion tree \citep{lei2015online,wen2017online,vaswani2017model,li2020online,perrault2020budgeted}. \cite{streeter2008online} and \cite{niazadeh2021online} also proposed algorithms for the adversarial setting using semi-bandit feedback, improving their respective $(1-1/e)$-regret bounds to $\mathcal{O}(\sqrt{kT\log(n)})$ and $\mathcal{O}(k\sqrt{T\log(n)})$, respectively.


\paragraph{Continuous Submodular} 
There is an active area of research in (continuous) optimization for functions exhibiting diminishing returns properties analogous to (discrete) optimization of submodular set functions. Several methods have been proposed in the bandit setting, varying in the environment (adversarial/stochastic) and feedback model \citep{chen2018projection, pmlr-v108-chen20c, zhang2019online, Hassani2017GradientMF, Mokhtari2020StochasticCG, Hassani2020StochasticCG, Zhang2020OneSS}. Extensions of these methods to problems with discrete actions have been proposed, but require additional assumptions, semi-bandit feedback, or expensive sampling routines to estimate gradients. 

\paragraph{Pure Exploration} Instead of evaluating algorithms in terms of \textit{cumulative} regret, the decision maker may seek to only evaluate the regret of the action chosen at time $T$, allowing for more aggressive exploration, or to select an action within a pre-set level of confidence as quickly as possible.  Several works have investigated this ``pure exploration'' setting with semi-bandit feedback \citep{chen2016combinatorial,mokhtari2018conditional,merlis2019batch,jourdan2021efficient} and recently for full-bandit feedback \citep{du2021combinatorial} (for a special reward function). 

\paragraph{Non-submodular}
There are prior works for combinatorial MAB with stochastic rewards and full-bandit feedback, but the classes of the reward functions considered do not include submodular functions.  In particular, there are works for linear reward functions \citep{dani2008stochastic, rejwan2020top} and Lipschitz reward functions  \citep{agarwal2021stochastic, agarwal2021dart}.  For those classes of reward functions considered by  \cite{rejwan2020top, agarwal2021stochastic, agarwal2021dart}, the optimal action (best set of $k$ arms) is to use the $k$ \textit{individually best} arms; that property does not hold for submodular rewards.  


\section{Problem Statement}
\label{prob_state}
In this section, we will formally present the problem we will study. We consider sequential decision-making problems with a fixed time horizon $T$, where at each time step $t$, the learner selects a subset (action) $S_t \subseteq \Omega$ with cardinality at most $k$. Let $\Omega$ be the ground set of base arms, and let $n=|\Omega|$ denote the number of arms. We will use the terminologies \emph{subset} and \emph{action} interchangeably throughout the paper. Let $\mathcal{S}=\{S | S \subseteq \Omega  \text{ and } |S|\leq k\}$ denote the set of all allowed subsets at any time step. After the subset $S_t$ is selected, the learner receives reward $f_t(S_t)$. We assume the reward $f_t$ is stochastic, bounded in $[0,1]$, and i.i.d. conditioned on a given subset. Define the expected reward function as $f(S) = \mathbb{E}[f_t(S)]$. We assume $f(S)$ to be submodular and monotonically non-decreasing. The goal of the learner is to maximize the cumulative reward $\sum_{t=1}^Tf_t(S_t)$. To measure the performance of the algorithm, one common metric is to compare the learner to an agent with access to a value oracle for $f$. Let $S^*=\argmax_{S:|S|\leq k}f(S)$ denote the optimal solution. Maximizing a monotone submodular set function under a cardinality constraint is NP-hard even with a value oracle. The best achievable approximation ratio with a polynomial time algorithm is $1-1/e$ \citep{nemhauser1978analysis}.  Thus, we compare the learner's cumulative reward to $(1-1/e)Tf(S^*)$ and we denote the difference as the ($1-1/e$)-regret $\mathcal{R}_{1-1/e,T}$:
\begin{equation}
    \mathcal{R}_{1-1/e,T} := (1-\frac{1}{e})Tf(S^*) - \sum_{t=1}^T f_t(S_t). \label{eq:reg:1e}
\end{equation} 
Note that the ($1-1/e$)-regret $\mathcal{R}_{1-1/e,T}$ is random, depending on the rewards and subsets chosen. In designing an algorithm, we will focus on minimizing the expected cumulative $(1-1/e)$-regret 
\begin{align}
    \mathbb{E}[\mathcal{R}_{1-1/e,T}] = (1-\frac{1}{e})Tf(S^*) - \mathbb{E}\left[\sum_{t=1}^T f_t(S_t)\right],\label{eq:reg:exp1e}
\end{align} 
where the expectation is over both the environment the sequence of actions. For ease of notation, we write $\mathcal{R}_T$ for $\mathcal{R}_{1-1/e,T}$ throughout this paper.

\begin{remark}\label{rem:1eoptvsgrd} For the experiments in \cref{sec:exp}, we will not know $S^*$ and so will not be able to compute the $(1-1/e)$ regret \eqref{eq:reg:exp1e}. We will instead compute an upper bound. We will compare ETCG and baselines against  $T$ times the expected value $f(S^\mathrm{grd})$ of the solution $S^\mathrm{grd}$ returned from an offline (greedy) approximation algorithm \citep{nemhauser1978analysis}. Since $f(S^\mathrm{grd}) \geq (1-\frac{1}{e})f(S^*)$, the expected cumulative regret with respect to  $S^\mathrm{grd}$ upper-bounds \eqref{eq:reg:exp1e}.  When the inequality is strict, $f(S^\mathrm{grd}) > (1-\frac{1}{e})f(S^*)$, it is possible that the expected cumulative regret \eqref{eq:reg:exp1e} is sub-linear in the horizon $T$ while the expected cumulative regret with respect to $S^\mathrm{grd}$ is linear in the horizon $T$.
\end{remark}

\section{ETCG Algorithm}\label{sec:alg}

In this section, we present our proposed algorithm, \textit{Explore-Then-Commit Greedy} (ETCG). The pseudo code for ETCG is presented in Algorithm~\ref{alg:PG}. Our algorithm  adds base arms to a super arm (subset of base arms) over time greedily until the cardinality constraint is satisfied and then exploits that super arm.  Let $S^{(i)}$ denote the super arm when we have selected $i<k$ base arms.  Our procedure begins with the empty set, $S^{(0)}=\emptyset$.  After fixing  a subset $S^{(i-1)}$ with $i-1$ arms,  our procedure explores base arms to add to $S^{(i-1)}$ for an interval of time we refer to as \textit{phase} $i$.  Our procedure repeats this process until the cardinality constraint $k$ is satisfied.  


\begin{algorithm}[t]
\caption{Explore-then-Commit Greedy (ETCG)}
\label{alg:PG}
\begin{algorithmic}
    \State {\bfseries Input:} set of base arms $\Omega$, horizon $T$, cardinality constraint $k$
    \State Initialize $S^{(0)}\gets\emptyset$, $n\gets|\Omega|$
    \State Initialize $m \gets \ceil*{\left(\frac{T\sqrt{2\log(T)}}{n+2nk\sqrt{2\log(T)}}\right)^{2/3}}$
    \For{\emph{phase} $i \in \{1,\dots,k\}$}
        \For{\emph{arm} $a\in \Omega\setminus S^{(i-1)}$}
            \State Play $S^{(i-1)}\cup \{a\}$ $m$ times
            \State  Calculate the empirical mean $\bar{f}(S^{(i-1)}\cup \{a\})$
        \EndFor
        \State $a_{i} \leftarrow \argmax_{a\in \Omega\setminus S^{(i-1)}} \bar{f}(S^{(i-1)}\cup \{a\})$
        \State $S^{(i)} \leftarrow S^{(i-1)}\cup \{a_{i}\}$
    \EndFor
    \For{\emph{remaining time}}
        \State Play action $S^{(k)}$
    \EndFor
\end{algorithmic}
\end{algorithm}

Let $T_i$ denote the time step when  phase $i$ finishes, for $i \in \{1,\cdots, k\}$.  For notational consistency, we also denote $T_0=0$ and $T_{k+1}=T$.  Let $\bar{f}_t(S)$ denote the empirical mean reward of set $S$ up to and including time $t$. Let \[\mathcal{S}_{i} := \{\  S^{(i-1)}\cup\{a\} : \ a\in \Omega\setminus S^{(i-1)}  \ \}\] denote the set of actions considered during phase $i$.  Each action consists of the super arm $S^{(i-1)}$ decided during the last phase and an additional base arm.   Each action $S\in \mathcal{S}_{i}$ will be played the same number of times; let $m$ denote that number. The choice of $m$ will be optimized later to minimize regret.  At the end of phase $i\in\{1,\dots,k\}$, ETCG will select the action that has the largest empirical mean,
\begin{align}
    a_{i} = \argmax_{a\in \Omega \setminus S^{(i-1)}} \ \bar{f}_{T_{i}}(S^{(i-1)} \cup \{a\}), \label{eq:emp_best}
\end{align} and include it in the super arm $S^{(i)} = S^{(i-1)} \cup \{a_{i}\}$.  
During the final phase, the algorithm exploits $S^{(k)}$; it plays the same action $S_t = S^{(k)}$ for $t\in \{T_k+1,\cdots, T\}$.  

We note that for the special setting of deterministic rewards, the choice \eqref{eq:emp_best} corresponds to the classic offline greedy approximation algorithm proposed by \cite{nemhauser1978analysis}.  When the rewards are stochastic, the actions selected by ETCG may differ from those that the greedy algorithm \citep{nemhauser1978analysis} would choose using a value oracle for the set function $f$ of expected rewards. 

ETCG has low storage complexity and per-round time-complexity.  During exploitation,  for $t\in \{T_{k}+1,\dots,T_{k+1}\}$, ETCG only needs to store the indices of the $k$ base arms and does not need any computation.  During exploration, for $t\in \{1,\dots,T_{k}\}$, ETCG just needs to update the empirical mean  for the current action at time $t$ and store the highest empirical mean so far in the current phase $i$ and its associated base arm $a\in \Omega \setminus S^{(i)}$. Thus, ETCG has $\mathcal{O}(k)$ storage complexity  and $\mathcal{O}(1)$ per-round time complexity. For comparison, the algorithm proposed by \cite{streeter2008online} for the adversarial full-bandit setting uses $\mathcal{O}(nk)$ storage complexity and  and $\mathcal{O}(n)$ per-round time complexity. 

\begin{remark}\label{rem:unknown_horizon} When the time horizon is not known, we can use geometric doubling trick to extend our result to an anytime algorithm. Essentially, we pick a geometric sequence $T_i=T_0 2^i$ for $i\in \{1,2,\cdots\}$, where $T_0$ is a large enough number to let the algorithm initialize, and run our algorithm within time interval $T_{i+1}-T_i$ with a full restart. We refer to the general detailed procedure in \citet{Besson2018WhatDT}. From Theorem 4 in \citet{Besson2018WhatDT}, we can show that the regret bound conserves the original $T^{2/3}\log(T)^{1/2}$ dependence with only changes in constant factors.
\end{remark}

\section{Regret Analysis}\label{sec:regret-analysis}
In this section, we analyze the  regret for Algorithm~\ref{alg:PG}. We begin by stating the main theorem, which bounds the cumulative expected $(1-1/e)$-regret:
\begin{theorem} \label{thm:main}
For the sequential decision making problem defined in Section~\ref{prob_state} with $T\geq n(k+1)$, the  expected cumulative $(1-1/e)$-regret of ETCG is at most  $\mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2})$.

\end{theorem}

The detailed proof is in the supplementary material. We next briefly walk through the proof, highlighting some unique steps. 

Since for each phase $i$, we play each action $S^{(i-1)}\cup\{a\} \in \mathcal{S}_{i}$ exactly   $m$ times, we consider the equal-sized confidence radii $\mathrm{rad} := \sqrt{2\log(T)/m}$ for all the actions $S^{(i-1)}\cup\{a\} \in \mathcal{S}_{i}$ at the end of phase $i$. Denote the event that the empirical means of actions played in phase $i$ are concentrated around their statistical means as 
\begin{align}
    \mathcal{E}_{i}:=\!\!\!\bigcap_{S\cup\{a\} \in \mathcal{S}_{i} }\!\!\!\bigg\{\big|\bar{f}(S\cup\{a\})-f(S\cup\{a\}) \big|< \mathrm{rad}\bigg\}. \label{eq:phase_event_main}
\end{align}
Then we define the \textit{clean event} $\mathcal{E}$ to be the event that the empirical means of all actions played up to and including phase $k$ are within $\mathrm{rad}$ of their corresponding statistical means:
\begin{align}
    \mathcal{E} := \mathcal{E}_{1}\cap \dots \cap \mathcal{E}_{k}. \label{eq:clean_event_main}
\end{align}
Although the $\mathcal{E}_{i}$'s are not independent, by conditioning on the sequence of selected subsets $\{S^{(0)},S^{(1)},\dots,S^{(k)}\}$ and using the Hoeffding bound, we show $\mathcal{E}$ happens with high probability. We then use the concentration of   empirical means \eqref{eq:phase_event_main} and  properties of submodular set functions to show the following important lemma. 

\begin{comment}
we get
\begin{align}
    \mathbb{P}(\mathcal{E}) %
    %
    & \geq 1 - \frac{2nk}{T^4}. \nonumber
\end{align}
Now we can write the expected regret as 
\begin{align}
    \mathbb{E}[\mathcal{R}(T)] = \mathbb{E}[\mathcal{R}(T)|\mathcal{E}] \cdot \mathbb{P}(\mathcal{E}) +\mathbb{E}[\mathcal{R}(T)|\bar{\mathcal{E}}]\cdot \mathbb{P}(\bar{\mathcal{E}}). \label{eq:regret:decomp}
\end{align}
Since the probability that the clean event does not happen is small, so we can trivially bound the second term by
\begin{align}
    \mathbb{E}[\mathcal{R}(T)|\bar{\mathcal{E}}]\cdot \mathbb{P}(\bar{\mathcal{E}}) \leq T\cdot\frac{2nk}{T^4} =\frac{2nk}{T^3}.
\end{align}
Now we focus on bounding the first term.
\end{comment}  


\begin{lemma} \label{lem:consequtive_reward_main}
Under the clean event $\mathcal{E}$,  for all   $i\in \{1,2,\cdots, k\}$,
\begin{align}
    f(S^{(i)})-f(S^{(i-1)}) \geq \frac{1}{k}\left[f(S^*)-f(S^{(i-1)})\right]-2 \mathrm{rad}. \nonumber
\end{align}
\end{lemma}  This lemma  (\cref{lem:consequtive_reward} in the supplementary material) identifies a lower bound of the expected marginal gain $f(S^{(i)})-f(S^{(i-1)})$ of the empirically best action $S^{(i)}$ at the end of phase $i$. The sequence of subsets $\{S^{(0)},S^{(1)},\dots,S^{(k)}\}$ that ETCG picks \textit{does not necessarily match} the sequence chosen by the offline greedy approximation \citep{nemhauser1978analysis} using a value oracle for the expected reward function $f$.  Even though ETCG may select a different sequence, \cref{lem:consequtive_reward_main} ensures the expected marginal gain is not too small.  
As a corollary of \cref{lem:consequtive_reward_main}, using properties of submodular set functions and unraveling the recursion induced by \cref{lem:consequtive_reward_main}, we can lower bound the expected value of ETCG's chosen set $S^{(k)}$ of size $k$, which is used for exploitation in phase $k+1$:

\begin{corollary} \label{cor:sk_lower:main}
Under the clean event $\mathcal{E}$,  
\begin{align}
    f(S^{(k)}) & \geq (1 - \frac{1}{e})f(S^*) - 2k \mathrm{rad} 
    . 
\end{align}
\end{corollary}
This corollary appears as \cref{cor:sk_lower} in the supplementary material in \cref{sec:appd:proof:prelim}. 

Using  \cref{cor:sk_lower:main},  we can break up the expected $(1-\frac{1}{e})$-regret \eqref{eq:reg:exp1e} conditioned on the clean event $\mathcal{E}$ into two parts, one part for the first $k$ phases  and one part for the exploitation phase, 
\begin{align}
    &\hspace{-.3cm}\mathbb{E}[\mathcal{R}(T)|\mathcal{E}] \nonumber\\
    % &= \mathbb{E}\left[(1-\frac{1}{e}) T f( S^* ) -\sum_{t=1}^T f_t(S_t) \right] \nonumber\\
      &=(1-\frac{1}{e}) T f( S^* ) -\sum_{t=1}^T \mathbb{E}[f_t(S_t)] \nonumber \\
      &=\sum_{t=1}^T \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S_t)]\right) \nonumber\\
      &=\underbrace{\sum_{i=1}^{k} \sum_{t=T_{i-1}+1}^{T_{i}} \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S_t)]\right)}_{\text{First $k$ phases (exploration)}}  \nonumber\\
    &\qquad +\underbrace{\sum_{t=T_k+1}^T \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(k)})]\right)}_{\text{Phase $k+1$ (exploitation)}}. \label{eq:regr:clean:twopart}
\end{align}
Recall that in phase $i$, each of the $n-(i-1)$ actions in $\mathcal{S}_{i}$ is played exactly $m$ times, meaning $T_{i}-T_{i-1} = m(n-i+1)$. For each action $S_t$ played during phase $i$, that is for $t\in \{T_{i-1}+1, \cdots, T_{i}\}$, since $S^{(i-1)} \subset S_t$, by monotonicity of the expected reward function $f$ we have $f(S^{(i-1)}) \leq f(S_t)$. Thus we can upper bound the expected regret $\mathbb{E}[\mathcal{R}(T)|\mathcal{E}] $ incurred during the first $k$ phases (first term of \eqref{eq:regr:clean:twopart}) as 
\begin{align}
    & \sum_{i=1}^{k} \sum_{t=T_{i-1}+1}^{T_{i}} \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S_t)]\right) \nonumber\\
      &\leq \sum_{i=1}^{k}m(n-i+1)\left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(i-1)})]\right)  \nonumber\\
      &\leq mn\sum_{i=1}^{k}\left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(i-1)})]\right). 
     \label{eq:decomp1}
\end{align}
We can further upper bound \eqref{eq:decomp1} as
\begin{align}
    &\hspace{-1cm} \sum_{i=1}^{k}\left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(i-1)})]\right) \nonumber\\
    %
     &\leq \sum_{i=1}^{k}\left(f(S^*)-\mathbb{E}[f(S^{(i)})]\right) \nonumber\\
     %
     &\leq k\sum_{i=1}^{k}\left(\mathbb{E}[f(S^{(i)})] - \mathbb{E}[f(S^{(i-1)})]+  2\mathrm{rad}\right) \label{eq:11}\\[7pt]
    %
    &=  k(\mathbb{E}[f(S^{(k)})] - \mathbb{E}[f(S^{(0)})] +  2k\mathrm{rad}) \label{eq:12}\\[10pt]
    %
    &\leq  k \left(1 +  2k\mathrm{rad}\right) \label{eq:13},
\end{align}
where \eqref{eq:11} follows by applying \cref{lem:consequtive_reward_main} and taking expectation, \eqref{eq:12} follows by simplifying a telescoping sum, and \eqref{eq:13} by $\mathbb{E}[f(S^{(k)})] \leq 1$ and $\mathbb{E}[f(S^{(0)})] = 0$.

We can upper bound the expected regret $\mathbb{E}[\mathcal{R}(T)|\mathcal{E}] $ incurred during the exploitation phase (phase $k+1$;  second term of \eqref{eq:regr:clean:twopart}) by applying \cref{cor:sk_lower:main} as
\begin{align}
    &\hspace{-1cm}\sum_{t=T_k+1}^T \left((1-\frac{1}{e})f(S^*)-\mathbb{E}[f(S^{(k)})]\right) \nonumber\\
    %
    \leq &\sum_{t=T_k+1}^T2k\mathrm{rad} \leq 2kT\mathrm{rad}. \label{eq:14}
\end{align}
 Combining the upper bounds \eqref{eq:13} and \eqref{eq:14} and then optimizing over the number of times $m$ each action is sampled during exploration, we get   
\begin{comment}
we have 
\begin{align}
    \mathbb{E}[\mathcal{R}(T)|\mathcal{E}] 
     &\leq  mnk \left(1 +  2k\mathrm{rad}\right) + 2kT\mathrm{rad}\nonumber\\
     & = mnk \left(1 +  2k\sqrt{2\log(T)/m}\right) + 2kT \sqrt{2\log(T)/m} \nonumber\\
    &\leq mnk \left(1 +  2k\sqrt{2\log(T)}\right) + 2kT \sqrt{2\log(T)/m}. \label{eq:final_regret1}
\end{align}

Taking 
\begin{align}
    m^\dagger = \ceil*{\left(\frac{T\sqrt{2\log(T)}}{n+2nk\sqrt{2\log(T)}}\right)^{2/3}} \label{eq:int:value:m_main}
\end{align}
\end{comment}
% gives 
\begin{align}
    &\hspace{-.5cm}\mathbb{E}[\mathcal{R}(T)|\mathcal{E}] \nonumber\\[8pt]
    %
     &\leq  4n^\frac{1}{3}k(T\sqrt{2\log(T)})^\frac{2}{3}(1+ 2k\sqrt{2\log(T)})^\frac{1}{3} \nonumber\\[8pt]
     %
     &= \mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2}). \label{eq:final_O_regret_clean:main} 
\end{align}
We then show that because the clean event $\mathcal{E}$ happens with high probability, $\mathbb{E}[\mathcal{R}(T)]$ also satisfies \eqref{eq:final_O_regret_clean:main}, completing the proof. 

\begin{comment}
Putting all together we have 
\begin{align}
    \mathbb{E}[\mathcal{R}(T)] =& \mathbb{E}[\mathcal{R}(T)|\mathcal{E}] \cdot \mathbb{P}(\mathcal{E}) +\mathbb{E}[\mathcal{R}(T)|\bar{\mathcal{E}}]\cdot \mathbb{P}(\bar{\mathcal{E}}) \\
    %
    %
    \leq& \left[4n^\frac{1}{3}k(T\sqrt{2\log(T)})^\frac{2}{3}(1+ 2k\sqrt{2\log(T)})^\frac{1}{3}\right]  \cdot 1 \nonumber\\
    &+T\cdot 2nkT^{-4} \\
    %
    %
    =& \mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2}),
    \nonumber
\end{align}
which conclude the proof of \cref{thm:main}.
\end{comment}

\paragraph{Lower bounds:}
For the setting we explore in this paper, with stochastic CMAB with submodular expected rewards and full-bandit feedback, it remains an open question if $\tilde{\mathcal{O}}(T^{1/2})$ expected cumulative $(1-1/e)$-regret  is possible (ignoring $n$ and $k$ dependence).  For the special sub-class of linear reward functions, $\tilde{\Omega}(T^{1/2})$ is known \citep{dani2008stochastic}.  

\section{Experiments}\label{sec:exp}
We next evaluate our proposed algorithm ETCG on both synthetic data and real world data. 

For the experiments, instead of $(1-1/e)$ regret \cref{eq:reg:1e}, which requires knowing $S^*$, we compare the cumulative rewards achieved by ETCG and baselines against $Tf(S^\mathrm{grd})$, where $S^\mathrm{grd}$ is the solution returned by the offline $(1-1/e)$-approximation algorithm proposed by \cite{nemhauser1978analysis}.   Recall from \cref{rem:1eoptvsgrd} that $Tf(S^\mathrm{grd})\geq (1-1/e)Tf(S^*)$, so $Tf(S^\mathrm{grd})$ is a more challenging reference value.

\subsection{Baseline Methods}
We use three algorithms designed for CMAB with full-bandit feedback as baselines.  
\begin{itemize}
    \item \textbf{Online Greedy with opaque feedback model (OG$^\text{o}$)} \citep{streeter2008online} This algorithm is designed for the adversarial setting with submodular rewards. The adversary model is \textit{oblivious}, meaning the sequence of monotone submodular reward functions is fixed in advance. OG$^\text{o}$ utilizes $k$ subroutines of randomized weighted majority algorithms \citep{Littlestone1994TheWM}     to select actions, where $k$ is the cardinality constraint. At each time step, the algorithm explores with probability $\gamma$ and exploits with probability $1-\gamma$. During exploration, it randomly picks an randomized weighted majority subroutine to select a base arm to explore. OG$^\text{o}$ has an $\widetilde{O}(T^{2/3})$ theoretical guarantee for the adversarial setting. We refer to our detailed implementation and parameter selection in \cref{implimentation:ogo}.
    
    
    \item \textbf{CMAB-SM} \citep{agarwal2021stochastic} This algorithm assumes the expected reward functions are Lipschitz continuous functions of individual arm rewards. The algorithm divides all $n$ base arms in to groups, sorts arms within each group, and then merges groups one by one to obtain the best $k$ arms. CMAB-SM has an $\widetilde{O}(T^{2/3})$ theoretical guarantee. 

    
    
    \item \textbf{DART} \citep{agarwal2021dart} DART is a successive accept-reject style algorithm designed for Lipschitz reward functions that have an additional property related to the marginal gains of the base arms. DART has an $\widetilde{O}(T^{1/2})$ theoretical guarantee.
\end{itemize}


\subsection{Experiments with Synthetic Data}\label{sec:exp:syn}

We begin with experiments with %deterministic reward functions from 
two special cases of submodular set functions. The first one is mean (linear) functions of individual arm rewards $f_t(S)=\sum_{a\in S}f_t(\{a\})/k$. The second is a stochastic weighted set cover, which % problem that 
can be viewed as a simple model for product recommendations. Let $n$ denote the number of products and each product belongs to exactly one of $c$ different categories. These product categories also have different (expected) values given by the weight vector $\omega$. The expected instantaneous reward is defined as the average (over cardinality $k$) weight of the categories covered by a chosen set of up to $k$ products.  With $C_i$ denoting indices of arms belonging to category $i$ and $\omega_t[i]$ denoting the instantaneous weight of category $i$ at time $t$, and $\mathbf {1}_{\cdot}$ denoting the indicator function,  $f_t(S) = \frac{1}{k}\sum_{i=1}^c w_t[i] \mathbf {1}_{S \bigcap C_i \neq \emptyset}  .$ This reward function is monotone and submodular. Notice that for these two types of reward functions, the offline greedy solution is the optimal solution so we are actually comparing against the optimal solution in the results.

\subsubsection{Experiment Details}
For both setups, we use $n=20$ base arms.  The cardinality constraint is $k=4$. We run experiments on different time  horizons $T\in\{10^2, 10^3, 10^4, 10^5, 10^6\}$.  For each horizon $T$ and reward function type (linear or weighted cover), we run each method 10 times.


For the linear reward function, for each run we first generate expected rewards $\{f(\{a\})\}_{a\in\Omega}$ for individual arms   randomly $f(\{a\}) \overset{i.i.d.}{\sim} \mathcal{U}([0.1,0.9])$.  For each arm $a\in\Omega$, the instantaneous reward $f_t(\{a\})$ at time $t$ is the expected reward plus noise, $f_t(\{a\})=f(\{a\})+\epsilon_{a,t}$, where the noises $\{\epsilon_{a,t}\}_{a\in\Omega, 1\leq t \leq T}$ are i.i.d. and follow a truncated normal distribution with mean 0 and standard deviation 0.1 within interval $[-0.1,0.1]$ (so all instantaneous rewards $f_t(\cdot)$ are within the interval $[0,1]$). 

For the weighted cover problem, we used $c=4$ categories with $[6,6,6,2]$ products respectively. The stochastic weights for each category $i=1,2,3,4$ at time $t$ are drawn from a uniform distribution $\omega[i] \sim \mathcal{U}([0,i/5])$. 

\begin{figure}[ht]%
    \centering
    \subfloat[]{\label{fig:toy:a}{\includegraphics[width=0.5\linewidth]{figures/regret_lin.png} }}%
    \subfloat[]{\label{fig:toy:b}{\includegraphics[width=0.5\linewidth]{figures/regret_weicover.png} }}%
    \\
    \subfloat[]{\label{fig:toy:c}{\includegraphics[width=0.5\linewidth]{figures/reward_plot_lin.png} }}%
    \subfloat[]{\label{fig:toy:d}{\includegraphics[width=0.5\linewidth]{figures/reward_plot_weicover.png} }}%
    \caption{(a) and (b) are comparison results for cumulative regret as a function of time horizon $T$. (c) and (d) are the moving average plot with window size 100 of instantaneous reward as a function of $t$. The expected reward used in (a) and (c) is linear, and weighted cover reward is used in (b) and (d). The gray dashed lines in (a) and (b) represent $y = aT^{2/3}$ for various values of $a$. The gray dashed line in (c) and (d) represents the value of the optimal solution (averaged across runs).}%
    \label{fig:toy}%
\end{figure}

\subsubsection{Results and Discussion}
\cref{fig:toy:a,fig:toy:b} depict cumulative regret curves for ETCG (in blue) and baselines for different horizon $T$ values for the linear and weighted cover problems  respectively. The standard deviation is also represented by error bars in the plots, though some of them might be hard to notice since the values of them are small. \cref{fig:toy:c,fig:toy:d} depict instantaneous rewards over a horizon $T=10^5$ for linear and max rewards respectively. The curves are averaged over the 10 runs.  The shaded area is the standard deviation for each method. The instantaneous reward curves for all methods are smoothed with a moving average with window size 100. The gray dashed lines in \cref{fig:toy:a,fig:toy:b} represent $y = aT^{2/3}$ for various values of $a$, corresponding to cumulative regret curves of $\widetilde{O}(T^{2/3})$.

\paragraph{Results--Linear}
Recall that ETCG, OG$^{\text{o}}$,  and CMAB-SM all have $\widetilde{O}(T^{2/3})$ regret (for their respective settings, which include  linear functions).  %\cref{fig:toy:a,fig:toy:b} include  curves $y = aT^{2/3}$ for various values of $a$ plotted in gray for visual reference. 
DART has $\widetilde{O}(T^{1/2})$ regret for this setting. 

In \cref{fig:toy:a}, we can see ETCG (in blue) outperforms OG$^{\text{o}}$ (in orange) and DART (in red), and shares similar performance with CMAB-SM (in green). Over the horizons examined (up to $T=10^6$), OG$^{\text{o}}$'s cumulative regret appears to grow faster than $T^{2/3}$ (i.e. the curve's slope appears steeper than $2/3$ on a log-log plot). One of the major reasons for this is that OG$^{\text{o}}$ explores actions (including actions will cardinality smaller than $k$) with a constant probability. %, so even in later time steps, it still plays actions with cardinality less than $k$ with that probability, while for other methods, they gather information by playing actions with cardinality less than $k$ in earlier time steps, but eventually only play actions with exactly cardinality $k$. 
%
\cref{fig:toy:c} shows that behavior also %, OG$^{\text{o}}$ spreads out exploration across the whole horizon (recall it can be seen as a generalization of the $\epsilon$-greedy MAB method), 
results in larger standard deviation area in the instantaneous reward curve and slower improvement in its instantaneous rewards.  

\paragraph{Results--Weighted Cover}
\cref{fig:toy:b} shows the cumulative regret curve for the weighted cover problem.   ETCG (in blue) outperforms all baseline methods by a large margin for all time horizons. Similar to what we have mentioned in linear case, we believe that OG$^{\text{o}}$ (in orange) performs poorly in part due to time spent in exploration. % and its potentially large constant term in regret bound. 


DART's cumulative regret (in red) empirically grows as $O(T^{0.90})$, much faster than ETCG's growth of $O(T^{0.58}) < O(T^{2/3})$ (we empirically estimated the slopes of the regret curves for these methods on the log-log scale).  CMAB-SM's cumulative regret curve (in green) grows almost as fast as DART's,  indicating CMAB-SM and DART fail to select a good action. They work well in the linear case mainly because the assumptions for ETCG, CMAB-SM and DART are all satisfied, so the regret bound would hold. However, in weighted cover problem, unlike linear function, the reward function is not simply a function of individual base arm rewards, a property used by DART and CMAB-SM. The reward function exhibits arm set dependence.




\begin{figure*}[ht]%
    \centering
    \subfloat[]{\label{fig:im:a}{\includegraphics[width=0.3\linewidth]{figures/regret_plot_4.jpg} }}%
    \subfloat[]{\label{fig:im:b}{\includegraphics[width=0.3\linewidth]{figures/regret_plot_8.jpg} }}%
    \subfloat[]{\label{fig:im:c}{\includegraphics[width=0.3\linewidth]{figures/regret_plot_16.jpg} }}%
    \\
    \subfloat[]{\label{fig:im:d}{\includegraphics[width=0.3\linewidth]{figures/reward_plot_4_100000.jpg} }}%
    \subfloat[]{\label{fig:im:e}{\includegraphics[width=0.3\linewidth]{figures/reward_plot_8_100000.jpg} }}%
    \subfloat[]{\label{fig:im:f}{\includegraphics[width=0.3\linewidth]{figures/reward_plot_16_100000.jpg} }}%
    \caption{(a), (b) and (c) are comparison results for cumulative regret as a function of time horizon $T$. (d), (e) and (f) are the moving average plot with window size 100 of instantaneous reward as a function of $t$. The gray dashed lines in  (d), (e) and (f) represent expected  rewards for the action chosen by an offline greedy algorithm.}%
    \label{fig:im}%
\end{figure*}

\subsection{Experiments with Real World Data}
We next run experiments for the application of social network influence maximization over a portion of the Facebook network graph.  While there are prior works proposing algorithms for influence maximization bandit problems, the state of the art (e.g., \citep{wen2017online}) presumes knowledge of the diffusion model (such as independent cascade) and, more importantly, extensive semi-bandit feedback on individual diffusions, such as which specific nodes became active or along which edges successful infections occurred, in order to estimate diffusion parameters. For social networks with user privacy, this information is not available. 

\subsubsection{Data Set Description and Experiment Details}
We next conduct experiments on an influence maximization problem using a portion of the Facebook network \citep{NIPS2012_7a614fd0}. To facilitate running multiple experiments for different horizons, we used the community detection method proposed by \cite{Blondel2008FastUO} to detect a community with 534 nodes and 8158 edges. The diffusion process is simulated using the  independent cascade model \citep{kempe2003maximizing}, where in each discrete step, an active node (that was inactive at the previous time step) independently attempts to infect each of its inactive neighbors.  We used uniform infection probabilities (0.1 for each edge). For each horizon $T\in\{2*10^4, 4*10^4,\dots,10^5\}$, we tested each method ten times. 

\subsubsection{Results and Discussion}
\cref{fig:im:a,fig:im:b,fig:im:c} %, \cref{fig:im:b} and \cref{fig:im:c} 
show average cumulative regret curves for ETCG (in blue) and baselines for different horizon $T$ values when the cardinality constraint $k$ is 4, 8 and 16, respectively. The shaded areas depict the standard deviation.  The figure axes are linearly scaled, so a linear cumulative regret curve corresponds to (linear) $\widetilde{O}(T)$ cumulative regret.

ETCG significantly outperforms OG$^\text{o}$ (in orange).  Over the horizons tested,  OG$^\text{o}$'s  cumulative regret (averaged over ten runs) appears to grow linearly with $T$.  We saw in \cref{sec:exp:syn} that even for much simpler reward functions and with few arms $n$ and small cardinality $k$, OG$^\text{o}$ performed poorly.  

ETCG outperforms CMAB-SM (in green) for all time horizons and cardinalities, with significant gaps between ETCG and CMAB-SM for smaller $k$.  From \cref{fig:im:a,fig:im:b,fig:im:c}, CMAB-SM's performance appears fairly stable across increasing cardinalities (though note limits of y-axes differ) while ETCG's regret curve %approaches that of CMAB 
appears to grow (relative to others).  For a fixed horizon $T$, increasing $k$ means more phases, which (for this problem with large $n$) means more time exploring overall but less time in any one phase, so the arms selected may not be as good. This phenomenon is visually apparent in the instantaneous reward plots \cref{fig:im:d,fig:im:e,fig:im:f}.  In \cref{fig:im:d} with $k=4$, for instance, each of the four phases of ETCG's exploration are visually distinct, and exploitation begins around $t=20000$.  In \cref{fig:im:f} with $k=16$, however, each of the sixteen phases of ETCG's exploration are shorter and exploitation begins around $t=35000$.  

ETCG and DART (in red) have similar performance for small time horizons. However, DART's cumulative regret curve has a steep jump which make the performance significantly worse. We attribute these jumps to the exponential epochs lengths considered in DART with number of epochs $\lfloor\log_2(KT/N\log(NT))\rfloor$. This creates a non-smooth behavior in the regret growth of the DART algorithm. 

\cref{fig:im:d}, \cref{fig:im:e} and \cref{fig:im:f} shows instantaneous rewards over a horizon $T=10^5$ for corresponding cardinality constraints. Again curves for all methods are smoothed with a moving average with window size 100. Clearly we can see that ETCG has the fastest convergence over all methods. On the other hand, the set of size $k$ that is chosen by ETCG is worse that those of CMAB-SM and DART, since the latter two methods requires longer time to explore. We can also attribute the worse performance when $k$ gets larger to the larger $k$ term in the regret bound.  

\section{Conclusion}\label{sec:conclusion}
In this paper, we investigate the problem of combinatorial multi-armed bandits in stochastic setting with expected rewards being submodular, where the agent can choose up to $k$ out of $n$ arms in each time step and receives only the aggregated reward. We proposed a simple algorithm ETCG, and showed that the algorithm is efficient both theoretically and empirically. We showed that it can achieve $\tilde{\mathcal{O}}( T^\frac{2}{3}\ )$ $(1-1/e)$-regret, which is the first theoretical regret bound in stochastic, full-bandit, submodular reward settings, and is comparable to guarantees in adversarial settings evaluated in \cite{streeter2008online} and \cite{niazadeh2021online}. We empirically showed that it outperforms other baselines on synthetic data and on a social influence maximization network.

%temporary to flush figures and see space left over
%\newpage

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    This material is based upon work supported by the National Science Foundation under Grants No. 2149588 and 2149617.
    % Briefly acknowledge people and organizations here.

    % \emph{All} acknowledgements go in this section.
\end{acknowledgements}


\bibliography{refs.bib}

\end{document}
