%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts,placeins}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{lipsum}
%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}
\usepackage{amsthm,amssymb}
\usepackage{enumitem,algorithm,algorithmic,qtree}
%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz,multirow,graphicx} % nice language for creating drawings and diagrams
\usepackage{bibentry}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand\algorithmicprocedure{\textbf{Procedure}}
\newcommand{\algorithmicendprocedure}{\algorithmicend\ \algorithmicprocedure}
\makeatletter
\newcommand\PROCEDURE[3][default]{%
  \ALC@it
  \algorithmicprocedure\ \textsc{#2}(#3)%
  \ALC@com{#1}%
  \begin{ALC@prc}%
}
\newcommand\ENDPROCEDURE{%
  \end{ALC@prc}%
  \ifthenelse{\boolean{ALC@noend}}{}{%
    \ALC@it\algorithmicendprocedure
  }%
}

\newenvironment{ALC@prc}{\begin{ALC@g}}{\end{ALC@g}}
\makeatother
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newenvironment{theorem}[2][Theorem]{\begin{trivlist} \em
\item[\hskip \labelsep {\bfseries #1}\hskip \labelsep {\bfseries #2.}]}{\end{trivlist}}
\newenvironment{remark}[2][Remark]{\begin{trivlist} \em
\item[\hskip \labelsep {\bfseries #1}\hskip \labelsep {\bfseries #2.}]}{\end{trivlist}}
\newenvironment{fact}[2][Fact]{\begin{trivlist} \em
\item[\hskip \labelsep {\bfseries #1}\hskip \labelsep {\bfseries #2.}]}{\end{trivlist}}
\newenvironment{lemma}[2][Lemma]{\begin{trivlist} \em
\item[\hskip \labelsep {\bfseries #1}\hskip \labelsep {\bfseries #2.}]}{\end{trivlist}}
\newenvironment{claim}[2][Claim]{\begin{trivlist}
\item[\hskip \labelsep {\bfseries #1}\hskip \labelsep {\bfseries #2.}]}{\end{trivlist}}
\newenvironment{problem}[2][Problem]{\begin{trivlist}
\item[\hskip \labelsep {\bfseries #1}\hskip \labelsep {\bfseries #2.}]}{\end{trivlist}}
\newenvironment{definition}[2][Definition]{\begin{trivlist} \em \item[\hskip \labelsep {\bfseries #1}\hskip \labelsep {\bfseries #2.}]}{\end{trivlist}}
\newenvironment{proposition}[2][Proposition]{\begin{trivlist} \em \item[\hskip \labelsep {\bfseries #1}\hskip \labelsep {\bfseries #2.}]}{\end{trivlist}}
\newenvironment{result}[2][Result]{\begin{trivlist} \em \item[\hskip \labelsep {\bfseries #1}\hskip \labelsep {\bfseries #2.}]}{\end{trivlist}}
\newenvironment{corollary}[2][Corollary]{\begin{trivlist} \em
\item[\hskip \labelsep {\bfseries #1}\hskip \labelsep {\bfseries #2.}]}{\end{trivlist}}

\newenvironment{solution}{\begin{proof}[Solution]}{\end{proof}} % just an example

\title{Best Arm Identification in Rare Events}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[$\dagger$]{Anirban Bhattacharjee}
\author[$\dagger$]{Sushant Vijayan}
\author[$\dagger$, $\ddag$]{Sandeep Juneja}

% Add affiliations after the authors
\affil[$\dagger$]{%
    School of Technology and Computer Science\\
    Tata Institute of Fundamental Research\\
    Mumbai, India \\
    \newline
  $\ddag$ Visiting Researcher, Google Research India  
}
%\affil[2]{
%    School of Technology and Computer Science\\
%    Tata Institute of Fundamental Research\\
%    Mumbai, India
%}
%\affil[3]{
%    School of Technology and Computer Science\\
%    Tata Institute of Fundamental Research\\
%    Mumbai, India
%  }
  
\begin{document}
\maketitle

\begin{abstract}
We consider the Best Arm Identification (BAI) problem in the stochastic multi-armed bandit framework, where each arm has a small probability of realizing large rewards, while with overwhelming probability, the reward is zero. A key application of this framework is in online advertising, where click rates of advertisements could be a fraction of a single percent and final conversion to sales, while highly profitable, may again be a small fraction of the click rates. Lately, algorithms for BAI problems have been developed that minimise sample complexity while providing statistical guarantees on the correct arm selection. As we observe, these algorithms can be computationally prohibitive. We exploit the fact that the reward process for each arm is well approximated by a Compound Poisson process and arrive at algorithms that are faster, with a small increase in sample complexity. We analyze the problem in an asymptotic regime as rarity of reward occurrence reduces to zero, and reward amounts increase to infinity. This helps illustrate the benefits of the proposed algorithm. It also sheds light on the underlying structure of the optimal BAI algorithms in the rare event setting.
\end{abstract}

\section{Introduction}\label{sec:intro}

Online advertising is ubiquitous in present times. Its users include e-commerce platforms, mobile application developers, marketing agents and online retailers. Typically, an online advertiser has to decide amongst various product advertisements and choose the one with highest expected reward. Advertisers typically have a period of experimentation where they sequentially show  competing advertisements to the users to arrive at those that elicit 
best response from each customer type (customers maybe clustered based on available information).


A key feature of online advertising 
is that  while each advertisement maybe shown to a large number
of customers, the click rates on advertisements are usually small.
Typically, these maybe of order one  in a thousand \footnote{https://cxl.com/guides/click-through-rate/benchmarks/}, and a very small percentage \footnote{https://localiq.com/blog/search-advertising-benchmarks/.} of the users who click on an advertisement  end up buying the product (known as the conversion rate). The conversion and click rates can vary significantly depending on the product category. For example, high-end products often have higher click rates but much lower conversion rates compared to standard products. Thus, a key characteristic of the problem is that rarer conversion rates often have very high rewards. Further, the seller would have an estimate of the price that their product(s) may sell for, along with an estimate of the volume of sales that may take place, it is also fair to  assume that in practice upper bounds on rewards are known.


We study the problem of identifying the best advertisement to show to a customer type
as a best arm identification (BAI) problem in the multi-armed bandit framework.
The rarity of the reward probabilities, and the fact advertisements 
are  shown to a large number of customers, may make the computational effort 
of popular existing adaptive algorithms prohibitive. On the other hand, 
these properties call for sensible aggregation
based algorithms. In this paper, we observe that the rewards from large number 
of pulls from each arm
can be well modelled as a compound Poisson process, significantly 
simplifying and speeding up the existing {\it optimal} algorithms. 
%In such a setting, the advertiser wants to quickly identify the product with the highest expected revenue.
\\
\\
To illustrate the proposed ideas clearly, we consider a simple stochastic 
BAI problem where
agent is given a set of $K$ unknown probability distributions (arms) 
that can be sampled sequentially.  The agent's objective is to declare the arm with the highest mean with a pre-specified confidence level $1- \delta$, while minimizing the expected number of samples (sampling complexity). In the literature, this is popularly 
known as the fixed-confidence setting, and the algorithms
that provide $1-\delta$ confidence guarantees
are referred to as $\delta$-correct.
\\
\\
Best arm identification problems
are also popular in simulation community
where these are better known as  ranking and selection problems (for example see \cite{goldsman1983ranking,chan2006sequential}).
Classical problem involves many complex simulation models
of practical systems such as supply chain design, traffic network and so
on, and the aim is to identify with high probability, the system
with the highest  expected reward, using minimum computational budget. In many systems, the performance measure of interest may correspond
to a rare event, e.g., a manufacturing plant shut down probability, 
or computer system unavailability fraction. The algorithms that we
propose here are also applicable in optimal computational resource allocation
in simulating such systems.


\noindent {\bf Related literature:} 
In the learning theory literature, \cite{even2006action} were amongst the first to consider the 
fixed confidence BAI problem. They proposed a  successive elimination algorithm
 (see section F of supplementary material). Upper Confidence Bound (UCB) based algorithms were proposed in \cite{auer2002finite,jamieson2014lil}, wherein the arm with highest confidence index is sampled. These algorithms usually stop when the difference between arm indices breaches a certain threshold (see \cite{jamieson2014best} for more details).  The sample complexities of these algorithms was
 shown to match the lower bound developed by them to  within a constant. Motivated by Bayesian approaches in \cite{russo2016simple}, \cite{jourdan2022top} propose top-two algorithms that sequentially identify 
  a challenger to the current empirical best arm and sample between the two  with a pre-defined probability $\beta$. Although these algorithms are $\beta$-optimal \footnote{see \cite{jourdan2022top} for definition}, it's not clear how optimal $\beta$ may be learnt, and 
  thus they are sub-optimal. 
 The sample complexity of these  algorithms is typically analyzed in an asymptotic
 regime where $\delta \rightarrow 0$.  
\cite{kaufmann2016} and \cite{kaufmann2016a} derived a lower bound  on the sample complexity through  a max-min formulation. Based on this lower bound, a Track-and-Stop algorithm (TS) was proposed for
arm distributions restricted to single parameter exponential families (SPEF), and was shown to match the lower bound even to a constant  (as $\delta \to 0$).\cite{shubhada2019,agrawal2020optimal} extended the TS algorithms to more general distributions. The optimal TS algorithms in the literature, 
proceed sequentially. At each iteration, the observed empirical parameters are plugged into the lower bound max-min problem to arrive at prescriptive optimal sample allocations to each arm, that then guide the sample allocations. 
%changes made for response to eaSj reviewer comments.
As is known, and as we observe, TS algorithms are computationally prohibitive\footnote{UCB based BAI algorithms aren't instance optimal and incur large sample complexity in this setting.}, especially since in our rare advertising settings, the informative non-zero reward samples (those instances where users buy products) are rare. This motivates the paper's goal to arrive at computationally efficient algorithms that exploit the compound Poisson structure (see chapter 2 \cite{ross1995stochastic}) of the arm reward process, with a small increase in sample complexity.

\noindent{\bf Contributions:} 
We develop a rarity framework where the reward success probabilities 
are modelled as a function of $\gamma^{\alpha}$ for arm dependent $\alpha >0$
and  $\gamma$ is $ >0$ and small. The rewards are modelled 
to be of order $\gamma^{-\alpha}$ so that the expected rewards across 
arms are comparable (otherwise, we a-priori know arms with small or large expected rewards). We assume that arm specific upper bounds on rewards are available to us. In this framework, we propose a computationally efficient $\delta$-correct algorithm that is nearly asymptotically optimal for small $\gamma$. This algorithm (approximate track and stop) is based on existing track and stop algorithms that are simplified through
a Compound Poisson approximation to the bandit reward process. 
The Poisson approximation can be seen to be  tight as $\gamma \rightarrow 0$
and we provide bounds on the deviations due to the Poisson approximation.
 Further, we give an asymptotically valid upper bound on the sample complexity illustrating that the increase in it  is marginal compared to the computational benefit of the proposed alggorithm.
The rarity structure helps us shed further light on the optimal 
sample allocations across arms in our BAI problem. 
 We identify five different regimes depending on the rarity differences between the arms.
Finally, we compare experimentally with the TS algorithm in \cite{agrawal2020optimal} for bounded random rewards. We find that for realistic rare event probabilities and reward structure, our algorithm is 6-12 times faster than the TS algorithm with a small increase (1-13 \%) in sample complexity.


The rest of the paper is organized as follows: 
 Section \ref{setup} formally introduces the problem, rare event setting and provides some background. Section \ref{approx_prob} introduces the approximate problem, analyzes its deviations from the exact problem and gives the optimal weight asymptotics. Section \ref{approx_algo_section} outlines the details of the Aproximate Track and Stop (TS(A)) algorithm, $\delta$-correctness, sample complexity guarantee and computational benefits of the algorithm. Section \ref{num_experimental_section} presents some experimental results and we conclude in Section 6. The proofs of various results and further technical details are furnished in the supplementary material.

\section{Modelling Framework}\label{setup}

Consider a $K$-armed bandit with each arm's distribution denoted by $p_i$, $i \in [K]$. We denote such a bandit instance by $p$. 
For any distribution $\eta$, let $\mu(\eta)$ denote its mean and 
$\textrm{supp}(\eta)$ denote its support. 
Further,  
let $KL(\eta,\kappa)
=\mathbb{E}_{\eta} \big[\log \left (\frac{d \eta}{d \kappa} \right ) \big] $ denote the Kullback-Leibler divergence between two measures $\eta$ and $\kappa$,
where $E_{\eta}$ denotes the expectation operator under $\eta$. The agent's goal is to sequentially sample from these arms
using a policy that at any sequential step
$t$, may depend upon all the generated data before time $t$. 
The policy then stops at a random stopping time
and declares an arm that it considers to have the highest mean. 
A sampling strategy, a stopping rule and a recommendation rule are together called a best arm bandit algorithm. A best arm bandit algorithm that correctly recommends the arm with the highest mean with probability at least $1-\delta$ (for a pre-specified $\delta \in (0,1)$) is said to be $\delta$-correct.


This BAI problem has been well studied, and lower bounds on sample complexity under $\delta$-correct algorithms 
have been developed along with algorithms
that match the lower bound 
asymptotically as $\delta \rightarrow 0$. Below, we first state the lower bound in 
Theorem~\ref{lower_bound_theorem},
and then briefly outline an algorithm
that asymptotically matches it. 
The lower bounds were developed by \cite{kaufmann2016}) for single parameter exponential family 
of distributions and were  generalized to bounded and 
heavy-tailed distributions by \cite{agrawal2020optimal}.
Let 
\begin{equation}
\mathcal{K}^{L, B}_{inf}(\eta,x) \coloneqq \min_{\substack{\textrm{supp}(\kappa) \subseteq [0, B] \\ \mu_\kappa \leq x}} KL(\eta,\kappa)
\end{equation}
\begin{equation}
\mathcal{K}^{U, B}_{inf}(\eta,x) \coloneqq \min_{\substack{\textrm{supp}(\kappa) \subseteq [0, B] \\ \mu_\kappa \geq x}} KL(\eta,\kappa).
\end{equation}

Henceforth, we suppress the dependence on $B$ above 
to ease the presentation. This should not cause confusion 
in the following discussion. For brevity, we'll denote $\mu_{p_i}$ by $\mu_i$ for each $i \in [K]$. As is customary
in the BAI literature, we assume that best arm is unique
and without loss of generality, $\mu_1 > \mu_i$ for
$i \in [K] \backslash \{1\}$. 

\begin{theorem}{5 in \cite{agrawal2020optimal}}\label{lower_bound_theorem}
For our bandit problem, any $\delta$-correct algorithm with stopping rule $\tau_\delta$, satisfies 
\begin{equation*}
\mathbb{E}_p[\tau_\delta] \geq \frac{1}{V^{*}(p)} \log\Big(\frac{1}{2.4\delta}\Big),
\end{equation*}
where $V^{*}(p)$ is defined as
\begin{equation}\label{sample_lower_bound}
\max_{w \in \Sigma_K} \min_{i\neq 1} \inf_{x \in [\mu_i,\mu_1]}w_1\mathcal{K}^{L}_{inf}(p_1,x)+w_i \mathcal{K}^{U}_{inf}(p_i,x),
\end{equation}
$\Sigma_K$ being the $K$-dimensional probability simplex.
\end{theorem}

The lower bound suggests that each arm be sampled in proportion to the optimal weights $w^*$ in (\ref{sample_lower_bound}). This idea guides the  optimal Track and Stop (TS) algorithms that match the lower
bound asymptotically as $\delta \rightarrow 0$. Typically, such algorithms have the following features:
(see \cite{kaufmann2016}, \cite {agrawal2020optimal}
for further details): 
\begin{enumerate}
    \item 
    Arms are sampled sequentially in batches. 
    At stage
    $t$, each arm is sampled at least
    order $\sqrt{t}$ times (this sub linear 
    exploration ensures that no arm is starved).
    \item
    Empirical distributions $\hat{p}_t$
    are plugged into the lower bound \ref{sample_lower_bound} 
    and is solved to determine 
    the prescriptive proportions $\hat{w}_t$.
    \item
    The algorithm then samples to closely track these proportions.
    \item 
    The algorithm stops when the log-likelihood ratio
    at stage $m$ exceeds 
    a threshold $\beta(m,\delta)$ (set close to $\log(1/\delta)$).
At stage $m$, the log likelihood ratio equals
    \[
    \begin{aligned}
     \underset{b \neq k^*}
     {\min}\underset{x \leq y}{\inf} &N_{k^*}(m)\mathcal{K}_{inf}^{L}(\hat{p}_{k^*}(m),x) \\ 
        &+ N_{b}(m)\mathcal{K}_{inf}^{U}(\hat{p}_b(m),y),
    \end{aligned}
    \]
    where $k^*$ denotes the arm with the largest sample mean, 
    each $N_a(m)$ denotes the samples of arm $a$ amongst $m$ samples.  
\end{enumerate}

As is apparent, the above algorithm involves repeatedly solving 
 the lower bound problem, and this is computationally demanding, particularly when nonzero rewards are rare and occur with very low probabilities. 

\subsection{The Rare Event Setting}\label{rare_event_setup}

We now specialize the BAI setting  to  illustrate our rare event framework where the rewards
from each arm take positive values with small probabilities. Further, while
the expected rewards across arms are of the same order, the realized rewards and the associated
probabilities may be substantially different.


Concretely, suppose that $\gamma$ is a small positive value (say of order $10^{-2}$ or lower) and
corresponding to each arm distribution $p_i, \ i \in [K]$, we have a rarity
index $\alpha_i >0$. The support of arm $i$ takes $n_i$ distinct nonzero values, namely, $a_{ij} \gamma^{-\alpha_i}$,  each with probability
$p_{ij} \gamma^{\alpha_i} >0$ for $j \in [n_i], \ n_i \in \mathbb{N}$. Under
each $p_i$, the realized reward takes value zero with probability close 
to 1. To summarize,
\begin{equation*}
\begin{aligned}
&\mathbb{P}_{X \sim p_i}(X=a_{ij}\gamma^{-\alpha_i}) = p_{ij}\gamma^{\alpha_i}, \ j \in [n_i] \\
&\mathbb{P}_{X \sim p_i}(X=0) = 1 - \sum_j p_{ij}\gamma^{\alpha_i}.
\end{aligned}
\end{equation*}
The arm means are given by $\mu_i=\sum_j a_{ij}p_{ij}$ and are independent of $\gamma$.
We further assume 
that an upper bound $B_i\gamma^{-\alpha_i}$ for each arm $i$ 
is known to the agent. 
We assume above
that when arm $i$ sees a large reward of order $\gamma^{-\alpha_i}$,
it takes finitely many values. This keeps our analysis somewhat
simpler and we deal with compound Poisson process for cumulative reward from each arm. It is easy to extend this to general distributions. However, the cumulative rewards from each arm would follow a Poisson random measure and Proposition \ref{poisson_approx_result} would generalize accordingly.


The above rarity framework brings out the benefits  of the proposed approximations cleanly for small $\gamma$ in our theoretical analysis. 
However, in executing the associated algorithm, we don't need to separately
know the values of $\gamma$ and each $\alpha_i$.



\subsection{The Poisson Approximation of KL Divergence}\label{poisson_punchline}
We motivate in this section the approximate form of KL divergence that we shall use. The following well-known result, shown in section A.5 of the supplementary material for completeness, is used to motivate our approximation.
\begin{proposition}{1}\label{poisson_approx_result}
Let $\tau^{(1)}_{ij}$ denote the minimum number of samples of arm $i$ needed to see the reward $a_{ij}\gamma^{-\alpha_i}$, i.e. the first arrival time of the support point $j$. Similarly, let $\tau^{(k)}_{ij}$ be the $k$-th arrival time of support point $j$.

Let $N_{ij}(t)$ be the number of times the reward $a_{ij}\gamma^{-\alpha_i}$ is returned by arm $i$ in $\lceil t\gamma^{-\alpha_i}\rceil$ trials ($t \in \mathbb{R}$). Then as $\gamma \to 0$,

\begin{enumerate}[label=(\alph*)]
    \item $\mathbb{P}(\tau^{(k)}_{ij} > t\gamma^{-\alpha_i}) \to e^{-p_{ij}t}$,
    %\item $\mathbb{P}(N_{ij}(t) = k) = \mathbb{P}(\sum_{l=1}^{k-1}\tau^{(l)}_{ij} \leq t < \sum_{l=1}^{k}\tau^{(l)}_{ij})$
    \item $N_{ij}(t)$ $\xrightarrow{D}$ $\mathrm{Poisson}$($p_{ij}t$). 
\end{enumerate}
Further for all support points, $\{\mathrm{Poisson}(p_{ij}t)\}_j$ is a collection of mutually independent random variables.
\end{proposition}
This implies that in rare event setting, the distribution of the counting process $N_{ij}(t)$ for each support point $a_{ij}\gamma^{-\alpha_i}$ is well-approximated by a Poisson process. We now argue that when $\gamma$ is small enough, the KL divergence between arm distributions $p_i$ and $\tilde{p}_i$ of same rarity can be approximated by a sum of KL divergences between independent $\mathrm{Poisson}$ variables.


Let $X_{1:m}$ and $\tilde{X}_{1:m}$ be two sets of i.i.d samples of size $m$ from $p_i$ and $\tilde{p}_i$ respectively. The corresponding measures are the product measures $p_i^{\otimes m}$ and $\tilde{p}_i^{\otimes m}$ respectively. By the tensorization property of KL-divergence, we have that
\begin{equation}\label{tensor_KL}
KL\big(p_i^{\otimes m},\tilde{p}_i^{\otimes m}\big)=mKL(p_i,\tilde{p}_i)
\end{equation}
In the following discussion we set $m=\lceil t\gamma^{-\alpha_i}\rceil$. Consider the vector-valued random variable $(N_{ij}(t))_{j \in [n_i]}$ and its counterpart $(\tilde{N}_{ij}(t))_{j \in [n_i]}$ under $\tilde{p}_i$. Note that they are functions of the samples $X_{1:\lceil t\gamma^{-\alpha_i}\rceil},\tilde{X}_{1:\lceil t\gamma^{-\alpha_i}\rceil}$. Since we can also reconstruct a permutation of these samples from  \textbf{$({N}_{ij}(t))_j$},\textbf{$(\tilde{N}_{ij}(t))_j$}, we have that 
\begin{equation*}
\begin{aligned}
KL\big(p_i^{\otimes \lceil t\gamma^{-\alpha_i}\rceil}, \, & \tilde{p}_i^{\otimes \lceil t\gamma^{-\alpha_i}\rceil}\big)\\
= &KL\big(\nu((N_{ij}(t))_j),\nu((\tilde{N}_{ij}(t))_j)\big)
\end{aligned}
\end{equation*}
where $\nu(A)$ is the measure of a random variable $A$. Now, by continuity of KL in $\gamma$ and weak convergence of  \hyperref[poisson_approx_result]{Proposition 1}, it follows that for $\gamma$ small enough:
\begin{equation*}
\begin{aligned}
&KL\big(p_i^{\otimes \lceil t\gamma^{-\alpha_i}\rceil},\tilde{p}_i^{\otimes \lceil t\gamma^{-\alpha_i}\rceil}\big)\\
\approx  &\sum_j KL(\textrm{Poisson}(p_{ij}t),\textrm{Poisson}(\tilde{p}_{ij}t))\\
= &t \bigg[ \sum_j p_{ij} \log \Big(\frac{p_{ij}}{\tilde{p}_{ij}}\Big) + (\tilde{p}_{ij} - p_{ij})\bigg].
\end{aligned}
\end{equation*}
for $\gamma$ small enough. Then, combining the approximation above with the relation (\ref{tensor_KL}) gives
\begin{equation}\label{approx_KL}
KL(p_i,\tilde{p}_i)\approx \gamma^{\alpha_i}\bigg[\sum_j p_{ij} \log \Big(\frac{p_{ij}}{\tilde{p}_{ij}}\Big) + (\tilde{p}_{ij} - p_{ij})\bigg].
\end{equation}
 This approximation is used to motivate the approximate lower bound problem in the next section.

\section{Approximate Lower Bound Problem}\label{approx_prob}

For each $i$, if $B_i \notin \textrm{supp}(p_i)$, let $\tilde{n}_i = n_i+1$ and set $a_{i \tilde{n}_i} = B_i$, else $\tilde{n}_i = n_i$. 
The Poisson approximation of the KL divergence (see Section \ref{poisson_punchline}) suggests that in lieu of Equation $\hyperref[sample_lower_bound]{(3)}$, which is computationally expensive to solve, one could consider the following approximate problem when the rarity $\gamma$ is small (the summations over $j$ below correspond to $j \in [\tilde{n}_i]$).
\begin{equation}\label{approx_lower_bound_problem}
\begin{aligned}
&V^{*}_{a}(p) \coloneqq \max_{w \in \Sigma_K} \min_{i\neq 1} \underset{\underset{\sum_ju a_{1j}\tilde{p}_{1j}}{\sum_j a_{ij}\tilde{p}_{ij} \geq}}{\inf}\bigg\{w_1\gamma^{\alpha_1}\bigg[ \sum_j p_{1j} \log \Big(\frac{p_{1j}}{\tilde{p}_{1j}}\Big)\\
& + (\tilde{p}_{1j} - p_{1j})\bigg] + w_i\gamma^{\alpha_i}\bigg[ \sum_j p_{ij} \log \Big(\frac{p_{ij}}{\tilde{p}_{ij}}\Big)
 + (\tilde{p}_{ij} - p_{ij})\bigg]\bigg\}.
\end{aligned}
\end{equation}
The minimization in \ref{sample_lower_bound} will now be replaced with the approximation in  \ref{approx_KL}. 
Above, instead of allowing $\tilde{p}_i$ to have the support
$[0, B_i \gamma^{-\alpha_i}]$, we limited its support 
to that of $p_i$ extended 
to allow point  $B_i \gamma^{-\alpha_i}$. 
This is justified in Sections A.1-A.2 of the supplementary material. The above representation suggest that in TS algorithm we do not need estimate $\gamma$ and $\alpha$'s separately, instead the equation above suggests that only the relative rarity $\gamma^{\alpha_i-\alpha_k}$ for some fixed $k$ is sufficient.


Let 
\begin{equation}\label{P_i_def}
\mathcal{P}_i \coloneqq \underset{x \in [\mu_i,\mu_1]}{\inf}w_1\mathcal{K}^{L}_{inf}(p_1,x)+w_i \mathcal{K}^{U}_{inf}(p_i,x)
\end{equation}
denote the inner minimisation problem
in \ref{sample_lower_bound}
and let
\begin{equation} \label{P_i_def2}
\begin{aligned}
\mathcal{P}_{i,a} \coloneqq \underset{\underset{\sum_j a_{1j}\tilde{p}_{1j}}{\sum_j a_{ij}\tilde{p}_{ij} \geq} }{\inf}&w_1\gamma^{\alpha_1}\bigg[ \sum_j p_{1j} \log \Big(\frac{p_{1j}}{\tilde{p}_{1j}}\Big) + (\tilde{p}_{1j} - p_{1j})\bigg]\\
+&w_i\gamma^{\alpha_i}\bigg[ \sum_j p_{ij} \log \Big(\frac{p_{ij}}{\tilde{p}_{ij}}\Big) + (\tilde{p}_{ij} - p_{ij})\bigg]
\end{aligned}
\end{equation}
denote its approximation (above, we suppress the  dependence on 
$w_1$ and $w_i$ of $\mathcal{P}_i$ and 
$\mathcal{P}_{i,a}$).
By approximating a reformulated version of 
$\mathcal{P}_i$ that  uses the dual representations of  $\mathcal{K}^{L}_{inf}$ and $\mathcal{K}^{U}_{inf}$ (following the approach used in \cite{honda2010asymptotically,agrawal2020optimal}),
%The quantities $\mathcal{P}_{i,a}$ above have been reformulated in much the same manner as $\mathcal{P}_i$.
we can show that 
\begin{equation}\label{approx_min_rform}
\begin{aligned}
\mathcal{P}_{i,a}=&w_1\gamma^{\alpha_1} \big[\sum_j p_{1j}\log(1+C^{a}_{1i}a_{1j})-C^{a}_{1i}x^{*}_{i,a}\big]
\\+&w_i \gamma^{\alpha_i}\big[\sum_j p_{ij}\log(1-C^{a}_{i}a_{ij})+C^{a}_{i}x^{*}_{i,a}\big],
\end{aligned}
\end{equation}
where the quantities $x^{*}_{i,a},C^a_{1i},C^a_{i}$ (the qualifier 'a' reminds us these are for the approximate problem) are defined by the relations:
\begin{equation}\label{approx_prob_cond}
\begin{aligned}
&C^a_{1i}w_1\gamma^{\alpha_1} =C^a_iw_i\gamma^{\alpha_i},\\
&x^{*}_{i,a}=\sum_j \frac{a_{1j}p_{1j}}{1+a_{1j}C^a_{1i}}, \mbox{ and }\\
& x^{*}_{i,a}=\sum_j \frac{a_{ij}p_{ij}}{1-a_{ij}C^a_{i}}.
\end{aligned}
\end{equation}
Section A.4 of the supplementary material provides the step-by-step reformulation, as well as the results that have been used for it (Sections A.1-A.3 and A.5).
   The advantage of our reformulation is that the quantities $C^a_{1i}$ and $C^a_{i}$ have bounded well-defined limits and using (\ref{approx_prob_cond}), we can eliminate the dependence on $x^*_i$ (whose behaviour is not as easy to analyze when $\gamma \to 0$).


The discussion in Section  \ref{poisson_punchline} also suggests that $\mathcal{P}_{i,a} \approx \mathcal{P}_i$ and hence, $V^{*}(p)\approx V^{*}_a(p)$. This is shown in the following theorem:

\begin{theorem}{1}\label{approxMaxMinTheorem}
For each $i \in [K]$ and $w \in \Sigma_K$,
$\mathcal{P}_i$, $\mathcal{P}_{i,a}$ are $\mathcal{O}(\gamma^{\max(\alpha_1, \alpha_i)})$.
Furthermore, 
$\underset{\gamma \to 0}{\lim}\frac{\mathcal{P}_i}{\mathcal{P}_{i,a}}=1.$
In addition,  there exist constants $L_{1i}$ and $L_i$, independent of $w$, such that 
\begin{equation*}    
|\mathcal{P}_i - \mathcal{P}_{i,a}| \leq L_{1i}w_1\gamma^{\min(2\alpha_1,\alpha_1 + \alpha_i)}+L_{i}w_i\gamma^{\min(2\alpha_i,\alpha_i + \alpha_1)}.
\end{equation*}
Furthermore,
\begin{equation*}
\begin{aligned}
|V^{*}(p)-V^{*}_a(p)| \leq \underset{i \neq 1}{\max}\max\big(&L_{1i}\gamma^{\min(2\alpha_1,\alpha_1 + \alpha_i)},\\
&L_{i}\gamma^{\min(2\alpha_i,\alpha_i + \alpha_1)}\big).
\end{aligned}
\end{equation*}
\end{theorem}

The proof involves simplifying $\mathcal{P}_i$, $\mathcal{P}_{i,a}$
through Taylor expansions for small $\gamma$.
It is given in the Sections A.4 and B of the supplementary material.

\subsection{Solving the approximate lower bound}\label{approx_lb_section} 
By definition we have that 
\begin{equation*}
V^{*}_a(p)=\max_{w \in \Sigma_K} \min_{i\neq 1}\mathcal{P}_{i,a}.
\end{equation*}
Further, we note that $\mathcal{P}_{i,a}$ is a concave function of $w$ (infimum of linear function of $w$). Maxmin problems with this specific structure were studied in \cite{glynn2004large} (the caveat being that in our 
$\mathcal{K}_{inf}$ definitions in the underlying KL term, the first argument is fixed while we optimize over the second argument, while in 
\cite{glynn2004large}, these orders are reversed. However, all the steps
carry out identically).
The optimal weights $w^*$ are characterized in the following theorem:
\begin{theorem}{1 in \cite{glynn2004large}}
The optimal $w^{*}$ of the maxmin problem \ref{approx_lower_bound_problem} satisfies:
\begin{equation}\label{KL_ratio_sum_eq}
\sum_{i=2}^{K}\frac{\partial \mathcal{P}_{i,a}(w^*)}{\partial w_1}\bigg/ \frac{\partial \mathcal{P}_{i,a}(w^*)}{\partial w_i}=1,
\end{equation}
and $\forall i \neq j$, $i,j \neq 1$,
\begin{equation}\label{wt_kl_eq}
\mathcal{P}_{i,a}(w^*)=\mathcal{P}_{j,a}(w^*).
\end{equation}
These conditions are also sufficient.
\end{theorem}

We can use the above theorem to find closed form expressions (in terms of $w^*$) for $\mathcal{P}_{i,a}$ and $\frac{\partial \mathcal{P}_{i,a}(w^*)}{\partial w_j}$ using (\ref{approx_min_rform}). As a starting point, we identify certain monotonicities present in (\ref{approx_prob_cond}), (\ref{KL_ratio_sum_eq}) and (\ref{wt_kl_eq}) to ease up the process of root-finding via bisection methods.


The equations defining $C^a_{1i}$ and $C^a_i$ imply that $C^a_{i}$ is a decreasing function of $C^a_{1i}$. Mathematically, the implicit functions $g_i(r)$, defined for all $i \neq 1$ as
 \begin{equation*}
 \sum_j \frac{a_{1j}p_{1j}}{1+g_i(r)a_{1j}}=\sum_j \frac{a_{ij}p_{ij}}{1-ra_{ij}}
 \end{equation*}
 are decreasing in $r$. The domain of $g_i$ is chosen such that the RHS in the above equation is positive and finite.
 \\ The optimality equation (\ref{wt_kl_eq}) implies at the optimal weight $w^*$, each $C^a_{1i}$, $i>2$, is an increasing function of $C^a_{12}$. More formally, the functions $\xi_{i}(s)$, $\forall i>2$, implicitly defined through the equation:
 \begin{equation*}
 \begin{aligned}
    &\sum_j p_{1j} \log(1+g_i(\xi_i)a_{1j}) +\frac{g_i(\xi_i)}{\xi_i}\sum_jp_{ij}\log(1-\xi_ia_{ij})\\
    =&\sum_j p_{1j} \log(1+g_2(s)a_{1j})
 +\frac{g_2(s)}{s}\sum_j p_{2j}\log(1-sa_{2j}),
 \end{aligned}
 \end{equation*}
 are increasing in $s$. The domain of $\xi_i$ is such that the RHS is well-defined. Finally, as  a function of $C^a_{12}$, the LHS in the optimality equation \ref{KL_ratio_sum_eq} is also increasing. Mathematically this means that the functions , $\forall i \neq 1$, 
 \begin{equation*}
 \begin{aligned}
 h_i(s)\coloneqq &\bigg(\sum_j p_{1j}\log(1+\xi_ia_{1j}) -\xi_i. \\
 &\Big[\sum_j \frac{a_{1j}p_{1j}}{1+a_{1j}\xi_i}\Big]\bigg)\bigg(\sum_j p_{ij}\log(1-g_i(\xi_i)a_{ij}) \\
 & +g_i(\xi_i)\sum_j \Big[\frac{a_{ij}p_{ij}}{1-a_{ij}g_i(\xi_i)}\Big]\bigg)^{-1}
 \end{aligned}
 \end{equation*}
 are increasing in $s$. These monotonicities enable one to solve for optimal weights in (\ref{approx_lower_bound_problem}) through simple bisection methods. This is the source of computational benefit of solving (\ref{approx_lower_bound_problem}) vis-a-vis (\ref{sample_lower_bound}). In (\ref{sample_lower_bound}), one has to solve either convex programs ($\mathcal{P}_i$) or a nonlinear system of four equations %((\ref{C_1iC_iRelation}),(\ref{twisted_mean_eq}),(\ref{prob_eq}),(\ref{w_K_iEq}))  
 to arrive at the solution (see Section C of supplementary material).

 
 This enables us to study the behaviour of $w^*$ as $\gamma \to 0.$ We set up some notation first.
\begin{definition}{1} {\em
Two positive valued functions of $\gamma$, $A(\gamma)$ and $B(\gamma)$, are said to be \textit{asymptotically equivalent} if $0<\underset{\gamma \to 0}{\liminf}\frac{A(\gamma)}{B(\gamma)} \leq \underset{\gamma \to 0}{\limsup}\frac{A(\gamma)}{B(\gamma)}<\infty$. We denote this by $A(\gamma) =\Theta(B(\gamma))$.  }
\end{definition}
Let $\alpha_{\max}={\max}_i\alpha_i$. The quantity $\zeta \coloneqq \sum_{\underset{\alpha_i = \alpha_{max}}{i \neq 1,}} h_i(\xi_{i}(0))$
also plays a role in governing the asymptotic behaviour of $w^*$. 

Theorem $\hyperref[asymptotic_w]{(2)}$ provides insight into the optimal weights
in the lower bound problem 
as $\gamma \rightarrow 0$. We discuss its conclusions further in the next subsection.
\begin{theorem}{2}\label{asymptotic_w}
 The behaviour of $w^*$ as $\gamma \to 0$ is described by the following five cases:\\
 \\
{Case 1: The best arm is not the rarest, $\alpha_{max} \neq \alpha_1.$}
\begin{equation*}
\begin{aligned}
& w^*_1= \Theta(\gamma^{\frac{\alpha_{max}-\alpha_1}{2}}),\\
& w^*_i= \Theta(\gamma^{\alpha_{max}-\alpha_i}) &&\text{for all $i \neq 1$}.
\end{aligned}
\end{equation*}
{Case 2: The best arm is uniquely the rarest, $\alpha_1 = \alpha_{max} > \alpha_i,  i \neq 1.$}
\begin{equation*}
\begin{aligned}
& w^*_2= \Theta( \gamma^{\frac{\alpha_{max}-\alpha_2}{2}}),\\
& w^*_i= \Theta(\gamma^{\alpha_{max}-\alpha_i}) &&\text{for all $i \neq 2.$}
\end{aligned}
\end{equation*}
{Case 3: The best and second best arm only are the rarest, $\alpha_1 = \alpha_2 = \alpha_{max} > \alpha_i, \ \forall i \neq 1,2.$}
\begin{equation*}
 w^*_i = \Theta(\gamma^{\alpha_{max}-\alpha_i}),\text{ for all $i.$}
\end{equation*}
{Case 4: The best arm is the rarest but not uniquely, $\alpha_1 = \alpha_k = \alpha_{max} \geq \alpha_i, \ i \notin \{1,2,k\}$, $\alpha_{max}>\alpha_2$ and $\zeta > 1$.}
\begin{equation*}
\begin{aligned}
& w^*_2= \Theta(\gamma^{\frac{\alpha_{max}-\alpha_2}{2}}),\\
& w^*_i = \Theta( \gamma^{\alpha_{max}-\alpha_i}) &&\text{for all $i \neq 2.$}
\end{aligned}
\end{equation*}
{Case 5: The best arm is the rarest but not uniquely, $\alpha_1 = \alpha_k = \alpha_{max} \geq \alpha_i, \ i \notin \{1,2,k\}$, $\alpha_{max}>\alpha_2$ and $\zeta \leq 1$.}
\begin{equation*}
\begin{aligned}
& w^*_1= \Theta(\gamma^{\alpha_{max}-\alpha_1}),\\
& w^*_i= \Theta( \gamma^{\alpha_{max}-\alpha_i}) &&\text{for all $i \neq 1.$}
\end{aligned}
\end{equation*}
Further, the asymptotic equivalence can be expressed by limits that are functions of parameters of the bandit problem.
\end{theorem}
\begin{proof}
See section C of supplementary material.
\end{proof}
%\textit{Remaining matter goes to appendix as proof}\\
Theorem $\hyperref[asymptotic_w]{2}$ gives us insight into the behavior of the optimal weights $w^*$ in  (\ref{approx_lower_bound_problem}).
The results from  the theorem  rely on Lemma $\hyperref[tweaked_means_meet]{1}$ below and are discussed further in Section~3.2.

By the fact that $V^*(p)\approx V^*_a(p)$ (Theorem \hyperref[approxMaxMinTheorem]{1}) the optimal weights of actual maxmin problem also will show the same asymptotic behaviour. It is easy to see that substituting these optimal weights in $V^*(p)$ gives us an overall lower bound on the sample complexity as a scalar multiple of $\gamma^{\alpha_{max}}$.

\subsection{Discussion on Theorem $\hyperref[asymptotic_w]{2}$}
Without loss of generality let arm 2 be the one with the second highest mean. We further assume that $\mu_2> \mu_i$ for $i \geq 3$.
\begin{lemma}{1}\label{tweaked_means_meet}
In the maxmin problem (\ref{sample_lower_bound}), let $x^{*}_{i,e}(w^*)$ denote the minimizer of each $\mathcal{P}_i$ for the optimal weights $w^*$. Then, we have $x^{*}_i(w^*) \in [\mu_2,\mu_1] \,\,\,\, \forall i$.
\end{lemma}
\begin{remark}
It is well known that $x^{*}_{i,e}(w^*)$ lies within $[\mu_i,\mu_1]$. The Lemma $\hyperref[tweaked_means_meet]{1}$ shows that $x^{*}_{i,e}(w^*) \geq \mu_2$.
\end{remark}
\noindent
\textbf{Proof of Lemma \hyperref[tweaked_means_meet]{1}:}
We shall show this by contradiction. Suppose $x^{*}_{i,e}(w^*) < \mu_2$. Then, from the optimality conditions of $w^*$ (similar to (\ref{KL_ratio_sum_eq}), (\ref{wt_kl_eq})) we have, $\forall i \neq j$, $i,j \neq 1$:
\begin{equation*}
\begin{aligned}
&\underset{\mu'_i \geq \mu'_1}{\inf} w^*_1 KL(\mu_1,\mu'_1)+w^*_i KL(\mu_i,\mu'_i)\\
&=\underset{\mu'_j \geq \mu'_1 }{\inf} w^*_1 KL(\mu_1,\mu'_1)+w^*_j KL(\mu_j,\mu'_j).
\end{aligned}
\end{equation*}

But we know that this minimization, for each $i \neq 1$, is attained uniquely  by a bandit instance $p'$ where the rest of the arms, except 1 and $i$, are the same as the original bandit instance in consideration, namely, $p$. Both the arms $i$ and $1$ have means $x^{*}_{i,e}(w^*)$ under $p'$. But the assumed hypothesis then implies that $x^{*}_{i,e}(w^*)=\mu'_1<\mu'_2=\mu_2$. That means $p'$ is also in the set $\{\mu'_2 \geq \mu'_1\}$ and hence
\begin{equation*}
\begin{aligned}
&\underset{\mu'_i \geq \mu'_1}{\inf} w^*_1 KL(\mu_1,\mu'_1)+w^*_i KL(\mu_i,\mu'_i)\\
&> \underset{\mu'_2 \geq \mu'_1 }{\inf} w^*_1 KL(\mu_1,\mu'_1)+w^*_2 KL(\mu_2,\mu'_2).
\end{aligned}
\end{equation*}
However, this contradicts the necessary optimality conditions for $w^*$. Thus, $x^{*}_{i,e}(w^*) \geq  \mu_2$.$ \hfill \qed$

A similar result can also be shown for the approximate  problem (\ref{approx_lower_bound_problem}) (see Section D of supplementary material).

\textbf{On Theorem $\hyperref[asymptotic_w]{2}$.} In the rare event setting, the non-zero samples from an arm are the informative samples, but they are quite rare. Any algorithm needs to see non-zero (informative) samples from at least some arms before it decides to stop. By Lemma $\hyperref[tweaked_means_meet]{1}$ we know that all arms, except possibly the best and second best ($i =1,2$), will show deviations in their sample mean under max-min optimality. As the TS algorithm and our algorithm track these weights, it is to be expected that the number of samples for arm $i (\neq 1,2)$ is only as high as it takes to see an $\mathcal{O}(1)$ sample mean, but also sufficiently low as to ensure that the probability of sample mean deviation is high. The optimal weights $w^*_i \simeq \gamma^{\alpha_{max}-\alpha_i}$, $\forall i \neq 1,2$, have this feature. This gives the sample complexity for arm $i (\neq 1,2)$ as $\mathcal{O}(\gamma^{-\alpha_i})$ (since the overall sample complexity is $\mathcal{O}(\gamma^{-\alpha_{\max}}$)). On average, each arm thus sees only $\mathcal{O}(1)$ non-zero samples, with a deviation probability $1-\mathcal{O}(\gamma^{\alpha_i}(\mu_1-\mu_i)^2)$ and $\mathcal{O}(1)$ sample mean. 

\section{Track and Stop Algorithm}\label{approx_algo_section}

Our algorithm builds upon the Track and Stop (TS) algorithm proposed in \cite{shubhada2019,kaufmann2016a}. We call it Track and Stop (A), to emphasize
thatwe are solving an approximate problem. The algorithm solves the approximate maxmin problem \ref{approx_lower_bound_problem}, and samples according to the weights obtained. The calculation of the sampling weights happen in batches of size $m$. Let $l$ denote the batch index. Within each batch we ensure that each arm gets at least $\sqrt{lm}$ samples. This is done in the same manner as \cite{shubhada2019}. At the end of $l$-th batch, TS(A) evaluates the maximum likelihood ratio $Z_{k^*}(l)$ for the empirical best arm $k^*(l)$ and decides whether to stop or not. The likelihood ratio is given by:
\begin{equation*}
\begin{aligned}
    Z_{k^*}(l) \coloneqq &\underset{b \neq k^*}{\min}\underset{x \leq y}{\inf} N_{k^*}(lm)\mathcal{K}_{inf}^{L}(\hat{p}_{k^*}(lm),x) \\
    &+ N_{b}(lm)\mathcal{K}_{inf}^{U}(\hat{p}_b(lm),y),
\end{aligned}
\end{equation*}
the same way as in \cite{kaufmann2016} and \cite{shubhada2019}. $\hat{p}(t)$ refers to the empirical bandit instance after $t$ samples. $N_i(t)$ denotes to number of pulls of arm $i$ after $t$ samples. TS(A) stops when $Z_{k^*}(l)>\beta(lm,\delta)$, where $\beta(t,\delta)$ is a stopping threshold defined as
\begin{equation*}
    \beta(t,\delta) \coloneqq \log\bigg(\frac{K-1}{\delta}\bigg)+5\log(t+1)+2.
\end{equation*}

Note that we are computing the maximum likelihood ratio by solving the $\mathcal{K}_{inf}$ problems exactly, and not approximately. Although it is relatively expensive to compute these quantities exactly, such computations occur only once for each $l$. The number of samples $N_i(t)$ for each arm $i$ is influenced by the optimal weights that are obtained as solution to the approximate maxmin problem. The precise algorithmic details of TS(A) are given below. 

%\newpage
\begin{algorithm}[h]\label{main_algo}
   \caption{TS(A) algorithm}

   \hspace*{\algorithmicindent} \textbf{Input:}  Confidence level $\delta$, Upper bounds $[B_i \gamma^{-\alpha_i}]_{i \in [K]}$.   \\
   \hspace*{\algorithmicindent} \textbf{Output:} Arm recommendation $k^*$. \\ 
   \begin{algorithmic}[1]
   \STATE Generate $\lfloor \frac{m}{K} \rfloor$ samples for each arm.\\
   \STATE $l\leftarrow1$.
   \STATE Compute the empirical bandit $\hat{p}=(\hat{p})_{i \in [K]}$.
   \STATE $\hat{w}(\hat{p})\leftarrow \text{Compute weights according to (\ref{approx_lower_bound_problem})}$.
   \STATE $k^{*}\leftarrow \underset{i \in [K]}{\arg \max }\hspace{0.2cm} \mathbb{E}[\hat{p}_i]$. 
   \STATE Compute $Z_{k^*}(l)$, $\beta(lm,\delta)$. %and $A(\hat{p},\gamma)$.
   \WHILE{ $Z_{k^*}(l) \geq \beta(lm,\delta) $}
        \STATE $s_i\leftarrow (\sqrt{(l+1)m}-N_i(lm))^+$.
       \IF {$m \geq \sum_{i}s_i$}
           \STATE Generate $s_i$ many samples for each arm $i$.
           \STATE Generate $(m-\sum_i s_i)^+$ i.i.d. samples from $\hat{w}(\hat{p})$. Let $Count(i)$ be occurrence of $i$ in these samples. 
           \STATE Generate $Count(i)$ samples from each arm $i$.
           \ELSE
           \STATE $\hat{s}^*\leftarrow \underset{\hat{s}, s_i \geq \hat{s}_i \geq 0}{\arg \min }\max_i(s_i-\hat{s}_i)$. 
           \STATE Generate $\hat{s}^*_i$ samples from each arm $i$.
       \ENDIF
           \STATE $l\leftarrow l+1$
           \STATE Update empirical bandit $\hat{p}.$
           \STATE $k^{*}\leftarrow \underset{i \in [K]}{\arg \max }\hspace{0.2cm} \mathbb{E}[\hat{p}_i]$.
           \STATE Update $Z_{k^*}(l)$, $\beta(lm,\delta)$. 
           \STATE $\hat{w}(\hat{p})\leftarrow \text{Compute weights according to (\ref{approx_lower_bound_problem})}$.
   \ENDWHILE
   \RETURN $k^*$.
\end{algorithmic}
\end{algorithm}
%\FloatBarrier
%\begin{remark}
{ Observe that in Algorithm $\hyperref[main_algo]{1}$ when we solve (\ref{approx_lower_bound_problem}) we need an estimate of $\gamma^{\alpha_i}$. Typically, this can be estimated either as the ratio of the known upper bounds of the support of each arm. Alternatively, this maybe estimated from the past sales data.} 
%\end{remark}
\subsection{$\delta$-correctness and sample complexity of TS(A)}
The following theorem guarantees the $\delta$-correctness and gives asymtptotic sample complexity bound for TS(A):
\begin{theorem}{3.}
 The TS(A) is a $\delta$-correct algorithm with the following asymptotic sample complexity bound:
 \begin{equation}
     \limsup_{\delta \to 0} \frac{\mathbb{E}_p[\tau_{\delta}]}{\log(1/\delta)}\leq \frac{1}{V_{TS(A)}(p)}
 \end{equation}
 where $V_{TS(A)}(p):=\underset{i \neq 1}{\min} \mathcal{P}_{i}(\hat{w}^*(p))$. $\hat{w}^*(p))$ denotes the optimal weights for the approx lower bound problem $V^*_a(p).$
\end{theorem}
See sections E and F in the supplementary material for a proof of Theorem 3. Note that by definition we have $V^*(p) \leq V_{TS(A)}$ and hence we do suffer some loss in sample complexity vis-a-vis the TS algorithm. However, when $\gamma$ is small, the difference is negligible as $w^*(p) \approx \hat{w}^*(p)$.
\subsection{Computational Benefit of Poisson Approximation}
The computational benefit of TS(A) vis-a-vis the exact algorithm, call it
TS (E),  is in how the approximate and exact lower bound problems are solved.


Let us first examine the number of operations required in finding the exact lower bound. In our implementation, we used Brent's method for one-dimensional optimization and the bisection method for root finding. To get a relative error of $\epsilon$ in Brent's method (see Chapter 4 in \cite{brent2013algorithms}) we require $\mathcal{O}\big(\log^2\big(\frac{1}{\epsilon}\big)\big)$ operations. The bisection method takes $\mathcal{O}\big(\log\big(\frac{1}{\epsilon}\big)\big)$ for a relative accuracy of $\epsilon$. Lemma 2 (see Section A of the supplementary material) reduces the process of computing $\mathcal{K}_{inf}^{L}$ and $\mathcal{K}_{inf}^{U}$ to a root-finding procedure, causing said computations to take about $\mathcal{O}\big(\log\big(\frac{1}{\epsilon}\big)\big)$ operations. The inner optimization $\mathcal{P}_i$ is a convex optimization that requires $\mathcal{O}\big(\log^2\big(\frac{1}{\epsilon}\big)\big)$ operations. The outer optimization in (\ref{sample_lower_bound}) can be reduced to solving two sets of simultaneous root finding procedures and hence would take $\mathcal{O}\big(\log^2\big(\frac{1}{\epsilon}\big)\big)$. Thus, the total number of operations to solve the exact lower bound (\ref{sample_lower_bound}) is $\mathcal{O}\big(\log^5\big(\frac{1}{\epsilon}\big)\big)$.


In the approximate problem $C_i, C_{1i}$'s are the unknown variables, whose behaviour we analyze. Using $g_i$ (section $\hyperref[approx_lb_section]{3.1}$) to write $C_{i}$ as a function of $C_{1i}$ requires about $\mathcal{O}\big(\log\big(\frac{1}{\epsilon}\big)\big)$ operations for each such conversion using the bisection method. Then, each of the $C_{1i}$ $(i \neq 2)$, are written as function of $C_{12}$ through $\xi_i$. This again requires about $\mathcal{O}\big(\log\big(\frac{1}{\epsilon}\big)\big)$ operations for each such conversion. Finally the solution of $C_{12}$ through $h_i$ requires another factor of $\mathcal{O}\big(\log\big(\frac{1}{\epsilon}\big)\big)$. This gives the total required number of operations to be $\mathcal{O}\big(\log^3\big(\frac{1}{\epsilon}\big)\big)$. Thus, we are saving about $\mathcal{O}\big(\log^2\big(\frac{1}{\epsilon}\big)\big)$ by solving the approximate problem vis-a-vis the exact one.

 \section{Numerical Experiments}\label{num_experimental_section}
We compare the sample complexity and computational time between TS(A) and TS(E) algorithm  proposed in \cite{agrawal2020optimal}. We make the comparison across different arms, $\gamma$ and $\alpha$ structures at a confidence level $\delta = 0.01$. We choose the parameter $\gamma=10^{-2},10^{-3}$ to reflect the typical rarities seen in the online ads scenario. We choose different configuration of the relative rarities $\alpha$'s to reflect some of the different regimes seen in Theorem $\hyperref[asymptotic_w]{2}$. The tested configuration are given below:

\begin{table}[h]
\centering
\begin{tabular}{ |c|c|c| } 
\hline
$(\gamma,\alpha)$ configuration & Config. name \\
\hline
$(\gamma = 10^{-3},\alpha=(1,1,1))$ & Expt 1. \\ 
$(\gamma=10^{-2},\alpha=(1,1.5,2))$& Expt 2. \\ 
$(\gamma=10^{-3},\alpha=(1,1,1,1,1))$& Expt 3.  \\ 
$(\gamma=10^{-2},\alpha=(2,1.5,2,2.5,1))$& Expt 4. \\
\hline
\end{tabular}
\end{table}
We run each algorithm for $100$ sample paths and their average sample complexity and average computational time are reported in the Table 1 below. The algorithm for both TS(E) and TS(A) proceeds in batches
of size $\gamma^{-\alpha_{\max}}$.
\begin{table}[h]
\centering
\scalebox{0.9}{\begin{tabular}{|p{2.3cm}|c|c|c|c|}
    \hline
    \multirow{2}{2.1cm}{\textbf{Experiment: ($\gamma$,$\alpha$)}} & \multicolumn{2}{c|}{\textbf{Samples (m)}} & \multicolumn{2}{c|}{\textbf{Runtime (s)}}\\
    \cline{2-5}
    & \textbf{TS(E)} & \textbf{TS(A)} & \textbf{TS(E)} & \textbf{TS(A)}\\
    %\hhline{~--}
    \hline
    Expt 1. & 0.28 & 0.37 & 269.49  & 27.36  \\ 
    \hline
    Expt 2. & 0.45 & 0.47 & 45.47  & 2.74  \\ \hline
    Expt 3. & 0.81 & 0.92 & 1016.29  &  144.08 \\ \hline
    Expt 4. & 7.87 & 8.88 & 109.61  & 15.17 \\ \hline
  \end{tabular}}
\caption{\emph{Comparison between the TS and TS(A) algorithms. Sample complexity is reported in million (m) samples. The computational runtime is reported in seconds (s).}}
\end{table}
Table 1 shows that for all experiments, TS(A) takes slightly more samples (1-13$\%$) to stop and recommend an arm compared to TS. The computational savings of TS(A) is about $6-12$ times the TS algorithm. These simple experiments underscore the trade-off between sample complexity and computational time.

\begin{table}[h]\label{ucb_expts}
\centering
\scalebox{0.9}{\begin{tabular}{|p{2.3cm}|c|c|c|c|}
    \hline
    \multirow{2}{2.1cm}{\textbf{Experiment: ($\gamma$,$\alpha$)}} & \multicolumn{2}{c|}{\textbf{Samples (m)}} & \multicolumn{2}{c|}{\textbf{Runtime (s)}}\\
    \cline{2-5}
    & \textbf{\footnotesize{lilUCB}} & \textbf{\footnotesize{LUCB}} & \textbf{\footnotesize{lilUCB}} & \textbf{\footnotesize{LUCB}}\\
    %\hhline{~--}
    \hline
    Expt 1. & 38.8 & 171.7* & 8870  & 28200*  \\ 
    \hline
    Expt 2. &162.2 & 137.7* & 28230 & 34250*  \\ \hline
    Expt 3. & 85.8 & 141* & 17340 & 28590* \\ \hline
    Expt 4. & 204.8* & 134.2* & 37300* & 34850* \\ \hline
  \end{tabular}}
\caption{\emph{Further comparison of TS(A) with lil-UCB  and LUCB1.The superscript $*$ denotes those runs which took a long time (>10 hours) to stop. We report the stopped values for these runs.}}
\end{table}

We conduct further comparisons with LilUCB (see \cite{jamieson2014lil}) and LUCB (see \cite{kalyanakrishnan2012pac}). Both these algorithms are well known in the BAI literature. These additional results are reported in Table 2. We see that both the TS algorithms are much better. The issue with UCB-index based algorithms like LilUCB, LUCB is that they have a dependence of $\sigma^2$(where $\sigma$
 is the sub-gaussianity parameter, see Appendix H) in the sample complexity upper bound. This translates to a dependence of $\gamma^{-2\alpha_{\max}}$ (since the upper bounds on rewards scale with $\gamma^{-\alpha_{\max}}$) while the TS(E) and TS(A) have an order dependence of only $\gamma^{-\alpha_{\max}}$, which is a significantly better sample complexity dependence.

\begin{table}[h]
\centering
\scalebox{0.9}{\begin{tabular}{|p{2.3cm}|c|c|c|c|}
    \hline
    \multirow{2}{2.1cm}{\textbf{Experiment: ($\gamma$,$\alpha$)}} & \multicolumn{2}{c|}{\textbf{Samples (m)}} & \multicolumn{2}{c|}{\textbf{Runtime (s)}}\\
    \cline{2-5}
    & \textbf{\footnotesize{m=200}} & \textbf{\footnotesize{m=500}} & \textbf{\footnotesize{m=200}} & \textbf{\footnotesize{m=500}}\\
    %\hhline{~--}
    \hline
    Expt 1. & 0.39 & 0.38 & 4.15  & 6.51  \\ 
    \hline
    Expt 2. & 0.18 & 0.14 & 14.59 & 16.21  \\ \hline
    Expt 3. & 0.28 & 0.25 & 62.93 & 84.13 \\ \hline
    Expt 4. & 4.29 & 4.38 & 11.98 & 21.30 \\ \hline
  \end{tabular}}
\caption{\emph{We increase the support size $m$ of each bandit arm while holding the means fixed. The sample complexity increases with increasing support points.}}
\end{table}
As noted in Section 2.1 the theory can be extended to continuum support. The experimental results reported in Table 1 were with a support size $m=25$ per arm. Now, we increase the support size to $m=200$ and $m=500$, while the mean of the arms are held fixed. The results are presented in Table 3. We observe the sample complexity increases with increasing support points but the marginal increase is diminishing. This hints that the sample complexity is tending towards the one suggested by theory for the continuum support.

\begin{table}[h]
\centering
\scalebox{0.8}{\begin{tabular}{|p{2cm}|c|c|c|c|}
    \hline
    \multirow{2}{2.1cm}{\textbf{Experiment: ($\gamma$,$\alpha$)}} & \multicolumn{2}{c|}{\textbf{Samples (m)}} & \multicolumn{2}{c|}{\textbf{Runtime (s)}}\\
    \cline{2-5}
    & \textbf{\footnotesize{$\gamma' = 0.95\gamma$}} & \textbf{\footnotesize{$\gamma' = 0.1\gamma$}} & \textbf{\footnotesize{$\gamma' = 0.95\gamma$}} & \textbf{\footnotesize{$\gamma' = 0.1\gamma$}}\\
    %\hhline{~--}
    \hline
    Expt 1. & 0.52 & 33.13 & 2.88 & 8.58 \\ \hline
    Expt 2. & 0.72 & 31.9 & 4.12 & 8.5 \\ \hline
    Expt 3. & 1.09 & 9.28 & 186.84 & 233.7 \\ \hline
    Expt 4. & 8.23 & 1124.6 & 15.21 & 47.34 \\ \hline
  \end{tabular}}
\caption{Misspecified $\gamma$. Sample complexity is stable wrt the mis-specification.}
\end{table}
\FloatBarrier

The TS(A) algorithm requires (see Section 4.1) an estimate of the rarity $\gamma^{\alpha}$. As the rarity can only be known only approximately we study the scenario where the parameter $\gamma$ is mis-specified and hence the rarities are too. The results are presented in Table 4. We observe that the sample complexity is stable wrt mis-specification, with larger estimation errors leading to an increase in sample complexity.
%We also account for the possibility that $\gamma$ or $\alpha_i$s, being unknown to the algorithm, may be incorrectly estimated. For small errors ($<5\%$) in estimation of $\alpha_i$ and/or $\gamma$, sample complexity increases by about $<5\%$. For large errors in estimation of these parameters (to the extent that we get the rarities entirely wrong - say, $\gamma$ is estimated as $\mathcal{O}(10^3)$ instead of $\mathcal{O}(10^2)$), sample complexity can increase by about 60 times. However, we don't typically expect misspecifications where the rarity orders are off by orders of magnitude.


%In the initially reported data, we allowed $n_i$ to range between 15-25. We then increased $n_i$ to 200 and then 1000 support points, while keeping the means fixed. On average, TS(A) took about 30-35 seconds for $n_i \approx 200$ and about 60-65 seconds for $n_i \approx 1000$, as compared to the earlier 6.59 seconds. Sample complexities increased to 2-2.5 million when $n_i \approx 200$ and 3.5-4 million when $n_i \approx 1000$, as compared to the earlier 1.23 million for $n_i = 20$.


\section{Conclusion}
The paper proposes a rarity framework to study the fixed confidence BAI problem relevant to online ad placement. In this framework the positive reward probabilities are small while the corresponding rewards are quite large. Consequently, the mean rewards are $\mathcal{O}(1)$.\\
We introduce a Poisson approximation to the standard lower bound problem and use it to motivate an algorithm that is computationally faster than the optimal TS algorithm at the cost of a small increase sample complexity. We also use this approximation to derive asymptotic optimal weights which give insight into the lower bound behaviour in the rare event setting. We observe this trade-off between sample complexity and computational time in our numerical experiments.
% References
\bibliography{Bhattacharjee_780}
\end{document}


As is well known, the lower bound problem in this testing of hypothesis
setting corresponds to gaining enough evidence for each arm so that we can rule
out alternative hypotheses. Specifically, modeller sees the data coming from
more-or-less underlying distribution and is concerned, that the true distribution comes from an alternative set, and what is seen is a large deviations taken by the data. Modeller selects samples so that this probability of an alternative hypothesis being correct is less than \delta. 

Now, all else being equal, an arm with a rarer event has larger tendency towards large deviations. 


















