\section{Introduction}\label{sec:intro}
%littlestone1994weighted,cesa1997use, vovk1995game
In the standard decision-theoretic online learning studied by \cite{freund1997decision},
%is a special case of the general framework of prediction with expert advice \cite{}.
there are $N$ experts (or actions) at the disposal of a learner. In round $t$, the learner chooses a probability mass function $\pmb{p}_t$ over the set of experts $\{1,2,\ldots,N\}$, an adversary reveals the loss vector $\pmb{l}_t = (l_t(1),\ldots,l_t(N)) \in [0,1]^{N}$, and the learner incurs an (expected) loss of $\langle \pmb{p}_t, \pmb{l}_t \rangle$. The total loss incurred by the learner after $T$ rounds is $L_T = \sum_{t = 1}^T \langle \pmb{p}_t, \pmb{l}_t \rangle$, and the total loss of choosing expert $i$ in all the rounds is $L_T(i) = \sum_{r = 1}^T l_t(i)$. The learner aims to minimize its cumulative regret up to round $T$, defined as $L_T - \min_i L_T(i)$. 
%\color{red}\sout{The learner aims to minimize its regret with respect to the expert with the minimum cumulative loss up to round $T$, i.e., minimize $L_T - \min_i L_T(i)$.}\color{black}

The celebrated Hedge algorithm by \cite{freund1997decision} uses a parameter called the learning rate $\eta \geq 0$, assigns weight $w_t(i) = e^{-\eta L_{t-1}(i)}$ for each expert $i$ based on the observed cumulative loss, and chooses expert $i$ with probability $p_t(i) = w_t(i)/W_t$, where $W_t = \sum_{i = 1}^K w_t(i)$ where $K$ is the number of experts. For a suitable choice of $\eta$, Hedge has $O(\sqrt{T \log N})$ regret. Subsequent works explored improved algorithmic techniques seeking regret bounds where the dependency on $T$ is replaced by metrics that capture the variability of the sequence of loss vectors $\pmb{l}_t$ \cite{cesa2007improved,hazan2010extracting,chiang2012online}. \color{black} In contrast to these works, \cite{Gofer13} studied the dependency of the regret bound on the number of experts $N$. They introduced the \textit{branching experts setting}, where new experts may be revealed in each round, and the cumulative loss of any new expert is either equal or close to the cumulative loss of one of the existing experts. They proposed an algorithm with $O(\sqrt{T N_T})$ regret, where $N_T$ is the number of experts revealed in the first $T$ rounds.
%and the focus is on the performance of various algorithms as a function of the number of experts revealed up to time $T$. 

%In this paper, we study decision-theoretic online learning under the stochastically growing experts setting where the experts are points/vectors in a d-dimensional Euclidean/vector space $\mathcal{V}$. In each round $t$, the environment draws an expert i.i.d. from a bounded convex set $\mathbb{B} \in \mathcal{V}$ using a fixed (unknown) distribution and reveals it to the learner. We consider the following structure for the losses assigned by the environment to the experts in $\mathbb{B}$. In round $t$, the revealed experts partition $\mathbb{B}$ into $t^d$ unique convex subsets obtained from $d$ orthogonal hyperplanes passing through each revealed expert.  The environment assigns a unique loss for all the experts within a subset. For the experts sampled from a $2$-dimensional plane, the convex subsets are illustrated in Fig~\ref{}. 

\jpcol{Motivated by learning problems that arise in out-of-distribution (OOD) detection \cite{yang2024} and distributed Deep Learning (DL) inference \cite{Ghina2024}, in this paper, we study a novel stochastically partitioning experts setting.} This setting is a stochastic variant of the branching experts setting, where the experts revealed in each round are new sub-partitions of a hypercube $\mathbb{B}$ in $d$-dimensional Euclidean space, \color{black} {where $d< \infty$}\color{black}.
%\footnote{The algorithms and the analysis in this work apply to any \textit{convex region} in the $d$-dimensional Euclidean space, but for the ease of exposition, we limit $\mathbb{B}$ to hypercube.} 
In each round $t$, the environment draws a point $X_t$, \color{black} i.i.d. from $\mathbb{B}$, using a fixed (unknown) distribution. For each chosen point, we draw $d$ orthogonal hyperplanes parallel to the $d$ faces of $\mathbb{B}$ passing through the point. The set of experts revealed up to round $t$ is the set of partitions of $\mathbb{B}$ created by the intersection of the $d$ orthogonal hyperplanes passing through each of the $t$ points drawn up to that round, resulting in $(t+1)^d$ experts.\footnote{Since the points are drawn i.i.d. from Euclidean space, the probability of a chosen point lying on one of the $d$ hyperplanes parallel to the faces of $\mathbb{B}$ passing through another point drawn in some other round is zero. Thus, in round $t$, there will be $(t+1)^d$ experts with probability one.} The partition of experts for one dimension and two dimensions is illustrated in Fig.\ref{fig:partitioning}.
%\sout{The new experts are the convex partitions obtained from $d$ orthogonal hyperplanes parallel to the $d$ faces of $\mathbb{B}$ passing through this new point.} 
\begin{figure}[h]
\centering
    \begin{subfigure}[b]{0.45\textwidth}
        \centering
        \includegraphics[scale=0.25]{partitioningExperts1.pdf}
        \caption{An illustration of partitioning experts in 1-dimension over three rounds.}
        \label{fig:1D}
    \end{subfigure}
    \hfill
    \begin{subfigure}[b]{0.45\textwidth}
        \centering
        \includegraphics[scale=0.25]{partitioningExperts2.pdf}
        \caption{An illustration of partitioning experts in 2-dimensions over three rounds.}
        \label{fig:2D}
    \end{subfigure}
    \caption{We show the partitioning experts setting for the first three rounds for one dimension ($d=1$) on a bounded interval in (a) and for two dimensions ($d=2$) on a square region in (b). The new point and the new expert indices in each round are highlighted using bold fonts.}
    \label{fig:partitioning}
\end{figure}

In each round, the environment only reveals the losses of the existing experts, and we allow the losses to be adversarial. We consider the \textit{perfect clone setting} introduced in \cite{Gofer13}, where a new expert is a perfect clone of its parent expert, i.e., the cumulative loss of a new partition is equal to the cumulative loss of its parent partition. Once the new expert is revealed, its cumulative loss evolves independently from its parent expert in the subsequent rounds. We note that, in contrast to the branching experts setting where $N_T$ is bounded and is independent of $T$, in the partitioning experts setting, the number of experts in round $T$ is $(T+1)^d$.

\subsection{Motivating Applications}
\label{sec:motivation}
%\jpcol{Below, we present two learning problems that motivate the partitioning experts setting.}
%\subsubsection*{OOD Detection:}
\textbf{OOD Detection:} Detecting OOD samples has been widely studied as DL models fail with high confidence for these samples, resulting in serious consequences in high-risk applications.
%Thus, not detecting OOD samples results in serious consequences in high-risk applications. 
Many methods that have been developed for OOD detection use a threshold $\theta$ on a \textit{score $x_t$}, calculated from soft-max values or features of the data sample, to differentiate OOD from in-distribution (ID) samples. The detected OOD samples can be deferred to human experts at some cost \cite{vishwakarma2024}. Thus, selecting the best threshold, denoted by $\theta^*$, that minimizes the false negative, i.e., undetected OOD samples, and false positives, i.e., detecting ID as OOD samples, is critical for the safe and reliable deployment of DL models with minimal costs. 

In Fig. \ref{fig:ood}, we show an example pdf for ID and OOD samples. For a chosen threshold $\theta$, the decision is to classify the input sample as an ID sample if $x_t$ exceeds $\theta$; otherwise, the sample is OOD. Since the pdfs are unknown a priori, a learner needs to learn $\theta^*$ using the following loss function.
\begin{align}\label{eq:thresholdrule}
\text{loss}(\theta) &=
    \begin{cases}
        \text{cost for false positive} &\text{ if } \text{score }x_t \geq \theta,\\
        \text{human expert cost} &\text{ if }\text{score } x_t<\theta.
    \end{cases}
\end{align}
Note that \jpcol{the challenge is that} the scores can take values from a continuous set $\mathbb{B}$. \jpcol{However,} with some effort, one can show that it is sufficient to consider only the distinct score values arrived/revealed in $T$ rounds as the thresholds/experts. Also, whenever a new expert, i.e., a new distinct score, is revealed, the cumulative loss of this expert is equal to the cumulative loss of the highest score (revealed) less than this new score. Thus, this problem falls under the setting we study in this paper.
\begin{figure}[h]
\centering
\includegraphics[width=0.7\linewidth]{ood.pdf}    
\caption{Differentiating ID and OOD samples using a threshold on the score.}
\label{fig:ood}
\end{figure}
%Since OOD samples are not seen before deployment,  false positives False Positive Rate (FPR) corresponds to unable to detect OOD samples and forwarding these samples to a human expert    

%\subsubsection{Hierarchical Inference:} 
\jpcol{\textbf{Hierarchical Inference:} The partitioning experts setting also arises in the Hierarchical Inference system proposed for distributed DL inference for classifications applications in edge AI systems \cite{moothedath2023,Beytur2024,Ghina2024}.} 
%The above setting with $\mathbb{B} = [0,1]$ arises in an online learning framework recently studied by \cite{moothedath2023,Beytur2024} for ML classification applications that use Deep Learning (DL) inference with a reject/offload option. 
In this system, in each round $t$, the environment presents a data sample (e.g., image) to an end device (e.g., mobile device, IoT device, etc.). The data sample is inputted to a pre-trained local DL model that outputs soft-max values corresponding to different classes. The learner computes a confidence metric $x_t \in [0,1]$ using these soft-max values\footnote{A typical choice for the confidence metric is the maximum soft-max value as the data sample is typically classified into the class with the maximum soft-max value.}. The learner accepts the classification in round $t$ if the confidence metric $x_t$ is above a threshold, which the learner aims to learn. If the learner accepts the classification, it incurs a zero loss when the classification is correct and a loss of one otherwise. If the learner rejects or offloads the classification task, it incurs an offloading cost. Similar to OOD detection, learning an optimal threshold for the confidence metric falls under the problem setting we study in this paper.
%In this problem, the experts are the partitions of $\mathbb{B} = [0,1]$ created by the $x_t$ values corresponding to the data samples that arrive over time \cite{moothedath2023}. If a learner chooses a partition in round $t$, then the classification of the DL model is accepted if $x_t$ is greater than the supremum of the chosen interval, else the classification is rejected, and the data sample is offloaded. For this problem, the partitions are illustrated in Fig.\ref{fig:1D}. 

%The above application scenarios have a single threshold corresponding to $d = 1$ (Fig. \ref{fig:1D}). In applications where false positives and false negatives have asymmetrical importance may require setting different thresholds for different soft-max values. Then the problem involves learning optimal thresholds corresponding to each soft-max value. This scenario corresponds to the $d$ dimension scenario. 

The above applications have a single threshold to learn and thus map to the partitioning experts problem with $d = 1$ (Fig. \ref{fig:1D}). Note that the expert in our problem setting is an interval -- not a threshold $\theta$ as in \eqref{eq:thresholdrule}. The equivalence between an interval and a threshold can be obtained as follows. Given $x_t$ and an interval, the threshold rule in \eqref{eq:thresholdrule} leads to the same outcome for all the thresholds in that interval. For example, in round 3 of Figure \ref{fig:1D}, if expert $2$ is chosen, then $x_3$ is smaller than all the points in expert $2$. Thus, the sample will be classified as an OOD sample (or will be offloaded in the case of the Hierarchical Inference system).  

For applications where misclassification costs are non-uniform across classes, using different thresholds for the soft-max values corresponding to the different classes will likely improve performance. In this case, learning thresholds for a $d$ class classification task maps to a $d$-dimensional partitioning experts problem. 

\begin{comment}
\color{red}
Consider a stylized binary communication system where the source uses one of two voltage levels $(V_{\text{low}}$ and $V_{\text{high}}(>V_{\text{low}}))$ to communicate the two binary symbols $(S_0 \text{ and } S_1)$. The communication channel adds a zero mean noise with an unknown distribution to this signal. The receiver can either decode the received signal or request for retransmission. Every error in decoding at the receiver costs the system one unit, and the system incurs a cost of $c$ for each retransmission. The system incurs no cost if the symbol is decoded correctly. \jpcol{How does the receiver know if the decoded message is correct or wrong?} 

One natural decoding mechanism at the receiver is the following: for some $\epsilon \in (0, (V_{\text{high}} - V_{\text{low}})/2]$ if the received voltage is less than $ V_{\text{low}} + \epsilon$, the receiver decodes the symbol as $V_{\text{low}}$ if the received voltage is greater than $ V_{\text{high}} - \epsilon$, the receiver decodes the symbol as $V_{\text{high}}$, and in all other cases, the receiver concludes that the decoding has failed and requests for retransmission. In such a communication system,  finding the optimal $\epsilon$ can be modeled as a learning problem. 
\color{black}
%The above works consider that the arrivals of $x_t$ values are arbitrary. 
\end{comment}

%While \cite{moothedath2023} characterize a regret bound involving the inverse of the smallest partition after $T$ rounds, \cite{Beytur2024} uniformly discretize the interval $\mathbb{B}$. \color{black}Indeed, as we will see later, if the arrivals of the points from $\mathbb{B}$ are arbitrary, no sub-linear regret bound is possible for this problem. 

%We also illustrate the partition of the experts for a $2$-dimensional square region in Fig~\ref{fig:2D}. 

\subsection{Our Contributions}
We study the novel stochastically partitioning experts setting. We propose two algorithms, namely, \algo, a natural extension of the Hedge algorithm for the growing experts setting, and Ada\algo, an adaptive learning rate variant of \algo. We prove the following results on the regret of the proposed algorithms. 
%and characterize the regret performance of \algo, a natural extension of the Hedge algorithm for the growing experts setting, and Ada\algo, an adaptive learning rate variant of \algo. 
%where the experts are partitions of a $d$-dimensional hypercube $\mathbb{B}$ 
%and the number of experts grows over time, based on the points drawn from $\mathbb{B}$. 
%By round $T$, we have $(T+1)^d$ experts. 
\begin{itemize}
    \item[--] Even though the number of experts grow as $(t+1)^d$, we show that \algo~has $O(\sqrt{2^d T \log T})$ expected regret, which is order-optimal in $T$. Compare this with the Hedge algorithm, which has $O(\sqrt{d T \log T})$ regret in the special case where all the $(T+1)^d$ experts are known apriori.
    %and their losses are revealed in each round.  
    \item[--] We also show that \algo~achieves the sample-path regret $O(\sqrt{2^d T^{1+\epsilon} \log T})$ with probability at least $1- T^{-\epsilon}$, for any $\epsilon > 0$.
    \item[--] \algo~uses a fixed learning rate. We show that there is a trade-off between choosing a rate that gives the optimal expected regret guarantee and a rate that gives a useful sample-path regret guarantee. To address this limitation of \algo, we propose the AdaHedge-G algorithm, a variant of the \algo~algorithm that \jpcol{uses a learning rate that adapts according to the cumulative loss of the new experts}. We show that AdaHedge-G simultaneously achieves $O(\log(\log T)\sqrt{ T \log T})$ expected regret, and $O(\log{T}\sqrt{T \log T})$ sample-path regret, with probability at least $1-T^{-c}$, where $c > 0$ is a constant dependent on $d$.
\end{itemize}

\section{Related Work}
%freund1997decision, cesa1997use
%\sout{The classical \textit{prediction with expert advice} problem} \cite{littlestone1994weighted, vovk1995game} \sout{received a lot of attention over the last three decades.}
The decision-theoretic online learning problem is a variant of the classical prediction with expert advice \cite{littlestone1994weighted,vovk1995game} and has received much attention in the past three decades. We summarize the related works that studied the variants of this problem, where the set of experts is very large or growing over time.
%, were studied by \cite{chaudhuri2009parameter, chernov2010prediction, luo2015achieving}. \color{black} 

For the setting where the number of experts is large, \cite{chaudhuri2009parameter} proposed a parameter-free version of Hedge and showed that it outperforms the classical Hedge algorithm. \cite{chernov2010prediction} considered the setting with a large number of experts where multiple experts can be near clones of each other. \color{black} Further, they considered that the regret of the algorithm with respect to any newly arrived expert is assumed to be zero, and it is accumulated thereafter. They provided regret guarantees as a function of the effective number of experts, i.e., the number of unique experts available to the learner. In contrast to the aforementioned works, \cite{luo2015achieving} proposed AdaNormalHedge, which is agnostic to the number of experts and, therefore, can be used in a setting where the number of experts is unknown or changing. At each time-step $t$, AdaNormalHedge creates $N$ sleeping experts, indexed by $(t,i)$ for $i \in {1,\ldots,N}$, that are asleep before time-step $t$, and wake up at time-step $t$ and suffer the same loss as that of expert $i$ from then onwards. It follows that, in total, there will be $NT$ sleeping experts after $T$ rounds. We note that AdaNormalHedge’s computation complexity will be $t$ times higher than Hedge-G in round $t$. Whether AdaNormalHedge can be adapted to the partitioning experts setting and how its regret bound compares to that of Hedge-G remains an open question. In the aforementioned works, however, the newly arriving experts are not correlated with the experts who came before them. 
\color{black}
%Closer to our work, another variant where new experts are revealed over time and the cumulative loss of each new expert is related to one of the existing (parent) experts is called the branching experts \cite{Gofer13}. This work considers the setting where the number of experts is large but finite. The algorithm proposed in \cite{Gofer13} is shown to be order-optimal. Our setting differs from the branching experts setting as we have an uncountably infinite set of experts. A direct application of the algorithm and results in \cite{Gofer13} to our setting leads to linear regret, thus necessitating new algorithms and/or analysis. 

%%%%%% Additional Details %%%%%%%
%\cite{chaudhuri2009parameter propose using potential functions for each expert and the derivative of the potential function as the weights. For experts whose cumulative loss is greater than the algorithm in a round, then a potential 1 is assigned, implying that the weight assigned to those experts in that round is zero. For all other experts, the potential is set exp(R^2/c), where R is the regret with respect to that expert and c is a constant dependent on the time slot. 

%\cite{erven2011adaptive} - Adahedge is designed by using a doubling trick for the 'mixability gap' budget.

\cite{Cohen2017} studied the setting where all the experts are known apriori, and their losses are revealed in each round, but the number of experts is potentially infinite. The focus here was on identifying a small set of experts such that all other experts are close to any one expert in this small set in terms of their cumulative loss. The authors proposed an algorithm with provable performance guarantees that depend on the $\epsilon$-covering number of the sequence of loss functions. They also proposed a method to compute the optimal $\epsilon$ in hindsight.  

\cite{mourtada2017efficient} studied the growing number of experts setting, where new experts are revealed over time. The key contribution in this work is two-fold. The authors considered multiple definitions of regret, namely shifting regret and sparse shifting regret, to account for the fact that the expert set is growing over time. They designed computationally inexpensive policies with order-optimal regret performance for all the regret definitions considered. The proposed algorithms are anytime and parameter-free. In \cite{gyorfi1999simple}, the set of experts grew at an exponentially decaying rate, and the goal was to make predictions about a stationary ergodic time series. In \cite{hazan2009efficient, shalizi2011adapting}, the focus was on predicting a non-stationary time series using a growing set of experts.  In contrast to the above works, experts arrive at a much faster rate in our setting. 

As mentioned, our partitioning experts setting is closely related to the branching experts setting first studied by \cite{Gofer13}. In this work, even though the number of experts increases with time, $N_T$, the total number of experts revealed after $T$ rounds is assumed to be large but finite.
%The branching experts setting is also the focus in \cite{wu2021lifelong}. In addition to the setting in \cite{Gofer13} where an adversary generates the losses, 
\cite{wu2021lifelong} further studied the branching experts setting where the losses are stochastic processes with unknown distributions. They proposed an optimal policy for both adversarial and stochastic losses.  Our setting differs from the branching experts setting as we have an uncountably infinite set of experts from which $(T+1)^d$ experts are revealed in $T$ rounds. Another difference is that the number of new experts revealed in round $t$ is $(t+1)^d-t^d$.


%Many methods for solving online optimization/learning problems like prediction with experts are based on the Hedge algorithm \cite{freund1997decision, freund1999adaptive}. 
In the classical Hedge algorithm, the learning rate is a function of the time horizon $T$. Thus, it is unsuitable for settings where the time horizon is unknown. 
%Using a constant learning rate over time can have limitations like the constant learning rate obtained by optimizing for the best worst-case performance can lead to poor performance on average. 
%One way to address this limitation is to change the learning rate over time.
%based on the difficulty of the problem instance as estimated from the observations up to that point \cite{erven2011adaptive, de2014follow}.
\color{black} 
The algorithms proposed in \cite{erven2011adaptive, de2014follow} addressed this limitation by adapting the learning rate without the need to know the value of $T$.  In contrast, we assume $T$ is given but adapt the learning rate in Ada\algo~according to the observed losses so that it simultaneously achieves near-optimal bound for expected regret and non-trivial sample-path regret guarantees.
%we propose an adaptive version of our policy in order to design a policy that simultaneously has near-optimal expected regret performance and also leads to non-trivial sample-path regret guarantees. \color{black} 
\color{black}


   
