%\section{Branching experts with experts drawn from i.i.d. distribution}
\section{Stochastically Partitioning Experts Setting}
\color{black}
In this work, experts are represented by partitions of a hypercube $\mathbb{B}$ in a $d$-dimensional Euclidean space. As discussed above, in each round $t$, the environment draws a point $X_t$, \color{black} i.i.d.\footnote{\color{black} A typical assumption during the DL training phase is that the data samples are drawn i.i.d. using an unknown distribution. In the Hierarchical Inference application, the same assumption is made for the inference phase, where the data samples are drawn i.i.d. (cf. Al-Atat et al. [2024]), which corresponds to $X_t$ being drawn i.i.d. in  our partitioning setting.\color{black}} from $\mathbb{B}$, using a fixed (unknown) distribution. For each such point, we draw $d$ hyperplanes passing through the point, parallel to the $d$ faces of $\mathbb{B}$. The set of experts available in round $t$ is the set of partitions of $\mathbb{B}$ created by all the hyperplanes drawn up to that round. The partitioning process for $d=1$ and $d=2$ is illustrated in Fig.\ref{fig:partitioning}. 
\color{black}

In round $1$, the environment draws a point $X_1 \in \mathbb{B}$ creating $2^d$ experts, which we index $1,\ldots,2^d$. Similarly, in round $t$, the environment samples point $X_t \in \mathbb{B}$ resulting in $n_t = (t+1)^d$ experts. Among these experts, $(t+1)^d - t^d$ are new experts. We say an expert is a child of a parent expert if the former is a sub-partition of the latter expert. 
%\color{red} We say that an expert is a child of another expert if the partition of $\mathbb{B}$ corresponding the former expert is a subset of the partition of $\mathbb{B}$ corresponding the latter expert. 
%We refer to the latter expert as the parent of the former expert. 
\color{black} We assign the index of each parent expert to one of its children and assign new indices $t^d+1,\ldots,(t+1)^d$ to the remaining unindexed new experts. We use $\mathcal{B}_t = \{1,\ldots, n_t\}$  to denote the set of indices at the end of round $t$.

In round $t$, the environment first samples $X_t$, and the learner chooses a probability mass function $\pmb{p}_t$ over the set of experts $\mathcal{B}_t$. Following this, the environment reveals the loss vector $\pmb{l}_t = (l_t(1),\ldots,l_t(n_t)) \in [0,1]^{n_t}$. The learner, therefore, incurs an expected loss of $\langle \pmb{p}_r, \pmb{l}_r \rangle$. The cumulative loss of expert $i \in \mathcal{B}_t$ up to time $t$ is $L_t(i) = \sum_{r = 1}^tl_r(i)$, and the expected cumulative loss of the learner up to time $t$ is $$L_t = \sum_{r = 1}^t \langle \pmb{p}_r, \pmb{l}_r \rangle.$$ For each new expert $i \in \mathcal{B}_t \backslash \mathcal{B}_{t-1}$, its cumulative loss up to time $t$, i.e., $L_{t-1}(i)$ is equal to the cumulative loss of its parent expert from $\mathcal{B}_{t-1}$. However, the subsequent losses of the new experts evolve independently from those of their parent experts. 

The loss functions are generated by an \textit{oblivious adversary} and communicated to the environment causally. \color{black} Our results hold for the setting where the oblivious adversary knows the relative order of the $X_t$s apriori and selects a sequence of losses according to a deterministic mapping from this ordering.
Specifically, at $t=0$, the adversary knows if $X_u < X_v$ or not, for all $1 \leq u < v \leq T$, and can exploit this information to design the loss vectors for $t \geq 1$. 
Note that our adversary is more powerful than an alternative oblivious adversary that does not have this side information. \color{black}
%The adversary generates a loss sequence for $T$ rounds before the start of the game and reveals the losses to the learner (causally) in each round. 

\color{black} We define $$L_t^* = \min_{i \in \mathcal{B}_t} L_t(i).$$ Given the time horizon $T$, we aim to minimize the \textit{expected regret} $$R_T = \mathbb{E}[L_T - L_T^*],$$ where the expectation is with respect to the joint distribution of the sequence of points $\mathbf{X}_T = \{X_1,\ldots,X_T\}$ drawn by the environment in $T$ rounds. Note that $\mathbb{E}[L^*_t]$ will be equal to $L_t^*$ if the loss vectors generated are independent of the points sampled by the environment and the regret bounds we prove will still hold. 

We also study the \textit{sample-path regret} $$\hat{R}_T = L_T - L_T^*,$$ and provide bounds in the high probability regime.
%Let $b_t$ denote the best expert till time slot $t$, i.e., $$b_t = \argmin_{i\in \mathcal{B}_t} L_t(i).$$ %Note that $\mathcal{B}_T \subseteq \hat{\mathcal{B}}$. 

\emph{\textbf{Remark 1}}: One can alternatively interpret the partitioning experts setting as follows. Instead of treating each partition as an expert, consider each point in $\mathbb{B}$ as an expert. When the environment draws an expert, it only reveals a single loss value per partition instead of losses for all the points in $\mathbb{B}$. For example, this loss value may be the average loss over all experts (points) in the partition. Since we can only work with the loss values per partition instead of losses of the individual experts, the setting where we carry over cumulative losses of the parent partition to the new sub-partitions is well-motivated, especially given its applicability in the classification application discussed in Section \ref{sec:motivation}. 

\jpcol{\emph{\textbf{Remark 2}}: Note that the regret bounds we prove are valid for any sequence of losses the oblivious adversary generates, and, thus, they hold for the supremum over all loss sequences.}

%\jpcol{\emph{\textbf{Remark 3}}: Comments on stronger adversary ...}

\begin{comment}
We consider the modification of branching experts setting studied where the adversary has no control to reveal experts to the learner. We consider a set of experts to be an interval $[a,b]$ and denote $N_t$ as the number of experts/intervals created till time $t$ by i.i.d. sampling from a bounded set $[a,b]$. Let us denote the sampled data at time t by $X_t$ and the set of intervals by $\mathcal{B}_T=\{I_{1},...I_{t}\}$. A new expert is said to branch at time $t$ from an expert $I_{y}$ if $X_{t} \in I_{y}$ leading to the formation of 2 new experts $I_{t+1},I_{t+2}$ and deletion of the expert $I_{y}$ which are clones of the expert $I_{y}$. The adversary can only assign losses to the experts whereas the branch formation is completely dependent on sampling from i.i.d. distribution. The regret of the learner is measured with respect to the best expert $I_*$ at time $T$ which was cloned starting from $I_{1}.....I_{*}$.
In our setup the two newly revealed at time $t$ are perfect clones of their parent in terms of the cumulative loss.

\SM{are we calling specific points in $[a,b]$ experts or the sub-intervals of $[a,b]$ experts? This is not clear from the text above. Also, we need to define regret formally using an equation after building suitable notation. The definition of $I_*$ is not clear to me from the text above. 
I have put down some notes in blue below. There may be gaps in my understanding, so please edit it suitably. We can replace the paragraph above with a suitably edited version of this blue text.}

\color{black}
We consider a variant of the branching experts setting studied in \cite{}. In this work, the process by which new experts are revealed to the learning agent is modeled as a stochastic process. This is a departure from \cite{}, where an adversary controls the process of revealing experts to the learning agent. 

\emph{Set of experts}: The set of experts is represented by the interval $[a,b]$, where $a$, $b$ $\in \mathbb{R}$ and $a < b$, such that each real number in $[a,b]$ represents a unique expert. The number of experts is thus uncountably infinite. 

\emph{Revealing new experts}: At each time $t$, $X_t \in [a,b]$ is generated with $X_t \sim \nu$, where $\nu$ is a probability distribution over $[a,b]$. Note that the $X_t$s are independent and identically distributed (i.i.d.). We refer to $X_t$ as a new expert revealed at time $t$ if $X_t \neq X_{\tau}$ for all $\tau < t$. 

\emph{Loss function}: In each round, the adversary assigns losses to the set of experts revealed thus far. Further, when a new expert is revealed, the cumulative loss of that expert up to that time is also revealed to the learning agent. In this work, we focus on the setting where the newly revealed expert at time $t$ is a clone of one of the experts revealed before time $t$.

\emph{Algorithmic task}: At each time, the learning agent has to choose one of the experts revealed thus far. The agent can use all prior loss information to make this decision. 

\emph{Regret}: 
\color{black}
\end{comment}

\begin{comment}
\begin{algorithm}[t]
\caption{\algo}\label{alg}
\begin{algorithmic}[1]
\STATE Initialize: $\mathcal{B}_0 = \{1,\ldots, n_0\}$, $w_{1}(i) = 1$ for all $i \in \mathcal{B}_0$, and $W_1 = 1$.
\FOR {each round $t = 1, 2,\ldots, \tau$}
\IF {new expert revealed}
\STATE $n_t = n_{t-1} + 1$ and $\mathcal{B}_t = \mathcal{B}_{t-1} \cup \{n_t\}$
\STATE Compute new weight 
$w_{t}(n_t) = e^{-\eta L_{t}(n_t)}$, and $\hat{W}_{t} = W_{t} + w_{t}(n_t)$\label{line:5}
\ELSE 
\STATE $\hat{W}_{t} = W_{t}$
\ENDIF
\STATE Compute $p_{i,t} = \frac{w_{i,t}}{\hat{W}_{t}}$, for all $i \in \mathcal{B}_t$. 
\STATE Choose an expert using $\pmb{p}_t$, observe $\pmb{l}_t$, and incur the loss $\langle \pmb{p}_t, \pmb{l}_t\rangle$.
\STATE Update the weights $w_{t+1}(i) = e^{-\eta l_{i}(t)} w_{t}(i)$
\STATE Cumulative weight $W_{t+1} = \sum_{i = 1}^{n_t}w_{t}(i)$.\label{line:12}
\ENDFOR
\end{algorithmic}
\end{algorithm}
\end{comment}

\section{The \algo~Algorithm: Regret Analysis}
\label{sec:hedgeG}
We propose an algorithm called Hedge-G, a natural extension of the Hedge algorithm for the growing experts setting, that introduces a new weight whenever a new expert arrives. Similar to the branching experts setting, in our setting, these new weights can be readily computed as the cumulative losses of the new experts are the same as their parent experts. In Algorithm~\ref{alg}, we present Hedge-G adapted to the partitioning experts setting. 

\begin{algorithm}[t]
\caption{\algo~for partitioning experts}\label{alg}
\begin{algorithmic}[1]
\STATE \textbf{Initialize:} $\mathcal{B}_0 = \{1\}$, $n_0 = 0$, $w_{1} = 1$, and $W_1 = 1$.
\FOR{each round $t = 1, 2,\ldots, T$}
\STATE $X_t$ is drawn i.i.d. from $\mathbb{B}$ and new partitions are revealed 
\STATE $n_t = (t+1)^d$ and $\mathcal{B}_t = \mathcal{B}_{t-1} \cup \{n_{t-1}+1,\ldots,n_t\}$
\STATE For $i \in \mathcal{B}_t\backslash \mathcal{B}_{t-1}$, given $L_{t-1}(i)$, compute new weights $w_{t}(i) = e^{-\eta L_{t-1}(i)}$
\STATE $\hat{W}_{t} = W_{t} + \sum_{i \in \mathcal{B}_t\backslash \mathcal{B}_{t-1}} w_{t}(i)$\label{line:5}
\STATE Compute $p_t(i) = \frac{w_{t}(i)}{\hat{W}_{t}}$, for all $i \in \mathcal{B}_t$. 
\STATE Choose an expert using $\pmb{p}_t$, observe $\pmb{l}_t$, and incur the loss $\langle \pmb{p}_t, \pmb{l}_t\rangle$.
\STATE Update the weights $w_{t+1}(i) = e^{-\eta l_{t}(i)} w_{t}(i)$, for all $i \in \mathcal{B}_t$.
\STATE Cumulative weight $W_{t+1} = \sum_{i = 1}^{n_t}w_{t+1}(i)$.\label{line:12}
\ENDFOR
\end{algorithmic}
\end{algorithm}

The regret analysis for Hedge-G differs from Hedge in that the introduction of new weights in line $5$ of Algorithm~\ref{alg} implies that $W_{t}$ does not normalize the weights $w_t(i)$, for $i \in \mathcal{B}_t$. A key step in our analysis of \algo $\ $ is to compute the expected value of the quantity $Y_t$, the ratio between the sum of new weights $w_t(i)$ and $W_t$, given by  
\begin{align}\label{eq:Yt}
    Y_t = \frac{\sum_{i = n_{t-1} + 1}^{n_t}w_t(i)}{W_t} =\frac{\sum_{i = n_{t-1} + 1}^{n_t} e^{-\eta L_{t-1} (i)}}{\sum_{j \in \mathcal{B}_{t-1}} e^{-\eta L_{t-1} (j)}}.
\end{align}
\begin{comment}
 A primer for computing $\E[Y_t]$ is the following lemma which states that in any slot the new point is sampled equally likely from existing partitions.
\begin{lemma}\label{lem:1/t+1}
    %Given a sequence of i.i.d. continuous random variables  $X_1, X_2, \ldots, X_t$ drawn from $[a,b]$ with cumulative distribution function $F_X(x)$, for any permutation $(i_1, i_2, \ldots, i_{t-1})$ of $(1, 2, \ldots, t-1)$ and any $k \in \{2,\ldots,t\}$, we have $$\mathds{P}\left(X_t \in [X_{i_{k-1}},X_{i_k}] \given[\big] X_{i_1} < X_{i_2} < s < X_{i_{t-1}} \right) = \frac{1}{t}.$$
    Given that the sequence of points $\{X_t\}$ are drawn i.i.d. from $\mathbb{B}$, the point $X_t$ drawn in round $t$ belongs to any of the existing $t^d$ partitions is equally probable, i.e.,
    \begin{align*}
        \P(X_t \in \text{ partition } i) = \frac{1}{t^d}, \; \forall i \in \mathcal{B}_{t-1}.
    \end{align*}
\end{lemma}

\begin{lemma}\label{lem:expYt}
    $\E[Y_t] = \left(1+\dfrac{1}{t}\right)^d - 1 \leq \dfrac{2^d}{t}$.
\end{lemma}
%Note that, for $d = 1$, there is one new expert in every slot $t$   
\end{comment}



\begin{comment}
    Lets assume experts are being drawn from interval $[0,1]$ i.i.d. distribution. At any time t WLOG consider the sequence of arrivals $0<X_{1}<X_{2}<....<X_{t}<1$. 
    \begin{align}
        & P(X_{t+1} \in [X_i,X_{i+1}] | (X_{1}<X_{2}<...<X_{t
}))\nonumber\\& =\frac{P(X_{1}<..<X_{i}<X_{t+1}<X_{i+1}<..<X_{t})}{P(X_{1}<X_{2}<....<X_{t})}\nonumber\\& =\frac{\frac{1}{(t+1)!}}{\frac{1}{t!}}=\frac{1}{t+1}
    \end{align}
\end{comment}
 
%$X_{1},..X_{n}$ are continuous their cdf F is also continuous and monotonically increasing. Since $Y_i = F_X(X_i)$, we want to find $P(Y_{i_1} < Y_{i_2} < \ldots < Y_{i_n})$. Since $Y_1, Y_2, \ldots, Y_n$ are i.i.d.. uniformly distributed on $[0, 1]$, the probability of any specific ordering is $\frac{1}{n!}$. Therefore,
%\begin{align*}
%    P(X_{i_1} < X_{i_2} < \ldots < X_{i_n}) &= P(Y_{i_1} < Y_{i_2} < \ldots < Y_{i_n}) \\
%    &= \frac{1}{n!}.
%\end{align*}


% \section{$p_{t}$ drawn from i.i.d. distribution}
% Any ordering of $(p_{1},p_{2},p_{3},....,p_{t})$ is equally likely to occur. Since we know that $p_{t}$ are i.i.d. and for any $i,j$ $\mathds{P}(p_{i}>p_{j})=\frac{1}{2}$. If we look till the time horizon t=3 and consider the sequence of $p_{t}$ then we observe that all the 3! sequences are equally likely.\\
% $p_{1}<p_{2}<p_{3}$\\
% $p_{1}<p_{3}<p_{2}$\\
% $p_{2}<p_{1}<p_{3}$\\
% $p_{2}<p_{3}<p_{1}$\\
% $p_{3}<p_{1}<p_{2}$\\
% $p_{3}<p_{2}<p_{1}$\\
%  Now lets look at t=3 we see already have 3 intervals created and the $\mathds{P}(p_{3} \text{ in left})=\mathds{P}(p_{3} \text{ in mid})=\mathds{P}(p_{3} \text{ in right})=\frac{2}{6}$. Hence uniform distribution over the number of intervals at time t.

The following theorem characterizes an upper bound on the cumulative loss of \algo. 
%TODO Expected Regret
\begin{theorem}\label{thm:upperbound_HedgeG}
    An upper bound for the cumulative loss of \algo~is given by 
    \begin{align}\label{eq:thm:upperbound_HedgeG}
        L_T \leq L^*_T+\frac{T \eta}{8} + \frac{\sum_{t=1}^T Y_t}{\eta}.
    \end{align}
\end{theorem}
\begin{proof}
%TODO ref line 5,12
%We have
%\begin{align}\label{eq1:thm2}
%  \hat{W_{t}}= \sum_{i \in \mathcal{B}_t} e^{-\eta L_{i}(t)}. 
%\end{align}
We write
\begin{align}\label{eq2:thm2}
    \log{\frac{W_{t+1}}{W_{t}}}=\log{\frac{W_{t+1}}{\hat{W_{t}}}}+\log{\frac{\hat{W_{t}}}{W_{t}}} .
\end{align}
Given $\hat{W_{t}}= \sum_{i \in \mathcal{B}_t} e^{-\eta L_{t-1}(i)}$, we upper bound the second term in RHS of \eqref{eq2:thm2} as follows.
\begin{align}
\log{\frac{\hat{W_{t}}}{W_{t}}}  
\!\!= &\log\!\left(\!{\frac{\sum_{i \in \mathcal{B}_{t-1}}\!e^{-\eta L_{t-1} (i)} \!+\! \sum_{i = n_{t-1} + 1}^{n_t} e^{-\eta L_{t-1} (i)}}{\sum_{i \in \mathcal{B}_{t-1} }e^{-\eta L_{t-1}(i)}}}\!\right) \nonumber \\
=&\log\left(1+Y_t\right) \leq Y_t. \label{eq3 : thm2}
 % & \leq \frac{1}{\sum_{i \in N'_{t+1}}e^{-\eta(L_{i}(t)-L_{j}(t))}} \nonumber\\
 % N'_{t+1}: & \text{ Set of experts } i \text{ where } L_i(t) < L_j(t) \nonumber\\ 
 % & \text{ and if } |N'_{t+1}|=k \implies Y_{t}  \leq \frac{1}{k}  
\end{align}
%TODO t included ?
\noindent Next, we upper and lower bound $\log \frac{W_{T+1}}{W_1}$. By definition, 
\begin{align}
\log \! \frac{W_{T+1}}{W_1} \! &= \!\log \!\left(\prod_{t=1}^{T} \frac{W_{t+1}}{W_t}\right) \nonumber \\
 \!\!&=\!\!\sum_{t=1}^{T} \!\left[\log \frac{W_{t+1}}{\hat{W_t}} \! + \! \log \frac{\hat{W_t}}{W_{t}}\right] \nonumber\\
& \! \leq \sum_{t=1}^T\left[-\eta \langle \pmb{p}_t, \pmb{l}_t\rangle \!+\! \frac{\eta^2}{8} \! + \! Y_t \right] \nonumber\\
&= \!\! -\eta L_T \!+ \!\frac{\eta^2 T}{8} \!+\! \!\sum_{t=1}^T \! Y_t. \! \label{eq7 : thm2}
\end{align}
%TODO ref (7), hoeffding
In the third step above, we have used \eqref{eq3 : thm2} and Hoeffding's lemma to upper bound $\log \frac{W_{t+1}}{\hat{W_t}}$. Also,
\begin{align}
    \log \frac{W_{T+1}}{W_1} 
    &= \log \sum_{i=1}^{n_T} e^{-\eta L_T(i)} \nonumber \\
    &\geq  \log \max_{i \in \mathcal{B}_T} e^{-\eta L_T(i)} \nonumber \\ 
    &\geq \max_{i \in \mathcal{B}_T} \log e^{-\eta L_T(i)} = -\eta L^*_T.\label{eq8 : thm2}
\end{align}
%= -\eta \min_{i \in \mathcal{B}_T} L_T(i) 
%TODO eqn ref
From \eqref{eq7 : thm2} and \eqref{eq8 : thm2}, we obtain the result.
%\begin{align*}
%-\eta \min_{i \in \mathcal{B}_T} L_i(T)  \leq-\eta L_{alg}(T) + \frac{\eta^2 T}{8} + \sum Y_t \implies 
%L_T \leq L^*_T +\frac{\eta T}{8} + \frac{\sum_{t=1}^T Y_t}{\eta}. 
%\end{align*}
%TODO eqn (9) ref and summation limits and check 2lnT 
%Substituting $\eta=\sqrt{\frac{8 \log T}{T}}$ in \eqref{eq9:thm2}, we get
%\begin{align}\label{thm2:eq15}
%& R_T = L_{alg}(T) - \min_{i \in \mathcal{B}_T}L_i(T) \leq \sqrt{T\log{T}} + \sqrt{\frac{T}{{\log T}}}\sum Y_t %\leq 2\sqrt{T\log{T}}.
% & \text{Taking expectation using Jensen's inequality}\\
% & R_T  \leq 2\sqrt{T\log{T}}
%\end{align}
\end{proof}

To obtain a bound on the expected regret of \algo~from Theorem \ref{thm:upperbound_HedgeG}, we need to compute $\sum_{t=1}^T \E[Y_t]$.
A primer for computing $\E[Y_t]$ is the following lemma which states that in any slot the new point is equally likely to belong to any one of the existing partitions of $\mathbb{B}$.
\begin{lemma}\label{lem:1/t+1}
    %Given a sequence of i.i.d. continuous random variables  $X_1, X_2, \ldots, X_t$ drawn from $[a,b]$ with cumulative distribution function $F_X(x)$, for any permutation $(i_1, i_2, \ldots, i_{t-1})$ of $(1, 2, \ldots, t-1)$ and any $k \in \{2,\ldots,t\}$, we have $$\mathds{P}\left(X_t \in [X_{i_{k-1}},X_{i_k}] \given[\big] X_{i_1} < X_{i_2} < s < X_{i_{t-1}} \right) = \frac{1}{t}.$$
    Given that the sequence of points $\{X_t\}$ are drawn i.i.d. from $\mathbb{B}$, the point $X_t$ drawn in round $t$ is equally likely to belong to any one of the existing $t^d$ partitions, i.e.,
    \begin{align*}
        \P(X_t \in \text{ partition } i) = \frac{1}{t^d}, \; \forall i \in \mathcal{B}_{t-1}.
    \end{align*}
\end{lemma}
Our next result uses Lemma \ref{lem:1/t+1} to compute $\E[Y_t]$.
\begin{lemma}\label{lem:expYt}
    $\E[Y_t] = \left(1+\dfrac{1}{t}\right)^d - 1 \leq \dfrac{2^d}{t}$.
\end{lemma}
The proofs of Lemmas \ref{lem:1/t+1} and \ref{lem:expYt} are given in the Appendix.

Taking expectation on both sides in \eqref{eq:thm:upperbound_HedgeG} (Theorem~\ref{thm:upperbound_HedgeG}) and using Lemma~\ref{lem:expYt}, we obtain the following bound on expected regret:
\begin{align}\label{eq:HedgeGExpReg}
    R_T \leq \frac{\eta T}{8} + \frac{2^d}{\eta}\sum_{t=1}^T \frac{1}{t} \leq \frac{\eta T}{8} +  \frac{2^d (\log T + 1)}{\eta}.
\end{align}
\color{black}
% \color{black}
% \begin{align}\label{eq:HedgeGExpReg}
%     R_T  \leq \frac{\eta T}{8} +  \frac{d(d\log T + 1)}{\eta}.
% \end{align}
\color{black}
The regret bound in the following corollary immediately follows from \eqref{eq:HedgeGExpReg}.
\begin{corollary}\label{cor:R}
    For the partitioning experts setting, for \algo~with $\eta=\sqrt{2^{d+3}(d\log T+1)/T}$, the expected regret $R_T = O(\sqrt{2^dT \log{T}})$.
\end{corollary}
\color{black}
\color{black} Note that, in our problem setting, $d$ is a constant determined by the application under consideration. For instance, in most cases of OOD detection and Hierarchical Inference applications, $d = 1$.  In the following, we show that Hedge-G is order-optimal with respect to $T$. \color{black}
% \color{black}
% \begin{corollary}\label{cor:R}
%     For the partitioning experts setting, for \algo~ with $\eta=\sqrt{8d(\log T+1)/T}$, the expected regret $R_T = O(\sqrt{d^2 T \log{T}})$.
% \end{corollary}

\textbf{Lower bound:} 
%In our model, for any realization of $\mathbf{X}_T$, there will be $(T+1)^d$ experts at the end of round $T$. Since the environment generates losses adversarially, the sample path regret $\hat{R}_T$ for any algorithm is $\Omega(\sqrt{dT\log T})$ \cite{freund1999adaptive}.
The Prediction with Expert Advice (PEA) with K experts has the lower bound $\sqrt{T\log{K}}$ for an oblivious adversary \cite{freund1999adaptive}. To prove the lower bound for the partitioning experts, we construct the following problem instance for $d = 1$. Let the oblivious adversary assign $0$ loss to the first $T/2$ spawned experts arrived in the first $T/2$ time steps. From $T/2 +1$, each new expert always receives a loss higher than its parent, and the first $T/2$ experts receive losses as in the PEA setting. Using only the first $T/2$ experts is sufficient to reduce regret, and we obtain a regret lower bound of $\sqrt{T\log{T}}$.  For $d>1$, $(T/2)^d$ experts spawn in the first $T/2$ time steps, and we use a similar loss assignment as above to obtain $\Omega(\sqrt{ dT \log T})$ lower bound. Since this lower bound is valid for any realization, the expected regret of any algorithm is also $\Omega(\sqrt{d T \log T})$.
Thus, from Corollary \ref{cor:R}, we see that \algo~has order-optimal expected regret with respect to the time-horizon $T$. Note that the vanilla Hedge algorithm achieves $O(\sqrt{dT\log T})$ expected regret only when all the $(T+1)^d$ experts are known apriori, and their losses are revealed in each round.

\color{black}
\textit{\textbf{Remark 3:}} We can improve the regret bound of Hedge-G with respect to $d$ by using a tighter upper bound for $E[Y_t]$. For example, using $E[Y_t] = (1+1/t)^d-1 \le d/t +2^d/t^2$ and repeating the analysis with this tighter upper bound and $\eta = \sqrt{8d (\log T + 1)/T}$, we obtain an improved bound $O(\sqrt{dT\log T} + 2^d\sqrt{T/d\log T})$. In the first term, the dependence is on $\sqrt{d}$ instead of $\sqrt{2^d}$. In the second term, the dependence is on $2^d/\sqrt{d \log T}$, but notice that $\sqrt{\log T}$ is in the denominator. Thus, even if $d$ is large, the regret bound is dominated by $\sqrt{dT\log T}$ term. 
%One may further improve Hedge-G regret by finding an even tighter upper bound for $E[Y_t]$.
\color{black}
%\algo~is optimal with respect to the expected regret. Consider the following problem instance. For the first $\frac{T}{2}$ rounds the environment assigns zero loss to all the experts. After that the children get the same loss as their parents essentially having $\frac{T}{2}$ experts and hence the loss is lower bounded by a hedge with $N=\frac{T}{2}$ experts proving a regret lower bound of $\Omega(\sqrt{T \log{T}})$.

% \section{Hedge with partial restarts Regret Upper Bound}
% From the Branching experts setting we know that Hedge with partial restarts has an upper bound on regret given by $\sqrt{L_{T}^{*}\log \Pi}$. $\Pi$ in our case corresponds to $2^{N^*}$ where $N^*$ denotes the number of splits on the path from the root to the best expert. Therefore,

% \begin{align}
%     R_T & \leq \sqrt{L_{T}^{*}\log \Pi} \leq \sqrt{L_{T}^{*}\log 2^{N^*}} \nonumber \\
%     & \leq \sqrt{L_{T}^{*}}E[\sqrt{N^*}] \leq \sqrt{L_{T}^{*}E[N^*]}   
% \end{align}
%  The expectation is wrt to the random $X_{t}$ arrivals.

%  Using lemma 4 we have 
%  \begin{align}
%      R_T \leq \sqrt{L_{T}^{*}\sqrt{T}} \leq T^{\frac{3}{4}}
%  \end{align}

%  \begin{lemma}
%      $\log(T)\leq E[N^{*}] \leq \sqrt{T}$

%      \begin{proof}
%          For a binary tree with $T$, leaf nodes have $2T-1$ nodes $\implies$ minimum depth of the tree is $\log{(2T-1)}$.

%          Now to prove the Upper bound take the example of $t=6$ and observe that for any particular permutation for example 
%          $\{p_{6},p_{2},p_{3},p_{4},p_{1},p_{5}\}$
%           $\text{max height of the binary tree} = max_{p_{i}} h(p_{i})+d(p_{i}) \leq max_{p_{i}}h(p_{i})+ max_{p_{i}}d(p_{i})$
%           \\\\
%           where $h(p_i)$ is the length of the sequence of the next greater elements in the sequence till Time $i$. For $p_1$ we will have $1 \rightarrow 2 \rightarrow 6$, for $p_{5}$ we will have $5 \rightarrow 6$.
%           \\\\
%           where $d(p_i)$ is the length of the sequence of the next smaller elements in the sequence till Time $i$. For $p_4$ we will have $4 \rightarrow 3 \rightarrow 2$, for $p_{5}$ we will have $5 \rightarrow 4 \rightarrow 3 \rightarrow 2$, for $p_{5}$.
%           \\\\

%           $h(p_{i}) \leq $ Length of Longest decreasing subsequence$\leq 2\sqrt{N}$, $d(p_{i}) \leq $ Length of Longest increasing subsequence. Therefore $E[N^*] \leq max_{p_{i}} h(p_{i})+d(p_{i}) \leq 4\sqrt{T}$       
          
         
%      \end{proof}
%  \end{lemma}

\begin{corollary}\label{cor:hatR}
    For the partitioning experts setting, for any $\epsilon > 0$, \algo~with $\eta=\sqrt{\frac{2^{d+3}(\log T + 1)}{T^{1-\epsilon}}}$ achieves the sample-path regret $\hat{R}_T = O(\sqrt{2^d T^{1+\epsilon} \log T})$ with probability at least $1- T^{-\epsilon}$.
\end{corollary}
\begin{proof}
    Using Markov inequality for the summation of the random variables $Y_t$, we get
    \begin{align*}
        \P\left(\sum_{t = 1}^T Y_t \leq  T^\epsilon \sum_{t = 1}^T \E[Y_t] \right) & \geq 1 - \frac{\sum_{t = 1}^T \E[Y_t]}{ T^\epsilon \sum_{t = 1}^T \E[Y_t]} \\ & = 1 -  T^{-\epsilon}.
    \end{align*}
    Using this result in \eqref{eq:thm:upperbound_HedgeG} and the upper bound for $\E[Y_t]$ from Lemma~\ref{lem:1/t+1}, we obtain, with probability at least $1 - T^{-\epsilon}$,
    \begin{align*}
        \hat{R}_T \leq \frac{\eta T}{8} + \frac{2^d T^\epsilon (\log T + 1)}{\eta}. 
    \end{align*}
    Choosing $\eta =\sqrt{\frac{2^{d+3}(\log T + 1)}{T^{1-\epsilon}}}$ results in $\hat{R}_T \leq \sqrt{2^{d-1} T^{1+\epsilon} (\log T + 1)}$.
\end{proof}
\begin{comment}
\begin{proof}
    Using Markov inequality for the summation of the random variables $Y_t$, we get
    \begin{align*}
        \P\left(\sum_{t = 1}^T Y_t \leq  (\log T)^\epsilon \sum_{t = 1}^T \E[Y_t] \right) \leq 1 - \frac{\sum_{t = 1}^T \E[Y_t]}{(\log T)^\epsilon \sum_{t = 1}^T \E[Y_t]} = 1 - (\log T)^{-\epsilon}.
    \end{align*}
    Using this result in \eqref{eq:thm:upperbound_HedgeG} and the upper bound for $\E[Y_t]$ from Lemma~\ref{lem:1/t+1}, we obtain, with probability at least $1 - (\log T)^{-\epsilon}$,
    \begin{align*}
        \hat{R}_T \leq \frac{\eta T}{8} + \frac{2^d (\log T)^\epsilon (\log T + 1)}{\eta}. 
    \end{align*}
    Choosing $\eta =\sqrt{\frac{2^{d+3}(\log T + 1)(\log T)^\epsilon}{T}}$ results in $\hat{R}_T \leq \sqrt{2^{d-1} T (\log T)^\epsilon (\log T + 1)}$.
\end{proof}
\end{comment}
%Choosing the fixed $\eta$ value in Corollary~\ref{cor:hatR} resulted in $O(T^{1+\epsilon}\log T)$ bound for $\hat{R}_T$, but it results in the same bound for $R_T$ which is higher by a factor $O(T^{\epsilon/2})$ compared to the optimal bound $O(\sqrt{T\log T})$. On the other hand, choosing the fixed eta value in Corollary~\ref{cor:R} does not result in an upper bound for $\hat{R}_T$.  Section \ref{sec:adaptive_eta} will address this issue by adapting the learning rate $\eta$ based on the observed $Y_t$ values.

From Corollaries \ref{cor:R} and \ref{cor:hatR}, it follows that for $\eta = \sqrt{\frac{2^{d+3}(\log T + 1)}{T^{1-\epsilon}}}$, the sample-path regret of \algo~is $ O(\sqrt{T^{1+\epsilon}\log T})$ with high probability and its expected regret is $O(T^{\frac{\epsilon}{2}}\sqrt{T\log T})$. Compared to this, the expected regret for \algo~with $\eta = \sqrt{2^{d+3}(\log T+1)/T}$ is $O(\sqrt{T \log{T}})$, but this value of $\eta$ leads to a sample-path regret bound that holds with probability zero, as $\epsilon = 0$. Therefore, to obtain a high probability bound on sample-path regret of \algo~using Theorem \ref{eq2:thm2} and Markov's inequality, we use a value of $\eta$ for which the expected regret is higher than the optimal by a factor of $O(T^{\frac{\epsilon}{2}})$. In Section \ref{sec:adaptive_eta}, we address this limitation of \algo~by adapting the learning rate based on the losses revealed by the adversary.


\emph{\textbf{Remark 4}}:
Note that if $X_t$ are drawn adversarially from $\mathbb{B}$, \algo~has linear regret. We construct the following problem instance for $d=1$. The adversary always splits the best expert in each round, resulting in two experts, $j$ and $k$. Uniformly at random, the adversary assigns a loss of one to one expert in the set $\{j,k\}$ and zero to the other expert. For all other experts $i \ne j,k$, it assigns a loss of one. For this problem instance, at any time $t$, $L_t^{*}=0$, but the expected loss for \algo~in that time step will be at least $\frac{1}{2}$. Hence, \algo~has expected regret of at least $\frac{T}{2}$. This result is expected because if $X_t$ are adversarially drawn from $\mathbb{B}$, then the partitioning expert setting is a special case of the branching experts setting studied by \cite{Gofer13}. It is known for the branching experts setting, the regret of any algorithm is %and
%with the exception that the number of experts in the partitioning experts setting in round $T$ is $(T+1)^d$ while in branching experts setting is $N_T$, which is assumed to be upper bounded by a constant. \cite{Gofer13} prove that the 
$\Omega(\sqrt{TN_T})$, where $N_T$ for the partitioning expert setting is equal to $(T+1)^d$.
% \begin{theorem}
%      The regret $R_T$ of \algo~run for $\alpha$ approximate clones and atmost $K$ children branching with $\eta=\sqrt{\frac{8K}{T}\log T}  $ for i.i.d. $X_{t}$ arrivals satisfies
%     \begin{align}
%         R_T \leq \sqrt{K T \log{T}}.
%     \end{align}
% \end{theorem}

\color{black}
\subsection{Performance Comparison}
We compare the cumulative loss and runtime performance of Hedge-G with the Hedge algorithm, which has prior knowledge of all the expert intervals (i.e., the intervals that will be formed in $T$ rounds). We simulate for $d = 1$.
%The simulation setup is as follows: \algo~ represents our proposed algorithm, while Hedge denotes an idealized baseline that has prior knowledge of the expert intervals (i.e., the intervals that will be formed). 
At each time step \( t \), the loss assigned to a Hedge expert corresponds to the loss of the corresponding parent expert in the same simulation instance under Hedge-G. The points \( X_t \) are sampled independently from a uniform distribution \( \mathcal{U}[0, 1] \), and the loss for each expert at each time step is generated from a Bernoulli distribution with parameter 0.3, i.e., \( \text{Bernoulli}(0.3) \). The experiments were performed on a machine equipped with an Intel(R) Xeon(R) CPU running at 2.20GHz. The processor has a cache size of 56.32 MB, and the system is equipped with 12.7 GB of RAM.

The results, illustrated in Figure~\ref{fig:cumloss}, demonstrate that \algo~achieves performance comparable to that of Hedge, despite lacking prior knowledge of the expert intervals available to the latter. By the end of 1000 rounds, \algo~incurs an additional cumulative loss of only 0.383 compared to Hedge. %, as observed in the simulation results. 
Figure~\ref{fig:runtime} presents the cumulative runtime as a function of the number of rounds for both algorithms. As anticipated, Hedge-G incurs significantly lower computational overhead, leading to a noticeably reduced runtime.


\begin{figure}[h]
\centering

\begin{subfigure}[t]{\linewidth}
    \centering
    \includegraphics[width=\linewidth]{cumloss.pdf}
    \caption{Cumulative loss of \algo, Hedge, and Best expert}
    \label{fig:cumloss}
\end{subfigure}

%\vspace{0.3cm} % Adjust spacing between subfigures

\begin{subfigure}[t]{\linewidth}
    \centering
    \includegraphics[width=\linewidth]{runtime.pdf}
    \caption{Running time of the algorithm vs Number of rounds}
    \label{fig:runtime}
\end{subfigure}

\caption{Comparison between \algo~and Hedge}
\label{fig:combined}
\end{figure}
\color{black}

\section{Ada\algo: \algo~with Adaptive Learning Rate}\label{sec:adaptive_eta}

%Using fixed learning rate $\eta=\sqrt{\frac{8 \log T}{T}}$ yields a suboptimal bound in a probabilistic setting. Suppose $\sum_{t=1}^{T}Y_t=O(\sqrt{T\log(T)}$ we get a linear regret from \eqref{thm2:eq15} whereas from Theorem \ref{thm4} we get a sublinear bound of $O(T^{\frac{3}{4}}\sqrt{\log(T)})$ showing the sub-optimality of fixed learning rate. There will be a trade-off between expected and probabilistic regret bound. To address this, we use the doubling trick to track $Y_t$ and suitably change the value of $\eta$ over rounds. Whereas fixed learning rate \algo~ has expected regret of $O(\sqrt{T\log(T)})$ and doubling trick \algo~ has $O(\sqrt{T\log(T)}\log(\log(T))$.  
%We assume that the value of $T$ is known and apply the doubling trick  \cite{} on the cumulative sum of $Y_t$ as defined in \eqref{defn:Yt}. We now provide performance guarantees for Algorithm \ref{alg2}.

In this section, we propose a variant of \algo~called Ada\algo~and show that its expected regret is near-optimal while simultaneously achieving the same high probability bound for the sample-path regret for \algo~stated in Corollary~\ref{cor:hatR}.

The details of Ada\algo~are presented in Algorithm~\ref{alg2}. The key idea behind the algorithm is to track the summation of $Y_{t}$s using the variable $S$ and suitably change the learning rate over rounds using a doubling trick. In particular, we partition the time into segments, where \textit{segment} $i$ spans the number of rounds for which $S \leq 2^{id}$. At the start of any segment $i$, we reset the value of $S$ to zero, choose an equal weight for all the existing experts (from the previous segment), and use Hedge-G with learning rate $\eta_i = \sqrt{8(2^{id} + \log \tau_i)/T}$, where $\tau_i$ is the round in which the segment starts.



\begin{algorithm}[ht]
\caption{Ada\algo}\label{alg2}
\begin{algorithmic}[1]
\STATE \textbf{Initialize:} $r \gets 0,S \gets 0, \tau \gets 1, b \gets 2^d, \textbf{w}_1 = 1$, and $\eta \gets \sqrt{\frac{8b}{T}}$.
\FOR{$t = 1, \ldots, T$}
\STATE $X_t$ is drawn i.i.d. from $\mathbb{B}$
\STATE Calculate $Y_t$ using \eqref{eq:Yt}
    \IF{$S + Y_t > b$}
    \STATE Start a new segment 
    %$\tau \gets \tau+r$\ \\
    \STATE $\textbf{w}_{t} = (w_1, \ldots, w_{t^d}) = \left( \frac{1}{t^d}, \ldots, \frac{1}{t^d} \right)$ \label{adahedge}
    \STATE $S \gets 0$
    \STATE $b \gets 2^d b$
    \STATE $\eta \gets \sqrt{\frac{8(b+d \log t)}{T}}$ 
    \ENDIF
    \STATE $S \gets S + Y_t$ 
    % $r \gets r + 1$ \\
    \STATE Use \algo~with already observed $X_t$, initial weight vector $\mathbf{w}_{t}$ and learning rate $\eta$.
\ENDFOR
\end{algorithmic}
\end{algorithm}

The next theorem characterizes an upper bound on the cumulative loss of Ada\algo.
\begin{theorem}\label{thm4}
    An upper bound for the cumulative loss of Ada\algo~is given by
\begin{align}\label{thm6:eq15}
 L_T  \leq & L^*_T + \frac{2^{d-\frac{1}{2}}}{2^{\frac{d}{2}}-1}\sqrt{T\left(\sum_{t=1}^T Y_t + 1\right)} \\ & + \left(1+\frac{2}{d}\log_2 \left(\sqrt{\sum_{t=1}^T Y_t + 1}\right)\right)\sqrt{dT \log{T}/2} \nonumber.
\end{align}  
\end{theorem}
\begin{proof}
%We partition the time into segments such that segment $i$ starts in the round in which $b$ is set to $2^{id}$.  Let $m$ be the number of segments that start either before or in round $T$. 
Let $r_i$ be the length of the $i^\text{th}$ segment, i.e., the number of rounds in the $i^\text{th}$ segment. By definition of a segment, we have $$r_i=\min\left\{r:\sum_{i=\tau_{i}}^{r}Y_i>2^{id}\right\}-\tau_{i},$$ where
$\tau_i$ is the round in which the segment $i$ starts and is given by $\tau_i = \sum_{u=1}^{i-1}r_{u}+1$.
% \begin{align*}
% \tau_i &= \sum_{u=1}^{i-1}r_{u}+1.
% \end{align*}
%By definition, the $i^{\text{th}}$ segment begins in round $b_i$. 
Let $R^{(i)}$ denote the regret incurred in segment $i$. It follows that
\begin{align*}
    R^{(i)} = \sum_{u = \tau_i}^{\tau_{i+1}-1} l_{u} - \min_{j \in \mathcal{B}_{\tau_{i+1}-1}}\sum _{u = \tau_i}^{\tau_{i+1}-1} l_{u}(j).
\end{align*}
%We now bound the regret in each segment. 
We repeat the regret analysis from the proof of Theorem~\ref{thm:upperbound_HedgeG} for $R^{(i)}$ and obtain
%Hence using \eqref{eq7 : thm2} and \eqref{eq8 : thm2}, it follows that:
\begin{align*}
    R^{(i)} \leq \frac{\eta_i r_i}{8} + \frac{S_i + d\log \tau_i}{\eta_i}\leq \sqrt{T(2^{id}+d \log T)/2},
\end{align*}
where, we have used $r_i \leq T$, $\tau_i \leq T$, 
%\sum_{u=0}^{i-1}r_{u} & = \text{ Number of experts before the start of round } i,\\
\begin{align*}
S_i = \sum_{r=\tau_{i}}^{\tau_{i+1}-1} Y_r \leq 2^{id}, \text{ and }
\eta_i =\sqrt{\frac{8(2^{i d} + d\log \tau_i)}{T}}.
\end{align*}
Note the weights are reinitialized to $1/\tau_i^d$ at the start of the segment and this yields the additional term of $d\log \tau_i$ when upper bounding $\log \frac{W_{\tau_{i+1}-1}}{W_{\tau_i}}$ in the analysis leading to \eqref{eq8 : thm2}.  

Let $m$ denote the last segment that started before round $T$. We add regret across all the $m$ segments and obtain,
\begin{align}
L_T - L^*_T  \leq \sum_{i=1}^{m} R^{(i)}   \leq & \sqrt{\frac{T}{2}} (\sqrt{2^{d}+ d\log{T}} 
\\ &+ \sqrt{2^{2 d}+d \log{T}} \nonumber\\
 &+ \ldots +  \sqrt{2^{m d} + d \log{T}} ) \nonumber \\
 \leq & \sqrt{\frac{T}{2}} \sum_{i=1}^{m}2^\frac{i d}{2} + m\sqrt{dT \log{T}/2} \nonumber  \\
 \leq & \frac{\sqrt{\frac{T}{2}}2^\frac{(m+1) d}{2}}{2^{\frac{d}{2}}-1} + m\sqrt{dT \log{T}/2}. \label{eq44:dt}
\end{align}
In the second step above, we have used $\sqrt{x+y} \leq \sqrt{x} + \sqrt{y}$.
Further, we have $$\displaystyle \sum_{i=1}^{T} Y_t \geq \sum_{i=1}^{m-1} 2^{i d} \geq 2^d  \frac{2^{(m-1) d}-1}{2^{d}-1}.$$ Therefore,
\begin{align}
 &2^{\frac{m d}{2}}  \leq 2^{\frac{d}{2}} \sqrt{\sum_{t=1}^T Y_t + 1}\label{eq:upperboundm1} \\
 \implies m & \leq \frac{2}{d} \log_2 \left(2^{\frac{d}{2}}\sqrt{\sum_{t=1}^T Y_t + 1}\right) \nonumber\\ & = 1+\frac{2}{d}\log_2 \left(\sqrt{\sum_{t=1}^T Y_t + 1}\right).\label{eq:upperboundm2}
\end{align}
Substituting \eqref{eq:upperboundm1} and \eqref{eq:upperboundm2} in \eqref{eq44:dt}, we obtain the result
% \begin{align}
%  L_T \leq L^*_T + \frac{2^{d-\frac{1}{2}}}{2^{\frac{d}{2}}-1}\sqrt{T\left(\sum_{t=1}^T Y_t + 1\right)} + \left(1+\frac{2}{d}\log_2 \left(\sqrt{\sum_{t=1}^T Y_t + 1}\right)\right)\sqrt{dT \log{T}/2}.
% \end{align}    
\end{proof}


% \color{black}
% \begin{align}
% S_0 & = Y_1 +\ldots+ Y_{l_0} \leq 1 , \eta_0=\sqrt{\frac{1}{T}} \nonumber\\
% S_1 & = Y_{l_0 +1 } +...+ Y_{l_0+l_1}\leq 2 , \eta_1=\sqrt{\frac{2+ \log(\l_0)}{T}} \nonumber\\
% \vdots \nonumber \\
% S_{m-1} & = Y_{(l_0 + l_1 +\ldots+ l_{m-2}+1)} + \ldots + Y_{T} \leq 2^{m-1} , \eta_{m-1}=\sqrt{\frac{2^{m-1}+ \log(l_0+l_1+...+l_{m-2}) }{T}}    
% \end{align}

% Defining the regret for every restart, using \eqref{eq9:thm2} we can upper bound the regret as follows 
% \begin{align}
% R_{0} & \leq \frac{\eta_0 l_0}{8} + \frac{S_0}{\eta_0} \nonumber\\
% R_{1} & \leq \frac{\eta_1 l_1}{8} + \frac{S_1 + \log(l_0)}{\eta_1} \nonumber\\
% \vdots \nonumber \\
% R_{m-1} & \leq \frac{\eta_{m-1} l_{m-1}}{8} + \frac{S_{m-1} + \log(l_0 + l_1 + \ldots l_{m-2})}{\eta_{m-1}}
% \end{align}
% Adding the regret for every restart we have the following 
% \begin{align}
% R_T & \leq \sum_{i=0}^{m-1} R_i \nonumber \leq \sqrt{T}(\sqrt{1} + \sqrt{2+\log{T}} + \ldots +  \sqrt{2^{m-1} + \log{T}}) \nonumber \\
% & \leq \sqrt{T} \sum_{i=0}^{m-1}\sqrt{2}^i + m\sqrt{T \log{T}} \leq (\sqrt{2}+1)\sqrt{T}\sqrt{2}^m + m\sqrt{T \log{T}} \label{eq44:dt}
% \end{align}
% We know that $\sum_{i=1}^{T} Y_t \geq \sum_{i=0}^{m-2} 2^{i} \geq 2^{m-1}-1 $
% \begin{align}
%  \sqrt{2}^m & \leq 2\sqrt{\sum_{t=1}^T Y_t + 1}
%  \implies m \leq 2\log_2 (2\sqrt{\sum_{t=1}^T Y_t + 1})
% \end{align}
% Substituting the upper bound for $m$ in \eqref{eq44:dt} we have
% \begin{align}
%  R_T & \leq 2(\sqrt{2}+1)\sqrt{T(\sum_{t=1}^T Y_t + 1)} + 2\log_2 (2\sqrt{\sum_{t=1}^T Y_t + 1})\sqrt{T \log{T}}
% \end{align}
% \color{black}

%\begin{theorem}
%    For $Hedge_G$ with Doubling Trick, we get 
%    \begin{align}
%        R_T & \leq 2(\sqrt{2}+1)\sqrt{T(\sum_{t=1}^T Y_t + 1)} \nonumber \\
%        & +2\log_2 (2\sqrt{\sum_{t=1}^T Y_t + 1})\sqrt{T \log{T}}
%    \end{align}
%\end{theorem}

% \begin{theorem}
%     For Doubling Trick $Hedge_G$ With probability $\geq$ $1-\delta$ where $\delta = 2e^{\frac{-2\epsilon^2}{T}}$ and $\epsilon>0$ we get 
%     \begin{align}
%         R_T \leq \sqrt{T(\log T + \epsilon)} + \log(\log(T)+\epsilon )\sqrt{T(\log T )} 
%     \end{align}
% \end{theorem}

% \begin{theorem}
%     For Doubling Trick $Hedge_G$ With probability $\geq$ $1-\delta$ where $\delta = o(1)$ and $\epsilon=\omega(\sqrt{T})$ we get 
%     \begin{align}
%         R_T = \omega(T^{\frac{3}{4}}) 
%     \end{align}
% \end{theorem}

%\color{red}
%\begin{theorem*}
%\begin{enumerate}
%    \item For Doubling Trick and $d=1$ \algo~With probability $\geq$ $1-\delta$ where $\delta = e^{-\log(T)^{2+\epsilon}}$  and $\epsilon>0$  has
%    \begin{align}
%        R_T \approx O(\sqrt{T\log(T)^{2+\epsilon}} + (2+\epsilon)\sqrt{T\log(T)} \log(\log(T)))
%    \end{align}

%    \item For Doubling Trick and any dimension $d$ \algo~With probability $\geq$ $1-\delta$ where $\delta = \log(T)^{-\epsilon}$ and $\epsilon>0$ has 
%    \begin{align}
%        R_T \approx O(\sqrt{T(\log T)^{1+\epsilon}} + (1+\epsilon)\log(\log(T))\sqrt{T(\log T )})
%    \end{align}
%\end{enumerate}
%\end{theorem*}
\noindent The next theorem provides guarantees on the regret of Ada\algo.
\begin{theorem} For the partitioning experts setting Ada\algo~has the following regret bounds.
\label{thm:adaHedgeG}
\begin{enumerate}
    \item[(i)] The expected regret  $R_T = O(\log (\log T)\sqrt{T\log T})$.
    % \begin{align*}
    %     R_T = O\left(\log (\log T)\sqrt{T\log T}\right).
    % \end{align*}
\item[(ii)] For $d \ge 1$ and some constant $c$ depending on dimension $d$, the sample-path regret $\hat{R}_T = O(\log{T}\sqrt{T\log T})$,
%    \begin{align*}
%        \hat{R}_T = O\left(\sqrt{T(\log T)^{1+\epsilon}} + (1+\epsilon)\log \log T \sqrt{T\log T }\right),
%\hat{R}_T = O\left(T^\frac{\epsilon}{2}\sqrt{T\log T}\right),
%    \end{align*}
    with probability at least $1-T^{-c}$.
 %\item For $\epsilon>0$ and $d=1$, the sample-path regret
    %\begin{align*}
    %\hat{R}_T = O\left(\sqrt{T(\log T)^{2+\epsilon}}\right),
   %     \hat{R}_T = O\left(\sqrt{T(\log T)^{2+\epsilon}} + (2+\epsilon)\log \log T \sqrt{T\log T} \right),
    %\end{align*}
    %with probability $\geq$ $1-T^{-2-\epsilon}$. 
    %\item[(iii)] 
    \jpcol{For $d=1$, the sample-path regret can be improved to $\hat{R}_T = O(\log (\log T)\sqrt{T \log T})$,
    % \begin{align*}
    % \hat{R}_T = O\left(\log (\log T)\sqrt{T \log T}\right),
    % \end{align*}
    with probability at least $1-(eT)^{-0.25}$.}
\end{enumerate}    
\end{theorem}
\begin{proof}
    \input{Bernstein}
\end{proof}
\color{black} 
From parts (i) and (ii) of Theorem \ref{thm:adaHedgeG}, we observe that Ada\algo~has near-optimal expected regret (sub-optimality of a factor of $\log(\log T)$) and it also has the same high probability bound on sample-path regret as that of \algo~in Corollary~\ref{cor:hatR}. Ada\algo~thus addresses the limitation of \algo~discussed at the end of Section \ref{sec:hedgeG}.
%where we see a trade-off between obtaining a tight bound on the expected regret and a high probability bound on the sample-path regret for \algo. 
Further, in part (ii) of the theorem, for the special case $d=1$, we provide a sample-path regret that is near-optimal with high probability, independent of $\epsilon$. 
%To prove this result, we show that for $d=1$, on each sample path, the $Y_t$s are upper bounded by another sequence of random variables that are independent across time. We then use Bernstein's inequality (cf. \cite{Boucheron2004}) for the sum of this alternate sequence of random variables to obtain concentration results for $\sum_{t=1}^T Y_t$. 
Proving a tighter bound for $d>1$, similar to the case $d = 1$, remains an open problem. 
\color{black}

% \begin{theorem}
%     $Hedge_G$ run with $\varepsilon=\frac{m}{T} $ and $\eta = \frac{\sqrt{2 m \log (T)}}{T}$ then expected number of revealed loss rounds equals $m$ and 
%     \begin{align}
%         R_T \leq \frac{ 3T\log (T)}{\sqrt{m}} 
%     \end{align}
% \end{theorem}






