\subsubsection{Theoretical motivation for the subspace selection} \label{ss:theo_analysis}
The selection of a block of coordinates can be viewed as a combinatorial mixture of experts problem, where each coordinate is a single expert and the forecaster aims at choosing the best combination of experts in each step~\cite{cesa2006prediction}. Under this view, we bound the regret of our selection method on an intuitive surrogate loss function with respect to the policy of selecting the best 
% (unknown) 
block of coordinates at each step.
This is complementary to the regret analysis of the optimization performed at each subspace. 
% Specifically, the conventional regret analysis associated with BO, with respect to the value of the objective function, is applicable for each specific subspace, accounting for the projection of points to this subspace. 
Here we focus on justifying the coordinate selection 
% part 
alone.

Following the standard framework, we compare with a fixed optimal choice $\mathcal{I}^*$ for the block of coordinates to pick at all steps. This block is characterized by improving the objective function for the largest number of times among all the possible coordinate blocks when performing Bayesian optimization. 
%We design the following loss function. 
For any coordinate subset $\mathcal{A}$,
%and a sequence of threshold values $\{\lambda_t\}_{t=0}^{T-1}$ given in advance by some oracle, 
we define the following loss function at time $t$, for coordinate $i$,
% \begin{align} \label{eq:alpha_beta_loss}
%     \ell_{t,i}(C_t, y_t) =
%     \begin{cases}
% 		-
%         % \frac{1}{\eta}
% 		\log(\tilde{\alpha}) & \text{if } i \in C_t \text{ and } y_t > M_{t-1} \\
% % 		\frac{1}{\eta}
% 		\log(\tilde{\beta}) & \text{if } i \in C_t \text{ and } y_t \leq M_{t-1} \\
%         0 & \text{if } i \notin C_t
%     \end{cases} 
%     \quad ; \quad 
%     \tilde{\alpha},\tilde{\beta}>1
% \end{align}
\begin{align} \label{eq:alpha_beta_loss}
    \ell_{t,i}(\mathcal{A}) =
    \begin{cases}
		-
        % \frac{1}{\eta}
		\log(\tilde{\alpha}) & \text{if } i \in \mathcal{A} \text{ and } y_t > M_{t-1} \\
% 		\frac{1}{\eta}
		\log(\tilde{\beta}) & \text{if } i \in \mathcal{A} \text{ and } y_t \leq \lambda_t \\
        0 & \text{if } i \notin \mathcal{A}
    \end{cases} 
\end{align}
with $\tilde{\alpha},\tilde{\beta}>1$, where both $y_t$ and $M_{t-1}$ are fully determined by the previously selected coordinate subset $C_1, C_2, \cdots, C_{t-1}, C_t$.  
All the coordinates participating in the selected block incur the same loss that effectively rewards these coordinates for improving the objective and penalizes these for failing to improve the objective. All other coordinates that are not selected receive a zero loss.% and remain untouched.

Note that $\tilde{\alpha}$ and $\tilde{\beta}$ express the extent of reward and penalty, e.g. for $\tilde{\alpha}=\tilde{\beta}=e$ we have %balanced 
losses of $\ell_{t,i} \in\{-1, 1, 0\}$. %respectively.
% \begin{align} \label{eq:alpha_beta_loss_ones}
%     \ell_{t,i} =
%     \begin{cases}
% 		-
%         % \frac{1}{\eta}
% 		1 & \text{if } i \in C_t \text{ and } y_t > M_{t-1} \\
% % 		\frac{1}{\eta}
% 		1 & \text{if } i \in C_t \text{ and } y_t \leq M_{t-1} \\
%         0 & \text{if } i \notin C_t
%     \end{cases}
% \end{align}
Yet, $\tilde{\alpha}$ is chosen to be larger than $\tilde{\beta}$, since the frequency of improving the objective is expected to be smaller.%since improving the objective can be rather hard and thus the queries that improve the best observations $y_t > M_{t-1}$ happen more rarely than the opposite $y_t \leq M_{t-1}$. 

The loss received by the forecaster is to reflect the same motivation. This is done by averaging the losses of the individual coordinates in the selected block, so that the size of the block does not matter explicitly, i.e. a bigger block should not incur more loss just due to its size but only due to its performance. Such that for each coordinate block $\mathcal{I}_t
\subset\mathcal{I}=\{1,\cdots,D\}$ selected at time step $t$, the loss incurred by the forecaster is
 $L_{t,\mathcal{I}_t} =\frac{1}{|\mathcal{I}_t|} \sum_{i\in\mathcal{I}_t}\ell_{t,i}$.
% \begin{equation}
%     L_{t,\mathcal{I}_t} =\frac{1}{|\mathcal{I}_t|} \sum_{i\in\mathcal{I}_t}\ell_{t,i} = \bar{\ell}_{t,\mathcal{I}_t}
% \end{equation}
This is also the common loss incurred by all the coordinates participating in that block.

In each step we have the following multiplicative update rule of the weights associated with each coordinate
 \begin{align}
    \label{eq:multiplicative_weight_update}
    w_{t, i}
	&= 
	w_{t-1, i} \cdot e^{-\eta\ell_{t,i}(C_t; y_t, M_{t-1})}
	\\&=
    w_{t-1, i} \cdot
    \begin{cases}
			\tilde{\alpha}^\eta & \text{if } i \in C_t \text{ and } y_t > M_{t-1} \\
			1/\tilde{\beta}^\eta & \text{if } i \in C_t \text{ and } y_t \leq M_{t-1} \\
            1 & \text{if } i \notin C_t,
	 \end{cases} 
% 	 =
%     w_{t-1, i} \cdot
%     \begin{cases}
% 			\alpha & \text{if } i \in C_t \text{ and } y_t > M_{t-1} \\
% 			1/\beta & \text{if } i \in C_t \text{ and } y_t \leq M_{t-1} \\
%             1 & \text{if } i \notin C_t
% 	 \end{cases}
 \end{align}
which, by setting $\alpha=\tilde{\alpha}^\eta$ and $\beta=\tilde{\beta}^\eta$, yields the update rule in Eq.~(\ref{eq:multiplicative_update}).%(\ref{eq:multiplicative_update}).
% Denote all possible block sizes by $\mathcal{C}$ and the set of all possible coordinate blocks of size $c\in\mathcal{C}$ by $\mathcal{S}_c$ and its size $|\mathcal{S}_c|=\frac{D!}{c!(D-c)!}={D \choose c}$. Denote by $p_c$ the probability of choosing a certain block size $c\in\mathcal{C}$, such that $p_c\geq 0$ and $\sum_{c\in\mathcal{C}}p_c=1$. 

The probability $\tilde{\pi}_{t,\mathcal{I}_t}$ of selecting a certain coordinate block $\mathcal{I}_t$ is induced by $\pi_t$ as specified next. 
%Assume the threshold values $\{\lambda_t\}_{t=0}^{T-1}$ are given in advance by some oracle and at each time step some oracle $f_t(\cdot)$ yields $f_t(\mathcal{I}_t)=y_t$. %without replacement (explicitly put in Appendix~\ref{sec:regret_analysis}).
Thus the expected cumulative loss of the forecaster is:
$$L_T= \sum_{t=1}^T\sum_{c\in\mathcal{C}}\sum_{\mathcal{I}_t\in \mathcal{S}_c}\tilde{\pi}_{t,\mathcal{I}_t}\cdot\frac{1}{|\mathcal{I}_t|} \sum_{i\in\mathcal{I}_t}\ell_{t,i}$$%(\mathcal{I}_t;f_t(\mathcal{I}_t), \lambda_t)$$

Assume that the best coordinate block is $\mathcal{I}^*$, then the corresponding cumulative loss is:
\begin{align*}
%L_T^* &= \sum_{t=1}^T L_{t,\mathcal{I}^*} \nonumber\\
L_T^* &= \sum_{t=1}^T L_{t,\mathcal{I}^*}=\sum_{t=1}^T\frac{1}{|\mathcal{I}^*|} \sum_{i\in \mathcal{I}^*}\ell_{t,i}%(\mathcal{I}^*;f_t(\mathcal{I}^*), \max_{0\leq \tau<t}f_{\tau}(\mathcal{I}^*)). \nonumber
% = \sum_{t=1}^T\bar{\ell}_{t,\mathcal{I}^*} \nonumber
\end{align*}

We hence aim at bounding the regret $\mathcal{R}_T= L_T-L_T^*$. 
%For this purpose we bound the regret with respect to any arbitrary sequence of selected coordinate blocks.

% \textbf{Theorem 1} 
\begin{theorem} 
\label{theo:regret_comb}
Sample from the combinatorial space of all possible coordinate blocks $\mathcal{I}_t \in \bigcup_{c\in\mathcal{C}}\mathcal{S}_c$ with probability 
% $\tilde{\pi}_{t,\mathcal{I}_t} = \sfrac{\prod_{i\in \mathcal{I}_t}
% w_{t,i}^{\sfrac{1}{|\mathcal{I}_t|}}
% % \sqrt[|\mathcal{I}_t|]{w_{t,i}}
%     }
%     {
%   \sum_{c\in\mathcal{C}}\sum_{\mathcal{I}_t\in\mathcal{S}_c}\prod_{j\in \mathcal{I}_t}
%   w_{t,j}^{\sfrac{1}{|\mathcal{I}_t|}}
%     % \sqrt[|\mathcal{I}_t|]{w_{t,j}}
%     }
% $. 
$\tilde{\pi}^{c}_{t,\mathcal{I}_t} =\sfrac{ 
\prod_{i\in \mathcal{I}_t}
\tilde{w}_{t,\mathcal{I}_t}
}{
   \sum_{c\in\mathcal{C}}\sum_{\hat{\mathcal{I}}\in\mathcal{S}_c}\prod_{j\in \hat{\mathcal{I}}}
   \tilde{w}_{t,\hat{\mathcal{I}}}
   }
$. Then the update rule in Eq.~(\ref{eq:multiplicative_update}) with $\alpha=\tilde{\alpha}^\eta$, $\beta=\tilde{\beta}^\eta$ and
$\eta=\log(\tilde{\alpha}\tilde{\beta})^{-1}\sqrt{T^{-1}|\mathcal{C}|D\log(D)}$ yields
 \begin{equation}\label{eq:theorem_1}
     \mathcal{R}_T
     \leq 
      \mathcal{O}\left(\log(\tilde{\alpha}\tilde{\beta})\cdot \sqrt{T|\mathcal{C}|D\log(D)}\right),
 \end{equation}
 where 
%  $\tilde{w}_{t,\mathcal{I}_t}=\sqrt[|\mathcal{I}_t|]{\prod_{i\in \mathcal{I}_t}w_{t,i}}$ 
 $\tilde{w}_{t,\mathcal{I}_t}=\prod_{i\in \mathcal{I}_t}w_{t,i}^{\sfrac{1}{|\mathcal{I}_t|}}$ 
 is the geometric mean of the weights for block $\mathcal{I}_t$.
\end{theorem}
 The upper bound in Eq.~(\ref{eq:theorem_1}) is tight, as the lower bound can be shown to be of $\Omega(\sqrt{T\log(N)})$ \cite{haussler1995tight} where the number of experts is $N=\sum_{c\in\mathcal{C}}\mathcal{S}_c\leq D^{|\mathcal{C}|D}$ in our combinatorial setup, as typically $|\mathcal{C}|\ll D$.
 
 In practice, the direct sampling policy introduced in Theorem~\ref{theo:regret_comb} involves high computational costs due to the exponential growth of combinations in $D$. Thus CobBO suggests an alternative computationally efficient sampling policy with a linear growth in $D$.
% \textbf{Theorem 2} 
\begin{theorem} \label{theo:regret_without_replacement} Sample a block size $c\in\mathcal{C}$ with probability $p_c$ and $c$ coordinates without replacement according to $\pi_t$. Assume $\mathcal{C}\supset\{1\}$, then 
the update rule in Eq.~(\ref{eq:multiplicative_update}), with $\alpha=\tilde{\alpha}^\eta$, $\beta=\tilde{\beta}^\eta$ and
$\eta=\sqrt{\frac{\log(D)}{T(\log(\tilde{\alpha}\tilde{\beta})^2 -\log(p_1))}}\geq 1$ yields
 \begin{equation}\label{eq:regret_without_replacement}
     \mathcal{R}_T
     \leq 
      \mathcal{O}\left(\sqrt{(\log(\tilde{\alpha}\tilde{\beta})^2 -\log(p_1))} \cdot \sqrt{T\log(D)})\right),
 \end{equation}
 where $p_c>0$ for all $c\in\mathcal{C}$ and $\sum_{c\in\mathcal{C}}p_c=1$.
\end{theorem}
%, e.g., uniformly set $p_c\equiv|\mathcal{C}|^{-1}$.%\sfrac{1}{|\mathcal{C}|}$.
The proof and detailed sampling policy are in appendix~4. %\ref{sec:regret_analysis}. 
The regret upper bound in Eq.~\ref{eq:regret_without_replacement} is tight, as the lower bound for an easier setup can be shown to be of $\Omega(\sqrt{T\log(D)})$ \cite{haussler1995tight}. 
 The implication on $\eta$ is valid only for settings of a high dimension and low query budget. In particular, CobBO is designed 
 for this kind of problems. 
% Moreover, interestingly, although the effective number of combinations of coordinates is $|\mathcal{S}|\leq  |\mathcal{C}|\cdot(D!)$, the regret upper bound in \ref{eq:regret} shows to grow with $\mathcal{O}(\sqrt{\log(D)})$ rather than $\mathcal{O}(\sqrt{\log(|\mathcal{C}|)+\log(D!)})\sim\mathcal{O}(\sqrt{D\log(D)})$ for large $D>>|\mathcal{C}|$ due to the Stirling's approximation~\cite{pearson1924historical}. This is due to adapting the preference probability $\pi_{t,i}$ for each coordinate rather than the one for each possible combination of coordinates $\hat{\pi}_{t,\mathcal{I}_t}$, as the later is derived from the former.
%\textbf{Remark:} 
Similar analysis and results follow when incorporating consistent queries from section~\ref{ss:backoff} and sampling a new coordinate block once every several steps. This is done by effectively performing less steps of aggregated temporal losses, as shown in appendix~4.3.%\ref{sec:regret_analysis_consistent_queries}.