\subsection{Theoretical Analysis} \label{ss:analysis}
One can view our block coordinate selection approach in section~\ref{ss:block} as a combinatorial mixture of experts problem~\cite{cesa2006prediction}, where each coordinate is a single expert and the forecaster aims at choosing the best combination of experts in each step. Under this view, we bound the regret of our selection method with respect to the policy of selecting the best (unknown) block of coordinates at each step.

Assume that there is a fixed and optimal choice $C^*$ for the block of coordinates to pick at every step. This block is characterized by improving the objective function the most number of times among all the possible coordinate blocks when performing Bayesian optimization over the corresponding subspaces. The following particular design of losses:
\begin{align} \label{eq:alpha_beta_loss}
    \ell_{t,i} =
    \begin{cases}
		-
        % \frac{1}{\eta}
		\log(\tilde{\alpha}) & \text{if } i \in C_t \text{ and } y_t > M_{t-1} \\
% 		\frac{1}{\eta}
		\log(\tilde{\beta}) & \text{if } i \in C_t \text{ and } y_t \leq M_{t-1} \\
        0 & \text{if } i \notin C_t
    \end{cases} 
    \quad ; \quad 
    \tilde{\alpha},\tilde{\beta}>1
\end{align}
expresses this cause, as all the coordinates participating in the selected block incur the same loss that effectively rewards the coordinates for improving the objective and penalizes for failing to improve the objective. All other coordinates that are not selected receive a zero loss and remain untouched.

Note that $\tilde{\alpha}$ and $\tilde{\beta}$ express the extent of reward and penalty, e.g. for $\tilde{\alpha}=\tilde{\beta}=e$ we have %balanced 
losses of $\ell_{t,i} \in\{-1, 1, 0\}$. %respectively.
% \begin{align} \label{eq:alpha_beta_loss_ones}
%     \ell_{t,i} =
%     \begin{cases}
% 		-
%         % \frac{1}{\eta}
% 		1 & \text{if } i \in C_t \text{ and } y_t > M_{t-1} \\
% % 		\frac{1}{\eta}
% 		1 & \text{if } i \in C_t \text{ and } y_t \leq M_{t-1} \\
%         0 & \text{if } i \notin C_t
%     \end{cases}
% \end{align}
Yet, $\tilde{\alpha}$ is better chosen to be larger than $\tilde{\beta}$, since the frequency of improving the objective is expected to be smaller.%since improving the objective can be rather hard and thus the queries that improve the best observations $y_t > M_{t-1}$ happen more rarely than the opposite $y_t \leq M_{t-1}$. 

The total loss of the forecaster is to express the same motivation. This is done by averaging the losses of the individual coordinates in the selected block, so that the size of the block does not matter explicitly. Such that for each coordinate block $\mathcal{I}_t
\subset\mathcal{I}=\{1,\cdots,D\}$ selected in time step $t$, the loss incurred by the forecaster is:
\begin{equation}
    L_{t,\mathcal{I}_t} =\frac{1}{|\mathcal{I}_t|} \sum_{i\in\mathcal{I}_t}\ell_{t,i} = \bar{\ell}_{t,\mathcal{I}_t}
\end{equation}
where $\bar{\ell}_{t,\mathcal{I}_t}$ is the common loss incurred by all the coordinates participating in that block.

In each step we have the following multiplicative update rule of the weights associated with each coordinate (setting $\alpha=\tilde{\alpha}^\eta$ and $\beta=\tilde{\beta}^\eta$ yields the update rule in Eq. \ref{eq:multiplicative_update}):
 \begin{equation}
    \label{eq:multiplicative_weight_update}
    w_{t, i}
    = 
    w_{t-1, i} \cdot
    \begin{cases}
			\tilde{\alpha}^\eta & \text{if } i \in C_t \text{ and } y_t > M_{t-1} \\
			1/\tilde{\beta}^\eta & \text{if } i \in C_t \text{ and } y_t \leq M_{t-1} \\
            1 & \text{if } i \notin C_t
	 \end{cases} 
	     = w_{t-1, i} \cdot e^{-\eta\ell_{t,i}}
%     \begin{cases}
% 			e^{-\eta\ell_\alpha} & \text{if } j \in C_t \text{ and } y_t > M_{t-1} \\
% 			e^{-\eta\ell_\beta} & \text{if } j \in C_t \text{ and } y_t \leq M_{t-1} \\
%             e^{-\eta\ell_0} & \text{if } j \notin C_t
% 	 \end{cases} 
 \end{equation}

% This particular choice of losses corresponds to rewarding selected coordinates that eventually improve the objective, penalizing selected coordinates that fail to improve the objective and remaining coordinates that are not selected untouched.

Denote all possible block sizes by $\mathcal{C}$ and the set of all possible coordinate blocks of size $c\in\mathcal{C}$ by $\mathcal{S}_c$ and its size $|\mathcal{S}_c|={D \choose c}$. Denote by $p_c$ the probability of choosing a certain block size $c\in\mathcal{C}$, such that $p_c\geq 0$ and $\sum_{c\in\mathcal{C}}p_c=1$. The expected cumulative loss following our update policy is
% Denote all possible block sizes by $\mathcal{C}$ and the set of all possible coordinate blocks by $\mathcal{S}$ and its size $|\mathcal{S}|=\sum_{c\in\mathcal C} {D \choose c}$. The expected cumulative loss following our update policy is
% $$L_T= \sum_{t=1}^T\sum_{\mathcal{I}_t\in \mathcal{S}}\hat{\pi}_{t,\mathcal{I}_t}\cdot\frac{1}{|\mathcal{I}_t|} \sum_{i\in\mathcal{I}_t}\ell_{t,i}$$
$$L_T= \sum_{t=1}^T\sum_{c\in\mathcal{C}}p_c\sum_{\mathcal{I}_t\in \mathcal{S}_c}\hat{\pi}_{t,\mathcal{I}_t}\cdot\frac{1}{|\mathcal{I}_t|} \sum_{i\in\mathcal{I}_t}\ell_{t,i}$$
where the probability $\hat{\pi}_{t,\mathcal{I}_t}$ of selecting a certain coordinate block $\mathcal{I}_t$ follows sampling according to $\pi_t$ without replacement (explicitly put in Appendix~\ref{sec:regret_analysis}).

Assume the best coordinate block is $C^*$ and the corresponding cumulative loss:
$$L_T^*= \sum_{t=1}^T L_{t,\mathcal{I}_t}=\sum_{t=1}^T\frac{1}{|\mathcal{I}_t|} \sum_{i\in\mathcal{I}_t}\ell_{t,i} = \sum_{t=1}^T\bar{\ell}_{t,\mathcal{I}_t}$$

We hence aim at bounding the regret $Regret_T = L_T-L_T^*$. For this purpose we bound the regret with respect to any arbitrary sequence of selected coordinate blocks.

\textbf{Lemma 1} Assume $\mathcal{C}\supset\{1\}$ and $p_1>0$. For $\eta >0$ and non-negative losses $\ell_{t,i}\geq 0$ the update rule in (\ref{eq:multiplicative_weight_update}) satisfies for any block of coordinates $C^*$:
\begin{equation}\label{eq:lemma_1}
Regret_T
% =L_T-L^*_T 
% =
% \sum_{t=1}^T\sum_{\mathcal{I}_t\in \mathcal{S}}\hat{\pi}_{t,\mathcal{I}_t}\cdot\frac{1}{|\mathcal{I}_t|} \sum_{i\in\mathcal{I}_t}\ell_{t,i} - \sum_{t=1}^T \ell^*_t
%  \sum_{t=1}^T\sum_{c\in\mathcal{C}}p_c\sum_{\mathcal{I}_t\in \mathcal{S}_c}\hat{\pi}_{t,\mathcal{I}_t}\cdot\frac{1}{|\mathcal{I}_t|} \sum_{i\in\mathcal{I}_t}\ell_{t,i} - \sum_{t=1}^T \ell^*_t
\leq 
\eta\sum_{t=1}^T\sum_{c\in\mathcal{C}}p_c\sum_{\mathcal{I}_t\in \mathcal{S}_c}\hat{\pi}_{t,\mathcal{I}_t}\cdot \left(\frac{1}{|\mathcal{I}_t|}\sum_{i\in\mathcal{I}_t}\ell_{t,i}\right)^2 + \frac{\log(D)}{\eta} -\frac{T\log(p_1)}{\eta}
\end{equation}
% where $p_1$ is the probability of sampling a block of size $1$. 
And thus we bound the regret (the proofs are in Appendix \ref{sec:regret_analysis}):\\
\textbf{Theorem 1} The update in Eq.~\ref{eq:multiplicative_update}, with $\alpha=\tilde{\alpha}^\eta$, $\beta=\tilde{\beta}^\eta$ and
$\eta=\sqrt{\frac{\log(D)}{T(\log(\tilde{\alpha}\tilde{\beta})^2 -\log(p_1))}}\geq 1$ yields:
 \begin{equation}\label{eq:regret}
     Regret_t 
     \leq 
      \mathcal{O}\left(\sqrt{(\log(\tilde{\alpha}\tilde{\beta})^2 -\log(p_1))} \cdot \sqrt{T\log(D)})\right)
 \end{equation}
 The regret upper bound in Eq.~\ref{eq:regret} is tight, as the lower bound for an easier setup can be shown to be of $\Omega(\sqrt{T\log(D)})$ \cite{haussler1995tight}. 
 The implication on $\eta$ is valid only for settings of a very high dimensionality and low query budget. In particular, CobBO is designed %to do well 
 for this kind of problems. 
% Moreover, interestingly, although the effective number of combinations of coordinates is $|\mathcal{S}|\leq  |\mathcal{C}|\cdot(D!)$, the regret upper bound in \ref{eq:regret} shows to grow with $\mathcal{O}(\sqrt{\log(D)})$ rather than $\mathcal{O}(\sqrt{\log(|\mathcal{C}|)+\log(D!)})\sim\mathcal{O}(\sqrt{D\log(D)})$ for large $D>>|\mathcal{C}|$ due to the Stirling's approximation~\cite{pearson1924historical}. This is due to adapting the preference probability $\pi_{t,i}$ for each coordinate rather than the one for each possible combination of coordinates $\hat{\pi}_{t,\mathcal{I}_t}$, as the later is derived from the former.