\section{Experiments}
\label{section:experiments}
In this section, we perform a wide range of synthetic (\textbf{Experiments 1,2,3})
% \footnote{The code can be found \href{https://github.com/sinhagaur88/Slate-Online-Logistic-Regression.git}{here}.} 
and real-world experiments (\textbf{Experiment 4})
% \footnote{The code can be found \href{https://github.com/tanmaygoyal258/Prompt_Optimization_Slate_Bandits.git}{here}.} 
to demonstrate the empirical performance of our algorithms \slateglincb, \slateglincbts\ and \slateglincbtsfixed. Details of each experiment are in the respective paragraphs.\footnote{The codes for the experiments can be found at \url{https://github.com/tanmaygoyal258/Logistic_Slate_Bandits.git} and \url{https://github.com/tanmaygoyal258/Prompt_Optimization_Slate_Bandits.git}}


% \begin{figure*}
% 	\centering
% 	\begin{subfigure}[b]{0.33\columnwidth}  
% 		\centering 
% 		\includegraphics[width=56mm]{Plots/Finite_Contexts.pdf}
% 		\caption{{\small Regret vs.\ $T$: Finite Context Setting}}   
% 		\label{fig:finite-context-logistic}
% 	\end{subfigure}
% 	%\vskip\baselineskip
% 	\hfill
% 	\begin{subfigure}[b]{0.33\columnwidth}   
% 		\centering 
% 	\includegraphics[width=56mm]{Plots/Infinite_Contexts.pdf}
% 		\caption{{\small Regret vs.\ $T$: Infinite Context Setting}}   
% 		\label{fig:infinite-context-logistic}
% 	\end{subfigure}
% 	\hfill
% 	\begin{subfigure}[b]{0.33\columnwidth}
% 		\centering
% 		\includegraphics[width=56mm]{Plots/Non_contextual_logistic.pdf}
% 		\caption{{\small Regret vs.\ $T$: Fixed-Arm Setting}}     
% 		\label{fig:non-contextual-logistic}
% 	\end{subfigure}
% 	\vskip\baselineskip
% 	\begin{subfigure}[b]{0.33\columnwidth}  
% 		\centering 
% 		\includegraphics[width=56mm]{Plots/average_time_per_round.pdf}
% 		\caption{{\small Average running time (per-round)}}   
% 		\label{fig:average-time-per-round}
% 	\end{subfigure}
% 	%\vskip\baselineskip
% 	\hfill
% 	\begin{subfigure}[b]{0.33\columnwidth}   
% 		\centering 
% 		\includegraphics[width=56mm]{Plots/maximum_time_per_round.pdf}
% 				\caption{{\small Maximum running time (per-round)}}   
% 		\label{fig:maximum-time-per-round}
% 	\end{subfigure}
% 	\hfill
% 	\begin{subfigure}[b]{0.33\columnwidth}
% 		\centering
% 		\includegraphics[width=56mm]{Plots/prompt_opt.pdf}
% 		\caption[observationalAlgo]%
%         {{\small Accuracy vs.\ $T$: Prompt Optimization}}   
% 		\label{fig:prompt_opt}
% 	\end{subfigure}
% 	\caption{}
%     \label{fig:Plots}
%     %{\small Simple Regret} 
% \end{figure*}

\begin{figure*}
	\centering
	\begin{subfigure}[b]{0.65\columnwidth}  
		\centering 
		\includegraphics[width=56mm]{Plots/Finite_Contexts.pdf}
		\caption{{\small Regret vs.\ $T$: Finite Context Setting}}   
		\label{fig:finite-context-logistic}
	\end{subfigure}
	%\vskip\baselineskip
	\hfill
	\begin{subfigure}[b]{0.65\columnwidth}   
		\centering 
	\includegraphics[width=56mm]{Plots/Infinite_Contexts.pdf}
		\caption{{\small Regret vs.\ $T$: Infinite Context Setting}}   
		\label{fig:infinite-context-logistic}
	\end{subfigure}
	\hfill
	\begin{subfigure}[b]{0.65\columnwidth}
		\centering
		\includegraphics[width=56mm]{Plots/non_contextual.pdf}
		\caption{{\small Regret vs.\ $T$: Fixed-Arm Setting}}     
		\label{fig:non-contextual-logistic}
	\end{subfigure}
	\vskip\baselineskip
	\begin{subfigure}[b]{0.65\columnwidth}  
		\centering 
		\includegraphics[width=56mm]{Plots/per_round_time_average.pdf}
		\caption{{\small Average running time (per-round)}}   
		\label{fig:average-time-per-round}
	\end{subfigure}
	%\vskip\baselineskip
	\hfill
	\begin{subfigure}[b]{0.65\columnwidth}   
		\centering 
		\includegraphics[width=56mm]{Plots/per_round_time_max.pdf}
				\caption{{\small Maximum running time (per-round)}}   
		\label{fig:maximum-time-per-round}
	\end{subfigure}
	\hfill
	\begin{subfigure}[b]{0.65\columnwidth}
		\centering
		\includegraphics[width=56mm]{Plots/prompt_opt.pdf}
		\caption[observationalAlgo]%
        {{\small Accuracy vs.\ $T$: Prompt Optimization}}   
		\label{fig:prompt_opt}
	\end{subfigure}
	\caption{}
    \label{fig:Plots}
    %{\small Simple Regret} 
\end{figure*}


\textbf{Experiment 1 ($R(T)$ vs.\ $T$, Contextual Setting):}  In this experiment, we compare our algorithms \texttt{Slate-GLM-OFU} and \texttt{Slate-GLM-TS} to their counterparts \texttt{ada-OFU-ECOLog} (Algorithm 2, \cite{Faury2022}) and \texttt{TS-ECOLog} (Section D.2, \cite{Faury2022}). These are the only logistic bandit algorithms that achieve optimal ($\kappa-$free) regret and are also computationally efficient ($O((\log T)^2)$ per round time complexity).  We perform experiments for the following two settings.

\textbf{Finite Contexts: } We assume the contexts come from the set $\C = \{1,\ldots,C\}$. For each $c\in \C$ and $i\in [N]$, a set of items $\X^{i,c}$ is constructed before hand by randomly sampling $K$ vectors from the $d-$dimensional ball with radius $1/\sqrt{N}$ . At each round $t$, a context $c$ is sampled uniformly at random from $\C$ and the sets $\X^{1,c}, \ldots , \X^{N,c}$ are presented to the learner.

\textbf{Infinite Contexts: } At each
    round $t\in [T]$, and for each slot $i\in [N]$, set $\X_t^i$ is constructed by sampling $K$ vectors randomly from the $d-$dimensional ball with radius $1/\sqrt{N}$. The learner is then presented with $\X_t^i$.

For the finite context setting, we fix $C=5$. For both settings, we fix the number of slots $N=3$, the number of items per slot $K=5$, and the dimension of item features to $d=5$. To simulate the reward, we select $\bm\theta^\star$ by randomly sampling from $[-1,1]^{15}$. We run our algorithms by varying the time horizon $T$ in $\{1000,5000,10000,15000, 20000\}$. For each $T$, we average the regret obtained at the end of $T$ rounds over 20 diferent seeds used to sample the rewards. The results for the Finite and Infinite context settings are shown in Figures \ref{fig:finite-context-logistic} and \ref{fig:infinite-context-logistic} respectively. We can see that in both instances, \texttt{Slate-GLM-OFU} performs the best, while \texttt{Slate-GLM-TS} performs on par with \texttt{TS-ECOLog}. Further, in Section \ref{appendix:experiments} of the appendix, we report the average results along with two standard deviations.

\textbf{Experiment 2 (Per-Round Time vs.\ $N$): }In
this experiment, we compare the average and maximum time taken (per round) by our algorithms \texttt{Slate-GLM-OFU} and \texttt{Slate-GLM-TS}, with respect to their counterparts \texttt{ada-OFU-ECOLog} and \texttt{TS-ECOLog} \citep{Faury2022} respectively\footnote{The per-round time is calculated as the sum of the per-round pull and per-round update times.}. While doing this comparison, we vary the number of slots $N$ in the set $\{3,\ldots , 6\}$. The number of items $(K = |\mathcal{X}^i_t|)$ per slot is fixed to $7$ and the dimension $d$ of each item is fixed to $5$. The item features are selected by randomly sampling from $[-1,1]^5$ and normalized to have norm $1/\sqrt{N}$. For each $N\in \{3,4,5,6\}$, we select a different reward parameter vector $\bm\theta^\star$ by randomly sampling from $[-1,1]^{5N}$. Note that the number of possible slates is $K^N$ and thus, varying $N$ in $\{3,4,5,6\}$ results in $343$, $2401$, $16807$, and $117649$ slates respectively. We perform this experiment in the infinite context setting (See \textbf{Experiment 1} for details). We run all the algorithms for $T = 1000$ rounds and average the results over 10 different seeds for sampling rewards. We the average per round running time in Figure \ref{fig:average-time-per-round} and maximum per round running time in Figure \ref{fig:maximum-time-per-round}. As expected, we observe much lower running times for \slateglincb\ and \slateglincbts\ compared to their counterparts. Moreover, the plots also indicate exponential growth in the per-round running time for both \texttt{ada-OFU-ECOLog} and \texttt{TS-ECOLog}. Further, there is a significant gap between the maximum and average per-round time of \texttt{Slate-GLM-OFU} and \texttt{Slate-GLM-TS}, implying that the actual per-round time for these algorithms is generally much lower than their maximum values. In Section \ref{appendix:experiments} of the appendix, we report the results with two standard deviations, along with each algorithm's average time for choosing an arm to pull and updating its parameters speerately.

\textbf{Experiment 3 ($R(T)$ vs.\ $T$, Non-Contextual Setting):} In this experiment, we compare our algorithms \texttt{Slate-GLM-OFU}, \texttt{Slate-GLM-TS}, and \texttt{Slate-GLM-TS-Fixed} (Algorithm \ref{algo:TS-Fixed}, Appendix \ref{appendix:ts-algos}) to a number of state-of-the-art baseline algorithms, in the non-contextual setting, i.e., the set of candidate slates remains fixed throughout the course of the algorithm. 
Like previous experiments, our baselines include \texttt{ada-OFU-ECOLog} and $\texttt{TS-ECOLog}$ from \cite{Faury2022}. However, for the non-contextual setting, we also include other state-of-the-art baselines such as the \texttt{MPS} algorithm (Algorithm 3, \cite{Dimakopoulou2019}) and the \texttt{Ordered Slate Bandit} algorithm (Figure 3, \cite{Kale2010}). The latter is designed for semi-bandit feedback, and hence, we adapt it to the bandit feedback setting as explained in Appendix \ref{appendix:experiments}. We fix the number of slots $N$ to $3$ and the number of items in each slot $K = |\mathcal{X}^i_t|$ to $5$. The dimension $d$ of items for each slot is fixed to $5$. The items for each slot are randomly sampled from $[-1,1]^5$ and normalized to have norm $1/\sqrt{3}$, while $\thetastar$ is randomly sampled from $[-1,1]^{15}$ and normalized. We run all the algorithms for $T \in \{1000,5000,10000,20000,30000,40000,50000\}$ rounds and average the results over 50 different seeds for sampling rewards. The rewards are shown in Figure \ref{fig:non-contextual-logistic}. We see that \texttt{Slate-GLM-OFU} has the best performance, with the only algorithm having comparable performance being \texttt{MPS}. Also, \texttt{Slate-GLM-TS} performs worse than \texttt{ada-OFU-ECOLog} and \texttt{MPS} while being on par with \texttt{TS-ECOLog}. In Section \ref{appendix:experiments}, we showcase the average results with two standard deviations, which also demonstrates that \texttt{MPS} showcases a high variance in results, hence, being less reliable in practice.


\textbf{Experiment 4 (Prompt Tuning):} In this experiment, we apply our contextual slate bandit algorithm \slateglincb\ to select in-context examples for tuning prompts of Language Models, applied to binary classification tasks. Typically, for such applications, a labeled training set of (input query, output label) pairs is used to learn policies of editing different parts of the prompt (instruction, in-context examples, verbalizers) \citep{Tempera2022} depending on a provided test input query. To simplify our task, we fix the instruction and the verbalizer and only select $N$ in-context examples from an available pool of $K$ examples. There are $N$ available positions (slots) in the prompt. Given a test input query (context), we create context-dependent features for the $K$ pool examples and independently select one (with repetition) per slot. This matches the contextual slate bandit problem setting (See Section \ref{section:preliminaries}) and therefore \slateglincb\ can be applied. We experiment on a sampled subset of size $5000$ from two popular sentiment analysis datasets, \emph{SST2} and \emph{Yelp Review}.  We randomly order the set and use about $\sim80\%$ ($4128$ for \emph{SST2}, $4000$ for \emph{Yelp Review}) of them for ``warm-up'' training and the remaining ~$20\%$ for testing. Like most prompt tuning experiments \citep{Tempera2022}, we report our results only on the test set, however, our algorithm continues to learn throughout the $5000$ rounds. The warm-up rounds help us to start with a good estimate of the hidden reward parameter vector. We fix $N = 4$ and vary $K$ in the set $\cbrak{8,16,32}$. All the slots choose an example from the same $K$-sized example pool. At each round, given an input query $\bm{q}$ that needs to be solved for, item features for each in-context example $\bm{e}=(\bm{x}, y)$, is constructed by embedding each of $\bm{q}$, $\bm{x}$, and $y$ into 64 dimensions \citep{nussbaum2024nomic} and concatenating them, thereby resulting in a $192$-dimensional item feature vector. After selecting the $4$ items (slate), the resulting prompt (also containing the input query $\bm{q}$) is passed through the RoBERTa \citep{Zhuang2021} model and a possible answer for $\bm{q}$ is generated. Hence, we are learning to choose best the in-context examples for RoBERTa. At each round, we use GPT-3.5-Turbo to provide feedback (binary, $0$ or $1$) for the generated answer. This is treated as the reward for the chosen slate and utilized by the rest of the \slateglincb\ algorithm.
Figure \ref{fig:prompt_opt} shows the increase in cumulative accuracy as we sequentially proceed through the $5000$ data points in the \emph{Yelp Review} dataset. The data points to the left of the dotted blue line are the warm-up points and those to the right are the test points. We can see that the cumulative accuracy increases consistently as we sequentially proceed through the points. Also, on the test set, the accuracy stays well above $80\%$.
We vary $K$ in the set $\{8,16,32\}$ and report test accuracy for both datasets in Table \ref{table:prompt_opt}. It can be seen that the cumulative test accuracies for \slateglincb\ are much higher compared to the Random Allocation baseline where each in-context example is chosen randomly and no learning is performed. Also, we see that the accuracy generally increases when the pool size increases since better examples can be available. We do see a small dip for the \emph{Yelp Review} dataset when $K$ increases from $16$ to $32$ and hypothesize that this may be happening due to more exploration.  


\begin{table}[H]
\centering
\def\arraystretch{1.0}%
\resizebox{\columnwidth}{!}{
\begin{tabular}{ccccc}
\hline
\multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Pool \\Size\end{tabular}} & \multicolumn{2}{c}{\textbf{SST2}}                                                                                                   & \multicolumn{2}{c}{\textbf{Yelp Review}}                                         \\ \cline{2-5} 
& \multicolumn{1}{c}{Random}            & \multicolumn{1}{c}{\texttt{Slate-GLM-OFU}}           & \multicolumn{1}{c}{Random}            & \multicolumn{1}{c}{\texttt{Slate-GLM-OFU}} \\ \hline
\multicolumn{1}{c}{8}           & \multicolumn{1}{c}{54.22}               & \multicolumn{1}{c}{69.15} & \multicolumn{1}{c}{62.90}               & \multicolumn{1}{c}{74.00} \\ 
\multicolumn{1}{c}{16}           & \multicolumn{1}{c}{54.46}               & \multicolumn{1}{c}{80.96} & \multicolumn{1}{c}{63.30}               & \multicolumn{1}{c}{82.50} \\ 
\multicolumn{1}{c}{32}           & \multicolumn{1}{c}{53.82}               & \multicolumn{1}{c}{81.42} & \multicolumn{1}{c}{62.00}               & \multicolumn{1}{c}{79.50} \\ \hline
\end{tabular}
}
\caption{Prompt Tuning Test Accuracy}
\label{table:prompt_opt}
\end{table}