\section{Introduction}
\label{section:introduction}
Online slate bandit problems provide a popular framework for modeling decision-making scenarios where multiple items must be selected in each round.  A slate consists of multiple slots, each with its own pool of candidate items, which may change over time. In each round, the learner selects one item per slot, thereby forming a slate.  A single reward drawn from a logistic model with unknown parameters is then received for the entire slate. The learner's objective is to adaptively optimize their slate selection policy to maximize the cumulative reward (or equivalently, minimize the cumulative regret) over time.  Online slate bandits naturally model various real-world applications.  A prominent example is landing page optimization \citep{Hill_2017}, where the goal is to optimize the selection of components for each part of a landing page to maximize conversions.  Another important application is the automatic optimization of advertising creatives \citep{Chen2021}, which requires advertisers to automatically compose ads from multiple elements, such as product images, text descriptions, and titles. Beyond these practical applications, slate bandits have been extensively studied in the academic literature, leading to the development of many interesting algorithms in diverse settings \citep{Kale2010, Dimakopoulou2019, Rhuggenaath2020}.

Although good progress has been made on a variety of online slate bandit settings, some significant challenges still remain that limit the applicability of these algorithms. In applications such as those mentioned above, at each round, the learner has access to some contextual information (such as user query, user history, or demographics) which influences the set of available items per slot. To the best of our knowledge, the current literature focuses heavily on the non-contextual (fixed arms\footnote{We use the terms arms and actions interchangeably.}) setting, i.e., they do not assume access to such contexts and therefore keep the set of items unchanged over time. Another limitation is that most of the prior work assumes that the reward of a slate is a function (known or unknown) of rewards of the items in the slate which are themselves either adversarially chosen or are stochastic but disjoint from each other (i.e., each item's reward comes from a different distribution). This assumption neglects the inherent similarities between items.  A more realistic approach is to assume a unified parametric reward model shared across all slates.  This model allows the learner to leverage shared information, significantly simplifying the learning process.  Specifically, for binary rewards, models based on the logistic or probit function can effectively capture the reward structure.

A third, and equally important, limitation is the prevalent focus on the semi-bandit feedback setting.  This setting provides separate reward feedback for each item within a selected slate.  However, many practical applications (e.g., the ad creatives problem \citep{Chen2021}) offer only a single, slate-level reward (i.e., bandit feedback).  Although there are some methods for converting bandit feedback to semi-bandit feedback \citep{Dimakopoulou2019}, these are often heuristic and lack theoretical guarantees.  The item-level feedback in the semi-bandit setting facilitates per-slot exploration and exploitation, enabling the development of algorithms with $N^{O(1)}$ per-round complexity (e.g., \citep{Kale2010, Rhuggenaath2020}) by avoiding explicit iteration over the entire slate space.  It remains unclear how to achieve similar efficiency in the more challenging bandit feedback setting.  For example, directly applying state-of-the-art bandit algorithms \citep{Lattimore2020} to the slate bandit problem (treating slates as arms) and selecting a slate by iterating through the $2^{\Omega(N)}$ sized set of all possible slates, results in exponential per-round time complexity.

Motivated by these challenges, our work introduces efficient and optimal algorithms for the logistic contextual slate bandit problem under bandit feedback, assuming time-varying item features and rewards generated from a global logistic model.  We make the following contributions.   

\subsection{Our Contributions}
\label{subsection:our-contributions}

\begin{enumerate}
    \item We propose two new algorithms \slateglincb\ and \slateglincbts\ that solve the logistic contextual slate bandit problem under bandit feedback. While \slateglincb\ is based on the OFU (Optimization in the Face of Uncertainty) paradigm, \slateglincbts\ follows the Thompson Sampling (TS) paradigm. Under a diversity assumption (Assumption \ref{assumption: diversity}), we prove that \slateglincb\ incurs a regret of $\Tilde{O}(dN\sqrt{T})$ with high probability. Here, $d$ is the dimensionality of the items in the slate, $N$ is the number of slots and $T$ is the total number of rounds the algorithm is run for. Both algorithms explore and exploit at the slot level and thus have a per round time complexity that grows polynomially in $N$ and $\log T$, making them feasible in practice.

    \item We also propose a fixed arm version \slateglincbtsfixed\ of the \slateglincbts\ algorithm for the non-contextual (fixed arm) setting. Using an assumption similar to Assumption \ref{assumption: diversity}, we prove an $O(d^{3/2}N^{3/2} \sqrt{T})$ regret guarantee for \slateglincbtsfixed. Similar to \slateglincbts, \slateglincbtsfixed\ also explores and exploits at the slot level and has per round complexity polynomial in $N$ and $\log T$.
    
    \item We perform extensive experiments to validate the performance of our algorithms for both the contextual and the non-contextual settings. Under a wide range of randomly selected instances, we see that \slateglincb\ incurs the least regret compared to all baselines and \slateglincbts, \slateglincbtsfixed\ are competitive with other state-of-the-art algorithms. We also evaluate the maximum and average per round time complexity of our algorithms and compare it to the time complexities of the baselines. Our algorithms are exponentially (most of the time) faster than all baselines. 

    \item Finally, we use our algorithm \slateglincb\ to select in-context examples for tuning prompts of language models, applied to binary classification tasks. We perform experiments on two datasets \emph{SST2} and \emph{Yelp Review} and achieve a competitive test accuracy of $\sim 80\%$ making it a possible alternative in practical prompt tuning scenarios.
\end{enumerate}


\subsection{Related Work}
\label{subsection:related-work}
Online slate bandits have received significant attention due to their wide applicability in applications such as recommendations and advertising \citep{Hill_2017, Chen2021, Dimakopoulou2019}, however, there are only a few theoretical studies that provide regret guarantees \citep{Kale2010, Rhuggenaath2020}. While these papers make progress on the slate bandit problem, neither do they address the contextual setting, nor do they accommodate bandit feedback which are the main motivations of our work. Theoretical analysis might be feasible for the Thompson Sampling approach in \cite{Dimakopoulou2019}, but proving optimal guarantees might still be hard since their algorithm assigns equal rewards to all slots in order to maintain slot level policies for efficiency purposes. However, we would like to acknowledge that in our experiments (Section \ref{section:experiments}), for the fixed arms setting, we found their algorithm to be quite competitive to ours on the instances we considered.     

One way of achieving optimal regret guarantees for the slate bandit problem is to reduce it to the canonical logistic bandit problem by considering each candidate slate as a separate arm and then using state of the art algorithms such as those in \citep{Faury2020, Abeille2021, Faury2022}. While these algorithms do achieve optimal ($\kappa$ free) regret, they are infeasible in practice. During the arm selection step they require an iteration through all the arms which is a $2^{\Omega(N)}$ sized set, thereby incurring exponential time per round. Even though these algorithms are inefficient for the slate bandit problem, we combine some of their key ideas with an efficient planning approach to design our algorithms. In Section \ref{section:experiments}, we demonstrate that our algorithms perform better than these  state of the art logistic bandit algorithms both in regret and time complexity, when applied to a wide variety of slate bandit instances.


Recently a large number of works \citep{Swaminathan2017, Kiyohara2024, Vlassis2021} have studied the slate bandit problem in the off-policy setting, wherein they utilize a dataset collected using some base policy to find optimal slate bandit policies. While these works have made significant progress both from the theoretical and practical sides, they are not relevant to our work since we focus on the online setting only. 


% There exist other methods for Slate Bandits, such as \cite{Swaminathan2017}, who develop algorithms to learn offline-policies to choose slates, and \cite{Hill_2017}, who use the idea of hill-climbing to develop solutions for applications such as ad placements and landing page optimizations. The aim of this work to address all these limitations and develop computationally efficient algorithms for slate bandits with a logistic reward model in the Bandit Feedback setting.