\section{Specialized results for contextual bandits}\label{sec:contextual}
This section specializes our general result to contextual bandit problems, like Example \ref{ex:contextual-bandit}.  
In such problems, the decision space $\mathcal{A}$ is a set of rules that map contexts to `arms.'  
Naive bounds depend on the entropy rate of the optimal decision rule, $\bar{H}(A^*)$, and, like Corollary \ref{cor:k-armed}, the cardinality of the set of possible decision rules $|\mathcal{A}|$. 
Leveraging the structure of the problem, we strengthen this result in two ways: (1) it depends on the conditional entropy rate of the optimal arm process and (2) it depends at most on the number of arms, rather than the number of decision rules. 
To facilitate comparisons with past work, we state a finite time regret bound in unchanging environments (where $\theta_1=\theta_2 \ldots$) in Appendix {\bf (ADD THIS)}. 
The analysis seems to resolve an open question raised by \citet{neu22}, around the proper application of information-ratio analysis to contextual problems.  


\subsection{General result.}
Contextual bandit problems are a  special case of our formulation that satisfy the following abstract assumption. 
Re-read Example \ref{ex:contextual-bandit} to get intuition. 
\begin{assumption}\label{assumption:contextual}
	There is a set $\mathcal{X}$ and integer $k$ such that $\mathcal{A}$ is the set of functions mapping $\mathcal{X}$ to $[k]$. The observation at time $t$ is the tuple $O_t =(X_t, R_t) \in \mathcal{X}\times \mathbb{R}$.  
	Define $i_t \equiv A_t(X_t)\in [k]$ and $i^*_t \equiv A^*(X_t)\in [k]$. 
	Assume that for each $t$,  $X_{t+1} \perp (A_t, R_t) \mid X_t, \mathcal{F}_{t-1}$. 
	 and $R_t \perp A_t \mid (X_t, i_t).$
\end{assumption}

Theorem \ref{thm:main-result} bounds the regret rate of a policy $\pi$ in terms of its limiting information ratio $\Gamma(\pi)$ and the mutual information rate $\bar{I}(A; \mathcal{F})$. 
Under assumption \ref{assumption:contextual}, we attain a bound on the information ratio that depends on the number of arms $k$, but is independent of the number of contexts or the size of the decision space, $|\mathcal{A}|$.
We bound the information rate by the conditional entropy rate of the optimal arm process.

\begin{theorem}\label{thm:contextual}
	Under Assumption \ref{assumption:contextual},




	\[
	\Gamma(\pi^{\rm TS})\leq k  \quad \& \quad \bar{I}(A^*; \mathcal{F})  \leq  \bar{H}\left( i^* \mid X \right).
	\]
\end{theorem}
The next result combines the information ratio bound above with the earlier bound of Theorem \ref{thm:effective-horizon-bound}. 
The bound depends on the number of arms, the dimension of the parameter space, and the effective time horizon. 
No further structural assumptions (e.g. linearity) are needed. 
An unfortunate feature of the result is that it applies only to parameter vectors that are quantized at scale $\epsilon$, with a logarithmic dependence on $\epsilon$. 
We believe this could be removed with more careful analysis. 
\begin{corollary} Under Assumption \ref{assumption:contextual}, 
 if $\theta_t \in \{ -1, -1+\epsilon, \ldots, 1-\epsilon, 1 \}^{p}$ is a discretized $p$-dimensional vector, and the optimal policy process $(A^*_t)_{t\in \mathbb{N}}$ is stationary, then 
 \[
 \Delta(\pi^{\rm TS}) \leq \tilde{O}\left( \sigma \sqrt{ \frac{p \cdot k}{\tau_\textup{eff}} } \right)
 \]
 where $\tilde{O}(\cdot)$ omits logarithmic factors. 
\end{corollary}



\subsection{Conditional entropy rate and VC dimension}\label{subsec:policy_complexity}
The next result provides a bound on the conditional entropy rate that depends on (1) a coarse measure of the effective time horizon, and (2) the Vapnik–Chervonenkis (VC) dimension of the class of decision-rules. 
The concept of VC dimension plays a key role in the theory of classification. \citet{beygelzimer2011contextual} were the first to provide regret bounds for contextual bandit problems that depend on the VC dimension of the policy class. 
Here, we show that bounds in terms of VC dimension follow through information-ratio analysis. 
It is worth emphasizing that such bounds do not apply to the entropy rate of the decision rule, $\bar{H}(A^*)$, and only to the conditional entropy rate of the optimal arm process, $H(i^* \mid X)$. 
For this reason, a naive extension of \citet{russo16} to contextual problems would not yield results that depend on VC dimension. 


\begin{lemma}
	Assume $k=2$ and there exists $\tau$ such that, with probability 1, $\theta_{1} =\ldots =\theta_{\tau}$, $\theta_{\tau +1} =\ldots =\theta_{2\tau}$ and so on. Let $\mathcal{C}\subset \{1,2\}^{\mathcal{X}}$ to be a policy class with VC dimension $d$ such that $A^* \in \mathcal{C}$ almost surely. Then,
	\[
	\bar{H}(i^* \mid X) \leq  \frac{d \ln\left( \frac{\tau}{d}\right)}{\tau}.
	\] 
\end{lemma}
The next example applies this general result to a contextual bandit problem where rewards follow a sparse high-dimensional linear model. 
\cite{bastani2020online} provide substantial technical analysis of similar problems. 
Here, the results follows without any new analysis. 
\begin{example}\label{ex:sparse_linear_models}
	Let the context space $\mathcal{X} \subset \mathbb{R}^p$ consist of high-dimensional vectors and take $\mathcal{C}$ to be the set of all sparse linear classifiers on $\mathcal{X}$; that is, let $\mathcal{C}$ consist of functions of the form $f_{\beta}(x) = 1+ \mathbf{1}\left( x^\top \beta  \geq 0 \right)$ for $\beta \in \mathbb{R}^p$ satisfying $\|\beta\|\leq d_0$. Then, \citet{abramovich2018high} show
	${\rm VC}(\mathcal{C})\leq 2 d_0 \log_{2}\left( \frac{p e}{d_0}\right)$. 
	
	Now, we construct a problem where the optimal policy is an element of $\mathcal{C}$. 	
	Assume there are just two arms ($k=2$) and $\theta = [\theta_t^{(1)}, \theta_t^{(2)}]$ is the concatenation of two different $p$ dimensional vectors. Rewards follow the linear model
	\[
	R_t = \langle X_t, \theta_t^{(i_t)} \rangle + W_t
	\]
	where $W_t$ is i.i.d noise. It is not hard to show that	$i^*_t  = 1 + \mathbf{1}\left( \langle  X_t \, ,\, \theta_t^{(1)} - \theta_t^{(2)}  \rangle  \geq 0\right)$. If parameters are drawn from a sparse distribution, with $\| \theta_t^{(1)} - \theta_t^{(2)} \|_0 \leq d_0$ almost surely, then $A^* \in \mathcal{C}$.  
\end{example}



\section{To move later}

\begin{example}[]
	If $\theta_t \in \{ -1, -1+\epsilon, \ldots, 1-\epsilon, 1 \}^{p}$ is a discretized $p$ dimensional vector, and the optimal policy process $(A^*_t)_{t\in \mathbb{N}}$ is stationary, then 
	\begin{align*}
		\bar{H}(A^*) &\leq \frac{1 + \log(\tau_\textup{eff}) + H(A_t^* | A_t^* \ne A_{t-1}^* ) }{\tau_\textup{eff}} \\
		&\leq \frac{1 + \log(\tau_\textup{eff}) + H(\theta_t ) }{\tau_\textup{eff}}\\
		&\leq \frac{1 + \log(\tau_\textup{eff}) + p \ln(2\epsilon) }{\tau_\textup{eff}}.
	\end{align*}
	Combining this gives 
	\[
	\Delta(\pi^{\rm TS}) \leq \tilde{O}\left( \sigma \sqrt{ \frac{p \cdot d}{\tau_\textup{eff}} } \right),
	\]
	which depends on the raw dimension of the parameter space, the embedded dimension of the action space, and the effective horizon. 
\end{example}
\section{Introduction}
\label{sec:intro}


We study the problem of learning in interactive decision-making. 
Across a sequence of time periods, a decision-maker selects actions, observes outcomes, and associates these with rewards. They hope to earn high rewards, but this may require investing in gathering information. 

 Most of the literature studies stationary environments --- where the likelihood of outcomes under an action is fixed across time.\footnote{An alternative style of result lets the environment change, but tries only to compete with the best fixed action in hindsight.} Efficient algorithms limit costs required to converge on optimal behavior. We study the design and analysis of algorithms in nonstationary environments, where converging on optimal behavior is impossible. 
 
 In our model, the latent state of the environment in each time period is encoded in a parameter vector. 
 These parameters are unobservable, but evolve according to a known stochastic process. 
 The decision-maker hopes to earn high rewards by adapting their action selection as the environment evolves.
 This requires continual learning from interaction and striking a judicious balance between exploration and exploitation.  
 Uncertainty about the environment's state cannot be fully resolved before the state changes and this necessarily manifests in suboptimal decisions. 
 Strong performance is impossible under adversarial forms of nonstationarity but is possible in more benign environments. 
 Why are A/B testing, or recommender systems, widespread and effective even though nonstationarity is a ubiquitous concern? Quantifying the impact  different forms of nonstationarity have on decision-quality is, unfortunately, quite subtle. 
 
 \paragraph{Our contributions.}
We provide a novel information-theoretic analysis that bounds the inherent degradation of decision-quality in changing environments. 
Note that the latent state evolution of the environment induces a latent optimal action process --- where the optimal action at any time step is that one that maximizes expected reward conditioned on the current environment parameter. 
We bound limiting per-period regret in terms of the \emph{entropy-rate of the optimal action process}. 
The entropy rate of a stochastic process is a fundamental concept in the theory of communications. 
We use the entropy rate to measure the extent to which the evolving state of the environment manifests in surprising and erratic evolution of the optimal action process. 
Subsection \ref{subsec:example} gives an example of nonstationarity, inspired by A/B testing, in which the entropy rate of the parameter process is large but the entropy rate of the action process is small. 
We believe this distinction is essential.  

We enrich this result in two ways. 
First we provide a matching lower bound. This exhibits a sequence of problems with varying entropy rate under which no algorithm could meaningfully outperform our upper bounds. 
Second, we upper bound on the entropy rate by the inverse of the problem's effective time horizon, which roughly captures the average length of time before the identity of the optimal action changes. 
In this form, our bounds mirror classic ones that pertain to stationary problems, except that the effective time horizon replaces the problem's raw time horizon. 

In addition to the problem's entropy rate, our general bounds depend on the algorithm's information ratio. 
First introduced by \citet{russo16}, the information ratio measures the per-period price an algorithm pays to acquire new information. 
It has been shown to properly capture the complexity of learning in a range of widely studied problems, and recent works link it to generic limits on when efficient learning is possible \cite{lattimore2022minimax, foster2022on}.  

Because the information-ratio framework covers many of the most important sequential learning problems, our framework applies to \emph{nonstationary variants} of many of the most important sequential learning problems. 
A secondary contribution of our work is extending information-ratio analysis to cover contextual bandits, resolving an open question highlighted by \citet{neu22}. See Section \ref{subsec:contextual}. 

This work emphasizes understanding of the limits of attainable performance. 
Thankfully, most results apply to Thompson sampling (TS), one of the most widely used learning algorithms in practice. 
In some problems, TS is far from optimal, and better bounds are attained with Information-Directed Sampling \cite{russo18}. 






  
 
 \subsection{An illustrative Bayesian model of nonstationarity}
 \label{subsec:example}
 
Consider a multi-armed bandit environment where two types of nonstationarity coexist  -- a common variation that affects the performance of all arms, and idiosyncratic variations that affect the performance of individual arms separately.
More explicitly, let us assume that the mean reward of arm $a$ at time $t$ is given by
$$ \mu_{t,a} = \theta_t^{\rm cm} + \theta_{t,a}^{\rm id}, $$
where $( \theta_t^{\rm cm} )_{t \in \mathbb{N}}$ and $( \theta_{t,a}^{\rm id} )_{t \in \mathbb{N}}$'s are latent stochastic processes describing common and idiosyncratic disturbances.
While deferring the detailed description to \cref{app:numerical}, we introduce two hyperparameters $\tau^{\rm cm}$ and $\tau^{\rm id}$ in our generative model to control the time scale of these two types of variations.\footnote{
We assume that $( \theta_t^{\rm cm} )_{t \in \mathbb{N}}$ is a zero-mean Gaussian process satisfying $\text{Cov}[ \theta_s^{\rm cm}, \theta_t^{\rm cm} ] = \exp\left( -\frac{1}{2} \left( \frac{t-s}{\tau^{\rm cm}} \right)^2 \right)$ so that $\tau^{\rm cm}$ determines the volatility of the process.
Similarly, the volatility of $( \theta_{t,a}^{\rm id} )_{t \in \mathbb{N}}$ is determined by $\tau^{\rm id}$.}
 
\begin{figure}[ht]
\begin{center}
\centerline{\includegraphics[width=\columnwidth]{figures/cartoon}}
\caption{A two-arm bandit environment with two types of nonstationarity -- a common variation $( \theta_t^{\rm cm} )_{t \in \mathbb{N}}$ generated with a time-scaling factor $\tau^{\rm cm}=10$, and idiosyncratic variations $( \theta_{t,a}^{\rm id} )_{t \in \mathbb{N}, a \in \mathcal{A}}$ generated with a time-scaling factor $\tau^{\rm id}=50$.
While absolute performance of two arms are extremely volatile (left), their idiosyncratic performances are relatively stable (right).}
\label{fig:example-illustration}
\end{center}
\vskip -0.2in
\end{figure}

Inspired by real-world A/B tests \cite{wu2022non}, we imagine a two-armed bandit instance involving a common variation that is much more erratic than idiosyncratic variations.
Common variations reflect exogenous shocks to user behavior which impacts the reward under all treatment arms. 
\cref{fig:example-illustration} visualizes such an example, a sample path generated with the choice of $\tau^{\rm cm}=10$ and $\tau^{\rm id}=50$.
Observe that the optimal action $A_t^*$ has changed only five times throughout 1,000 periods.
Although that the environment itself is highly nonstationary and unpredictable due to the common variation term, the optimal action sequence $(A_t^*)_{t \in \mathbb{N}}$ is relatively stable and predictable since it depends only on the idiosyncratic variations.

Now we ask --- \emph{How difficult is this learning task? Which type of nonstationarity determines the difficulty?}
A quick numerical investigation shows that the problem's difficulty is mainly determined by the frequency of optimal action switches, rather than volatility of common variation.


\begin{figure}[ht]
\begin{center}
\centerline{\includegraphics[width=\columnwidth]{figures/gauss_gp_compare}}
\caption{Performance of algorithms in two-armed bandit environments, with difference choices of time-scaling factors $\tau^{\rm cm}$ (common variation) and $\tau^{\rm id}$ (idiosyncratic variations). 
Each data point reports per-period regret averaged over 1,000 time periods and 1,000 runs of simulation.}
\label{fig:example-performance}
\end{center}
\vskip -0.2in
\end{figure}

See \cref{fig:example-performance}, where we report the effect of $\tau^{\rm cm}$ and $\tau^{\rm id}$ on the performance of several bandit algorithms (namely, Thompson sampling with exact posterior sampling,\footnote{
	In order to perform exact posterior sampling, it exploits the specified nonstationary structure as well as the values of $\tau^{\rm cm}$ and $\tau^{\rm id}$.
} and Sliding-Window TS that only uses recent $L \in \{10,50,100\}$ observations; see \cref{app:numerical} for the details).
Remarkably, their performances appear to be sensitive only to $\tau^{\rm id}$ but not to $\tau^{\rm cm}$, highlighting that nonstationarity driven by common variation is benign to the learner.

We remark that our information-theoretic analyses predict this result.
\cref{thm:main-result} shows that the complexity of a nonstationary environment can be sufficiently characterized by the entropy rate of the optimal action sequence, which should depend only on $\tau^{\rm id}$ but not on $\tau^{\rm cm}$ in this example.
\cref{thm:effective-horizon-bound} further expresses the entropy rate in terms of effective horizon, which corresponds to $\tau^{\rm id}$ in this example.

\subsection{Comments on the use of prior knowledge}

A substantive discussion of Bayesian, frequentist, and adversarial perspectives on decision-making uncertainty is beyond the scope of this short paper. 
We make two quick observations. 
First, where does a prior like the one in \cref{fig:example-illustration} come from? 
One answer is that company may run many thousands of A/B tests, and an informed prior may let them transfer experience across tests  \cite{azevedo2019empirical}. 
In particular, experience with past tests may let them calibrate $\tau^{\rm id}$, or form hierarchical prior where $\tau^{\rm id}$ is also random. 
Second, Thompson sampling with a stationary prior is perhaps the most widely used bandit algorithm. 
One might view the model in Section \ref{subsec:example} as a more conservative way of applying TS that guards against a certain magnitude of nonstationarity. 



\subsection{Literature review}


Most existing theoretical studies on nonstationary bandit experiments adopt adversarial or frequentist viewpoints in the modeling of nonstationarity, typically falling into two categories -- ``abruptly changing environments'' and ``slowly changing environments''.

\emph{Abruptly changing environments} (often referred to as switching bandits or piecewise stationary bandits) consider a situation where underlying reward distributions change at unknown times (often referred to as changepoints or breakpoints).
Denoting the total number of changes over $T$ periods by $S$, it was shown that the cumulative regret $\tilde{O}(\sqrt{ST})$ is achievable: e.g., Exp3.S \citep{auer02a, auer02b}, Discounted-UCB \citep{kocsis06}, Sliding-Window UCB \citep{garivier08}, and more complicated algorithms that actively detect the changepoints \citep{auer19, chen19}.
These results are consistent with our result: applied to this setup (assuming that the changes are occurring stochastically in a Bayesian sense), our analysis predicts $\tilde{O}(\sqrt{\mathbb{E}[S]T})$ when we represent the effective horizon as $T/\mathbb{E}[S]$.

Another stream of work considers \emph{slowly changing environments} (often referred to as drifting bandits).
Denoting the total variation in the underlying reward distribution by $V$ (often referred to as variation budget, e.g., $V = \sum_{t=2}^T \| \theta_t - \theta_{t-1} \|_{\infty}$), it was shown that the cumulative regret $\tilde{O}(V^{1/3} T^{2/3})$ is achievable \citep{besbes14, besbes15, cheung19}.
Building a tight connection between these results and ours is an important direction for future work. 
We comment on this in the conclusion. 





We adopt \emph{Bayesian viewpoints} to describe nonstationary environments: changes in the underlying reward distributions (more generally, changes in outcome distributions) are driven by a stochastic process.
Such a viewpoint dates back to the earliest work of \citet{whittle88} which introduces the term `restless bandits' and has motivated subsequent work \cite{slivkins08, chakrabarti08, jung19}.
On the other hand, since Thompson sampling (TS) has gained its popularity as a Bayesian bandit algorithm, its variants have been proposed for nonstationary settings accordingly: e.g., Dynamic TS \citep{gupta11}, Discounted TS \citep{raj17}, Sliding-Window TS \citep{trovo20}, TS with Bayesian changepoint detection \citep{mellor13, ghatak20},  and Predictive Sampling \citep{liu22}.
Although the Bayesian framework appears very flexible to model various types of nonstationarities (see examples discussed in \citet{liu22}), the performance analysis is hardly found and is often too specific to the assumed model.



Our analysis adopts an \emph{information-theoretic approach} introduced by \citet{russo16}, which has been motivating design and analysis of TS-like algorithms for complex environments \citep{russo18, liu18, dong19, lattimore21, russo22, neu22, liu22}.
Our work can be seen as its application to nonstationary bandit problems.
Notably, \citet{liu22} also adopts the same technique for nonstationary bandit problems, but for the purpose of analyzing their own algorithm with a customized definition of information ratio, resulting in tractable regret bounds only in some stylized setups.







\section{Problem Setup}
\label{sec:problem}

A decision-maker interacts with a changing environment across rounds $t \in \mathbb{N}:= \{1,2,3,\ldots\}$. 
In period $t$, the decision-maker selects some action $A_t$ from a finite set $\mathcal{A}$, observes an outcome $O_t$, and associates this with reward $R_t = R(O_t, A_t)$ that depends on the outcome and action through a known utility function. 

There is a function $g$, an i.i.d sequence of disturbances $W=(W_t)_{t\in \mathbb{N}}$, and a sequence of latent environment states $\theta=(\theta_t)_{t\in \mathbb{N}}$ taking values in $\Theta$, such that outcomes are determined as
\begin{equation}\label{eq:outcome-generation}
O_t=g(A_t, \theta_t, W_t ). 
\end{equation}
Write potential outcomes as $O_{t,a}=g(a, \theta_t, W_t)$ and potential rewards as $R_{t,a}= R(O_{t,a}, a)$. Equation \eqref{eq:outcome-generation} is equivalent to specifying a known probability distribution over outcomes for each choice of action and environment state. 

The decision-maker wants to earn high rewards even as the environment evolves, but cannot directly observe the environment state or influence its evolution. 
Specifically, the decision-maker's actions are determined by some choice of policy $\pi=(\pi_1, \pi_2,\ldots)$. 
At time $t$, an action $A_t= \pi_t(\mathcal{F}_{t-1}, \tilde{W}_t)$ is a function of the observation history $\mathcal{F}_{t-1}=(A_1, O_1, \ldots, A_{t-1}, O_{t-1})$ and an internal random seed $\tilde{W}_t$ that allows for randomness in action selection. 
Reflecting that the seed is exogenously determined, assume $\tilde{W}=(\tilde{W}_t)_{t\in \mathbb{N}}$ is jointly independent of the outcome disturbance process $W$ and state process
 $\theta=(\theta_t)_{t\in \mathbb{N}}$.   
That actions do not influence the environment's evolution can be written formally through the conditional independence relation $(\theta_{s})_{s\geq t+1}  \perp \mathcal{F}_t \mid (\theta_{\ell})_{\ell \leq t}$.

The decision-maker wants to select a policy $\pi$ that accumulates high rewards as this interaction continues. 
They know all probability distributions and functions listed above, but are uncertain about how environment states will evolve across time. 
To perform `well', they need to continually gather information about the latent environment states and carefully balance exploration and exploitation. 

Rather than measure the reward a policy generates, it is helpful to measure its regret. 
We define the \emph{regret rate} of a policy $\pi$ to be 
\[
	\Delta(\pi) := \limsup_{T \rightarrow \infty} \,\, \mathbb{E}_{\pi}\left[ \frac{1}{T} \sum_{t=1}^T \left(R_{t,A_t^*} - R_{t,A_t} \right)\right],
\]
where the latent optimal action $A_t^*$ is a function of the latent state $\theta_t$ satisfying $A_t^* \in \argmax_{a \in \mathcal{A}} \mathbb{E}[R_{t,a} \mid \theta_t]$. 
A policy's regret rate is the limit of the Ces\`{a}ro average of its regret.
It measures the per-period degradation in performance due to uncertainty about the environment state. 

\begin{remark} \label{rem:conjecture}
	The use of a limit supremum and Ces\`{a}ro averages is likely unnecessary under some technical restrictions. 
	For instance, under Thompson sampling applied to Examples~\ref{ex:k-armed-bandit}--\ref{ex:contextual-bandit}, if the latent state process $(\theta_t)_{t \in \mathbb{N}}$ is ergodic, we conjecture that $\Delta(\pi)= \lim_{t\to \infty} \mathbb{E}_{\pi}\left[ \left(R_{t,A_t^*} - R_{t,A_t} \right)\right]$. 
\end{remark}

Our analysis proceeds under the following assumption, which is standard in the literature. 
\begin{assumption} \label{ass:subgaussian}
	There exists $\sigma$ such that, conditioned on $\mathcal{F}_{t-1}$, $R_{t,a}$ is sub-Gaussian with variance proxy $\sigma^2$.  
\end{assumption}


\subsection{`Stationary processes' in `nonstationary bandits'}
The way the term `nonstationarity' is used in the bandit learning literature could cause confusion as it conflicts with the meaning of `stationarity' in the theory of stochastic process, which we use  elsewhere in this paper. 
	\begin{definition}\label{def:stationary}
		A stochastic process $X=(X_t)_{t\in \mathbb{N}}$ is (strictly) stationary if for each integer $t$, the random vector $(X_{1+m}, \ldots, X_{t+m})$ has the same distribution for each choice of $m$.  
	\end{definition}
`Nonstationarity', as used in the bandit learning literature, means that \emph{realizations} of the latent state $\theta_t$ may differ at different time steps. 
The decision-maker can gather information about the current state of the environment, but it may later change.  
Nonstationarity of the stochastic process $(\theta_{t})_{t\in \mathbb{N}}$, in the language of probability theory, arises when apriori there are predictable differences between environment states at different timesteps -- e.g., if time period $t$ is nighttime then rewards tend to be lower than daytime. 
It is often clearer to model predictable differences like that through contexts, as in \cref{ex:contextual-bandit}.


 
 
\subsection{Examples}
Many interactive decision-making problems can be naturally written as special cases of our general protocol, where actions generate outcomes that are associated with rewards. 

Our first example describes a bandit problem with independent arms, where outcomes generate information only about the selected action.

\begin{example}[K-armed bandit]\label{ex:k-armed-bandit}
	Consider a website who can display one among $k$ ads at a time and gains one dollar per click.
	For each ad $a \in [k] := \{1,\ldots,k\}$, the potential outcome/reward $O_{t,a} = R_{t,a} \sim \text{Bernoulli}(\theta_{t,a})$ is a random variable representing whether the ad $a$ is clicked by the $t^\text{th}$ visitor if displayed, where $\theta_{t,a} \in [0,1]$ represents its click-through-rate.
	The platform only observes the reward of the displayed ad, so $O_t = R_{t,A_t}$.
\end{example}

Full information online optimization problems fall at the other extreme. 
There the potential observation $O_{t,a}$ does not depend on the chosen action $a$, so purposeful information gathering is unnecessary.
The next example was introduced by \citet{cover91} and motivates such scenarios. 
\begin{example}[Log-optimal online portfolios]\label{ex:full-information} 
Consider a small trader who has no market impact. 
In period $t$ they have wealth $W_t$ which they divide among $k$ possible investments. 
The action $A_{t}$ is chosen from a feasible set of probability vectors, with $A_{t,i}$ denoting the proportion of wealth invested in stock $i$. 
The observation is $O_t\in \mathbb{R}^k_+$ where $O_{t,i}$ is the end-of-day value of $\$1$ invested in stock $i$ at the start of the day and the distribution of $O_t$ is parameterized by $\theta_t$.
Because the observation consists of publicly available data, and the trader has no market impact, $O_{t}$ does not depend on the investor's decision. Define the reward function $R_t=\log\left( O_t^\top A_t \right)$. Since wealth evolves according to the equation $W_{t+1} = \left(O_t^\top A_t\right) W_t$,
$$\sum_{t=1}^{T-1} R_t = \log(W_T/ W_0) .$$
\end{example}
Many problems lie in between these extremes. 
We give two examples. 
The first is a matching problem. Many pairs of individuals are matched together and, in addition to the cumulative reward, the decision-maker observes feedback on the quality of outcome from each individual match. 
This kind of observation structure is sometimes called ``semi-bandit'' feedback \cite{audibert14}. 
\begin{example}[Matching] \label{ex:matching}
	Consider an online dating platform with two disjoint sets of individuals $\mathcal{M}$ and $\mathcal{W}$.
	On each day $t$, the platform suggests a matching of size $k$, $A_t \subset \{ (m,w): m \in \mathcal{M}, w \in \mathcal{W} \}$ with $|A_t| \leq k$.
	For each pair $(m,w)$, their match quality is given by $\theta_{t,(m,w)}$.
	The platform observes the quality of individual matches, $O_t = \big( \theta_{t,(m,w)} : (m,w) \in A_t \big)$, and earns their average, $R_t = \frac{1}{k}\sum_{(m,w) \in A_t} \theta_{t,(m,w)}$.
\end{example}
Our final example is a contextual bandit problem.
Here an action is itself more likely a policy --- it is a rule for assigning treatments on the basis of an observed context. 
Observations are richer than in the K-armed bandit. 
The decision-maker sees not only the reward a policy generated but also the context in which it was applied. 
\begin{example}[Contextual bandit] \label{ex:contextual-bandit}
	Suppose that the website described in \cref{ex:k-armed-bandit} can now access additional information about each visitor, denoted by $X_t \in \mathcal{X}$. 
	The website observes the contextual information $X_t$, chooses an ad to display, 
	and then observes whether the user clicks. 
	To represent this task using our general protocol, we let the decision space $\mathcal{A}$ be the set of mappings from the context space $\mathcal{X}$ to the set of ads $\{1,\ldots, k\}$, the decision  $A_t \in \mathcal{A}$  be a personalized advertising rule, and the observation $O_t=(X_t, R_t)$ contains the observed visitor information and the reward from applying the ad $A_t(X_t)$. 
	Rewards are drawn according to $R_t \mid X_t, A_t, \theta_t \sim  \text{Bernoulli}( \phi_{\theta_t}(X_t, A_t(X_t) ) )$, where $\phi_\theta:\mathcal{X}\times [k] \to [0,1]$ is a parametric click-through-rate model.
	Assume $X_{t+1} \perp (A_t, \theta) \mid X_t, \mathcal{F}_{t-1}$. 
	This assumption means that advertising decisions cannot influence the future contexts and that parameters of the click-through rate model $\theta=(\theta_t)_{t\in \mathbb{N}}$ cannot be inferred passively by observing contexts.
\end{example}






\section{Information Theoretic Preliminaries}
The entropy of a discrete random variable $X$, defined by $H(X)= -\sum_{x} \mathbb{P}(X=x)\log(\mathbb{P}(X=x))$,  measures the uncertainty in its realization. 
The entropy rate of a stochastic process $(X_1, X_2, \ldots)$ is the rate at which entropy of the partial realization $(X_1, \ldots, X_t)$ accumulates as  $t$ grows. 
\begin{definition}
	The entropy rate of a stochastic process $X=(X_t)_{t\in \mathbb{N}}$, taking values in a discrete set $\mathbb{X}$, is
	\begin{align*}
		\bar{H}(X)&=\limsup_{T \rightarrow \infty} \,\, \frac{H\left([X_1, \ldots X_T] \right)}{T} \\
		&= \limsup_{T \rightarrow \infty} \,\, \frac{1}{T} \sum_{t=1}^{T} H(X_t | X_{t-1}, \ldots, X_1 ).
	\end{align*}
	If $X$ is a stationary stochastic process, then 
	\begin{equation}\label{eq:entropy-rate-stationary}
		\bar{H}(X) = \lim_{t\to \infty}  H(X_t | X_{t-1}, \ldots, X_1 ).
	\end{equation}
\end{definition}
The form \eqref{eq:entropy-rate-stationary} is especially elegant. 
The entropy rate of a stationary stochastic process is the residual uncertainty in the draw of $X_t$ which cannot be removed by knowing the draw of $X_{t-1}, \ldots, X_1$. 
Processes that evolve quickly and erratically have high entropy rate. 
Those that tend to change infrequently (i.e., $X_t=X_{t-1}$ for most $t$) or change predictably will have low entropy rate.


\iffalse
The mutual information rate and conditional entropy rate are defined similarly. Now consider the stochastic process $(X,Z)=(X_t, Z_t)_{t\in \mathbb{N}}$, where $X_t$ takes values in a finite set. Define the conditional entropy rate as
\[ 
 \bar{H}(X|Z) = \limsup_{T \rightarrow \infty} \frac{H( X_{1:T}\mid Z_{1:T})}{T}
\]
and information rate as 
\[
\bar{I}(X;Z) = \limsup_{T \rightarrow \infty} \frac{I( X_{1:T} ; Z_{1:T})}{T}.
\]
It is immediate that $\bar{I}(X;Z) \leq \bar{H}(X)$.
When all limit suprema can be replaced by proper limits, $\bar{I}(X;Z) = \bar{H}(X) - \bar{H}(X|Z)$.  
\fi

\section{Information-Theoretic Analysis of Dynamic Regret}
\label{sec:dregret}
We apply the information theoretic analysis of \citet{russo16} and establish upper bounds on the per-period regret, expressed in terms of (1) the algorithm's information ratio, and (2) the entropy rate of the optimal action process.


\subsection{Preview: special cases of our result}
We begin by giving a special case of our result. 
It bounds the regret rate of Thompson sampling in terms of the reward variance proxy $\sigma^2$, the number of actions $|\mathcal{A}|$, and the entropy rate of the optimal action process $\bar{H}(A^*)$. Thompson sampling is denoted by $\pi^{\rm TS}$ and is defined by the probability matching property:
\begin{equation}\label{eq:ts}
	\mathbb{P}(A_t=a \mid \mathcal{F}_{t-1})=\mathbb{P}(A_t^*=a \mid \mathcal{F}_{t-1}), 
\end{equation}
which holds for all $t\in \mathbb{N}$, $a\in \mathcal{A}$. Actions are chosen by sampling from the posterior distribution of the optimal action. 
\begin{corollary}\label{cor:k-armed}
	Under any problem in the scope of our problem formulation,
	\[
	\Delta(\pi^{\rm TS}) \leq \sigma \sqrt{2 \cdot |\mathcal{A}| \cdot \bar{H}(A^*) }.
	\]
\end{corollary}
According to $\eqref{eq:entropy-rate-stationary}$, the entropy rate is small when the conditional entropy $H(A_t^* \mid A^*_1, \ldots, A^*_{t-1})$ is small. 
That is, the entropy rate is small if most uncertainty in the optimal action $A_t^*$ is removed through knowledge of the past optimal actions.
Of course, Thompson sampling does not observe the environment states or the corresponding optimal actions, so  its dependence on this quantity is somewhat remarkable. 

The dependence of regret on the number of actions, $|\mathcal{A}|$, is unavoidable in a problem like the $K$-armed bandit of \cref{ex:k-armed-bandit}. 
But in other cases, it is undesirable. Our general results depend on the problem's information structure in a more refined manner. 
To preview this, we give another corollary of our main result, which holds for problems with full-information feedback (see \cref{ex:full-information} for motivation).
In this case, the dependence on the number of actions completely disappears and the bound depends on the variance proxy and the entropy rate. 
The bound applies to TS and the policy $\pi^{\rm Greedy}$, which chooses $A_t \in \argmax_{a \in \mathcal{A}} \mathbb{E}[R_{t,a} \mid \mathcal{F}_{t-1}]$ in each period $t$. 


\begin{corollary}
		For full information problems, where $O_{t,a}=O_{t,a'}$ for each $a,a'\in\mathcal{A}$, 
	\[
	\Delta(\pi^{\rm Greedy}) \leq \Delta(\pi^{\rm TS}) \leq \sigma \sqrt{2 \cdot \bar{H}(A^*) }.
	\]
\end{corollary}


\subsection{The entropy rate as an effective time horizon}
The entropy rate is an intrinsic measure of the rate of unpredictable variation in the optimal action process.
To aid in interpretation of this quantity, the next theorem upper bounds the entropy rate by the inverse of a notion of the problem's effective time horizon, denoted by $\tau_\textup{eff}$. 
The effective time horizon is long when the optimal action changes infrequently, so that, intuitively, a decision-maker could continue to exploit the optimal action for a long time if it were identified. 
\begin{theorem}\label{thm:effective-horizon-bound} When the process $(A_t^*)_{t\in \mathbb{N}}$ is stationary 
	$$ \bar{H}( A^* ) \leq \frac{1 + \log(\tau_\textup{eff}) + H(A_t^* | A_t^* \ne A_{t-1}^* ) }{\tau_\textup{eff}}, $$
	where 
	\begin{equation}\label{eq:effective-horizon}
		\tau_\textup{eff} := \frac{1}{\mathbb{P}( A_t^* \ne A_{t-1}^* ) }.
	\end{equation}
\end{theorem}

	
Combining this result with \cref{cor:k-armed} gives the bound 
\[
\Delta(\pi^{\rm TS}) \leq \tilde{O}\left( \sigma\sqrt{\frac{|\mathcal{A}|}{\tau_{\rm eff}}}\right),
\]
which closely mirrors familiar $O(\sqrt{k/T})$ regret bounds on the average per-period regret in bandit problems with $k$ arms, $T$ periods, and i.i.d rewards \cite{bubeck12}.
Through the entropy rate, our results yield a similar guarantee in terms of an intrinsic ``effective horizon.'' 
Note that it is not clear how one would directly establish such a guarantee for 
TS; Even though the optimal action switches only once every $\tau_{\rm eff}$ periods, other features of the environment may change more erratically. 

Below we give an example where the upper bound in \cref{thm:effective-horizon-bound} is nearly exact. 
\begin{example}[Piecewise stationary environment]
	Suppose $(A_t^*)_{t\in \mathbb{N}}$ follows a switching process. 
	With probability $1-\delta$ there is no change in the optimal action, whereas with probability $\delta$ there is a change-event and $A_t$ is drawn uniformly from among the other $k-1 \equiv|\mathcal{A}|-1$ arms.  
	Precisely, $(A_t^*)_{t\in \mathbb{N}}$ follows a Markov process with transition dynamics:
	\[
	\mathbb{P}(A^*_{t+1}=a \mid  A^*_t=a')= \begin{cases}
		1-\delta  & \text{if } a=a' \\
		\delta/(k-1) & \text{if } a\neq a'
	\end{cases}
	\]
	for $a, a' \in \mathcal{A}$. Then 
	\begin{align*}
		\bar{H}(A^*) &= (1-\delta)\log\left(\frac{1}{1-\delta}\right) + \delta\log\left(\frac{k-1}{\delta}\right)\\
		&= (1-\delta)\log\left(1 + \frac{\delta}{1-\delta}\right) + \delta\log\left(\frac{k-1}{\delta}\right)\\
		&\approx \delta+\delta\log((k-1)/\delta),
	\end{align*}
	where we used the approximation $\log(1+x) \approx x$. 
	Plugging in $\tau_{\rm eff}=1/\delta$ and $H(A_t^* \mid A_t^*\neq A^*_{t-1})=\log(k-1)$ yields, 
	\[
	\bar{H}(A^*) \approx \frac{1+\log(\tau_{\rm eff}) + H(A_t^* \mid A_t^*\neq A^*_{t-1})}{\tau_{\rm eff}}.
	\]
\end{example}

Although it can be illuminating to consider a problem's effective horizon, the entropy rate is a deeper quantity that better captures a problem's intrinsic difficulty. 
Below we give an extreme example where the entropy rate is zero even though the optimal action changes frequently.
Although the example is quite artificial, it captures that some forms of benign nonstationarity involve quickly changing, but largely predictable evolution in the optimal action process. 
\begin{example}[Cyclic optimal action sequence]
	Consider a nonstationary Bernoulli bandit environment where the success probability of arm $a \in \{1,\ldots,k\}$ at time $t$ is given by
	$$ \theta_{t,a} = \mathbb{I}\{ t \equiv a \text{ mod } k \}. $$
	The optimal action process $(A_t^*)_{t \in \mathbb{N}}$ is a deterministic sequence\footnote{Strictly speaking, the process $A^*$ is not stationary according to \cref{def:stationary}. 
	To resolve this issue, one can introduce a random seed: $\theta_{t,a} = \mathbb{I}\{ t +N \equiv a \text{ mod } k \}$ where $N \sim \text{Uniform}(\{1,\ldots,k\})$.} taking values in $[k]$ cyclically.
	Although the optimal action changes every single time (i.e., $\tau_{\rm eff} = 1$), the sequence is completely predictable and hence $\bar{H}(A^*) = 0$.
\end{example}









\subsection{Main result}
The corollaries presented earlier are special cases of a general result that we present now. 
Define the limiting information ratio of an algorithm $\pi$ by,
\[
\Gamma(\pi) = \limsup_{T \rightarrow \infty} \frac{1}{T} \sum_{t=1}^{T} \underbrace{\frac{ \left( \mathbb{E}\left[ R_{t,A^*} - R_{t,A_t} \right] \right)^2 }{ I\left( A_t^*;  (A_t, O_{t,A_t})  \mid \mathcal{F}_{t-1} \right) }}_{:=\Gamma_t(\pi)}.
\]
The per-period information ratio $\Gamma_t(\pi)$ was defined by \citet{russo16} and presented in this form by \citet{russo22}. 
It is the ratio between the square of expected regret and the conditional mutual information between the optimal action and the algorithm's observation. 
It measures the cost, in terms of the square of expected regret, that the algorithm pays to acquire each bit of information about the optimum. 
Under sufficient ergodicity assumptions, we expect $\Gamma(\pi)=\lim_{t\to \infty} \Gamma_t(\pi)$. 
We avoid imposing conditions under which such a limit exists by taking the limit supremum of Ces\`{a}ro averages. 

The next theorem shows that any algorithm's regret is bounded by the square root of the product of its information ratio and the entropy of rate of its optimal action sequence.
The result has profound consequences, but follows easily by applying elegant properties of information measures. 
\begin{theorem} \label{thm:main-result}
	Under any algorithm $\pi$,
	\[
		\Delta(\pi) \leq \sqrt{ \Gamma(\pi) \cdot \bar{H}(A^*) }.
	\]
	
	
\end{theorem}
\begin{proof} Use the shorthand notation $\Delta_t := \mathbb{E}[ R_{t,A^*} - R_{t,A_t} ]$ for regret, $G_t := I( A_t^*; (A_t, O_{t,A_t}) | \mathcal{F}_{t-1} )$ for information gain, and $\Gamma_t = \Delta_t^2/G_t$ for the information ratio at period $t$. 
Then, 
	\begin{align*}
		\Delta(\pi) &= \limsup_{T\to \infty} \, T^{-1}\sum_{t=1}^T \Delta_t \\
		&= \limsup_{T\to \infty}  T^{-1}\sum_{t=1}^T \sqrt{ \Gamma_t } \sqrt{ G_t }
		\\& \, \leq \limsup_{T\to \infty}  \sqrt{ T^{-1}\sum_{t=1}^T \Gamma_t } \cdot \sqrt{ T^{-1} \sum_{t=1}^T G_t }\\
		\, &\leq \sqrt{\Gamma(\pi)  \cdot \limsup_{T \rightarrow \infty}\, T^{-1}\sum_{t=1}^{T} G_t}.
	\end{align*}
	The first inequality uses Cauchy-Schwarz. 
	The second inequality uses that $\limsup x_t y_t \leq (\limsup x_t) \cdot (\limsup y_t)$ for any sequences $(x_t), (y_t)$. 
	
	We conclude by bounding the information gain. This uses the chain rule, the data processing inequality, and the fact that entropy bounds mutual information:
	\begin{align*}
		\sum_{t=1}^T G_t &= \sum_{t=1}^T I( A_t^*; (A_t, O_{t,A_t}) | \mathcal{F}_{t-1} )
		\\&\leq \sum_{t=1}^T I( [A_1^*, \ldots, A_T^*] ; (A_t, O_{t,A_t}) | \mathcal{F}_{t-1} )
		\\&= I( [A_1^*, \ldots, A_T^*]; \mathcal{F}_T ) \\
		& \leq H( [A_1^*, \ldots, A_T^*] ).
	\end{align*}
We conclude that $\limsup_{T \to \infty}\, T^{-1}\sum_{t=1}^{T} G_t \leq \bar{H}( A^* )$, completing the proof.  
\end{proof}
\begin{remark}%
	A careful reading of the proof reveals that it is possible to replace the entropy rate with the mutual information rate $\limsup_{T \rightarrow \infty} T^{-1} I( A_{1:T}^*; \mathcal{F}_T )$.  
\end{remark}

\begin{remark}\label{rem:lambda-info-ratio}
	Following \citet{lattimore21}, one can adopt a slightly different definition of information ratio,
	$$ \Gamma_\lambda(\pi) := \sup_{t \in \mathbb{N}} \frac{ \left( \mathbb{E}\left[ R_{t,A^*} - R_{t,A_t} \right] \right)^\lambda }{ I\left( A_t^*;  (A_t, O_{t,A_t})  \mid \mathcal{F}_{t-1} \right) }, $$
	which immediately yields an inequality, $\Delta(\pi) \leq \left( \Gamma_\lambda(\pi) \bar{H}(A^*) \right)^{1/\lambda}$ for any $\lambda \geq 1$.
\end{remark}

\subsection{Some known bounds on the information ratio}

We list some known results about the information ratio. These were originally established for stationary bandit problems but immediately extend to nonstationary settings considered in this paper. Most results apply to Thompson sampling, and essentially all bounds apply to Information-directed sampling, which is designed to minimize the information ratio \cite{russo18}. The first four results were shown by \citet{russo16} under \cref{ass:subgaussian}.
\begin{description}
	\item[Classical bandits.] $\Gamma(\pi^{\rm TS}) \leq 2\sigma^2|\mathcal{A}| $, for bandit tasks with finite action set (e.g., \cref{ex:k-armed-bandit}).
	\item[Full information.] $\Gamma(\pi^{\rm TS}) \leq 2\sigma^2$, for problems with full-information feedback (e.g., \cref{ex:full-information}).
	\item[Linear bandits.] $\Gamma(\pi^{\rm TS}) \leq 2\sigma^2 d$, for linear bandits of dimension $d$ (i.e., $\mathcal{A} \subseteq \mathbb{R}^d$, $\Theta \subseteq \mathbb{R}^d$, and $\mathbb{R}[R_{t,a} | \theta_t] = a^\top \theta_t$).
	\item[Combinatorial bandits.] $\Gamma(\pi^{\rm TS}) \leq 2\sigma^2 \frac{d}{k^2}$, for combinatorial optimization tasks of selecting $k$ items out of $d$ items with semi-bandit feedback (e.g., \cref{ex:matching}). 
	\item[Contextual bandits.] See the below for a new result.  
	\item[Logistic bandits.] \citet{dong19} consider problems where mean-rewards follow a generalized linear model with logistic link function, and bound the information ratio by the dimension of the parameter vector and a new notion they call the `fragility dimension.'
	\item[Graph based feedback.]  With graph based feedback, the decision-maker observes not only the reward of selected arm but also the reward of its neighbors in feedback graph. One can bound the information ratio by the feedback graph's clique cover number \cite{liu18} or its independence number  \cite{hao2022contextual}.
	\item[Sparse linear models.] \citet{hao21} consider sparse linear bandits and show conditions under which the information ratio of Information-Directed Sampling in \cref{rem:lambda-info-ratio} is bounded by the number of nonzero elements in the parameter vector. 
	\item[Convex cost functions.] \citet{bubeck2016multi} and \citet{lattimore2020improved} study bandit learning problems where the reward function is known to be concave and bound the information ratio by a polynomial function of the dimension of the action space. 
\end{description}



\subsection{A new bound on the information ratio of contextual bandits}\label{subsec:contextual}
Contextual bandit problems are a  special case of our formulation that satisfy the following abstract assumption. 
Re-read \cref{ex:contextual-bandit} to get intuition. 
\begin{assumption}\label{assumption:contextual}
	There is a set $\mathcal{X}$ and integer $k$ such that $\mathcal{A}$ is the set of functions mapping $\mathcal{X}$ to $[k]$. The observation at time $t$ is the tuple $O_t =(X_t, R_t) \in \mathcal{X}\times \mathbb{R}$.  
	Define $i_t := A_t(X_t)\in [k]$.
	Assume that for each $t$,  $X_{t+1} \perp (A_t, R_t) \mid X_t, \mathcal{F}_{t-1}$. 
	and $R_t \perp A_t \mid (X_t, i_t, \theta_t).$
\end{assumption}

Under this assumption, we provide an information ratio bound that depends on the number of arms $k$. 
It is a massive improvement over \cref{cor:k-armed}, which depends on the number of \emph{decision-rules}. 
\begin{lemma}\label{lem:contextual}
	Under Assumption \ref{assumption:contextual}, $\Gamma(\pi^{\rm TS}) \leq 2\cdot \sigma^{2} \cdot k$. 
\end{lemma}
Theorem \ref{thm:main-result} therefore bounds regret in terms of the entropy rate of the optimal decision rule process $(A^*_t)_{t\in \mathbb{N}}$, the number of arms $k$, and the reward variance proxy $\sigma^2$.  

\citet{neu22} recently highlighted that information-ratio analysis seems not to deal adequately with context, and proposed a substantial modification which considers information gain about model parameters rather than optimal decision-rules.  
\cref{lem:contextual} appears to resolve this open question without changing the information ratio itself. 
Our bounds scale with the entropy of the optimal decision-rule, instead of the entropy of the true model parameter, as in \citet{neu22}. 
By the data processing inequality, the former is always smaller. 
Our proof bounds the per-period information ratio, so it can be used to provide finite time regret bounds for stationary contextual bandit problems. 
\citet{hao2022contextual} provide an interesting study of variants of Information-directed sampling in contextual bandits with complex information structure. 
It is not immediately clear how that work relates to \cref{lem:contextual} and the information ratio of Thompson sampling. 
	
The next corollary combines the information ratio bound above with the earlier bound of \cref{thm:effective-horizon-bound}. 
The bound depends on the number of arms, the dimension of the parameter space, and the effective time horizon. 
No further structural assumptions (e.g., linearity) are needed. 
An unfortunate feature of the result is that it applies only to parameter vectors that are quantized at scale $\epsilon$. 
The logarithmic dependence on $\epsilon$ is omitted in the $\tilde{O}(\cdot)$ notation, but displayed in the proof. 
When outcome distributions are smooth in $\theta_t$,  we believe this could be removed with careful analysis. 
\begin{corollary}\label{cor:contextual} Under \cref{assumption:contextual}, 
	if $\theta_t \in \{ -1, -1+\epsilon, \ldots, 1-\epsilon, 1 \}^{p}$ is a discretized $p$-dimensional vector, and the optimal policy process $(A^*_t)_{t\in \mathbb{N}}$ is stationary, then 
	\[
	\Delta(\pi^{\rm TS}) \leq \tilde{O}\left( \sigma \sqrt{ \frac{p \cdot k}{\tau_\textup{eff}} } \right).
	\]
\end{corollary}


\section{Lower Bound}\label{sec:lower}

We provide an impossibility result through the next Theorem, showing that no algorithm can perform significantly better than the upper bounds provided in the previous section.
Our proof is built by modifying well known lower bound examples for stationary bandits. 

\begin{theorem} \label{thm:lower-bound}
	Let $k > 1$ and $\tau \geq k$.
	There exists a nonstationary bandit problem instance $|\mathcal{A}|=k$ and $\tau_{\rm eff}=\tau$, such that
	$$ \inf_{\pi}\Delta(\pi) \geq C \cdot \sigma \sqrt{ \frac{|\mathcal{A}|}{\tau_{\rm eff}} }, $$
    where $C$ is a universal constant.
\end{theorem}


\begin{remark}
	For the problem instance constructed in the proof, the entropy rate of optimal action process is $\bar{H}(A^*) \approx \log(| \mathcal{A} |)/\tau_{\rm eff}$.
	This implies that the upper bound established in \cref{cor:k-armed} is tight up to logarithmic factors, and so is the one established in \cref{thm:main-result}.
\end{remark}




\section{Conclusion and Open Questions}
We have provided a new information-theoretic analysis of interactive learning in changing environments. 
The results offer an intriguing measure of the difficulty of learning:  the entropy rate of the optimal action process. 
A strength of the approach is that it applies to nonstationary variants of many of the most important learning problems. 
Instead of designing algorithms to make the proofs work, most results apply to TS, one of the most widely used bandit algorithms. 

TS can explore too aggressively in nonstationary learning problems with short effective horizon. 
It is continually uncertain about the optimal arm and explores often in response. 
But the value of acquiring information is low if it cannot be exploited for long before the state of the environment changes. 
To resolve this, one should consider variants of TS that \emph{satisfice}, like the version proposed by \citet{liu22}.
We conjecture that a synthesis of our analysis with the information-ratio analysis of satisficing in \citet{russo22} 
could tighten our bounds in some cases. 
Namely, we believe they are loose in problems where the optimal action changes frequently and unpredictably, but one still attain near-optimal rewards by tracking an (`satisficing') action-sequence which changes less frequently. 
Worst-case examples in the variation budget framework \cite{besbes14} are often addressed in this way. See \citet{chen19} 


\vfill 



  





\pagebreak












\nocite{langley00}


