% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{microtype}
\usepackage{graphicx}
% \usepackage{subfigure}
\usepackage{booktabs} % for professional tables
\usepackage[font=small,skip=0pt]{caption}
\usepackage{enumitem}
\usepackage{ulem}
\normalem

% For theorems and such

\usepackage{multirow}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

\hypersetup{hidelinks}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% EXTRAS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Attempt to make hyperref and algorithmic work together better:
\newcommand{\theHalgorithm}{\arabic{algorithm}}

% \usepackage{subfig}
\usepackage{subcaption}
\usepackage{algorithmic}
\usepackage{algorithm}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

\newcommand\algorithmicprocedure{\textbf{function}}
\newcommand{\algorithmicendprocedure}{\algorithmicend\ \algorithmicprocedure}
\makeatletter
\newcommand\PROCEDURE[3][default]{%
  \ALC@it
  \algorithmicprocedure\ \textsc{#2}(#3)%
  \ALC@com{#1}%
  \begin{ALC@prc}%
}
\newcommand\ENDPROCEDURE{%
  \end{ALC@prc}%
  \ifthenelse{\boolean{ALC@noend}}{}{%
    \ALC@it\algorithmicendprocedure
  }%
}
\newenvironment{ALC@prc}{\begin{ALC@g}}{\end{ALC@g}}
\makeatother


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Bayesian Inference Approach for Entropy Regularized Reinforcement Learning with Stochastic Dynamics}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:Argenis Arriojas <arriojasmaldonado001@umb.edu>?Subject=Your UAI 2023 paper}{Argenis Arriojas}{}}
\author[1]{Jacob Adamczyk}
\author[2]{Stas Tiomkin}
\author[1]{\href{mailto:Rahul Kulkarni <rahul.kulkarni@umb.edu>?Subject=Your UAI 2023 paper: Bayesian inference ...}{Rahul V Kulkarni}{}}
% Add affiliations after the authors
\affil[1]{%
    Department of Physics\\
    University of Massachusetts Boston\\
    Boston, Massachusetts, USA
}
\affil[2]{%
    Department of Computer Engineering\\
    San Jose State University\\
    San Jose, California, USA
}

  \begin{document}
\maketitle

\begin{abstract}
We develop a novel approach to determine the optimal policy in entropy-regularized reinforcement learning (RL) with stochastic dynamics. For deterministic dynamics, the optimal policy can be derived using Bayesian inference in the control-as-inference framework; however, for stochastic dynamics, the direct use of this approach leads to risk-taking optimistic policies. To address this issue, current approaches in entropy-regularized RL involve a constrained optimization procedure which fixes system dynamics to the original dynamics, however this approach is not consistent with the unconstrained Bayesian inference framework. In this work we resolve this inconsistency by developing an exact mapping from the constrained optimization problem in entropy-regularized RL to a different optimization problem which can be solved using the unconstrained Bayesian inference approach. We show that the optimal policies are the same for both problems, thus our results lead to the exact solution for the optimal policy in entropy-regularized RL with stochastic dynamics through Bayesian inference.
\end{abstract}

\section{Motivation}
Reinforcement learning (RL) provides a promising framework for training artificial agents for goal-oriented tasks through trial and error interaction with the environment~\citep{Sutton2018,Zhu2020The}. Specifically, the agent receives rewards in the process of solving a task according to a predefined reward function and this interaction informs the agent's behavior policy. The aim is to determine the optimal policy which, in the original formulation of reinforcement learning, maximizes the expected accumulated reward. The problem of RL can be addressed  in the model-based~\citep{606886,620043,NIPS2004_02f657d5,10.5555/3294771.3294858,asadi2018simple,pmlr-v80-corneil18a,lowrey2018plan} or model-free settings~\citep{Watkins1992May,NIPS2010_091d584f,Mnih2015Feb,8169685}. In the former case, the agent has access to a model of the environment and in the latter case, it has access only to samples from the environment.
Approaches based on RL have led to remarkable successes in robotics~\citep{Zhu2020The}, board games~\citep{Silver2018Dec,Schrittwieser2020Dec}, and many other fields~\citep{9397429,yu2019reinforcement,charpentier2021reinforcement}. 

A more general framework is entropy-regularized reinforcement learning, which considers reward accumulation with an entropy-based regularization term~\citep{Haarnoja2017Jul,haarnoja2018soft,nachum2017bridging}. The entropic regularization term corresponds to a control cost associated with the control policy (relative to a prior policy) and leads to stochastic optimal policies that are robust to environmental changes~\citep{eysenbach2021maximum} and show improved exploration~\citep{Haarnoja2017Jul}.
Moreover, entropy regularization has been shown to improve convergence rates in policy gradient methods \citep{mei2020ontheglobal,cen2022fastglobal}.
This generalization of RL towards entropy-regularized RL also makes connections to statistical mechanics, given that 
the free energy
is given by a similar combination of energy and entropy terms. A series of recent works have revealed new connections between non-equilibrium statistical mechanics and entropy-regularized RL, which have led to new algorithms and applications~\citep{rose2021reinforcement,das2021reinforcement,arriojas2021closed}.

One of the advantages of entropy-regularized RL is that it enables us to recast the problem of reward maximization into a problem of Bayesian Inference~\citep{todorov2008general,rawlik2012stochastic,Kappen2012May,Levine2018May}. This insight brings the rich arsenal of tools in Bayesian inference~\citep{koller2009probabilistic} to control and reinforcement learning, motivating the development of the control-as-inference framework~\citep{rawlik2012stochastic,Kappen2012May,Levine2018May}.
An important aspect of this approach is that, in the case of stochastic dynamics, an optimal control-as-inference solution involves inferring both a posterior policy as well as a posterior transition dynamics. Correspondingly, a direct application of this approach for stochastic dynamics leads to the ``optimistic agent problem'' (see Fig. 1), in which the agent unreasonably assumes for the optimal solution that it can control not only its policy, but also the system dynamics. In practice, system dynamics is typically fixed (e.g. a robot with fixed physical parameters) and not within the agent's control. In such cases, the policy derived using the control-as-inference framework for entropy-regularized RL is sub-optimal. This problem of obtaining the optimal policy using Bayesian inference while imposing the constraint to keep the dynamics fixed is an open problem in entropy-regularized RL, which motivates the current work.


In this work, we take a step towards resolving this problem by showing how to solve the constrained dynamics optimization problem in entropy-regularized RL using a Bayesian inference-based solution. We also develop a model-free algorithm based on the results derived and validate our approach in tabular settings. A simple example illustrating the application of our approach is shown in Fig. 1. 
The insights obtained can also be used to develop novel approaches to address model-based and model-free problems with dynamics shift.
Our main contributions include the following:
\begin{itemize}
    \item a formal mapping of the optimization problem in entropy-regularized RL with fixed (constrained) stochastic dynamics to a different problem for which the optimization is unconstrained with respect to the dynamics. The derived mapping ensures that the optimal policy is identical for the two optimization problems. 
    \item an algorithm for obtaining the optimal policy for entropy-regularized RL with an arbitrary fixed dynamics (i.e. not necessarily constrained to original dynamics) which can also be applied to problems involving distribution shift for system dynamics.
\end{itemize}


\begin{figure}[t!]
    \centering
    \includegraphics[width=\linewidth]{Fig1.png}
    
    \caption{
    Demonstration of the optimistic agent problem in a cliff environment with stochastic dynamics.
    Left: The maze layout showing the force of wind in six states. At each time step the agent must choose a direction to walk and the wind may push in one direction with some probability which is determined by wind direction and intensity. Here there is $35\%$ chance to move left, $35\%$ to move down and $30\%$ for no move due to wind. Each time step has fixed penalization $r=-1$. Red crosses represent traps with $r=-5$, and the golden star is the goal with $r=0$. The MDP is such that the agent transitions to the start state (green circle) after stepping into a trap or the goal.
    Center: Policies computed {with our proposed biasing method} (bottom) and without biases (top), with $\beta=50$.
    Right: The corresponding state visitation distribution for each policy. The optimistic agent fails to predict how often it will fall off the cliff, while the optimal solution has realistic expectations.
    }
    \label{fig:optimistic_vs_optimal}
\end{figure}




\section{Relevant Work}

Entropy-regularized RL can be seen as a particular case of the general problem of minimization of a {\it free energy functional}, wherein energetic quantities such as reward, value, and energy, are combined with entropic quantities such as entropy, cross entropy, and mutual information. 
Previously, the utility of this combination has been studied from the perspectives of i) cognitive science \citep{friston2009free,friston2006free}, ii) information theory \citep{tishby2011information,tiomkin2017unified}, iii) control \citep{mitter2000duality,todorov2008general,watson2021stochastic}, 
iv) robotics \citep{toussaint2009robot}, v) reinforcement learning \citep{nachum2017bridging,haarnoja2018soft,Levine2018May}. The preceding is only a short list of prior work that invoke of the free energy formalism, which provides the reader with the big picture and puts the current work in the broader context. 

The most relevant prior work to the current research is in the setting of RL \citep{rawlik2012stochastic,nachum2017bridging,haarnoja2018soft,Levine2018May}. In particular, the question that motivates our work, i.e. how to find the optimal policy using the framework of control-as-inference for the case of stochastic dynamics, has been clearly discussed in \citep{Levine2018May}.
As noted in \citep{Levine2018May} the standard solution to this question within the formalism of control-as-inference results in a policy that leads to risk-taking behaviour which is undesirable. 
In this work we develop a novel mapping that leads to the derivation of an exact solution for entropy-regularized RL within the framework of control-as-inference in the general case of stochastic dynamics. 


\section{Preliminaries}
In this section, we overview the standard setting of Markov decision processes (MDP) in RL. Then, we discuss its extension to entropy-regularized RL, providing both the classical perspective of control-as-inference and the free energy perspective. The latter emphasises the usefulness of the properties of free energy for the derivation of the optimal solution. 
Then, we overview an existing analytical solution (in the long-time limit) for the general case of entropy-regularized RL with {\bf\it{unconstrained}} stochastic dynamics \citep{arriojas2021closed}, which we apply in Section \ref{sec:ConstrainedMaxEntRL} to solve the general case of {\bf\it constrained/fixed} stochastic dynamics.


\subsection{Markov Decision Processes in RL}
In the following, we introduce the notation for the standard MDP formulation for RL \citep{puterman2014markov}. We will focus on the undiscounted finite horizon version with horizon $T$. The state of the system is denoted by $s \in \cal{S}$ and actions are denoted by $a \in \cal{A}$. The action of the agent is specified by the policy function $\pi(a|s)$ which represents the probability of choosing action $a$, given that the state is $s$. The initial state distribution is denoted by $\mu(s)$. In the following, we take $\mu$ to be deterministic, i.e. we fix the initial state. The dynamics is determined by the state-transition function $p(s'|s,a)$ which denotes the probability of transitioning to state $s'$ given that action $a$ was chosen when in state $s$. The reward function $r(s,a)$ specifies the reward received after choosing action~$a$ in state~$s$. 

The objective in standard RL is to find the optimal policy that maximizes expected rewards collected by the agent, i.e.
\begin{equation}\label{eq:optimal_policy_1}
\pi^* = \arg\max_{\pi} \mathbb{E} \left[ \sum_{t=1}^{T} r(s_t,a_t) \right],
\end{equation}
where the expectation is taken over the possible trajectories generated by following $\pi$, and subject to the problem's dynamics. The summation represents the sequence of steps that form a trajectory.


\subsection{Entropy-regularized RL}
In entropy-regularized RL \citep{pmlr-v80-haarnoja18b,Levine2018May}, the preceding objective function is modified to include an entropic regularization term, such that the optimal policy is given by
\begin{equation}\label{eq:optimal_policy_2}
\pi^* = \arg\max_{\pi}
\mathbb{E}
\left[ \sum_{t=1}^{T} r(s_t,a_t) - \frac{1}{\beta} \log \left(\frac{\pi(a_t|s_t)}{\pi^{0}(a_t|s_t)}\right) \right]
\end{equation}
where $\beta$ is an inverse temperature parameter and $\pi^{0}$ denotes the prior policy distribution. 
In the special case of maximum entropy RL (MaxEnt RL), the prior policy is taken to be the uniform distribution over actions~\citep{Levine2018May}. In the above formulation, it is implicit that the system dynamics remains fixed 
to the original dynamics $p(s'|s,a)$ and the optimization is over the policy distribution $\pi$. Furthermore, we note that,
without any loss of generality, we will consider reward functions such that $r(s,a) \le 0$, since a constant offset for the reward function for all state-action pairs does not impact the optimal policy \citep{Levine2018May}.


Let us now consider the preceding optimization problem from the trajectory perspective.
Let $\tau:= \{(s_t, a_t)\}_{t=0}^T$ denote a trajectory and (with a slight abuse of notation) let $p(\tau$) denote the corresponding trajectory distribution with the dynamics fixed to the original system dynamics. Given the initial state distribution $\mu(s_{1})$ and a control policy $\pi(a_t|s_t)$, the trajectory distribution can be expressed as
\begin{equation}
    p(\tau) = \mu(s_1)\prod_{t=1}^{T} p(s_{t+1}|s_t,a_t)\pi(a_t|s_t).\label{eq:prob_trajectory}
\end{equation}
Note that $p(s_{t+1}|s_t,a_t)$, $\pi(a_t|s_t)$ and $\mu(s_1)$ are all normalized probability distribution functions. When the control policy is taken to be the prior policy $\pi^{0}(a_t|s_t)$, the corresponding prior trajectory distribution will be denoted by 
$p_{0}(\tau)$. 
Furthermore, let us denote the energy of a trajectory $\tau$ as
\begin{equation*}
    E(\tau) = - \sum_{t=1}^T r(s_t,a_t).
\end{equation*}
It is readily seen that the optimization problem in Eqn. \eqref{eq:optimal_policy_2} is equivalent to determining the trajectory distribution $p(\tau)$ that \textit{minimizes} the objective function: 
\begin{equation}
J\left[p(\tau)\right] = \mathbb{E}_{\tau \sim{} p(\tau)}\left[E(\tau)\right] + \frac{1}{\beta} {\cal{H}}(p(\tau)|p_{0}(\tau))
\label{eq:ent reg objective}
\end{equation}
 where 
${\cal{H}}(p(\tau)|p_{0}(\tau))$ denotes the relative entropy between the prior and controlled trajectory distributions: 
\begin{equation*}
    \mathcal{H}(p(\tau)|p_{0}(\tau)) = \sum_{\tau} p(\tau)\log\frac{p(\tau)}{p_{0}(\tau)}
\end{equation*}
    


\subsection{Control-as-Inference Approach}\label{sec:Free energy - relative entropy duality}


To connect to the control-as-inference approach, let us consider a general controlled trajectory distribution denoted by $q(\tau)$. In contrast with $p(\tau)$ in Eqn. \eqref{eq:prob_trajectory},  for $q(\tau)$ the system's transition dynamics is not constrained to be the same as the original dynamics. In this more general setting, the objective function is the same as in Eqn.~\eqref{eq:ent reg objective} but with $p(\tau)$ replaced by the unconstrained trajectory distribution $q(\tau)$:
\begin{equation}
J\left[q(\tau)\right] = \mathbb{E}_{\tau \sim{} q(\tau)}\left[E(\tau)\right] + \frac{1}{\beta} {\cal{H}}(q(\tau)|p_{0}(\tau)).
\end{equation}
We will refer to the problem of minimizing this objective $J\left[q(\tau)\right]$ as the \textit{unconstrained} optimization problem.


The solution of the unconstrained optimization problem is related to the concept of free energy. Given a prior trajectory distribution $p_{0}(\tau)$, the corresponding free energy is defined as
\begin{equation*}
    F \doteq - \frac{1}{\beta} \log \mathcal{Z},\mbox{ where } \mathcal{Z} \doteq \sum_{\tau} p_{0}(\tau) e^{-\beta E(\tau)}.
\end{equation*}
The connection to the unconstrained optimization problem is given by the relationship~\citep{mitter2000duality, todorov2008general,theodorou2012relative}:
\begin{equation}
\label{eq:free-energy-relative-entropy}
    F = \inf_{q(\tau)} \left[ \left<E\right>_q + \frac{1}{\beta} \mathcal{H}(q(\tau)|p_{0}(\tau)) \right].
\end{equation}
Therefore the free energy $F$ above yields the solution to the unconstrained optimization problem.

Furthermore, the corresponding optimal trajectory distribution $q(\tau)=q^*(\tau)$ is given by~\citep{mitter2000duality, todorov2008general,theodorou2012relative} 
\begin{equation}
\label{eq:optimal_p}
    q^*(\tau) = \frac{p_{0}(\tau)e^{-\beta E(\tau)}}{\sum_\tau p_{0}(\tau)e^{-\beta E(\tau)}}
\end{equation}
We will refer to the preceding result for the optimal trajectory distribution as the \textit{inference approach solution}. 

This result provides insight into the control-as-inference framework. This approach~\citep{Ziebart2010Jun,toussaint2009robot,Levine2018May} involves the introduction of the binary random variable $\mathcal{O}_t$ such that
\begin{equation}
    p(\mathcal{O}_t = 1 | s_t, a_t) = \exp(\beta r(s_t,a_t))
\end{equation} 

This choice is motivated by the observation that, conditioned on optimality (i.e. $\mathcal{O}_t =1$ for all $t$), the posterior trajectory distribution  $p(\tau|\mathcal{O}_{1:T})$ exactly corresponds to the optimal control distribution in Eqn. \eqref{eq:optimal_p}.
Correspondingly, the posterior policy derived using this Bayesian approach is the optimal policy for the unconstrained optimization problem.



\subsection{ Entropy-Regularized RL via Unconstrained Optimisation}
One of the advantages of the inference approach solution is that, in the long-time limit, it is possible to derive analytical expressions for the optimal policy and optimal dynamics. Recent work~\citep{arriojas2021closed}, using approaches from large deviation theory, has shown how the optimal dynamics and policy can be expressed in terms of the Perron-Frobenius eigenvalue ($e^{-\theta}$) and corresponding left eigenvector ($u(s,a)$) of a sub-stochastic matrix ($\widetilde{P}$) whose elements are given by
\begin{equation*}
    \widetilde{P}_{(s',a'),(s,a)}= p(s'|s,a)\pi^{0}(a'|s')e^{\beta r(s,a)}\label{eq:twisted_transition_matrix}
\end{equation*}

Using this framework, it can be shown \citep{arriojas2021closed} that the posterior (i.e. optimal) transition dynamics $p^*$ is related to the original transition dynamics by:
\begin{equation}
    p^{*}(s'|s,a)
    \propto p(s'|s,a)e^{\beta V^*(s')}\label{eq:optimal_dynamics_equations1}
\end{equation}
where $V^*(s)$ is the optimal value function and the proportionality constant (for each $s,a$) is determined by normalization.

Let us now consider the \textit{constrained optimization} problem, with the objective function defined by Eqn.~\eqref{eq:ent reg objective}, i.e. the transition dynamics is fixed to the original dynamics $p(s'|s,a)$.
For the case of deterministic transition dynamics, the solution to the constrained optimization problem is provided by the inference approach solution. This can be seen from Eqn.~\eqref{eq:optimal_dynamics_equations1}, which shows that, for the case of deterministic dynamics, the optimal dynamics is the same as the original dynamics. However, for the case of stochastic dynamics, the same result indicates that the optimal dynamics is, in general, different from the original dynamics. Thus, the constraint that the optimal dynamics is the same as the original dynamics is satisfied by the inference approach solution for the case of deterministic dynamics but not for stochastic dynamics.



\subsubsection{The optimistic agent problem {in the inference approach solution}}

The results for the inference approach solution outlined in Eqn. \eqref{eq:optimal_dynamics_equations1}
define the posterior transition dynamics that is necessary to achieve optimal control.
Although this result can be useful in scenarios where the transition dynamics can be controlled, in many cases such control is not feasible.
In such cases, the resulting policy derived from the inference approach is no longer optimal, since the agent optimistically expects that unfavorable transitions are unlikely \citep{Levine2018May} (see Fig.~\eqref{fig:optimistic_vs_optimal}). 


An additional perspective on the optimistic agent problem comes from considering the backup equations for the optimal soft value functions (assuming a prior policy $\pi^0(a|s)$)~\citep{haarnoja2018soft}
\begin{align}
    Q(s,a) =& r(s,a) + \sum_{s'}p(s'|s,a)V(s')\label{eq:q_backup_constrained},\\
    V(s) =& \sum_a \pi^*(a|s) \left[Q(s,a) - \frac{1}{\beta}\log \frac{\pi^*(a|s)}{\pi^0(a|s)}\right]\label{eq:q_backup_constrained_1}.
\end{align}
Here the constraint is implicitly imposed in Eqn. \eqref{eq:q_backup_constrained}, where the original dynamics is directly used. Note that the optimism problem does not arise when we consider the equations above. However, when we consider the inference-based approach we get the following backup equations~\citep{Levine2018May}


\begin{align}
    Q(s,a) &= r(s,a) + \frac{1}{\beta} \log \sum_{s'} p(s'|s,a) e^{\beta V(s')}\label{eq:q_backup_inference},\\
    V(s) &= \frac{1}{\beta}\log \sum_{a} \pi^0(a|s) e^{\beta Q(s,a)}\label{eq:q_backup_inference_1}. 
\end{align}
We note that the backup equation for $Q(s,a)$ in the inference approach (Eqn. \eqref{eq:q_backup_inference}) is equivalent to Eqn. \eqref{eq:q_backup_constrained} only for the case of deterministic dynamics. For stochastic dynamics, the averaging over exponentiated future rewards \citep{Levine2018May,levine2013Jun} in Eqn.~\eqref{eq:q_backup_inference} is the source of optimistic behavior by the agent.

In summary, two sources of the optimistic agent problem for stochastic dynamics in the inference approach to entropy-regularized RL are: 1) averaging over exponentiated rewards in the value function computation and 2) posterior transition dynamics being different from the original dynamics. To resolve this problem, we develop an approach that ensures that (i) the posterior transition dynamics is fixed to the original dynamics {\em and} (ii) the backup equations, even though they involve averaging over exponentiated future rewards, reduce to the entropy-regularized RL backup equations Eqns.~\eqref{eq:q_backup_constrained} and \eqref{eq:q_backup_constrained_1}.
Note that there can be other sources of optimistic behavior in finite-horizon control-as-inference approaches
(as discussed in \citep{watson2021stochastic}),
however these issues do not apply for the current formulation and thus are not considered in this work.

Given that the unconstrained entropy-regularized RL problem for stochastic dynamics can be solved exactly using the inference approach, we ask if it is possible to similarly solve the constrained entropy-regularized RL problem. In the next section, we present an approach to solve the constrained entropy-regularized RL problem through a transformation of the unconstrained approach.



\section{Constrained optimization via unconstrained inference}\label{sec:ConstrainedMaxEntRL}

The core idea underlying our approach for {{\it constrained optimization via unconstrained inference}} 
is outlined in the following.  
The results from previous section show that, for the case of stochastic dynamics, the unconstrained inference approach solution leads to posterior dynamics that differs from the original dynamics. This implies that, if we want to use the unconstrained inference approach to obtain the solution for constrained entropy-regularized RL, it has to be applied to a {\em different} problem. Our approach is to determine the parameters for this different problem such that the optimal policy for the unconstrained inference problem is identical to the optimal policy of the original constrained optimization problem. 

\begin{figure}[t!]
    \centering
    \includegraphics[width=0.9\linewidth]{Fig2.png}
    \caption{Left: an example maze with traps placed randomly, and wind field blowing in random directions and intensity. Wind dynamics is similar to that of Fig.~\ref{fig:optimistic_vs_optimal}. Each time step has fixed penalization $r=-1$. Red crosses represent traps with $r=-2$, and the golden star is the goal with $r=0$. The MDP is such that the agent transitions to the start state (green circle) after stepping into a trap or the goal. Right: Shows the state distribution induced by the optimal policy which is computed using the proposed method.
    This result has been validated by comparison to the ground truth solution computed with value/policy iteration.}
    \label{fig:stochastic_maze_solved}
\end{figure}


\subsection{Mapping to Constrained Optimization}\label{sec:Mapping to Unconstrained Optimization}


Let us begin by considering the general controlled trajectory distribution denoted by $q(\tau)$. In contrast with $p(\tau)$ in Eqn.~\eqref{eq:prob_trajectory}, the system dynamics is not constrained to be the same as the original dynamics for $q(\tau)$. As noted in the preceding section, the corresponding {\em unconstrained} objective function $J\left[q(\tau)\right]$
is minimized by the inference approach solution.

In the following, we will show how, for specific parameter choices for the dynamics and reward function, the unconstrained objective function $J\left[q(\tau)\right]$ exactly reduces to objective function for constrained entropy-regularized RL $J(\left[p(\tau)\right])$ (Eqn.~\eqref{eq:ent reg objective}).

We begin by considering a modified unconstrained problem with a {\it biased} transition dynamics and {\it biased} reward function, $p_b(s'|s,a)$ and $r_b(s,a)$, respectively, which are given by
\begin{align}
  p_b(s'|s,a) &= b(s'|s,a)~p(s'|s,a)\label{eq:dynamics_biasing}\\
  r_b(s,a)&=r(s,a)+\delta(s,a)\le0~\forall~s,a,\label{eq:rewards_biasing}
\end{align}
with $b(s'|s,a)\!>\!0$ s.t. $\sum_{s'} p_b(s'|s,a) = 1$. 
For a given choice of biasing functions, we can express the corresponding unconstrained objective function, using Eqn. \eqref{eq:free-energy-relative-entropy}, as:
\begin{align}
    F\!=\!
    \inf_{q(\tau)}\!\! \left[\! \left<E\right>_q - \left<\delta\right>_q \!+ \!\frac{1}{\beta}\! \left<\log\frac{1}{b}\right>_q\!\!\!+ \frac{1}{\beta} \mathcal{H}(q(\tau)|p_{0}(\tau)) \right]\label{eq:full_unconstrained_biased_free_energy}
\end{align}
with $\delta(\tau) \doteq \sum_{t=0}^T \delta(s_t,a_t)$, $b(\tau) \doteq \prod_{t=0}^T b(s_{t+1}|s_t,a_t)$
$\left<\log\frac{1}{b}\right>_q = \sum_\tau q(\tau)\log\frac{1}{b(\tau)}$, and $\left<\delta\right>_q = \sum_\tau q(\tau)\delta(\tau)$.
Since we want the inference approach solution to be identical to the solution for constrained entropy-regularized RL, the first condition is that the optimal dynamics for the biased model should be the same as the original dynamics. 
Using Eqn. \eqref{eq:optimal_dynamics_equations1}, the condition that the optimal dynamics for the biased model must be the same as the original dynamics imposes the constraint equation 

\begin{equation}
    \forall s,a:\ b(s'|s,a) \propto e^{-\beta V_b(s')}
    \label{eq:constraint_1}
\end{equation}
with the proportionality constant determined by normalization of the distribution function for transition dynamics.

We can interpret the above equation as follows: for any given choice of biased reward function $r_{b}(s,a)$, 
this equation determines the biasing function for the dynamics $b(s'|s,a)$ which is such that the optimal dynamics for the biased problem is the same as the original dynamics. Thus for each choice of $r_{b}(s,a)$ for which the above equation has a solution, we have identified biased dynamics parameters which satisfy the constraint on the dynamics.  Now we can ask the following question:

{\it Within this set of biased dynamics and biased reward functions, can we identify the choice of reward function which gives rise to the same optimal policy as constrained entropy-regularized RL?} 

Remarkably, we can derive a simple constraint equation that answers this question. The basic insight is that we need a condition such that the objective function for the biased unconstrained problem becomes identical to the objective function for constrained entropy-regularized RL. Correspondingly, we focus on the case where the cost contributions due to $b$ and $\delta$ cancel each other out in Eqn. \eqref{eq:full_unconstrained_biased_free_energy}. This can be achieved by choosing $\delta(s,a)$ such that
\begin{align}\begin{split}
    \beta\delta(s,a)
    &= - \sum_{s'} p(s'|s,a) \log b(s'|s, a)\\
    &= D_{\textrm{KL}}(p(\cdot|s,a)||p_b(\cdot|s,a))
\end{split}\label{eq:constraint_2}
\end{align}
As before, let $p(\tau)$ denote the trajectory distributions subject to the constraint that the dynamics is fixed to the original dynamics of the problem, such that the variation among different trajectory distributions is entirely due to the policy~$\pi$. 
After applying both constraints in Eqns. \eqref{eq:constraint_1} and \eqref{eq:constraint_2},
Eqn. \eqref{eq:full_unconstrained_biased_free_energy} gets simplified to 
\begin{equation}
    F_{q} = \inf_{p(\tau)} \left[ \left<E\right>_p + \frac{1}{\beta} \mathcal{H}(p(\tau)|p_{0}(\tau))
    \right],
\end{equation}
which is the free energy objective to be minimized for the {\it constrained} problem (Eqn. \eqref{eq:ent reg objective}). In both cases, the optimization is to be carried out by varying the policy $\pi$, and the preceding derivation shows that for every policy $\pi$, the corresponding objective function (i.e. sum of energetic and entropic costs) is the same for constrained and the unconstrained optimization problems, for a specific choice of biasing functions. Correspondingly, the optimal policy distribution is identical for the two problems.
Thus we have shown that, assuming the constraint Eqns. \eqref{eq:constraint_1} and \eqref{eq:constraint_2} can be solved, the constrained optimization problem is identical to an {\it unconstrained} optimization problem for biased dynamics and biased reward function, which can then be solved using the inference approach.




\subsection{Equivalence of backup equations}\label{sec:Equivalence of Original and Mapped Dynamics}

The previous section has derived conditions which, when satisfied, lead to the solution of the constrained entropy-regularized RL problem using the inference approach. It is instructive to consider the equivalence between the two optimization problems by considering the corresponding backup equations.



Using Eqn. \eqref{eq:optimal_dynamics_equations1}, the inference approach backup equation 
(Eqn. \eqref{eq:q_backup_inference}) can be recast as  

\begin{multline}
Q(s,a) =  ~ r(s,a) +\sum_{s'}p^*(s'|s,a)V(s')\\
- \frac{1}{\beta} D_{\textrm{KL}}(p^*(\cdot|s,a)||p(\cdot|s,a)),
\label{eq:q_backup_unbiased_unconstrained}
\end{multline}

Comparing with Eqn. \eqref{eq:q_backup_constrained}, we see that the two equations are equivalent only when the optimal dynamics $p^*(s'|s,a)$ is the same as the original dynamics $p(s'|s,a)$ and this is true only for the case of deterministic dynamics. 

When we consider the biased version of the unconstrained problem, Eqn. \eqref{eq:q_backup_unbiased_unconstrained} becomes
\begin{multline}
\label{eq:q_backup_biased_unconstrained_0}
Q(s,a) =  ~ r(s,a) +\sum_{s'}p^*(s'|s,a)V(s')\\
+ \delta(s,a)
- \frac{1}{\beta} D_{\textrm{KL}}(p^*(\cdot|s,a)||p_b(\cdot|s,a)).
\end{multline}

We now consider biasing functions that satisfy the constraint in Eqn. \eqref{eq:constraint_2} which, when substituted in Eqn. \eqref{eq:q_backup_biased_unconstrained_0}, gives
\begin{multline}
Q(s,a) = r(s,a) +\sum_{s'}p^*(s'|s,a)V(s')\\
- \frac{1}{\beta} D_{\textrm{KL}}(p^*(\cdot|s,a)||p(\cdot|s,a)) \\
 + \frac{1}{\beta}\sum_{s'} \left[p^*(s'|s,a) - p(s'|s,a)\right]\log b(s'|s,a).
\label{eq:q_backup_biased_unconstrained}
\end{multline}



Finally, we note that the condition in Eqn. \eqref{eq:constraint_1} imposes the constraint $p^*=p$ (i.e. the biased optimal dynamics is the same as the original dynamics), using which Eqn. \eqref{eq:q_backup_biased_unconstrained} turns into Eqn. \eqref{eq:q_backup_constrained}.
This shows that by solving the biased unconstrained optimization problem in this framework, with the bias parameters chosen to satisfy the constraint equations, we effectively solve the original, constrained version of entropy-regularized RL. 


In summary, the inference approach backup equations for {\it biased} dynamics reduce to the backup equations for constrained entropy-regularized RL for the optimal policy, thereby showing that both approaches lead to the same soft-value functions $Q(s,a)$ and $V(s)$. 





\begin{figure}[t!]
    \centering
    \includegraphics[width=0.9\linewidth]{Fig3.png}
    \caption{
    Agent performance in {various standard benchmark dynamics} as a function of iterations in the biasing process {by the proposed method.} 
    At iteration 0, no biases are applied and the obtained solution corresponds to the optimistic agent {(the existing optimal solution by inference).} 
    Transition dynamics are obtained from discretized state observations. 
    Then, Eqn. \eqref{eq:fixed_point_iteration} is used to find the optimal solutions. 
    See Algorithm \eqref{alg:cap} and Table (S1) in Appendix A for more details.}
    \label{fig:biasing_progression}
\end{figure}



\subsection{Optimization for arbitrary target dynamics}

The approach developed in previous sections can be generalized to the case where the transition dynamics is constrained to some arbitrary target distribution $\hat{p}(s'|s,a)$ (not necessarily the original dynamics). Specifically, the constrained optimization problem now corresponds to backup equations as given in Eqns. \eqref{eq:q_backup_constrained} and \eqref{eq:q_backup_constrained_1}, but with original dynamics $p(s'|s,a)$ replaced by $\hat{p}(s'|s,a)$. This situation can be relevant when the agent's original dynamics either changes due to some failures or can be changed by the agent to specific target dynamics, corresponding to which we would like to determine the optimal policy. 


It is readily seen that this scenario, which corresponds to a distribution shift in the transition dynamics, can be addressed by modifying the constraint equations (Eqns. \eqref{eq:constraint_1} and \eqref{eq:constraint_2}) as follows
\begin{align}
    b(s'|s,a) &\propto \frac{\hat{p}(s'|s,a)}{p(s'|s,a)}
    e^{-\beta V_{b}(s')}
    \label{eq:general_b_constraint}\\
    \beta \delta(s,a) &= D_{\textrm{KL}}(\hat{p}(\cdot|s,a)||p_b(\cdot|s,a)).\label{eq:general_d_constraint}
\end{align}



\subsection{Algorithms and experimental validation}
\label{sec:algorithm_and_experimental_validation}


In order to determine the optimal policy in constrained entropy-regularized RL using the inference approach, we need to determine the corresponding biased dynamics and rewards. We have developed a procedure to determine the biasing functions $b(s'|s,a)$ and $\delta(s, a)$ through an iterative approach
which receives $\pi^0(a|s)$, $p(s'| s, a)$, $\hat{p}(s'|s, a)$, and $r(s, a)$, and calculates $b(s', a,s)$ and $\delta(s,a)$ by iteratively solving the constraint equations. Details are provided in Algorithm~\eqref{alg:cap}.
The basic idea of the algorithm is to iteratively solve the unconstrained MDP problem, while updating the biasing functions for dynamics and rewards through Eqn. \eqref{eq:fixed_point_iteration}.
The algorithm implements a fixed-point iteration method on the biasing functions $b$ and $\delta(b)$ (see Eqs. \eqref{eq:general_b_constraint} and \eqref{eq:general_d_constraint}), such that

\begin{equation}
    p_b^{(n+1)}(s'|s,a) = 
    \frac{1}{C}
    \hat{p}(s'|s,a)
    e^{-\beta V^{(n)}_b(s')}
    \label{eq:fixed_point_iteration}
\end{equation}
where {$V_b^{(n)}$}  is computed for the biased problem with $p_b^{(n)}$ and $r_b^{(n)}$ {; and $C$ is a normalization constant}.

Convergence is tested by computing the KL divergences between optimal and target dynamics. The convergence is considered attained when the following condition is true:
$$\max_{(s,a)} \left[D_{\textrm{KL}}(p^*(\cdot|s,a)||\hat{p}(\cdot|s,a))\right] < 10^{-6}$$


\begin{algorithm}[t]
    \caption{Find Biases for dynamics and rewards}\label{algo:Find Biases for dynamics and rewards}
    \label{alg:cap}
\begin{algorithmic}
    \STATE {\bfseries Parameters:} inverse temperature $\beta$, update rate $\alpha$%, $N$    
    \STATE {\bfseries Input:} $\pi^0(a|s),$ $p(s'|s,a),$ $\hat{p}(s'|s,a),$ $r(s,a)$

    \STATE {\bfseries Output:} $b(s'|s,a)$, $\delta(s,a)$, $\Delta$
    \STATE 1. Initialize $b(s'|s,a) \gets 1$ and $\delta(s,a) \gets 0$\;
    \STATE 2. Initialize $p_b \gets p$ and $r_b \gets r$\;
    \REPEAT
    \STATE 3. $V, p^* \gets $ Unconstr($\beta,r_b,p_b, \pi^0$)\;
    \STATE 4. $p_b(s'|s,a) \gets \hat{p}(s'|s,a) e^{-\beta V(s')}$ \hfill See Eqn. \eqref{eq:fixed_point_iteration}
    \STATE 5. Normalize $p_b(s'|s,a)$
  
    \STATE 6. $\delta(s,a) \gets \beta^{-1} D_{\textrm{KL}}(\hat{p}(\cdot|s,a)||p_b(\cdot|s,a))$\\
    \hfill See~Eqn.~\eqref{eq:general_d_constraint}

    \STATE 7. $\Delta \gets \max_{(s,a)}[r(s,a) + \delta(s,a)]$\;

    \STATE 8. $r_b(s,a) \gets r(s,a) + \delta(s,a) - \Delta$

    \UNTIL{
        convergence $p^* \to \hat{p}$
    }
    \PROCEDURE{Unconstr}{$\beta$, $r(s, a)$, $p(s'|s, a)$, $\pi^0(a|s)$}
    \STATE a. $\widetilde{P}(s',a'|s,a) \gets \pi^0(a'|s')p(s'|s,a)\exp(\beta r(s,a))$\;
    \STATE b. get dominant eigenvalue $e^{-\theta}$ and left eigenvector $u$\;
    
    \STATE c. compute $e^{\beta V(s)} \gets \sum_a \pi^0(a|s)u(s,a)$
    \STATE d. $p^*(s'|s,a) \gets $ from Eqn. \eqref{eq:optimal_dynamics_equations1} \;
\STATE {\bfseries return: $V$, $p^*$}  
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}



To solve the unconstrained optimization problem in entropy-regularized RL using Bayesian inference, we have used the approach developed in \cite{arriojas2021closed}, where the optimal value functions are obtained  from the dominant left eigenvector $u(s,a)$ of the \emph{tilted} transition matrix $\widetilde{P}$ for the MDP. Algorithm~\eqref{alg:cap} summarizes this process in the \emph{Unconstr} function.


We have tested this algorithm for various environments as summarized at Figs. \eqref{fig:stochastic_maze_solved} and \eqref{fig:biasing_progression}. The experimental details are provided in the Appendix.
To test the algorithm on the scenario of an arbitrary target dynamics, we set out a model-based proof-of-concept experiment where a prior transition dynamics is defined for which the optimal policy can be obtained. We then introduce a change in the dynamics representing a failure mode in the agent. In the example presented in Fig.~\eqref{fig:proof_of_concept}, the agent can no longer walk directly towards the goal, but can still take advantage of the wind field to move in the desired direction. With this setting we were able to find the optimal policy for the altered dynamics by following the procedure outlined to determine the corresponding biases to the prior transition dynamics and reward function. 


\begin{figure}
    \centering
    \includegraphics[width=0.9\linewidth]{Fig4.png}
    \caption{A windy environment used to test the feasibility of the method for forcing a target transition dynamics, different from the initial/prior transition dynamics. Left: The maze layout. Center: the solution to the original problem. Right: the new optimal solution to the modified problem where the action ``up'' has been suppressed from the transition dynamics. }
    \label{fig:proof_of_concept}
\end{figure}




Finally, a model-free version that works in the tabular setting has been developed, wherein the biasing functions are learned through experience, along with the intermediate policies (see Appendix B and Algorithm (S1)). The environment used is the same shown in Figure \eqref{fig:optimistic_vs_optimal} (windy cliff environment). Our approach utilizes a single experience dataset collected from the original dynamics and the prior policy (uniform policy) throughout the whole process, making it an off-policy approach. Figure \eqref{fig:model_free_rewards} shows the performance evaluation during the training process for several biasing iterations. As more iterations are completed, the optimistic behavior is removed. The proposed approach successfully leads to the optimal policy for constrained optimization.





\section{Discussion}
Control-as-inference is a powerful formalism for solving control problems using tools from Bayesian inference. Previously, the advantage of this formalism has been demonstrated by generalization of existing methods and derivation of new sophisticated algorithms. However, for the case of stochastic dynamics, this framework could not be directly applied to obtain the optimal solution for entropy-regularized RL. This work closes this gap in the field and provides a novel approach to the problem. 
Our solution can provide an alternative to standard approaches based on structured variational inference~\citep{Levine2018May}. In general, such approaches provide variational bounds, whereas our results show that there is a mapping to a problem that has an exact solution.


The proposed solution not only adds to the formalism of control-as-inference by providing an analytical solution in the general case, but it also opens doors for new research directions and applications. For example, our method enables us to calculate the optimal policy and to choose optimal stochastic dynamics from a set of possible dynamics. A particular application of such optimal choice of dynamics can be, for example, hierarchical control where the upper level (manager) signals to the lower level (worker) to change dynamics (e.g., to update system dynamics to different frictions coefficients and/or different control gains). Another natural application can be self-recovering robots from failures, for which the ability to find a policy that works well under distribution shift of the dynamics would be useful. The results derived provide a novel approach for addressing such issues.

The scope of this work is to develop a novel probabilistic inference-based solution to entropy-regularized RL with stochastic dynamics, which we demonstrate in various model-based and model-free environments.
We defer to future work the extension towards high-dimensional continuous spaces via function approximators. 
Another avenue for future work is a study of the theoretical properties of the iterative coupled equations for determining the biasing functions $b$ and $\delta$. We do not yet have a theoretical analysis for their convergence, but we do provide empirical evidence for various stochastic dynamics models. 



\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{Fig5.png}
    \caption{Progression of the learning process in a model-free setting. Biases are learnt from experience along with the policies. The environment used is the same as in Figure \eqref{fig:optimistic_vs_optimal}. An initial policy is learnt without any biasing, which results in an optimistic agent. Then biases are successively learnt and new policies obtained. As expected, the optimistic policy has sub-optimal performance. After learning the biases, the approach recovers the optimal policy.}
    \label{fig:model_free_rewards}
\end{figure}


Finally, we note that the approach developed in this work can be applied more generally (i.e beyond entropy-regularized RL) as outlined in the following.
Consider a setting wherein the solution to an unconstrained optimization problem is readily accessible (e.g. via Bayesian inference), however the problem of interest requires constrained optimization. Our approach considers a broader class of optimization problems which, for a specific parameter choice, reduce to the original system of interest. We then ask the question: Can we determine a (different) set of parameters such that a) the optimal solution to the unconstrained optimization problem satisfies the constraints of the original optimization problem, and b) there is a one-to-one mapping between objective functions for the two optimization problems? It will be of interest to see if the approach presented here for solving constrained optimization problems by mapping them to unconstrained problems that can be analyzed via inference  can also be applied to other settings involving a more general class of objective functions \citep{pmlr-v97-hazan19a,NEURIPS2020_30ee748d}.


\section*{Acknowledgments}

The authors would like to thank the anonymous reviewers for their helpful comments and suggestions. JA, AA, and RVK acknowledge funding support from the NSF through Award No. DMS-1854350. ST acknowledges funding support from the NSF through Award No. 2246221. JA and AA would like to acknowledge the use of the supercomputing facilities managed by the Research Computing Department at the University of Massachusetts Boston. The work of JA and AA was supported in part by the College of Science and Mathematics Dean's Doctoral Research Fellowship through fellowship support from Oracle, project ID R20000000025727. JA and RVK would like to acknowledge support from the Proposal Development Grant provided by the University of Massachusetts Boston. ST acknowledges support from the Alliance Innovation Lab in Silicon Valley.


\section*{Software and Data}

We make source code immediately available at \citep{githubRepo}, which can be used to reproduce all the results obtained. 


\bibliography{arriojas_611}



\end{document}