% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 
%\usepackage{xr} 
%\externaldocument{uai2023-template}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{On the Limitations of Markovian Rewards\\ to Express Multi-Objective, Risk-Sensitive, and Modal Tasks}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1,2]{\href{mailto:<joar.skalse@cs.ox.ac.uk>?Subject=Your UAI 2023 paper}{Joar~Skalse}{}}
\author[1]{Alessandro~Abate}
% Add affiliations after the authors
\affil[1]{%
    Computer Science Department\\
    Oxford University\\
    Oxford, UK
}
\affil[2]{%
    The Future of Humanity Institute\\
    Oxford, UK\\
}

  
\usepackage{amsthm, amsfonts, bbm, csquotes, amssymb} 

%\theoremstyle{plain}
%\newtheorem{corollary}[theorem]
%\newtheorem{theorem}{Theorem} %[section]
%\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{proposition}{Proposition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem{example}{Example}
%\theoremstyle{definition}
\newtheorem{definition}{Definition}
\newtheorem{assumption}{Assumption}
%\theoremstyle{remark}
\newtheorem{remark}{Remark}

\usepackage{xcolor}
\newcommand{\red}[1]{\textcolor{red}{#1}}

\newcommand{\M}{\mathcal{M}}

\newcommand{\States}{\mathcal{S}}
\newcommand{\Actions}{\mathcal{A}}
\newcommand{\mcS}{\mathcal{S}}
\newcommand{\mcA}{\mathcal{A}}
%\newcommand{\t}{\tau}
\newcommand{\init}{\mu_0}
\newcommand{\R}{R}
\newcommand{\y}{\gamma}
\newcommand{\Rs}{\textbf{R}}
%\newcommand{\ys}{\textbf{\y}}

\newcommand{\Ob}{{\mathcal{O}}}

\newcommand{\m}{m_{\tau,\init,\gamma}}

\newcommand{\SxA}{\mcS \times \mcA}
\newcommand{\SxAxS}{\mcS \times \mcA \times \mcS}

\newcommand{\MDP}{\langle \mcS, \mcA, \tau, \init, \R, \y \rangle}
\newcommand{\MOMDP}{\langle \mcS, \mcA, \tau, \init, \Rs, \y \rangle}
\newcommand{\MDPwO}{\langle \mcS, \mcA, \tau, \init, \tilde{\R}, \y \rangle}
\newcommand{\MDPwOb}{\langle \mcS, \mcA, \tau, \init, \hat{\R}, \y \rangle}
\newcommand{\env}{\langle \mcS, \mcA, \tau, \init, \_, \y \rangle}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%\iclrfinalcopy % Uncomment for camera-ready version, but NOT for submission.
\begin{document}


\maketitle

\begin{abstract}
In this paper, we study the expressivity of scalar, Markovian reward functions in Reinforcement Learning (RL), and identify several limitations to what they can express.
Specifically, we look at three classes of RL tasks; multi-objective RL, risk-sensitive RL, and modal RL.
For each class, we derive necessary and sufficient conditions that describe when a problem in this class can be expressed using a scalar, Markovian reward. Moreover, we find that scalar, Markovian rewards are unable to express most of the instances in each of these three classes. We thereby contribute to a more complete understanding of what standard reward functions can and cannot express.
In addition to this, we also call attention to modal problems as a new class of problems, since they have so far not been given any systematic treatment in the RL literature. We also briefly outline some approaches for solving some of the problems we discuss, by means of bespoke RL algorithms.
\end{abstract}

\section{Introduction}

%A reinforcement learning (RL) algorithm aims to maximise the expectation of a scalar reward function \citep{sutton2018reinforcement}. 
%Therefore, in order to use RL to solve some task, we must first encode that task using such a reward function. 

%\textcolor{blue}{Should we use instead https://www.deepmind.com/publications/reward-is-enough - and call this 'reward sufficiency' hypothesis?}

%\red{Those two hypothesis are actually slightly different! And I want to explicitly mention this somewhere (either the intro or discussion). The "reward hypothesis" says that all "goals and purposes" can be encoded as a reward function, whereas the "reward is enough" hypothesis says that maximising a (any?) reward function with sufficient competence in a sufficiently complex environment will give rise to intelligence.} 

%\textcolor{blue}{Thanks for the clarification, I was unaware of this difference and, since the latter claim is quite popular, I'd surely welcome you commenting on their difference, as done above.}

To solve a task using reinforcement learning (RL), we must first encode that tasks as a reward function \citep{sutton2018reinforcement}.
Typically, these rewards are \emph{scalar} and \emph{Markovian}. 
However, it is often not straightforward to determine if a given task \emph{can} be adequately expressed using such a reward function. Therefore, understanding the expressivity of scalar, Markovian rewards is a basic and foundational question of the RL setting.
%In this paper, we study some limitations in the expressivity of such reward functions. 
In this paper, we identify and characterise several specific limitations in the expressivity of scalar, Markovian rewards.
Specifically, we examine three broad classes of tasks, all of which are both intuitive to understand, and useful in many practical situations. 
We then derive necessary and sufficient conditions that describe when these tasks can be expressed using ordinary reward functions, and consequently
show that \emph{almost no} tasks in any of these three classes can be expressed using scalar, Markovian rewards.
This suggests that scalar, Markovian reward functions are semantically limited in certain important ways. %, and raises a warning for their indiscriminate use.
We thus contribute to a more complete understanding of what standard reward functions can and cannot express. This clarifies the implicit assumptions behind many common RL techniques, and makes it easier to determine if they are applicable to a given practical problem.
%Moreover, we also show that many of these problems \emph{can} be solved effectively with RL algorithms.
%, either from existing literature or by outlining possible approaches. 
%This rules out the possibility that those problems that cannot be expressed using scalar, Markovian reward functions also are impossible to learn effectively in RL. 
%\textcolor{red}{This result suggests that scalar, Markovian rewards are semantically limited, and raises warnings on their indiscriminate put to use.} 

The first class of problems we look at, in Section~\ref{section:morl}, are single-policy, multi-objective RL tasks (MORL).
In such problems, the agent receives multiple reward signals, and the aim is to learn a single policy that achieves an optimal trade-off amongst those rewards, according to some specified criterion \citep{Roijers2013,Liu2015}. 
For example, a single-policy MORL algorithm might attempt to maximise the rewards lexicographically \citep{lmorl}.
We will provide necessary and sufficient conditions describing when a MORL problem can be reduced to scalar-reward RL, by providing a single reward function that induces the same preferences as the MORL problem.
%We will look at the question of which MORL problems can be reduced to ordinary scalar-reward RL, by providing a single reward function that induces the same preferences as the original MORL problem. We will provide a complete solution to this problem, in the form of necessary and sufficient conditions.
%In some cases, it might be possible to reduce a single-policy MORL problem to ordinary RL, by finding a scalar reward function that induces the same preferences as that MORL problem. However, it is not always clear whether or not this is possible, or how the reduction could be performed. 
%In this paper, \textbf{we provide a complete solution to this problem}. Specifically, we present necessary and sufficient conditions for when a single-policy MORL problem can be represented as a traditional RL problem. 
We find that this can \emph{only} be done for MORL problems that correspond to a linear weighting of the rewards, which means that it cannot be done for the vast majority of all interesting MORL problems. %We shall in particular detail a few natural MORL objectives, and show that none of them can be reduced to scalar RL. 
This result is analogous to Harsanyi's Utilitarian Theorem \cite{harsanyi1955cardinal}, generalised to the RL setting.

The next class of problems we study, in Section~\ref{section:risk_sensitive_rl}, is risks-sensitive RL. 
%which encompasses many contexts where it is desirable to be risk averse. 
In expected utility theory, risk-aversion is often modelled using utility functions that are concave in some of their variables. 
We will show that these tasks cannot be expressed as Markovian reward functions,
%In other words, is it possible to take a reward function, and then create a version of that reward function which induces more risk-averse behaviour? We show that the answer is no, 
by demonstrating that no non-affine monotonic transformations of the trajectory return function are possible. 
This demonstrates another limitation in the expressive power of Markovian rewards. 
%\textcolor{blue}{[yes please frame this section within the main tenet of this paper, as also discussed below. ]}

In Section~\ref{section:modal}, we introduce a new class of tasks, which we call \emph{modal} tasks. These are tasks where the agent is evaluated not only based on what distribution of trajectories it generates, but also based on what it \emph{could have done} along those trajectories. 
As an example, consider the instruction \enquote{you should always be \emph{able} to return to the start state}. 
We provide a formalisation of such tasks, argue that there are many situations in which these tasks could be useful, and finally prove that these tasks also typically cannot be formalised using scalar, Markovian reward functions. 

In Section~\ref{section:solving_inexpressible}, we discuss how to solve tasks from each of these classes using specialised RL solutions: we provide references to existing literature, and also sketch both an approach for learning a wide class of MORL problems, and an approach for learning a wide class of modal problems. Finally, in Section~\ref{section:discussion}, we discuss the implications of our results, together with several pieces of related work.

%To clarify, this paper concerns the \emph{reward hypothesis}, expressed by Richard Sutton as \enquote{all of what we mean by goals and purposes can be well thought of as the maximisation of the expected value of the cumulative sum of a received scalar signal} \citep{sutton2018reinforcement}. In other words, it is the hypothesis that any natural task can be expressed as a reward signal. This is not to be confused with what is sometimes referred to as the \emph{reward-is-enough hypothesis} \citep{SILVER2021103535}, which says that \enquote{the objective of maximising reward is enough to drive behaviour that exhibits most if not all attributes of intelligence that are studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language and generalisation}. In other words, this hypothesis says that a system which maximises a reward function with sufficient competence in a sufficiently complex environment naturally would develop general intelligence. This is a distinct claim, which we will not discuss in this paper.



%MORL is a generalisation of single-objective reinforcement learning (RL), in which the agent has to optimise several reward functions at once.
%In \emph{single-policy} MORL, the aim is to learn a single policy that achieves an optimal trade-off of those rewards according to some criterion, and in \emph{multi-policy} MORL, the aim is instead to learn a set of policies that together represent the solutions for several different relative preferences between the rewards. For example, a single-policy MORL algorithm might attempt to learn a policy that maximises the rewards lexicographically \cite{lmorl}, whereas a multi-objective MORL algorithm might attempt to learn a set of policies that approximate the rewards' Pareto front \cite{paretoMORL}, or a parameterised representation of the optimal policy for any linear weighting of the rewards \cite{linearMORL}. 
%
%This paper concerns single-policy MORL. In some cases, a single-policy MORL problem can be reduced to ordinary single-objective RL, by finding a reward function that induces the same preferences and incentives as the original MORL problem. However, it is not always clear when this is the case, and if so, how the reduction could be performed. In this paper, we provide a complete solution to this problem. Specifically, we present necessary and sufficient conditions for when a single-policy MORL problem can be represented as a traditional RL problem. We find that this can only be done for MORL problems that correspond to a \emph{linear weighting} of the rewards, which means that it cannot be done for the vast majority of all interesting MORL problems. We also present many natural MORL objectives, and show that none of them can be reduced to scalar RL.
%This confirms that MORL is a genuine generalisation of traditional RL, and provides a solid motivation for it as an area of study. It also shows that the slogan \emph{reward is enough} is untrue in settings where we have multiple objectives.

%One might think that there would be no way to solve a MORL problem effectively if it cannot be represented using scalar RL. We show that this is not the case, both by referring to counterexamples in the literature, and by providing two simple but novel MORL algorithms that can solve several MORL problems which cannot be represented using single-objective RL. One of these algorithms is designed for a particular MORL problem, whereas the other one can solve any MORL problem from a large class. This confirms that MORL is a genuine generalisation of traditional RL, and provides a solid motivation for it as an area of study. It also shows that the slogan \emph{reward is enough} is untrue in settings where we have multiple objectives.



\section{Preliminaries}\label{section:preliminaries}

The standard RL setting is formalised using \textit{Markov Decision Processes} (MDPs) \citet{sutton2018reinforcement}, which are tuples $\MDP$ where $\mcS$ is a set of states, $\mcA$ is a set of actions, $\tau : \mcS \times \mcA \to \Delta(\States)$ is a transition function, $\init$ is an initial state distribution over $\mcS$, $R : \mcS \times \mcA \to \mathbb{R}$ is a reward function, and $\gamma \in (0,1)$ is a discount factor.
%Here, $f : X \rightsquigarrow Y$ denotes a probabilistic mapping $f$ from $X$ to $Y$. 
%A state is \textit{terminal} if $\tau(s,a)=s$ and $R(s,a)=0$ for all $a$ \textcolor{red}{[do we utilise `terminal states' anywhere later?]}. 
A \emph{trajectory} $\xi$ is in general an element of $(\SxA)^\omega$, i.e.\ a sequence $s_0, a_0, s_1 \dots$. 
We use $G$ to denote the \emph{trajectory return function}, where $G(\xi) = \sum_{t=0}^\infty \gamma^t R(s_t,a_t)$.
A \emph{policy} is a mapping $\pi : \mcS \to \Delta(\mcA)$, and $\Pi$ is the set of all policies. 
Given a policy $\pi$, its \emph{value function} $V^\pi : \mcS \to \mathbb{R}$ is the function where $V^\pi(s)$ is the expected future discounted reward when following $\pi$ from $s$, and its \emph{$Q$-function} $Q^\pi$ is $\mathbb{E}_{s' \sim \tau(s,a)}[R(s,a) + \gamma \cdot V^\pi(s')]$. 
The \emph{policy evaluation function} $J : \Pi \to \mathbb{R}$ is $J(\pi) = \mathbb{E}_{s_0 \sim \init}[V^\pi(s_o)]$. If a policy maximises $J$, then we say that this policy is \emph{optimal}. We denote optimal policies by $\pi^\star$, and their value function and $Q$-function by $V^\star$ and $Q^\star$. Moreover, given an MDP $\M$, we say that $\M$'s policy order is the ordering $\prec$ on $\Pi$ where $\pi_1 \prec \pi_2 \iff J(\pi_1) < J(\pi_2)$ for any $\pi_1, \pi_2$. 
%For an overview, see \citet{sutton2018reinforcement}.

In this paper, we will say that a reward function $R$ is \emph{trivial} if $J(\pi_1) = J(\pi_2)$ for all $\pi_1,\pi_2$. Moreover, we say that $R_1$ and $R_2$ are \emph{equivalent} if $J_1(\pi_1) < J_1(\pi_2) \iff J_2(\pi_1) < J_2(\pi_2)$ for all $\pi_1,\pi_2$, and that they are \emph{opposites} if $J_1(\pi_1) < J_1(\pi_2) \iff J_2(\pi_1) > J_2(\pi_2)$ for all $\pi_1,\pi_2$.
%\textcolor{blue}{[clarify quantification over policies: first univ, second  existential?]}
%, for all $\pi_1,\pi_2$. 

MORL problems are formalised using \emph{Multi-Objective MDPs} (MOMDPs), which are tuples $\MOMDP$, 
with the only difference from MDPs being $\Rs$, 
which is now a function $\Rs : \mcS \times \mcA \to \mathbb{R}^k$ that, for each pair $(s,a)$, 
returns $k$ different rewards (for some finite $k$). We denote the $i$'th component of $\Rs$ as the scalar reward function $R_i$, 
and use $V^\pi_i$, $Q^\pi_i$, $J_i$, and $G_i$, etc, to refer to its value-, $Q$-, evaluation-, and return function, etc. 
There are two types of MORL problems; single-policy MORL, where the goal is to compute one policy that achieves an optimal trade-off of the rewards, and multi-policy MORL, where the aim is to compute several policies (typically with the aim of approximating the Pareto front of the rewards).
In this paper, we are concerned with single-policy MORL.
Since there may not be a single policy that maximises each component of $\Rs$, a single-policy MORL problem needs some additional rule for combining and trading off each reward. 

In economics and psychology, risk-aversion is often modelled using utility functions $U(c)$ that are concave in some relevant variable $c$.
%This can be used to capture the intuition that having two units of money is less than twice as good as having one unit (for example).
The most common risk-averse utility functions are the \emph{exponential}, the \emph{isoelastic}, and the \emph{quadratic} utility functions. 
The exponential utility function is given by $U(c) = -e^{\alpha c}$, where $\alpha > 0$ is a parameter controlling the degree of risk aversion. The isoelastic utility function is given by $U(c) = c^{1-\alpha}$, for $\alpha > 0, \alpha \neq 1$, or by $U(c) = \ln(c)$ (corresponding to the case when $\alpha = 1$). The quadratic utility function is given by $U(c) = c - \alpha c^2$, where $\alpha > 0$.
Since this function is decreasing for sufficiently large $c$, its domain is typically restricted to $(-\infty, 1/2\alpha]$.

\paragraph*{A Remark on \enquote{Tasks}:}

In this paper, we are investigating the question of when a given task can be expressed using a scalar, Markovian reward function.
To do this, we must first formalise what it should mean for a reward function to \enquote{express a task}. 
One option is to say that a task corresponds to a desired policy $\pi$, and that a reward function $R$ expresses the task if $\pi$ is optimal under $R$ (possibly with the additional requirement that $\pi$ is the \emph{only} policy that is optimal under $R$). With this definition, we find that \emph{any} task can be expressed as a Markovian reward function, at least as long as $\pi$ is stationary and deterministic (see Appendix~B).
With this definition, the problem is therefore rather trivial.

An alternative, stronger formalisation is to say that a task corresponds to an ordering $\prec$ on $\Pi$, which encodes a preference ordering over all policies, and that a reward function $R$ expresses the task if its corresponding evaluation function $J$ orders $\Pi$ according to $\prec$. It is primarily this latter definition that we will use in this paper. 
The main reason for this is that it is often impossible to find the optimal policy in complex environments. 
For example, in a robotics problem, it is typically not feasible to find a policy that is globally optimal.
This means that it is not enough for $R$ to admit the correct optimal policy; it must also induce the right preferences between the all the (sub-optimal) policies that the policy synthesis algorithm might in fact generate. The only way to robustly ensure that this is the case is if $R$ induces the right policy ordering.
For this reason, we think it is more informative to think of a problem setting (i.e.\ a \enquote{task}) as corresponding to an ordering on $\Pi$.

%We conclude by remarking that these are not the only reasonable ways to formalise the notion of a task; some alternative options are considered by \cite{markovrewardexpressivity}.

\section{Multi-Objective Problems}\label{section:morl}

In this section, we examine the MORL setting.
We first need a general definition of what a single-policy MORL problem is. Recall that a MOMDP $\MOMDP$ by itself has no one canonical objective to maximise. We therefore introduce the notion of a \emph{MORL objective}:
%
\begin{definition}\label{def:morl_objective}
A \textbf{MORL objective} over $k$ rewards is a function $\Ob$ that takes $k$ policy evaluation functions $J_1 \dots J_k$ and returns a (total) ordering $\prec_\Ob$ over the set of all policies $\Pi$.
\end{definition}
%
Given a MOMDP $\M$, a MORL objective $\Ob$ gives us an ordering $\prec_\Ob$ over $\Pi$ that tells us when a policy is preferred over another. 
%We use $\prec_\Ob^\M$ to denote the ordering that is obtained when we apply $\Ob$ to $\M$'s policy evaluation functions.
%\footnote{Note that $\prec_\Ob^{\M_1} = \prec_\Ob^{\M_2}$ when $\M_1$ and $\M_2$ have the same policy evaluation functions, even if they differ in other respects.}
%This captures the kinds of MORL objectives we care about in practice, see e.g.\ the examles in Section~\ref{section:examples}.
For the purposes of this paper, we will not need to impose any further requirements on $\prec_\Ob$. For example, we will not insist that $\prec_\Ob$ must have a greatest element in $\Pi$, or that $\pi_1 \prec_O \pi_2$ whenever $\pi_2$ is a Pareto improvement over $\pi_1$, etc, even though a reasonable MORL objective presumably would have these properties.
We next provide a few examples of MORL objectives, where we denote by $\pi_1, \pi_2$ any given pair of distinct policies.   

\begin{definition}\label{def:lexmax}
Given $J_1 \dots J_k$, the \textbf{LexMax} objective $\prec_\texttt{Lex}$ is given by $\pi_1 \prec_\texttt{Lex} \pi_2$ iff there is an $i \in \{1 \dots k\}$ such that $J_i(\pi_1) < J_i(\pi_2)$ and $J_j(\pi_1) = J_j(\pi_2)$ for all $j < i$.
\end{definition}

\begin{definition}\label{def:maxmin}
Given $J_1 \dots J_k$, the \textbf{MaxMin} objective $\prec_\texttt{Min}$ is given by $\pi_1 \prec_\texttt{Min} \pi_2 \iff \min_i J_i(\pi_1) < \min_i J_i(\pi_2)$.
\end{definition}

\begin{definition}\label{def:maxsat}
Given $J_1 \dots J_k$ and some $c_1 \dots c_m \in \mathbb{R}$, the \textbf{MaxSat} objective $\prec_\texttt{Sat}$ is given by $\pi_1 \prec_\texttt{Sat} \pi_2$ if and only if
the number of rewards that satisfy $J_i(\pi_1) \geq c_i$ is larger than the number of rewards that satisfy $J_i(\pi_2) \geq c_i$.
\end{definition}

%CJG - I strugled to interpret this the first time - I'm guessing it means the following: |{i <= : J_i(\pi_1) >= c_i}| < |{i : J_i(\pi_2) >= c_i}| 

\begin{definition}\label{def:consat}
Given $J_1, J_2$ and some $c\in \mathbb{R}$, the \textbf{ConSat} objective $\prec_\texttt{Con}$ is given by $\pi_1 \prec_\texttt{Con} \pi_2$ if and only if
either $J_1(\pi_1) < c$ and $J_1(\pi_1) < J_1(\pi_2)$, or $J_1(\pi_1), J_1(\pi_2) \geq c$ and $J_2(\pi_1) < J_2(\pi_2)$.
\end{definition}

In other words, the LexMax objective has \emph{lexicographic} preferences over $R_1 \dots R_m$, so that policies are first ordered by their expected discounted $R_1$-reward, and then policies that obtain the same expected discounted $R_1$-reward are ordered by their expected discounted $R_2$-reward, and so on. The MaxMin objective orders policies by their \emph{worst} performance according to any of $R_1 \dots R_m$ (which could be used to obtain worst-case guarantees). The MaxSat objective only cares whether a policy reaches a certain \emph{threshold} for each reward, and ranks policies based on how many thresholds they reach. The ConSat objective aims to maximise $J_2$, but under the constraint that $J_1$ reaches a certain threshold. 
Note that these objectives are not necessarily the most important MORL objectives.
Rather, they are simply a short list of illustrative examples, meant to demonstrate the flexibility of the MORL framework, and give an intuition for what types of problems it can be used to express.
A few more examples can be found in Appendix~C.

We next define what it means to \emph{reduce} a MORL problem to a scalar RL problem.
Given a MORL objective $\Ob$ and a MOMDP $\M$, we use $\prec_\Ob^\M$ to denote the ordering we get when we apply $\Ob$ to $\M$'s policy evaluation functions:

%We say that a MOMDP $\M$ and a MORL objective $\Ob$ together form a MORL problem, denoted $\MORLprob$.

% \footnote{Note that $\Ob$ is defined over $\M$'s policy evaluation functions, rather than $\M$ itself. This means that $\preceq_\Ob$ must be the same in any two MOMDPs $\M_1$ and $\M_2$ whose policy evaluation functions are the same, even if $\M_1$ and $\M_2$ differ in other respects. Our definition could therefore be made more general. However, we believe that our somewhat stronger definition captures all natural MORL objectives.}
 %We will use the following definition:
%
\begin{definition}
A MOMDP $\M = \MOMDP$ with MORL objective $\Ob$ is \textbf{equivalent} to the MDP $\M' = \MDP$ if and only if $\M'$'s policy order is $\prec_\Ob^\M$. 
We then say that $\M$ with $\Ob$ is \textbf{scalarized} by $R$.
If $\M$ with $\Ob$ is scalarized by some $R$ then we say that $\M$ with $\Ob$ is \textbf{scalarizable}, otherwise we say that it is \textbf{unscalarizable}. 
\end{definition}
%
Note that $\M'$ must have the same states, actions, transition function, initial state distribution, and discount factor, as $\M$. This definition therefore says that $\M$ with $\Ob$ is equivalent to $\M'$ if $\M'$ is given by replacing $\Rs = \langle R_1 \dots R_k \rangle$ with a single reward function $R$, and $R$ induces the same preferences between all policies as $\Ob(J_1 \dots J_k)$.
Note also that we require $R$ to express the same \emph{policy order} as $\Ob(J_1 \dots J_k)$; it is not enough for $R$ and $\Ob(J_1 \dots J_k)$ to have the same \emph{optimal policies} (see Section~\ref{section:preliminaries}).

Given this definition, we can now provide the necessary and sufficient conditions for when a MORL problem can be reduced to a scalar-reward RL problem. All proofs are provided in the supplementary material.

%We then use these conditions to show that many interesting MORL problems cannot be expressed as scalar-reward RL problems.

\begin{theorem}\label{thm:linearity_thm}
If a MOMDP $\M$ with objective $\Ob$ is scalarizable, then there exist $w_1 \dots w_k \in \mathbb{R}$ such that $\M$ with $\Ob$ is scalarized by the reward $R(s,a) = \sum_{i=1}^k w_i \cdot R_i(s,a)$.
%Moreover, $\M$ with $\Ob$ is also equivalent to the MDP with reward $R(s,a) = \sum_{i=1}^k w_i \cdot R_i(s,a)$.
\end{theorem} 

Theorem~\ref{thm:linearity_thm} tells us that a MORL objective can be expressed using a scalar, Markovian reward function if and only if that objective corresponds to a linear weighting of the individual rewards.
In other words, scalar, Markovian rewards are unable to express all non-linear MORL problems.
As we will see, this imposes a strong limitation on what MORL tasks can be encoded using scalar, Markovian rewards.

It is worth noting that Theorem~\ref{thm:linearity_thm} is analogous to Harsanyi's Utilitarian Theorem \cite{harsanyi1955cardinal} from social choice theory, but generalised to the RL setting. In brief, this theorem supposes that we have a finite set of outcomes $\Omega$ and a group of individuals $\{1 \dots k\}$ with different preferences over $\Omega$, and that we wish to construct an aggregate preference structure that captures the preferences of the group. 
Moreover, also suppose that (1) the preferences of each individual $i$ are described by a utility function $U_i : \Omega \to \mathbb{R}$, (2) the aggregate preferences of the group are described by a further utility function $U_G : \Omega \to \mathbb{R}$, and (3) for all distributions $\mathcal{D}_1$ and $\mathcal{D}_2$ over $\Omega$, if $\mathbb{E}_{O \sim \mathcal{D}_1}[U_i(O)] = \mathbb{E}_{O \sim \mathcal{D}_2}[U_i(O)]$ for every individual $i$, then $\mathbb{E}_{O \sim \mathcal{D}_1}[U_G(O)] = \mathbb{E}_{O \sim \mathcal{D}_2}[U_G(O)]$. Harsanyi's Utilitarian Theorem then says that $U_G$ must be given by some linear combination of $U_1 \dots U_k$. 
The link to Theorem~\ref{thm:linearity_thm} becomes clear if we think of $\Omega$ as being the set of all trajectories which are possible in a MOMDP $\M$, $U_1 \dots U_k$ as being the trajectory return functions $G_1 \dots G_k$ of the reward functions $R_1 \dots R_k$ in $\M$, and $U_G$ as being the trajectory return function of the scalarizing reward $R$. However, note that Harsanyi's Utilitarian Theorem assumes that $\Omega$ is finite, whereas the set of all trajectories may be uncountably infinite. 
Moreover, assumption (3) quantifies over all possible distributions over $\Omega$, whereas Theorem~\ref{thm:linearity_thm} only quantifies over distributions that can be realised as policies in a given MOMDP $\M$. If $\Omega$ is allowed to be infinite, and assumption (3) is restricted to range over only some distributions over $\Omega$, then Harsanyi's Utilitarian Theorem does not hold in general. The generalisation provided by Theorem~\ref{thm:linearity_thm} is therefore non-trivial.


%tasks we can express and encode using scalar reward functions. %We will soon see that none of the MORL objectives we defined in Section~\ref{section:examples} can be expressed as single-objective RL problems, except in a few degenerate cases.
%
%We will also state two more results, which make it easy to demonstrate when some MORL objective cannot be expressed using scalar reward functions. 
Theorem~\ref{thm:linearity_thm} also entails the following corollary, which is useful to elucidate when a MORL objective cannot be expressed using scalar reward functions. 
Given an ordering $\prec$ over $\Pi$, depending on some evaluation functions $J_1 \dots J_k$, we say that a function $U : \Pi \to \mathbb{R}$ \emph{represents} $\prec$ if $U(\pi_1) < U(\pi_2) \iff \pi_1 \prec \pi_2$. 
We say that $U$ is a \emph{linear representation} if $U(\pi) = f(\sum_{i=1}^k w_i \cdot J_i(\pi))$ for some $w_1 \dots w_k \in \mathbb{R}$ and some strictly monotonic $f$.

%Note that $U$ need not correspond to any reward function -- it can be any arbitrary function. Moreover, we say that $\preceq$ is \emph{representable} if it is represented by some function.
%\begin{proposition}\label{prop:no_rep}
%If $\Ob(J_1 \dots J_k)$ is not representable, and $\M$ is a MOMDP whose policy evaluation functions are $J_1 \dots J_k$, then $\M$ with $\Ob$ is not equivalent to any MDP.
%\end{proposition}
%\begin{proof}
%Assume for contradiction that $\M$ with $\Ob$ is equivalent to $\tilde{\M}$. Then $\tilde{J}$ represents $\Ob(J_1 \dots J_k)$, which by assumption is not possible.
%\end{proof}

\begin{corollary}\label{cor:nonlinear_rep}
If $\Ob(J_1 \dots J_k)$ has a non-linear representation $U$, and $\M$ is a MOMDP whose $J$-functions are $J_1 \dots J_k$, then $\M$ with $\Ob$ is unscalarizable.
\end{corollary}

Therefore, we can prove that $\M$ with $\Ob$ is unscalarizable by finding a non-linear representation of $\prec_\Ob^\M$. 
Accordingly, we now show that none of the MORL objectives given in Definitions~\ref{def:lexmax}-\ref{def:consat} can be expressed using scalar, Markovian reward functions, except in a few degenerate cases. 

\begin{corollary}\label{corollary:no_lexmax}
$\M$ with $\textbf{LexMax}$ is unscalarizable, as long as $\M$ has at least two reward functions that are neither trivial, equivalent, or opposite. 
\end{corollary}

Note that if all reward functions are either trivial, equivalent, or opposite, then the only reward function that matters for $\textbf{LexMax}$ is the highest-priority non-trivial reward function. In that case, $\M$ with $\textbf{LexMax}$ is equivalent to the MDP which contains only this reward function.

\begin{corollary}\label{corollary:no_minmax}
$\M$ with $\textbf{MaxMin}$ is unscalarizable, unless $\M$ has a reward function $R_i$ such that $J_i(\pi) \leq J_j(\pi)$ for all $j \in \{1 \dots k\}$ and all $\pi$.
\end{corollary}

Note that if $\M$ has a reward function $R_i$ such that $J_i(\pi) \leq J_j(\pi)$ for all $j$ and $\pi$, then this is the only reward function that matters for the $\textbf{MaxMin}$ objective. 
In that case, $\M$ with $\textbf{MaxMin}$ is equivalent to the MDP which contains only $R_i$.

\begin{corollary}\label{corollary:no_maxsat}
$\M$ with $\textbf{MaxSat}$ is unscalarizable, as long as $\M$ has at least one reward $R_i$ where $J_i(\pi_1) < c_i$ and $J_i(\pi_2) \geq c_i$ for some $\pi_1, \pi_2 \in \Pi$.
\end{corollary}

Note that if $\M$ has no reward $R_i$ where $J_i(\pi_1) < c_i$ and $J_i(\pi_2) \geq c_i$ for some $\pi_1, \pi_2 \in \Pi$, then either all policies satisfy all constraints, or no policy satisfies any constraint. In either case, $\M$ with $\textbf{MaxSat}$ would be equivalent to an MDP with a trivial reward function.

\begin{corollary}\label{corollary:no_consat}
$\M$ with $\textbf{ConSat}$ is unscalarizable, unless either $R_1$ and $R_2$ are equivalent, or $\max_{\pi}J_1(\pi) \leq c$, or $\min_{\pi}J_1(\pi) \geq c$.
\end{corollary}

Note that if $\max_{\pi}J_1(\pi) \leq c$ then no policy satisfies the constraint, in which case $\M$ with $\textbf{ConSat}$ is equivalent to the MDP with $R_1$. If $\min_{\pi}J_1(\pi) \geq c$ then all policies satisfy the constraint, in which case $\M$ with $\textbf{ConSat}$ is equivalent to the MDP with $R_2$. If $R_1$ and $R_2$ are equivalent, then $\M$ with $\textbf{ConSat}$ is scalarized by $R_1$ or $R_2$.

Corollaries~\ref{corollary:no_lexmax}-\ref{corollary:no_consat} thus show that none of the MORL objectives given in Definition~\ref{def:lexmax}-\ref{def:consat} can be expressed using a scalar, Markovian reward function, except in a few degenerate cases where those MORL objectives are trivialised. This demonstrates that MORL problems typically cannot be scalarized in a satisfactory way.

To get an intuition for this result, note that the expected cumulative return of a Markovian reward function always is maximised by some stationary policy, whereas some of these MORL objectives may require the optimal policy to be non-stationary. For example, consider the \textbf{MaxMin} objective, and suppose the agent can choose between an action giving one $R_1$-reward, and an action giving one $R_2$-reward.
Then the optimal choice may depend on how much $R_1$ and $R_2$-reward the agent has got in the past. This means that the optimal policy may be non-stationary, and thus not correspond to any Markovian reward. %Of course, Theorem~\ref{thm:linearity_thm} gives a stronger conclusion than this argument.

%\footnote{Note that Corollaries~\ref{corollary:no_lexmax}-\ref{corollary:no_consat} are not necessarily hard to prove by themselves. Rather, they are given to exemplify and provide an intuition for the consequence of Theorem~\ref{thm:linearity_thm}, which is our main contribution for this section.}

%To give an intuition for why 

%\red{Constrained RL}

\section{Risk-Sensitive Problems}\label{section:risk_sensitive_rl}

The next area we will look at is that of \emph{risk-sensitive} RL. An ordinary RL agent tries to maximise the \emph{expectation} of its reward function.
However, there are many cases where it is natural to require the agent to be \emph{risk-averse}.
For example, we might prefer a policy that reliably achieves $5$ reward, over one that achieves $11$ reward with probability $0.5$, and otherwise gets $0$ reward, even though the latter policy achieves a higher expected reward.
In this section, we will examine when scalar, Markovian reward functions can be used to encourage such behaviour.

In expected utility theory, risk-aversion is often modelled using concave utility functions. In particular, suppose we have a set of \emph{outcomes} $C$, each of which is associated with some utility via a function $U_1 : C \to \mathbb{R}$. We can then construct a second utility function $U_2 : C \to \mathbb{R}$ by letting $U_2(c) = f(U_1(c))$ for some concave function $f$. Then, an agent which maximises expected utility according to $U_2$, will be risk-averse with respect to utility as defined by $U_1$. 
For example, suppose each outcome is associated with some monetary payoff, and that $U_1$ measures how much money is obtained in each outcome. 
If we were to maximise expected utility according to $U_1$, then we would prefer a $50\%$ chance of obtaining $\$2,000,000$, to a certain chance of obtaining $\$900,000$. However, in the real world, most people would prefer the latter option. One reason for this is that, while getting $\$2,000,000$ is better than getting $\$900,000$, it is less than twice as good. We can model these preferences by using a second utility function $U_2$ that is concave in $U_1$.
Intuitively, $U_2$ should measure how much \emph{benefit} we get from the money.
Then the expected $U_2$-utility might be higher for the safe option than the risky option, even though the expected $U_1$-utility is higher for the risky option.% than the safe option.

In reinforcement learning, the \emph{outcomes} are the trajectories that might occur in the environment, and the \emph{utility} of a trajectory $\xi$ is induced by the reward function $R$ via the return function, $G$. 
If the transition function $\tau$ is nondeterministic, then the agent cannot reliably enact a particular outcome (i.e., move along a particular trajectory), but can instead only choose between some distributions over outcomes. 
By default, the agent may then be compelled to pursue a policy that achieves a high reward with small probability, as long as the expectation remains high.
A natural question is then whether we could avoid this by constructing a second reward function that is concave in the original reward function, similar to what is done in expected utility theory. That is, given a reward function $R_1$ and a concave function $f$, can we construct a second reward function $R_2$ such that $G_2 = f(G_1)$? Our next theorem demonstrates that this is impossible. As before, the proof is in the appendix.

\begin{theorem}\label{thm:risk_theorem}
Given $\States$, $\Actions$, and $\gamma$, 
let $R_1$ and $R_2$ be two reward functions.
If $\gamma \geq 0.5$, and for all $\xi_1,\xi_2 \in (\SxA)^\omega$, 
$$
G_1(\xi_1) \leq G_1(\xi_2) \iff G_2(\xi_1) \leq G_2(\xi_2),
$$
then $\exists a \in \mathbb{R}$, $b \in \mathbb{R} > 0$ such that for all $\xi \in (\SxA)^\omega$,
$$
G_1(\xi) = b \cdot G_2(\xi) + a.
$$
\end{theorem}

Theorem~\ref{thm:risk_theorem} effectively tells us that \emph{only affine transformations of $G$ are possible}. 
From this result, it straightforwardly follows that none of the standard risk-averse utility functions (exponential utility, isoelastic utility, and quadratic utility) can be expressed using Markovian reward functions:

\begin{corollary}
For any non-trivial reward $R_1$ and any constant $\alpha \neq 0$, if $\gamma \geq 0.5$ then there is no reward $R_2$ such that $G_2(\xi) = -e^{\alpha G_1(\xi)}$ for all $\xi \in (\SxA)^\omega$.
\end{corollary}

\begin{corollary}
For any non-trivial reward $R_1$ and any constant $\alpha > 0$, $\alpha \neq 1$, if $\gamma \geq 0.5$ then there is no reward $R_2$ such that $G_2(\xi) = G_1(\xi)^{1-\alpha}$ for all $\xi \in (\SxA)^\omega$.
\end{corollary}

\begin{corollary}
For any non-trivial reward $R_1$, if $\gamma \geq 0.5$ then there is no reward $R_2$ such that $G_2(\xi) = \ln (G_1(\xi))$ for all $\xi \in (\SxA)^\omega$.
\end{corollary}

\begin{corollary}
For any non-trivial reward $R_1$ and any $\alpha > 0$ where $\max_\xi G_1(\xi) \leq \frac{1}{2\alpha}$, if $\gamma \geq 0.5$ then there is no reward $R_2$ such that $G_2(\xi) = G_1(\xi) - \alpha G_1(\xi)^2$ for all $\xi \in (\SxA)^\omega$.
\end{corollary}

Theorem~\ref{thm:risk_theorem} thus implies that none of the standard risk-averse utility functions can be expressed using scalar, Markovian reward functions. To get an intuition on Theorem~\ref{thm:risk_theorem}, consider the fact that the expected cumulative return of a Markovian reward function always is maximised by some stationary (i.e.\ Markovian) policy. However, a risk-averse objective may require the optimal policy to be non-stationary, because whether or not it is worth taking a particular gamble could depend on how much reward you have accrued in the past. This suggests that there should be instances where risk-sensitive objectives cannot be expressed as Markovian reward functions. Theorem~\ref{thm:risk_theorem} formalises this intuition. 
%, to show that they can \emph{never} be expressed as Markovian reward functions.

It is also worth remarking on the fact that Theorem~\ref{thm:risk_theorem} considers the value of $G_1$ and $G_2$ for \emph{all} trajectories in $(\SxA)^\omega$. For any particular transition function $\tau$, most of these trajectories are likely to be impossible (unless $\tau$ allows you to transition between any two states via any action with non-zero probability). We could therefore alternatively consider the condition where $G_1(\xi_1) \leq G_1(\xi_2) \iff G_2(\xi_1) \leq G_2(\xi_2)$ for those trajectories $\xi_1,\xi_2$ that are possible in a given environment. In this case, it \emph{can} be possible for $G_2$ to be non-affine in $G_1$. For example, consider the case of a tree-shaped MDP, where $\tau(s,a) = s$ and $R_1(s,a) = 0$ for all actions $a$ if $s$ is a leaf-node. In that case, $G_2$ can be an arbitrary transformation of $G_1$. However, to construct the corresponding reward function $R_2$, we would need to have a detailed understanding of the environment (which is against the main tenet in RL), and furthermore the resulting reward function would no longer induce the same behaviour if it were used in a different environment. For this reason, we believe that it is more relevant to consider the set of all trajectories in $(\SxA)^\omega$. Nonetheless, an interesting direction for further work could be to more extensively study what happens if the set of trajectories under consideration is restricted in various ways.

Finally, note that Theorem~\ref{thm:risk_theorem} assumes that the discount parameter $\gamma \geq 0.5$. It is not clear if this is strictly necessary, so it might be possible to generalise Theorem~\ref{thm:risk_theorem} by removing this requirement. This would, however, require a different proof strategy.
%because our proof crucially relies on this assumption. 
Nonetheless, this assumption is not very restrictive, as in practice $\gamma$ is almost always set to be greater than $0.5$ (typically $\gamma \geq 0.9$).

\section{Modal Problems}\label{section:modal}

%\textcolor{blue}{[elephant in the room: connections with modal logic. aren't these modal requirements connected to temporal specifications?]}

%\red{[Not necessarily! You can express these kinds of things in CTL, but not LTL. But you could also make a "modal" specification that is not per se temporal.]}

%\textcolor{blue}{[I surely agree that modal operators can be broader than temporal ones. But I would definitely mention the connection with temporal logics - I agree that CTL might be more relevant than LTL for the tasks you express. Of course, native CTL does not necessarily come with rewards, which is instead what the focus of this chapter is.]}

%\red{[There is a paper on learning PCTL specs using RL, which I cite in the related work section (Wang et al 2020). But yes, I agree some more explicit mention would be good -- I'll see where I can fit it in.]}

The final class of problems that we will examine is a class of tasks that we refer to as \emph{modal} tasks. Before we give a formal definition of this class, we will first provide some intuition. In analytic philosophy, a distinction is made between \emph{categorical} facts and \emph{modal} facts. In short, categorical facts only concern what is true in actuality, whereas modal facts concern what must be true, could have been true, or cannot be true, etc. 
For example, it is a categorical fact that the Eiffel Tower is brown, and a modal fact that it \emph{could have had} a different colour. It is (arguably) a categorical fact that the number 3 is prime, and a modal fact that it \emph{could not have been} otherwise.
%\textcolor{blue}{[so, it is  modal?]}\red{[yes (both statements are)]}. 
To give another example, there is a difference between stating that \emph{nothing can travel faster than light} and that \emph{nothing does travel faster than light} -- the former statement, which is modal, is stronger than the latter, which is categorical.
%\textcolor{blue}{[and it is  modal, I suppose?]}\red{[yes]}. 
One can further distinguish between different kinds of possibility (e.g.\ logical vs physical possibility, etc), and discussions about modality also involves topics such as \emph{causality} and \emph{counterfactuals}, etc.
A complete treatment of this subject is beyond the scope of this paper, but for an overview see \citet{sep-possible-worlds}.

%There is, of course, a connection between modality and \emph{modal logic}, but there is also a connection to \emph{temporal logic}. In particular, computational tree logic (CTL) can be used to express modal statements.

Modality does of course relate to \emph{modal logic}, and thus also to \emph{temporal logic}. In particular, computational tree logic (CTL, see e.g.\ \cite{BaierKatoen2008}), and its extensions, can express many modal statements.
\footnote{To avoid a possible confusion, we should emphasise that we here use the term \enquote{modal} in a somewhat more narrow sense than the sense of \enquote{modal logic}. In particular, we use it to mean \enquote{pertaining to what is possible or impossible}, as in e.g.\ \citet{sep-modality-varieties}. In that sense, Linear Temporal Logic (LTL) does not express modal statements, even though it is a modal logic, because LTL can only make assertions about what in fact occurs. For that reason, not everything that relates to modal logic will be related to the setting we discuss here. The type of possibility we discuss is specifically \enquote{possibility according to the transition function}.}

%\textcolor{red}{[Good to relate to temporal logics. I would add a citation here to \cite{BaierKatoen2008}, and clarify temporal logic specifies (modal) tasks with binary/probabilistic nature, and as such by and large it does not actually deals with reward maximisation, as done here. We might want to further relate temporal logics to reward machines and Hosein's LCRL, as done elsewhere - the latter contributions can essentially handle LTL fragments (but not CTL ones). ]}

The intuition behind this section is that a reward function always is expressed in terms of categorical facts, whereas many tasks are naturally expressed in terms of modal facts.
For example, consider an instruction such as \enquote{you should always be \emph{able} to return to the start state}. This instruction seems quite reasonable, but it is not obvious how to translate it into a reward function. Note that this instruction is not telling the agent to \emph{actually} return to the start state, it merely says that it should maintain the \emph{ability} to do so.
This illustrates the motivation behind modal tasks; they let us reward the agent based on what is \emph{possible} or \emph{impossible} along its trajectory, rather than just in terms of what in fact occurs along that trajectory.
%
Given this background motivation, we can now give a formal definition of modal tasks:

%\begin{definition}
%Given a set of states $\mcS$ and a set of actions $\mcA$, a \emph{modal property} $P$ is a function $P : \mcS \times (\mcS \times \mcA \rightsquigarrow \mcS) \to \mathbb{R}$ which takes a state $s \in \mcS$ and a transition function $\tau$ over $\mcS, \mcA$, and returns a real number. Moreover, a \emph{modal reward function} $R$, defined in terms of modal properties $P_1 \dots P_n$, is a function $R : \SxAxS \times \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}$ which takes a state $s \in \mcS$, action $a \in \mcA$, and next state $s' \in \mcS$, as well as $P_1(s, \tau) \dots P_n(s, \tau)$ and $P_1(s', \tau) \dots P_n(s', \tau)$, and returns a real number. 
%modal task, trivial
%\end{definition}

\begin{definition}\label{definition:modal_reward}
Given a set of states $\mcS$ and a set of actions $\mcA$, a \textbf{modal reward function} $R^\Diamond$  is a function $R^\Diamond : \SxA \times (\mcS \times \mcA \to \Delta(\mcS)) \to \mathbb{R}$ which takes a state $s \in \mcS$, an action $a \in \mcA$, and a transition function $\tau$ over $\mcS$ and $\mcA$, and returns a real number.
\end{definition}

$R^\Diamond(s,a,\tau)$ is the reward that is obtained when taking action $a$ in state $s$ in an environment whose transition function is $\tau$.
Here we allow $R^\Diamond$ an unrestricted dependence on $\tau$, to make our results as general as possible, even if a practical algorithm for solving modal tasks presumably would require restrictions on what this dependence can look like (see Appendix~D).
Modal reward functions can be used to express instructions such as that we gave above.
For example, a simple case might be \enquote{you get 1 reward if you reach this goal state, and -1 reward if you ever enter a state from which you cannot reach the initial state}. This reward depends on the transition function, because the transition function determines from which states you can reach the initial state.
As usual, $R^\Diamond$ then induces a $Q$-function $Q^\Diamond$, value function $V^\Diamond$, and evaluation function $J^\Diamond$, etc.
We say that a modal reward $R^\Diamond$ and an ordinary reward $R$ are \emph{contingently equivalent} given a transition function $\tau$ if $J^\Diamond$ and $J$ induce the same ordering of policies given $\tau$, and that they are \emph{robustly equivalent} if $J^\Diamond$ and $J$ induce the same ordering of policies for all $\tau$. 
We use $R^\Diamond_\tau$ to denote the reward function $R^\Diamond_\tau(s,a) = R^\Diamond(s,a,\tau)$.
We will also use the following definition. 

\begin{definition}
A modal reward function $R^\Diamond$ is \textbf{vacuous} if there is a reward function $R$ such that for all $\tau$, $R$ and $R^\Diamond_\tau$ have the same policy ordering under $\tau$.
\end{definition}

%are no transition functions $\tau_1, \tau_2$ such that $R^\Diamond_{\tau_1}$ and $R^\Diamond_{\tau_2}$ have different policy orderings under $\tau_1$.

The intuition here is that a vacuous modal reward function does not actually depend on $\tau$ in any important sense. Note that this is \emph{not} necessarily to say that $R^\Diamond_{\tau} = R$ for all $\tau$. For example, it could be the case that $R^\Diamond_{\tau}$ is a \emph{scaled} version of $R$, or that $R^\Diamond_{\tau}$ and $R$ differ by \emph{potential shaping} \citet{ng1999}, or that $R^\Diamond_\tau$ is modified in a way such that $\mathbb{E}_{S' \sim \tau(s,a)}[R^\Diamond_\tau(s,a,S')] = \mathbb{E}_{S' \sim \tau(s,a)}[R(s,a,S')]$,
since none of these differences affect the policy ordering (for a more in-depth examination, see \citet{invariance_ambiguity}).
From this, we get the following straightforward result:
%We can now give the main result of this section:

%Moreover, we say that $R^\Diamond$ is \emph{trivial} if there are no transition functions $\tau_1, \tau_2$ such that $R^\Diamond_{\tau_1}$ and $R^\Diamond_{\tau_2}$ have different policy orderings under $\tau_1$. The intuition here is that such a modal reward function does not depend on $\tau$ in any relevant sense. Note that this is not necessarily to say that $R^\Diamond_{\tau_1} = R^\Diamond_{\tau_2}$ for all $\tau_1, \tau_2$; it could be the case that $R^\Diamond_{\tau_1}$ is a scaled version of $R^\Diamond_{\tau_2}$, for example.


%$s,a,s'$, $R^\Diamond(s,a,s',\tau)$ is constant for all $\tau$ under which $s,a,s'$ is a possible transition. In other words, $R^\Diamond$ is trivial when it does not actually depend on $\tau$. Using this, we can now give a straightforward formal result:

%\red{The definition of "trivial" has to be amended to rule out the possibility that $\tau$ merely determined eg the potential or scale of $R^\Diamond$. Eg, there are two tau for which $R^\Diamond$ does not differ by PS, LS, SR.}

%same ordering for all tau = PS and LS

\begin{theorem}
For any modal reward $R^\Diamond$ and any transition function $\tau$, there exists a reward $R$ that is contingently equivalent to $R^\Diamond$ given $\tau$. Moreover, unless $R^\Diamond$ is vacuous, there is no reward that is robustly equivalent to $R^\Diamond$.
\end{theorem}

In other words, \emph{every} modal task can be expressed with an ordinary reward function in each particular given environment, but \emph{no} reward function expresses a (non-vacuous) modal task in all environments. Is this enough? We argue that it is not, because the construction of $R^\Diamond_\tau$ will invariably be laborious, and require detailed knowledge of the environment. For example, consider the task \enquote{you should always be able to return to the start state}; here, constructing $R^\Diamond_\tau$ would amount to manually enumerating all the states from which the start state is reachable: this would be very much against the spirit of RL, where much of the point is that we want to be able to specify tasks which can be pursued in \emph{unknown} environments. In short, a method which requires a model of the environment is arguably not an RL method. We thus argue that reward functions are largely unable to capture modal tasks in a satisfactory way.


%This then raises the question; is this not enough? \red{After all, we are usually only interested in one particular environment. We argue that this is not so.}

One remaining question might be why one would want to express tasks for RL agents in terms of modal properties. After all, what benefit is there to the instruction \enquote{never enter a state from which it is possible to quickly enter an unsafe state} over the instruction \enquote{never enter an unsafe state}? One reason is that the former task might lead to behaviour that is more robust to changes in the environment. For example, if an RL agent is trained in a simulated environment, and deployed in the real world, then it seems like it would be preferable to tell the agent to avoid \emph{risky} states, rather than \emph{unsafe} states, since imperfections in the simulation could lead to an underestimation of the risk involved.
Another example is the existing work on avoiding side effects (e.g.\ \citep{krakovnaside, krakovnaside2, turnerside, griffin2022alls}), which it is natural to express in modal terms. This work can be viewed as being aimed at making the behaviour of an RL agent more robust to misspecification of the reward function.

%\red{intuitive argument for why these are useful (uncertainty about tau, or change in tau)}

%\red{We will, for the time being, not impose any further conditions on $R^\Diamond$}

%\red{It is clear that this definition can be generalised.}


\section{Solving Tasks That Are Inexpressible by Markovian Rewards}\label{section:solving_inexpressible}

We have pointed to three broad classes tasks that cannot be expressed using scalar, Markovian reward functions, namely multi-objective, risk-sensitive, and modal tasks. A natural next question is whether these tasks \emph{can} be solved at all using RL, or whether only tasks corresponding to Markovian reward functions can be effectively learnt.  
We briefly discuss this issue below. In short, it is indeed possible to design RL algorithms for tasks in each of these categories. 

First of all, the existing literature already contains several bespoke RL algorithms that solve some of the problems that we have discussed.
Multi-objective reinforcement learning is particularly well-explored, with many existing algorithms. Most of these algorithms are designed to solve a specific MORL objective; for example, \cite{lmorl} solve the \textbf{LexMax} objective, and \cite{tessler2019} solve the \textbf{ConSat} objective. 
%There is (as far as we know) no algorithm for the solving e.g.\ the \textbf{MaxMin} objective, but there is no good reason to believe that such an algorithm cannot be devised. 
Similarly, there are existing algorithms for risk-sensitive RL (e.g.\ \cite{chow2015riskconstrained}), and even algorithms that solve certain modal tasks 
\citep{krakovnaside, krakovnaside2, turnerside, pctl_rl, griffin2022alls}. 
We give a more complete overview of this existing work in Section~\ref{section:related_work}.

It should also be possible to design algorithms that can flexibly solve many different tasks from the classes we have discussed, instead of having to be designed for just one particular task. For example, suppose a MORL objective can be represented by a function $U : \mathbb{R}^k \to \mathbb{R}$, such that $\pi_1 \prec \pi_2$ when $U(J_1(\pi_1) \dots J_k(\pi_1)) < U(J_1(\pi_2) \dots J_k(\pi_2))$, and that $U$ is \emph{differentiable}. We give a few examples of such objectives in
Appendix~C, including e.g.\ a \enquote{soft} version of \textbf{MaxMin}. 
With such an objective, if we have a policy $\pi$ that is differentiable with respect to some parameters $\theta$, then one could compute the gradient of $U(J_1(\pi)\dots J_k(\pi))$ with respect to $\theta$, and then use a policy gradient method to increase $U$. This means that it should be possible to design an actor-critic algorithm that can solve any differentiable MORL objective. We consider the development of such methods to be a promising direction for further work. 

In Appendix~D, we also outline a possible approach for solving a wide class of modal tasks. Further exploration of this setting would also be interesting for further work.

\section{Discussion}\label{section:discussion}

In this paper, we have studied the ability of Markovian reward functions to express different kinds of problems. We have looked at three classes of tasks; multi-objective tasks, risk-sensitive tasks, and modal tasks, and found that Markovian reward functions are unable to express most of the tasks in each of these three classes.
%
In particular, have provided necessary and sufficient conditions for when a single-policy MORL problem can be expressed using a scalar, Markovian reward function, and demonstrated that this only can be done when the MORL objective corresponds to a linear weighting of the individual rewards. 
Moreover, we have also provided necessary and sufficient conditions for when a monotonic transformation of the return function, $G$, can be expressed as a Markovian reward function, and demonstrated that this only can be done for affine transformations.
Furthermore, we have also also drawn attention to a class of tasks which have just barely been explored previously (namely modal tasks), and shown that most of these tasks cannot be expressed using Markovian reward functions.
Finally, we have shown that many of these problems still can be solved with RL, and even outlined some methods for doing this.

Our work has a number of immediate practical implications. First of all, we have contributed to a more precise demarcation of what types of problems can be expressed within the most common RL formalism. This makes it easier to determine whether standard RL techniques are applicable to a given problem, or whether more specialised methods must be used. In particular, our results show that there are situations in which careful reward specification and reward shaping will not be sufficient to robustly incentivise the desired behaviour. In those cases, we must instead use an alternative policy synthesis method, such as e.g.\ those offered by MORL. Secondly, in the area of reward learning, most algorithms attempt to fit a scalar, Markovian reward function to their training data \cite[e.g.\ ][]{christiano2023deep}. Our work clarifies the implicit modelling assumptions behind these algorithms, and shows that there are many situations in which these models will be misspecified.

Our work also suggests several directions for further work. The fact that the common settings of MORL and risk-sensitive RL indeed are genuine extensions over the standard (scalar, Markovian) setting provides additional motivation for further work in these areas. Our work also suggests that it could be interesting to further explore the modal setting, or other directions that aim to extend the expressivity of the standard RL setting. We give an overview of the existing work in this area in Section~\ref{section:related_work}. Our work also motivates work on reward learning algorithms which do not assume that the preferences of the demonstrator can be captured by a scalar, Markovian reward. There is some existing work in this area \cite[e.g.\ ][]{abate2022learning}, but it remains quite limited. Moreover, another interesting direction for further work would be to quantify the consequences of taking a task which cannot be perfectly represented using a Markovian reward function, and trying to approximate it using a Markovian reward function. For example, could we bound the worst-case regret that might be incurred if a MORL problem is approximated using a scalar reward? Finally, another interesting direction for further work would be to more thoroughly explore the expressivity of other types of problem settings, and their relationship to each other.

%There are several conclusions that we can draw from our results. First of all, the fact that the common settings of MORL and risk-sensitive RL indeed are genuine extensions over the standard (scalar, Markovian) setting provides additional motivation for further work in these areas.
%Moreover, the fact that there are intuitive and commonsensical objectives which cannot be expressed using Markovian reward functions, motivates some caution about the indiscriminate use of such reward functions.
%It means that there are situations in which careful reward specification and reward shaping will not be sufficient to robustly incentivise the desired behaviour. In those cases, we must instead use an alternative policy synthesis method, such as e.g.\ those offered by MORL. This also means that a reward learning algorithm which assumes that the data is generated from a Markovian reward function runs the risk of being misspecified. 


\subsection{Related Work}\label{section:related_work}

There has been a lot of recent work on the expressivity of Markovian reward functions.
Here, we summarise relevant contributions, and detail differences with our work.  

Notably, there are three recent papers which provide necessary and sufficient conditions for when a particular type of task can be expressed using a particular type of reward function. The first of these is \cite{Pitis_2019}, who consider a task to be a preference relation defined over \emph{prospects}, where a prospect is defined as a pair of a state and a policy. Moreover, they generalise the discount function by allowing it to depend on the transition (instead of always being a constant value $\gamma$). They then add two axioms (and one assumption) to the famous vNM-axioms (from \cite{vonneumann1947}), to obtain necessary and sufficient conditions for when a task (as they formalise it) can be expressed as a Markovian reward with transition-dependent discounting.
Our work differs from their in several ways, as explained shortly.

The next paper is \cite{pmlr-v162-shakerinava22a}, who provide an alternative, simpler axiomatisation of the setting considered by \cite{Pitis_2019}, and also provide further axioms to describe two additional types of environments. They consider environments without any discount factor, but instead use termination probabilities, which can be used to simulate the standard case with exponential discounting.

The third paper is \cite{settling_reward_hypothesis}, who generalise the results of \cite{pmlr-v162-shakerinava22a} even further, and provide an alternative axiom to add to the vNM axioms.
They start by considering preference relations over finite trajectories, and then extend this to a preference relation over policies by saying that a policy $\pi_1$ is preferred to $\pi_2$ if there exists a time $t$ after which the trajectory distribution induced by $\pi_1$ is always preferable to the trajectory distribution induced by $\pi_2$. This encompasses the setting with exponentially discounted reward, the setting with limit-average reward, and the episodic setting. They consider both the case where the discount function is transition-dependent, and the case when it is constant.

Our work differs from that by \cite{Pitis_2019, pmlr-v162-shakerinava22a, settling_reward_hypothesis} in a few ways.
First of all, these papers aim to establish general necessary and sufficient conditions for when a task can be formalised as a Markovian reward, whereas we instead focus on three specific classes of tasks that we believe to be especially interesting. 
It might in principle be possible to derive our results as a special case of theirs. However, doing this would be quite non-trivial, and possibly more difficult than our direct derivations. 
Secondly, the axiomatisations provided by \cite{Pitis_2019, pmlr-v162-shakerinava22a, settling_reward_hypothesis} are difficult to use in practice. Our results, on the other hand, are arguably intuitive to understand, and concern some settings that are both popular and important.
Our work could thus be construed as a study on the practical consequences of the work by \cite{Pitis_2019, pmlr-v162-shakerinava22a, settling_reward_hypothesis}, with results that may be more directly useful to practitioners.
There are also several differences in how we formalise the problem compared to \cite{Pitis_2019, pmlr-v162-shakerinava22a, settling_reward_hypothesis}. For example, we consider the case with fixed discount rates, whereas \cite{Pitis_2019} and \cite{settling_reward_hypothesis} consider transition-dependent discount rates. To give another example, \cite{pmlr-v162-shakerinava22a} consider finite trajectories, whereas we consider infinite trajectories (noting that the latter can model the former, but not vice versa).
These differences further contribute to distinguishing our results from theirs.


Another notable piece of related work is \citet{markovrewardexpressivity}, who point to three different ways to formalise the notion of a \enquote{task} (namely, as a set of acceptable policies, as an ordering over policies, or as an ordering over trajectories). They then demonstrate that each of these classes contains at least one instance which cannot be expressed using a Markovian reward function, and provide algorithms which compute reward functions for these types of tasks.
Our work is different from theirs in a few different ways. 
First of all, we consider three different ways to specify a policy ordering, and then derive necessary and sufficient conditions which can be used to directly determine when the resulting policy ordering can be expressed as a Markovian reward function. \citet{markovrewardexpressivity} do not provide necessary and sufficient conditions, but instead only provide a counter-example for each type of task, showing that Markovian rewards cannot formalise all tasks of that type.

Another important paper is the work by  \cite{Vamplew2022}, who argue that there are many important aspects of intelligence which can be captured by MORL, but not by scalar RL. 
Like them, we also argue that MORL is a genuine extension of scalar RL, but our approach is quite different. They focus on the question of whether MORL or (scalar) RL is a better foundation for the development of general intelligence (considering feasibility, safety, and etc), and they provide qualitative arguments and biological evidence. By contrast, we are more narrowly focused on what incentive structures can be expressed by MORL and scalar RL, and our results are mathematical.

\cite{miura2022} considers the question of when a task can be expressed as a constrained MDP (CMDP), or as a Markovian reward. They formalise a task as two sets of policies, $\langle \Pi_G, \Pi_B \rangle$, and consider a CMDP to express the task if all policies in $\Pi_G$, and none of the policies in $\Pi_B$, are feasible, and consider a Markovian reward to express the task if all policies in $\Pi_G$, and none of the policies in $\Pi_B$, are optimal under that reward. They then derive necessary and sufficient conditions for both of these cases, and show that CMDPs are strictly more expressive than Markovian rewards for these types of tasks. The CMDP framework is a special case of the MORL framework we discuss in Section~\ref{section:morl}, roughly corresponding to the \textbf{MaxSat} objective. On the other hand, we formalise the notion of a task as a policy ordering, whereas \cite{miura2022} formalises it as a set of feasible policies.

Also relevant is the work by \cite{pitis2022rational}, who consider a task to consist of multiple Markovian reward functions, each of which may use a different discount parameter, and where the goal is to maximise the sum of these rewards. They then show that this setting may lead to the optimal policy being non-stationary, which demonstrates that it cannot always be expressed using Markovian rewards. Our analysis of the MORL setting allows for more general objectives than the case where the goal is to maximise the sum of the individual rewards. On the other hand, we assume that the same discount parameter is used for each reward. Our analysis is therefore in some ways more general, and in other ways more restrictive, than that of \cite{pitis2022rational}.

Also related is the work by \cite{rewardgaming}, who demonstrate that if for two rewards $R_1$, $R_2$ there are no policies $\pi_1$, $\pi_2$ such that $J_1(\pi_1) < J_2(\pi_2)$ and $J_2(\pi_1) > J_2(\pi_2)$, then either $R_1$ and $R_2$ are equivalent, or one of them is trivial. This means that there are some policy orderings that cannot be expressed using Markovian rewards. 
We consider different kinds of policy orderings than they do.
%However, the types of orderings they consider are different from those we consider in this work.

There is also other relevant work that is less strongly related. For example, \cite{reward_machines} point out that there are certain tasks which cannot be expressed using Markovian rewards, and propose a way extend their expressivity by augmenting the reward function with an automaton that they call a \emph{reward machine}. 
Similar approaches have also been used by \cite{bcHKA20,hagw21}, tackling infinite-horizon tasks for single- and multi-agent systems. 
There are also other ways to extend Markovian rewards to a more general setting, such as \emph{convex RL}, as studied by e.g.\ \cite{convex_1,convex_2,convex_3,convex_4, convex_5}, and \emph{vectorial RL}, as studied by e.g.\ \cite{vectorial_1,vectorial_2}.
Analysing the expressivity of these problem settings more extensively would be an interesting direction for further work.

There is a large literature on (the overlapping topics of) single-policy MORL, constrained RL, and risk-sensitive RL. 
These areas are too large for it to be possible to give a fully complete overview of this work here.
Some notable examples include \citet{cpo, chow2015riskconstrained, miryoosefi2019reinforcement, tessler2019, lmorl}. 
This existing literature typically focuses on the creation of algorithms for solving particular MORL problems, rather than on characterising when MORL problems can (or cannot) be reduced to scalar RL.
%\textcolor{blue}{[how does this cognate literature relate to this work?]}
Modal RL has (to the best of our knowledge) never been discussed explicitly in the literature before. However, it relates to some existing work, such as side-effect avoidance
\citep{krakovnaside, krakovnaside2, turnerside, griffin2022alls}, and the work by \cite{pctl_rl}.

Finally, our work also relates to existing work in decision theory, social choice theory, and related fields. This of course includes the famous work by \cite{vonneumann1947}. As discussed previously, the work by \cite{harsanyi1955cardinal} is also particularly relevant. Note that work in decision theory and social choice theory typically only considers single-step decision problems, whereas the RL setting of course considers sequential decision making. There are also a few other modelling assumptions that are common in decision theory and social choice theory which do not hold in the RL setting.
For example, in these fields, it is common to assume that the choice set is finite (whereas the set of trajectories in RL may be infinite), that preferences are defined over all distributions over the choice set (whereas it in RL is more common to only consider distributions that can be realised by some policy for a given transition function), and that a utility function can be any function from the choice set to real numbers (whereas many of these functions cannot be expressed as reward functions).
Consequently, results from decision theory and social choice theory only sometimes generalise to the RL setting. For example, in Section~\ref{section:morl}, we provide some examples of results that do generalise to the RL setting, and in Section~\ref{section:risk_sensitive_rl}, we provide some examples of results which do not generalise.

%\subsection{Further Work}

%There are several ways to extend our work. First of all, it is likely that there are classes of tasks beyond the three we have considered in this work that it would be interesting to analyse as well. Second, we have noted that there are ways to extend the expressivity of Markovian reward functions (using e.g.\ reward machines, convex RL, or vectorial RL, etc). It would be interesting to analyse the expressivity of these settings in a similar manner. 

%\subsection{Extensions of this Work}

%There are several ways to extend our work. First of all, we have studied to what extent three particular classes of tasks can be expressed using scalar and Markovian reward functions. It is likely that there are other classes of tasks that it would be interesting to consider as well. Moreover, it would also be interesting to consider the expressivity of more general classes of reward functions. For example, it would be interesting to know to what extent MORL tasks can be expressed using reward machines \citep{reward_machines}, and similar.

%Our work also provides a strong motivation for developing newer, better RL algorithms, which can learn tasks that cannot be expressed using Markovian reward functions. Amongst the possible approaches, in section~\ref{section:solving_inexpressible} we have outlined one for learning any differentiable MORL objective using policy gradients, and in Appendix~\ref{appendix:solve_modal}  we discuss an approach for learning a large class of modal tasks.


%We have provided necessary and sufficient conditions for when a MORL problem can be reduced to an ordinary single-objective RL problem. These conditions show that only \enquote{linear} MORL problems can be expressed using a single reward function, and this excludes almost all interesting MORL problems. To highlight this fact, we have defined seven different MORL objectives, which all correspond to natural and useful ways to combine and trade off different rewards, and shown that none of them can be expressed using only one reward function. This shows that traditional RL is unable to express many natural kinds of tasks. Moreover, we have provided two simple algorithms (MMQL and MORL-AC) that do solve some of these problems, and demonstrated that they are effective (using both theoretical and empirical results). MMQL pursues the \texttt{MaxMax} objective (Definition~\ref{def:maxmax}), whereas MORL-AC can pursue any differentiable MORL objective. This establishes that single-policy multi-objective reinforcement learning is a substantive and tractable generalisation of scalar reinforcement learning, and provides motivation for further work in multi-objective reinforcement learning.

%There are several ways to extend our work. First of all, we have defined two MORL objectives for which we have not provided any algorithm, namely \texttt{MaxMin} (Definition~\ref{def:maxmin}) and \texttt{MaxSat} (Definition~\ref{def:maxsat}). Therefore, one clear direction for further work is to design algorithms capable of solving these objectives. Another exciting direction could be to develop algorithms capable of learning many different MORL objectives, similar to MORL-AC. Finally, our definition of a MORL objective (Definition~\ref{def:morl_objective}) only lets the policy ordering be sensitive to $J_1 \dots J_k$, i.e.\ to the \emph{expectation} of each reward. This could be generalised further, by allowing the ordering to depend on other properties of each reward's distribution. For example, we could allow a MORL objective to be risk-averse with respect to some rewards, by letting its ordering depend on e.g.\ the value-at-risk, or conditional-value-at-risk, of that reward's distribution.


\bibliography{references}
%\bibliographystyle{iclr2023_conference}

%\newpage
%\appendix

%\include{appendix}

\end{document}
