% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 
\usepackage{xr} 
\externaldocument{skalse_235}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{On the Limitations of Markovian Rewards\\ to Express Multi-Objective, Risk-Sensitive, and Modal Tasks\\(Supplementary Material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1,2]{\href{mailto:<joar.skalse@cs.ox.ac.uk>?Subject=Your UAI 2023 paper}{Joar~Skalse}{}}
\author[1]{Alessandro~Abate}
% Add affiliations after the authors
\affil[1]{%
    Computer Science Department\\
    Oxford University\\
    Oxford, UK
}
\affil[2]{%
    The Future of Humanity Institute\\
    Oxford, UK\\
}

\usepackage{amsthm, amsfonts, bbm, csquotes, amssymb} 

%\theoremstyle{plain}
%\newtheorem{corollary}[theorem]
%\newtheorem{theorem}{Theorem} %[section]
%\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{proposition}{Proposition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem{example}{Example}
%\theoremstyle{definition}
\newtheorem{definition}{Definition}
\newtheorem{assumption}{Assumption}
%\theoremstyle{remark}
\newtheorem{remark}{Remark}

\usepackage{xcolor}
\newcommand{\red}[1]{\textcolor{red}{#1}}

\newcommand{\M}{\mathcal{M}}

\newcommand{\States}{\mathcal{S}}
\newcommand{\Actions}{\mathcal{A}}
\newcommand{\mcS}{\mathcal{S}}
\newcommand{\mcA}{\mathcal{A}}
%\newcommand{\t}{\tau}
\newcommand{\init}{\mu_0}
\newcommand{\R}{R}
\newcommand{\y}{\gamma}
\newcommand{\Rs}{\textbf{R}}
%\newcommand{\ys}{\textbf{\y}}

\newcommand{\Ob}{{\mathcal{O}}}

\newcommand{\m}{m_{\tau,\init,\gamma}}

\newcommand{\SxA}{\mcS \times \mcA}
\newcommand{\SxAxS}{\mcS \times \mcA \times \mcS}

\newcommand{\MDP}{\langle \mcS, \mcA, \tau, \init, \R, \y \rangle}
\newcommand{\MOMDP}{\langle \mcS, \mcA, \tau, \init, \Rs, \y \rangle}
\newcommand{\MDPwO}{\langle \mcS, \mcA, \tau, \init, \tilde{\R}, \y \rangle}
\newcommand{\MDPwOb}{\langle \mcS, \mcA, \tau, \init, \hat{\R}, \y \rangle}
\newcommand{\env}{\langle \mcS, \mcA, \tau, \init, \_, \y \rangle}
  
\begin{document}
  
%\onecolumn %% Turn this off if single column is desired for the supplement
\maketitle

\setcounter{theorem}{0}
\setcounter{corollary}{0}
\appendix

\section{Proofs}\label{appendix:proofs}

Here, we will provide all proofs that were omitted from the main text. We will begin with Theorem~\ref{thm:linearity_thm}, from Section~\ref{section:morl}.

\begin{theorem}
If a MOMDP $\M$ with objective $\Ob$ is scalarizable, then there exist $w_1 \dots w_k \in \mathbb{R}$ such that $\M$ with $\Ob$ is scalarized by the reward $R(s,a) = \sum_{i=1}^k w_i \cdot R_i(s,a)$.
\end{theorem} 

To prove this, we must first set up some theoretical preliminaries. For convenience, let $n = |S||A|$, let $T = \SxA$, and let each transition in $\SxA$ be indexed by an integer $i \in [1,n]$. Moreover, given a reward function $R$, let $\Vec{R} \in \mathbb{R}^n$ be the vector such that $\Vec{R}_i = R(T_i)$. Next, given $\tau$, $\init$, and $\gamma$, let $\m : \Pi \to \mathbb{R}^n$ be the function where 
$$
\m(\pi)_i = \sum_{t=0}^{\infty} \y^t \mathbb{P}_{\xi \sim \pi}(\xi = T_i).
$$
Now $J(\pi) = \vec{R} \cdot \m(\pi)$. In other words, this construction lets us decompose $J$ into two steps, the first of which embeds $\pi$ in $\mathbb{R}^n$, and the second of which is a linear function. 
%Note that $m_{\tau,\init}$ depends on $\tau$ and $\init$.
%Also note that $L_1(\m(\pi)) = \frac{1}{1-\gamma}$ for all $\pi$. Let $S_\gamma$ be the ($n-1$)-dimensional affine subspace of $\mathbb{R}^n$ formed by all points $x$ such that $L_1(x) = \frac{1}{1-\gamma}$. This means that $\mathrm{Im}(\m) \in S_\gamma$.
Let $S_\gamma$ be the smallest affine subspace of $R^n$ such that $\mathrm{Im}(\m) \in S_\gamma$.
We will also use the following lemma:

\begin{lemma}\label{lemma:open_set}
$\mathrm{Im}(\m)$ is open in $S_\gamma$.
\end{lemma}

For a proof of Lemma~\ref{lemma:open_set}, see \cite{IRLmisspecification-supp} (their Lemma A.11).
We can now prove Theorem~\ref{thm:linearity_thm}:

\begin{proof}
Suppose the MOMDP $\MOMDP$ with $\Ob$ is equivalent to the MDP $\MDP$.

First, note that $J(\pi) = \vec{R} \circ \m (\pi)$, and that $J_i (\pi) = \vec{R_i} \circ \m (\pi)$ for each of $R_i \in \Rs$. 
Let $M$ be the $(n \times k)$-dimensional matrix that maps each vector $x \in \mathbb{R}^n$ to $\langle R_1 \cdot x, \dots, R_k \cdot x\rangle$. In other words, $M$ is the matrix whose rows are $\vec{R}_1 \dots \vec{R}_k$.
Since $J(\pi)$ is a function of $J_1(\pi) \dots J_k(\pi)$, we have that $\vec{R} \cdot x_1 = \vec{R} \cdot x_2$ if $M \cdot x_1 = M \cdot x_2$ for any $x_1, x_2 \in \mathrm{Im}(\m)$.

We will first show that $\vec{R} \cdot x_1 = \vec{R} \cdot x_2$ if $M \cdot x_1 = M \cdot x_2$ for any $x_1, x_2 \in S_\gamma$, not just any $x_1,x_2 \in \mathrm{Im}(\m)$.
Let $x_1, x_2$ be any two points in $S_\gamma$ such that $M \cdot x_1 = M \cdot x_2$, and let $x$ be some arbitrary element of $\mathrm{Im}(\m)$. Let $y_1 = x_1 - x$ and $y_2 = x_2 - x$. Since $\mathrm{Im}(\m)$ is open in $S_\gamma$ (as per Lemma~
\ref{lemma:open_set}), there is an $\alpha > 0$ such that $x + \alpha \cdot y_1 \in \mathrm{Im}(\m)$ and $x + \alpha \cdot y_2 \in \mathrm{Im}(\m)$. Since $M$ is linear, and since $M \cdot x_1 = M \cdot x_2$, we have that $M \cdot (x + \alpha \cdot y_1) = M \cdot (x + \alpha \cdot y_2)$. Moreover, since $x + \alpha \cdot y_1 \in \mathrm{Im}(\m)$ and $x + \alpha \cdot y_2 \in \mathrm{Im}(\m)$, this means that $\vec{R} \cdot (x + \alpha \cdot y_1) = \vec{R} \cdot (x + \alpha \cdot y_2)$. Finally, from the properties of linear functions, this in turn implies that $\vec{R} \cdot x_1 = \vec{R} \cdot x_2$. Thus, if $M \cdot x_1 = M \cdot x_2$ then $\vec{R} \cdot x_1 = \vec{R} \cdot x_2$ for all $x_1, x_2 \in S_\gamma$.

Next, note that we can decompose $M$ into two matrices $M_1, M_2$ such that $M = M_1 \cdot M_2$, where $M_1$ is invertible, and $M_2$ is an orthogonal projection such that $M_2(x_1) = M_2(x_2)$ if and only if $M(x_1) = M(x_2)$. This means that $\vec{R} \cdot x = \vec{R} \cdot M_2(x)$ for all $x \in S_\gamma$. From this, we obtain that $\vec{R} \cdot x = \vec{R} \cdot M_1^{-1} \cdot M_1 \cdot M_2(x) = \vec{R} \cdot M_1^{-1} \cdot M(x)$ for all $x \in S_\gamma$. Since $\vec{R} \cdot M_1^{-1}$ is a linear function, this means that $\vec{R} \cdot x$ can be expressed as $\sum_{i=1}^k w_i \cdot M(x)_i$ for some $w_1 \dots w_k$ for all $x \in S_\gamma$. 

Recall that $J(\pi) = \vec{R} \cdot \m(\pi)$, where $m(\pi) \in S_\gamma$.
This means that $J(\pi) = \sum_{i=1}^k w_i \cdot M(\m(\pi))_i = \sum_{i=1}^k w_i \cdot \vec{R_i} \cdot \m(\pi) = \sum_{i=1}^k w_i \cdot J_i(\pi)$. This completes the proof.
\end{proof}

\begin{corollary}
If $\Ob(J_1 \dots J_k)$ has a non-linear representation $U$, and $\M$ is a MOMDP whose $J$-functions are $J_1 \dots J_k$, then $\M$ with $\Ob$ is not equivalent to any MDP.
\end{corollary}
\begin{proof}
Assume for contradiction that $\M$ with $\Ob$ is equivalent the MDP $\MDP$. Then $J$ represents $\Ob(J_1 \dots J_k)$, and this in turn means that $U$ must be strictly monotonic in $J$. Moreover, Theorem~\ref{thm:linearity_thm} implies that $J = \sum_{i=0}^k w_i \cdot J_i$ for some $w_1 \dots w_k \in \mathbb{R}^k$. However, this contradicts our assumptions.
\end{proof}

\begin{corollary}
There is no MDP equivalent to $\M$ with $\textbf{LexMax}$, as long as $\M$ has at least two reward functions that are neither trivial, equivalent, or opposite. 
\end{corollary}
\begin{proof}
%Assume for contradiction that $\M$ has at least two reward functions which are neither trivial, equivalent, or opposites, and that $\M$ with $\texttt{LexMax}$ is equivalent the MDP $\tilde{\M} = \MDPwO$.
%We will use a standard construction to show that this would imply the existence of an injection from the reals to the rationals.
%Let $i$ be the smallest number such that $R_i$ is non-trivial, and let $j$ be the smallest number greater than $i$ such that $R_j$ is non-trivial, and not equivalent to or opposite of $R_i$. We can then map each real number between $\max_{\pi}$

Suppose $\M$ with $\texttt{LexMax}$ is equivalent to $\tilde{\M} = \MDPwO$. Let $i$ be the smallest number such that $R_i$ is non-trivial, and let $j$ be the smallest number greater than $i$ such that $R_j$ is non-trivial, and not equivalent to or opposite of $R_i$. Then there are $\pi_1,\pi_2$ such that $J_i(\pi_1) = J_i(\pi_2)$ and $J_j(\pi_1) < J_j(\pi_2)$, which means that $\pi_1 \prec_\texttt{Lex}^\M \pi_2$.
Moreover, since $\tilde{J}$ represents $\prec_\texttt{Lex}^\M$, it follows that there are no $\pi, \pi'$ such that $J_i(\pi) < J_i(\pi')$ and $\tilde{J}(\pi) > \tilde{J}(\pi')$. Then Theorem 1 in \citet{rewardgaming-supp} implies that $R_i$ is equivalent to $\tilde{R}$. However, then $\tilde{J}(\pi_1) = \tilde{J}(\pi_2)$, which means that $\tilde{J}$ cannot represent $\prec_\texttt{Lex}^\M$.
\end{proof}

%\begin{lemma}\label{lemma:open_ball}
%If $\hat{\mcS}$ is the set of all states that are reachable under $\t$ and $\init$, then $\mathrm{Im}(m)$ is located in an $(|\hat{\mcS}||\mcA|-1)$-dimensional linear subspace of $\mathcal{R}^{|\mcS||\mcA|}$. Moreover, if $\tilde{\Pi}$ is the set of all policies that are nondeterministic everywhere, then $m(\tilde{\Pi})$ is open in that subspace.
%\end{lemma}
%\begin{proof}
%See Skalse et al 2022.
%\end{proof}

\begin{corollary}
There is no MDP equivalent to $\M$ with $\textbf{MaxMin}$, unless $\M$ has a reward function $R_i$ such that $J_i(\pi) \leq J_j(\pi)$ for all $j \in \{1 \dots k\}$ and all $\pi$.
\end{corollary}
\begin{proof}
%First note that 
$\Ob_\texttt{Min}^\M$ is represented by 
the function 
$U(\pi) = \mathrm{min}_i J_i(\pi)$. Moreover, if $\M$ has no reward function $R_i$ such that $J_i(\pi) \leq J_j(\pi)$ for all $j \in \{1 \dots k\}$ and all $\pi$ then this representation is non-linear. 
%Therefore, 
Corollary~\ref{cor:nonlinear_rep} then implies that $\M$ with $\texttt{MaxMin}$ is not equivalent to any MDP. 
%A rigorous proof that $U$ is non-linear is provided in the supplementary material.
\end{proof}

%The rest of this proof is concerned with showing rigorously that $U$ is non-linear.
%First, since there is no $R_i$ such that $J_i(\pi) \leq J_j(\pi)$ for all $\pi$ and all $j \in \{1 \dots k\}$, we have that there must be two policies $\pi_\alpha, \pi_\beta$ and two reward functions $R_\alpha, R_\beta$ such that 
%$J_\alpha(\pi_\alpha) \leq J_i(\pi_\alpha)$ for all $j \in \{1 \dots k\}$,
%$J_\beta(\pi_\beta) \leq J_i(\pi_\beta)$ for all $i \in \{1 \dots k\}$,
%$J_\alpha(\pi_\alpha) \neq J_\beta(\pi_\alpha)$, and $J_\alpha(\pi_\beta) \neq J_\beta(\pi_\beta)$.
%Let $\vec{\pi_\alpha} \in \mathbb{R}^{|S||A|}$ be the vector where $\vec{\pi_\alpha}[s,a] = \pi_\alpha(a \mid s)$, and similarly for $\vec{\pi_\beta}$. Next, consider the vectors $\vec{\pi}$ on a linear path from $\vec{\pi_\alpha}$ to $\vec{\pi_\beta}$. Specifically, let $\vec{\pi}_\delta = \vec{\pi_\alpha} + \delta \cdot (\vec{\pi_\beta} - \vec{\pi_\alpha})$; then 

%Each such vector $\vec{\pi}$ corresponds to a policy $\pi$. Next, since $J(\pi)$ is continuous in $\pi$, there is a last 

% there is a last point where alpha is minimal; consider this point and two nearby points on either side
% linearity is violated


%For each reward function $R_i$, let $P_i$ be the set of all policies $\pi$ such that $J_i(\pi) \leq J_j(\pi)$ and all $j \in \{1 \dots k\}$. %Note that every policy $\pi$ is a member of at least one of $P_1 \dots \P_k$. 
%Since $J(\pi)$ is continuous in $\pi$, and since there is no $R_i$ such that $J_i(\pi) \leq J_j(\pi)$ for all $\pi$ and all $j \in \{1 \dots k\}$, we have that there must be a 

%Moreover, since $J(\pi)$ is continuous in $\pi$, we have that each non-empty $P_i$ intersects at least one other $P_j$.

%Pick a $P_i$ that contains a policy $\pi$ that is nondeterministic everywhere.

% pick fully random policy "on the edge" between two rewards
% invoke open ball around it to find two policies with different min rewards
% invoke linearity, and draw line to center ball
% et viola: linearity violated

% find close-by policy that isn't minimised by R_i
% evoke 

%Since there is no $R_i$ such that $J_i(\pi) \leq J_j(\pi)$ for all $\pi$ and all $j \in \{1 \dots k\}$, we have that $P_i \neq \Pi$. 

%Since $J_i(\pi)$ is continuous in $\pi$, we have that $P_i$ must contain 

%Next, let $R_i$ be a reward function for which there is at least one policy $\pi$ such that $J_i(\pi) \leq J_j(\pi)$ and all $j \in \{1 \dots k\}$, and let $S$ be the 

%Next, since $\M$ has no reward function $R_i$ such that $J_i(\pi) \leq J_j(\pi)$ for all $\pi$ and all $j \in \{1 \dots k\}$, and since $J_i(\pi)$ is continuous in $\pi$,  

%Next, recall that $J_i = L_i \circ m$, where $m : \Pi \to \mathbb{R}^{|\mcS||\mcA|}$, and $L_i$ is a linear function. 

%This means that if $\M$ has no reward function $R_i$ such that $J_i(\pi) \leq J_j(\pi)$ for all $\pi$ and all $j \in \{1 \dots k\}$, then $U$ is not strictly monotonic in any function that is linear in $J_1 \dots J_k$. 
%((NOT SO EASY! find policy where two non-equivalent R minimise the objective, get open ball around this policy, find place where linearity is violated))

\begin{corollary}
There is no MDP equivalent to $\M$ with $\textbf{MaxSat}$, as long as $\M$ has at least one reward $R_i$ where $J_i(\pi_1) < c_i$ and $J_i(\pi_2) \geq c_i$ for some $\pi_1, \pi_2 \in \Pi$.
\end{corollary}
\begin{proof}
Note that $\texttt{MaxSat}(\M)$ is represented by the function $U(\pi) = \sum_{i=1}^k \mathbbm{1}[J_i(\pi) \geq c_i]$, where $\mathbbm{1}[J_i(\pi) \geq c_i]$ is the function that is equal to $1$ when $J_i(\pi) \geq c_i$, and $0$ otherwise. 
%Moreover, $U$ is non-continuous if $\M$ has at least one reward function $R_i$ and at least two policies $\pi_1,\pi_2$ such that $J_i(\pi_1) < c_i$ and $J_i(\pi_2) \geq c_i$. 
%This means that 
Moreover,
$U$ is not strictly monotonic in any function that is linear in $J_1 \dots J_k$. Corollary \ref{cor:nonlinear_rep} thus implies that $\M$ with $\texttt{MaxSat}$ is not equivalent to any MDP.
\end{proof}

\begin{corollary}
There is no MDP equivalent to $\M$ with $\textbf{ConSat}$, unless either $R_1$ and $R_2$ are equivalent, or $\max_{\pi}J_1(\pi) \leq c$.
\end{corollary}
\begin{proof}
%First note that 
$\Ob_\texttt{Con}^\M$ is represented by
$U(\pi) = \{J_1(\pi) \text{ if } J_1(\pi) \leq c \text{, else } J_2(\pi) - \min_\pi J_2(\pi) + c\}$. Moreover, this representation is non-linear, unless either $R_1$ and $R_2$ are equivalent, or $\max_{\pi}J_1(\pi) \leq c$.
%Therefore, 
Corollary~\ref{cor:nonlinear_rep} then implies that $\M$ with $\texttt{ConSat}$ is not equivalent to any MDP. 
\end{proof}



We next give the proof of Theorem~\ref{thm:risk_theorem}, from Section~\ref{section:risk_sensitive_rl}.

\begin{theorem}
Given $\States$, $\Actions$, and $\gamma$, 
let $R_1$ and $R_2$ be two reward functions.
If for all $\xi_1,\xi_2 \in (\SxA)^\omega$ and $\gamma \geq 0.5$, 
$$
G_1(\xi_1) \leq G_1(\xi_2) \iff G_2(\xi_1) \leq G_2(\xi_2),
$$
then there exist $a \in \mathbb{R}$, $b \in \mathbb{R} > 0$ such that for all $\xi \in (\SxA)^\omega$,
$$
G_1(\xi) = b \cdot G_2(\xi) + a.
$$
\end{theorem}

\begin{proof}
We can first note that if $G_1$ is constant then $G_2$ must also be constant, and vice versa, in which case this result is straightforward (with $b = 1$, $a = G_1 - G_2$). For the rest of the proof, assume that neither $G_1$ or $G_2$ is constant.

For convenience, let $n = |S||A|$, let $T = \SxA$ , and let each transition in $\SxA$ be indexed by an integer $i \in [1,n]$. Let $\Vec{R_1} \in \mathbb{R}^n$ be the vector such that $\Vec{R_1}_i = R_1(T_i)$, and $\Vec{R_2} \in \mathbb{R}^n$ be the vector such that $\Vec{R_2}_i = R_2(T_i)$. Moreover, let $m : T \to \mathbb{R}^n$ be the function where 
$$
m(\xi)_i = \sum_{j=0}^\infty \delta^j \mathbbm{1}[\xi_j = T_i].
$$
Now $G_1(\xi) = \vec{R_1} \cdot m(\xi)$ and $G_2(\xi) = \vec{R_2} \cdot m(\xi)$. In other words, this construction lets us decompose $G_1$ and $G_2$ into two steps, the first of which embeds $\xi$ in $\mathbb{R}^n$, and the second of which is a linear function.

Next, let us consider what $\mathrm{Im}(m)$ looks like. First, note that $m(\xi)_i \geq 0$ for all $i$ and all $\xi$. Next, note that $\sum m(\xi) = 1/(1-\gamma)$ for all $\xi$. This means that $\mathrm{Im}(m)$ is located inside the simplex that is formed by all points in the positive quadrant of $\mathbb{R}^n$ whose $L_1$-norm is $1/(1-\gamma)$.

Consider two arbitrary transitions $t_i, t_j \in T$. Note that $m(t_i^\omega)$ is the point where the aforementioned simplex intersects the $i$'th basis vector of $\mathbb{R}^n$, and similarly for $m(t_j^\omega)$. Moreover, if $\xi$ is made up entirely from $t_i$ and $t_j$ in some combination and order (i.e., $\xi \in \{t_i,t_j\}^\omega \subseteq T$), then $m(\xi)$ is on the line between $m(t_i^\omega)$ and $m(t_j^\omega)$. 

Let $\alpha$ be any number in $[0, 1/(1-\gamma)]$. Since $1/\gamma > 1$, there is a representation of $\alpha$ in base $1/\gamma$.
This means that there is an integer $u$ and a sequence of integers $\{a_k\}_{k \in (-\infty, u]}$ such that
$$
\sum_{k = u}^{-\infty} a_k \cdot (1/\gamma)^k = \alpha
$$
where each $a_k$ is a nonnegative integer less than $1/\gamma$. Since $\gamma \geq 0.5$, this means that each $a_k$ is 0 or 1. Moreover, since $\alpha \leq 1/(1-\gamma)$, we have that $u \leq 0$. By rewriting using $k' = -k$, this means that there is a sequence $\{a_{k'}\}_{k' \in [0,\infty)}$ where each $a_{k'} \in \{0,1\}$ such that
$$
\sum_{k' = 0}^{\infty} a_{k'} \cdot \gamma^{k'} = \alpha.
$$
Let $\xi \in T$ be the trajectory where $\xi_{k'} = t_i$ if $a_{k'} = 1$, and $t_j$ if $a_{k'} = 0$. We now have that $m(\xi) = \alpha/(1/(1-\gamma)) \cdot m(t_i^\omega) + (1-\alpha/(1/(1-\gamma))) \cdot m(t_j^\omega)$. Since $\alpha$ was chosen arbitrarily from $[0, 1/(1-\gamma)]$, this means that every point on the line between $m(t_i^\omega)$ and $m(t_j^\omega)$ are in $\mathrm{Im}(m)$. Since $t_i$ and $t_j$ were also chosen arbitrarily, this holds for any $t_i$ and $t_j$ in $T$.

Consider again the simplex that is formed by all points in the positive quadrant of $\mathbb{R}^n$ whose $L_1$-norm is $1/(1-\gamma)$. We have just shown that every point on the edges (1-faces) of this simplex are in $\mathrm{Im}(m)$.

Consider the linear functions that $\vec{R_1}$ and $\vec{R_2}$ induce on $\mathbb{R}^n$. Take the point $x$ at the centre of the simplex, and consider the tangent plane of $\vec{R_1}$ at this point. Since every point on any of the simplex edges are in $\mathrm{Im}(m)$, we have that this tangent plane must intersect $\mathrm{Im}(m)$ at $n-1$ linearly independent points. Since $\vec{R_1}\cdot x_1 = \vec{R_1}\cdot x_2$ implies that $\vec{R_2}\cdot x_1 = \vec{R_2}\cdot x_2$ for all $x_1,x_2 \in \mathrm{Im}(m)$, we have that the tangent plane of $\vec{R_2}$ at $x$ must intersect $\mathrm{Im}(m)$ at the same points. This implies that there are $a, b \in \mathbb{R}$ such that $G_1 = b \cdot G_2 + a$. Since moreover $G_1(\xi_1) \leq G_1(\xi_2) \iff G_2(\xi_1) \leq G_2(\xi_2)$, we have that $b > 0$.
\end{proof}

\begin{theorem}
For any modal reward $R^\Diamond$ and any transition function $\tau$, there exists a reward $R$ that is contingently equivalent to $R^\Diamond$ given $\tau$. Moreover, unless $R^\Diamond$ is trivial, there is no reward that is robustly equivalent to $R^\Diamond$.
\end{theorem}
\begin{proof}
This is straightforward.
For the first part, simply let $R(s,a,s') = R^\Diamond(s,a,s',\tau)$. 
The second part is immediate from the definition of trivial modal reward functions.
%For the second part, if $R^\Diamond$ is non-trivial then there are transition functions $\tau_1,\tau_2$ such that $R^\Diamond_{\tau_1}$ and $R^\Diamond_{\tau_2}$ have different policy orderings under $\tau_1$. As per Theorem 2.6 in \cite{IRLmisspecification}, this implies that $R^\Diamond_{\tau_1}$ and $R^\Diamond_{\tau_2}$ do not differ by positive linear scaling, potential shaping, and $S'$-redistribution.
\end{proof}

\section{Tasks as Optimal Policies}\label{appendix:towards_nas}

In this paper, we primarily think of a \enquote{task} as corresponding to a policy ordering.
An alternative way to formalise the notion of a task is as a set of optimal policies. It is fairly straightforward to provide necessary and sufficient conditions for when this type of task can be expressed using a scalar, Markovian reward function.

\begin{proposition}
A set of policies $\hat{\Pi}$ is the optimal policy set for some reward if and only if there is a function $o : \mcS \to \mathcal{P}(\mcA) \setminus \varnothing$ that maps each state to a (non-empty) set of \enquote{optimal actions}, and $\pi \in \hat{\Pi}$ if and only if $\mathrm{supp}(\pi(s)) \subseteq o(s)$.
\end{proposition}

\begin{proof}
For the \enquote{if} part, consider the reward function $R$ where $R(s,a,s') = 0$ if $a \in o(s)$, and $R(s,a,s') = -1$ otherwise. 
The \enquote{only if} part follows from the fact that the optimal $Q$-function $Q^\star$ is the same for all optimal policies, 
so we can let $o(s) = \mathrm{argmax}_a Q^\star(s,a)$.
\end{proof}

We can see that some tasks of this form cannot be expressed by Markovian rewards. For example, consider the task \enquote{always go in the same direction} --- this task cannot be expressed as a reward function, because any policy that mixes the actions of two other optimal policies must itself be optimal. It also shows that Markovian reward functions cannot be used to encourage \emph{stochastic} policies. For example, there is no Markovian reward function under which \enquote{play rock, paper, and scissors with equal probability} is the unique optimal policy.

\section{More MORL Objectives}\label{appendix:morl_objectives}

In this Appendix, we give even more examples of MORL objectives, and some comments on how to construct them -- the purpose of this is mainly just to show how rich this space is. First, similar to the MaxMin objective, we might want to judge a policy according to its \emph{best} performance:

\begin{definition}\label{def:maxmax}
Given $J_1 \dots J_k$, the \textbf{MaxMax} objective $\prec_\texttt{Max}$ is given by $\pi_1 \prec_\texttt{Max} \pi_2 \iff \max_i J_i(\pi_1) < \max_i J_i(\pi_2)$.
\end{definition}

%In other words, the MaxMax objective orders policies by their \emph{best} performance according to any of $R_1 \dots R_m$. %This is somewhat analogous to the $\ell_\infty$-norm.

We would next like to point out that it is possible to create smooth versions of almost any MORL objective. In Section~\ref{section:solving_inexpressible}, we outline an approach for learning any continuous, differentiable MORL objective, so this is quite useful. We begin with a soft version of the MaxMax objective:

\begin{definition}\label{def:maxsoft}
Given $J_1 \dots J_k$ and $\alpha > 0$, the \textbf{Soft MaxMax} objective $\prec_\texttt{MaxSoft}$ is given by 
$$
J_\texttt{MaxSoft}(\pi) = \left(\sum_{i=1}^k J_i(\pi) e^{\alpha J_i(\pi)}\middle)\middle/\middle(\sum_{i=1}^k e^{\alpha J_i(\pi)}\right).
$$
\end{definition}

This is of course not the only way to continuously approximate MaxMax, it is just an example of one way of doing it. Here $\alpha$ controls how \enquote{sharp} the approximation is -- the larger $\alpha$ is, the closer $J_\texttt{MaxSoft}$ gets to the sharp max function, and the smaller $\alpha$ is, the closer it gets to the arithmetic mean function (so by varying $\alpha$, we can continuously interpolate between them). Similarly, we can also create a smooth version of MaxMin:

\begin{definition}\label{def:minsoft}
Given $J_1 \dots J_k$ and $\alpha > 0$, the \textbf{Soft MaxMin} objective $\prec_\texttt{MinSoft}$ is given by 
$$
J_\texttt{MinSoft}(\pi) = \left(\sum_{i=1}^k J_i(\pi) e^{-\alpha J_i(\pi)}\middle)\middle/\middle(\sum_{i=1}^k e^{-\alpha J_i(\pi)}\right).
$$
\end{definition}

As before, the larger $\alpha$ is, the closer $J_\texttt{MinSoft}$ gets to the sharp min function, and the smaller $\alpha$ is, the closer it gets to the arithmetic mean function
We can also smoothen MaxSat:

\begin{definition}\label{def:satsoft}
Given $J_1 \dots J_k$, $c_1 \dots c_k$, and $\alpha > 0$, the \textbf{Soft MaxSat} objective $\prec_\texttt{SatSoft}$ is 
$$
J_\texttt{SatSoft}(\pi) = \sum_{i=1}^k \left(\frac{1}{1 + e^{-\alpha(J_i(\pi) - c_i)}}\right).
$$
\end{definition}

The larger $\alpha$ is, the closer $J_\texttt{SatSoft}$ gets to the sharp MaxSat function (and the smaller $\alpha$ gets, the closer $J_\texttt{SatSoft}$ gets to a flat $0.5$). And, again, this is of course not the only way to create a smooth version of MaxSat.
It is unclear if it is possible to create a smooth version of ConSat without having any prior knowledge of (a lower bound of) the value of $\min_\pi J_1(\pi)$, but with this value it should be reasonably straightforward (see the construction in Corollary~\ref{corollary:no_consat}). As for LexMax, we can of course create a smooth approximation of it by taking a linear approximation of the weights, but here we would need some prior knowledge of $\max_\pi J_1(\pi) \dots \max_\pi J_k(\pi)$.


\section{A Method for Solving Modal Tasks}\label{appendix:solve_modal}

In this Appendix, we give an outline of one possible method for solving modal tasks.
We mainly want to show that it is \emph{feasible} to learn modal tasks, and so we only provide a solution sketch; the task of \emph{implementing} and \emph{evaluating} this method is something we leave as a topic for future work.

We will first define a restricted class of modal tasks, which is both very expressive, and also more amenable to learning than the more general version given in Definition~\ref{definition:modal_reward}:

\begin{definition}\label{def:affordance_mdp}
An \emph{affordance} consists of a reward function and a discount factor, $\langle R,\gamma \rangle$, and an \emph{affordance-based reward} is a function $R^\Diamond : \SxAxS \times \mathbb{R}^{2k} \to \mathbb{R}$, that is continuous in the last $2k$ arguments. 
An \emph{affordance-based MDP} is a tuple $\langle \mcS, \mcA, \tau, \mu_0, R^\Diamond, \gamma, \langle R,\gamma \rangle^k \rangle$, where the reward given for transitioning from $s$ to $s'$ via $a$ is $R^\Diamond(s,a,s',V_1^\star(s) \dots V_k^\star(s), V_1^\star(s') \dots V_k^\star(s'))$, where $V_i^\star$ is the optimal value function of the $i$'th affordance.
\end{definition}

This definition requires some explanation. 
In psychology (and other fields, such as user interface design), an affordance is, roughly, a perceived possible action, or a perceived way to use an object.
For example, if you see a button, then the fact that you can \emph{press} that button, and expect something to happen, is part of \emph{how you perceive} it, in a way that might not be the case if you could somehow show the button to a premodern human.
It can also be used to refer to a choice or action that is perceived as available in some context (without being tied to an object). Here, we are using it to refer to a \emph{task} that could be performed in an MDP. The intuition is that $R^\Diamond$ is allowed to depend on what \emph{could be done} from $s$ and $s'$, in addition to the state features of $s$ and $s'$.

Before outlining an algorithm, let us first give a few examples of how to formalise modal tasks within this framework. First consider the instruction \enquote{you should always be able to return to the start state}. We can formalise this using a reward function $R_1$ that gives $1$ reward if the start state is entered, and $0$ otherwise, and pair it up with a discount parameter $\gamma$ that is very close to $1$. We could then set $R^\Diamond$ to, for example, $R^\Diamond(s,a,s',V_1^\star(s), V_1^\star(s')) = R(s,a,s') \cdot \tanh(V_1^\star(s'))$, where $R$ describes some base task. In this way, no reward is given if the start state cannot be reached from $s'$. Next, consider the instruction \enquote{never enter a state from which it is possible to quickly enter an unsafe state}.
To formalise this, let $R_1$ give $1$ reward if an unsafe state is entered, and $0$ otherwise, and let $\gamma$ correspond to a very high discount rate (e.g.\ $0.7$). 
We could then set $R^\Diamond$ to, for example, $R^\Diamond(s,a,s',V_1^\star(s), V_1^\star(s')) = R(s,a,s') - V_1^\star(s')$, where $R$ again describes some base task.

These examples show that our \enquote{affordance-based} MDPs are quite flexible, and that they should be able to formalise many natural modal tasks in a satisfactory way, including most of our motivating examples.\footnote{This arguably excludes \enquote{you should never enter a state where you would be unable to receive a feedback signal}. However, this instruction only makes sense in a multi-agent setting.} However, the definition could of course be made more general. For example, we could allow the affordances to themselves be based on affordance-based reward functions, etc. However, it is not clear if this would bring much benefit in practice. 
%how much benefit this would bring in practice.

Let us now outline an approach for solving affordance-based MDPs using reinforcement learning, specifically using an action-value method. 
First, let the agent maintain $k+1$ $Q$-functions, $Q^\Diamond, Q_1, \dots, Q_k$, one for $R^\Diamond$ and one for each affordance $\langle R_i, \gamma_i \rangle$. Next, we suppose that the agent updates each of $Q_1, \dots, Q_k$ using an off-policy update rule, such as $Q$-learning; this will ensure that $Q_1, \dots, Q_k$ converge to their true values (i.e.\ to $Q_1^\star \dots Q_k^\star$), as long as the agent explores infinitely often. Note that the use of an off-policy update rule is crucial. 
Next, let the agent update $Q^\Diamond$ as if it were an ordinary Markovian reward function, using the reward $\hat{R}(s,a,s') = R^\Diamond(s,a,s',V_1(s) \dots V_k(s), V_1(s') \dots V_k(s'))$, where $V_i(s)$ is given by $\max_a Q_i(s,a)$.
In other words, we let it update $Q^\Diamond$ using an \emph{estimate} of the true value of $R^\Diamond$, expressed in terms of its current estimates of $V_1^\star \dots V_k^\star$. 
The fact that $Q_1, \dots, Q_k$ converge to $Q_1^\star, \dots, Q_k^\star$, and the fact that $R^\Diamond$ is continuous in its value function arguments, will ensure that the estimate $\hat{R}$ also converges to the true value of $R^\Diamond$.
The update rule used for $Q^\Diamond$ could be either on-policy or off-policy.
We then suppose that the agent selects its actions by applying a Bandit algorithm to $Q^\Diamond$, and that this Bandit algorithm is greedy in the limit, but also explores infinitely often, as usual.

This algorithm should be able to learn to optimise the reward in any affordance-based MDP. 
In the tabular case, it should be possible (and reasonably straightforward) to prove that it always converges to an optimal policy (assuming that appropriate learning rates are used, etc), using Lemma 1 in \cite{Singh2000-supp}. We would also expect it to perform well in practice, when used with function approximators (such as neural networks). However, we leave the task of implementing and properly evaluating this approach as a topic for future work.

There are also several ways that this algorithm could be tweaked or improved. For example, the algorithm we have described is an action-value algorithm, but the same approach could of course be used to make an actor-critic algorithm instead.
We also suspect that there could be interesting modifications one could make to the exploration strategy of the algorithm.
If a standard Bandit algorithm (such as $\epsilon$-greedy) is used, then the agent will mostly take actions that are optimal under its current estimate of $Q^\Diamond$. In the ordinary case, this is good, because it leads the agent to spend more time in the parts of the MDP that are relevant for maximising the reward. However, in this case, there is a worry that it could lead the agent to neglect the parts of the (affordance-based) MDP that are relevant for learning more about $V_1^\star \dots V_k^\star$, which might slow down the learning.
Again, we leave such developments for future work, since our aim here only is to show that it is feasible to learn non-trivial modal tasks.

We also want to point out that the work by \citet{pctl_rl-supp} could provide another starting point for learning modal tasks using RL. In their work, they present some RL-based methods for determining whether a specification in Probabilistic Computational Tree Logic (PCTL) holds in an MDP. PCTL can be used to specify many kinds of properties of states in MDPs which depend on the transition function, including e.g.\ what states can and cannot be reached from a particular state, and with what probability, etc. We can therefore specify non-trivial modal tasks by providing a number of PCTL formulas, and allowing the reward function to depend on the truth values of these formulas. That is, we could consider a setup that is analogous to that which we give in Definition~\ref{def:affordance_mdp}, but where the \enquote{affordances} are replaced by PCTL formulas. It should then be possible to learn tasks specified in this manner by using the techniques of \citet{pctl_rl-supp} to learn the values of the PCTL formulas, and then using ordinary RL to train on the resulting reward function.

\bibliography{references-supp}

\end{document}
