% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{he_35}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


\usetikzlibrary{arrows,shapes,fit,shadows,decorations.pathmorphing}
\tikzstyle{dgraph}=[->, line width=1pt]
\tikzstyle{input} = [text centered, minimum height=2em]
\tikzstyle{ellip}=[ellipse,draw=black,thick,minimum size=4mm, text width=.4cm]
\tikzstyle{system} = [draw, dotted, minimum height=2em]


% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 
\usepackage{xr} 
\externaldocument{he_35}
\usepackage{bibentry}

\usepackage{amsthm} % for theorem
\usepackage{amssymb} % for mathbb
\usepackage{subfigure} % for figures
\hypersetup{hidelinks}
% \usepackage[capitalize,noabbrev]{cleveref}
\counterwithin{figure}{section} % set difference labels for the appendices
\numberwithin{equation}{section}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% For commenting in the PDF
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\todotag}{\textcolor{violet}{[{\bf TODO}] }}
\newcommand{\mytodo}[1]{\textcolor{violet}{[{\bf TODO}: #1]}}
% \newcommand{\mytodo}[1]{\textcolor{violet}{}}
\newcommand{\jiamin}[1]{\textcolor{red}{[{\bf Jiamin}: #1]}}
\newcommand{\rupam}[1]{\textcolor{blue}{[{\bf Rupam}: #1]}}
\newcommand{\yi}[1]{\textcolor{orange}{[{\bf Yi}: #1]}}
\newcommand{\fengdi}[1]{\textcolor{green}{[{\bf Fengdi}: #1]}}
\newcommand{\uncertain}[1]{\textcolor{violet}{#1}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% MATH
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\vect}[1]{\boldsymbol{\mathbf{#1}}}
\newcommand{\mtrx}[1]{\boldsymbol{\mathbf{#1}}}
\newcommand{\cS}{\mathcal{S}}
\newcommand{\cA}{\mathcal{A}}
\newcommand{\bR}{\mathbb{R}}
\newcommand{\bE}{\mathbb{E}}
\newcommand{\bP}{\mathbb{P}}
\newcommand{\bN}{\mathbb{N}}
\newcommand{\norm}[1]{\| #1 \|}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}


\title{Loosely Consistent Emphatic Temporal-Difference Learning\\(Supplementary Material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\setcounter{Maxaffil}{2}
\author[1]{Jiamin~He}
\author[1]{Fengdi~Che}
\author[1]{Yi~Wan}
\author[1,2]{A.~Rupam~Mahmood}
% Add affiliations after the authors
\affil[1]{%
    Department of Computing Science,
    University of Alberta
}
\affil[2]{
    CIFAR AI Chair, Alberta Machine Intelligence Institute (Amii)
}
  
\begin{document}
  
\onecolumn %% Turn this off if single column is desired for the supplement
\maketitle

% Fig. \ref{fig:pitt} and Eq \ref{eq:example} in the main paper can be cross referenced using \texttt{xr}. 

\appendix

\section{Proof of the Consistency of AETD and LC-ETD}
\label{sec_app:proofs}

Since AETD($0$) is a special case of LC-ETD($\lambda$, $\beta$, $\nu$) with $\lambda=0$, $\beta=0$, and $\nu=1$, the proof for Theorem \ref{thrm:stability_aetd} is also a special case of the proof for Theorem \ref{thrm:stability_cetd}, which will be presented below.

We first revisit the update of LC-ETD($\lambda$, $\beta$, $\nu$):
\begin{align}
  \label{math:general_update_lambda_cetd}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + M_t \vect \phi_t), \text{with } \vect z_{-1} = \vect 0, \\
    M_t &= \left(1 - \lambda h(t)\right) F_t + \lambda g(t), \\
    F_t &= \left(1 - g(t)\right) \rho_{t-1} F_{t-1} + g(t), \text{with } F_0 = 1,
  \end{split}
\end{align}
where $h(t)$ and $g(t)$ are defined as follows:
\begin{align*}
    \begin{split}        
        h(t) \doteq \left(\frac{1-\beta}{t+1}\right)^\nu \  \text{{and}} \  g(t) \doteq \frac{1-\beta}{(t+1)^{\nu}}
    \end{split}
\end{align*}
with $\beta\in[0,1)$ and $\nu\in(0,1]$, or $\beta=1$ and $\nu\in[0,1]$. Then we present the relationship between $F_t$, $M_t$, and the density ratio with Lemma \ref{thrm:consistency_consistent_trace} and Lemma \ref{thrm:consistency_consistent_emphasis}.

\begin{lemma}
    \label{thrm:consistency_consistent_trace}
    Under Assumption \ref{assumption:ergodicity} and \ref{assumption:coverage}, for any $\beta\in[0,1)$ and $\nu\in(0,1]$, or $\beta=1$ and $\nu\in[0,1]$, if $\lim_{t\to\infty} \bE_\mu [F_t|S_t=s]$ exists for all $s\in\cS$, where $F_t$ is defined in Update (\ref{math:general_update_lambda_cetd}), then
    $$
        \lim_{t\to\infty} \bE_\mu [F_t|S_t=s] = \frac{d_\pi(s)}{d_\mu(s)}
    $$
    holds for any $s\in\cS$.
\end{lemma}
\proof{
Let $\vect f = [f(s_1), \cdots, f(s_{|\cS|})]^\top \in \bR^{|\cS|}$, and $f(s) \in \bR$ is defined as follows:
\begin{align}
    \label{math:f_definition}
    f(s) \doteq d_\mu(s) \lim_{t \to \infty} \bE_\mu [F_t | S_t = s], \text{ for any } s\in\cS,
\end{align}
which exists under our assumptions. Then we have
\begin{align}
    f(s) &= d_\mu(s) \lim_{t \to \infty} \bE_\mu [ F_t | S_t = s] \nonumber \\
        &= d_\mu(s) \lim_{t \to \infty} \bE_\mu \left[ (1-g(t)) \rho_{t-1} F_{t-1} + g(t) | S_t = s \right] \nonumber \\
        &= d_\mu(s) \left( \lim_{t \to \infty} (1-g(t)) \bE_\mu [ \rho_{t-1} F_{t-1} | S_t = s] + \lim_{t \to \infty} g(t) \right) \label{eq:avg_split_1} \\
        &= d_\mu(s) \lim_{t \to \infty} (1-g(t)) \lim_{t \to \infty} \bE_\mu [ \rho_{t-1} F_{t-1} | S_t = s] \label{eq:avg_split_2} \\
        &= d_\mu(s) \lim_{t \to \infty} \bE_\mu [ \rho_{t-1} F_{t-1} | S_t = s] \nonumber \\
        &= d_\mu(s) \lim_{t\to\infty} \sum_{\bar{s}, \bar{a}} \bP_\mu (S_{t-1}=\bar{s}, A_{t-1}=\bar{a} | S_t=s) \frac{\pi (\bar{a} | \bar{s})}{\mu (\bar{a} | \bar{s})} \bE_{\mu} [F_{t-1} | S_{t-1}=\bar{s} ] \nonumber \\
        &= d_\mu(s) \lim_{t\to\infty} \sum_{\bar{s}, \bar{a}} \frac{\bP_\mu (S_{t-1}=\bar{s}, A_{t-1}=\bar{a}, S_t=s)}{\bP_\mu (S_t=s)} \frac{\pi (\bar{a} | \bar{s})}{\mu(\bar{a} | \bar{s})} \bE_{\mu} [F_{t-1} | S_{t-1}=\bar{s}] \nonumber \\
        &= d_\mu(s) \sum_{\bar{s}, \bar{a}} \frac{d_\mu (\bar{s}) \mu(\bar{a} | \bar{s}) p (s | \bar{s}, \bar{a})}{d_\mu(s)} \frac{\pi (\bar{a} | \bar{s})}{\mu(\bar{a} | \bar{s})} \lim_{t\to\infty} \bE_{\mu} [F_{t-1} | S_{t-1}=\bar{s}] \nonumber \\
        &= \sum_{\bar{s}, \bar{a}} \pi (\bar{a} | \bar{s}) p (s | \bar{s}, \bar{a}) d_\mu (\bar{s}) \lim_{t\to\infty} \bE_{\mu} [F_{t-1} | S_{t-1}=\bar{s}] \nonumber \\
        &= \sum_{\bar{s}} [\mtrx P_{\pi}]_{\bar{s} s} f(\bar{s}), \nonumber
\end{align}
where in Eqs. (\ref{eq:avg_split_1}) and (\ref{eq:avg_split_2}), we use the assumption that $\lim_{t\to\infty} \bE_\mu [F_t|S_t=s]$ exists for any $s\in\cS$ and the facts that $\lim_{t\to\infty} (1-g(t))=1$ and $\lim_{t\to\infty} g(t)=0$ for any $\beta\in[0,1)$ and $\nu\in(0,1]$, or $\beta=1$ and $\nu\in[0,1]$. From the last equation, we have $\vect f^\top = \vect f^\top \mtrx P_\pi$ in vector form. Since the expectations of importance-sampling ratios are one and $F_0=1$, by induction, the expectation of $F_t$ will remain one for any $t \in \mathbb{N}$. Then we have:
\begin{align*}
    \vect 1^\top \vect f = \sum_s f(s) &= \sum_{s\in\cS} d_\mu(s) \lim_{t \to \infty} \bE_\mu [ F_t | S_t = s] \\
        &= \lim_{t \to \infty} \bE_\mu [ F_t | S_t] = 1.
\end{align*}
By Assumption \ref{assumption:ergodicity}, the existence of the target policy's stationary distribution is unique. From $\vect f^\top = \vect f^\top \mtrx P_\pi$ and $ \vect 1^\top \vect f=1$, we can infer that $\vect f = \vect d_\pi$, that is,
\begin{align}
\label{math:aetd_proof_intermediate}
    d_\mu(s) \lim_{t \to \infty} \bE_\mu [ F_t | S_t = s] = d_\pi(s).
\end{align}
Since it holds that $d_\mu(s)>0$ for any $s\in\cS$ by Assumption \ref{assumption:ergodicity}, we can divide both sides of Eq. (\ref{math:aetd_proof_intermediate}) by $d_\mu(s)$ and conclude the proof.
\qed
}

\begin{lemma}
    \label{thrm:consistency_consistent_emphasis}
    Under the assumptions of Lemma \ref{thrm:consistency_consistent_trace}, for any $\beta\in[0,1)$ and $\nu\in(0,1]$, or $\beta=1$ and $\nu\in[0,1]$, it holds for any $s\in\cS$ that
    $$
        \lim_{t\to\infty} \bE_\mu [M_t|S_t=s] = \frac{d_\pi(s)}{d_\mu(s)},
    $$
    where $M_t$ is defined in Update (\ref{math:general_update_lambda_cetd}).
\end{lemma}
\proof{
We can expand $M_t$ and use the result from Lemma \ref{thrm:consistency_consistent_trace}:
\begin{align}
    \lim_{t \to \infty} \bE_\mu [M_t | S_t = s]
        &= \lim_{t \to \infty} \bE_\mu [ (1 - \lambda h(t)) F_t + \lambda g(t) | S_t = s] \nonumber \\
        &= \lim_{t \to \infty} (1 - \lambda h(t)) \bE_\mu [ F_t | S_t = s] + \lim_{t \to \infty} \lambda g(t) \nonumber \\
        &= \lim_{t \to \infty} (1 - \lambda h(t)) \lim_{t \to \infty} \bE_\mu [ F_t | S_t = s]  \label{eq:split_1} \\
        &= \lim_{t \to \infty} \bE_\mu [ F_t | S_t = s] \label{eq:split_2} \\
        &= \frac{d_\pi(s)}{d_\mu(s)}, \tag*{(Lemma \ref{thrm:consistency_consistent_trace})} \nonumber
\end{align}
where, in Eqs. (\ref{eq:split_1}) and (\ref{eq:split_2}), we make use of $\lim_{t \to \infty} g(t)=\lim_{t \to \infty} (1-\beta)(t+1)^{-\nu}=0$ and $\lim_{t \to \infty} h(t)=\lim_{t \to \infty} (1-\beta)^\nu(t+1)^{-\nu}=0$ for any $\beta\in[0,1)$ and $\nu\in(0,1]$, or $\beta=1$ and $\nu\in[0,1]$.
\qed
}

From Lemma \ref{thrm:consistency_consistent_trace} and Lemma \ref{thrm:consistency_consistent_emphasis}, we can see that the expectations of both $F_t$ and $M_t$ converge to the density ratio $\frac{d_\pi(s)}{d_\mu(s)}$. By utilizing these results, we can prove the consistency of LC-ETD($\lambda$, $\beta$, $\nu$) (including AETD($\lambda$)), which is presented in Theorem \ref{thrm:statbility_cetd_restate}.

\begin{theorem}[Restatement of Theorem \ref{thrm:stability_cetd}]
    \label{thrm:statbility_cetd_restate}
    Let Assumptions \ref{assumption:ergodicity}-\ref{assumption:features} hold. For any $\beta\in[0,1)$ and $\nu\in(0,1]$, or $\beta=1$ and $\nu\in[0,1]$, if $\lim_{t\to\infty} \bE_\mu [F_t|S_t=s]$ and $\lim_{t\to\infty} \bE_\mu [\vect z_t|S_t=s]$ exist for all $s\in\cS$, then LC-ETD($\lambda$, $\beta$, $\nu$) has the same expected update as On-policy TD($\lambda$). As a result, LC-ETD($\lambda$, $\beta$, $\nu$) is stable and consistent.
\end{theorem}

\begin{remark}
\label{rmrk:aetd_multi_step_consistency}
AETD($\lambda$) is stable and consistent, as it is a special case of LC-ETD($\lambda$, $\beta$, $\nu$) with $\beta=0$ and $\nu=1$.
\end{remark}

\begin{proof}[Proof of Theorem \ref{thrm:statbility_cetd_restate}]
The proof is similar in structure to the proof of Theorem $1$ in the work of \citet{sutton2016emphatic}. We start from the update of LC-ETD($\lambda$, $\beta$, $\nu$). Specifically, we can rewrite Update (\ref{math:general_update_lambda_cetd}) as follows:
\begin{align}
  \label{math:general_update_lambda_rewritten}
  \begin{split}
  \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t \\
    &= \vect \theta_t + \alpha \left( R_{t+1} + \gamma \phi_{t+1}^\top \theta_t - \phi_{t}^\top \theta_t \right) \vect z_t \\
    &= \vect \theta_t + \alpha \left( \underbrace{\Big[
            \vect z_t R_{t+1}
        \Big]}_{\vect b_t} - \underbrace{\Big[ 
            \vect z_t(\vect \phi_t - \gamma \vect \phi_{t+1})^\top
        \Big]}_{\mtrx A_t} \vect \theta_t \right).
  \end{split}
\end{align}
Defining $\mtrx A \doteq \lim_{t \to \infty} \bE_\mu [\mtrx A_t]$ and $\vect b \doteq \lim_{t \to \infty} \bE_\mu [\vect b_t]$, we analyze LC-ETD($\lambda$, $\beta$, $\nu$)'s expected update:
\begin{align}
    \vect{\bar \theta}_{t+1} = \vect{\bar \theta}_{t} + \alpha (\vect b - \mtrx A \vect{\bar \theta}_t).
\end{align}

We first analyze the $\mtrx A$ matrix. Similar to obtain ETD($\lambda$)'s $\mtrx A$ matrix \citep{sutton2016emphatic}, we have
\begin{align}
    \mtrx A = \lim_{t \to \infty} \bE_\mu [\mtrx A_t]
        &= \lim_{t \to \infty} \bE_\mu \left[\vect z_t(\vect \phi_t-\gamma\vect \phi_{t+1})^\top | S_t = s \right] \nonumber \\
        &= \sum_{s} d_\mu(s) \lim_{t \to \infty} \bE_\mu \left[\vect z_t(\vect \phi_t-\gamma\vect \phi_{t+1})^\top | S_t = s \right] \nonumber \\
        &= \sum_{s} d_\mu(s) \lim_{t \to \infty} \bE_\mu \left[\rho_t (\gamma\lambda\vect z_{t-1} + M_t\vect \phi_t) (\vect \phi_t-\gamma\vect \phi_{t+1})^\top | S_t = s \right] \nonumber \\
        &= \sum_{s} d_\mu(s) \lim_{t \to \infty} \bE_\mu \left[(\gamma\lambda\vect z_{t-1} + M_t\vect \phi_t) | S_t = s \right] \bE_\mu \left[\rho_t (\vect \phi_t-\gamma\vect \phi_{t+1})^\top | S_t = s \right] \nonumber \\
        (\text{because, } &\gamma\lambda\vect z_{t-1} + M_t\vect \phi_t \text{ is independent of } \rho_t (\vect \phi_t-\gamma\vect \phi_{t+1})^\top \text{ if $S_t$ is given}) \nonumber \\
        &= \sum_{s} \underbrace{d_\mu(s) \lim_{t \to \infty} \bE_\mu \left[\gamma\lambda\vect z_{t-1} + M_t\vect \phi_t | S_t = s \right]}_{\vect z(s)\in\bR^d} \bE_\mu \left[\rho_k (\vect \phi_k-\gamma\vect \phi_{k+1})^\top | S_k = s \right] \nonumber \\
        &= \sum_{s} \vect z(s) \bE_\mu \left[\rho_k (\vect \phi_k-\gamma\vect \phi_{k+1})^\top | S_k = s \right] \nonumber \\
        &= \sum_{s} \vect z(s) \bE_\pi \left[\vect \phi_k-\gamma\vect \phi_{k+1} | S_k = s \right]^\top \nonumber \\
        &= \sum_{s} \vect z(s) \left( \vect \phi(s) - \gamma \sum_{s'} [\mtrx P_\pi]_{ss'}\vect \phi(s') \right)^\top \nonumber \\
        &= \mtrx Z^\top (\mtrx I - \gamma \mtrx P_\pi) \mtrx \Phi, \nonumber
\end{align}
where $\mtrx Z \doteq [\vect z(s_1), \cdots, \vect z(s_{|\cS|})]^\top \in \bR^{|\cS| \times d}$, and $\vect z(s) \in \bR^d$ is defined by
\begin{align*}
    \vect z(s) &\doteq d_\mu(s) \lim_{t \to \infty} \bE_\mu \left[\gamma\lambda\vect z_{t-1} + M_t\vect \phi_t | S_t = s \right] \\
        &= \underbrace{d_\mu(s) \lim_{t \to \infty} \bE_\mu \left[ M_t |S_t=s\right]}_{m(s)} \vect \phi(s) + \gamma \lambda d_\mu(s) \lim_{t \to \infty} \bE_\mu \left[ \vect z_{t-1} | S_t=s \right] \\
        &= m(s) \vect \phi(s) + \gamma \lambda d_\mu(s) \sum_{\bar s, \bar a} \lim_{t \to \infty} \bP_\mu(S_{t-1}=\bar s, A_{t-1}=\bar a|S_t=s) \bE_\mu \left[ \vect z_{t-1} | S_{t-1}=\bar s, A_{t-1}=\bar a \right] \\
        &= m(s) \vect \phi(s) + \gamma \lambda d_\mu(s) \sum_{\bar s, \bar a} \frac{d_\mu(\bar s)\mu(\bar a|\bar s)p(s|\bar s,\bar a)}{d_\mu(s)} \lim_{t \to \infty} \bE_\mu \left[ \vect z_{t-1} | S_{t-1}=\bar s, A_{t-1}=\bar a \right] \\
        &= m(s) \vect \phi(s) + \gamma \lambda \sum_{\bar s, \bar a} d_\mu(\bar s)\mu(\bar a|\bar s)p(s|\bar s,\bar a) \frac{\pi(\bar a|\bar s)}{\mu(\bar a|\bar s)} \lim_{t \to \infty} \bE_\mu \left[ \gamma \lambda \vect z_{t-2} + M_{t-1} \vect \phi_{t-1} | S_{t-1}=\bar s \right] \\
        &= m(s) \vect \phi(s) + \gamma \lambda \sum_{\bar s} \left( \sum_{\bar a} \pi(\bar a| \bar s)p(s|\bar s, \bar a) \right) \vect z(\bar s) \\
        &= m(s) \vect \phi(s) + \gamma \lambda \sum_{\bar s} [\mtrx P_\pi]_{\bar s s}\vect z(\bar s).
\end{align*}
In matrix form, we have
\begin{align*}
    \mtrx Z^\top &= \mtrx \Phi^\top \mtrx D_{\vect m} + \mtrx Z^\top (\gamma \lambda \mtrx P_\pi) \\
        % &= \mtrx \Phi^\top \mtrx D_{\vect m} + \left(\mtrx \Phi^\top \mtrx D_{\vect m} + \mtrx Z (\gamma \lambda \mtrx P_\pi)\right) (\gamma \lambda \mtrx P_\pi) \\
        &= \mtrx \Phi^\top \mtrx D_{\vect m} + \mtrx \Phi^\top \mtrx D_{\vect m} (\gamma \lambda \mtrx P_\pi) + \mtrx Z^\top (\gamma \lambda \mtrx P_\pi)^2 \\
        &= \mtrx \Phi^\top \mtrx D_{\vect m} + \mtrx \Phi^\top \mtrx D_{\vect m} (\gamma \lambda \mtrx P_\pi) + \mtrx \Phi^\top \mtrx D_{\vect m} (\gamma \lambda \mtrx P_\pi)^2 + \cdots \\
        &= \mtrx \Phi^\top \mtrx D_{\vect m} (\mtrx I - \gamma \lambda \mtrx P_\pi)^{-1},
\end{align*}
where $\mtrx D_{\vect m} \doteq diag(\vect m) \in \bR^{|\cS| \times |\cS|}$, $\vect m = [m(s_1), \cdots, m(s_{|\cS|})]^\top \in \bR^{|\cS|}$, and $m(s) \in \bR$ is defined as follows:
\begin{align*}
    m(s) \doteq d_\mu(s) \lim_{t \to \infty} \bE_\mu [M_t | S_t = s], \text{ for any } s\in\cS,
\end{align*}
which exists due to Lemma \ref{thrm:consistency_consistent_emphasis}. Further, from Lemma \ref{thrm:consistency_consistent_emphasis}, we have that
\begin{align}
    m(s) &= d_\mu(s) \lim_{t \to \infty} \bE_\mu [M_t | S_t = s] \nonumber \\
        &= d_\mu(s) \frac{d_\pi(s)}{d_\mu(s)} \nonumber \\
        &= d_\pi(s).
\end{align}
In vector form, we have $\vect m=\vect d_\pi$.

Plugging $\vect m = \vect d_\pi$ and $\mtrx Z^\top=\mtrx \Phi^\top \mtrx D_{\vect m}(\mtrx I - \gamma \lambda \mtrx P_\pi)^{-1}$ back to the $\mtrx A$ matrix, we have
\begin{align*}
    \mtrx A = \mtrx \Phi^\top \mtrx D_{\pi} (\mtrx I - \lambda\gamma \mtrx P_\pi)^{-1} (\mtrx I - \gamma \mtrx P_\pi) \mtrx \Phi,
\end{align*}
which is exactly the $\mtrx A$ matrix of On-policy TD($\lambda$) and known to be stable \citep{tsitsiklis1996analysis}. Thus, LC-ETD($\lambda$, $\beta$, $\nu$) and its expected update are also stable by our definition.

Similarly, we can infer that
\begin{align*}
    \vect b = \lim_{t \to \infty} \bE_\mu [\vect b_t] = \mtrx \Phi^\top \mtrx D_\pi (\mtrx I - \lambda\gamma \mtrx P_\pi)^{-1} \vect r_\pi.
\end{align*}
Note that this $\vect b$ vector is also the same as On-policy TD($\lambda$). Thus, LC-ETD($\lambda$, $\beta$, $\nu$) has the same expected update as On-policy TD($\lambda$). As a result, LC-ETD($\lambda$, $\beta$, $\nu$) is consistent.
\end{proof}


\section{Update Rules}
\label{sec_app:update_rules}

This section include the update rules for the algorithms mentioned in the paper.

Off-policy TD($\lambda$):
\begin{align*}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + \vect \phi_t), \text{with } \vect z_{-1} = \vect 0.
  \end{split}
\end{align*}

Full-IS-TD($\lambda$):
\begin{align*}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + F_t \vect \phi_t), \text{with } \vect z_{-1} = \vect 0, \\
    F_t &= \rho_{t-1} F_{t-1}, \text{with } F_0 = 1.
  \end{split}
\end{align*}

ETD($\lambda$):
\begin{align*}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + M_t \vect \phi_t), \text{with } \vect z_{-1} = \vect 0, \\
    M_t &= (1-\lambda) F_t + \lambda, \\
    F_t &= \gamma \rho_{t-1} F_{t-1} + 1, \text{with } F_0 = 1.
  \end{split}
\end{align*}

ETD($\lambda$, $\beta$):
\begin{align*}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + M_t \vect \phi_t), \text{with } \vect z_{-1} = \vect 0, \\
    M_t &= (1-\lambda) F_t + \lambda, \\
    F_t &= \beta \rho_{t-1} F_{t-1} + 1, \text{with } F_0 = 1.
  \end{split}
\end{align*}

Scaled ETD($\lambda$):
\begin{align*}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + M_t \vect \phi_t), \text{with } \vect z_{-1} = \vect 0, \\
    M_t &= (1-\lambda) F_t + \lambda (1-\gamma), \\
    F_t &= \gamma \rho_{t-1} F_{t-1} + (1-\gamma), \text{with } F_0 = 1.
  \end{split}
\end{align*}

Scaled ETD($\lambda$, $\beta$):
\begin{align*}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + M_t \vect \phi_t), \text{with } \vect z_{-1} = \vect 0, \\
    M_t &= (1-\lambda) F_t + \lambda (1-\beta), \\
    F_t &= \beta \rho_{t-1} F_{t-1} + (1-\beta), \text{with } F_0 = 1.
  \end{split}
\end{align*}

AETD($\lambda$):
\begin{align*}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + M_t \vect \phi_t), \text{with } \vect z_{-1} = \vect 0, \\
    M_t &= (1-\lambda g(t)) F_t + \lambda g(t), \\
    F_t &= (1-g(t)) \rho_{t-1} F_{t-1} + g(t), \text{with } F_0 = 1, \\
    g(t) &= {(t+1)^{-1}}.
  \end{split}
\end{align*}

LC-ETD($\lambda$, $\beta$, $\nu$):
\begin{align*}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + M_t \vect \phi_t), \text{with } \vect z_{-1} = \vect 0, \\
    M_t &= (1-\lambda h(t)) F_t + \lambda g(t), \\
    F_t &= (1-g(t)) \rho_{t-1} F_{t-1} + g(t), \text{with } F_0 = 1, \\
    h(t) &= (1-\beta)^\nu {(t+1)^{-\nu}}, \\
    g(t) &= (1-\beta) {(t+1)^{-\nu}}.
  \end{split}
\end{align*}

LC-ETD1($\lambda$, $\beta$):
\begin{align*}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + M_t \vect \phi_t), \text{with } \vect z_{-1} = \vect 0, \\
    M_t &= (1-\lambda h(t)) F_t + \lambda g(t), \\
    F_t &= (1-g(t)) \rho_{t-1} F_{t-1} + g(t), \text{with } F_0 = 1, \\
    h(t) &= (1-\beta)^\beta {(t+1)^{-\beta}}, \\
    g(t) &= (1-\beta) {(t+1)^{-\beta}}.
  \end{split}
\end{align*}

LC-ETD2($\lambda$, $\nu$):
\begin{align*}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + M_t \vect \phi_t), \text{with } \vect z_{-1} = \vect 0, \\
    M_t &= (1-\lambda g(t)) F_t + \lambda g(t), \\
    F_t &= (1-g(t)) \rho_{t-1} F_{t-1} + g(t), \text{with } F_0 = 1, \\
    g(t) &= {(t+1)^{-\nu}}.
  \end{split}
\end{align*}

LC-ETD3($\lambda$, $\beta$):
\begin{align*}
  \begin{split}
    \vect \theta_{t+1} &= \vect \theta_t + \alpha \delta_t \vect z_t, \\
    \delta_t &= R_{t+1} + \gamma \vect \phi_{t+1}^\top \vect \theta_t - \vect \phi_{t}^\top \vect \theta_t, \\
    \vect z_t &= \rho_t (\gamma \lambda \vect z_{t-1} + M_t \vect \phi_t), \text{with } \vect z_{-1} = \vect 0, \\
    M_t &= (1-\lambda g(t)) F_t + \lambda g(t), \\
    F_t &= (1-g(t)) \rho_{t-1} F_{t-1} + g(t), \text{with } F_0 = 1, \\
    g(t) &= {(1-\beta)}{(t+1)^{-1}}.
  \end{split}
\end{align*}


\section{Additional Results and Experimental Details for One-Step Bootstrapping}
\label{sec_app:exps_add}

In this section, we provide additional results and experimental details to supplement the results for the one-step case in the main text. Same as Section \ref{sec:experiments}, we omit the $\lambda$ argument from all algorithms for notational convenience. Our Python implementations of the algorithms and environments are publicly available for future research.\footnote{See \url{https://github.com/hejm37/LC-ETD}.}


\subsection*{Stability of LC-ETD($\beta$, $\nu$)}

We use Baird's \citeyearpar{baird1995residual} counterexample to validate the stability of LC-ETD($\beta$, $\nu$). Baird's counterexample is a seven-state, two-action MDP with linear features (see Figure \ref{fig:baird_counterexample}), which can illustrate the instability of Off-policy TD($\lambda$) and other algorithms \citep{sutton2018reinforcement,jiang2022learning}. In the one-step case, Off-policy TD($0$) diverges in this example for any positive step size as long as $\gamma\in[(\sqrt{5}-1)/2, 1]$. Here, we choose $\gamma=0.97$. Since the target policy's stationary distribution concentrates on the bottom state, the $\overline{\text{RMSVE}}$ error defined previously can only capture the errors of $\theta_7$ and $\theta_8$. To also take into account the errors of other dimensions of the parameter vector, we adopt the following root mean square value error as our metric: $\norm{\vect{\hat{v}_{\theta}} - \vect v_\pi}_{\vect u}$, where $\vect u\doteq [1/|\cS|,\cdots,1/|\cS|]^\top\in\bR^{|\cS|}$ is a uniform distribution. We run each algorithm for $100{,}000$ steps with the $19$ step sizes mentioned in Section \ref{sec:experiments} and present the results in Figure \ref{fig:exp4_baird}. The results are averaged over $100$ independent runs, and the shaded region near each learning curve represents the standard error.

From the leftmost plot of Figure \ref{fig:exp4_baird}, we can see that while the only existing consistent algorithm, Full-IS-TD (the green dashed line), does not learn at all as in the Two-state and Rooms tasks, LC-ETD1($\beta$) with $\beta\in[0.2,0.8]$ finds solutions with much lower errors. A similar observation can be found in LC-ETD2($\nu$). It is important to note that the importance-sampling ratio can be zero in this counterexample. This can occur with a probability of $6/7$ at any state when the agent chooses the $\mathsf{up}$ action. Consequently, the full IS-ratio product will quickly become zero after some time steps, and the same goes for most of the incomplete IS-ratio products. As a result, LC-ETD3($\beta$) cannot learn because its followon trace quickly decays to an extremely small value. For Off-policy TD (the red dotted line), it diverges gradually even with very small step sizes (the smallest step size is $2^{-18}$). The same goes for ETD($\beta$) with $\beta\in[0.0,0.4]$ (the rightmost plot), validating ETD($\beta$)'s instability with small $\beta$. In summary, the results in Baird's counterexample highlight the stability of LC-ETD instances and illustrate the instability of ETD($\beta$) with small $\beta$. In addition, they also show the limitation of LC-ETD($\beta$, $\nu$) that it cannot learn effectively with large $\nu$ when importance-sampling ratios are often zero.

\begin{figure}[b]
    \centering
    \begin{tikzpicture}[dgraph]
        \node[ellip] (s1) at (0.2,1) {$2\theta_1$\\$\,+$\\$\,\theta_8$};
        \node[ellip] (s2) at (1.3,1) {$2\theta_2$\\$\,+$\\$\,\theta_8$};
        \node[ellip] (s3) at (2.4,1) {$2\theta_3$\\$\,+$\\$\,\theta_8$};
        \node[ellip] (s4) at (3.5,1) {$2\theta_4$\\$\,+$\\$\,\theta_8$};
        \node[ellip] (s5) at (4.6,1) {$2\theta_5$\\$\,+$\\$\,\theta_8$};
        \node[ellip] (s6) at (5.7,1) {$2\theta_6$\\$\,+$\\$\,\theta_8$};
        \node[ellip,text width=1.5cm] (s7) at (3,-1) {$\,\,\theta_7+2\theta_8$};
        
        \draw[](s1.south)--(s7);
        \draw[](s2.south)--(s7);
        \draw[](s3.south)--(s7);
        \draw[](s4.south)--(s7);
        \draw[](s5.south)--(s7);
        \draw[](s6.south)--(s7);
        \draw[](s7.5)to[out=5, in=-5,looseness=10] (s7.-5);
        
        \node[input] (ss) at (-1.2, 1) {upper states};
        \node[system,fit=(ss) (s1) (s2) (s3) (s4) (s5) (s6)] {};
        \node[input] (tt) at (0.5, -1) {bottom state};
    \end{tikzpicture}
    \caption{Baird's counterexample. Each state has two actions. The $\mathsf{up}$ action will take the agent to one of the six upper states with equal probability, while the $\mathsf{down}$ action will take the agent to the bottom state. The target policy will choose the $\mathsf{down}$ action with probability $1$ at any state (illustrated as the solid lines), while the probability for the behavior policy is $1/7$.}
    \label{fig:baird_counterexample}
\end{figure}

\begin{figure*}[tb]
  \centering
    \includegraphics[width=0.245\textwidth]{figures/baird/learning_curve_CETDL1Baird_local_97_100Lmbda0.0_final}
    \includegraphics[width=0.245\textwidth]{figures/baird/learning_curve_CETDL2Baird_local_97_100Lmbda0.0_final}
    \includegraphics[width=0.245\textwidth]{figures/baird/learning_curve_CETDL3Baird_local_97_100Lmbda0.0_final}
    \includegraphics[width=0.245\textwidth]{figures/baird/learning_curve_ETDLBBaird_local_97_100Lmbda0.0_final}
  \caption{Results on Baird's counterexample. The y-axis shows $\norm{\vect{\hat{v}_{\theta}} - \vect v_\pi}_{\vect u}$ (see text for details).}
  \label{fig:exp4_baird}
\end{figure*}


\subsection*{Experimental Details on the Rooms Task}

Our continuing Rooms task is extended from the episodic Rooms task proposed by \citet{ghiassian2021empirical2}. It is based on the Four Rooms environment \citep{sutton1999between}, which can be partitioned into four parts that are connected by hallways (see Figure \ref{fig:four_room}). The Four Rooms environment has $104$ states, including four hallway states. The four actions in this environment will move the agent by $1$ state towards the corresponding direction. If an action causes the agent to leave the boundary, the agent will stay in the current state. The task consists of four sub-tasks. Each sub-task will assign a reward of $1$ to the agent if it arrives or stays at the corresponding hallway state. However, the agent cannot stay in a hallway state permanently as there is noise in the interactions. At each time step, there is a probability of $50\%$ that the agent's action will be treated as one of the other three actions with equal probability. The agent needs to learn the value functions for the four target policies while following a uniform random behavior policy. The four target policies will try to go to the four hallway states. Specifically, each target policy will choose the optimal action to a corresponding hallway state with probability $1-\epsilon$ and a random action with probability $\epsilon$. We set $\epsilon$ to $0.1$ in our experiments. The discount factor $\gamma$ is $0.9$. Note that it is hard to calculate the fixed points analytically in this task. Thus, we applied On-policy TD with tabular features on a trajectory following the target policy for $2{,}000{,}000$ steps for each target policy and used the final value function as the ground truth $v_\pi$. Similarly, the on-policy distributions are calculated following each target policy for $2{,}000{,}000$ steps.

\begin{figure*}[b]
    \centering
    \includegraphics[width=0.36\textwidth]{figures/four_room}
    \caption{The Rooms task. Modified from \citet{sutton1999between}.}
    \label{fig:four_room}
\end{figure*}


\subsection*{Supplementary Results on the Rooms Task}

To provide a comprehensive performance profile of different one-step algorithms in the Rooms task, we present the mean results averaged over all runs in Figure \ref{fig:exp2_curve_mean}. From Figure \ref{sec_app:exps_add}.\ref{fig:exp2_sub_all_mean}, we can see that ETD, ETD($\beta$), LC-ETD1($\beta$), and LC-ETD2($\nu$) are the top-tier algorithms in this case. Among them, ETD, LC-ETD1($\beta$), and LC-ETD2($\nu$) perform less stable due to the high variance of this task. Besides, LC-ETD3($\beta$) suffers more from the variance issue and cannot learn efficiently, but still, it learns much faster and finds much better solutions than Off-policy TD. For Full-IS-TD and Off-policy TD, their performances are not much different than the IQM results presented in Figure \ref{fig:exp2_sub_all}: The former cannot learn despite being the only existing consistent algorithm, while the latter converges to a solution with a significant bias. Finally, from Figure \ref{sec_app:exps_add}.\ref{fig:exp2_sub_sensitivity_mean}, we can see that LC-ETD1($\beta$) and LC-ETD2($\nu$) are still less sensitive to the decaying parameter compared to ETD($\beta$).

\begin{figure*}[t]
  \centering
  \subfigure[Best learning curves]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_mean/learning_curve_AllFourRoom_true3_30s_150k_reLmbda0.0_final}
    \label{fig:exp2_sub_all_mean}
  }
  \subfigure[Sensitivity to $\beta$ or $\nu$]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_mean/sensitivity_beta_FourRoom_true3_30s_150k_reLmbda0.0_final}
    \label{fig:exp2_sub_sensitivity_mean}
  }
  \subfigure[Best learning curves of ETD($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_mean/learning_curve_ETDLBFourRoom_true3_30s_150k_reLmbda0.0_final}
    \label{fig:exp2_sub_etdlb_mean}
  }
  \subfigure[Best learning curves of LC-ETD1($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_mean/learning_curve_CETDL1FourRoom_true3_30s_150k_reLmbda0.0_final}
    \label{fig:exp2_sub_cetd1_mean}
  }
  \subfigure[Best learning curves of LC-ETD2($\nu$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_mean/learning_curve_CETDL2FourRoom_true3_30s_150k_reLmbda0.0_final}
    \label{fig:exp2_sub_cetd2_mean}
  }
  \subfigure[Best learning curves of LC-ETD3($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_mean/learning_curve_CETDL3FourRoom_true3_30s_150k_reLmbda0.0_final}
    \label{fig:exp2_sub_cetd3_mean}
  }
  \caption{Results averaged over all runs on the Rooms task. The y-axis shows $\overline{\text{RMSVE}}$.}
  \label{fig:exp2_curve_mean}
\end{figure*}


\subsection*{Supplementary Results of the Bias-Variance Trade-Off Analysis}

\begin{figure*}[b]
  \subfigure[$25$ seeds]{
    \includegraphics[height=0.23\textheight]{figures/b_v_appendix/b_v_seeds_25-crop}
  }
  \subfigure[$5{,}500$ seeds]{
    \includegraphics[height=0.23\textheight]{figures/b_v_appendix/b_v_seeds_5500-crop}
  }
  \subfigure[$12{,}000$ seeds]{
    \includegraphics[height=0.23\textheight]{figures/b_v_appendix/b_v_seeds_12000-crop}
    \label{fig:b_v_12000}
  }
\caption{The bias and variance of LC-ETD1($\beta$)'s $F_t$ when $\beta$ varies. Label $\mathsf{step}\ n$ represents results for $F_n$.}
\label{fig:b_v}
\end{figure*}

In this section, we explain the design choices and provide extra results for the bias-variance trade-off analysis. We first explain why we choose to study the bias and variance of trajectories of length only $30$. For explanation purposes, assume that we want to analyze the bias and variance of the $2$-step full IS-ratio product $F_1=\rho_0\rho_1$ in the Two-state task, where the target policy $\pi$ will go to the left state from any state with a probability of $0.1$, while the probability for the behavior policy is $0.9$. Since both the target and the behavior policies are state-independent, the IS ratio $\rho_t$ at any time step $t$ could take a value of ${1}/{9}$ with a probability of $0.9$ while choosing to go to the left state or a value of $9$ with a probability of $0.1$. To obtain an accurate estimate of the mean of $\rho_0$ with a high probability, we will need way more than $10$ seeds. To obtain an accurate estimate of the mean of $F_1$, we will need way more than $100$ seeds. Otherwise, we can only obtain an estimation with a large bias. Thus, to obtain an accurate bias-variance analysis of different algorithms' $F_t$, we run experiments on short trajectories of length only $30$ but with $100{,}000$ seeds.

Next, we provide some additional experiment results to support the above discussion and reveal more insights. Figure \ref{fig:b_v} plots the estimated bias and variance of LC-ETD1($\beta$)'s $F_t$ with different numbers of seeds. We can see that with $25$ seeds, we can estimate the bias and variance of $F_3$ well but not of $F_5$ and $F_{10}$; With $5{,}500$ seeds, the estimations of $F_5$'s and $F_{10}$'s biases and variances are improved but still biased; Finally, with $12{,}000$ seeds, we can estimate the bias and variance of $F_5$ well, but that of $F_{10}$ are still biased. Now, focusing on the results of $F_3$ and $F_5$ in Figure \ref{sec_app:exps_add}.\ref{fig:b_v_12000}, we can see that as the decay parameter $\beta$ increases, the bias would decrease, and the variance would increase. In addition, as the time step increases, the bias decreases, and the variance increases for any fixed $\beta$, which implies the consistency of LC-ETD1($\beta$).


\section{Results for Multi-Step Bootstrapping}
\label{sec_app:exps_multi_step}

In this section, we present results for different algorithms with multi-step bootstrapping. Specifically, we studied two values of $\lambda$: $\{0.5, 0.9\}$, which correspond to different levels of bootstrapping. Figures \ref{fig:exp1_curve_lmbda05} and \ref{fig:exp1_curve_lmbda09} show the results of different algorithms with multi-step bootstrapping on the Two-state task. The conclusion is similar to the one-step case presented in Section \ref{sec:experiments} except that the biases of Off-policy TD ($\lambda$) and ETD($\lambda$, $\beta$) reduce significantly as $\lambda$ increases.

\begin{figure*}[!htb]
  \centering
  \subfigure[Best learning curves]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/learning_curve_AllTwoState_true3_reLmbda0.5_final}
  }
  \subfigure[Sensitivity to $\beta$ or $\nu$]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/sensitivity_beta_TwoState_true3_reLmbda0.5_final}
  }
  \subfigure[Best learning curves of ETD($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/learning_curve_ETDLBTwoState_true3_reLmbda0.5_final}
  }
  \subfigure[Best learning curves of LC-ETD1($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/learning_curve_CETDL1TwoState_true3_reLmbda0.5_final}
  }
  \subfigure[Best learning curves of LC-ETD2($\nu$)]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/learning_curve_CETDL2TwoState_true3_reLmbda0.5_final}
  }
  \subfigure[Best learning curves of LC-ETD3($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/learning_curve_CETDL3TwoState_true3_reLmbda0.5_final}
  }
  \caption{Results on the Two-state task when $\lambda=0.5$. The y-axis shows $\overline{\text{RMSVE}}$. The $\lambda$ argument is omitted in the plots.}
  \label{fig:exp1_curve_lmbda05}
\end{figure*}

\begin{figure*}[!htb]
  \centering
  \subfigure[Best learning curves]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/learning_curve_AllTwoState_true3_reLmbda0.9_final}
  }
  \subfigure[Sensitivity to $\beta$ or $\nu$]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/sensitivity_beta_TwoState_true3_reLmbda0.9_final}
    \label{fig:exp1_sub_sensitivity_0.9}
  }
  \subfigure[Best learning curves of ETD($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/learning_curve_ETDLBTwoState_true3_reLmbda0.9_final}
  }
  \subfigure[Best learning curves of LC-ETD1($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/learning_curve_CETDL1TwoState_true3_reLmbda0.9_final}
  }
  \subfigure[Best learning curves of LC-ETD2($\nu$)]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/learning_curve_CETDL2TwoState_true3_reLmbda0.9_final}
  }
  \subfigure[Best learning curves of LC-ETD3($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/two_state_true/learning_curve_CETDL3TwoState_true3_reLmbda0.9_final}
  }
  \caption{Results on the Two-state task when $\lambda=0.9$. The y-axis shows $\overline{\text{RMSVE}}$. The $\lambda$ argument is omitted in the plots.}
  \label{fig:exp1_curve_lmbda09}
\end{figure*}

Figures \ref{fig:exp2_curve_lmbda05} and \ref{fig:exp2_curve_lmbda09} show the results of different algorithms with multi-step bootstrapping on the Rooms task. The conclusion is similar to the one-step case presented in Section \ref{sec:experiments} except that the biases of Off-policy TD ($\lambda$) and ETD($\lambda$, $\beta$) reduce significantly as $\lambda$ increases.

\begin{figure*}[!htb]
  \centering
  \subfigure[Best learning curves]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/learning_curve_AllFourRoom_true3_30s_150k_reLmbda0.5_final}
    \label{fig:exp2_sub_all_lmbda1}
  }
  \subfigure[Sensitivity to $\beta$ or $\nu$]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/sensitivity_beta_FourRoom_true3_30s_150k_reLmbda0.5_final}
    \label{fig:exp2_sub_sensitivity_lmbda1}
  }
  \subfigure[Best learning curves of ETD($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/learning_curve_ETDLBFourRoom_true3_30s_150k_reLmbda0.5_final}
    \label{fig:exp2_sub_etdlb_lmbda1}
  }
  \subfigure[Best learning curves of LC-ETD1($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/learning_curve_CETDL1FourRoom_true3_30s_150k_reLmbda0.5_final}
    \label{fig:exp2_sub_cetd1_lmbda1}
  }
  \subfigure[Best learning curves of LC-ETD2($\nu$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/learning_curve_CETDL2FourRoom_true3_30s_150k_reLmbda0.5_final}
    \label{fig:exp2_sub_cetd2_lmbda1}
  }
  \subfigure[Best learning curves of LC-ETD3($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/learning_curve_CETDL3FourRoom_true3_30s_150k_reLmbda0.5_final}
    \label{fig:exp2_sub_cetd3_lmbda1}
  }
  \caption{Results on the Rooms task when $\lambda=0.5$. The y-axis shows $\overline{\text{RMSVE}}$. The $\lambda$ argument is omitted in the plots.}
  \label{fig:exp2_curve_lmbda05}
\end{figure*}

\begin{figure*}[!htb]
  \centering
  \subfigure[Best learning curves]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/learning_curve_AllFourRoom_true3_30s_150k_reLmbda0.9_final}
    \label{fig:exp2_sub_all_lmbda2}
  }
  \subfigure[Sensitivity to $\beta$ or $\nu$]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/sensitivity_beta_FourRoom_true3_30s_150k_reLmbda0.9_final}
    \label{fig:exp2_sub_sensitivity_lmbda2}
  }
  \subfigure[Best learning curves of ETD($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/learning_curve_ETDLBFourRoom_true3_30s_150k_reLmbda0.9_final}
    \label{fig:exp2_sub_etdlb_lmbda2}
  }
  \subfigure[Best learning curves of LC-ETD1($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/learning_curve_CETDL1FourRoom_true3_30s_150k_reLmbda0.9_final}
    \label{fig:exp2_sub_cetd1_lmbda2}
  }
  \subfigure[Best learning curves of LC-ETD2($\nu$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/learning_curve_CETDL2FourRoom_true3_30s_150k_reLmbda0.9_final}
    \label{fig:exp2_sub_cetd2_lmbda2}
  }
  \subfigure[Best learning curves of LC-ETD3($\beta$)]{
    \includegraphics[width=0.292\textwidth]{figures/four_room_true/learning_curve_CETDL3FourRoom_true3_30s_150k_reLmbda0.9_final}
    \label{fig:exp2_sub_cetd3_lmbda2}
  }
  \caption{Results on the Rooms task when $\lambda=0.9$. The y-axis shows $\overline{\text{RMSVE}}$. The $\lambda$ argument is omitted in the plots.}
  \label{fig:exp2_curve_lmbda09}
\end{figure*}


\section{Step Size Sensitivity}
\label{sec:exp_step_size_sensitivity}

In this section, we provide step-size sensitivity analysis on the Two-state and Rooms tasks. We aggregate the results in Figure \ref{fig:exp_step_size_all} for convenient comparisons across different dimensions. We will discuss in order the following aspects.

\begin{itemize}
    \item The effect of an algorithm's decay parameter ($\beta$ or $\nu$) on its step-size sensitivity.
    \item The comparison of the step-size sensitivity of different algorithms on a single task.
    \item The effect of an algorithm's bootstrapping parameter ($\lambda$) on its step-size sensitivity.
    \item The comparison of the step-size sensitivity of different algorithms across different tasks.
\end{itemize}

Firstly, from the top-left corner of Figure \ref{fig:exp_step_size_all}, we can see how different values of $\beta$ affect the step-size sensitivity of LC-ETD1($0$, $\beta$). Specifically, on the left extreme ($\beta=0$), LC-ETD1($0$, $\beta$) becomes Off-policy TD($0$), which is the least sensitive algorithm but converges to solutions with high errors. On the right extreme ($\beta=1$), LC-ETD1($0$, $\beta$) degenerates into Full-IS-TD($0$), which is the most sensitive and learns extremely slowly. While LC-ETD1($0$, $\beta$) with all intermediate values of $\beta$ achieves significantly lower errors, it also has an intermediate sensitivity to the step size. In summary, the sensitivity to the step size will increase as the decay parameter increase. This pattern can also be validated in other plots in the figure except for those completely flat curves that represent no sign of learning of Full-IS-TD($\lambda$).

Next, we compare different one-step ($\lambda=0$) algorithms' step-size sensitivity on the Two-state task from the top row of Figure \ref{fig:exp_step_size_all}. It's quite obvious that ETD($0$, $\beta$) is the least sensitive across different values of the decay parameter, while LC-ETD3($0$, $\beta$) is at the other extreme. In addition, their best-performing step sizes for different values of the decay parameter are quite similar, which is not the case for LC-ETD1($0$, $\beta$) and LC-ETD2($0$, $\nu$). Nevertheless, the latter two algorithms with a decay parameter with a value of $0.2$ exhibit low sensitivity while achieving the lowest error. These observations remain valid for other rows in the figure.

Further, the leftmost plots of the top three lines provide insights into how different values of $\lambda$ impact the step-size sensitivity of LC-ETD1($\lambda$, $\beta$). Notably, as $\lambda$ increases, we observe four significant findings. Firstly, LC-ETD1($\lambda$, $\beta$) yields lower errors across different values of the decay parameter. Secondly, the method becomes increasingly sensitive to step size due to higher variance. Thirdly, the difference in error between Off-policy TD($\lambda$) ($\beta=0$) and LC-ETD1($\lambda$, $\beta$) ($0<\beta<1$) diminishes. Finally, the sensitivity curve shifts toward smaller step sizes. These observations are consistent with those found in LC-ETD2($\lambda$, $\nu$) and ETD($\lambda$, $\beta$).

Finally, we compare the sensitivity of one-step ($\lambda=0$) algorithms to step size on two different tasks from the first and fourth rows of Figure \ref{fig:exp_step_size_all}. Our observations reveal that algorithms exhibit greater sensitivity in the Rooms task, which has a higher variance than the Two-state task. This is especially notable for algorithms previously found to be less sensitive in the Two-state task. There could be two contributing factors to this observation. Firstly, the shrinkage of the suitable step size range may become smaller as the task variance increases. Alternatively, the difference could be due to how the results are summarized. We remind the reader that the results for the Two-state task were averaged over all $100$ runs, while the results for the Rooms task were averaged over the middle $15$ runs.

In summary, higher variance can lead to greater sensitivity to the step size. In the case of LC-ETD instances, reducing variance through a small decay parameter can improve usability. This is supported by the above analysis, which showed that a small decay parameter resulted in the lowest error while reducing sensitivity to changes in the step-size parameter. Therefore, using a small decay parameter may be an effective way to optimize the performance of LC-ETD instances.

\begin{figure*}
  \centering
    \includegraphics[width=0.9\textwidth]{figures/sensitivity_twostate_re_lc} \\
    \vspace*{5mm}
    \includegraphics[width=0.9\textwidth]{figures/sensitivity_fourroom_re_lc}
  \caption{Step-size sensitivity.}
  \label{fig:exp_step_size_all}
\end{figure*}

\nobibliography{he_35}
\end{document}
