\documentclass[accepted]{uai2022} % for initial submission
% \documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros

\usepackage{times}
\usepackage{xcolor}

\usepackage{algorithm, algpseudocode}
\usepackage{amsfonts,amssymb, amsmath, amsthm}

\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables

% \usepackage{geometry} % fix the margin of pdf
\usepackage{wrapfig}
\usepackage{bm}

\usepackage{enumitem}
\usepackage{xspace}
\usepackage{tcolorbox}

\newcommand{\swap}[3][-]{#3#1#2} % just an example

%\RequirePackage{latexsym} \RequirePackage{amsmath}
%\RequirePackage{amssymb} \RequirePackage{bm} \RequirePackage{url}
\usepackage{latexsym}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{bm}
\usepackage{mathrsfs}
\usepackage{url}


%%%%%%%% Stock standard definitions %%%%%%%%%%%%%%%

\newcommand{\mfe}{\mathfrak{e}}
\newcommand{\mfb}{\mathfrak{b}}


\newcommand{\alphab}{\boldsymbol{\alpha}}
\newcommand{\betab}{\boldsymbol{\beta}}
\newcommand{\gammab}{\boldsymbol{\gamma}}
\newcommand{\thetab}{\boldsymbol{\theta}}
\newcommand{\phib}{\boldsymbol{\phi}}
\newcommand{\omegab}{\boldsymbol{\omega}}

\newcommand{\Phib}{\boldsymbol{\Phi}}
% \newcommand{\ab}{\bm{a}}
% \newcommand{\bb}{\bm{b}}
% \newcommand{\cbb}{\bm{c}}
% \newcommand{\db}{\bm{d}}
% \newcommand{\eb}{\bm{e}}
% \newcommand{\fb}{\bm{f}}
% \newcommand{\gb}{\bm{g}}
% \newcommand{\hb}{\bm{h}}
% \newcommand{\ib}{\bm{i}}
% \newcommand{\jb}{\bm{j}}
% \newcommand{\kb}{\bm{k}}
% \newcommand{\lb}{\bm{l}}
% \newcommand{\mb}{\bm{m}}
% \newcommand{\nbb}{\bm{n}}
% \newcommand{\ob}{\bm{o}}
% \newcommand{\pb}{\bm{p}}
% \newcommand{\qb}{\bm{q}}
% \newcommand{\rb}{\bm{r}}
% \newcommand{\sbb}{\bm{s}}
% \newcommand{\tb}{\bm{t}}
% \newcommand{\ub}{\bm{u}}
% \newcommand{\vb}{\bm{v}}
% \newcommand{\wb}{\bm{w}}
% \newcommand{\xb}{\bm{x}}
% \newcommand{\yb}{\bm{y}}
% \newcommand{\zb}{\bm{z}}

\newcommand{\ab}{\mathbf{a}}
\newcommand{\bb}{\mathbf{b}}
\newcommand{\cbb}{\mathbf{c}}
\newcommand{\db}{\mathbf{d}}
\newcommand{\eb}{\mathbf{e}}
\newcommand{\fb}{\mathbf{f}}
\newcommand{\gb}{\mathbf{g}}
\newcommand{\hb}{\mathbf{h}}
\newcommand{\ib}{\mathbf{i}}
\newcommand{\jb}{\mathbf{j}}
\newcommand{\kb}{\mathbf{k}}
\newcommand{\lb}{\mathbf{l}}
\newcommand{\mb}{\mathbf{m}}
\newcommand{\nbb}{\mathbf{n}}
\newcommand{\ob}{\mathbf{o}}
\newcommand{\pb}{\mathbf{p}}
\newcommand{\qb}{\mathbf{q}}
\newcommand{\rb}{\mathbf{r}}
\newcommand{\sbb}{\mathbf{s}}
\newcommand{\tb}{\mathbf{t}}
\newcommand{\ub}{\mathbf{u}}
\newcommand{\vb}{\mathbf{v}}
\newcommand{\wb}{\mathbf{w}}
\newcommand{\xb}{\mathbf{x}}
\newcommand{\yb}{\mathbf{y}}
\newcommand{\zb}{\mathbf{z}}

\newcommand{\abtil}{\tilde{\ab}}
\newcommand{\bbtil}{\tilde{\bb}}
\newcommand{\cbtil}{\tilde{\cbb}}
\newcommand{\dbtil}{\tilde{\db}}
\newcommand{\ebtil}{\tilde{\eb}}
\newcommand{\fbtil}{\tilde{\fb}}
\newcommand{\gbtil}{\tilde{\gb}}
\newcommand{\hbtil}{\tilde{\hb}}
\newcommand{\ibtil}{\tilde{\ib}}
\newcommand{\jbtil}{\tilde{\jb}}
\newcommand{\kbtil}{\tilde{\kb}}
\newcommand{\lbtil}{\tilde{\lb}}
\newcommand{\mbtil}{\tilde{\mb}}
\newcommand{\nbtil}{\tilde{\nbb}}
\newcommand{\obtil}{\tilde{\ob}}
\newcommand{\pbtil}{\tilde{\pb}}
\newcommand{\qbtil}{\tilde{\qb}}
\newcommand{\rbtil}{\tilde{\rb}}
\newcommand{\sbtil}{\tilde{\sbb}}
\newcommand{\tbtil}{\tilde{\tb}}
\newcommand{\ubtil}{\tilde{\ub}}
\newcommand{\vbtil}{\tilde{\vb}}
\newcommand{\wbtil}{\tilde{\wb}}
\newcommand{\xbtil}{\tilde{\xb}}
\newcommand{\ybtil}{\tilde{\yb}}
\newcommand{\zbtil}{\tilde{\zb}}


\newcommand{\atil}{\tilde{a}}
\newcommand{\btil}{\tilde{b}}
\newcommand{\ctil}{\tilde{c}}
\newcommand{\dtil}{\tilde{d}}
\newcommand{\etil}{\tilde{e}}
\newcommand{\ftil}{\tilde{f}}
\newcommand{\gtil}{\tilde{g}}
\newcommand{\htil}{\tilde{h}}
\newcommand{\itil}{\tilde{i}}
\newcommand{\jtil}{\tilde{j}}
\newcommand{\ktil}{\tilde{k}}
\newcommand{\ltil}{\tilde{l}}
\newcommand{\mtil}{\tilde{m}}
\newcommand{\ntil}{\tilde{n}}
\newcommand{\otil}{\tilde{o}}
\newcommand{\ptil}{\tilde{p}}
\newcommand{\qtil}{\tilde{q}}
\newcommand{\rtil}{\tilde{r}}
\newcommand{\stil}{\tilde{s}}
\newcommand{\ttil}{\tilde{t}}
\newcommand{\util}{\tilde{u}}
\newcommand{\vtil}{\tilde{v}}
\newcommand{\wtil}{\tilde{w}}
\newcommand{\xtil}{\tilde{x}}
\newcommand{\ytil}{\tilde{y}}
\newcommand{\ztil}{\tilde{z}}

\newcommand{\Atil}{\tilde{A}}
\newcommand{\Btil}{\tilde{B}}
\newcommand{\Ctil}{\tilde{C}}
\newcommand{\Dtil}{\tilde{D}}
\newcommand{\Etil}{\tilde{E}}
\newcommand{\Ftil}{\tilde{F}}
\newcommand{\Gtil}{\tilde{G}}
\newcommand{\Htil}{\tilde{H}}
\newcommand{\Itil}{\tilde{I}}
\newcommand{\Jtil}{\tilde{J}}
\newcommand{\Ktil}{\tilde{K}}
\newcommand{\Ltil}{\tilde{L}}
\newcommand{\Mtil}{\tilde{M}}
\newcommand{\Ntil}{\tilde{N}}
\newcommand{\Otil}{\tilde{O}}
\newcommand{\Ptil}{\tilde{P}}
\newcommand{\Qtil}{\tilde{Q}}
\newcommand{\Rtil}{\tilde{R}}
\newcommand{\Stil}{\tilde{S}}
\newcommand{\Ttil}{\tilde{T}}
\newcommand{\Util}{\tilde{U}}
\newcommand{\Vtil}{\tilde{V}}
\newcommand{\Wtil}{\tilde{W}}
\newcommand{\Xtil}{\tilde{X}}
\newcommand{\Ytil}{\tilde{Y}}
\newcommand{\Ztil}{\tilde{Z}}

\newcommand{\abar}{\bar{a}}
\newcommand{\bbar}{\bar{b}}
\newcommand{\cbar}{\bar{c}}
\newcommand{\dbar}{\bar{d}}
\newcommand{\ebar}{\bar{e}}
\newcommand{\fbar}{\bar{f}}
\newcommand{\gbar}{\bar{g}}
\newcommand{\hbr}{\bar{h}}
\newcommand{\ibar}{\bar{i}}
\newcommand{\jbar}{\bar{j}}
\newcommand{\kbar}{\bar{k}}
\newcommand{\lbar}{\bar{l}}
\newcommand{\mbar}{\bar{m}}
\newcommand{\nbar}{\bar{n}}
\newcommand{\obar}{\bar{o}}
\newcommand{\pbar}{\bar{p}}
\newcommand{\qbar}{\bar{q}}
\newcommand{\rbar}{\bar{r}}
\newcommand{\sbar}{\bar{s}}
\newcommand{\tbar}{\bar{t}}
\newcommand{\ubar}{\bar{u}}
\newcommand{\vbar}{\bar{v}}
\newcommand{\wbar}{\bar{w}}
\newcommand{\xbar}{\bar{x}}
\newcommand{\ybar}{\bar{y}}
\newcommand{\zbar}{\bar{z}}

\newcommand{\abbar}{\bar{\ab}}
\newcommand{\bbbar}{\bar{\bb}}
\newcommand{\cbbar}{\bar{\cb}}
\newcommand{\dbbar}{\bar{\db}}
\newcommand{\ebbar}{\bar{\eb}}
\newcommand{\fbbar}{\bar{\fb}}
\newcommand{\gbbar}{\bar{\gb}}
\newcommand{\hbbar}{\bar{\hb}}
\newcommand{\ibbar}{\bar{\ib}}
\newcommand{\jbbar}{\bar{\jb}}
\newcommand{\kbbar}{\bar{\kb}}
\newcommand{\lbbar}{\bar{\lb}}
\newcommand{\mbbar}{\bar{\mb}}
\newcommand{\nbbar}{\bar{\nbb}}
\newcommand{\obbar}{\bar{\ob}}
\newcommand{\pbbar}{\bar{\pb}}
\newcommand{\qbbar}{\bar{\qb}}
\newcommand{\rbbar}{\bar{\rb}}
\newcommand{\sbbar}{\bar{\sbb}}
\newcommand{\tbbar}{\bar{\tb}}
\newcommand{\ubbar}{\bar{\ub}}
\newcommand{\vbbar}{\bar{\vb}}
\newcommand{\wbbar}{\bar{\wb}}
\newcommand{\xbbar}{\bar{\xb}}
\newcommand{\ybbar}{\bar{\yb}}
\newcommand{\zbbar}{\bar{\zb}}

% \newcommand{\Ab}{\bm{A}}
% \newcommand{\Bb}{\bm{B}}
% \newcommand{\Cb}{\bm{C}}
% \newcommand{\Db}{\bm{D}}
% \newcommand{\Eb}{\bm{E}}
% \newcommand{\Fb}{\bm{F}}
% \newcommand{\Gb}{\bm{G}}
% \newcommand{\Hb}{\bm{H}}
% \newcommand{\Ib}{\bm{I}}
% \newcommand{\Jb}{\bm{J}}
% \newcommand{\Kb}{\bm{K}}
% \newcommand{\Lb}{\bm{L}}
% \newcommand{\Mb}{\bm{M}}
% \newcommand{\Nb}{\bm{N}}
% \newcommand{\Ob}{\bm{O}}
% \newcommand{\Pb}{\bm{P}}
% \newcommand{\Qb}{\bm{Q}}
% \newcommand{\Rb}{\bm{R}}
% \newcommand{\Sbb}{\bm{S}}
% \newcommand{\Tb}{\bm{T}}
% \newcommand{\Ub}{\bm{U}}
% \newcommand{\Vb}{\bm{V}}
% \newcommand{\Wb}{\bm{W}}
% \newcommand{\Xb}{\bm{X}}
% \newcommand{\Yb}{\bm{Y}}
% \newcommand{\Zb}{\bm{Z}}

\newcommand{\Ab}{\mathbf{A}}
\newcommand{\Bb}{\mathbf{B}}
\newcommand{\Cb}{\mathbf{C}}
\newcommand{\Db}{\mathbf{D}}
\newcommand{\Eb}{\mathbf{E}}
\newcommand{\Fb}{\mathbf{F}}
\newcommand{\Gb}{\mathbf{G}}
\newcommand{\Hb}{\mathbf{H}}
\newcommand{\Ib}{\mathbf{I}}
\newcommand{\Jb}{\mathbf{J}}
\newcommand{\Kb}{\mathbf{K}}
\newcommand{\Lb}{\mathbf{L}}
\newcommand{\Mb}{\mathbf{M}}
\newcommand{\Nb}{\mathbf{N}}
\newcommand{\Ob}{\mathbf{O}}
\newcommand{\Pb}{\mathbf{P}}
\newcommand{\Qb}{\mathbf{Q}}
\newcommand{\Rb}{\mathbf{R}}
\newcommand{\Sbb}{\mathbf{S}}
\newcommand{\Tb}{\mathbf{T}}
\newcommand{\Ub}{\mathbf{U}}
\newcommand{\Vb}{\mathbf{V}}
\newcommand{\Wb}{\mathbf{W}}
\newcommand{\Xb}{\mathbf{X}}
\newcommand{\Yb}{\mathbf{Y}}
\newcommand{\Zb}{\mathbf{Z}}

% \newcommand{\Abtil}{\tilde{\Ab}}
% \newcommand{\Bbtil}{\tilde{\Bb}}
% \newcommand{\Cbtil}{\tilde{\Cb}}
% \newcommand{\Dbtil}{\tilde{\Db}}
% \newcommand{\Ebtil}{\tilde{\Eb}}
% \newcommand{\Fbtil}{\tilde{\Fb}}
% \newcommand{\Gbtil}{\tilde{\Gb}}
% \newcommand{\Hbtil}{\tilde{\Hb}}
% \newcommand{\Ibtil}{\tilde{\Ib}}
% \newcommand{\Jbtil}{\tilde{\Jb}}
% \newcommand{\Kbtil}{\tilde{\Kb}}
% \newcommand{\Lbtil}{\tilde{\Lb}}
% \newcommand{\Mbtil}{\tilde{\Mb}}
% \newcommand{\Nbtil}{\tilde{\Nb}}
% \newcommand{\Obtil}{\tilde{\Ob}}
% \newcommand{\Pbtil}{\tilde{\Pb}}
% \newcommand{\Qbtil}{\tilde{\Qb}}
% \newcommand{\Rbtil}{\tilde{\Rb}}
% \newcommand{\Sbtil}{\tilde{\Sbb}}
% \newcommand{\Tbtil}{\tilde{\Tb}}
% \newcommand{\Ubtil}{\tilde{\Ub}}
% \newcommand{\Vbtil}{\tilde{\Vb}}
% \newcommand{\Wbtil}{\tilde{\Wb}}
% \newcommand{\Xbtil}{\tilde{\Xb}}
% \newcommand{\Ybtil}{\tilde{\Yb}}
% \newcommand{\Zbtil}{\tilde{\Zb}}

\newcommand{\Abar}{\bar{A}}
\newcommand{\Bbar}{\bar{B}}
\newcommand{\Cbar}{\bar{C}}
\newcommand{\Dbar}{\bar{D}}
\newcommand{\Ebar}{\bar{E}}
\newcommand{\Fbar}{\bar{F}}
\newcommand{\Gbar}{\bar{G}}
\newcommand{\Hbar}{\bar{H}}
\newcommand{\Ibar}{\bar{I}}
\newcommand{\Jbar}{\bar{J}}
\newcommand{\Kbar}{\bar{K}}
\newcommand{\Lbar}{\bar{L}}
\newcommand{\Mbar}{\bar{M}}
\newcommand{\Nbar}{\bar{N}}
\newcommand{\Obar}{\bar{O}}
\newcommand{\Pbar}{\bar{P}}
\newcommand{\Qbar}{\bar{Q}}
\newcommand{\Rbar}{\bar{R}}
\newcommand{\Sbar}{\bar{S}}
\newcommand{\Tbar}{\bar{T}}
\newcommand{\Ubar}{\bar{U}}
\newcommand{\Vbar}{\bar{V}}
\newcommand{\Wbar}{\bar{W}}
\newcommand{\Xbar}{\bar{X}}
\newcommand{\Ybar}{\bar{Y}}
\newcommand{\Zbar}{\bar{Z}}

\newcommand{\Abbar}{\bar{\Ab}}
\newcommand{\Bbbar}{\bar{\Bb}}
\newcommand{\Cbbar}{\bar{\Cb}}
\newcommand{\Dbbar}{\bar{\Db}}
\newcommand{\Ebbar}{\bar{\Eb}}
\newcommand{\Fbbar}{\bar{\Fb}}
\newcommand{\Gbbar}{\bar{\Gb}}
\newcommand{\Hbbar}{\bar{\Hb}}
\newcommand{\Ibbar}{\bar{\Ib}}
\newcommand{\Jbbar}{\bar{\Jb}}
\newcommand{\Kbbar}{\bar{\Kb}}
\newcommand{\Lbbar}{\bar{\Lb}}
\newcommand{\Mbbar}{\bar{\Mb}}
\newcommand{\Nbbar}{\bar{\Nb}}
\newcommand{\Obbar}{\bar{\Ob}}
\newcommand{\Pbbar}{\bar{\Pb}}
\newcommand{\Qbbar}{\bar{\Qb}}
\newcommand{\Rbbar}{\bar{\Rb}}
\newcommand{\Sbbar}{\bar{\Sb}}
\newcommand{\Tbbar}{\bar{\Tb}}
\newcommand{\Ubbar}{\bar{\Ub}}
\newcommand{\Vbbar}{\bar{\Vb}}
\newcommand{\Wbbar}{\bar{\Wb}}
\newcommand{\Xbbar}{\bar{\Xb}}
\newcommand{\Ybbar}{\bar{\Yb}}
\newcommand{\Zbbar}{\bar{\Zb}}

\newcommand{\Ahat}{\widehat{A}}
\newcommand{\Bhat}{\widehat{B}}
\newcommand{\Chat}{\widehat{C}}
\newcommand{\Dhat}{\widehat{D}}
\newcommand{\Ehat}{\widehat{E}}
\newcommand{\Fhat}{\widehat{F}}
\newcommand{\Ghat}{\widehat{G}}
\newcommand{\Hhat}{\widehat{H}}
\newcommand{\Ihat}{\widehat{I}}
\newcommand{\Jhat}{\widehat{J}}
\newcommand{\Khat}{\widehat{K}}
\newcommand{\Lhat}{\widehat{L}}
\newcommand{\Mhat}{\widehat{M}}
\newcommand{\Nhat}{\widehat{N}}
\newcommand{\Ohat}{\widehat{O}}
\newcommand{\Phat}{\widehat{P}}
\newcommand{\Qhat}{\widehat{Q}}
\newcommand{\Rhat}{\widehat{R}}
\newcommand{\Shat}{\widehat{S}}
\newcommand{\That}{\widehat{T}}
\newcommand{\Uhat}{\widehat{U}}
\newcommand{\Vhat}{\widehat{V}}
\newcommand{\What}{\widehat{W}}
\newcommand{\Xhat}{\widehat{X}}
\newcommand{\Yhat}{\widehat{Y}}
\newcommand{\Zhat}{\widehat{Z}}

\newcommand{\ahat}{\widehat{a}}
\newcommand{\bhat}{\widehat{b}}
\newcommand{\chat}{\widehat{c}}
\newcommand{\dhat}{\widehat{d}}
\newcommand{\ehat}{\widehat{e}}
\newcommand{\fhat}{\widehat{f}}
\newcommand{\ghat}{\widehat{g}}
\newcommand{\hhat}{\widehat{h}}
\newcommand{\ihat}{\widehat{i}}
\newcommand{\jhat}{\widehat{j}}
\newcommand{\khat}{\widehat{k}}
\newcommand{\lhat}{\widehat{l}}
\newcommand{\mhat}{\widehat{m}}
\newcommand{\nhat}{\widehat{n}}
\newcommand{\ohat}{\widehat{o}}
\newcommand{\phat}{\widehat{p}}
\newcommand{\qhat}{\widehat{q}}
\newcommand{\rhat}{\widehat{r}}
\newcommand{\shat}{\widehat{s}}
\newcommand{\that}{\widehat{t}}
\newcommand{\uhat}{\widehat{u}}
\newcommand{\vhat}{\widehat{v}}
\newcommand{\what}{\widehat{w}}
\newcommand{\xhat}{\widehat{x}}
\newcommand{\yhat}{\widehat{y}}
\newcommand{\zhat}{\widehat{z}}

\newcommand{\Abhat}{\hat{\Ab}}
\newcommand{\Bbhat}{\hat{\Bb}}
\newcommand{\Cbhat}{\hat{\Cb}}
\newcommand{\Dbhat}{\hat{\Db}}
\newcommand{\Ebhat}{\hat{\Eb}}
\newcommand{\Fbhat}{\hat{\Fb}}
\newcommand{\Gbhat}{\hat{\Gb}}
\newcommand{\Hbhat}{\hat{\Hb}}
\newcommand{\Ibhat}{\hat{\Ib}}
\newcommand{\Jbhat}{\hat{\Jb}}
\newcommand{\Kbhat}{\hat{\Kb}}
\newcommand{\Lbhat}{\hat{\Lb}}
\newcommand{\Mbhat}{\hat{\Mb}}
\newcommand{\Nbhat}{\hat{\Nb}}
\newcommand{\Obhat}{\hat{\Ob}}
\newcommand{\Pbhat}{\hat{\Pb}}
\newcommand{\Qbhat}{\hat{\Qb}}
\newcommand{\Rbhat}{\hat{\Rb}}
\newcommand{\Sbhat}{\hat{\Sb}}
\newcommand{\Tbhat}{\hat{\Tb}}
\newcommand{\Ubhat}{\hat{\Ub}}
\newcommand{\Vbhat}{\hat{\Vb}}
\newcommand{\Wbhat}{\hat{\Wb}}
\newcommand{\Xbhat}{\hat{\Xb}}
\newcommand{\Ybhat}{\hat{\Yb}}
\newcommand{\Zbhat}{\hat{\Zb}}

\newcommand{\Acal}{\mathcal{A}}
\newcommand{\Bcal}{\mathcal{B}}
\newcommand{\Ccal}{\mathcal{C}}
\newcommand{\Dcal}{\mathcal{D}}
\newcommand{\Ecal}{\mathcal{E}}
\newcommand{\Fcal}{\mathcal{F}}
\newcommand{\Gcal}{\mathcal{G}}
\newcommand{\Hcal}{\mathcal{H}}
\newcommand{\Ical}{\mathcal{I}}
\newcommand{\Jcal}{\mathcal{J}}
\newcommand{\Kcal}{\mathcal{K}}
\newcommand{\Lcal}{\mathcal{L}}
\newcommand{\Mcal}{\mathcal{M}}
\newcommand{\Ncal}{\mathcal{N}}
\newcommand{\Ocal}{\mathcal{O}}
\newcommand{\Pcal}{\mathcal{P}}
\newcommand{\Qcal}{\mathcal{Q}}
\newcommand{\Rcal}{\mathcal{R}}
\newcommand{\Scal}{{\mathcal{S}}}
\newcommand{\Tcal}{{\mathcal{T}}}
\newcommand{\Ucal}{\mathcal{U}}
\newcommand{\Vcal}{\mathcal{V}}
\newcommand{\Wcal}{\mathcal{W}}
\newcommand{\Xcal}{\mathcal{X}}
\newcommand{\Ycal}{\mathcal{Y}}
\newcommand{\Zcal}{\mathcal{Z}}

\newcommand{\Ascr}{\mathscr{A}}
\newcommand{\Bscr}{\mathscr{B}}
\newcommand{\Cscr}{\mathscr{C}}
\newcommand{\Dscr}{\mathscr{D}}
\newcommand{\Escr}{\mathscr{E}}
\newcommand{\Fscr}{\mathscr{F}}
\newcommand{\Gscr}{\mathscr{G}}
\newcommand{\Hscr}{\mathscr{H}}
\newcommand{\Iscr}{\mathscr{I}}
\newcommand{\Jscr}{\mathscr{J}}
\newcommand{\Kscr}{\mathscr{K}}
\newcommand{\Lscr}{\mathscr{L}}
\newcommand{\Mscr}{\mathscr{M}}
\newcommand{\Nscr}{\mathscr{N}}
\newcommand{\Oscr}{\mathscr{O}}
\newcommand{\Pscr}{\mathscr{P}}
\newcommand{\Qscr}{\mathscr{Q}}
\newcommand{\Rscr}{\mathscr{R}}
\newcommand{\Sscr}{{\mathscr{S}}}
\newcommand{\Tscr}{{\mathscr{T}}}
\newcommand{\Uscr}{\mathscr{U}}
\newcommand{\Vscr}{\mathscr{V}}
\newcommand{\Wscr}{\mathscr{W}}
\newcommand{\Xscr}{\mathscr{X}}
\newcommand{\Yscr}{\mathscr{Y}}
\newcommand{\Zscr}{\mathscr{Z}}

\newcommand{\Afra}{\mathfrak{A}}
\newcommand{\Bfra}{\mathfrak{B}}
\newcommand{\Cfra}{\mathfrak{C}}
\newcommand{\Dfra}{\mathfrak{D}}
\newcommand{\Efra}{\mathfrak{E}}
\newcommand{\Ffra}{\mathfrak{F}}
\newcommand{\Gfra}{\mathfrak{G}}
\newcommand{\Hfra}{\mathfrak{H}}
\newcommand{\Ifra}{\mathfrak{I}}
\newcommand{\Jfra}{\mathfrak{J}}
\newcommand{\Kfra}{\mathfrak{K}}
\newcommand{\Lfra}{\mathfrak{L}}
\newcommand{\Mfra}{\mathfrak{M}}
\newcommand{\Nfra}{\mathfrak{N}}
\newcommand{\Ofra}{\mathfrak{O}}
\newcommand{\Pfra}{\mathfrak{P}}
\newcommand{\Qfra}{\mathfrak{Q}}
\newcommand{\Rfra}{\mathfrak{R}}
\newcommand{\Sfra}{{\mathfrak{S}}}
\newcommand{\Tfra}{{\mathfrak{T}}}
\newcommand{\Ufra}{\mathfrak{U}}
\newcommand{\Vfra}{\mathfrak{V}}
\newcommand{\Wfra}{\mathfrak{W}}
\newcommand{\Xfra}{\mathfrak{X}}
\newcommand{\Yfra}{\mathfrak{Y}}
\newcommand{\Zfra}{\mathfrak{Z}}


\newcommand{\Acalb}{\bm{\Acal}}
\newcommand{\Bcalb}{\bm{\Bcal}}
\newcommand{\Ccalb}{\bm{\Ccal}}
\newcommand{\Dcalb}{\bm{\Dcal}}
\newcommand{\Ecalb}{\bm{\Ecal}}
\newcommand{\Fcalb}{\bm{\Fcal}}
\newcommand{\Gcalb}{\bm{\Gcal}}
\newcommand{\Hcalb}{\bm{\Hcal}}
\newcommand{\Icalb}{\bm{\Ical}}
\newcommand{\Jcalb}{\bm{\Jcal}}
\newcommand{\Kcalb}{\bm{\Kcal}}
\newcommand{\Lcalb}{\bm{\Lcal}}
\newcommand{\Mcalb}{\bm{\Mcal}}
\newcommand{\Ncalb}{\bm{\Ncal}}
\newcommand{\Ocalb}{\bm{\Ocal}}
\newcommand{\Pcalb}{\bm{\Pcal}}
\newcommand{\Qcalb}{\bm{\Qcal}}
\newcommand{\Rcalb}{\bm{\Rcal}}
\newcommand{\Scalb}{\bm{\Scal}}
\newcommand{\Tcalb}{\bm{\Tcal}}
\newcommand{\Ucalb}{\bm{\Ucal}}
\newcommand{\Vcalb}{\bm{\Vcal}}
\newcommand{\Wcalb}{\bm{\Wcal}}
\newcommand{\Xcalb}{\bm{\Xcal}}
\newcommand{\Ycalb}{\bm{\Ycal}}
\newcommand{\Zcalb}{\bm{\Zcal}}

\newcommand{\Ascrb}{\bm{\Ascr}}
\newcommand{\Bscrb}{\bm{\Bscr}}
\newcommand{\Cscrb}{\bm{\Cscr}}
\newcommand{\Dscrb}{\bm{\Dscr}}
\newcommand{\Escrb}{\bm{\Escr}}
\newcommand{\Fscrb}{\bm{\Fscr}}
\newcommand{\Gscrb}{\bm{\Gscr}}
\newcommand{\Hscrb}{\bm{\Hscr}}
\newcommand{\Iscrb}{\bm{\Iscr}}
\newcommand{\Jscrb}{\bm{\Jscr}}
\newcommand{\Kscrb}{\bm{\Kscr}}
\newcommand{\Lscrb}{\bm{\Lscr}}
\newcommand{\Mscrb}{\bm{\Mscr}}
\newcommand{\Nscrb}{\bm{\Nscr}}
\newcommand{\Oscrb}{\bm{\Oscr}}
\newcommand{\Pscrb}{\bm{\Pscr}}
\newcommand{\Qscrb}{\bm{\Qscr}}
\newcommand{\Rscrb}{\bm{\Rscr}}
\newcommand{\Sscrb}{\bm{\Sscr}}
\newcommand{\Tscrb}{\bm{\Tscr}}
\newcommand{\Uscrb}{\bm{\Uscr}}
\newcommand{\Vscrb}{\bm{\Vscr}}
\newcommand{\Wscrb}{\bm{\Wscr}}
\newcommand{\Xscrb}{\bm{\Xscr}}
\newcommand{\Yscrb}{\bm{\Yscr}}
\newcommand{\Zscrb}{\bm{\Zscr}}

\newcommand{\Afrab}{\bm{\Afra}}
\newcommand{\Bfrab}{\bm{\Bfra}}
\newcommand{\Cfrab}{\bm{\Cfra}}
\newcommand{\Dfrab}{\bm{\Dfra}}
\newcommand{\Efrab}{\bm{\Efra}}
\newcommand{\Ffrab}{\bm{\Ffra}}
\newcommand{\Gfrab}{\bm{\Gfra}}
\newcommand{\Hfrab}{\bm{\Hfra}}
\newcommand{\Ifrab}{\bm{\Ifra}}
\newcommand{\Jfrab}{\bm{\Jfra}}
\newcommand{\Kfrab}{\bm{\Kfra}}
\newcommand{\Lfrab}{\bm{\Lfra}}
\newcommand{\Mfrab}{\bm{\Mfra}}
\newcommand{\Nfrab}{\bm{\Nfra}}
\newcommand{\Ofrab}{\bm{\Ofra}}
\newcommand{\Pfrab}{\bm{\Pfra}}
\newcommand{\Qfrab}{\bm{\Qfra}}
\newcommand{\Rfrab}{\bm{\Rfra}}
\newcommand{\Sfrab}{\bm{\Sfra}}
\newcommand{\Tfrab}{\bm{\Tfra}}
\newcommand{\Ufrab}{\bm{\Ufra}}
\newcommand{\Vfrab}{\bm{\Vfra}}
\newcommand{\Wfrab}{\bm{\Wfra}}
\newcommand{\Xfrab}{\bm{\Xfra}}
\newcommand{\Yfrab}{\bm{\Yfra}}
\newcommand{\Zfrab}{\bm{\Zfra}}

\newcommand{\Atilde}{\widetilde{A}}
\newcommand{\Btilde}{\widetilde{B}}
\newcommand{\Ctilde}{\widetilde{C}}
\newcommand{\Dtilde}{\widetilde{D}}
\newcommand{\Etilde}{\widetilde{E}}
\newcommand{\Ftilde}{\widetilde{F}}
\newcommand{\Gtilde}{\widetilde{G}}
\newcommand{\Htilde}{\widetilde{H}}
\newcommand{\Itilde}{\widetilde{I}}
\newcommand{\Jtilde}{\widetilde{J}}
\newcommand{\Ktilde}{\widetilde{K}}
\newcommand{\Ltilde}{\widetilde{L}}
\newcommand{\Mtilde}{\widetilde{M}}
\newcommand{\Ntilde}{\widetilde{N}}
\newcommand{\Otilde}{\widetilde{O}}
\newcommand{\Ptilde}{\widetilde{P}}
\newcommand{\Qtilde}{\widetilde{Q}}
\newcommand{\Rtilde}{\widetilde{R}}
\newcommand{\Stilde}{\widetilde{S}}
\newcommand{\Ttilde}{\widetilde{T}}
\newcommand{\Utilde}{\widetilde{U}}
\newcommand{\Vtilde}{\widetilde{V}}
\newcommand{\Wtilde}{\widetilde{W}}
\newcommand{\Xtilde}{\widetilde{X}}
\newcommand{\Ytilde}{\widetilde{Y}}
\newcommand{\Ztilde}{\widetilde{Z}}


%%%%%%%% Widely accepted definitions %%%%%%%%%%%%%%%

\newcommand{\BB}{\mathbb{B}} % Complex numbers
\newcommand{\CC}{\mathbb{C}} % Complex numbers
\newcommand{\EE}{\mathbb{E}} % Expectation
\newcommand{\VV}{\mathbb{V}} % Variance
\newcommand{\II}{\mathbb{I}} % Indicator
\newcommand{\KK}{\mathbb{K}} % Arbitrary field
\newcommand{\LL}{\mathbb{L}} % Loss
\newcommand{\MM}{\mathbb{M}} % Median
\newcommand{\NN}{\mathbb{N}} % Natural numbers
\newcommand{\PP}{\mathbb{P}} % Probability
\newcommand{\QQ}{\mathbb{Q}} % Rationals
\newcommand{\RR}{\mathbb{R}} % Real numbers
\newcommand{\ZZ}{\mathbb{Z}} % Integers
\newcommand{\XX}{\mathbb{X}} %
\newcommand{\YY}{\mathbb{Y}} %

\newcommand{\one}{\mathbf{1}}  % Identity
\newcommand{\zero}{\mathbf{0}} % Zero
%\newcommand{\TRUE}{\mathbf{TRUE}}  % True
%\newcommand{\FALSE}{\mathbf{FALSE}}  % False

\newcommand*{\mini}{\mathop{\mathrm{minimize}}}
\newcommand*{\maxi}{\mathop{\mathrm{maximize}}}
\newcommand*{\argmin}{\mathop{\mathrm{argmin}}}
\newcommand*{\argmax}{\mathop{\mathrm{argmax}}}
\newcommand*{\st}{\mathop{\mathrm{s.t.}}}
\newcommand{\sgn}{\mathop{\mathrm{sign}}}
\newcommand{\tr}{\mathop{\mathrm{tr}}}
\newcommand{\diag}{\mathop{\mathrm{diag}}}
\newcommand{\rank}{\mathop{\mathrm{rank}}}
\newcommand{\ovec}{\mathop{\mathrm{vec}}}
\newcommand{\traj}{\mathop{\mathrm{Traj}}}
\newcommand*{\cov}{\mathrm{Cov}}
\newcommand*{\conv}{\mathrm{conv}}
\newcommand*{\const}{\mathrm{constant}}

%%%%%%%% Bold Greek Letters %%%%%%%%%%%%%%%
\newcommand{\sigmab}{\bm{\sigma}}
\newcommand{\Sigmab}{\mathbf{\Sigma}}


%%%%%%%% Mess around with LaTeX %%%%%%%%%%%%%%%

%% Some style files might actually define these variables.
%% So don't mess with them if they are already defined

\ifx\BlackBox\undefined
\newcommand{\BlackBox}{\rule{1.5ex}{1.5ex}}  % end of proof
\fi

\ifx\QED\undefined
\def\QED{~\rule[-1pt]{5pt}{5pt}\par\medskip}
\fi

\ifx\proof\undefined
\newenvironment{proof}{\par\noindent{\bf Proof\ }}{\hfill\BlackBox\\[2mm]}
%\newenvironment{proof}{\emph{Proof. }}{ \hfill \QED}
\fi

\ifx\theorem\undefined
\newtheorem{theorem}{Theorem}
\newtheorem{example}{Example}
\newtheorem{property}{Property}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{conjecture}[theorem]{Conjecture}
\newtheorem{assumption}{Assumption}
\fi

\ifx\axiom\undefined
\newtheorem{axiom}[theorem]{Axiom}
\fi

%%%%%%%% Utility functions %%%%%%%%%%%%%%%

\newcommand{\eq}[1]{(\ref{#1})}
\newcommand{\mymatrix}[2]{\left[\begin{array}{#1} #2 \end{array}\right]}
\newcommand{\mychoose}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)}
\newcommand{\mydet}[1]{\det\left[ #1 \right]}
\newcommand{\sembrack}[1]{[\![#1]\!]}

\newcommand{\ea}{\emph{et al.}}
\newcommand{\eg}{\emph{e.g.}}
\newcommand{\ie}{\emph{i.e.}}
\newcommand{\iid}{\emph{i.i.d.}}
\newcommand{\etc}{\emph{etc.}}

%\newcommand{\alex}[1]{{\bf ALEX: \uppercase{#1}}}
%\newcommand{\vishy}[1]{{\bf VISHY: \uppercase{#1}}}
%\newcommand{\rene}[1]{{\bf RENE: \uppercase{#1}}}
%\newcommand{\karsten}[1]{{\bf KARSTEN: \uppercase{#1}}}

%%%%%%%% Specific symbols for this project %%%%%%%%%%%%%%%

\newcommand{\methodname}{KDE}
\newcommand{\ind}{\boldsymbol{\mathsf{I}}}

\newcommand{\hsic}{\mathrm{HSIC}}
\newcommand{\mmd}{\mathrm{MMD}}

%\newcommand{\tDiag}{\textsf{Diag}}
%\newcommand{\tTr}{\textsf{Tr}}
%\newcommand{\tE}{\textsf{E}}
%\newcommand{\tVec}{\textsf{Vec}}
%\newcommand{\tRank}{\textsf{Rank}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% math symbols and commands
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\newcommand{\eq}[1]{(\ref{#1})}
%\newcommand{\mymatrix}[2]{\left[\begin{array}{#1} #2 \end{array}\right]}

%brackets
\newcommand{\inner}[2]{\left\langle #1,#2 \right\rangle}
\newcommand{\rbr}[1]{\left(#1\right)}
\newcommand{\sbr}[1]{\left[#1\right]}
\newcommand{\cbr}[1]{\left\{#1\right\}}
\newcommand{\nbr}[1]{\left\|#1\right\|}
\newcommand{\abr}[1]{\left|#1\right|}
\newcommand{\smallfrac}[2]{{\textstyle \frac{#1}{#2}}}
\renewcommand{\url}[1]{{\sffamily #1}}
\newcommand{\arow}[2]{#1_{#2\cdot}}
\newcommand{\acol}[2]{#1_{\cdot#2}}
\def\ci{\perp\!\!\!\perp}

\newcommand{\ssbr}[1]{\left[\!\left[#1\right]\!\right]}

\newcommand{\twoco}[1]{\multicolumn{2}{c|}{#1}}

\newcommand{\wtimes}[1]{\times_{#1}}
\newcommand{\btimes}[1]{~\bar{\times}_{#1}~}


\newcommand{\secref}[1]{Section~\ref{#1}}
\newcommand{\eqnref}[1]{Eqn~(\ref{#1})}
\newcommand{\eqnsref}[1]{Eqns~(\ref{#1})}
\newcommand{\appref}[1]{Appendix~\ref{#1}}
\newcommand{\algtabref}[1]{Algorithm~\ref{#1}}
\newcommand{\lemref}[1]{Lemma~\ref{#1}}
\newcommand{\propref}[1]{Proposition~\ref{#1}}
\newcommand{\thmref}[1]{Theorem~\ref{#1}}
\newcommand{\corref}[1]{Corollary~\ref{#1}}
\newcommand{\asmpref}[1]{Assumption~\ref{#1}}

\newcommand{\tabref}[1]{Table~\ref{#1}}
\newcommand{\figref}[1]{Figure~\ref{#1}}


\newcommand{\defeq}{:=}

\usepackage{xr}
\usepackage{hyperref}
\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}

\myexternaldocument{ren_338-supp}

\usepackage{url}

\hypersetup{
    colorlinks=true,
    linkcolor=blue,
    citecolor=cyan,
    filecolor=green,      
    urlcolor=black,
}

\newcommand{\AlgName}{{Spectral Dynamics Embedding}\xspace}
\newcommand{\algabb}{{SPEDE}\xspace}
\newcommand*{\dif}{\mathop{}\!\mathrm{d}}

\usepackage{xcolor}
\newcommand{\Bo}[1]{{\color{blue} [Bo: #1]}}
\newcommand{\Tongzheng}[1]{{\color{red} [Tongzheng: #1]}}


\allowdisplaybreaks

\title{A Free Lunch from the Noise:\\
Provable and Practical Exploration for Representation Learning}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1, 2, $^\star$]{\href{mailto:<tongzheng@utexas.edu>?Subject=Your UAI 2022 paper}{Tongzheng Ren }{}}
\author[3, $^\star$]{\href{mailto:<tianjunz@berkeley.edu>?Subject=Your UAI 2022 paper}{Tianjun Zhang }{}}
\author[4, 5]{Csaba Szepesv\'{a}ri~}
\author[2]{\href{mailto:<bodai@google.com>?Subject=Your UAI 2022 paper}{Bo Dai}{}
}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science, UT Austin
}
\affil[2]{%
    Google Research, Brain Team
}
\affil[3]{
    Department of EECS, UC Berkeley
  }
\affil[4]{DeepMind}
\affil[5]{Department of Computer Science, University of Alberta}

\begin{document}
\maketitle

\begin{abstract}
Representation learning lies at the heart of the empirical success of deep learning for dealing with the curse of dimensionality. However, the power of representation learning has not been fully exploited yet in reinforcement learning (RL), due to {\bf i)}, the trade-off between expressiveness and tractability; and {\bf ii)}, the coupling between exploration and representation learning. In this paper, we first reveal the fact that under some noise assumption in the stochastic control model, we can obtain the linear spectral feature of its corresponding Markov transition operator in closed-form \emph{for free}. Based on this observation, we propose~\emph{\AlgName~(\algabb)}, which breaks the trade-off and completes optimistic exploration for representation learning by exploiting the structure of the noise. We provide rigorous theoretical analysis of~\algabb, and demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks. 
\end{abstract}

\let\thefootnote\relax\footnotetext{$^\star$ Equal Contribution}



% \vspace{-2mm}
\section{Introduction}
Reinforcement learning~(RL) dedicates to solve the sequential decision making problem, where an agent is interacting with an \emph{unknown} environment to find the best policy that maximizes the expected cumulative rewards~\citep{sutton2018reinforcement}. It is known that the tabular algorithms direct controlling over the original state and action in Markov decision processes~(MDPs) achieve the minimax-optimal regret depending on the cardinality of the state and action space~\citep{jaksch2010near, azar2017minimax,jin2018q}. However, these algorithms become computationally intractable for the real-world problems with an enormous number of states.
Learning with function approximations upon \emph{good} representation is a natural idea to tackle such computational issue, which has already demonstrated its effectiveness in the success of deep learning~\citep{bengio2013representation}. In fact, representation learning lies at the heart of the empirical successes of deep RL in video games~\citep{mnih2013playing}, robotics~\citep{levine2016end}, Go~\citep{silver2017mastering} to name a few. Meanwhile, the importance and benefits of the representation in RL is rigorously justified~\citep{jin2020provably,yang2020reinforcement}, which quantifies the regret in terms of the dimension of the \emph{known} representation based on a subclass in MDPs~\citep{puterman2014markov}. A natural question raises:
% \vspace{-6mm}
\begin{center}
\emph{How to design {\bf provably efficient} and {\bf practical} algorithm for representation learning in RL?}
\end{center}
Here, by ``provably efficient'' we mean the sample complexity of the algorithm can be rigorously characterized only in terms of the complexity of representation class, without explicit dependency on the number of states and actions, while by ``practical'' we mean the algorithm can be implemented and deployed for the real-world applications. Therefore, we not only require the representation learned is expressive enough for handling complex practical environments, but also require the operations in the algorithm tractable and computation/memory efficient.    
The major difficulty of this question lies in two-fold: 
\begin{itemize}% [leftmargin=14pt,topsep=0pt,parsep=0pt,partopsep=0pt]
    \item[{\bf i)}] The \emph{trade-off} between the expressiveness and the tractability in the design of the representations;

    \item[{\bf ii)}] The learning of representation is intimately \emph{coupled} with exploration.
\end{itemize}

Specifically, a desired representation should be sufficiently expressive\footnote{For a formal definition of expressiveness, see \citep{agarwal2020flambe}.} to capture the practical dynamic systems, while still computationally tractable. However, in general, expressive representation leads to complicated optimization in learning. 
For example, the representation in the linear MDP is \emph{exponential stronger} than the latent variable MDPs in terms of expressiveness~\citep{agarwal2020flambe}. However, its representation learning depends on either a MLE oracle that is computationally intractable due to the constraint on the regularity of conditional density~\citep{agarwal2020flambe}, or an optimization oracle that can solve complicated constrained $\min$-$\max$-$\min$-$\max$ optimization~\citep{modi2021model}. On the other hand,~\citet{misra2020kinematic} considers the representation introduced by an encoder in block MDP~\citep{du2019provably}, in which the learning problem can be completed by a regression, but with the payoff that the representations in block MDP is even weaker than the latent variable MDP~\citep{agarwal2020flambe}. 

Meanwhile, the coupling of the representation learning and exploration also induces the difficulty in \emph{practical} algorithm design and analysis. Specifically, one cannot learn a precise representation without enough experiences from a comprehensive exploration, while the exploration depends on a reliable estimation of the representation. Most of the known results depends on a policy-cover-based exploration~\citep{du2019provably,misra2020kinematic,agarwal2020flambe,modi2021model}, which maintains and samples a set of policies during training for systematic exploration, that significantly increases the computation and memory cost in implementation. 

In this work, we propose~\emph{\AlgName~(\algabb)}, 
dealing with the aforementioned difficulties appropriately, and thus, answering the question affirmatively.~\algabb is established on a connection between the stochastic control dynamics~\citep{osband2014model,kakade2020information} and linear MDPs in~\secref{sec:algorithm}. Specifically, by exploiting the property of the \emph{noise} in the stochastic control dynamics, we can recover the factorization of its corresponding Markov transition operator in closed-form \emph{without extra computation}. This equivalency immediately overcomes the computational intractability in the linear MDP estimation via the corresponding control dynamics form, and thus, breaks the trade-off between expressiveness and tractability. Meanwhile, as a byproduct, the linear MDP reformulation also introduce efficient planning for optimal policy in nonlinear control through the linear sufficient feature from the spectral space of Markov operator, while in most model-based RL, planning is conducted by treating learned model as simulator, and thus, is inefficient and sub-optimal.

More importantly, the two faces of one model also provide the opportunity to tackle the coupling between representation learning and exploration. The optimism in the face of uncertainty principle can be easily implemented through Thompson sampling w.r.t. the stochastic nonlinear dynamics, which leads to the posterior of representations implicitly, while bypasses the unidentifiability issue in directly characterizing the representation, therefore, can be theoretically justified. 

We rigorously characterize the statistical property of~\algabb in terms of regret w.r.t. the complexity of representation class in~\secref{sec:analysis}, without explicit dependence on the size of raw state space and action space. With the established unified view, our results generalize online control~\citep{kakade2020information} and linear MDP~\citep{jin2020provably} beyond \emph{known} features. We finally demonstrate the superiority of~\algabb on the MuJoCo benchmarks in~\secref{sec:experiments}. It significantly outperforms the empirical state-of-the-art RL algorithms. To our knowledge, \algabb is the first representation learning algorithm achieving statistical, computational, and memory efficiency with sufficient expressiveness.  

\subsection{Related Work}
There have been many great attempts on {\bf algorithmic representation learning} in RL for different purposes, \eg, bisimulation~\citep{ferns2004metrics,gelada2019deepmdp}, reconstruction~\citep{hafner2019learning}.
Recently, there are also several works considering the spectral features based on decomposing different variants of the transition operator, including successor features~\citep{dayan1993improving,kulkarni2016deep}, proto-value functions~\citep{mahadevan2007proto,wu2018laplacian}, spectral state-aggregation~\citep{duan2018state,zhang2019spectral}, and contrastive fourier features~\citep{nachum2021provable}. These works are highly-related to the proposed~\algabb. Besides these features focus on \emph{state-only} representation, the major differences between~\algabb and these spectral features lie in {\bf i)}, the target operators in existing spectral features are \emph{state-state} transition, which cancel the effect of action; {\bf ii)}, the target operators are estimated based on empirical data from a \emph{fixed behavior policy} under the implicit assumption that the estimated operator is \emph{uniformly accurate}, ignoring the major difficulty in exploration, while \algabb carefully designed the systematic exploration with theoretical guarantee; {\bf iii)}, most of the existing spectral features rely on \emph{explicitly} decomposition of the operators, while \algabb obtains the spectral \emph{for free}.  

Turning to the {\bf theoretically-justified representation learning with online exploration}, a large body of effort focuses on the policy-cover-based exploration~\citep{du2019provably,misra2020kinematic,agarwal2020flambe,modi2021model}. 
The major difficulty impedes their practical application is the computation and memory cost: the policy-cover-based exploration requires a set of exploratory polices to be maintained and sampled from during training, which can be extremely expensive. 
\citet{uehara2021representation} introduced a UCB mechanism that can enforce exploration without the requirements on maintaining the policy cover. However, the algorithm requires an MLE oracle for unnormalized conditional statistical model, which still prevents us from applying the algorithm in practice until recent attempt~\citep{zhang2022making} using contrastive learning to replace the MLE. 

Another two related lines of research are {\bf model-based RL} and {\bf online control}, which are commonly known overlapped but separate communities considering different formulations of the dynamics. Our finding bridges these two communities by establishing the equivalency between standard models that are widely considered in the corresponding communities.~\citet{osband2014model} and~\citet{kakade2020information} are the most related to our work in each community. These models generalize their corresponded linear models, \ie,~\citet{jin2020provably} and~\citet{cohen2019learning}, with general nonlinear model and kernel function within a known RKHS, respectively. The regret of the optimistic (pessimistic) algorithm has been carefully characterized for these models. However, both of the proposed algorithms in~\citet{osband2014model} and~\citet{kakade2020information} require a planning oracle to seek the optimal policy, which might be computationally intractable. In~\algabb, this is easily handled in the equivalent linear MDP. 

\section{Preliminaries}

Markov Decision Process (MDP) is one of the most standard models studied in the reinforcement learning that can be denoted by the tuple $\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, T, \rho, H)$, where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $r:\mathcal{S} \times \mathcal{A} \to \mathbb{R}^+$ is the reward function\footnote{In general, the reward can be stochastic. Here for simplicity we assume the reward is deterministic and known throughout the paper, which is a common assumption in the literature \citep[\eg,][]{jin2018q, jin2020provably, kakade2020information}.} (where $\mathbb{R}^+$ denotes the set of non-negative real numbers), $T:\mathcal{S}\times \mathcal{A} \to \Delta(\mathcal{S})$ is the transition and $\rho$ is an initial state distribution and $H$ is the horizon\footnote{Our method can be generalized to infinite horizon case, see Section \ref{sec:practical_alg} for the detail.} (\ie, the length of each episode). A (potentially non-stationary) policy $\pi$ can be defined as $\{\pi_h\}_{h\in [H]}$ where $\pi_h:\mathcal{S}\to\Delta(\mathcal{A}), \forall h\in [H]$. Following the standard notation, we define the value function $V_h^\pi(s_h) := \mathbb{E}_{T, \pi}\left[\sum_{t=h}^{H-1} r(s_t, a_t)|s_h = s\right]$ and the action-value function (\ie, the $Q$ function) $Q_h^\pi(s_h, a_h) = \mathbb{E}_{T, \pi}\left[\sum_{t=h}^{H-1} r(s_t, a_t)|s_h = s, a_h = a\right]$, which are the expected cumulative rewards under transition $T$ when executing policy $\pi$ starting from $s_h$ and $(s_h, a_h)$. With these two definitions at hand, it is straightforward to show the following Bellman equation:
\begin{align*}
    Q_{h}^\pi(s_h, a_h) = r(s_h, a_h) + \mathbb{E}_{s_{h+1}\sim T(\cdot|s_h, a_h)} \sbr{V_{h+1}^\pi(s_{h+1})}.
\end{align*}
Most of RL algorithms aim at finding the optimal policy $\pi^* = \mathop{\arg\max}_{\pi} \mathbb{E}_{s\sim \rho} \sbr{V_0^\pi(s)}$ under MDPs. It is well known that in the tabular setting when the state space and action space are finite, we can provably identify the optimal policy with both sample-efficient and computational-efficient optimism-based methods \citep[\eg][]{azar2017minimax} with the complexity proportion to $\mathrm{poly}(|\mathcal{S}|, |\mathcal{A}|)$. However, in practice, the cardinality of state and action space can be large or even infinite. Hence, we need to incorporate function approximation into the learning algorithm when we deal with such cases. The linear MDP \citep{jin2020provably} or low-rank MDP \citep{agarwal2020flambe, modi2021model} is the most well-known MDP class that can incorporate linear function approximation with theoretical guarantee, thanks to the following assumption on the transition and reward:
\begin{align}
\label{eq:linear_transition}
    T(s^\prime|s, a) = \langle \phi(s, a),  \mu(s^\prime)\rangle_{\mathcal{H}},\quad r(s, a) = \langle \phi\rbr{s, a}, \theta \rangle_\Hcal,
\end{align}
where $\phi:\mathcal{S}\times \mathcal{A} \to \mathcal{H}$, $\mu:\mathcal{S}\to\mathcal{H}$ are two feature maps and $\mathcal{H}$ is a Hilbert space. The most essential observation for them is that, $Q_h^\pi(s, a)$ for any policy $\pi$ is linear w.r.t $\phi(s_h, a_h)$, due to the following observation \citep{jin2020provably}:
\begin{align}
    & Q_h^\pi(s, a)= r(s, a) + \int V_{h+1}^\pi(s_{h+1}) T(s_{h+1}|s_h, a_h) \dif s_{h+1}   \nonumber\\
    % & \int V_{h+1}^\pi(s_{h+1}) \langle \phi(s_h, a_h), \mu(s_{h+1})\rangle \dif s_{h+1} \nonumber\\
    & = \left\langle \phi(s_h, a_h), \theta + \int V_{h+1}^\pi(s_{h+1}) \mu(s_{h+1}) \dif s_{h+1}\right\rangle_{\mathcal{H}}.
\label{eq:linear_Q}
\end{align}

Therefore, $\phi$ serves as a sufficient representation for the estimation of $Q_h^\pi$, that can provide uncertainty estimation with standard linear model analysis and eventually lead to sample-efficient learning when $\phi$ is fixed and known to the agent \citep[see Theorem 3.1 in][]{jin2020provably}.
However, we in general do not have such representations in advance\footnote{One exception is the tabular MDP, where we can choose $\phi:\mathcal{S}\times \mathcal{A}\to \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ that each state-action pair has exclusive one non-zero element and $\mu:\mathcal{S}\to \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ correspondingly defined to make \eqref{eq:linear_transition} hold.} and we need to learn the representation from the data, which constraints the applicability of the algorithms derived with fixed and known representation.

\paragraph{Remark (Model-based RL vs. RL with Representation):} 
We would like to emphasize that, although most of the existing representation learning methods need to learn the transition~\citep{du2019provably,misra2020kinematic,agarwal2020flambe,uehara2021representation}, RL with representation learning is related but perpendicular to the concept of model-based RL. 
The major difference lies in how to use the learned transition for planning (\ie~finding the optimal policy). 
% between RL with representation learning and model-based RL. 
In vanilla model-based RL methods~\citep[\eg,][]{sutton1990integrated,chua2018deep,kurutach2018model}, the learned transition is played as a simulator generating samples for policy improvement; while in representation-based RL, the representation is extracted from the learned transition to compose the policy explicitly, which is significantly efficient comparing to the model-based RL methods.

\section{\AlgName}\label{sec:algorithm}
It is naturally to consider how to perform sample-efficient representation learning (and hence sample-efficient reinforcement learning) that satisfies \eqref{eq:linear_transition} in an online manner. The most straightforward idea is performing the maximum likelihood estimation (MLE) in the representation space \citep[\eg,][]{agarwal2020flambe}. Unfortunately, for general cases, such MLE is intractable, due to the constraints on the regularity of marginal distribution (\ie, $\langle \phi(s, a), \int_{s^\prime} \mu(s^\prime) \dif s^\prime\rangle = 1$) for all $(s, a)\in\mathcal{S}\times \mathcal{A}$. 
Moreover, even we can perform MLE for certain cases (for example, the block MDP), as the representation is estimated from the data, which can be inaccurate, most of the existing work apply the policy cover technique \citep{du2019provably, misra2020kinematic, agarwal2020flambe, modi2021model} to enforce exploration.
However, such procedures can be both computational and memory expensive when we need amounts of exploratory policy to guarantee the coverage of whole state space, which makes it not a practical choice. 

To overcome these issues, we introduce \AlgName~(\algabb), which leverages the noise structure to provide a simple but provable efficient and practical algorithm for representation learning in RL. We first introduce our key observation, which induces the equivalency between linear MDP and stochastic nonlinear control.

\subsection{Key Observation}
Our fundamental observation is that, the density of isotropic Gaussian distribution can be expressed as the inner product of two feature maps, thanks to the reproducing property and the random Fourier transform of the Gaussian kernel\footnote{We provide a brief review on the related definitions in Appendix \ref{sec:background}.} \citep{rahimi2007random}:

\begin{tcolorbox}[colback=cyan!5!white,colframe=cyan!75!black]
% \vspace{-3mm}
\begin{align}
% \textstyle
    &\phi(x| \mu, \sigma^2 I) \propto  \exp\left(-\frac{\|x - \mu\|^2}{2\sigma^2}\right) \nonumber\\
    = & \langle k(x, \cdot), k(\mu, \cdot)\rangle_{\mathcal{H}} \quad \textit{(Reproducing Property)} \label{eq:reproducing_property}\\
    = &  \inner{\varphi(x, \omega, b)}{\varphi(\mu, \omega, b)}_{p(\omega, b)}\quad \textit{(Random Fourier)},\label{eq:random_feature}
\end{align}
\end{tcolorbox}
where $k(\cdot, \cdot)$ is the Gaussian kernel with bandwidth $\sigma$: $k(x, y) = \exp\left(-\frac{\|x - y\|_2^2}{2\sigma^2}\right)$, $\mathcal{H}$ is the Reproducing Kernel Hilbert Space (RKHS) associated with $k$, $\varphi(x, \omega, b) = \sqrt{2}\cos(\omega^\top x + b)$, $\langle f, g\rangle_p = \mathbb{E}_{p(x)}[f(x)g(x)]$ and $p(\omega, b) =\mathcal{N}(\omega; 0, 1/\sigma^2 I)\cdot \Ucal(b; [0, 2\pi])$ with $\Ncal$ and $\Ucal$ denoting Gaussian and Uniform distribution, respectively.

Consider the general transition dynamics,
\begin{align}
\label{eq:control_model}
% \textstyle
    & s^\prime = f^*(s, a) + \epsilon,\quad \epsilon\sim \Ncal(0, \sigma^2), \\
    \text{or equivalently}& \quad T(s'|s, a)\propto \exp\rbr{-\frac{\nbr{s' - f^*(s, a)}^2}{2\sigma^2}},
\end{align}
which is a widely used setup in the empirical model-based reinforcement learning \citep[\eg,][]{chua2018deep, kurutach2018model, clavera2018model, wang2019benchmarking}, and the online (non)-linear control \citep[\eg,][]{abbasi2011regret, mania2019certainty, mania2020active, simchowitz2020naive, kakade2020information}. Here $s\in\mathbb{R}^d$, $a\in\mathcal{A}$ that can be continuous and $f^*$ is a dynamic function. 

By applying the reproducing property~\eqref{eq:reproducing_property} or random Fourier transform~\eqref{eq:random_feature} for the transition dynamics~\eqref{eq:control_model}, we can obtain the feature $\phi$ and $\mu$ satisfies \eqref{eq:linear_transition} \emph{for free}. Specifically, taking the reproducing property as an example, we have that
\begin{align}
\label{eq:control_model_linear}
    T(s^\prime|s, a)  = \langle k(f^*(s, a), \cdot), (2\pi\sigma^2)^{-d/2} k(s^\prime, \cdot))\rangle_{\mathcal{H}},
\end{align}
which means the problem \eqref{eq:control_model} is indeed a linear MDP with $\phi(s, a) = k(f^*(s, a), \cdot)$ and $\mu(s^\prime) = (2\pi\sigma^2)^{-d/2} k(s^\prime, \cdot)$. Following \eqref{eq:linear_Q}, we know $Q(s, a)$ is in the linear span of the $\phi(s, a)$ that is transformed from $f^*(s, a)$. Therefore, finding a good representation of $Q(s, a)$ is equivalent to finding a good estimation of $f^*$. In the next section, we will show that, with the well-known optimism in the face of uncertainty (OFU) principle, we can estimate $f^*$ in an online manner with a both sample-efficient in terms of regret and computational-efficient algorithm.

\paragraph{Remark (Computation-free Factorizable Noise Model):} We remark that, similar observations also hold for large amounts of distributions, \eg, the Laplace and Cauchy distribution. We refer the interested reader to Table 1 in \citet{dai2014scalable} for the known transformation of kernels and features. Here we focus on the Gaussian noise.

\paragraph{Remark (Reward Factorization):} In the definition of linear MDP~\eqref{eq:linear_transition}, the reward function $r(s, a)$ should also have the ability to be linearly represented by $\phi\rbr{s, a}$. This can be implemented by augmenting $[\phi\rbr{s, a}, r(s, a)]$ as the new representation, therefore, we neglect the reward function throughout the paper. 
\subsection{Practical Algorithm Description}
\label{sec:practical_alg}
Here, we introduce a generic Thompson Sampling (TS) type algorithm in Algorithm \ref{alg:TS} based on the OFU principle that leverage our observation at the previous section.
At the beginning, we provide a prior distribution $\mathbb{P}(f)$ that reflects our prior knowledge on $f^*$. Then for each episode, we draw a $f$ from the posterior, find the optimal policy with $f$ using the planning algorithm, execute this policy and eventually inference the posterior with the new observation. Notice that, we choose the policy optimistically with an \emph{sampled} $f$, which enforces the exploration following the principle of OFU. Meanwhile, we only learn the dynamic with posterior inference {\color{black} and directly obtain the representation with \eqref{eq:reproducing_property} or \eqref{eq:random_feature}, which avoids additional error from the representation learning step}. As all of our data is collected with $f^*$, our posterior will shrink to a point mass of $f^*$, which guarantees we can identify good representation and good policy with sufficient number of data.
\begin{algorithm}[tb]
\caption{Thompson Sampling (TS) Algorithm}
\label{alg:TS}
\begin{algorithmic}[1]
\Require Number of Episodes $K$, Prior Distribution $\mathbb{P}(f)$, Reward Function $r(s, a)$.
\State Initialize the history set $\mathcal{H}_0 = \emptyset$.
\For{episodes $k=1, 2, \cdots$}
\State {\color{blue} Sample $f_k \sim \mathbb{P}(f|\mathcal{H}_k)$.} \Comment{Draw the Representation.}
\State {\color{blue} Find the optimal policy $\pi_k$ on $f_k$ with Algorithm \ref{alg:planning}.}\label{line:planning}\Comment{Planning with $f_k$.}
\For{steps $h=0, 1, \cdots, H-1$}\Comment{Executing $\pi_k$.}
\State Execute $a_h^k \sim \pi_k^h(s_h^k)$.
\State Observe $s_{h+1}$.
\EndFor
\State Set $\mathcal{H}_k = \mathcal{H}_{k-1} \cup \{(s_h^k, a_h^k, s_{h+1}^k)\}_{h=0}^{H-1}$. \Comment{Update the History.}
\EndFor
\end{algorithmic}
\end{algorithm}

One significant part of \algabb is the computational-efficient planning with $f_k$, thanks to the linear MDP formulation \eqref{eq:control_model_linear}. Prior work assumes an oracle \cite[\eg,][]{kakade2020information} for such planning problem, but little is known on how to provably perform such planning efficiently. Notice that, with the feature $\phi(s, a)$ defined via \eqref{eq:reproducing_property} and \eqref{eq:random_feature}, we know that $Q_h^\pi(s, a)$ is exactly linear in $\phi(s, a)$, $\forall h, \pi$. Hence, we can perform a dynamic programming style algorithm that calculates $Q_h^\pi(s, a)$ with the given feature $\phi(s, a)$, and then greedily select the action at each level $h$, which is simple yet efficient. It is straightforward to show that the policy obtained with this dynamic programming algorithm is optimal by induction. We illustrate the detailed algorithm in~\algtabref{alg:planning}.

\begin{algorithm}[tb]
\caption{Planning with Dynamic Programming}
\label{alg:planning}
\begin{algorithmic}[1]
\Require Transition Model $f$, Reward Function $r(s, a)$.
\State Initialize $\phi(s, a)$, $\mu(s^\prime)$ with \eqref{eq:reproducing_property} or \eqref{eq:random_feature}. $V_H(s) = 0, \forall s$.
\For{steps $h=H-1, H-2, \cdots, 0$}
\State {\color{blue} Compute
\begin{align*}
    Q_h(s, a) = r(s, a) + \langle \phi(s, a), \int V_{h+1}(s^\prime) \mu(s^\prime) \dif s^\prime\rangle_{\mathcal{H}}.
\end{align*}}\Comment{Bellman Update.}\label{line:critic}
\State Set $V_h(s) = \max_{a} Q_h(s, a)$, $\pi_h(s) = \mathop{\arg\max}_a Q_h(s, a)$.\Comment{Choose the Optimal Policy.}\label{line:actor}
\EndFor
\State \Return $\{\pi_h\}_{h=0}^{H-1}$.
\end{algorithmic}
\end{algorithm}

\subsubsection{Implementation Details}
In such a planning algorithm, we need to maintain the posterior of $f$ and calculate the term $\int V_{h+1}(s^\prime) \mu(s^\prime) \dif s^\prime$ and take the maximum of $Q_h(s, a)$ over $a$, which can be problematic. We will provide more discussion on this issue below.

\paragraph{Posterior Sampling} The exact posterior inference can be hard if $f^*$ does not lie in simple function class (\eg, linear function class) or has some derived property (\eg, conjugacy), so in practice we apply the existing mature approximate inference methods like Markov Chain Monte Carlo (MCMC) \citep[\eg,][]{neal2011mcmc} and variational inference \citep[see,][]{blei2017variational}. 
In our implementation, we used stochastic gradient langevin dynamics~\citep{welling2011bayesian,cheng2018convergence} to train an ensemble of models for posterior approximation.

\paragraph{Large State and Action Space} In general, we need to handle the case when the number of states and actions can be large, or even infinite. Notice that, when the state space is large, we can estimate the term $\int V_{h+1}(s^\prime)\mu(s^\prime) \dif s^\prime$ with regression based method using the samples from $f$~\citep{NIPS2007_da0d1111}. 
For the continuous action space, we can apply principled policy optimization methods \citep[\eg,][]{agarwal2020optimality} with an energy-based model~(EBM) parametrized policy~\citep{nachum2017bridging,dai2018sbeed}, treat the linear $Q^{\pi}(s, a)$ as the gradient and perform mirror descent and eventually obtain the optimal policy. However, this is at the cost of an additional sampling step from the EBM policy. In practice, we introduce a Gaussian policy and perform soft actor-critic \citep{haarnoja2018soft} policy update, which already provides good empirical performance. 
To sum up, for large state and action cases, we learn the critic in the learned representation space by regression, and obtain the Gaussian parametrized actor with SAC policy update step, in Line \ref{line:critic} and \ref{line:actor} in~\algtabref{alg:planning}, respectively. 
\paragraph{Infinite Horizon Case} Our algorithm can be provably extended to the infinite horizon case with specific termination condition for each episode \citep[\eg, see][]{jaksch2010near}. In practice, for the planning part we can solve the linear fixed-point equation with the feature $\phi(s, a)$ using the popular algorithms like Fitted $Q$-iteration (FQI)~\citep{NIPS2007_da0d1111} or dual embedding~\citep{dai2018sbeed}. that still guarantees to find the optimal policy. 

\section{Theoretical Guarantees}\label{sec:analysis}
In this section, we provide theoretical justification for~\algabb, showing that \algabb can identify informative representation and as a result, near-optimal policy in a sample-efficient way. 

We first define the notation of regret. Assume at episode $k$, the learner chooses the policy $\pi_k$ and observes a sequence $\{(s_h^k, a_h^k)\}_{h=0}^{H-1}$. We define the regret of the first $K$ episodes (and define $T:=KH$) as:
\begin{align}
% \textstyle
    \mathrm{Regret}(K) := \sum_{k\in [K]} \left[V_0^*(s_0^k) - V_0^{\pi_k}(s_0^k)\right]
\end{align}
The regret measures the sample complexity of the representation learning in RL. We want to provide a regret upper bound that is sublinear in $T$. When $T$ increases, we collect more data that can help us build a much more accurate estimation on the representation, which should decrease the per-step regret and make the overall regret scale sublinear in $T$. As we consider the Thompson Sampling algorithm, we would like to study the expected regret $\mathbb{E}_{\mathbb{P}(f)} \left[\mathrm{Regret}(K)\right]$, which takes the prior $\mathbb{P}(f)$ into account.

\subsection{Assumptions}
Before we start, we first state the assumptions we use to derive our theoretical results.

We assume the reward is bounded, which is common in the literature \citep[\eg][]{azar2017minimax, jin2018q, jin2020provably}.
\begin{assumption}[Bounded Reward]
$r(s, a) \in [0, 1]$, $\forall (s, a) \in \mathcal{S}\times \mathcal{A}$.
\end{assumption}

In practice, we generally approximate $f^*$ with some complicated function approximators, so we focus on the setting where we want to find $f^*$ from a general function class $\mathcal{F}$ 
This is important for MuJoCo dynamics modeling, which have complicated transitions over angle, angular velocity and torque of the agent in the raw state.
We first state some necessary definitions and assumptions on $\mathcal{F}$.
\begin{definition}[$\ell_2$-norm of functions] 
Define
$\|f\|_2 := \max_{(s, a) \in \mathcal{S}\times\mathcal{A}} \|f(s, a)\|_2.$
Notice that it is not the commonly used $\ell_2$ norm for the function, but it suits our purpose well.
\end{definition}
\begin{assumption}[Bounded Output]
\label{assump:bounded_output}
We assume that $\|f\|_2 \leq C$, $\forall f\in\mathcal{F}$.
\end{assumption}
\begin{assumption}[Realizability]
\label{assump:realizability}
We assume the ground truth dynamic function $f^*\in\mathcal{F}$.
\end{assumption}
\vspace{-0.5em}
We then define the notion of covering number, which will be helpful in our algorithm derivation.

\begin{definition}[Covering Number \citep{wainwright2019high}]
An $\epsilon$-cover of $\mathcal{F}$ with respect to a metric $\rho$ is a set $\{f_i\}_{i\in [n]}\subseteq \mathcal{F}$, such that $\forall f\in \mathcal{F}$, there exists $i\in [n]$, $\rho(f, f_i) \leq \epsilon$. The $\epsilon$-covering number is the cardinality of the smallest $\epsilon$-cover, denoted as $\mathcal{N}(\mathcal{F}, \epsilon, \rho)$.
\end{definition}
\begin{assumption}[Bounded Covering Number]
\label{assump:bounded_covering} We assume that $\mathcal{N}(\mathcal{F}, \epsilon, \|\cdot\|_2) < \infty, \forall \epsilon > 0$.
\end{assumption}
\paragraph{Remark} Basically, Assumption \ref{assump:bounded_output} means the the transition dynamic never pushes the state far from the origin, which holds widely in practice. Assumption \ref{assump:realizability} guarantees that we can find the exact $f^*$ in $\mathcal{F}$, or we will always suffer from the error induced by model mismatch. Assumption \ref{assump:bounded_covering} ensures that we can estimate $f^*$ with small error when we have sufficient number of observations.

Besides the bounded covering number, we also need an additional assumption on bounded eluder dimension, which is defined in the following:
\begin{definition}[$\epsilon$-dependency \citep{osband2014model}]
A state-action pair $(s, a)\in\mathcal{S}\times\mathcal{A}$ is $\epsilon$-dependent on $\{(s_i, a_i)\}_{i\in [n]}\subseteq \mathcal{S}\times \mathcal{A}$ with respect to $\mathcal{F}$, if $\forall f, \tilde{f}\in\mathcal{F}$ satisfying $\sqrt{\sum_{i\in [n]} \|f(s_i, a_i) - \tilde{f}(s_i, a_i)\|_2^2}\leq \epsilon$ satisfies that $\|f(s, a) - \tilde{f}(s, a)\|_2 \leq \epsilon$. Furthermore, $(s, a)$ is said to be $\epsilon$-independent of $\{(s_i, a_i)\}_{i\in [n]}$ with respect to $\mathcal{F}$ if it is not $\epsilon$-dependent on $\{(s_i, a_i)\}_{i\in[n]}$.
\end{definition}
\begin{definition}[Eluder Dimension \citep{osband2014model}]
We define the eluder dimension $\mathrm{dim}_{E}(\mathcal{F}, \epsilon)$ as the length $d$ of the longest sequence of elements in $\mathcal{S}\times \mathcal{A}$, such that $\exists \epsilon^\prime\geq\epsilon$, every element is $\epsilon^\prime$-independent of its predecessors.
\end{definition}
\paragraph{Remark} Intuitively, eluder dimension illustrates the number of samples we need to make our prediction on unseen data accurate. If the eluder dimension is unbounded, then we cannot make any meaningful prediction on unseen data even with large amounts of collected samples. Hence, to make the learning possible, we need the following bounded eluder dimension assumption.
\begin{assumption}[Bounded Eluder Dimension]
\label{assump:bounded_eluder}
We assume $\mathrm{dim}_{E}(\mathcal{F}, \epsilon) < \infty, \forall \epsilon > 0$.
\end{assumption}

\subsection{Main Result}

\begin{theorem}[Regret Bound]
\label{thm:regret_bound}
Assume Assumption \ref{assump:bounded_output} to \ref{assump:bounded_eluder} holds. We have that

\begin{align*}
    & \mathbb{E}_{\mathbb{P}(f)}\left[\mathrm{Regret}(K)\right] \leq  \tilde{O}\bigg(\sqrt{H^2 T}\\
    & \cdot \sqrt{\log \mathcal{N}(\mathcal{F}, T^{-1/2}, \|\cdot\|_2)} \cdot \sqrt{\mathrm{dim}_{E}(\mathcal{F}, T^{-1/2})}\bigg).
\end{align*}
where $\tilde{O}$ represents the order up to logarithm factors.
\end{theorem}
% \vspace{-0.5em}
For finite dimensional function class, $\log \mathcal{N}(\mathcal{F}, T^{-1/2}, \|\cdot\|_2)$ and $\mathrm{dim}_{E}(F, T^{-1/2}))$ should be scaled like $\mathrm{polylog}(T)$, hence our upper bound is sublinear in $T$. The proof is in Appendix \ref{sec:technical_proof}. Here we briefly sketch the proof idea.
\begin{proof}[Proof Sketch]
We first construct an equivalent UCB algorithm (see Appendix \ref{sec:ucb}) and bound $\mathrm{Regret}(K)$ for it. Then by the conclusion from \citet{russo2013eluder, russo2014learning, osband2014model}, we can directly translate the upper bound on $\mathrm{Regret}(K)$ from UCB algorithm to an upper bound on $\mathbb{E}_{\mathbb{P}(f)}\left[\mathrm{Regret}(K)\right]$ of TS algorithm. We emphasize that the UCB algorithm is solely designed for analysis purpose.

With the optimism, we know for episode $k$, $V_0^*(s_0^k) \leq \tilde{V}_{0, k}^{\pi_k}(s_0^k)$, where $\tilde{V}_{h, k}^{\pi_k}$ is the value function of policy $\pi_k$ under the model $\tilde{f}_k$ introduced in the UCB algorithm.
Hence, the regret at episode $k$ can be bounded by $\tilde{V}_{0, k}^{\pi_k}(s_0^k) - V_0^{\pi_k}(s_0^k)$, which is the value difference of the policy $\pi_k$ under the two models $\tilde{f}_k$ and $f^*$, that can be bounded by $\sqrt{\mathbb{E}\left[\sum_{h=0}^{H-1}\|f^*(s_h^k, a_h^k) - \tilde{f}_k(s_h^k, a_h^k)\|_2^2 \right]}$ (see Lemma \ref{lem:simulation} for the details), which means when the estimated model $\hat{f}$ is close to the real model $f^*$, the policy obtained by planning on $\hat{f}$ will only suffer from a small regret. With Cauchy-Schwartz inequality, we only need to bound
$\mathbb{E}\left[\sum_{k\in [K]}\sum_{h=0}^{H-1}\|f^*(s_h^k, a_h^k) - \tilde{f}_k(s_h^k, a_h^k)\|_2^2\right]$. This term can be handled via Lemma \ref{lem:width_sum_bound}. With some additional technical steps, we can obtain the upper bound on $\mathrm{Regret}(K)$ for the UCB algorithm, and hence the upper bound on $\mathbb{E}_{\mathbb{P}(f)}\left[\mathrm{Regret}(K)\right]$ for the TS algorithm.
\end{proof}
\paragraph{Kernelized Non-linear Regulator} Notice that, for the linear function class $\mathcal{F} = \{\theta^\top \varphi(s, a): \theta \in \mathbb{R}^{d_{\varphi} \times d}\}$ where $\varphi:\mathcal{S} \times \mathcal{A} \to \mathbb{R}^{d_{\varphi}}$ is a fixed and \emph{known} feature map of certain RKHS\footnote{Note that, the RKHS here is the Hilbert space that contains $f(s, a)$ with the feature from some fixed and known kernel, It is different from the RKHS we introduced in Section \ref{sec:algorithm}, that contains $Q(s, a)$ with the feature $k(f(s, a), \cdot)$ where $k$ is the Gaussian kernel.}, when the feature and the parameters are bounded, the logarithm covering number can be bounded by $\log \mathcal{N}(\mathcal{F}, \epsilon, \|\cdot\|_2)\lesssim d_{\varphi}\log (1/\epsilon)$, and the eluder dimension can be bounded by $\mathrm{dim}_{E}(\mathcal{F}, \epsilon) \lesssim d_{\varphi} \log (1/\epsilon)$ (see Appendix \ref{sec:linear_case} for the detail, notice that we provide a tighter bound of the eluder dimension compared with the one derived in \citet{osband2014model}). Hence, for linear function class, Theorem \ref{thm:regret_bound} can be translated into a regret upper bound of $\tilde{O}(H d_{\varphi} T^{1/2})$ for sufficiently large $T$, that matches the results of \citet{kakade2020information}\footnote{Note that $T$ in \citep{kakade2020information} is the number of episodes, and $V_{\max}$ in \citep{kakade2020information} can be viewed as $H^2$ when the per-step reward is bounded.}. Moreover, for the case of linear bandits when $H = 1$, our bound can be translated into a regret upper bound of $\tilde{O}(d_{\varphi} T^{1/2})$, that matches the lower bound \citep{dani2008stochastic} up to logarithmic terms.
\vspace{-3mm}
\paragraph{Compared with \citet{kakade2020information} and \citet{osband2014model}} Our results have some connections with the results from \citet{kakade2020information} and \citet{osband2014model}. However, in \citet{kakade2020information}, the authors only considers the case when $\mathcal{F}$ only contains linear functions w.r.t some known feature map, which constrains its application in practice. We instead, consider the general function approximation, which makes our algorithm applicable for more complicated models like deep neural networks. Meanwhile, the regret bound from \citet{osband2014model} depends on a global Lipschitz constant for the value function, which can be hard to quantify with either theoretical or empirical method. Instead, our regret bound gets rid of such dependency on the Lipschitz constant with the simulation lemma that carefully exploit the noise structure.

\section{Experiments}\label{sec:experiments}

\begin{table*}[t]
\caption{\footnotesize Performance of \algabb on various MuJoCo control tasks. All the results are averaged across 4 random seeds and a window size of 10K. Results marked with $^*$ is directly adopted from MBBL~\citep{wang2019benchmarking}. Our method achieves strong performance compared to pure empirical baselines (\eg, PETS). 
We also compare \algabb-REG which regularizes the critic using the model dynamics loss with several model-free RL method. \algabb-REG significantly improves the performance of the SoTA method SAC.
}
\scriptsize
\setlength\tabcolsep{3.5pt}
\label{tab:MuJoCo_results2}
\centering
\begin{tabular}{p{2cm}p{2cm}p{2cm}p{2.5cm}p{2cm}p{2cm}p{2cm}}
\toprule
& Swimmer & Reacher & MountainCar & Pendulum & I-Pendulum \\ 
\midrule  
ME-TRPO$^*$ & 30.1$\pm$9.7 & -13.4$\pm$5.2 & -42.5$\pm$26.6 & \textbf{177.3$\pm$1.9} & -126.2$\pm$86.6\\
PETS-RS$^*$  & 42.1$\pm$20.2 & -40.1$\pm$6.9 & -78.5$\pm$2.1 & 167.9$\pm$35.8 & -12.1$\pm$25.1\\
PETS-CEM$^*$  & 22.1$\pm$25.2 & -12.3$\pm$5.2 & -57.9$\pm$3.6 & 167.4$\pm$53.0 & -20.5$\pm$28.9\\
DeepSF & 25.5$\pm$13.5 & -16.8$\pm$3.6 & -17.0$\pm$23.4 & 168.6$\pm$5.1 & -0.2$\pm$0.3\\
{\bf \algabb} & \textbf{42.6$\pm$4.2} & \textbf{-7.2$\pm$1.1} & \textbf{50.3$\pm$1.1} & {169.5$\pm$0.6} & \textbf{0.0$\pm$0.0} \\
\midrule
PPO$^*$ & 38.0$\pm$1.5 & -17.2$\pm$0.9 & 27.1$\pm$13.1 & 163.4$\pm$8.0 & -40.8$\pm$21.0 \\
TRPO$^*$ & 37.9$\pm$2.0 & -10.1$\pm$0.6 & -37.2$\pm$16.4 & 166.7$\pm$7.3 & -27.6$\pm$15.8 \\
TD3$^*$ & 40.4$\pm$8.3 & -14.0$\pm$0.9 & -60.0$\pm$1.2 & 161.4$\pm$14.4 & -224.5$\pm$0.4 \\
SAC$^*$  & \textbf{41.2$\pm$4.6} & -6.4$\pm$0.5 & \textbf{52.6$\pm$0.6} & 168.2$\pm$9.5 & -0.2$\pm$0.1\\
{\bf \algabb-REG} & 40.0$\pm$3.8 & \textbf{-5.8$\pm$0.6} & 40.0$\pm$3.8 & \textbf{168.5$\pm$4.3} & \textbf{0.0$\pm$0.1}\\
\bottomrule 
\end{tabular}
\centering
\begin{tabular}{p{2cm}p{2cm}p{2cm}p{2.5cm}p{2cm}p{2cm}p{2cm}}
\toprule
& Ant-ET & Hopper-ET & S-Humanoid-ET & Humanoid-ET & Walker-ET \\ 
\midrule  
ME-TRPO$^*$ & 42.6$\pm$21.1 & 4.9$\pm$4.0 & 76.1$\pm$8.8 & 72.9$\pm$8.9 & -9.5$\pm$4.6\\
PETS-RS$^*$ & 130.0$\pm$148.1 &  205.8$\pm$36.5 & 320.9$\pm$182.2 & 106.9$\pm$106.9 & -0.8$\pm$3.2 \\
PETS-CEM$^*$ & 81.6$\pm$145.8 & 129.3$\pm$36.0 & 355.1$\pm$157.1 & 110.8$\pm$91.0 & -2.5$\pm$6.8 \\
DeepSF & 768.1$\pm$44.1  & 548.9$\pm$253.3 & 533.8$\pm$154.9 & 168.6$\pm$5.1 & 165.6$\pm$127.9\\
{\bf \algabb} & \textbf{806.2$\pm$60.2} & \textbf{732.2$\pm$263.9} & \textbf{986.4$\pm$154.7} & \textbf{886.9$\pm$95.2} & \textbf{501.6$\pm$204.0}  \\
\midrule
PPO$^*$ & 80.1$\pm$17.3  & 758.0$\pm$62.0 & 454.3$\pm$36.7 & 451.4$\pm$39.1 & 306.1$\pm$17.2\\
TRPO$^*$ & 116.8$\pm$47.3  & 237.4$\pm$33.5 & 281.3$\pm$10.9 & 289.8$\pm$5.2 & 229.5$\pm$27.1\\
TD3$^*$ & 259.7$\pm$1.0  & 1057.1$\pm$29.5 & 1070.0$\pm$168.3 & 147.7$\pm$0.7 & \textbf{3299.7$\pm$1951.5}\\
SAC$^*$ & {\bf 2012.7$\pm$571.3}  & 1815.5$\pm$655.1 & 834.6$\pm$313.1 & 1794.4$\pm$458.3 & 2216.4$\pm$678.7\\
{\bf \algabb-REG} & \textbf{2073.1$\pm$119.7} & \textbf{2510.3$\pm$550.8} & \textbf{2710.3$\pm$277.5} & \textbf{3747.8$\pm$1078.1} & 2170.3$\pm$810.9 \\
\bottomrule 
\end{tabular}
% \vspace{-1.5em}
\end{table*}
In this section, we study the empirical performance of \algabb in the OpenAI MuJoCo control suite~\citep{1606.01540}.
We use the environments from MBBL~\citep{wang2019benchmarking}, which varies slightly from the original environments in terms of modifying the reward function so its gradient w.r.t. the states exists and introducing early termination (ET). Note that the set of environments contains various control and manipulation tasks, which are commonly used for benchmarking both model-free and model-based RL algorithms~\citep[\eg,][]{kakade2020information, haarnoja2018soft}. As aforementioned, for practical implementation, our critic network consists of a representation network $\phi(\cdot)$ and a linear layer on the top. We follow the same procedure of Algorithm ~\ref{alg:TS}. Specifically, (1) for finding the optimal policy, we run an actor-critic algorithm (SAC); (2) we fix the representation network of the critic function $\phi(\cdot)$ and only update the linear layer on the top. We provide the full set of experiments in Appendix~\ref{appendix:full_exp} and the hyperparameter we use in Appendix~\ref{appendix:hyperparam}.~\footnote{Our code is available at \href{https://sites.google.com/view/spede}{https://sites.google.com/view/spede}.}
\paragraph{Baselines} We compare our method with various model-based RL baselines: PETS~\citep{chua2018deep} with random shooting (RS) optimizer, PETS with cross entropy method (CEM) optimizer and ME with TRPO policy optimizer~\citep{kurutach2018model}. Note that these are strong empirical baselines with many hand-tuned hyperparameters and engineering features (\eg, ensemble of models). It is usually hard for any theoretically guaranteed model-based RL algorithm to match or surpass their performance~\citep{kakade2020information}. Another natural baseline is the successor feature~\citep{dayan1993improving}, which is one of the representative spectral features. We compare with the deep successor feature (DeepSF)~\citep{kulkarni2016deep}, and for a fair comparison, we only swap the representation objective of \algabb with DeepSF and keep the other parts of the algorithm exactly the same.
\paragraph{\algabb: Performance with the Learned Representation} Following Algorithm~\ref{alg:TS}, we are interested in how \algabb performs when we conduct planning on top of the representation induced by the dynamics model in each episode. 
As most of the rigorously-justified representation learning algorithms are computationally intractable/inefficient, to demonstrate the effectiveness of representation used in \algabb, we compare \algabb with the deep successor features, which is one representative empirical representation learning algorithm. Moreover, as our method learning representation via fitting transition dynamics, to demonstrate the superiority of representation in planning, we compare our methods with the state-of-the-art model-based RL algorithms.
We summarize the results of our method in Table~\ref{tab:MuJoCo_results2}. We see that our method achieves impressive performance comparing to model-based RL methods. Even in some hard environments that baselines fail to reach positive reward (\eg, MountainCar, Walker-ET), \algabb manage to achieve a reward of 52.6 and 501.6 respectively. We also evaluate our representation by comparing \algabb to the usage of deep successor feature (DeepSF). Results show that on hard tasks like Humanoid and Walker, \algabb manages to achieve 452.6 and 336.0 higher reward respectively. 
\paragraph{\algabb-REG: Policy Optimization with \algabb Representation Regularizer}
In order to evaluate whether our assumption on linear MDP is valid in empirical settings and study whether such assumption can help improve the performance, 
we add our model dynamics representation objective as a regularizer in addition to the original SAC algorithm for learning the $Q$-function. Specifically, the algorithm \algabb-REG consists of vanilla SAC objective with an additional loss putting constraints on the representation learned by the critic function, due to the intuition that the representation should satisfy the equivalent dynamics. We compare its performance with the vanilla SAC algorithm to show the benefits of dynamic representation. Results in Table~\ref{tab:MuJoCo_results2} show that adding such constraint significantly improve the performance of SAC: on hard tasks like Hopper-ET, S-Humanoid-ET and Humanoid-ET, \algabb-REG improves the performance of SAC by 694.8, 1875.7 and 2000.4. 
\begin{figure*}[t]
    \centering
    \includegraphics[width=0.8\textwidth]{figures/MuJoCo_result.pdf}
    \caption{\footnotesize \textbf{Experiments on MuJoCo:} We show curves of the return versus the training steps for \algabb and model-based RL baselines. Results show that in these tasks, our method enjoys better sample efficiency even compared to SoTA empirical model-based RL baselines.}
    \label{fig:MuJoCo}
\end{figure*}

\paragraph{Ablations} We conduct ablations on: (1) What is the effect of the momentum parameter. (2) How does the number of random features affect the performance. 
Detailed results can be found at ~\ref{appendix:ablations}.

\paragraph{Performance Curves} To better understand how the sample complexity of our algorithm comparing to the prior model-based RL baselines, we plot the return versus environment steps in Figure~\ref{fig:MuJoCo}. We see that comparing to prior model-based baselines, \algabb enjoys great sample efficiency in these tasks. We want to emphasize that from MBBL~\citep{wang2019benchmarking}, model-based methods already show significantly better sample efficiency compared to model-free methods (\eg PPO/TRPO). We provide additional results in Appendix~\ref{appendix:full_exp}.

\paragraph{Discussion of the Results} We observe that in the environments with relatively simple dynamics (top row of Table~\ref{tab:MuJoCo_results2}), \algabb achieves the SoTA among all the model-based and model-free RL algorithms.
When the model dynamics of the environment become harder (bottom row of Table~\ref{tab:MuJoCo_results2}), the difference of the performance between the two approaches begin to enlarge. 
Interestingly, our \algabb achieves strong results comparing to model-based approaches, while the joint learning \algabb-REG outperforms model-free algorithm by a huge margin. The performance promotion of \algabb indicates the importance on learning a good representation based on model dynamics and again shows the effectiveness of our approach in both settings. The performance gap might be caused by random feature approximation. To mitigate such approximation error, we also tried using MLP upon the learned representation, instead of linear form, which leads to better performances. Please refer to~\appref{appendix:ablations} for details. 

In fact, the differences in the SoTA usage of \algabb in easy environments and difficult environments also reveals the important direction for our future work. The current rigorous representation learning methods, \eg,~\citet{du2019provably, misra2020kinematic, agarwal2020flambe} and the proposed~\algabb, all rely on some model assumption. When the assumptions are satisfied, \eg, Pendulum, Reacher, and others, our theoretically derived \algabb variant works extremely well, even better than current SoTA. However, when the assumption is not fully satisfied, although the decoupled \algabb achieves best performance among existing model-based RL and representation learning under fair comparison, the joint learned variant of \algabb is more robust and promotes the current SoTA with significant margin. An interesting question is whether we can rigorously justify the regularized \algabb , which we leave as our future work.

\section{Conclusion}
We introduce \algabb, which, to the best of our knowledge, is the first provable and efficient representation learning algorithm for RL, by exploiting the benefits from noise. We provide thorough theoretical analysis and strong empirical results, comparing to both model-free and model based RL, that demonstrates the effectiveness of our algorithm. 

\section*{Acknowledgement}
Cs. Sz. greatly acknowledges funding from NSERC, AMII and the Canada CIFAR AI Chair program. This project occurred under the Google-BAIR Commons at UC Berkeley.

\bibliography{ren_338}

\end{document}
