%\documentclass{uai2022} % for initial submission
 \documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage[semicolon]{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amsfonts}
\usepackage[mathscr]{eucal}
\usepackage{algorithm,algpseudocode}
\usepackage{algorithmicx}
\usepackage{bm}
\usepackage{makecell}
\usepackage{subfigure}
\usepackage{flushend}

\graphicspath{{figures/}} % Directory in which figures are stored

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Self-Supervised Representations for Multi-View Reinforcement Learning}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Huanhuan Yang}
\author[2,3,1]{Dianxi Shi \thanks{Corresponding author (dxshi@nudt.edu.cn).}}
\author[4]{Guojun Xie}
\author[1]{Yingxuan Peng}
\author[2]{Yi Zhang}
\author[3]{Yantai Yang}
\author[1]{Shaowu Yang}

% Add affiliations after the authors
\affil[1]{%
	College of Computer, National University of Defense Technology, Changsha, China
}
\affil[2]{%
	Artificial Intelligence Research Center, Defense Innovation Institute, Beijing, China
}
\affil[3]{%
	Tianjin Artificial Intelligence Innovation Center, Tianjin, China
}
\affil[4]{%
	College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
}
  
  \begin{document}
\maketitle

\begin{abstract}
  Learning policies from raw, pixel images are quite important for the real-world application of deep reinforcement learning (RL). Standard model-free RL algorithms focus on single-view settings and unify the representation learning and policy learning into an end-to-end training process. However, such a learning paradigm is sample-inefficiency and sensitive to hyper-parameters when supervised merely by the reward signals. Based on this, we present Self-Supervised Representations (S2R) for multi-view reinforcement learning, a sample-efficient representation learning method for learning features from high-dimensional images. In S2R, we introduce a representation learning framework and define a novel multi-view auxiliary objective based on the multi-view image states and Conditional Entropy Bottleneck (CEB) principle. We integrate S2R with the deep RL agent to learn robust representations that preserve task-relevant information while discarding task-irrelevant information and find optimal policies that maximize the expected return. Empirically, we demonstrate the effectiveness of S2R in the visual DeepMind Control (DMControl) suite and show its better performance on the default DMControl tasks and their variants by replacing the tasks' default background with a random image or natural video.
\end{abstract}

\section{Introduction}\label{sec:intro}
In recent years, deep reinforcement learning (RL) has shown the potential to learn high-quality policies directly from complex environments with high-dimensional states, such as playing Atari video games \citep{mnih2015human, hessel2018rainbow} or operating in visual continuous control tasks \citep{lillicrap2015continuous}, etc. Note that we can decouple the RL learning process into two sub-processes: representation learning and policy learning. The former aims to abstract features that characterize high-dimensional states, and the latter aims to find optimal policies that maximize the expected cumulative return. However, standard model-free RL algorithms unify these two sub-processes into an end-to-end training procedure, making the learning sample-inefficiency \citep{lake2017building, kaiser2019model} when just being supervised by the reward signals. This situation will be aggravated in the real world as collecting interacting data and training specific policies is expensive and time-consuming \citep{kalashnikov2018qt, akkaya2019solving}.

Therefore, for RL algorithms, decoupling representation learning and policy learning in one training procedure provides a feasible solution to alleviate the problem of sample inefficiency. Representation learning decomposes high-dimensional data into low-vectored representations that faithfully characterize them \citep{lesort2018state}. Then, policy learning can benefit from these low-dimensional and informative representations, rather than the raw data, to make the task sample efficiently solved. In this paper, we base our method on this idea, first relying on an auxiliary objective to explicitly obtain latent representations, then training the agent upon these representations.

We focus on multi-view RL, which extends RL to multi-view settings. While most RL algorithms solely consider one-view data, multi-view settings release the restrictions that hinder the application of RL to real-life scenarios. Take the smart vehicle as an example, instead of only using one-view data, it fuses multi-view data perceived by multiple sensors to make safe driving decisions. Actually, compared with the paradigm of learning in one-view settings, learning in multi-view settings is more complex due to the increased difficulties of reasoning representations from complicated multiple views. If solved, it can promote the generalization of RL across varying domains, including their applications in the real world. Thus, we propose S2R: \textbf{S}elf-\textbf{S}upervised \textbf{R}epresentations for multi-view reinforcement learning. Our key contributions are summarized as follows.

\begin{itemize}
	\item \textbf{Representation learning framework.} To support the representation learning in multi-view RL, we design a specific learning framework. It is composed of the encoder/target encoder network, feature fusion module, view-specific predictor, and multi-view predictor. After learning marginal representations from the encoder network, we use the reparameterization trick to obtain sampled data utilized by the feature fusion module, and further the multi-view predictor to predict self-supervision signals (latent transition function and reward function). Besides, the sampled data are also fed into the view-specific predictor to make predictions.
	\item \textbf{Self-supervision objective.} To learn compressed representations, inspired by the Conditional Entropy Bottleneck (CEB) \citep{fischer2020conditional}, we define a new multi-view CEB (MCEB) auxiliary objective. It maximizes the task-relevant information between representations (marginal or joint) and self-supervision signals and compresses away any task-irrelevant information that comes from multi-view image states but is not contained in the self-supervision signals.
	\item \textbf{Representation learning for multi-view RL.} To integrate the representation learning with the multi-view RL training, we incorporate the MCEB objective with the RL objective by optimizing the RL objective on top of the encoder network optimized by the MCEB objective. We follow the common practice (for a given image, data augmentation is used to generate multiple views) in multi-view learning \citep{bachman2019learning, wang2021residual} to produce multi-view data. Empirically, we show that S2R performs better on default visual DMControl tasks \citep{tassa2018deepmind} and their noisy variants by replacing the tasks' default background with a random image or complex natural video.
\end{itemize}

\section{Related Work}
\textbf{Reconstruction-based representations.}\ \  Auto-encoder, an unsupervised learning technique that uses neural networks for representation learning, is the early work that combines with RL in control tasks \citep{lange2010deep, lange2012autonomous, yarats2021improving}. These RL agents first trained an encoder via the reconstruction loss, then learned policies based on the representations encoded by the encoder. However, there is no guarantee that the encoder captures useful information for control tasks in practice. Aiming at this problem, researchers proposed to train the encoder jointly with RL dynamics to learn task-oriented and predictive representations \citep{watter2015embed, wahlstrom2015pixels, hafner2019learning, hafner2019dream, hafner2020mastering, lee2019stochastic}. Although effective, these approaches try to encode all details into embeddings in the reconstruction process of visual images, resulting in the sensibility to task-independent visual changes and negative effection on performance due to the existence of task-irrelevant information \citep{zhang2018natural}.

\textbf{Contrastive-based representations.} \ \ As a representation learning method, contrastive learning has been widely used in self-supervised settings and made significant progress in the research of image classification and detection \citep{caron2020unsupervised, xie2021propagate}. It uses data augmentation \citep{chen2020simple} or image patches \citep{henaff2020data} to acquire data samples and learns rich representations via similarity functions \citep{belghazi2018mutual, poole2019variational} such that the distance between similar pairs is minimized, between dissimilar pairs is maximized. Many works \citep{kim2018emi, srinivas2020curl, mazoure2020deep} have introduced contrastive learning to RL settings to extract predictive features. However, under the effect of contrastive loss, these methods aim to capture all features in the images to maximize the lower bound of the mutual information, making the features containing task-irrelevant information.

\textbf{Multi-view and other representations.}\ \ To solely extract task-relevant features from high-dimensional data, researchers have tried various methods. Multi-view learning, also known as data fusion or data integration from multiple views data, is an emerging area in machine learning \citep{zhang2016multi}. Though abundant in computer vision tasks \citep{federici2020learning, wang2019deep, wan2021multi}, it gains less attention on RL decision-making tasks. \citet{chen2017double} proposed the double-task deep Q-Network within multiple views based on double-DQN \citep{van2016deep} and dueling-DQN \citep{wang2016dueling}. \citet{li2019multi} defined a framework that generalized partially observable Markov decision processes (POMDPs) to multi-view settings within multiple observation models. In addition, \citet{zhang2020learning} introduced the bisimulation metric \citep{ferns2011bisimulation} to learn latent representations that only encode task-relevant information of image observations. \citet{laskin2020reinforcement} proposed a plug-and-play module that achieved SOTA performance on the default visual DMControl tasks by incorporating data augmentations with the RL agent. \citet{lee2020predictive} learned compressed representations of the predictive information of RL dynamics through a CEB objective with the CatGen decoder \citep{fischer2020conditional} in the single-view setting. By contrast, our work, S2R, which learns robust representations via an MCEB auxiliary objective, simultaneously takes advantage of the multi-view learning and CEB principle to preserve task-relevant information and ignore task-irrelevant information. We empirically show the performance improvement of S2R against state-of-the-art methods on a variety of visual control benchmarks.

\section{Preliminaries}
\begin{figure*}[htp]
	\centering
	\includegraphics[scale=0.573]{S2R Framework.png}
	\caption{S2R framework. It contains the encoder/target encoder network, feature fusion module, view-specific predictor, and multi-view predictor. Multi-view image state $s_j$ are fed into the encoder network to learn marginal representation $z_j$. Following the reparameterization trick, we obtain sampled representations that successively fed into the feature fusion module and multi-view predictor to predict ${z}'$ and $r$ and simultaneously into the view-specific predictor to predict ${z}'_j$ and $r$.}
	\label{S2R Framework}
\end{figure*}

\textbf{Multi-view Reinforcement Learning.}\ \ In this paper, we consider the multi-view reinforcement learning, an extension of RL to multi-view settings, formulated as a Markov decision process (MDP) $\{ S,A,P, r, \gamma\} $. Here, symbols $S$, $A$, $P(s^{t+1}|s^t,a^t):S\times A \times S \mapsto [0,1]$, $r(s^t,a^t):S\times A\mapsto\mathbb{R}$ and $\gamma\in[0,1) $ respectively denote the state space, action space, transition probability of state $s^{t+1}$ when agent takes action $a^t$ at state $s^t$, reward function that maps state $s^t$ and action $a^t$ into real number, and the discount factor. Given $r$ and $\gamma$, the agent aims to learn an optimal policy $\pi$ that maximizes the expected cumulative discounted reward $R =  {\textstyle \sum_{t}}\gamma^{t}r(s^t,a^t)$. 

Crucially, we focus on image-based tasks, which means the agent needs to learn policy from pixels. To obtain the multi-view data, referring to the common practice in multi-view learning, we repeatedly apply random data augmentation on the original image state $s^t$ received by the agent to generate diverse sub-images $s^t_{j}$ as multi-view states, where $j\in[1, N]$ is the view index. 

\textbf{Soft Actor-Critic.} \ \ Soft Actor-Critic (SAC) \citep{haarnoja2018soft} is an off-policy actor-critic algorithm that learns a stochastic policy $\pi_{\phi}$ to maximize a $\gamma$-discounted and maximum entropy-based return \citep{ziebart2008maximum} by optimizing three objectives. Given transition tuples $\tau^{t}=(s^t,a^t,r^t, s^{t+1})$ sampled from the replay buffer $\mathcal{B}$, the critic minimizes the below Bellman error.
\begin{equation}
L_{Q_{\varphi_i}} = \mathbb{E}_{\tau\sim\mathcal{B}}\left [ \left (  Q_{\varphi_i}(s^t,a^t)-(r^t+\gamma V(s^{t+1}))\right )^{2}  \right ] \ 
\end{equation}
Where $V(s^{t+1})$ is the target value of $s^{t+1}$, defined as:
\begin{equation}
V(s^{t+1})= \mathbb{E}_{{a}'\sim\pi} \Big(\min_{i=1,2} \bar{Q}_{\bar{\varphi}_i}(s^{t+1},{a}') -\alpha \log \pi_{\phi}({a}'|s^{t+1}) \Big) 
\end{equation}
Note that SAC maintains two critics ($Q_{\varphi_1}$, $Q_{\varphi_2}$), two target critics ($\bar{Q}_{\bar{\varphi}_1}$, $\bar{Q}_{\bar{\varphi}_2}$) and uses the exponential moving average (EMA) to update target network parameters. For the actor, actions are sampled using the reparameterization trick, i.e., $a_{\phi}(s^t,\xi) = \tan(\mu_\phi(s^t)+\sigma_\phi(s^t)\odot\xi)$ with a standard normalized noise vector $\xi\sim\mathcal{N}(0,I)$, it minimizes:
\begin{equation}
L_{\pi_\phi} = \mathbb{E}_{a\sim\pi} \Big[\alpha \log \pi_\phi(a|s^t) - \min_{i=1,2} Q_{\varphi_i}(s^t,a) \Big] \qquad
\end{equation}
For the temperature, given the target entropy $\mathcal{H}$ of the policy distribution, it minimizes:
\begin{equation}
L_{\alpha} = \mathbb{E}_{a\sim \pi}[-\alpha \log \pi_\phi(a|s^t)-\alpha\mathcal{H}]\qquad\qquad\qquad 
\end{equation}

\section{S2R for multi-view RL}
To address the learning challenges of multi-view RL mentioned in Sec. \ref{sec:intro}, we propose S2R, which mainly contains: the representation learning framework, the self-supervision objective, and the combination of S2R with multi-view RL. For readability, we simplify the time index of the transition tuple, replacing \{$s^t,a^t,r^t,s^{t+1}$\} with \{$s,a,r,{s}'$\}.

\subsection{S2R Framework} 
\label{s2r framewok}
To extract representations from pixel states in multi-view RL, in Fig. \ref{S2R Framework}, we design an S2R representation learning framework. It includes:

\begin{enumerate}[label={(\arabic*)}]
	\item Encoder/target encoder network. Both of them are responsible for encoding image states (high-dimensional) into marginal representations (low-dimensional) in a common latent space.
	\item Feature fusion module. Its purpose is to integrate (sampled) marginal representations into joint representations in the common latent space.
	\item View-specific predictor. By inputting the sampled marginal representation together with the action and predicting the latent transition function and reward function, it can maximize task-relevant information and minimize task-irrelevant information in the marginal representation.
	\item Multi-view predictor. By inputting the sampled joint representation together with the action and doing the same prediction, it can effectively extract useful information from the joint representation.
\end{enumerate}

\subsection{S2R Objective}
\textbf{Two-view CEB.} In 2020, CEB \citep{fischer2020conditional} was proposed. Given the high-dimensional data $X$, it learns representation $Z$ from $X$ to predict label $Y$, defined as $\min_{Z} \beta I(X; Z|Y)- I(Y; Z)$, expecting that the information captured in $Z$ is maximally relevant to $Y$. In CEB, $I(X; Z|Y)$ is the conditional mutual information, measuring the reduction of uncertainty of $X$ due to learning $Z$ when given $Y$; $I(Y; Z)$ is the mutual information, measuring the reduction of uncertainty of $Y$ due to learning $Z$ \citep{cover1999elements}. Based on CEB, we propose a new MCEB objective to optimize networks related to the S2R framework (Sec. \ref{s2r framewok}). For simplicity, we start with a two-view case. Considering the sequential nature of RL, we define $X_1, X_2$ as the current image states, $Z_1, Z_2, Z$ as the current latent representations, and $Y_1, Y_2, Y$ as the rewards and next latent representations. Without loss of generality, we define the two-view CEB objective as:
\begin{align}
&\textbf{obj.}\min_{Z,Z_1,Z_2} \beta_1I(X_1;Z_1|Y_1)+\beta_2I(X_2;Z_2|Y_2)-I(Z;Y) \notag\\
&\quad \ =\min_{z,z_1,z_2} \beta_1I(s_1;z_1|{z}'_1,r,a) + 
\beta_2 I(s_2;z_2|{z}'_2,r,a) \ - \notag\\
&\qquad\qquad\quad I(z;{z}',r|a) \notag\\
&\textbf{s.t.} \quad Z= f_\theta(Z_1,Z_2) \Rightarrow z = f_\theta(z_1,z_2)
\label{eq-2MCEB}
\end{align}
Where $\beta_1$, $\beta_2$ are regularization factors. To better understand this objective, we show an Information diagram (I-diagram) for $X_1$, $X_2$, $Z_1$, $Z_2$, $Z$, $Y_1$, $Y_2$ and $Y$ in Fig. \ref{I-diagrams}. Intuitively, we observe that: $I(X_1;Z_1) = I(Z_1;Y_1) + I(X_1;Z_1|Y_1)$, $I(X_2;Z_2) = I(Z_2;Y_2) + I(X_2;Z_2|Y_2)$. Thus, to get a minimal and sufficient $Z$, we must minimize redundant information ($I(X_1;Z_1|Y_1)$ and $I(X_2;Z_2|Y_2)$) and maximally preserve relevant information ($I(Z;Y)$, where $Z=f_\theta(Z_1,Z_2)$ is the joint representation of marginal representations $Z_1$ and $Z_2$ fused by the S2R feature fusion module).

\begin{figure}[hp]
	\centering
	\includegraphics[scale=0.56]{I-diagrams.png}
	\caption{I-diagram of the two-view CEB.}
	\label{I-diagrams}
\end{figure}

\textbf{Optimization of Two-view CEB.} In Eq. (\ref{eq-2MCEB}), it is intractable to directly compute the (conditional) mutual information terms. Fortunately, the variational inference method provides a feasible solution by approximating intractable terms with variational bounds that are easily optimized by standard gradient methods \citep{kingma2014auto, alemi2016deep}. To get the variational upper bound of Eq. (\ref{eq-2MCEB}), we first rewrite it below.
\begin{align}
&\min_{Z,Z_1,Z_2} \beta_1(I(X_1;Z_1)-I(Z_1;Y_1))+\beta_2(I(X_2;Z_2)\ -  \qquad \notag\\
&\qquad\qquad\ \  I(Z_2;Y_2))-I(Z;Y), \ \ Z = f_\theta(Z_1,Z_2) \quad \notag\\
& =\min_{z,z_1,z_2} \beta_1(I(s_1;z_1)-I(z_1;{z}'_1,r|a)) + 
\beta_2 (I(s_2;z_2) - \notag\\
&\qquad\quad I(z_2;{z}'_2,r|a)) - I(z;{z}',r|a),\  z = f_\theta(z_1,z_2)
\label{eq-2MCEB-alter}
\end{align}
Then, we give the joint probability density function of variables $s_1$, $s_2$, $z_1$, $z_2$, $z$, ${z}'_1$, ${z}'_2$, ${z}'$, $r$ and $a$. According to the Bayes's rule, it can be expressed as:
\begin{align}
& p(s_1, s_2, z_1, z_2, z, {z}'_1, {z}'_2, {z}', r, a) = p(z|s_1, s_2, z_1, z_2, {z}'_1, {z}'_2,\qquad \notag \\
& \quad {z}', r, a) \ {\cdot} \ p(z_1|s_1, s_2, z_2, {z}'_1, {z}'_2, {z}', r, a)\ \cdot p(z_2|s_1, s_2, {z}'_1, \notag \\
& \quad {z}'_2, {z}', r, a)\  {\cdot} \  p(s_1, s_2, {z}'_1, {z}'_2, {z}', r, a)
\end{align}
Considering $z_1$ is extracted from $s_1$, $z_2$ is extracted from $s_2$, $z$ is fused by $z_1$ and $z_2$, we thus infer that: $z_1$ is independent of variables other than $s_1$, $z_2$ is independent of variables other than $s_2$, and $z$ is independent of variables other than $z_1$ and $z_2$. Therefore, we have:
\begin{equation}
\begin{aligned}
p(s_1, s_2, z_1, z_2, z, {z}'_1, {z}'_2, {z}', r, a) = p(z|z_1, z_2) \ {\cdot} \\
p(z_1|s_1) \cdot p(z_2|s_2) \ {\cdot} \ p(s_1, s_2, {z}'_1, {z}'_2, {z}', r, a)
\end{aligned}
\end{equation}
Based on the standard definition of the (conditional) mutual information, the non-negative property of the Kullback-Leibler divergence (KL-divergence), the above joint probability density function, and the Monte Carlo sampling \citep{shapiro2003monte}, we derive the variational upper bound of Eq. (\ref{eq-2MCEB}) as follows.
\begin{align}
&\beta_1I(s_1;z_1|{z}'_1,r,a) + 
\beta_2 I(s_2;z_2|{z}'_2,r,a) - I(z;{z}',r|a) \le \notag\\ 
& \frac{1}{M} \sum^{M} \Big (\beta_1 [D_{KL}\left(p(z_1| s_1)||q_1(z_1)\right)- \mathbb{E}_{z_{1}\sim p(z_1| s_1)} \log g_{\omega_1}( \notag\\
&\qquad\; {z}'_1,r|z_1,a)\ ] \ + \ \beta_2 [D_{KL}\left(p(z_2| s_2)||q_2(z_2)\right)\ - \notag\\
&\qquad\; \mathbb{E}_{z_{2}\sim p(z_2| s_2)}\log{g_{\omega_2} ({z}'_2,r| z_2,a)}\ ]\ - \ \mathbb{E}_{z_{1}\sim p(z_1| s_1)}\notag\\
&\qquad\mathbb{E}_{z_{2}\sim p(z_2| s_2)}\mathbb{E}_{z\sim p(z|z_1,z_2)}\left[ \log{g_{\omega_{12}}({z}',r|z,a)}\right] \Big )
\label{eq-lower-bound}
\end{align}
Where $M$ is the size of data obtained by the Monte Carlo sampling, $g_{\omega_1}({z}'_1,r|z_1,a)$, $g_{\omega_2}({z}'_2,r|z_2,a)$ and $g_{\omega_{12}}({z}',r|z, a)$ are distributions learned from neural networks (view-specific predictor or multi-view predictor) to approximate real distributions $p({z}'_1,r|z_1,a)$, $p({z}'_2,r|z_2,a)$ and $p({z}',r|z, a)$, variational distributions $q_1(z_1) \sim N(0,I), q_2(z_2)\sim N(0,I)$ are used to approximate real distributions $p(z_1)$ and $p(z_2)$. Detailed derivations of Eq. (\ref{eq-lower-bound}) are given in Appendix A.

Next, we assume $p(z_1|s_1)$, $p(z_2|s_2)$ and $p(z|z_1,z_2)$ are Gaussian distributions with relative means ($\mu_1, \mu_2, \mu_{12}$) and variances ($\sigma_1, \sigma_2, \sigma_{12}$) learned from MLPs:
\begin{align}
&p(z_1|s_1)=\mathscr{N}(\mu_1(s_1;\psi_1), \sigma_1(s_1;\psi_1)) \notag\\
&p(z_2|s_2)=\mathscr{N}(\mu_2(s_2;\psi_2), \sigma_2(s_2;\psi_2)) \notag\\
&p(z|z_1, z_2)=\mathscr{N}(\mu_{12}(z_1,z_2;\psi_{12}), \sigma_{12}(z_1,z_2;\psi_{12}))
\label{eq11}
\end{align}
In Eq. (\ref{eq11}), $\psi_1, \psi_2, \psi_{12}$ are parameters of the MLPs used for learning $p(z_1|s_1)$, $p(z_2|s_2)$ and $p(z|z_1,z_2)$, respectively. To backpropagate the gradient through random variables $z_1$, $z_2$ and $z$, we use the reparameterization trick:
\begin{equation}
\begin{aligned}
&z_1=\mu_1(s_1;\psi_1) + \sigma_1(s_1;\psi_1)\cdot\xi_1 \quad\\
&z_2=\mu_2(s_2;\psi_2) + \sigma_2(s_2;\psi_2)\cdot\xi_2 \\
&z=\mu_{12}(z_1,z_2;\psi_{12}) + \sigma_{12}(z_1,z_2;\psi_{12})\cdot\xi_{12}
\end{aligned}
\label{eq12}
\end{equation}
Where $\xi_1\in\mathcal{N}(0,I),\xi_2\in\mathcal{N}(0,I),\xi_{12}\in\mathcal{N}(0,I)$ are Gaussian random variables. Therefore, Eq. (\ref{eq-lower-bound}) will be transformed into Eq. (\ref{eq-2-view-loss}), the final optimization loss of  Eq. (\ref{eq-2MCEB}).
\begin{align}
&\min_{z, z_1, z_2} \frac{1}{M}\sum^{M} \Big ( \beta_1 [D_{KL}(p(z_1| s_1)||q_1(z_1)) - \mathbb{E}_{\xi_1} \log g_{\omega_1}( \notag\\
&\quad {z}'_1,r|z_1,a)] + \beta_2 [D_{KL}(p(z_2| s_2)||q_2(z_2)) - \mathbb{E}_{\xi_2} \log g_{\omega_2}( \notag\\
&\quad {z}'_2,r|z_2,a)] - \mathbb{E}_{\xi_1} \mathbb{E}_{\xi_2} \mathbb{E}_{\xi_{12}} \log g_{\omega_{12}}({z}',r|z,a) \Big ) 
\label{eq-2-view-loss}
\end{align}
\textbf{From Two-view CEB to MCEB.} For cases with more than two views, we can easily generalize the two-view CEB objective to the MCEB objective by adding information terms. Given $N$ views $(X_1,\dots,X_N)$, it is expressed as:
\begin{align}
&\textbf{obj.}\quad \min_{Z,Z_1,\cdots,Z_N} \textstyle \sum_{j=1}^{N} \beta_j I(X_j;Z_j|Y_j)-I(Z;Y) \ = \notag \\
&\min_{z,z_1,\cdots,z_N} \sum_{j=1}^{N} \beta_j (I(s_j;z_j)-I(z_j; {z}'_j,r|a))- I(z;{z}',r|a)\notag \\
&\textbf{s.t.}\ \ \ \  Z = f_\theta(Z_1,\cdots, Z_N) \Rightarrow z = f_\theta(z_1,\cdots, z_N)
\label{eq-MCEB}
\end{align}
Referring to the same derivation process of the two-view CEB objective, the final optimization loss of the MCEB objective (Eq. (\ref{eq-MCEB})) can be expressed as follows:
\begin{align}
&\min_{z,z_1,\cdots,z_N} \frac{1}{M}\sum^{M} \Big ( \sum_{j=1}^{N} \beta_j \Big[D_{KL}(p(z_j| s_j)||q_j(z_j)) - \mathbb{E}_{\xi_j} \notag \\
&\qquad\qquad\qquad\quad\log g_{\omega_j}({z}'_j,r | z_j, a ) \Big] -  \mathbb{E}_{\xi_1}\dots \mathbb{E}_{\xi_N} \mathbb{E}_{\xi_{1N}} \notag \\
&\qquad\qquad\qquad\quad \log g_{\omega_{1N}}({z}',r|z,a)\Big )
\label{eq-MCEB-loss}
\end{align}
\subsection{Incorporate S2R into multi-view RL}

To incorporate S2R into multi-view RL, we simultaneously train the S2R model and the RL agent and treat the S2R loss as an auxiliary loss (Fig. \ref{S2R + SAC framework}). To obtain multi-view image states, we repeatedly apply the random crop augmentation on sampled transition data from the replay buffer and keep it consistent across three consecutive stacked frames to retain the temporal information hidden in the states. This allows the S2R model to infer task dynamics and is more suitable for the RL setting. In Algorithm \ref{alg1}, we give the detailed procedure of integrating S2R with SAC. In our implementation, we use an (target) encoder ($\rho(s_j)/\bar\rho({s}'_j)$), MLPs ($\psi$), view-specific/multi-view predictor ($\omega$) and two views' data. The first view is not only responsible for the training of the RL agent but also the S2R model together with the second view. For settings with multimodal states (image, text, audio, etc.), we can use $N$ (target) encoders, MLPs, view-specific/multi-view predictors, and the joint latent representation to train the RL agent and S2R model.

\begin{figure}[ht]
	\centering
	\includegraphics[scale=0.555]{S2R+SAC Framework.png}
	\caption{Joint training of the S2R model and RL agent.}
	\label{S2R + SAC framework}
\end{figure}

\begin{algorithm}[htp]
	\caption{S2R + SAC pseudo-code}
	\label{alg1}
	\begin{algorithmic}[1]
		\State Initialize: parameters of critic ($\varphi_i, \bar\varphi_i$), actor($\phi$), S2R model ($\rho, \bar\rho, \theta, \psi, \omega$), temperature ($\alpha$), views $N$, replay buffer $\mathcal{B}$, training step $T$, gradient step $K$, batch size M
		\For {$\text{step} \ t=1$ to $T$}
		\For {each collection step}
		\State Store interaction data: $\mathcal{B}\gets\mathcal{B} \cup (s,a,r,{s}')$.
		\EndFor
		\For {$\text{step} \ k=1$ to $K$}
		\State Sample batches $D:\{(s,a,r, {s}')\}_{m=1}^{M}$ from $\mathcal{B}$.
		\State Applying data augmentation on $D$, now: \Statex \qquad\qquad $D=\{(s_j,a,r, {s}'_j)\}_{m=1}^{M},\ j\in[1,N]$
		\State Compute target value: 
		\Statex \qquad\qquad $V=\min \bar{Q}_i(\bar{\rho} ({s}'_1),{a}') -\alpha \log \pi({a}'|\bar{\rho}({s}'_1))$
		\State Update critic: 
		\Statex \qquad\qquad $L_{\varphi_i} = [ Q_i(\rho(s_1),a)- (r + \gamma V)]^{2}$
		\State Update actor: 
		\Statex \qquad\qquad $L_\phi = \alpha \log \pi(a|\rho(s_1)) - \min Q_i(\rho(s_1),a)$
		\State Update temperature: 
		\Statex \qquad\qquad $L_\alpha = -\alpha \log \pi(a|\rho(s_1))-\alpha \mathcal{H}$
		\State Update S2R model ($\rho(s_j)$, etc.) by $D$,  Eq. (\ref{eq-MCEB-loss}).
		\State Update target critic:  $\bar{\varphi}_i=\tau_\varphi \cdot \varphi_i + (1-\tau_{\varphi}) \cdot \bar{\varphi}_i$
		\State Update target encoder:  $\bar{\rho}=\tau_\rho \cdot \rho + (1-\tau_{\rho}) \cdot \bar{\rho}$
		\EndFor
		\EndFor
	\end{algorithmic}
\end{algorithm}

\section{Experiments}
In this paper, we design a variety of experiments to answer the following questions:

\begin{itemize}
	\item Can S2R have a better sample efficiency in RL visual control tasks (Table \ref{Default DMC}, Fig. \ref{Default median} - \ref{Ablation Setting})?
	\item Is S2R robust to complex settings with the random image distractor or natural video distractor (Fig. \ref{Total Setting})?
	\item Can S2R perform better than existing reconstruction-based, non-reconstruction-based, or contrastive-based RL representation methods (Table \ref{Default DMC}, Fig. \ref{Default median} - \ref{Total Setting})?
	\item For S2R, How much information should be preserved for efficient representation? Is it sufficient to merely predict the latent transition function or reward function in MCEB? Is the MCEB objective more suitable than its mutual information or CEB variants? How does S2R perform when the number of views increases? (Fig. \ref{Ablation Setting})
\end{itemize}

\subsection{Experiment Setup}

\textbf{DMControl Suite.} To evaluate the performance of S2R, we combine it with the SAC algorithm and focus on visual continuous control tasks in the DMControl Suite \citep{tassa2018deepmind}. Our benchmark includes six different environments under three settings. \textbf{(1) Default Setting.} Agent receives pixel states with the default background. \textbf{(2) Image Distractor Setting.} Agent receives pixel states with the random image as the background. \textbf{(3) Natural Video Setting.} Agent receives pixel states with the natural video selected from the "arranging ﬂowers" class of the Kinetics dataset \citep{kay2017kinetics} as the background. In Fig. \ref{DMC tasks}, We show snapshots of pixel states in the above settings.

\begin{figure}[hp]
	\centering
	\includegraphics[scale=0.9]{DMC tasks.png}
	\caption{Tasks from left to right are ball-in-cup catch, cartpole swingup, cheetah run, finger spin (the first row)/walker run (the second/third row), reacher easy, and walker walk.}
	\label{DMC tasks}
\end{figure}

\textbf{Implementation.} We base our S2R method on the implementation of RAD \citep{laskin2020reinforcement} \footnote{\url{https://github.com/MishaLaskin/rad}} and use most of its default parameters, including the learning rate, action repeat, etc. Specially, we use a desktop with an 8-core CPU, and two Nvidia GeForce RTX 3090 for each benchmarking. In our experiments, figures show the mean and standard error across five seeds unless specified otherwise. Besides, we use random crop augmentation on the agent’s $100 \times 100$ original image states to obtain $84 \times 84$ multi-view states. Full implementation details and hyper-parameters are listed in Appendix B.

\subsection{Baseline Algorithm} In this paper, we compare S2R + SAC with some state-of-the-art pixel-based RL methods. DBC \citep{zhang2020learning} learns effective representations for downstream control tasks through the bisimulation metric. RAD \citep{laskin2020reinforcement} uses augmented data to train policy. CURL \citep{srinivas2020curl} combines contrastive learning objective with model-free RL agent. SLAC \citep{lee2019stochastic} learns stochastic sequential models via a variational inference objective. PlaNet \citep{hafner2019learning} and Dreamer \citep{hafner2019dream} are two model-based algorithms, they both learn a world model and respectively choose actions via online planning and long-horizon imagination. SAC + AE \citep{yarats2021improving} combines auto-encoder with model-free RL algorithm via an auxiliary reconstruction loss. Pixel SAC is the SAC \citep{haarnoja2018soft} algorithm with image inputs, while State SAC operates on proprioceptive states (positions, velocities, etc.). Besides, in DBC, we use the same action repeat as RAD and S2R to make a fair comparison.

\subsection{Main Results}

\begin{figure}[bp]
	\centering
	\includegraphics[scale=0.529]{Default_median.png}
	\caption{Performance of S2R + SAC relative to baselines averaged across 10 seeds in the default setting. Results are the medians of 6 pixel-based control tasks in Table \ref{Default DMC}, and data other than S2R + SAC is reported in CURL.}
	\label{Default median}
\end{figure}

\textbf{Default Setting Results.} To evaluate the sample efficiency of our method, we first give the median scores achieved by S2R + SAC along with the baselines at DMControl100k (low sample performance) and DMControl500k (asymptotical optimal performance) benchmarks \footnote{DMControl100k/DMControl500k refers to 100k/500k environment or simulator steps, which is equal to 50k/250k policy steps if the action repeat is set to 2.} in Fig. \ref{Default median} and show their relative scores on 6 control tasks in Table \ref{Default DMC} and Fig. \ref{Default Setting}. In Fig. \ref{Default median}, S2R + SAC achieves 1.14x/1.04x higher median scores than State SAC, 1.59x/1.05x higher median scores than CURL, and 6.69x/5.12x higher median scores than Pixel SAC at 100k/500k environment steps, showing that S2R + SAC has a higher sample efficiency. In Table \ref{Default DMC}, S2R + SAC, which integrates MCEB-based representation learning with model-free RL learning, is the state-of-the-art algorithm on all (6 out of 6) visual DMControl tasks on both DMControl100k and DMControl500k benchmarks. It achieves impressive results, exceeds the performance of best-performing RAD and CURL, matches the performance of State SAC operating from proprioceptive states, and significantly improves the performance of Pixel SAC on both DMControl100k and DMControl500k benchmarks. In Fig. \ref{Default Setting}, the learning curves of S2R + SAC and DBC again confirm the better sample efficiency of S2R + SAC.

\begin{table*}[tp]
	\renewcommand\arraystretch{1.25}   
	\newcommand{\tabincell}[2]{\begin{tabular}{@{}#1@{}}#2\end{tabular}}
	\caption{We report scores (mean and standard deviation) for S2R + SAC and baselines (report by RAD) on DMControl500k and DMControl100k. Results are statistics by averaging the scores of 10 seeds on 6 control tasks. In both benchmarks, compared with existing baselines, S2R + SAC achieves state-of-the-art performance on all (6 out of 6) control tasks.}
	\label{Default DMC}
	\resizebox{\textwidth}{!}{
		\begin{tabular}{lccccccccc}
			\hline
			500K STEP SCORES & S2R + SAC    & RAD      & CURL     & PlaNet   & Dreamer  & SAC + AE   & SLACv1   & PIXEL SAC & STATE SAC \\ \hline
			FINGER, SPIN     & {\tabincell{c}{\textbf{983}\\ $\pm$ 5}} & \tabincell{c}{947\\ $\pm$101} & \tabincell{c}{926\\ $\pm$45} & \tabincell{c}{561\\ $\pm$284} & \tabincell{c}{796\\ $\pm$183} & \tabincell{c}{884\\ $\pm$128} & \tabincell{c}{673\\ $\pm$92} & \tabincell{c}{192\\ $\pm$166} & \tabincell{c}{923\\ $\pm$211} \\
			CARTPOLE, SWING  & {\tabincell{c}{\textbf{869}\\ $\pm$ 10}} & \tabincell{c}{863\\ $\pm$9} & \tabincell{c}{845\\ $\pm$45} & \tabincell{c}{475\\ $\pm$71} & \tabincell{c}{762\\ $\pm$27} & \tabincell{c}{735\\ $\pm$63} & -         & \tabincell{c}{419\\ $\pm$40} & \tabincell{c}{848\\ $\pm$15} \\
			REACHER, EASY    & \tabincell{c}{\textbf{981}\\ $\pm$5} & \tabincell{c}{{955}\\ $\pm$71} & \tabincell{c}{929\\ $\pm$44} & \tabincell{c}{210\\ $\pm$44} & \tabincell{c}{793\\ $\pm$164} & \tabincell{c}{627\\ $\pm$58} & -         & \tabincell{c}{145\\ $\pm$30}   & \tabincell{c}{923\\ $\pm$24}   \\
			CHEETAH, RUN     & {\tabincell{c}{\textbf{837}\\ $\pm$ 21}} & \tabincell{c}{728\\ $\pm$71} & \tabincell{c}{518\\ $\pm$28}  & \tabincell{c}{305\\ $\pm$131} & \tabincell{c}{570\\ $\pm$253} & \tabincell{c}{550\\ $\pm$34}  & \tabincell{c}{640\\ $\pm$19}  & \tabincell{c}{197\\ $\pm$15}   & \tabincell{c}{795\\ $\pm$30}   \\
			WALKER, WALK     & \tabincell{c}{\textbf{950}\\ $\pm$19} & \tabincell{c}{918\\ $\pm$16}  & \tabincell{c}{902\\ $\pm$43}  & \tabincell{c}{351\\ $\pm$58}  & \tabincell{c}{897\\ $\pm$49}  & \tabincell{c}{847\\ $\pm$48}  & \tabincell{c}{842\\ $\pm$51}  & \tabincell{c}{42\\ $\pm$12}    & \tabincell{c}{948 \\ $\pm$54}   \\
			CUP, CATCH       & \tabincell{c}{\textbf{978}\\ $\pm$5} & \tabincell{c}{974\\ $\pm$12}  & \tabincell{c}{959\\ $\pm$27}  & \tabincell{c}{460\\ $\pm$380} & \tabincell{c}{879\\ $\pm$87}  & \tabincell{c}{794\\ $\pm$58}  & \tabincell{c}{852\\ $\pm$71}  & \tabincell{c}{312\\ $\pm$63}   & \tabincell{c}{974\\ $\pm$33}   \\ \hline
			100K STEP SCORES & S2R + SAC    & RAD      & CURL     & PlaNet   & Dreamer  & SAC + AE   & SLACv1   & PIXEL SAC & STATE SAC \\ \hline
			FINGER, SPIN     & \tabincell{c}{\textbf{876}\\ $\pm$43} & \tabincell{c}{{856}\\ $\pm$73}  & \tabincell{c}{767\\ $\pm$56}  & \tabincell{c}{136\\ $\pm$216} & \tabincell{c}{341\\ $\pm$70}  & \tabincell{c}{740\\ $\pm$64}  & \tabincell{c}{693\\ $\pm$141} & \tabincell{c}{224\\ $\pm$101}  & \tabincell{c}{811\\ $\pm$46}   \\
			CARTPOLE, SWING  & \tabincell{c}{\textbf{868}\\ $\pm$9}  & \tabincell{c}{828\\ $\pm$27}  & \tabincell{c}{582\\ $\pm$146} & \tabincell{c}{297\\ $\pm$39}  & \tabincell{c}{326\\ $\pm$27}  & \tabincell{c}{311\\ $\pm$11}  & -         & \tabincell{c}{200\\ $\pm$72}   & \tabincell{c}{835\\ $\pm$22}   \\
			REACHER, EASY    & \tabincell{c}{\textbf{961}\\ $\pm$40} & \tabincell{c}{826\\ $\pm$219} & \tabincell{c}{538\\ $\pm$233} & \tabincell{c}{20\\ $\pm$50}   & \tabincell{c}{314\\ $\pm$155} & \tabincell{c}{274\\ $\pm$14}  & -         & \tabincell{c}{136\\ $\pm$15}   & \tabincell{c}{746\\ $\pm$25}   \\
			CHEETAH, RUN     & \tabincell{c}{\textbf{605}\\ $\pm$22} & \tabincell{c}{447\\ $\pm$88}  & \tabincell{c}{299\\ $\pm$48}  & \tabincell{c}{138\\ $\pm$88}  & \tabincell{c}{235\\ $\pm$137} & \tabincell{c}{267\\ $\pm$24}  & \tabincell{c}{319\\ $\pm$56}  & \tabincell{c}{130\\ $\pm$12}   & \tabincell{c}{616\\ $\pm$18}   \\
			WALKER, WALK     & \tabincell{c}{\textbf{897}\\ $\pm$42} & \tabincell{c}{504\\ $\pm$191} & \tabincell{c}{403\\ $\pm$24}  & \tabincell{c}{224\\ $\pm$48}  & \tabincell{c}{277\\ $\pm$12}  & \tabincell{c}{394\\ $\pm$22}  & \tabincell{c}{361\\ $\pm$73}  & \tabincell{c}{127\\ $\pm$24}   & \tabincell{c}{891\\ $\pm$82}   \\
			CUP, CATCH       & \tabincell{c}{\textbf{968}\\ $\pm$6} & \tabincell{c}{840\\ $\pm$179} & \tabincell{c}{769\\ $\pm$43}  & \tabincell{c}{0\\ $\pm$0}     & \tabincell{c}{246\\ $\pm$174} & \tabincell{c}{391\\ $\pm$82}  & \tabincell{c}{512\\ $\pm$110} & \tabincell{c}{97\\ $\pm$27}    & \tabincell{c}{746\\ $\pm$91}   \\ \hline
	\end{tabular}}
\end{table*}

\begin{figure*}[bp]
	\centering
	\subfigure{
		\includegraphics[scale=0.29]{Default_ball_in_cup.png}
		\label{Default ball-in-cup, catch}
	}
	\quad
	\subfigure{
		\includegraphics[scale=0.29]{Default_cartpole.png}
		\label{Default cartpole, swingup}
	}
	\quad
	\subfigure{
		\includegraphics[scale=0.29]{Default_cheetah.png}
		\label{Default cheetah, run}
	}
	\quad
	\subfigure{
		\includegraphics[scale=0.29]{Default_finger.png}
		\label{Default finger, spin}
	}
	\quad
	\subfigure{
		\includegraphics[scale=0.29]{Default_reacher.png}
		\label{Default reacher, easy}
	}
	\quad
	\subfigure{
		\includegraphics[scale=0.29]{Default_walker.png}
		\label{Default walker, walk}
	}
	\caption{Learning curves in the default setting, a supplement to Table \ref{Default DMC}. We benchmark S2R + SAC with DBC. Results show that S2R + SAC outperforms DBC and achieves impressive performance on all 6 control tasks.}
	\label{Default Setting}
\end{figure*}

\begin{figure*}[!ht]
	\centering
	\subfigure{
		\includegraphics[scale=0.285]{Image_ball_in_cup.png}
		\label{Image ball-in-cup, catch}
	}
	\quad
	\subfigure{
		\includegraphics[scale=0.285]{Image_cheetah.png}
		\label{Image cheetah, run}
	}
	\quad
	\subfigure{
		\includegraphics[scale=0.285]{Image_walker_run.png}
		\label{Image walker, run}
	}
	\quad
	\subfigure{
		\includegraphics[scale=0.285]{Video_ball_in_cup.png}
		\label{Video ball-in-cup, catch}
	}
	\quad
	\subfigure{
		\includegraphics[scale=0.285]{Video_cheetah.png}
		\label{Video cheetah, run}
	}
	\quad
	\subfigure{
		\includegraphics[scale=0.285]{Video_walker_run.png}
		\label{Video walker, run}
	}
	\caption{Performance of S2R + SAC. \textbf{Top row}: Results in the image distractor setting. \textbf{Last row}: Results in the natural video setting. We benchmark S2R + SAC with RAD and DBC in both settings, and results confirm the better performance of S2R + SAC. Additional results can be found in Appendix C.}
	\label{Total Setting}
\end{figure*}

\textbf{Image Distractor Setting Results.} Then, we evaluate S2R performance in the image distractor setting by replacing the tasks' background with a random image. In the top row of Fig. \ref{Total Setting}, we give the results of three tasks (ball-in-cup catch, cheetah run, and walker run). Results show that S2R + SAC performs comparably or better than RAD, and substantially outperforms DBC, proving that S2R can discard task-irrelevant information when learning representations.

\textbf{Natural Video Setting Results.} Next, we evaluate S2R + SAC, RAD, and DBC in a more complex setting by introducing the natural video as the background. In the last row of Fig. \ref{Total Setting}, we give the results of three tasks (ball-in-cup catch, cheetah run, and walker run). We notice that compared with RAD and DBC, S2R + SAC again performs better and has a higher sample efficiency. This attributes to our well designing of S2R, which makes the agent only focus on task-related features, insensitive to task-irrelevant visual changes, and thus providing robust representations for the training of the actor and critic.

\begin{figure}[!ht]
	\centering
	\subfigure[MCEB regularization factors]{
		\includegraphics[scale=0.24]{Ablation_lr_cheetah.png}
		\label{Ablation_lr_cheetah, run}
	}
	\subfigure[MCEB predictive data]{
		\includegraphics[scale=0.24]{Ablation_preob_cheetah.png}
		\label{Ablation_preob_cheetah, run}
	}
	\subfigure[Optimization objectives]{
		\includegraphics[scale=0.24]{Ablation_s2rob_cheetah.png}
		\label{Ablation_s2rob_cheetah, run}
	}
	\subfigure[Number of views]{
		\includegraphics[scale=0.24]{Ablation_viewn_cheetah.png}
		\label{Ablation_viewn_cheetah, run}
	}
	\caption{Results in the default setting for ablation studies. (a) compares MCEB regularization factors, (b) compares MCEB predictive data, (c) compares MCEB optimization objectives, and (d) compares the number of views $N$ in MCEB. Additional results can be found in Appendix C.}
	\label{Ablation Setting}
\end{figure}

\textbf{Ablation Studies.} Finally, in the cheetah run task in Fig. \ref{Ablation Setting}, we investigate how S2R is affected by the regularization factors, predictive data ($Y_1, Y_2 \ \text{and}\  Y$), optimization objectives, and the number of views. \textbf{(1) MCEB regularization factors.} In the MCEB objective, regularization factors are related to the trade-off between the sufﬁciency and robustness of the representation, and we use an exponential scheduler in all experiments. As seen from Fig. \ref{Ablation_lr_cheetah, run}, in MCEB, too-high values block information essential to the predictive data, while too-small values reduce the benefit of regularization. Results prove the rationality of the set values of the regularization factors in MCEB. \textbf{(2) MCEB predictive data.} To utilize the sequential nature of RL, the predictive data in the MCEB objective can be the reward and next latent representation or either of them. However, our experiment results in Fig. \ref{Ablation_preob_cheetah, run} show that simultaneously predicting the latent transition function and reward function is better than predicting either of them alone. \textbf{(3) MCEB optimization objectives.} With a slight modification to the MCEB objective, its two variants that are similar to reported works can be obtained. The first variant is equal to PI-SAC \citep{lee2020predictive}, which optimizes the representation model by the CEB principle in the single-view RL setting. The second variant is equal to MIB (Multi-view Information Bottleneck) \citep{wang2019deep}, which replaces the CEB term $I(X_j; Z_j| Y_j)$ with the IB term $I(X_j; Z_j)$ in MCEB. Compared to these two variants, our results in Fig. \ref{Ablation_s2rob_cheetah, run} show the better performance and higher sample efficiency of the MCEB objective, confirming the necessity of including multiple views and using the CEB principle in S2R. \textbf{(4) Number of views in MCEB.} We further ablate the number of views $N$ included in the MCEB objective to understand its effect on the S2R performance. As we can see from Fig. \ref{Ablation_viewn_cheetah, run}, the MCEB objective can benefit from multi-view data (especially when it contains the complementary information) to learn robust representations that improve performance, whereas this is premised on the increase of the training time (as the increase of $N$ means a larger computational demand). To strike a balance between the training time and the method performance, we choose to set the number of views to 2.

\section{Discussion}
In this paper, we present S2R, a multi-view self-supervised representation learning method to learn efficient and sufficient representations for the policy learning of the RL agent based on the multi-view data and CEB principle. S2R introduces a representation learning framework for multi-view RL and defines a novel MCEB auxiliary objective for the training of the actor and critic to extract useful features from pixel states by ignoring task-irrelevant information. As a decoupling representation module, S2R is easy to integrate with the deep RL agents to find optimal policies. To evaluate S2R, we perform extensive experiments on the DMControl suite. Empirical results show that S2R learns robust representations and improves sample efficiency of the RL agent on various default and noisy visual continuous control tasks.  

We want to emphasize that one way to theoretically analyze the sample efficiency of the S2R method is using the sample complexity trait. According to \citet{kakade2003sample} and \citet{strehl2006pac}, the sample complexity of an RL algorithm can be expressed as the amount of experience the RL agent takes to learn to behave well. As an open and challenging problem, theoretical analysis of the sample complexity of the S2R method combined with specified RL algorithms is a clear direction for future work. Besides, a natural extension of S2R is to combine it with model-based planning, which may further improve its sample efficiency. It is well-known that model-based RL algorithms are generally more sample-efficient than model-free RL algorithms. Therefore, for future research, we are interested in incorporating S2R into model-based RL algorithms, first learning an accurate environment model by reducing the model bias, then planning actions through the learned model. Also, integrating S2R with exploration mechanisms is a reasonable way to improve its sample efficiency in RL sparse-reward visual settings. In RL realistic applications, the sparse-reward problem is common and inevitable, and the agent may need to learn policies in environments with sparse or deceptive rewards. Such learning challenges urge us to improve the exploration efficiency of the S2R method in the future.

\begin{acknowledgements} 
	This work was supported by the National Natural Science Foundation of China (No. 91948303).
\end{acknowledgements}

\bibliography{yang_234}

\end{document}
