\section{Introduction}
\label{sec:intro}

The challenge of visual control or learning from pixels entails addressing a reinforcement learning (RL) problem where states are represented in the form of images.
Extensive investigations into this problem have been conducted in prior studies, as noted in works such as \citep{rep_rl_survey1}, showcasing commendable performance on continuous control tasks by directly utilizing images as input. Despite these achievements, the performance in visual control problems lags behind that of works employing physical states as input for direct control, as demonstrated in \citep{SAC-AE}. This disparity primarily arises from the challenge of effectively extracting all \textit{task-relevant} information while filtering out \textit{task-irrelevant} details during the representation learning process, as highlighted in \citep{DBC,DRIBO}. Notably, with the increasing availability of multi-view data in various application scenarios, additional perspectives now contribute to distinguishing task-relevant information from task-irrelevant information, independent of specific actions.
In the context of the robot arm catching problem, obtaining images from both upper and horizontal viewpoints offers valuable insights. By conducting a comparative analysis of images from these two perspectives, relevant information about the running robot arm can be extracted, effectively isolating it from background elements that are irrelevant to the control task, as emphasized in \citep{sapien}. This highlights the pressing need to develop a mechanism that enhances visual control performance through skilled representation learning from multi-view data.


An additional coveted attribute for the learned representation in the visual control task of reinforcement learning is its ability to encompass the predictability of future states based on potential actions. At the same time, it should discard task-irrelevant visual details, thereby capturing the temporal structure of task-relevant dynamics, as discussed in \citep{DRIBO}. This learned representation not only enhances the robustness of the acquired policy in unfamiliar environments but also addresses challenges stemming from the complexity of high dimensionality and the causal confusion effect, as outlined in \citep{CausalConfusion}, which arises from task-irrelevant information.



\begin{figure*}[htb]
    \centering
    \includegraphics[width=\textwidth]{fig/framework_crop.pdf}
    \caption{RL with \textbf{S}equential \textbf{Mu}lti-view Total \textbf{Co}rrelation (\mname{}) framework takes multi-view observations over $T$ time steps as input to learn complete representations for downstream RL tasks. The dimension of each observation $\vec{o}_i$ ($i=1,2,3$) equals to the number of views $V$. The \mname{} objective is derived as a lower bound of sequential total correlation between multi-view observation sequences and representations sequences. 
    }
    \label{fig:framework}
\end{figure*}



This study focuses on representation learning for visual control tasks utilizing input images from multiple views, introducing a novel reinforcement learning algorithm named SMuCo. Under the multi-view setting \citep{MultiViewIBCV,survey_mvrl,CEB}, where shared information among multi-view observations is considered task-relevant and unshared information is considered task-irrelevant, our proposed method adeptly learns task-relevant temporal dynamics while discarding extraneous information for visual control tasks. In our framework, depicted in Figure \ref{fig:framework}, diverse observation viewpoints are encoded into a unified representation through deep neural networks, with this learned representation serving as the state for training the policy through reinforcement learning. To train the encoder, we formulate the \mname{} objective, akin to sequential total correlation for sequences of multi-view observations. This objective guides the learning process, emphasizing the preservation of task-relevant temporal dynamics and the elimination of task-irrelevant information.

Our contributions of this work are summarized as follows: 
\begin{itemize}
    \item We propose \mname{}, a novel reinforcement learning framework for representation learning from multiple views in visual control problems based on multi-view total correlation.
    \item We derive the \mname{} objective that represents the multi-view total correlation between sequential observations and representations in \mname{} to learn representations that can well capture task-relevant temporal dynamics  while discarding task-irrelevant information.
    \item  We empirically validate that \mname{} can learn a sufficient and concise representation from multiple views of images by demonstrating that \mname{} achieves higher scores than both model-free and model-based state-of-the-art (SOTA) RL algorithms on a number of multi-view image-based control tasks. 
\end{itemize}  






 













\section{Related Work}


\textbf{Visual Control in Reinforcement Learning}. 
Various efforts have been undertaken to develop robust representations for visual control tasks in reinforcement learning. Some approaches address the visual control problem through contrastive viewpoints, as seen in CURL, which utilizes contrastive learning to enhance agent robustness and generalization \citep{CURL}. Contrastive Predictive Coding (CPC) proposes learning representations by predicting the future latent space \citep{CPCInfoNCE}. Additionally, there are works grounded in the theory of bisimulation, such as DBC \citep{DBC}, a bisimulation-based reinforcement learning algorithm aiming to extract state information to eliminate redundancy in natural video input. PSE (Policy Similarity Embedding) \citep{PSE} and DBC-IR-ID \citep{DBCIRID} represent improved versions of DBC, with DBC-IR-ID incorporating constraints in the representation space, intrinsic rewards, and inverse dynamics.
Furthermore, some works design reinforcement learning algorithms using information-theoretic auxiliary tasks. Among these, DRIBO \citep{DRIBO} establishes an RL framework akin to CURL, leveraging the multi-view information bottleneck method \citep{MultiViewIBCV}. PI-SAC \citep{PISAC} utilizes a conditional entropy bottleneck (CEB) to predict future observations and rewards. However, most of these existing methods are not applicable to the multi-view setting. Notably, DRIBO can only handle situations with two views due to the pairwise formulation of the objective. Random PadResize and CycAug \citep{da_methods} are recently proposed data augmentation techniques to enhance the sample efficiency of visual reinforcement learning algorithms. In order to mitigate visual deadly triad, A-LIX \citep{cetin2022stabilizing} provides adaptive regularization to the encoder's gradients to avoid self-overfitting.
In environments characterized by partial observability, it is rational to employ multiple policies to handle diverse visual observations, as explored in \citep{shang2023active}. However, in multi-view settings, increasing the number of policies with the growing number of views becomes impractical. Therefore, in the context of multi-view scenarios, the potential lies in mapping these observations into a comprehensive and condensed representation, initiating the learning process from the induced latent space.
The challenges associated with learning from visual observations are further compounded by the instability of the Q function in off-policy RL algorithms, as discussed in \citep{Hansen2021StabilizingDQ}. This instability is primarily attributed to redundant information present in raw sensor observations. Nevertheless, the transformation of these raw observations into sufficient and compact representations through multi-view total correlation, as introduced in \citep{MultiViewIBCV}, can mitigate this issue.
The significance of diverse visual perspectives on the learning and generalization performance in the context of visual control is underscored in \citep{Hsu2022VisionBasedMN}, emphasizing the non-negligible impact that different viewpoints can have on the overall effectiveness of the learning process.









\noindent
\textbf{Multi-view Representation Learning}.
Research in multi-view representation learning has delved into extracting robust and concise representations from multi-view data, offering various network architectures to transform such data into representations with desirable properties.
The Memory Fusion Network (MFN) \citep{MFN} constructs sequential multi-view representations, incorporating accountability for interactions within a neural architecture. The Multi-view Laplacian Network \citep{DSpectralRL} is designed to learn spectral representations with consensus from multi-view data. CPM-Nets \citep{CPMNet} aim to learn a comprehensive representation of multi-view data, accommodating potential partialities. Furthermore, appropriate loss functions are crucial for training deep networks to instill ideal properties like sample efficiency and robustness into representations. S2R \citep{S2R} is a multi-view reinforcement learning algorithm that extends the two-view conditional entropy bottleneck method to a multi-view setting, facilitating the learning of sample-efficient representations. DRIBO \citep{DRIBO} utilizes the mutual predictability of multi-view (pairwise) observations to acquire a robust representation devoid of task-irrelevant information. However, these methods cannot be directly applied to visual control problems due to the absence of considerations for temporal predictability. Fuse2Control (F2C) \citep{mvrl_ssm} is an information-theoretic Multi-View Reinforcement Learning framework that learns a latent state space model. It is good at handling missing view problem. However, the temporal length F2C considers is limited, which could hinder the learning process of sequential decision making task.


\noindent
\textbf{Total Correlation}.
Total correlation, a fundamental method derived from information theory, plays a pivotal role beyond representation learning within the realm of AI, which has been applied across a diverse spectrum of tasks (\citep{chen2018isolating,locatello2019fairness,kim2018disentangling}). 
The Total Correlation method is designed to capture shared information among data with minimal sufficiency, focusing on the independence among random variables \citep{5392532}. 
In the context of independent component analysis (ICA), \citep{cardoso2003dependence} introduces a comprehensive framework that establishes connections between mutual information, entropy, and non-Gaussianity, all without relying on decorrelation constraints. This framework contributes significantly to understanding the underlying structures within complex datasets by leveraging the inherent dependencies among variables. For the domain of structure discovery, \citep{ver2014discovering} proposes a novel methodology centered around learning a hierarchical structure of progressively abstract representations of intricate data sets. This approach is underpinned by optimizing an information-theoretic objective, ensuring that the learned representations capture meaningful and salient features of the data while facilitating interpretability and scalability. 
The Total Correlation Explanation (CorEx) principle has been leveraged in unsupervised learning to enhance interpretability. Total correlation, as discussed in \citep{AEVB}, plays a role in characterizing disentanglement and dependence within representations. MVTC \citep{MVTC} introduces an information-theoretic approach to transform multi-view data into complete and minimally sufficient representations.
These works collectively highlight the versatility and power of total correlation as a foundational concept in information theory, showcasing its applicability across a wide range of AI applications.



\section{
\mname{}
}
\label{sec:theory}

In this section, we give the definitions and approximate formulation of sequential total correlation in \mname{} that can capture the task-relevant information and temporal dynamics while discarding task-irrelevant information in the learned representations for visual control tasks, as well as the visual control RL algorithm based on \mname{}.


\subsection{Multi-view Total Correlation}


Total correlation, also referred to as multivariate mutual information, has been shown to be able to characterize informativeness and disentanglement from observations \citep{1802.05822}. Optimizing total correlation can guide the stochastic search process for a set of latent factors that explain best the correlations in the original data \citep{1406.1222}. 
Under the multi-view setting, the total correlation between multi-view observations and representation is defined as:
\begin{equation}
TC(\vec{O}; Z) = TC(\vec{O}) - TC(\vec{O} \mid Z)
\end{equation}
which can be rewritten into: 
\begin{equation}
TC(\vec{O}; Z) = \sum_{v=1}^{V} I(O^{v} ; Z) - I(\vec{O} ; Z)
\label{eq:tc}
\end{equation}
where $I$ denotes mutual information, and $V$ is the number of viewpoints for observations. Maximizing the expected total correlation between multi-view observation and representation can not only enforce informativeness but also guarantee sufficiency of the representation \citep{MVTC}. 

\begin{figure}[htb]
    \centering
    \includegraphics[width=0.7\columnwidth]{fig/tc_demo_crop.pdf}
    \caption{ Illustration of total correlation on two views.}
    \label{fig:tc_demo}
\end{figure}


In unsupervised representation learning, total correlation has been used to obtain complete and minimal sufficient representations from multiple views \citep{MVTC}.  
Figure~\ref{fig:tc_demo} provides an illustration of the intuition behind such a mechanism with a simple example of two views. In Figure~\ref{fig:tc_demo}, the green circles denote the entropy of the observations 
$O$ and the red ellipses denote the entropy of the representation $Z$. According to the assumptions of the multi-view setting, these two views have overlapping information, whose entropy is denoted as the white area.  According to Equation~\ref{eq:tc}, the value of TC is equal to the shaded area in the figure, which is equal to the entropy of $Z$ in the left part. From left to right, the total correlation has increased, and the representation is encouraged to incorporate more shared information between the two views thus can extract more task-relevant information under the multi-view setting. 


\subsection{Sequential Multi-view Total Correlation}
Temporal structure is important in sequential decision making problems and representations incorporating temporal dynamics have better predictability of future states. 
To accurately identify temporal dynamics and remove task-irrelevant information from learned representations for visual control tasks, the encoder should be able to correlate sequential observations and representations in the temporal structure \citep{DRIBO}. 


Empirically, the success of related works such as DRIBO \citep{DRIBO} and PI-SAC \citep{PISAC} have demonstrated the advantage of considering this temporal structure.
For visual control problems with multi-view data, we extend the formulation of total correlation to sequences of multi-view observations conditioned on the action sequences of the MDPs, motivated by the success of PI-SAC \citep{PISAC}. 
PI-SAC is a model-free reinforcement learning algorithm that learns compressive representations of predictive information to improve sample efficiency. It can capture the temporal dynamics of the environment into the learned representation by substituting random variables in CEB \citep{CEB} 

with a combination of sequences of previous and future observations, actions, and rewards. Specifically, CEB aims to optimize the following objective:


\begin{equation}
CEB \equiv \min _Z \beta I(X ; Z \mid Y)-I(Y ; Z)
\end{equation}
According to \citep{PISAC}, it follows that 
\begin{equation}
CEB \leq \mathbb{E}_{x, y, z \sim p(x, y) e(z \mid x)} \beta \log \frac{e(z \mid x)}{b(z \mid y)}-I(Y ; Z)
\end{equation}
where $e(z \mid x)$ is the true encoder distribution representation $z$ comes from and $b(z|y)$ is the variational backwards encoder distribution that approximates the unknown true distribution $p(z \mid y)$.
The minimization of CEB can be approximated by minimization of this upper bound. 
In PI-SAC, the loss function after substitution of CEB has the following form:
\begin{multline}
\mathcal{L} = 
\mathbb{E}
\log \frac{e \left(z_0 \mid o_{-T+1: 0}, a_{0: T-1}\right)}{b \left(z_0 \mid s_{1: T}, r_{1: T}\right)} \\
+\log \frac{b \left(z_0 \mid o_{1: T}, r_{1: T}\right)}{\frac{1}{K} \sum_{k=1}^K b\left(z_0 \mid o_{1: T}^k, r_{1: T}^k\right)}
\end{multline}
where expectation is taken over $(o_{-T+1: T}, a_{0: T-1}, r_{1: T} \sim \mathcal{D}, z_0 \sim e\left(z_0 \mid \cdot\right))$.



 
 

Adopting this idea of PI-SAC, we show our extension of multi-view total correlation to sequential multi-view total correlation (SMTC) as follows. 
The SMTC of a sequence of observations and representations is defined as:
\vspace{-\baselineskip}
\begin{multline}
    SMTC(\vec{O}_{1:T}; Z_{1:T}  \mid A_{1:T}) = \\
    \sum_{v=1}^{V} I(O^{v}_{1:T}; Z_{1:T} \mid A_{1:T}) 
    - I(\vec{O}_{1:T}; Z_{1:T} \mid A_{1:T}) 
    \label{eq:SMTC}
\end{multline}
where $\vec{O}_{1:T}$ denotes the sequence of the observation view vectors, each with $V$ views, $A$ denotes actions, $Z$ denotes the representation, and $T$ denotes the sequence length. According to \citep{PISAC,infomax}, the encoder predicts future states more accurately under the condition of multiple future actions. 
 Maximizing the above SMTC is equivalent to maximizing $\sum_{v=1}^{V} I(O^{v}_{1:T}; Z_{1:T} \mid A_{1:T})$ and minimizing $I(\vec{O}_{1:T}; Z_{1:T} \mid A_{1:T})$. The former term makes the obtained representation complete as it encourages $Z$ to be informative, while the latter term guarantees the conciseness of the resulted representation. Therefore, the maximization of SMTC enforces the representation to capture minimally sufficient correlations among different views over the sequences. 
 Unfortunately, the calculation of both of these terms requires the calculation of mutual information among random vectors, which is notoriously difficult to compute \citep{MIME,DRIBO}. Therefore, we instead try to find an appropriate surrogate for this SMTC objective. 

Let $O_{1:T}$ be observation sequence and $Z_{1:T}$ be representation sequence, whose joint distribution is
$p(O_{1:T}, Z_{1:T}) = \prod_{t=1}^{T} p(O_t, Z_t \mid O_{t-1}, Z_{t-1}, A_{t-1})$ where $A_{1:T}$ is action sequence and $p(O_1, Z_1 \mid O_0, Z_0, A_0) = p(O_1, Z_1)$. 
Let $\vec{O}_{1:T}$ be multi-view observation sequence with $\text{dim } \vec{O} = V$ and temporal length $T$.
We derive a tractable lower bound of sequential multi-view total correlation as follows:
\begin{theorem}\label{thm:main_thm}
    The sequential total correlation between sequences of multi-view observation and representation on condition of action sequence has the following lower bound:
    \begin{align}\label{eq:lwb}
        &SMTC(\vec{O}_{1:T}; Z_{1:T}  \mid A_{1:T})
        \geq \notag \\
        &\sum_{v=1}^{V} \sum_{t=1}^{T}
        \left[
        H(O_{t}^{v} \mid Z_{t-1},  A_{t-1}) 
        \right. \notag \\
        &\left. +
        \mathbb{E}_{
        p(z_{t}, o_{t}^{v} \mid z_{t-1}, a_{t-1} )
        }
        \ln q_{\psi}^{v}(o_{t}^{v} \mid z_{t}, z_{t-1}, a_{t-1} )
        \right] \notag \\
        & - \sum_{t=1}^{T} \sum_{s=1}^{T}
        \mathbb{E}_{
        p(\vec{o}_s)
        } \left[ 
        % D_{\text{KL}} ( p(z_t \mid o_s, \iota)  \;\delimsize\|\; r_{\phi}(z_t \mid \iota) )
        \infdiv{ p(z_t \mid o_s, \iota) }{ r_{\phi}(z_t \mid \iota) }   
            \right]
    \end{align}
where $H$ is the entropy function, 
$\iota = (\vec{o}_{1:s-1}, z_{1:t-1}, a_{1:T})$, 
prior distribution $r_{\phi}(z_{t}) \approx p(z_{t})$ is an approximate distribution for $\phi$, and posterior distribution $q_{\psi}(o^{v}_{t} \mid z_{t}, z_{t-1}, a_{t-1}) \approx p(o^{v}_{t} \mid z_{t}, z_{t-1}, a_{t-1})$ is an approximate distribution for $\psi$.
\end{theorem}

    

Using this result, we construct the loss function for representation learning based on Equation \ref{eq:lwb} in \mname{}, which is detailed in Section \ref{sec:algo}. Proof of Theorem \ref{thm:main_thm} is elaborated in the appendix.

\subsection{Visual Control with \mname{}}
\label{sec:algo}



In the following, we show how the visual control task is resolved with our proposed \mname{} objective for representation learning. As shown in Figure \ref{fig:framework}, we use the \mname{} objective derived based on SMTC to learn the encoder, and the observations are encoded as states for the reinforcement learning part to learn the control policy. The details of the encoder are explained in the following part.


\textbf{Encoder}. According to Equation \ref{eq:lwb}, terms on the right-hand side can be treated as three parts of the loss function of the encoder as follows:
\begin{equation}
    \mathcal{L} = 
    \mathcal{L}_{\text{REC}} + 
    \mathcal{L}_{\text{LL}} +
    \mathcal{L}_{\text{TC}}
\label{eq:training_objective}
\end{equation}
where the reconstruction entropy term $\mathcal{L}_{\text{REC}}$, the expected logarithmic likelihood term $\mathcal{L}_{\text{LL}}$ and the temporal contrastive term $\mathcal{L}_{\text{TC}}$ are defined as follows:

\vspace{-\baselineskip}
\begin{align}
    \mathcal{L}_{\text{REC}} &= 
    - \sum_{v=1}^{V} \sum_{t=1}^{T}
    H(O_{t}^{v} \mid Z_{t-1}, A_{t-1}),
    \\
    \mathcal{L}_{\text{LL}} &= 
    - \sum_{v=1}^{V} \sum_{t=1}^{T}
    \mathbb{E}_{p_1} \ln q_{\psi}^{v}(o_{t}^{v} \mid z_{t}, z_{t-1}, a_{t-1} ),   \\
    \mathcal{L}_{\text{TC}} &= 
    \sum_{t=1}^{T} \sum_{s=1}^{T}
    \mathbb{E}_{
        p_2
    } [
        % D_{KL} ( p(z_t \mid o_s, \iota) || r_{\phi}(z_t \mid \iota) ) 
        \infdiv{ p(z_t \mid o_s, \iota) }{ r_{\phi}(z_t \mid \iota) }   
    ],
\end{align}
where $p_1 := p(z_{t}, o_{t}^{v} \mid z_{t-1}, a_{t-1} )$ and $p_2 := p(\vec{o}_s)$.

For the benefits of the multi-view correlation, the completeness of multi-view representation is defined as the reconstruction ability of representation into each individual view \citep{MVTC}. By minimizing the reconstruction entropy term $\mathcal{L}_{\text{REC}}$, we try to obtain the representation $Z$ which is a maximal compression of observation $O$, thus trying to eliminate irrelevant information from visual control tasks in the learned representation $Z$. Minimizing the expected log likelihood term $\mathcal{L}_{\text{LL}}$ conforms to the principle of maximizing log likelihood in statistical inference methods, trying to preserve the temporal dynamics of the sequence. $\mathcal{L}_{\text{TC}}$ is a regularization term for this surrogate loss function, preventing approximate prior distribution $r$ from divergence with true posterior distribution $p$.


\textbf{Joint Modeling}.
We use Product of Expert (PoE) \citep{poe} and Inverse Variance Weighted (IVW) \citep{ivw01} for the joint modeling of multiple views. For each view, we assign a separate encoder and decoder network. After feeding multi-view observations into the encoder, a joint representation is obtained by aggregating $V$ separate representations into a single one, as illustrated in Figure~\ref{fig:encoder}. Reparamterization method \citep{Reparamterization} is utilized to guarantee the feasibility of backpropogation over parameters of latent distributions.


Summing up, the training procedure of the encoder as well as the reinforcement learning policy is elaborated in Algorithm \ref{alg:\mname{}}. We design our algorithm using a co-training paradigm, as updates of each component among encoder, actor, and critic require values passed through other components. 


\begin{algorithm}[tb]
    \caption{\mname{} Training Procedure}
    \label{alg:\mname{}}
    \textbf{Input}: environment $E$, encoder $p$ parameters $(\phi, \psi)$, policy $\pi$ parameters $\theta$, Q function parameters $\eta_1$, $\eta_2$, replay buffer $\mathcal{D}$ 
    \begin{algorithmic}[1] 
        \STATE Reset environment $E$ with multi-view observation $\vec{o}_{0}$.
        \STATE Initialize representation $z_0 \sim p_{\phi}( \cdot \mid \vec{o}_{0})$.
        \STATE Initialize replay buffer $\mathcal{D}$.
        \STATE Initialize target parameters: $\eta_{trgt,i} \leftarrow \eta_i$ where $i = 1, 2$.
        \WHILE{not convergence}
        \STATE Get action $a_t \sim \pi_{\theta}( \cdot \mid z_{t})$.
        \STATE Get reward $r_t = R(\vec{o}_t, a_t)$.
        \STATE Get next observation $\vec{o}_{t+1} \sim \transprob( \cdot \mid \vec{o}_{t}, a_{t})$.
        \STATE Get representation $z_{t+1} \sim p_{\phi}( \cdot \mid \vec{o}_{t+1})$.
        \STATE Push tuple $(\vec{o}_t, z_t, a_t, r_t, \vec{o}_{t+1}, z_{t+1})$ into replay buffer $\mathcal{D}$.
        \FOR {update steps  }
            \STATE Update encoder $p$ by gradient descent using
            $
            \grad_{\phi, \psi}  \Big( \mathcal{L}_{\text{REC}} + \mathcal{L}_{\text{LL}} + \mathcal{L}_{\text{TC}} \Big)
            $ by sampling replay buffer $\mathcal{D}$.
            \STATE Update Q network by gradient descent using $$
            \mathcal{L}_{Q_{\eta_{i}}} = \mathbb{E}_{B \subset \mathcal{D}} [ Q_{\eta_{i}}(z_t, a_t) - y(r_t, z_{t+1}) ]^2$$ 
            where $(z_t, a_t, r_t, z_{t+1}) \in B$ and $ i = 1,2 $.  
            \STATE Update policy network $\pi$ by gradient descent using
            $$\mathcal{L}_{\pi_{\theta}} = \mathbb{E}_{B \subset \mathcal{D}} \log \pi_{\theta}( a'_t \mid z_{t}) - \min_{i=1,2} Q_{\eta_{i}}(z_t, a'_{t})$$ 
            where $a'_t \sim \pi_{\theta}( \cdot \mid z_{t}), z_t \in B$.
            \STATE Update target network by polyak averaging: $\phi_{\text {targ }, i} \leftarrow \rho \phi_{\text {targ }, i}+(1-\rho) \phi_i.$
        \ENDFOR
        \ENDWHILE
        
    \end{algorithmic}
\end{algorithm}

\begin{figure}[htb]
    \centering
    \includegraphics[width=\columnwidth]{fig/encoder_crop.pdf}
    \caption{\mname{} encoder architecture. At each time step $i=1,2,3$, the encoder takes one column of multi-view observations, i.e. $\vec{o}_{i}$, as input to generate a joint representation $z$.}
    \label{fig:encoder}
\end{figure}

 \section{Experiments}
\label{sec:exp}


\subsection{Experimental Setup}


To assess the efficacy of our proposed method, we integrate \mname{} with SAC to tackle visual control tasks sourced from the DeepMind Control (DMC) Suite \citep{Deepmind_control_suite} and Sapien \citep{sapien}. The experiments in this section are conducted on three servers, each equipped with 24 CPU cores, 110GB of memory, and NVIDIA Tesla A100 GPUs. The Sapien task involves multiple views obtained from cameras at various angles. In DMC tasks, where multi-view observations are not inherently available, we adopt random crop as the view generation method based on its reported advantages over other data augmentation methods, as highlighted in RAD \citep{RAD}. Results are averaged across 5 random seeds, with each agent undergoing training for up to 500,000 steps. The network architectures and other hyperparameters are provided in detail in the appendix.





\subsection{Baselines}
\label{sec:exp_baselines}

The following SOTA model-based and model-free methods are compared with our proposed methods: \textbf{Dreamerv2} \citep{dreamerv2}, \textbf{RAD} \citep{RAD}, \textbf{PI-SAC} \citep{PISAC}, \textbf{DrQ} \citep{drq}, \textbf{SLAC} \citep{SLAC}, and \textbf{DRIBO} \citep{DRIBO}.

DreamerV2 sets itself apart by integrating a world model to understand agent behaviors, explicitly preserving latent dynamics. In contrast, other techniques like RAD and DrQ do not explicitly model dynamics. Both RAD and DrQ input transformed observations, raw pixels from interactions with the environment, into downstream reinforcement learning (RL) tasks. On the contrary, \mname{} considers the joint representation from encoding multi-view observations as the state, emphasizing task-relevant information for downstream RL tasks.
In the case of RAD and DrQ, a notable distinction lies in observational transformation. RAD employs data augmentation techniques such as color jittering and random cropping, while DrQ samples transformation operators from an invariant state transformation set, referring to this approach as a data regularized method.
Furthermore, PI-SAC achieves representation learning by maximizing a Conditional Entropy Bottleneck (CEB)-related surrogate as an auxiliary task for training the encoder. In contrast, DRIBO aims to maximize the mutual information between two marginal representations and the divergence of likelihood probability. These differences underscore the varied approaches and methodologies each method employs in the field of representation learning for RL.


While the training objective (Equation \ref{eq:training_objective}) of the encoder in \mname{} necessitates $V$ views over $T$ time steps, unlike other baselines with no such requirement, it doesn't introduce unfairness in performance evaluation. It's important to note that during the evaluation stage, the encoder parameters are frozen, and episodes are generated step by step in both \mname{} and other baselines. Although it may appear that \mname{} leverages information over a broader time window, the decision to utilize historical information is inherent to the design of the training objective. The availability of historical information is equal for both \mname{} and other baselines. However, \mname{} gives it more thoughtful consideration, leading to superior performance. In conclusion, as long as episodes are unrolled one step at a time during the evaluation stage, the comparison remains unbiased.


\subsection{Evaluation on Control Tasks}

\begin{figure*}[htb]
    \centering
    \includegraphics[width=\textwidth]{fig/eval_crop.pdf}
    \caption{Evaluation on DMC tasks and basic manipulation task in Sapien. Row 1 shows results trained on DMC tasks: (Cheetah, run), (Walker, walk), (Ball in cup, catch), (Finger, spin), (Acrobat, swingup), (Humanoid, run), (Hopper, hop), (Fish, swim). Row 2 shows results trained on basic manipulation with different view settings: uh - upward and horizontal, ud - upward and diagonal, hd - horizontal and diagnoal, uhd - upward, horizontal and diagonal.}
    \label{fig:dmc_eval}
\end{figure*}



We evaluate \mname{} on tasks from DMC Suite \citep{Deepmind_control_suite} and Sapien environment \citep{sapien} with other baselines mentioned above.

The experimental results of our proposed method compared with the baseline methods are shown in Figure~\ref{fig:dmc_eval}, from which we can see that \mname{} achieves better performance than previous works including DrQ, RAD, Dreamerv2 and DRIBO, and comparable performance with PI-SAC and SLAC. 
In the cheetah run task, \mname{} converges faster than other baselines and achieves better performance than baselines except DreamerV2. 
Similarly, in the walker walk task, \mname{} converges faster than other baselines and achieves better performance than baselines except DRIBO. 
However, both DreamerV2 and PI-SAC achieve better performance and sample efficiency than \mname{} in the ball-in-cup catch task, even though \mname{} performs better than RAD, SLAC and DRIBO in convergence rate. 
In the finger spin task, \mname{} excels than other baselines in sample efficiency, except that \mname{} and DRIBO achieve almost the same score at the end of the evaluation.
In the acrobat swingup task, \mname{} achieves almost equivalent performance as PI-SAC and SLAC. Although the final score of DRIBO is decent, its volatility makes it uncomparable with \mname{}, PI-SAC and SLAC.
In the humanoid run task, \mname{} outperforms all the other baselines at the beginning of a relatively early time step.
In the hopper hop task, only DRIBO has the almost comparable performance with \mname{}, with significantly larger variance than \mname{}.
In the fish swim task, \mname{}'s final score ranks third among other baselines. However, DRIBO with first rank has not reached a stable state and SLAC outperforms \mname{} only at the end of training stage, indicating that DRIBO and SLAC have non-dominant advantage over \mname{}.
In the basic manipulation task with both upward-horizontal views and horizontal-diagonal views settings, \mname{} outperforms other baselines. 
However, in the upward and diagonal views setting, RAD and SLAC have better final scores than \mname{}. This implies that the choice of different perspective could incur impact on the performance of training efficiency, which is partially consistent with conclusions in \citep{Hsu2022VisionBasedMN}.

In the basic manipulation task with upward-horizontal-diagonal views setting, \mname{} achieves best score over other baselines and itself across other two views settings.


The results listed in Figure \ref{fig:dmc_eval} shows that our proposed method can have significant performance improvement in scenarios with the real multi-view data in application. 


\subsection{Ablation Study}


\begin{figure*}[htb]
    \centering
    \includegraphics[width=\textwidth]{fig/ablation.pdf}
    \caption{Comparison among different settings.}
    \label{fig:ablation}
\end{figure*}



In the ablation study, we aim to investigate the setting of several components that can affect the overall performance of \mname{}, including 
(a) loss function form; (b) sequence length $T$; (c) number of views $V$;  (d) data augmentation methods; (e) size of views. We conduct these experiments under the same visual control task: walker walk using algorithm \mname{}. 


\textbf{Loss Function}. 
The different contributions of the three components of the objective function: \rectrm, \llhtrm and \ctrtrm are reported in this part. The evaluation results with different combination of the three terms are demonstrated in Figure \ref{fig:ablation}(a). 

We observe that (1) none of \rectrm, \llhtrm and \ctrtrm can achieve success in the task of walker walk. Intuitively, if the loss function only contains \rectrm, relevant information for control task cannot be preserved in the representation $Z$ due to the embedding collapse phenomenon. 
If the loss function is in the form of either \llhtrm{} or \ctrtrm, the representation $Z$ would contain too much task-irrelevant information, which exacerbates the learning process of the control task. 
(2) However, when combining \rectrm and \llhtrm, the final performance is better than that of each term, because \rectrm encourages the representation to be concise while \llhtrm guarantees the sufficiency of the representation. 
(3) In contrast, the other two combinations: \rectrm plus \ctrtrm and \llhtrm plus \ctrtrm, do not have a noticeable improvement in final performance compared to single-term cases. 

It suggests that \ctrtrm can guarantee the robustness of the representation as the final performance of \mname{} is slightly better than the case of \rectrm plus \llhtrm.
In conclusion, \rectrm and \llhtrm are essential terms for obtaining performant representation while \ctrtrm is a regularization term that can make the representation more robust.


\textbf{Sequence Length $T$}. To investigate the effect of different sequence lengths on the performance of the agent, we conduct experiments with different sequence lengths $T$, and the results are shown in Figure \ref{fig:ablation}(b). We observe that as the sequence length $T$ increases, the performance of the agent also increases slightly. It means that \mname{} can learn better representations with the help of incorporating longer temporal dynamics. However, as $T$ increases from $15$ to $20$, the performance has not improved significantly compared to the improvement on smaller $T$ values, indicating that the marginal improvement brought about by adding more time steps to calculate total correlation will gradually disappear as $T$ increases.  Intuitively, if we set $T$ to be a too large integer, observations from time steps with large time gap would be independent with each other, in which case the sequential total correlation can be separated into the sum of sequential total correlations on shorter sequences of observations, which makes it merely no benefit in learning temporal dynamics when adding extra time steps of observations in this case.


\textbf{Number of Views $V$}.  To investigate whether providing more views $V$ can improve task performance, we conduct experiments with different number of views $V$ and depict the results in Figure \ref{fig:ablation}(c). It is shown that as the number of views $V$ increases, the performance of agents increases slightly, weaker than the effect of sequence length $T$. We deduce that the \mname{} objective captures larger cross-view correlation during the optimization of the observation encoder, improving the robustness of the learned representation. However, too many views can only bring in too much redundant information and deteriorate the learning process. Therefore, it is beneficial to use multi-view data but not necessary to collect too many views in order to achieve a satisfactory performance in practice, considering the extra efforts demanded in data collection and computation for extra views. 


\begin{figure*}[htb]
    \centering
    \includegraphics[width=0.7\textwidth]{fig/data_aug_eval_crop.pdf}
    \caption{Evaluation results using different (left) data augmentation methods and (right) view sizes.}
    \label{fig:data_aug_eval}
\end{figure*}

\textbf{Data Augmentation Methods} We investigate the effects of using different data-augmentation methods to generate views. Figure \ref{fig:data_aug_eval}(d) depicts the evaluation results under task: Walker, walk using different view-generation methods: Grayscale, Random Crop, Rotate and Color Jitter. The result suggests that Random Crop is more beneficial for improving performance of \mname{}, which is consistent with the conclusion from \citep{RAD}. Intuitively, since \mname{} needs to optimize over discrepancy between different views, Gradyscale, Rotate and Color Jitter does not increase discrepancy among different views.

\textbf{Size of Views}. The size of view is correlated to the degree of partial information in each view. We conduct experiments using different size of views $[32, 64, 128, 256]$ as illustrated in Figure \ref{fig:data_aug_eval}(e). The result implies that as the size of raw observation increases, the performance of \mname{} increases correspondingly with decreasing acceleration. This observation is coherent with the fact that the mutual information among different views is a submodular function \citep{submodMI1,submodMI2}.


\subsection{Spatial Attention of Learned Representations}

To show the representation learned by our proposed method, we build spatial attention maps of representations from observations \citep{spatial_atten}, as illustrated in Figure \ref{fig:spatial_attention}. The \mname{} captures clear task-relevant information in the cheetah run and walker walk tasks (upper row in Figure \ref{fig:spatial_attention}) than the ball-in-cup catch and finger spin tasks (lower row in Figure \ref{fig:spatial_attention}). More specifically, the edges of the controlled agent can be clearly identified in the learned representation of cheetah run and walker walk, which also validates the good performance of \mname{} on these two tasks. Moreover, the learned representation in the finger spin task not only captures the current state of the agent, but also captures the possible future states as the movement would change the area around the agent in the view, which shows that \mname{} can capture the temporal dynamics in the learned representation.



\begin{figure}[htb]
    \centering
    \includegraphics[width=\columnwidth]{fig/spatial_attention.pdf}
    \caption{Spatial attention maps of representations from observations in four DMC tasks. }
    \label{fig:spatial_attention}
\end{figure}


\section{Conclusion}

In this work, we introduce a novel reinforcement learning algorithm named \mname{} specifically designed for visual control problems. \mname{} aims to learn a comprehensive and succinct representation of multi-view observations. By capturing shared information across views and exploiting temporal correlation, our approach maximizes sequential total correlation between sequences of multi-view observations and their corresponding representations.
To integrate temporal dynamics, we extend multi-view total correlation into sequential multi-view total correlation, conditioning on sequences of actions, and utilize it as the training objective for the encoder. In empirical evaluations, our proposed method demonstrates consistently superior performance compared to state-of-the-art baselines, including both model-free and model-based reinforcement learning methods.


\textbf{Limitations:} There are a few limitations for the SMuCo method. One crucial aspect revolves around the scalability concerns inherent in handling multiple views simultaneously. As outlined in \citep{MVTC}, the complexity of correlations across multiple views can significantly impact the scalability of the method. This complexity not only poses challenges but can also exacerbate the difficulty of effectively learning and representing task-relevant information. These scalability limitations are particularly exacerbated when dealing with highly complex and scalable multi-view scenarios, such as those encountered in real-world applications. Consequently, while SMuCo may excel in certain contexts, its effectiveness and performance may be hindered in scenarios that demand handling many views and intricate correlations among them.



\textbf{Future Work:} As part of future work, \mname{} can be enhanced to handle challenges posed by unaligned multi-view observations and extend its capabilities to accommodate multi-modal observations, including not only image data but also text and audio data. This expansion will contribute to the algorithm's versatility across diverse input modalities in various applications.
Furthermore, we could also try to address scenarios with limited multi-view data. One potential solution to this problem is to explore multi-view representation learning methodologies that can effectively handle such limitations. A promising avenue for this exploration is the CPM-Nets proposed by \citep{CPMNet}. CPM-Nets are designed to handle the absence or missingness of multi-view data, showcasing their effectiveness in dealing with data limitations. Therefore, our future work could focus on adapting and extending the principles of CPM-Nets to the domain of reinforcement learning, particularly in situations where there is a scarcity of multi-view information. By leveraging the benefits and methodologies of CPM-Nets, we could potentially develop novel approaches that are robust and efficient in learning policies despite limited multi-view data availability. This direction is promising for advancing reinforcement learning algorithms and addressing challenges posed by data constraints in multi-view environments.









