% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{hyperref}
\usepackage{url}
\usepackage{booktabs} 
\usepackage{algorithm}
\usepackage{algorithmic} 
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{picinpar}
\usepackage{wrapfig}
\usepackage{enumitem}
\usepackage{subfigure}
\usepackage{multirow}
\usepackage{booktabs}
\usepackage{bbding}
\usepackage{subfigure}
\usepackage{graphicx}
\usepackage{xcolor}
\usepackage[most]{tcolorbox}
\usepackage{makecell}
\usepackage{caption}
\allowdisplaybreaks[4]

%% Provided macros
% er: Because the class footnote size is essentially LaTeX's ,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Chengmin Gao}
\author[1]{Bin Li\thanks{Corresponding author (libin@fudan.edu.cn)}}

% Add affiliations after the authors
\affil[1]{%
    School of Computer Science, Fudan University
}

\begin{document}
\maketitle
\begin{abstract}
  When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when an object is completely occluded from certain viewpoints. Meanwhile, humans are able to imagine novel views after observing multiple viewpoints. Recent remarkable advances in multi-view object-centric learning still leaves some unresolved problems: 1) The shapes of partially or completely occluded objects can not be well reconstructed. 2) The novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit rules in view representations. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of an object accurately, we enhance the disentanglement between the latent representations of objects and views, where the latent representations of time-conditioned views are jointly inferred with a Transformer and then are input to a sequential extension of Slot Attention to learn object-centric representations. In addition, Gaussian processes are employed as priors of view latent variables for video generation and novel-view prediction without viewpoint annotations. Experiments on multiple datasets demonstrate that the proposed model can make object-centric video decomposition, reconstruct the complete shapes of occluded objects, and make novel-view predictions. 
\end{abstract}

\section{Introduction}
  Humans understand the multi-object world in a compositional manner that the representations of multiple objects are memorized separately and then combined into the perceived whole \cite[]{kahneman1992reviewing,spelke2007core,johnson2010infants}. When it comes to the multi-object scene with multiple viewpoints, humans exhibit higher-level intelligence in multiple aspects: On one hand, a certain object is endowed with a canonical representation that depicts its complete 3D shape and appearance through multi-view perception \cite[]{turnbull1997neuropsychology}. As a result, humans have the ability to reason about the complete object even when an object is completely occluded from certain viewpoints \cite[]{shepard1971mental}. On the other hand, scenes observed from novel viewpoints can be imagined on the basis of the learned implicit rules of perspective \cite[]{schacter2012future,beaty2016creative}. Such compositional modeling from multiple viewpoints is the fundamental ingredient for high-level cognitive intelligence. 

  Unsupervised object-centric learning that is dedicated to simulating human intelligence have recently achieved remarkable advances \cite[]{yuan2022compositional}, especially in single-view object-centric learning on both images \cite[]{burgess2018understanding,yuan2019generative, yuan2019spatial,engelcke2021genesis} and videos \cite[]{kosiorek2018sequential,jiang2019scalor,lin2020improving}. Meanwhile, multi-view object-centric learning \cite[]{li2020learning,chen2021roots,kabra2021simone,yuan2022unsupervised}, which aims to learn 3D object representations, also demonstrates a promising blueprint; however, it still leaves some unresolved problems: 1) The shapes of partially or completely occluded objects from some viewpoints cannot be reconstructed through 3D representations learned from other viewpoints. Although some models can theoretically restore occlusions, relatively poor restoration (e.g. inaccurate shadows, blurs and noises) is inevitably observed. 2) Despite using the query objective during training \cite[]{li2020learning}, the ability for novel viewpoint prediction depends on expensive viewpoint annotations, which provide strong location information and play a crucial role in update of object-centric representations; while the implicit rules of view representations are not fully explored to make prediction. 
  It is, therefore, crucial to develop a unified multi-view model to perform object-centric learning like humans. 

\begin{figure}[tb]
  % \vskip 0.2in
  \begin{center}
  \centering
  \includegraphics[width=\linewidth]{assets/intro.pdf}
  \caption{\textbf{Top}: Video decomposition and prediction with multiple observed time-conditioned viewpoints. The yellow and red triangles represent the observed frames and predicted frames, respectively. \textbf{Bottom}: The expected outputs: (a) reconstruction, (b) segmentation, (c) overlaps, and (d) complete segmentation. In our problem setting, only the observation set and time stamps are provided.}
  \label{fig:intro}
  \end{center}
\end{figure}

  In this paper, we focus on learning object-centric and viewpoint representations conditioned on time stamps from multi-view static scenes for video decomposition and unknown-viewpoint prediction. The problem setting and the expected outputs are illustrated in Figure \ref{fig:intro}. Under the setting that only the observation set and time stamps are provided, a generative model is developed to 1) make video decomposition based on object-centric representations; 2) reconstruct the complete shapes of partially or even completely occluded objects; and 3) predict 2D images from unknown viewpoints conditioned on known viewpoints.

  To enable the abovementioned abilities, we propose a time-conditioned generative model for video decomposition and prediction. The proposed model reconstructs the complete shape of an object accurately through enhancing the disentanglement between object-centric representations and viewpoint representations, where the latent representations of time-conditioned views are jointly inferred with a Transformer \cite[]{vaswani2017attention} and then are input to a sequential extension of Slot Attention \cite[]{locatello2020object} to learn viewpoint-invariant object-centric representations. In addition, the prediction from novel viewpoints without viewpoint annotations is enabled. Specifically, Gaussian processes are employed as priors of viewpoint latent variables for video generation and novel-view inference, based on the learned functions depicting the underlying implicit rules in view representations. 
  
  Experiments on multiple synthetic datasets demonstrate that the proposed model can 1) make object-centric video decomposition, 2) reconstruct the complete shapes of occluded objects, and 3) make novel-view predictions. Moreover, the proposed model outperforms the state-of-the-art methods in video decomposition and, compared with the method that uses viewpoint annotations, achieves competitive results on novel-view prediction.

\section{Related Work}

  \textbf{Single-View Object-Centric Learning.} Recent advances mainly focus on aggregating the input image into multiple slots based on the attention mechanism. AIR \cite[]{eslami2016attend} extracts a variable number of object representations based on the bounding-box attention \cite[]{jaderberg2015spatial}. SQAIR \cite[]{kosiorek2018sequential} further extends AIR to videos. Both SPACE \cite[]{lin2019space} and GMIOO \cite[]{yuan2019generative} model the background separately and model occlusions from different perspectives. SCALOR \cite[]{jiang2019scalor} implements object discovery and tracking in videos with dynamic backgrounds based on SPACE. G-SWM \cite[]{lin2020improving} integrates the advantages of current models on videos and further models the multimodal uncertainty. MONet \cite[]{burgess2019monet} adopts the attention network to iteratively infer masks and then extract object-centric representations based on masked features. GENESIS \cite[]{engelcke2020genesis} additionally models layouts of scenes based on MONet. GENESIS-V2 \cite[]{engelcke2021genesis} infers the attention masks inspired by instance coloring previously used in supervised instance segmentation. Slot Attention \cite[]{locatello2020object} and EfficientMORL \cite[]{emami2021efficient} randomly initialize the embeddings of objects in the slots to compute the similarities between the embeddings and local features. ADI \cite[]{yuan2021knowledge} proposes a continual learning strategy and makes pilot explorations in the acquisition and exploitation of knowledge.

  \textbf{Multi-View Object-Centric Learning.} We can coarsely categorize the recent advances in terms of viewpoint annotation. GQN \cite[]{eslami2018neural} uses viewpoint annotations to build single-object scenes. Based on novel-view annotations, single-object images from the given viewpoints can be generated. MulMON \cite[]{li2020learning} models the multi-object multi-view scenes according to viewpoint annotations. The double-level iterative inference is conducted to achieve both multi-object segmentation and prediction. ROOTS \cite[]{chen2021roots} divides the three-dimensional space into equal-spaced grids and discovers objects in different grids. ROOTS also considers occlusions and makes predictions with viewpoint annotations. SIMONe \cite[]{kabra2021simone} and OCLOC \cite[]{yuan2022unsupervised} are the most recent models without viewpoint annotations. They learn viewpoint representations and object-centric representations separately. The difference is that SIMONe learns representations from videos and can recompose representations to novel scenes, while OCLOC is capable of modeling scenes from unordered viewpoints.
    
  \textbf{Deep Learning with Stochastic Processes.} The Gaussian Process (GP) \cite[]{rasmussen2006gaussian} is a classical non-parametric model that regards the outputs of a function as a random variable of multivariate Gaussian distribution. The Neural Process (NP) \cite[]{garnelo2018neural,kim2019attentive} captures function stochasticity with a Gaussian distributed latent variable obtained from an inference network. To integrate stochastic processes into generative models, \cite[]{shi2021raven} employs GPs with deep kernels for Raven’s progressive matrices completion. CLAP-NP \cite[]{shi2022compositional} takes the first attempt in compositional law parsing with random functions based on NPs. In addition, a number of deep generative models \cite[]{deng2020modeling, norcliffe2021neural, song2021score} introduce ODEs or SDEs to learn diverse random functions on latent states.


\section{Background}
  In order to enable the abilities illustrated in Figure \ref{fig:intro}, in the following we list the treatments to consider in multi-view object-centric representation learning from videos without viewpoint annotations. 

  \textbf{Variable Number of Objects.} As the number of objects differs from one scene to another, it requires modeling and inference. A possible solution is to introduce a set of Bernoulli variables $\boldsymbol{z}^{\text{pres}} = \{ z_1^{\text{pres}}, ..., z_K^{\text{pres}}\}$ to model object presences in the $K$ slots for automatic counting, where $K$ denotes the maximum number of objects that may appear in a scene.

  \textbf{Separately Modeling of Background.} As foreground objects only occupy local regions while the background covers the entire image, the generation of 3D objects from multiple viewpoints tends to blur through a decoder shared with the background. We train two different decoders, a shared foreground object decoder and a separate background decoder. 

  \textbf{View-independent Object Representations.} We don't learn object representations from different viewpoints separately. As we can view representations of the same object inherently consistent independent of viewpoints, we consider $\{ \boldsymbol{z}^{\text{bck}}, \boldsymbol{z}_1^{\text{obj}},...,\boldsymbol{z}_K^{\text{obj}} \}$ as view-independent object-centric representations, learned from multiple observed viewpoints to represent viewpoint-invariant 3D objects.

  \textbf{Depth Estimation of Objects.} 
  %Estimation of the depth of an object without supervision is challenging.
  We introduce a depth variable $o_{t,k} \in \big[0,1\big]$ of the $k$th object in the $t$th frame and its complete shape $\boldsymbol{s}_{t,k}^{\text{shp}} \in \big[0,1 \big]^{N}$ before being occluded in generative modeling. In this way, the pixels of an object with larger depth values will cover the pixels with smaller depth values. We can thus naturally obtain the observed shape of an possibly occluded object. It is worth noting that this treatment is also applicable to situations where an object is completely occluded.

  \textbf{Modeling of Viewpoints.} We explicitly learn the viewpoint representations according to modelling the correlations of viewpoints, instead of directly leveraging viewpoint annotations as previous works \cite[]{li2020learning,chen2021roots}. The view-correlation based modeling can also enable novel-view prediction given any time. To this end, we define $\boldsymbol{z}^{\text{view}} \in \mathbb{R}^{T \times D}$ and $\boldsymbol{\lambda} \in \mathbb{R}^{T\times D \times D_{\lambda}}$, where $T$ denotes the number of frames, $D$ denotes the dimensionality of viewpoint representations, and $\boldsymbol{z}^{\text{view}}$ follows the GPs w.r.t. $\boldsymbol{\lambda}$ that characterizes the position of the camera in different frames.


\section{Method}

  Our goal is to infer object-centric latent variables independent of viewpoints and correlated viewpoint latent variables dependent on time $t$. In the following, we introduce our time-conditioned generative model, the inference method and a two-stage training procedure to achieve the goal.

\begin{figure*}[tb]
  % \vskip 0.2in
  \centering
  \includegraphics[width=\textwidth]{assets/generative_full.pdf}
  \caption{The proposed time-conditioned generative process for generating the $t$th frame in a video. The correlations between the viewpoint representations of $T$ frames are modeled dimension-wisely with GPs. The notations in circles denote latent variables; the notations in deep gray boxes denote neural networks.}
  \label{fig:generative}
\end{figure*}

\subsection{Generative Model}
\label{sec:generative_modeling}

  Let $\boldsymbol{x}_\mathcal{S} = \{\boldsymbol{x}_{1},...,\boldsymbol{x}_{T}\}$ be the $T$ frames in a video and $\boldsymbol{t}_{\mathcal{S}}$ be their timestamps. The frame set $\boldsymbol{x}_\mathcal{S}$ can be arbitrarily divided into an observation frame set $\boldsymbol{x}_{\mathcal{T}}$ and a prediction frame set $\boldsymbol{x}_{\mathcal{Q}}$, where $\boldsymbol{x}_{\mathcal{S}} = \boldsymbol{x}_{\mathcal{T}} \cup \boldsymbol{x}_{\mathcal{Q}}$.  For convenience, the elements in $\boldsymbol{x}_{\mathcal{T}} $ and $\boldsymbol{x}_{\mathcal{Q}}$ is sorted according to the time, e.g. $\boldsymbol{x}_{\mathcal{T}} = \big( \boldsymbol{x}_1 , \boldsymbol{x}_3, \boldsymbol{x}_7, \boldsymbol{x}_9 \big)$; similarly, $\boldsymbol{t}_{\mathcal{S}}$ can be divided into $\boldsymbol{t}_{\mathcal{T}}$ and $\boldsymbol{t}_{\mathcal{Q}}$ accordingly. Figure \ref{fig:generative} shows the flowchart of the generative process. The generative model conditioned on time $\boldsymbol{t}_{\mathcal{S}}$ can be expressed as:
\begin{align}
  \label{eq:lambda_prior}
  & \boldsymbol{\lambda}_{t,d}  \sim \mathcal{N}(\boldsymbol{A}\boldsymbol{w}_t,\sigma_{w}^2\boldsymbol{I}) \\
  &\kappa_{\boldsymbol{\eta}}^d (\boldsymbol{\lambda}_{t,d},\boldsymbol{\lambda}_{t',d}) = l^2 \exp \Big( \frac{\|g_{\boldsymbol{\eta}}^d (\boldsymbol{\lambda}_{t,d} )-g_{\boldsymbol{\eta}}^d (\boldsymbol{\lambda}_{t',d} )\|_2^2}{2\sigma^2} \Big) \\
  &\boldsymbol{z}_k^{\text{obj}} \sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) \quad\quad \boldsymbol{z}^{\text{bck}}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) \\
  &\boldsymbol{K}_{\boldsymbol{\eta}}^d = \left[\begin{array}{ccc}
  \kappa_{\boldsymbol{\eta}}^{d}\left(\boldsymbol{\lambda}_{1,d}, \boldsymbol{\lambda}_{1,d}\right) & \cdots & \kappa_{\boldsymbol{\eta}}^{d}\left(\boldsymbol{\lambda}_{1,d}, \boldsymbol{\lambda}_{T,d}\right) \\
  \vdots & \ddots & \vdots \\
  \kappa_{\boldsymbol{\eta}}^{d}\left(\boldsymbol{\lambda}_{T,d}, \boldsymbol{\lambda}_{1,d}\right) & \cdots & \kappa_{\boldsymbol{\eta}}^{d}\left(\boldsymbol{\lambda}_{T,d}, \boldsymbol{\lambda}_{T,d}\right)
  \end{array}\right] \\
  \label{eq:view_concat}
  &\boldsymbol{z}_{1:T,d}^{\text{view}} \sim \mathcal{N}(\boldsymbol{0},\boldsymbol{K}_{\boldsymbol{\eta}}^d)\\
  &\boldsymbol{z}_{1:T}^{\text{view}}=\text{concat}(\boldsymbol{z}_{\cdot,1}^{\text{view}},...,\boldsymbol{z}_{\cdot,D}^{\text{view}}) \\
  &z_k^{\text{pres}} \sim \text{Bernoulli}(\nu_k) \quad \quad \nu_k \sim \text{Beta}(\alpha / K, 1)  \\
  & s_{t,k,n}^{\text{shp}} = \text{Sigmoid}(g_{\text{shp}}(\boldsymbol{z}_k^{\text{obj}},\boldsymbol{z}_t^{\text{view}})_n)\\
  \label{eq:ord}
  &o_{t,k} = g_{\text{ord}}(\boldsymbol{z}_k^{\text{obj}} ,\boldsymbol{z}_t^{\text{view}}) \\
  \label{eq:occlusion}
  & \pi_{t,k,n}=\begin{cases}\prod_{k'=1}^{K}(1 - z_{k'}^{\text{pres}}s_{t,k',n}^{\text{shp}}), \quad \ k=0 \\ \frac{(1-\pi_{t,0,n})(1-z_{k}^{\text{pres}} s_{t,k,n}^{\text{shp}} o_{t,k})}{\sum_{k'=1}^{K} (1-z_{k'}^{\text{pres}} s_{t,k',n}^{\text{shp}} o_{t,k'})}, \ \ k\geq 1  \end{cases} \\
  \label{eq:apc_gen}
  & \boldsymbol{a}_{t,k,n} = \begin{cases} g_{\text{apc}}^{\text{bck}}(\boldsymbol{z}_t^{\text{view}},\boldsymbol{z}^{\text{bck}})_n,\quad \quad \quad \ \ \  k=0 \\ g_{\text{apc}}^{\text{obj}}(\boldsymbol{z}_t^{\text{view}},\boldsymbol{z}_k^{\text{obj}})_n , \quad \quad \quad \ \ \ \ k \geq 1 \end{cases} \\
  \label{eq:likelihood}
  & \boldsymbol{x}_{t,n} \sim \mathcal{N}\Big(\sum \nolimits_{k=0}^K \pi_{t,k,n}  \boldsymbol{a}_{t,k,n}, \sigma_{x}^2 \boldsymbol{I}\Big)
\end{align}
  In the above, the ranges of all indices ($1\leq t \leq T, 1 \leq d \leq D, 1\leq k \leq K, 1\leq n\leq N$) are omitted for simplicity. The way to time embedding $\boldsymbol{w}_t = \text{TimeEncoding}(t)$ can be diverse, e.g. $\boldsymbol{w}_t = \big[ \cos t, \sin t\big]$. $\boldsymbol{\lambda}_{t,d}$ follows a linear Gaussian distribution with a projection matrix $\boldsymbol{A}$, which can be either learned or provided, and $\sigma_w$ is a hyperparameter. $\kappa_{\boldsymbol{\eta}}^d$ is the kernel function corresponding to the $d$th dimension of $\boldsymbol{z}^{\text{view}}$ composed of a neural network $g_{\boldsymbol{\eta}}^d$ and an RBF kernel parameterized with $\boldsymbol{\eta}$, $l$ and $\sigma$ (\cite[]{wilson2016deep}). Each dimension of the viewpoint latent variable $\boldsymbol{z}^{\text{view}}_t$ is generated by a different GP in Eq.\ref{eq:view_concat}. The occlusions are treated in Eq.\ref{eq:occlusion} through sorting the depth values of objects to obtain the soft masks $\boldsymbol{\pi}_{t,k}$ of the background and objects. $\boldsymbol{a}_{t,k}$ in Eq.\ref{eq:apc_gen} denotes the complete appearance of the $k$th object or background in GRB values at time $t$. The likelihood of the $n$th observed pixel at time $t$ is a Gaussian distribution parameterized with $\boldsymbol{\pi}$ and $\boldsymbol{a}$ in Eq.\ref{eq:likelihood}.

  Let $\boldsymbol{\Omega} = \{ \boldsymbol{z}^{\text{obj}}, \boldsymbol{z}^{\text{bck}}, \boldsymbol{z}^{\text{pres}},  \boldsymbol{z}^{\text{view}}, \boldsymbol{\lambda}, \boldsymbol{\nu}\}$ denote the collection of all latent variables, the joint conditional probability of $\boldsymbol{x}_{\mathcal{S}}$ and $\boldsymbol{\Omega}$ can be written as:
\begin{align}
  p(\boldsymbol{x}_{\mathcal{S}},\boldsymbol{\Omega} \mid & \boldsymbol{t}_{\mathcal{S}}) =  \prod \nolimits_{t=1}^T \prod \nolimits_{n=1}^N p(\boldsymbol{x}_{t,n} \mid \boldsymbol{\Omega})  p(\boldsymbol{z}^{\text{bck}}) \notag\\
  & \cdot \prod \nolimits_{d=1}^{D} p(\boldsymbol{z}_{\mathcal{S},d}^{\text{view}}\mid \boldsymbol{\lambda}_{\mathcal{S},d}) \prod \nolimits_{t=1}^T p(\boldsymbol{\lambda}_{t,d}\mid \boldsymbol{t}_{\mathcal{S}} ) \notag \\
  & \cdot \prod \nolimits_{k=1}^K p(\boldsymbol{z}_k^{\text{obj}}) p(z_k^{\text{pres}}\mid \nu_k) p(\nu_k)
\end{align}

\subsection{Inference}
\label{sec:inference}
  Since we can hardly compute the likelihood through integrating out the latent variables $\boldsymbol{\Omega}$, the amortized variational inference approach is employed to approximate the posterior of $\boldsymbol{\Omega}$. In our problem setting, only a subset of the frame collection,  $\boldsymbol{x}_{\mathcal{T}}$, for each video is observed. This implies that the posteriors of  $\boldsymbol{\lambda}_{\mathcal{T}}$ and $\boldsymbol{z}_{\mathcal{T}}^{\text{view}}$ that correspond to $\boldsymbol{x}_{\mathcal{T}}$ can be inferred directly with the inference networks, while the posteriors of $\boldsymbol{\lambda}_{\mathcal{Q}}$ and $\boldsymbol{z}_{\mathcal{Q}}^{\text{view}}$ that correspond to $\boldsymbol{x}_{\mathcal{Q}}$ are hard to compute. We use the least square method to approximate the posterior of $\boldsymbol{\lambda}_{\mathcal{Q}}$ and then explicitly compute the posterior of $\boldsymbol{z}_{\mathcal{Q}}^{\text{view}}$ based on the properties of the GP prior. For simplicity, the parameters in the inference networks are denoted by $\boldsymbol{\phi}$ and the parameters in the learnable kernels in GP are denoted by $\boldsymbol{\eta}$. The variational posterior $q_{\boldsymbol{\phi},\boldsymbol{\eta}}(\boldsymbol{\Omega} \mid \boldsymbol{x}_{\mathcal{T}}, \boldsymbol{t}_{\mathcal{S}})$ conditioned on the observed set can be written as: 
\begin{align}
  \label{eq:posterior}
      q_{\boldsymbol{\phi},\boldsymbol{\eta}}(\boldsymbol{\Omega} \mid & \boldsymbol{x}_{\mathcal{T}}, \boldsymbol{t}_{\mathcal{S}}) =  q_{\boldsymbol{\phi}}(\boldsymbol{z}^{\text{bck}}\mid \boldsymbol{x}_{\mathcal{T}}) q_{\boldsymbol{\phi}}(\boldsymbol{z}_{\mathcal{T}}^{\text{view}} \mid \boldsymbol{x}_{\mathcal{T}},\boldsymbol{t}_{\mathcal{T}})\notag \\
      & \cdot q_{\boldsymbol{\phi}}(\boldsymbol{\lambda}_{\mathcal{T}}\mid \boldsymbol{x}_{\mathcal{T}}, \boldsymbol{t}_{\mathcal{T}})q_{\boldsymbol{\phi}}(\boldsymbol{\lambda}_{\mathcal{Q}} \mid \boldsymbol{\lambda}_{\mathcal{T}}, \boldsymbol{t}_{\mathcal{S}}) \notag \\ 
      & \cdot \prod \nolimits_{k=1}^K q_{\boldsymbol{\phi}}(\boldsymbol{z}_k^{\text{obj}}\mid \boldsymbol{x}_{\mathcal{T}}) q_{\boldsymbol{\phi}}(z_k^{\text{pres}}\mid \boldsymbol{x}_{\mathcal{T}}) q_{\boldsymbol{\phi}}(\nu_k \mid \boldsymbol{x}_{\mathcal{T}}) \notag \\
      & \cdot \prod \limits_{d=1}^{D} q_{\boldsymbol{\eta}}(\boldsymbol{z}_{\mathcal{Q},d}^{\text{view}} \mid \boldsymbol{z}_{\mathcal{T},d}^{\text{view}},\boldsymbol{\lambda}_{\mathcal{S},d}) 
\end{align}
  In the following, we will introduce the inference methods for the observed view-dependent latent variables in Section \ref{subsec:observed_view_inference}, the predicted view-dependent latent variables in Section \ref{subsec:query_view_inference}, and the view-independent object-centric latent variables in Section \ref{subsec:object_inference}. The overview of the inference procedure is illustrated in Figure \ref{fig:overview}. The mathematical details of the inference procedure can be found in the Supplementary Material.

\begin{figure*}[tb]
  % \vskip 0.2in
  \centering
  \includegraphics[width=\textwidth]{assets/inference.pdf}
  \caption{The inference procedure of the proposed model. The three modules correspond to the inference of observed view-dependent latent variables (top-left), the inference of predicted view-dependent latent variables (top-middle), and the inference of view-independent object-centric latent variables (bottom), respectively.}
  \label{fig:overview}
\end{figure*}

\subsubsection{Inference of Observed View-dependent Latents}
\label{subsec:observed_view_inference}
  The posteriors of the viewpoint latent variable $\boldsymbol{z}_t^{\text{view}}$ $(t\in\mathcal{T})$ and the timestamp latent variable $\boldsymbol{\lambda}_{t,d}$ $(t\in\mathcal{T}, 1 \leq d \leq D)$ are defined as:
\begin{align}
    q_{\boldsymbol{\phi}} (\boldsymbol{z}_t^{\text{view}} \mid \boldsymbol{x}_{\mathcal{T}}, \boldsymbol{t}_{\mathcal{T}}) &= \mathcal{N}(\boldsymbol{z}_t^{\text{view}} \mid \boldsymbol{\mu}_t^{\text{view}},\text{diag}(\boldsymbol{\sigma}_{t}^{\text{view}})^2) \notag \\ 
    q_{\boldsymbol{\phi}} (\boldsymbol{\lambda}_{t,d} \mid \boldsymbol{x}_{\mathcal{T}}, \boldsymbol{t}_{\mathcal{T}}) &= \mathcal{N}(\boldsymbol{\lambda}_{t,d} \mid \boldsymbol{\mu}_{t,d}^{\lambda}, \sigma_{\boldsymbol{w}}^2\boldsymbol{I}) \notag 
\end{align}
  where $[\boldsymbol{\mu}_t^{\text{view}}, \boldsymbol{\sigma}_{t}^{\text{view}}] = f_{\boldsymbol{\phi}}^\text{view}(\boldsymbol{x}_{\mathcal{T}})$ and $\boldsymbol{\mu}_{t,d}^{\lambda} = f_{\boldsymbol{\phi}}^\lambda (\boldsymbol{x}_{\mathcal{T}},\boldsymbol{w}_t)$; the variance $\boldsymbol{\sigma}_{\boldsymbol{w}}^2$ is fixed. As Figure \ref{fig:overview} shows: First, $\boldsymbol{x}_{\mathcal{T}}$ is fed into a Transformer block along with a 3D position embedding \cite[]{kabra2021simone}, where the viewpoint information with correlations between frames is learned. 
  A $|\mathcal{T}| \times L \times C$ feature map extracted by the Transformer is averaged over $L = HW$ pixels on the feature map to obtain $\boldsymbol{y}_t^{\text{view}}$ $(t \in \mathcal{T})$, and $\boldsymbol{y}_t^{\text{view}}$ is an intermediate variable to obtain $[\boldsymbol{\mu}_t^{\text{view}}, \boldsymbol{\sigma}_{t}^{\text{view}}]$ and $\boldsymbol{\mu}_{t,d}^{\lambda}$ in $f_{\boldsymbol{\phi}}^\text{view}$ and $f_{\boldsymbol{\phi}}^\lambda$, respectively.

\subsubsection{Inference of Predicted View-dependent Latents}
\label{subsec:query_view_inference}

  Inference of latent variables related to predicted viewpoints is challenging because $\boldsymbol{x}_{\mathcal{Q}}$ is not provided. Therefore, the predicted view-dependent latent variables need to be inferred through the observed viewpoints. We introduce the inference methods for $\boldsymbol{\lambda}_{\mathcal{Q}}$ and $\boldsymbol{z}_{\mathcal{Q}}^{\text{view}}$, respectively.
  
  \textbf{Inference of \textnormal{$\boldsymbol{\lambda}_{\mathcal{Q}}$}.} According to the prior distribution of $\boldsymbol{\lambda}_{t,d}$ defined in Eq.\ref{eq:lambda_prior}, $\boldsymbol{\mu}^\lambda_{t,d}$ of the posterior $q_{\boldsymbol{\phi}} (\boldsymbol{\lambda}_{t,d} \mid \boldsymbol{\lambda}_{\mathcal{T}}, \boldsymbol{t}_{\mathcal{T}})$ can be approximated to satisfy a linear function w.r.t. $\boldsymbol{w}_t$, i.e. $\boldsymbol{\mu}_{t,d}^\lambda = \boldsymbol{\hat{A}}_d \boldsymbol{w}_t, \boldsymbol{\hat{A}}_d\in \mathbb{R}^{D_{\lambda} \times |\boldsymbol{w}_t|}$. Based on the Least Square method, the optimal ${\boldsymbol{\hat{A}}_d^{*}}$ ($1\leq d \leq D$) in the linear set and the posterior of $\boldsymbol{\lambda}_{t,d}$ ($t \in \mathcal{Q}$) are:
\begin{align}
\label{eq:lambda_solve}
    q_{\boldsymbol{\phi}} (\boldsymbol{\lambda}_{t,d} \mid \boldsymbol{\lambda}_{\mathcal{T}}, \boldsymbol{t}_{\mathcal{S}}) = & \mathcal{N}({\boldsymbol{\hat{A}}_d^{*}}\boldsymbol{w}_t, \sigma_{\boldsymbol{w}}^2\boldsymbol{I})\\
    {\boldsymbol{\hat{A}}_d^{*}} = & \boldsymbol{\Phi}_d^{\top}\boldsymbol{W}_{\mathcal{T}}(\boldsymbol{W}_{\mathcal{T}}^{\top} \boldsymbol{W}_{\mathcal{T}})^{-1}
\end{align}
where $\boldsymbol{W}_{\mathcal{T}} = \big[ \boldsymbol{w}_1,...,\boldsymbol{w}_{|\mathcal{T}|}\big]^{\top}\in \mathbb{R}^{|\mathcal{T}|\times |\boldsymbol{w}_t|}$ and $\boldsymbol{\Phi}_d = \big[ \boldsymbol{\mu}_{1,d}, ..., \boldsymbol{\mu}_{|\mathcal{T}|,d}\big]^{\top}\in \mathbb{R}^{|\mathcal{T}| \times D_{\lambda}}$.

  \textbf{Inference of \textnormal{$\boldsymbol{z}_{\mathcal{Q}}^{\text{view}}$}.} $q_{\boldsymbol{\eta}}(\boldsymbol{z}_{\mathcal{Q}}^{\text{view}} \mid \boldsymbol{z}_{\mathcal{T}}^{\text{view}},\boldsymbol{\lambda}_{\mathcal{S}})$ follows the same distribution as the predictive distribution of the GPs (the details can be found in the Supplementary Material):
\begin{align} 
  q_{\boldsymbol{\eta}}(\boldsymbol{z}_{\mathcal{Q}}^{\text{view}} \mid \boldsymbol{z}_{\mathcal{T}}^{\text{view}},\boldsymbol{\lambda}_{\mathcal{S}}) = \prod_{d=1} ^D p_{\boldsymbol{\eta}}(\boldsymbol{z}_{\mathcal{Q},d}^{\text{view}} \mid \boldsymbol{z}_{\mathcal{T},d}^{\text{view}},\boldsymbol{\lambda}_{\mathcal{S},d})
\end{align}
  where $p_{\boldsymbol{\eta}}(\boldsymbol{z}_{\mathcal{Q},d}^{\text{view}} \mid \cdot )$ satisfies the multivariate Gaussian distributions $\mathcal{N}(\boldsymbol{\mu}^\text{view}_{\mathcal{Q},d},\boldsymbol{\Sigma}^\text{view}_{\mathcal{Q},d})$, and the parameters $\boldsymbol{\mu}^\text{view}_{\mathcal{Q},d}$ and $\boldsymbol{\Sigma}^\text{view}_{\mathcal{Q},d}$ are analytical functions of $\boldsymbol{\lambda}_{\mathcal{S},d}$, $\boldsymbol{z}_{\mathcal{T},d}^{\text{view}}$ and $\boldsymbol{\eta}$. 

\subsubsection{Inference of View-independent Latents}
\label{subsec:object_inference}
  The posteriors of the view-independent object-centric latent variables $\{ \boldsymbol{z}^{\text{bck}}, \boldsymbol{z}^{\text{obj}}, \boldsymbol{z}^{\text{pres}}, \boldsymbol{\nu} \}$ in Eq.\ref{eq:posterior} are defined as:
\begin{align}
    \label{eq:obj_post}
    q_{\boldsymbol{\phi}}(\boldsymbol{z}^{\text{bck}} \mid \boldsymbol{x}_\mathcal{T}) &= \mathcal{N}(\boldsymbol{z}^{\text{bck}} \mid \boldsymbol{\mu}^{\text{bck}},\text{diag}(\boldsymbol{\sigma}^{\text{bck}})^2)\\
    q_{\boldsymbol{\phi}}(\boldsymbol{z}_k^{\text{obj}} \mid \boldsymbol{x}_\mathcal{T}) &= \mathcal{N}(\boldsymbol{z}_k^{\text{obj}} \mid \boldsymbol{\mu}_k^{\text{obj}},\text{diag}(\boldsymbol{\sigma}_k^{\text{obj}})^2) \\
    q_{\boldsymbol{\phi}}(z_k^{\text{pres}} \mid \boldsymbol{x}_\mathcal{T}) &= \text{Bernoulli}(z_k^{\text{pres}} \mid \kappa_k) \\
    q_{\boldsymbol{\phi}}(\nu_k \mid \boldsymbol{x}_\mathcal{T}) &= \text{Beta}(\nu_k \mid \tau_{k,1}, \tau_{k,2})
\end{align}
  where the default range of $k$ is $1\leq k \leq K$. All the parameters of the above distributions will pass through a sequential extension of Slot Attention \cite[]{locatello2020object}, which is illustrated in Figure \ref{fig:overview}.

  The model maintains $K+1$ slots $\boldsymbol{y}^{\text{attr}} = [ \boldsymbol{y}^{\text{bck}},\boldsymbol{y}_1^{\text{obj}}, ...,\boldsymbol{y}_K^{\text{obj}}]$, $\boldsymbol{y}_k^{\text{attr}} \in \mathbb{R}^{D_s}$. Different from Slot Attention \cite[]{locatello2020object}, two types of initialization are employed for the foreground objects and the background, respectively. Then $\boldsymbol{y}_k^{\text{attr}}$ is combined with $\boldsymbol{y}_t^{\text{view}} \in \mathbb{R}^{D_v}$ ($t \in \mathcal{T}$) obtained in Section \ref{subsec:observed_view_inference} to produce $|\mathcal{T}| \times (K+1) $ slots $\boldsymbol{y}_{t,k}^{\text{full}} \in \mathbb{R}^{D_f} $ with the viewpoint information, where $D_f = D_s + D_v$. We use another encoder to extract the feature maps of $\boldsymbol{x}_{\mathcal{T}}$, denoted as $\boldsymbol{y}_{\mathcal{T}}^{\text{sa}}$. We do $M$ iterations like Slot Attention. In each iteration, Eq.\ref{eq:cross_attn} first uses the cross attention to obtain the attention masks $\boldsymbol{a}_t \in \mathbb{R}^{N\times (K+1)}$ of $K$ objects and the background. Then, the pixel-wise normalized masks of all the objects and background are multiplied with the value of $\boldsymbol{y}_t^{\text{sa}}$ to obtain the hidden state $\boldsymbol{u}_t \in \mathbb{R}^{(K+1)\times D_f}$ for GRU updating. In addition, we perform temporal mean over the updated attribute part of $\boldsymbol{\hat{y}}_{t,k}^{\text{full}}$ after GRU updating.
\begin{align}
\label{eq:cross_attn}
  \boldsymbol{a}_{t} &= \underset{K+1}{\text{Softmax}} \Big( \frac{k(\boldsymbol{y}_t^{\text{sa}}) \cdot q(\boldsymbol{y}_{t,1:K+1}^{\text{full}})^{\top}}{\sqrt{D_f}}\Big)\\
  \boldsymbol{u}_t &= \sum_{n=1}^N \Big(\underset{N}{\text{Softmax}}\big( \log \boldsymbol{a}_{t,n}\big) \cdot v(\boldsymbol{y}_{t,n}^{\text{sa}}) \Big)\\
  \hat{\boldsymbol{y}}^{\text{full}}_{t,k} &= \text{GRU}(\boldsymbol{y}^{\text{full}}_{t,k},\boldsymbol{u}_{t,k}) \quad \big[ \hat{\boldsymbol{y}}_{t,k}^{\text{attr}},\hat{\boldsymbol{y}}_{t,k}^{\text{view}} \big] \stackrel{\text{split}}{\leftarrow} \hat{\boldsymbol{y}}^{\text{full}}_{t,k}\\
  \boldsymbol{y}_{k}^{\text{attr}} &={\text{mean}}_{|\mathcal{T}|}\Big(\hat{\boldsymbol{y}}_{1:|\mathcal{T}|,k}^{\text{attr}}\Big)
\end{align}
  where $k$, $q$ and $v$ are MLPs for producing key, query and value, respectively. The procedure maintains the permutation invariance w.r.t. the input order of frames. $\boldsymbol{\mu}^{\text{bck}}$ and $\boldsymbol{\sigma}^{\text{bck}}$ are obtained through the neural network $f_{\boldsymbol{\phi}}^{\text{bck}}$ with $\boldsymbol{y}^{\text{bck}}$ as input; $\boldsymbol{\mu}_k^{\text{obj}},\boldsymbol{\sigma}_k^{\text{obj}},\kappa_k,\tau_{k,1}, \tau_{k,2}$ are obtained through the shared neural network $f_{\boldsymbol{\phi}}^{\text{obj}}$ with $\boldsymbol{y}_k^{\text{obj}}$ as input.

\subsection{Training}
\label{sec:training}
  Optimizing the evidence lower bound (ELBO) for all frames (including both observed and predicted frames) is unstable. To solve this problem, a two-stage training procedure is adopted. Let $\boldsymbol{\Omega}_{\mathcal{S}} = \{\boldsymbol{\Omega}_{\mathcal{T}},\boldsymbol{\Omega}_{\mathcal{Q}} \}$, where $\boldsymbol{\Omega}_{\mathcal{T}} = \big\{ \boldsymbol{z}^{\text{bck}},\boldsymbol{z}^{\text{obj}}, \boldsymbol{z}^{\text{pres}}, \boldsymbol{\nu}, \boldsymbol{\lambda}_{\mathcal{T}}, \boldsymbol{z}_{\mathcal{T}}^{\text{view}}\big\}$ and $\boldsymbol{\Omega}_{\mathcal{Q}} = \big\{ \boldsymbol{z}^{\text{bck}},\boldsymbol{z}^{\text{obj}}, \boldsymbol{z}^{\text{pres}}, \boldsymbol{\nu}, \boldsymbol{\lambda}_{\mathcal{Q}}, \boldsymbol{z}_{\mathcal{Q}}^{\text{view}}\big\}$, i.e. the view-independent latent variables share in both $\boldsymbol{\Omega}_{\mathcal{T}}$ and $\boldsymbol{\Omega}_{\mathcal{Q}}$. The two-stage losses are as follows:
\begin{align}
  \mathcal{L}_1 = & -\mathbb{E}_{q_{\boldsymbol{\phi},\boldsymbol{\eta}}( \boldsymbol{\Omega}_{\mathcal{T}} \mid \boldsymbol{x}_{\mathcal{T}})}\big[ \log p_{\boldsymbol{\theta},\boldsymbol{\eta}}(\boldsymbol{x}_{\mathcal{T}} \mid \boldsymbol{\Omega}_{\mathcal{T}})\big] \notag \\
  & + D_{KL}\Big( q_{\boldsymbol{\phi},\boldsymbol{\eta}}( \boldsymbol{\Omega}_{\mathcal{T}} \mid \boldsymbol{x}_\mathcal{T}) \| p_{\boldsymbol{\theta},\boldsymbol{\eta}}(\boldsymbol{\Omega}_\mathcal{T})\Big) \\
  \mathcal{L}_2 = & -\frac{1}{|\mathcal{T}|}\mathbb{E}_{q_{\boldsymbol{\phi},\boldsymbol{\eta}}( \boldsymbol{\Omega}_{\mathcal{T}} \mid \boldsymbol{x}_\mathcal{T},\boldsymbol{t}_{\mathcal{T}})}\big[ \log p_{\boldsymbol{\phi},\boldsymbol{\eta}}(\boldsymbol{x}_\mathcal{T} \mid \boldsymbol{\Omega}_{\mathcal{T}})\big]\notag \\ 
 -\frac{1}{|\mathcal{Q}|}  & \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\Omega}_{\mathcal{T}}\mid \boldsymbol{x}_\mathcal{T},\boldsymbol{t}_{\mathcal{T}} )q_{\boldsymbol{\phi},\boldsymbol{\eta}}( \boldsymbol{\Omega}_{\mathcal{Q}} \mid \boldsymbol{\Omega}_\mathcal{T}, \boldsymbol{t}_Q)}\big[ \log p_{\boldsymbol{\theta},\boldsymbol{\eta}}(\boldsymbol{x}_\mathcal{Q} \mid \boldsymbol{\Omega}_{\mathcal{Q}}) \big] \notag \\
   +\beta & D_{KL}\Big( q_{\boldsymbol{\phi},\boldsymbol{\eta}}( \boldsymbol{\Omega}_{\mathcal{S}}\mid \boldsymbol{x}_{\mathcal{T}}, \boldsymbol{t}_{\mathcal{S}}) \| p_{\boldsymbol{\theta},\boldsymbol{\eta}}(\boldsymbol{\Omega}_{\mathcal{S}}\mid \boldsymbol{t}_{\mathcal{S}})\Big)
\end{align}
  where $\mathcal{L}_1$ is a standard ELBO of $\boldsymbol{\Omega}_{\mathcal{T}}$ on $\boldsymbol{x}_{\mathcal{T}}$ to learn object-centric representations from multiple frames and does not depend on $\boldsymbol{t}_{\mathcal{S}}$; while $\mathcal{L}_2$ adopts the curriculum learning to learn the function of viewpoint latent variables w.r.t. $\boldsymbol{t}_{\mathcal{S}}$. Let $\mathcal{S}'$ denote the subset of $\mathcal{S}$ and $|\mathcal{S}'|$ is scheduled to gradually increase during training. $\mathcal{S}'$ will be randomly divided into $\mathcal{T}$ and $\mathcal{Q}$, where $|\mathcal{Q}|\sim U(1,C)$ ($C<|S'|$ and increases during training). $\mathcal{L}_2$ averages the observed and predicted losses to balance the two losses, where $\beta \geq 1$ is a hyper-parameter follows \cite[]{burgess2018understanding}. Note that the reconstruction performance of $\mathcal{L}_2$ is worse than that of the first stage; however, it can perform well on the prediction task.

\section{Experiments}
\begin{figure*}[tb]
  \centering
  \setlength{\fboxrule}{3pt}
  \fcolorbox{red}{white}{\begin{minipage}[t]{0.4\textwidth}
    \centering
    \subfigure[MulMON]{
        \centering
        \includegraphics[width=0.45\linewidth]{assets/MulMON/60.pdf}
        %\caption{fig2}
    }
    \subfigure[SIMONe]{
        \centering
        \includegraphics[width=0.45\linewidth]{assets/SIMONe/60.pdf}
        %\caption{fig2}
    }

    \subfigure[OCLOC]{
        \centering
        \includegraphics[width=0.45\linewidth]{assets/OCLOC/60.pdf}
        %\caption{fig2}
    }
    \subfigure[Ours]{
        \centering
        \includegraphics[width=0.45\linewidth]{assets/Ours/60.pdf}
        %\caption{fig2}
    }
  \end{minipage}
  }
  \fcolorbox{blue}{white}{\begin{minipage}[t]{0.495\textwidth}
      \centering
      \subfigure[MulMON]{
          \centering
          \includegraphics[width=0.86\linewidth]{assets/MulMON/predict_overview.pdf}
          %\caption{fig2}
      }

      \subfigure[Ours]{
          \centering
          \includegraphics[width=0.86\linewidth]{assets/Ours/predict_overview.pdf}
          %\caption{fig2}
      }
  \end{minipage}
  }
  \caption{\textbf{Left}: Visualization results of the compared methods on the \emph{observation} set of CLEVER-COMPLEX, where four consecutive frames are demonstrated. \textbf{Right}: Visualization results on the \emph{prediction} set of SHOP-SIMPLE. The `images' in blue boxes are unobserved ground truths and the `recons' in blue boxes are predicted results.}
  \label{fig:exp_overview}
\end{figure*}

\begin{figure*}[tb]
	\centering
  \setlength{\fboxrule}{3pt}
  \fcolorbox{red}{white}{
	\begin{minipage}[b]{0.605\textwidth}
		\subfigure[Video Recomposition (SHOP-COMPLEX)]{
			\includegraphics[width=\linewidth]{assets/Ours/recompose.pdf} 
			\label{fig:video_recomp}
    }
	\end{minipage}
  }
  \fcolorbox{blue}{white}{
	\begin{minipage}[b]{0.3\textwidth}
		\subfigure[Video Generation (CLEVR-SIMPLE)]{
			\includegraphics[width=\linewidth]{assets/Ours/generate_clevr_simple.jpg}  
			\label{fig:clevr_gen}
    }

    \subfigure[Video Generation (SHOP-SIMPLE)]{
			\includegraphics[width=\linewidth]{assets/Ours/generate_shop_simple.jpg}  
			\label{fig:shop_gen}
    }
	\end{minipage}
  }
	\caption{\textbf{Left}: Scene image generation from novel viewpoints through recomposing viewpoint representations and object-centric representations. \textbf{Right}: Video generation based on CLEVR-SIMPLE and SHOP-SIMPLE.}
	\label{fig:video_prop}
\end{figure*}


  We design experiments to investigate 1) how well the proposed model performs compared to state-of-the-art multi-view models in object-centric video decomposition on the observation set; 2) whether the proposed model can disentangle the 3D scene into object-centric view-invariant representations and viewpoint representations; 3) how well the proposed model handles occlusions compared to existing methods; 4) how well the proposed model makes the prediction only depending on timestamps; and 5) whether the proposed model can generate videos.

  To validate the above, we compare the proposed model\footnote{The code is available at https://github.com/FudanVI/\\compositional-scene-representation-toolbox} with three state-of-the-art models, \textbf{MulMON} \cite[]{li2020learning} with viewpoint annotations, viewpoint-free models \textbf{SIMONe} \cite[]{kabra2021simone} and \textbf{OCLOC} \cite[]{yuan2022unsupervised}. We design four synthetic video datasets, called CLEVR-SIMPLE, CLEVR-COMPLE, SHOP-SIMPLEX, and SHOP-COMPLEX, through modifying multi-view CLEVR \cite[]{johnson2017clevr} and SHOP \cite[]{nazarczuk2020shop} based on the official code. The two SHOP datasets are more challenging than the two CLEVR datasets in terms of the object texture; the two COMPLEX versions are more challenging than the two SIMPLE versions because of more types of objects and backgrounds. 
  
  We train the proposed model with the introduced two-stage strategy. Stage 1 can reconstruct the observation set without supervision while Stage 2 can predict unobserved set only with timestamp supervision. We train the proposed model on all the datasets using the Adam optimizer with a learning rate 4e-4 for 300K gradient steps. The increment of curriculum learning is 2. 

  \textbf{Video Decomposition.} Since the proposed model maintains the view-invariant object-centric representations in 3D structure, video decomposition is crucial to evaluating the completeness and accuracy of learned representations. Figure \ref{fig:exp_overview} (Left) demonstrates the visualization results on CLEVR-COMPLEX. The proposed model can accurately represent objects with complex shapes from multiple viewpoints and build crisp segregation between the foregrounds and the background. Moreover, the proposed model tends to treat shadows as parts of objects (e.g., the horse in Figure \ref{fig:exp_overview}(d)), it is reasonable for shadows to be blended with the corresponding objects due to lighting. Surprisingly, the shadow area is noticeably smaller than those of other models.

  Table \ref{table:compare}(a) reports the segmentation performance in terms of foreground objects. ARI-O measures how accurately a video is decomposed into separate objects. We find that, except for CLEVR-SIMPLE, the proposed model outperforms the other models, especially on the two SHOP datasets, probably because the 3D representations integrity of objects helps reconstruct better masks. SIMONe and OCLOC fail to capture the objects on SHOP-COMPLEX. A possible reason is that the background is indistinguishable with the objects in SHOP-COMPLEX, such that these models cannot represent the background separately during the inference. Although OCLOC models the background separately, sampling from permutation-equivalent slots may affect the extraction of the background representation.
  
\begin{table*}
  \caption{Performance comparison of MulMON, SIMONe and the proposed model (Ours). ARI-O is adopted for evaluating segmentation, IoU and OOA are adopted for evaluating segmentation with occlusions, and MSE is adopted for evaluating reconstruction. Except for MSE in (d), all results are recorded in `mean $\pm$ std' over 5 random seeds. `-S' and `-C' are short for `SIMPLE' and `COMPLEX', respectively.}\label{table:compare}	
\centering
\subtable[ARI-O (observation set)]{
    \begin{minipage}[a]{0.48\textwidth}
    \centering
    \renewcommand\arraystretch{1.4}
    \scalebox{0.7}{
  \begin{tabular}{ccccc} 
      \toprule[1.5pt]
    \multirow{2}{*}{Model}&CLEVR-S&CLEVR-C& SHOP-S &  SHOP-C \\
    \cline{2-5}
    ~ & ARI-O$\uparrow$ & ARI-O$\uparrow$ & ARI-O$\uparrow$ & ARI-O$\uparrow$ \\
  \hline
  MulMON (cond) & \large \textbf{96.4}  $\pm$ 0.1  & \large 92.9  $\pm$ 0.2 & \large 88.3  $\pm$ 0.6 & \large 87.1  $\pm$ 0.2\\
  SIMONe & \large 91.0  $\pm$ 0.0  & \large 91.4  $\pm$ 0.0 & \large 55.3  $\pm$ 0.0 & \large 33.5  $\pm$ 0.0\\
  OCLOC & \large 92.7  $\pm$ 0.8  & \large 82.7  $\pm$ 0.8 & \large 91.3  $\pm$ 0.4 & \large 29.3  $\pm$ 0.5\\
  Ours & \large 95.9  $\pm$ 0.3  & \large \textbf{94.1}  $\pm$ 0.3 & \large \textbf{95.8} $\pm$ 0.1 & \large \textbf{94.9}  $\pm$ 0.4\\
  \bottomrule[1.5pt]
  \end{tabular}
  }
\end{minipage}
}
\subtable[IoU and OOA (observation set)]{
    \begin{minipage}[a]{0.48\textwidth}
    \centering
    \renewcommand\arraystretch{1.4}
    \scalebox{0.7}{
  \begin{tabular}{c|cc|cc} 
      \toprule[1.5pt]
    \multirow{2}{*}{Model}&\multicolumn{2}{c|}{IoU$\uparrow$}& \multicolumn{2}{c}{OOA$\uparrow$} \\
    \cline{2-5}
    ~ & OCLOC & Ours & OCLOC & Ours \\
  \hline
  CLEVR-S & \large 45.6 $\pm$ 0.2  & \large \textbf{59.5}  $\pm$ 0.5 & \large 93.6  $\pm$ 1.2 & \large \textbf{95.3}  $\pm$ 1.1\\
  CLEVR-C & \large 35.1  $\pm$ 0.2  & \large \textbf{50.9}  $\pm$ 0.4 & \large 89.1  $\pm$ 1.2 & \large \textbf{93.0}  $\pm$ 0.8\\
  SHOP-S &\large 61.9  $\pm$ 0.6  & \large \textbf{65.9}  $\pm$ 0.1 & \large 72.8 $\pm$ 1.4 & \large \textbf{78.9}  $\pm$ 0.4\\
  SHOP-C & \large 21.5  $\pm$ 0.3  & \large \textbf{66.2}  $\pm$ 0.6 & \large 57.9 $\pm$ 1.9 & \large \textbf{81.8}  $\pm$ 1.3\\
  \bottomrule[1.5pt]
  \end{tabular}
  }
\end{minipage}
}
\quad 
\centering
\subtable[ARI-O (prediction set)]{
    \begin{minipage}[a]{0.48\textwidth}
    \centering
    \renewcommand\arraystretch{1.4}
    \scalebox{0.7}{
  \begin{tabular}{c|ccccc} 
      \toprule[1.5pt]
    \multicolumn{2}{c}{ \multirow{2}{*}{Model} }&CLEVR-S&CLEVR-C& SHOP-S &  SHOP-C \\
    \cline{3-6}
    \multicolumn{2}{c}{~} & ARI-O$\uparrow$ & ARI-O$\uparrow$ & ARI-O$\uparrow$ & ARI-O$\uparrow$\\
  \hline
  \multirow{2}{*}{Mode 1}&MulMON & \large \textbf{96.2}  $\pm$ 0.1  & \large 91.5  $\pm$ 0.3 & \large 88.3  $\pm$ 0.5 & \large 86.9  $\pm$ 0.7\\
  ~& Ours & \large 95.5  $\pm$ 0.5  & \large \textbf{95.5}  $\pm$ 0.9 & \large \textbf{96.0}  $\pm$ 0.3 & \large \textbf{92.9}  $\pm$ 0.4\\
  \hline
  \multirow{2}{*}{Mode 2}&MulMON & \large \textbf{96.9}  $\pm$ 0.2  & \large 94.5  $\pm$ 0.2 & \large 87.1  $\pm$ 0.6 & \large 86.0  $\pm$ 0.6\\
  ~& Ours & \large 95.1  $\pm$ 0.5  & \large \textbf{95.0}  $\pm$ 0.6 & \large \textbf{95.5}  $\pm$ 0.1 & \large \textbf{93.8}  $\pm$ 0.8\\
  \bottomrule[1.5pt]
  \end{tabular}
  }
\end{minipage}
}
\subtable[MSE (prediction set)]{
    \begin{minipage}[a]{0.48\textwidth}
    \centering
    \renewcommand\arraystretch{1.4}
    \scalebox{0.7}{
  \begin{tabular}{c|ccccc} 
      \toprule[1.5pt]
    \multicolumn{2}{c}{ \multirow{2}{*}{Model} }&CLEVR-S&CLEVR-C& SHOP-S &  SHOP-C \\
    \cline{3-6}
    \multicolumn{2}{c}{~} & MSE$\downarrow$ & MSE$\downarrow$ & MSE$\downarrow$ & MSE$\downarrow$ \\
  \hline
  \multirow{2}{*}{Mode 1}&MulMON & \large \textbf{0.0014} & \large \textbf{0.0020} & \large 0.0049 & \large 0.0038\\
  ~& Ours & \large 0.0018 & \large 0.0021 & \large \textbf{0.0034} & \large \textbf{0.0036}\\
  \hline
  \multirow{2}{*}{Mode 2}&MulMON & \large \textbf{0.0014} & \large \textbf{0.0020} & \large 0.0050 & \large 0.0038\\
  ~& Ours & \large 0.0017 & \large 0.0024 & \large \textbf{0.0035} & \large \textbf{0.0038}\\
  \bottomrule[1.5pt]
  \end{tabular}
  }
\end{minipage}
}
\end{table*}
\begin{figure*}[ht]
  \centering
  \begin{minipage}{\linewidth}
      \centering
      \includegraphics[width=0.33\linewidth]{assets/Ours/m1_ari.pdf}
      %\caption{fig2}
      \centering
      \includegraphics[width=0.33\linewidth]{assets/Ours/m1_mse.pdf}
      \centering
      \includegraphics[width=0.33\linewidth]{assets/Ours/m1_iou.pdf}
      %\caption{fig2}
  \caption{Single-view prediction performance in ARI-O, MSE, and IoU in terms of the number of observed views. All results are tested with 5 random seeds and each point on a curve is the mean value and the shaded band denotes $\pm$std.}
  \label{fig:curve}
  \end{minipage}
\end{figure*}

  \textbf{Video Recomposition.} An intriguing experiment is to generate scene images from novel viewpoints through cross-combining viewpoint representations and object-centric representations of objects (including $\boldsymbol{z}^{\text{bck}}$ and $\boldsymbol{z}^{\text{obj}}$). The recomposition is implemented as follows: We randomly choose two videos (each comprises 10 frames) and select the first 5 frames from one video and select the last 5 frames from the other. Then, we encode the selected frames into viewpoint representations and object-centric representations. Finally, we combine the first five object-centric representations from one video and the last five viewpoint representations from the other frame-wisely to generate the scene images from novel viewpoints. Figure \ref{fig:video_prop}(a) demonstrates that disentangled object-centric and viewpoint representations from different scenes can be effectively coupled, based on which the proposed model can generate novel views. 

  \textbf{Occlusion Evaluation.} Among the compared methods, only OCLOC is designed to handle occlusions. The comparison results on CLEVR-COMPLEX are visualized in Figure \ref{fig:exp_overview} (c) and (d). As the camera moves counterclockwise around the center, a gray ball is completely occluded behind the green mug in the second frame. The proposed model can reconstruct the complete shape of an object even it is completely occluded (e.g. the gray ball). We evaluate IoU and OOA used in \cite[]{yuan2019generative} that respectively assess the quality of reconstructed complete shapes and the accuracy of the estimated pairwise ordering of objects. The proposed model clearly outperforms OCLOC, probably because OCLOC samples the pixel-wise shape during the generation, which produces noisy pixels and large shadows.

  \textbf{GP Prediction.} Due to modeling the viewpoint latent variables with GPs, we can use the analytical posterior of $\boldsymbol{z}_{\mathcal{Q}}^{\text{view}}$ to predict the rest viewpoints given the observation set. In our experimental setting, 10 consecutive viewpoint representations in Figure \ref{fig:exp_overview} satisfy the GPs and we randomly remove four frames (i.e. the ground truths in the blue boxes are unobserved). The remaining six frames are encoded to infer $\boldsymbol{z}^{\text{obj}}$, $\boldsymbol{z}^{\text{bck}}$, $\boldsymbol{\lambda}_{\mathcal{T}}$, $\boldsymbol{\lambda}_{\mathcal{Q}}$, $\boldsymbol{z}_{\mathcal{T}}^{\text{view}}$ and $\boldsymbol{z}_{\mathcal{Q}}^{\text{view}}$. The four viewpoint representations predicted by GPs are concatenated with the object-centric representations to reconstruct the scene images. Figure \ref{fig:exp_overview}(f) shows that the proposed model can predict arbitrary-time frames given the observation. Compared with MulMON which uses viewpoint annotations, the proposed model can additionally process occlusions while reconstructing frames from novel viewpoints. To assess the segmentation performance and reconstruction quality on the prediction set, we choose four fixed frames in Mode 1 and Mode 2 to make prediction (see the Supplementary Material for details). Table \ref{table:compare}(c) and (d) show that the proposed model is comparable to MulMON on the two CLEVR datasets and clearly outperforms MulMON on the two SHOP datasets. The reconstruction loss helps improve the texture characterization of objects, which may be the reason that the proposed model achieves better performance in MSE on the two SHOP datasets.

  \textbf{Ablation Study.} GPs have a generic nature: As the number of observed variables increases, the prediction uncertainty gradually decreases. We assume the number of observed frames (hyperparameter) to be the most important factor that affects the accuracy and uncertainty of the prediction. To verify the assumption, we fix a single frame and gradually increase the number of observed frames from 2 to 9. The viewpoint representations of both the predicted frame and the observed frames are used to construct GPs together. We execute the GP prediction and plot the performance curves in ARI-O, MSE, and IoU in terms of the number of observed views in Figure \ref{fig:curve}. One can see that the proposed model gradually reduces the uncertainty and improves the performance as the number of observed views increases, and tends to be stable after the number of observed views achieves 5. 

  \textbf{Video Generation.} As we model the viewpoint latent variables with GPs, we can generate videos from the GPs along the timeline. Figure \ref{fig:video_prop}(b) and (c) plot two example videos with 10 frames generated based on CLEVR-SIMPLE and SHOP-SIMPLE. One can find that the 10 frames obviously rotate clockwise around the center, reflecting the captured correlations between viewpoints; meanwhile, the generated objects and backgrounds have no irregular shapes. 

\section{Conclusion}
  We propose a time-conditioned generative model for video decomposition and prediction. The proposed model enhances the disentanglement between viewpoint and object-centric representations, and additionally adopts GPs for viewpoint modeling, inference and generation. We design experiments to show that the proposed model can: 1) aggregate 3D object-centric information from multiple viewpoints, and as a result, outperforms the state-of-art multi-view models; 2) restore the complete shapes of objects even when completely occluded; and 3) predict the scene images from unknown viewpoints without viewpoint annotations.

\section*{Acknowledgments}
  This work was supported in part by the National Natural Science Foundation of China (No.62176060), STCSM project (No.20511100400), and the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning.
% References
\bibliography{gao_704}
\end{document}
