% \documentclass{article}

% \usepackage{authblk}
% \usepackage[utf8]{inputenc}
% \usepackage[T1]{fontenc} 
% \usepackage[english]{babel}
% \usepackage{amsmath,graphicx,bbold,dsfont}
% \usepackage{hyperref}       % hyperlinks
% \usepackage{url}            % simple URL typesetting
% \usepackage{booktabs}       % professional-quality tables
% \usepackage{amsfonts}       % blackboard math symbols
% \usepackage{nicefrac}       % compact symbols for 1/2, etc.
% \usepackage{xcolor}         % colors
% \usepackage{tabulary}
% \usepackage{multirow}
% \usepackage{multicol}
% \usepackage{caption}
% \usepackage{amsthm}
% \usepackage{subcaption}
% \usepackage{float}
% \usepackage{bbold}
% \usepackage{mathtools}
% \usepackage{enumitem}
% \usepackage{csquotes}
% \usepackage{algorithm}
% \usepackage[backend=biber,style=numeric-comp,maxnames=2,minnames=1,
% hyperref=auto,doi=false,isbn=false,
% url=false,date=year,abbreviate=true,eprint=false,giveninits,uniquename=init]{biblatex}

% \AtEveryBibitem{%
%   \clearfield{note}%
% }

% \AtEveryBibitem{%
%   \clearlist{language}%
% }

% \addbibresource{midl-samplebibliography.bib}

% \author[1,2]{R. Louiset}
% \author[2]{E. Duchesnay}
% \author[2]{A. Grigis}
% \author[1,2]{B. Dufumier}
% \author[1]{P. Gori}
% \affil[1]{LTCI, Télécom Paris, IPParis, France}
% \affil[2]{NeuroSpin, CEA, University Paris-Saclay, France}
% \date{}%% if you don't need date to appear

% \renewcommand\Authfont{\fontsize{11}{14.4}\selectfont}
% \renewcommand\Affilfont{\fontsize{9}{10.8}\itshape}
%\renewcommand\Affilfont{\itshape\small}

\documentclass{midl} % Include author names
%\documentclass[anon]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{wrapfig}
\usepackage{booktabs} % for professional tables

\usepackage{caption}
\captionsetup{format=plain}

\newcommand{\PG}[1]{\textcolor{black}{#1}}
\newcommand{\review}[1]{\textcolor{black}{#1}}

% Attempt to make hyperref and algorithmic work together better:
\newcommand{\theHalgorithm}{\arabic{algorithm}}
\usepackage{algorithmic}

%\jmlrvolume{-- Under Review}
\jmlryear{2024}
\jmlrworkshop{Accepted Paper -- MIDL 2024}
\jmlrvolume{-- 127}
% \editors{Accepted for publication at MIDL 2024}
%\editors{Under Review for MIDL 2024}

\title{SepVAE: a contrastive VAE to separate pathological patterns from healthy ones}

%SepVAE: a contrastive VAE to separate pathological patterns from healthy ones

% More complicate cases, e.g. with dual affiliations and joint %Robin Louiset, Edouard Duchesnay, Grigis Antoine, Benoit Dufumier, Pietro Gori
\midlauthor{\Name{R. Louiset\nametag{$^{1,2}$}}, \Name{E. Duchesnay\nametag{$^{2}$}}, \Name{A. Grigis\nametag{$^{2}$}}, \Name{B. Dufumier\nametag{$^{1,2}$}}, \Name{P. Gori\nametag{$^{1}$}}\\
\addr $^{1}$ LTCI, Télécom Paris, IPParis, France\\
\addr $^{2}$ NeuroSpin, CEA, University Paris-Saclay, France
}

\begin{document}

\maketitle

\begin{abstract}
% Variational auto-encoders (VAEs) have advanced the field of unsupervised learning by capturing the underlying structure of a dataset and identifying independent patterns of variation. 
Contrastive Analysis VAE (CA-VAEs) is a family of Variational auto-encoders (VAEs) that aims at separating the common factors of variation between a \textit{background} dataset (BG) (\textit{i.e.,} healthy subjects) and a \textit{target} dataset (TG) (\textit{i.e.,} patients) from the ones that only exist in the target dataset.
%separating the factors of variations that are unique to a \textit{target} dataset (\textit{i.e.,} patients), compared to a \textit{background} dataset (\textit{i.e.,} healthy subjects).
To do so, these methods separate the latent space into a set of \textbf{salient} features (\textit{i.e.,} proper to the target dataset) and a set of \textbf{common} features (\textit{i.e.,} exist in both datasets). Currently, all CA-VAEs models fail to prevent sharing of information between the latent spaces and to capture all salient factors of variation.
%high-level patterns in the salient representation.
%In this paper, we propose a new mathematical framework \PG{ce n'est pas new non ?} based on the maximization of the Evidence LOwer BOund of the joint likelihood of the two datasets, along with 
To this end, we introduce two crucial regularization losses: a disentangling term between common and salient representations and a classification term between background and target samples in the salient space. We show a better performance than previous CA-VAEs methods on three medical applications and a natural images dataset (CelebA).
\footnote{Code and datasets available at \url{https://github.com/neurospin-projects/2023_rlouiset_sepvae/}.}
% Code and datasets are available on GitHub \url{https://github.com/neurospin-projects/2023_rlouiset_sepvae}.
\end{abstract}

\begin{keywords}
Contrastive Analysis, VAE, generative model, Psychiatry, population analysis.
\end{keywords}

\section{Introduction}
\label{introduction}
One of the goals of unsupervised learning is to learn a compact, latent representation of a dataset, capturing the underlying factors of variation. Furthermore, the estimated latent dimensions should describe distinct, noticeable, and semantically meaningful variations. One way to achieve that is to use a generative model, like Variational Auto-Encoders (VAEs) \cite{kingma_auto-encoding_2013}, \cite{higgins_beta-vae_2017} and disentangling methods \cite{higgins_beta-vae_2017}, \cite{burgess_understanding_2018}, \cite{shu_rethinking_2018}, \cite{ainsworth_oi-vae_2018}, \cite{li_disentangled_2018}. Differently from these methods, which use a \textit{single} dataset, in Contrastive Analysis (CA), researchers attempt to distinguish the latent factors that generate a \textit{target} (TG) and a \textit{background} (BG) dataset. Usually, it is assumed that target samples comprise additional (or modified) patterns with respect to background data. The goal is thus to estimate the \textbf{common} generative factors and the ones that are \textbf{target-specific} (or \textbf{salient}).
%, which is trained to reconstruct the input data by learning a compressed and probabilistic representation. %In order to learn the different factors of variability of a dataset, 
% Some of these models, such as $\beta$-VAE \cite{higgins_beta-vae_2017}, are also trained to disentangle the latent representation
%of the data into semantically meaningful and independent axes of variation. This means that each latent 
% so that each dimension should correspond to a specific generative factor that captures distinct and semantically meaningful variations in the data.
%underlying factor contributing to the data variations.
% This means that background data are fully encoded by some generative factors that are also \textbf{common} with the target data. On the other hand, target samples are assumed to be partly generated from strictly proper factors of variability, which we entitle \textbf{target-specific} or \textbf{salient} factors of variability. This formulation is particularly useful in medical applications where clinicians are interested in separating common (i.e., healthy) patterns from the salient (i.e., pathological) ones in an \textit{intepretable} way.
%to interpret which parts of the input belong to the natural healthy variability and which parts of the input code for the pathology.}

\noindent For instance, consider two sets of data: 1) healthy neuro-anatomical MRIs (BG=\textit{background dataset}) and 2) Alzheimer-affected patients' MRIs (TG=\textit{target dataset}).
As in \cite{jack_nia-aa_2018,pmlr-v97-antelmi19a,dufumier_contrastive_2021}, given these two datasets, neuroscientists would be interested in distinguishing common factors of variations (\textit{e.g.:} effects of aging, education or gender) from Alzheimer's specific markers (\textit{e.g.:} temporal lobe atrophy, an increase of beta-amyloid plaques). Until recently, separating the various latent mechanisms that drive neuro-anatomical variability in neuro-degenerative disorders was considered hardly feasible. This can be attributed to the intertwining between the variability due to natural aging and the variability due to neurodegenerative disease development. The combined effects of both processes make hardly interpretable the discovery of novel bio-markers. 
The objective of developing such a Contrastive Analysis method would be to help separate these processes. And thus identifying correlations between neuro-biological markers and pathological symptoms. In the \textbf{common features} space, aging patterns should correlate with normal cognitive decline, while \textbf{salient features} (\textit{i.e.: } Alzheimer-specific patterns) should correlate with pathological cognitive decline.

% \RL{Besides medical imaging, Contrastive Analysis (CA) methods cover various kinds of applications, like in pharmacology (\PG{placebo} versus medicated populations), biology (pre-intervention vs. post-intervention cohorts) \cite{zheng_massively_2017}, and genetics (healthy vs. disease population \cite{jones_contrastive_2021}, \cite{haber_single-cell_2017}).}
%or any anomaly detection application (normal samples vs anomalies).}

\begin{figure}
    \centering        
    \includegraphics[width=0.94\textwidth]{sep_vae_overall_model.png} 
    \caption{Illustration of SepVAE training. Target ($y=1$) and background ($y=0$) images are encoded with the same encoders $e_{\phi_s}$ and $e_{\phi_c}$. The first encoder $e_{\phi_s}$ estimates the salient factors of variation $s$ of the target samples. Background samples' salient space is set to an informationless value $s'=0$.
    The second encoder $e_{\phi_c}$ estimates the common factors $c$. Images are reconstructed using a single decoder $d_{\theta}$ fed with the concatenation of $\textbf{c}$ and $\textbf{s}$. The common space $\textbf{c}$ should only capture common factors of variability (shape), while the salient space $\textbf{s}$ should model target-only factors of variability (color).}
    %\RL{Encoders $e_{\phi_c}$ and $e_{\phi_c}$ respectively infer the parameters of the distributions $q_{\phi_c}(c | x, y)$ and $q_{\phi_s}(s | x)$. 
     %The common space $\textbf{c}$ captures common factors of variability (shape), and the salient space $\textbf{s}$ captures target-only factors of variability (color).}
     \label{fig:overall_model}
\end{figure}

% \begin{figure*}
%     \centering
%     \begin{minipage}{.4\textwidth}
%         \centering
%         \includegraphics[width=0.99\textwidth]{brats_qualitative_results.png} 
%         \caption{SepVAE reconstructions on Brats2021 dataset \cite{menze_multimodal_2015}. 
%            \RL{(Middle) full reconstructions using the estimated common and salient latent vectors. (Right) common-only reconstructions using the estimated common latent vectors and fixing the salient factors to $s'$. 
%        The common latent variables encode the healthy factors of variability (\textit{e.g. :} brain shape and aspect), while the salient factors encode the pathological patterns (\textit{e.g. :} tumors), which are not visible in the right columns (common-only).
%         % We qualitatively show that we separated the pathological factors of generation (tumor's related factors of generation) from the healthy factors of generation (aspect and shape's related factors of generation).
%         }}
%         \label{fig:brats_qualitative_results}
%     \end{minipage}
%     \hspace{0.05\textwidth}
%     \begin{minipage}{.4\textwidth}
%         \centering
%         \includegraphics[width=0.99\textwidth]{celeba_qualitative_results.png} 
%         \caption{SepVAE qualitative example on the CelebA with accessories dataset (BG = no accessories, TG = hats and glasses). (Middle, common+salient): Full reconstructions using the estimated common and salient factors. (Right, common only): Reconstruction using only the estimated common factors fixing the salient to $s'$. \RL{The salient latent variables capture the accessories (hats and glasses) which are target-specific patterns. The common latents capture the common attributes (identity, skin color, background, etc...).}
%         }
%         \label{fig:celeba_qualitative_results}
%     \end{minipage}%
% \end{figure*}

%Variational auto-encoders (VAEs) have advanced the field of unsupervised learning by enabling generating new samples and capturing the underlying structure of the data onto a lower-dimensional data manifold. Compared to linear methods (PCA, ICA). VAEs make use  of deep non-linear encoders to capture non-linear relationships in the data, leading to better performance on a variety of tasks.
% However, while VAEs trained in a fully unsupervised manner encourage the independence of the underlying factors of variation in the data, they cannot clearly disentangle the factors that independently generate distinct classes.

% VAEs trained in a fully unsupervised manner encourage the independence of the underlying factors of variation in the data. However, when trained on labeled data, they cannot clearly automatically attribute each factor of variability to a distinct class or dataset. As a solution, weakly-supervised VAEs propose to leverage additional information, such as class labels, in the learning process. 
% Thus, they learn more discriminative features that are useful for classification, which may lead to improved performance on downstream tasks such as clustering analysis or controlled generation.


\section{Related works}\label{sec:related}
Variational Auto-Encoders (VAEs) \cite{kingma_auto-encoding_2013} have advanced the field of unsupervised learning by generating new samples and capturing the underlying structure of the data onto a lower-dimensional data manifold. Disentangling methods \cite{higgins_beta-vae_2017, burgess_understanding_2018, shu_rethinking_2018} enable learning the underlying factors of variation in the data. While disentangling \cite{zheng_disentangling_2019, chen_isolating_2019} is a desirable property for improving the control of the image generation process and the interpretation of the latent space
%generation control and latent space interpretation 
\cite{ainsworth_oi-vae_2018, li_disentangled_2018}, these methods are usually based on a \textit{single} dataset, and they do not explicitly use labels or multiple datasets to effectively estimate and separate the common and salient factors of variation.
%and attribute the factors of variations between several classes.  

% Semi and weakly-supervised VAEs \cite{mathieu_disentangling_2019, kingma_semi-supervised_nodate, maaloe_auxiliary_2016, joy_capturing_2021} have proposed to integrate class labels in their training. However, these methods solely allow conditional generalization and better semantic expressivity rather than addressing the separation of the factors of variation between distinct datasets. 

% \noindent Contrastive Analysis (CA) works are explicitly designed to identify patterns that are unique to a target dataset compared to a background dataset. First attempts \cite{zou_contrastive_2013, abid_exploring_2018, ge_rich_2016} employed linear methods in order to identify a projection that captures the variance of the target dataset while minimizing the background information expressivity. However, due to their linearity, these methods had reduced learning expressivity and were also unable to produce satisfactory generation.
\noindent Contrastive VAE \cite{abid_contrastive_2019, weinberger_moment_nodate, severson_unsupervised_2019, ruiz_learning_2019, zou_joint_2022, choudhuri_towards_2019} have employed deep encoders in order to capture higher-level semantics. They usually rely on a latent space split into two parts, a common and a salient, produced by two different encoders. First methods, such as \cite{severson_unsupervised_2019}, employed two decoders (common and salient) and directly sum the common and salient reconstructions in the input space. This seems to be a very strong assumption, probably wrong when working with high-dimensional and complex images. 
%about the kind of target-specific we would like to encode. 
For this reason, subsequent works used a single decoder, which takes as input the concatenation of both latent spaces. Importantly, when seeking to reconstruct background inputs, the decoder is fed with the concatenation of the common part and an informationless reference vector $\textbf{s'}$. This is usually chosen to be a null vector in order to reconstruct a null (i.e., empty) image by setting the decoder's biases to $0$. To fully enforce the constraints and assumptions of the underlying CA generative model, previous methods have proposed different regularizations. Here, we analyze the most important ones with their advantages and shortcomings: 

\noindent \textbf{Minimizing background's variance in the salient space }
Pioneer works \cite{abid_contrastive_2019} have shown inconsistency between the encoding and the decoding task. While background samples are reconstructed from $\textbf{s'}$, the salient encoder does not encourage the background salient latents to be equal to $\textbf{s'}$. To fix
that, posterior works \cite{weinberger_moment_nodate, zou_joint_2022, choudhuri_towards_2019} proposed to explicitly nullifying the background variance in the salient space. This regularization is
necessary to avoid salient features explaining the background variability but not
sufficient to prevent information leakage between common and salient spaces, as shown in \cite{weinberger_moment_nodate}. 
%to avoid background variability in the salient part. 

\noindent \textbf{Independence between common and salient spaces }
Only \cite{abid_contrastive_2019} proposed to prevent information leakage between the common and salient space by minimizing the total correlation (TC) between %$q_{\phi_c,\phi_s}(c, s | x)$ and $q_{\phi_c}(c|x) \times q_{\phi_s}(s|x)$,
$p(c, s | x)$ and $p(c|x) \times p(s|x)$. Similarly to FactorVAE \cite{kim_disentangling_2019}, they used the density-ratio trick \cite{nguyen_estimating_2010}, which 
%. This 
requires to \textit{independently} train a discriminator $D_\lambda(.)$ to approximate the ratio between $p(c,s | x)$ %q_{\phi_c,\phi_s}(c, s| x)$ 
and  %$\bar{q}(x) = q_{\phi_c}(c| x) \times q_{\phi_s}(s| x)$ 
$p(c| x) \times p(s| x)$.
%via the density-ratio trick \cite{nguyen_estimating_2010, Sugiyama2012DensityratioMU}. 
However, \cite{abid_contrastive_2019}'s code does \textit{not} use an independent optimizer for $\lambda$, which is theoretically wrong, and it thus undermines  their contribution. % Moreover, when incorrectly estimated, the TC can become negative, and its minimization can be harmful to the model's training. 

\noindent \textbf{Matching background and target common patterns }
Another work \cite{weinberger_moment_nodate}, has proposed to encourage the distribution in the common space to be the same across target samples and background samples. % Mathematically, it is equivalent to minimizing the KL between $q_{\phi_c}(c | y=0)$ and $q_{\phi_c}(c | y=1)$ \RL{(or between $q_{\phi_c}(c)$ and $q_{\phi_c}(c | y)$)}. 
In practice, we argue that it may encourage undesirable \textit{biases} to be captured by salient factors rather than common factors. For example, suppose that we have healthy subjects (\textit{background} dataset) and  patients (\textit{target} dataset) and that patients are composed of both young and old individuals, whereas healthy subjects are mostly old (\textit{i.e.,} imbalance dataset). We would expect the CA method to capture the normal aging patterns %(\textit{i.e.,} the bias) 
in the common space. However, forcing both %$q_{\phi_c}(c | x, y=0)$ and $q_{\phi_c}(c | x, y=1)$ 
$p(c | x, y=0)$ and $p(c | x, y=1)$ to follow the same distribution in the common space would probably bring to a biased distribution and thus to leakage of information between salient and common factors (i.e., aging could be considered as a salient factor of the patient dataset).%not enable to capture such bias. Therefore, the reconstruction constrain would capture this bias toward the salient space. 
This behavior is not desirable, and we believe that the statistical independence between common and salient space is a more robust property. Our contributions are three-fold: \\
%We first recognize that salient space regularization is especially important. We thus propose to enforce it by minimizing the overlap between target and common distributions on the salient space, using a new classification loss. Second, we found that a common space regularization, as done in \cite{weinberger_moment_nodate} and \cite{choudhuri_towards_2019}, may be harmful in presence of data biases. In such a scenario, we have found that ignoring common space regularization and promoting only independence between common and salient spaces is more adapted and give better results. Please note that, to the best of our knowledge, this is the first time that independence between common and salient space is \textit{coherently} written, from a mathematical point of view, and \textit{correctly} implemented. Ultimately, we provide a sound mathematical framework for theoretical justifications of our regularization terms.
$\bullet$ We develop a new Contrastive Analysis method, called SepVAE, which is supported by a sound and versatile Evidence Lower BOund maximization framework.\\
$\bullet$ We identify and implement two properties: the salient space discriminability and the salient/common independence, that have not been successfully addressed by previous Contrastive VAE methods.\\
$\bullet$ We provide a fair comparison with other SOTA CA-VAE methods on 3 medical applications and a natural image experiment.
\begin{figure}
    \centering
    \begin{minipage}{.48\textwidth}
        \centering
        \includegraphics[width=0.99\textwidth]{brats_qualitative_results.png} 
        \caption{SepVAE. Reconstructions on BRATS dataset \cite{menze_multimodal_2015}, we separate healthy patterns from tumors.}
        \label{fig:brats_qualitative_results}
    \end{minipage}
    \hspace{0.02\textwidth}
    \begin{minipage}{.45\textwidth}
        \centering
        \includegraphics[width=0.99\textwidth]{celeba_qualitative_results.png} 
        \caption{SepVAE. Reconstructions with CelebA accessories dataset (BG = no accessories, TG = hats and glasses). % The salient latent variables capture the accessories (hats and glasses) which are target-specific patterns. The common latents capture the common attributes (identity, skin color, background, etc...)
        }
        \label{fig:celeba_qualitative_results}
    \end{minipage}%
\end{figure}
\section{Contrastive Variational Autoencoders}
Let $(X,Y)=\{(x_i,y_i) \}_{i=1}^N$ be a data-set of images $x_i$ associated with labels $y_i \in \{0, 1\}$, $0$ for background and $1$ for target. 
%For each image and its associated label, we assume the existence of 
Both background and target samples are assumed to be i.i.d. from two different and unknown distributions that depend on two latent variables: $c_i \in \mathbf{R}^{D_c}$ and $s_i \in \mathbf{R}^{D_s}$. 
%drawn from the unobserved distribution $p_\theta(c_i, s_i | x_i, y_i)$. 
Our objective is to have a generative model $x_i \sim p_\theta (x |y_i, c_i, s_i)$ so that: 1- the $\textbf{common}$ latent vectors $C = \{c_i\}_{i=1}^N$ should capture the common generative factors of variation between the background and target distributions and fully encode the background samples and 2- the $\textbf{salient}$ latent vectors $S = \{s_i\}_{i=1}^N$ should capture the distinct generative factors of variation of the target set (\textit{i.e.,} patterns that are only present in the target dataset and not in the background dataset). Similarly to previous works\cite{abid_contrastive_2019,weinberger_moment_nodate,zou_joint_2022}, we assume the generative process: $p_\theta(x, y, c, s) = p_\theta(x | c, s, y)p_\theta(c) p_\theta(s | y) p(y)$.
Since $p_\theta(c, s| x, y)$ is hard to compute in practice, we approximate it using an auxiliary parametric distribution $q_\phi(c, s | x, y)$ and directly derive the Evidence Lower Bound of $\log p(x, y)$:
\begin{equation}
    - \log p_\theta(x, y) \leq \mathbf{E}_{c, s \sim q_{\phi_c, \phi_s}(c, s|x, y)} \log \frac{q_{\phi_c, \phi_s}(c, s|x, y)}{p_\theta(x, y, c, s)} 
    \label{eq:ELBO}
\end{equation} 
Then, we can develop the lower bound into three terms, a conditional reconstruction term, a common space prior regularization, and a salient space prior regularization.
\noindent From there, we assume the independence of the auxiliary distributions (\textit{i.e.: } $q_{\phi_c, \phi_s}(c, s|x,y) = q_{\phi_c}(c|x) q_{\phi_s}(s|x,y)$) and prior distributions (\textit{i.e.: } $p_\theta(c, s)=p_\theta(c)p_\theta(s)$). Both $p_\theta (x |y_i, c_i, s_i)$ (i.e., single decoder) and $q_{\phi_c}(c|x) q_{\phi_s}(s|x,y)$ (i.e., two encoders) are assumed to follow a Gaussian distribution parametrized by a neural network. To reinforce the independence assumption between $c$ and $s$, we introduce a Mutual Information regularization term $KL(q(c, s)||q(c) q(s))$. This property is desirable in order to ensure that the information is well separated between the latent spaces. Theoretically, this term is similar to the one in \cite{abid_contrastive_2019}. However, in \cite{abid_contrastive_2019}, the Mutual Information estimation and minimization are done simultaneously \footnote{In 
 \cite{abid_contrastive_2019}, Alg. 1 suggests that the MI estimation and minimization depend on two distinct parameter updates. However, in their code, a single optimizer is used. Moreover, in Sec.~3, authors write: "discriminator is trained \underline{simultaneously} with the encoder and decoder".}, which is theoretically wrong (see Sec.~\ref{sec:related}). Here, we correctly implement an independent optimizer to estimate the Mutual Information.
 %we argue that the estimation of the Mutual Information requires the introduction of an independent optimizer, .
To further reduce the overlap of target and common distributions on the salient space, \review{differently from previous works}, we also introduce a salient classification loss defined as $\mathbf{E}_{s \sim q_{\phi_s}(s|x, y)} \log p(y | s)$.
By combining all these losses together, we obtain the final loss $\mathcal{L}$:
%, a higher bound of the negative joint likelihood $- \log p_\theta(x, y)$ to be minimized in practice:
\begin{equation}
    \label{eq:SepVAE-ELBO}
    \begin{aligned}
        \mathcal{L} = & \underbrace{- \mathbf{E}_{c, s \sim q_{\phi_c, \phi_s}(c, s|x,y)} \log p_\theta(x | c, s, y)}_{\textbf{a) Conditional Reconstruction}} + \underbrace{KL(q(c, s)||q(c) q(s))}_{\textbf{e) Mutual Information}} - \underbrace{\mathbf{E}_{s \sim q_{\phi_s}(s|x,y)} \log p_\theta(y | s)}_{\textbf{d) Salient Classification}} \\
        & + \underbrace{KL(q_{\phi_c}(c|x)||p_{\theta}(c))}_{\textbf{b) Common Prior}} + \underbrace{KL(q_{\phi_s}(s|x, y)||p_{\theta}(s|y))}_{\textbf{c) Salient Prior}}
    \end {aligned}
\end{equation}
% \subsection{Posterior sampling}
% \label{sec:sampling}
% We assume that the posteriors $q_{\phi_c}(c | x)$ and $q_{\phi_s}(s | x, y)$ follow two gaussian distributions, parameterized by the encoders (respectively, common and salient). In particular, as in \cite{zou_joint_2022}, we assume that the averages  of the standard deviation of the salient latents is estimated 


% $q_{\phi_s(s|x,y=1)} \sim N(\mu_s(x), \sigma_s(x))$), and background distributions $q_{\phi_s(s|x,y=0)} \sim N(\mu_s(x|y=0), \sigma_q)$. $N(\mu_\cdot^x, \sigma_\cdot^{x, y})$  As  when $y=0$, we assume the standard deviation of the salient latents to be a small constant $\sigma_{\phi_s}(x, y=0)) = \sigma_q << 1$ (\textit{e.g.,} $\sigma_q = 1e-1$). Empirically, we have seen that the training is easier when fixing $\sigma_s^{x}=\sigma_q$ to a small constant  (\textit{e.g.,} $\sigma_q=1e-1$) when $y=0$. When $y=1$, we choose $\sigma_{\phi_s}(x, y=1))$ as the value predicted by the salient encoder.

\noindent \textbf{Conditional reconstruction} The reconstruction term is  $- \mathbf{E}_{c,s \sim q_{\phi_c, \phi_s}(c, s | x, y)} \log p_\theta(x | c, s, y)$. Given an image $x$ (and a label $y$), a common and a salient latent vector can be drawn from $q_{\phi_c, \phi_s}$ with the help of the reparameterization trick. We assume that $p(x | c, s, y) \sim \mathcal{N}(d_\theta([c, ys+(1-y)s'], I)$, \textit{i.e:} $p_\theta(x | c, s, y)$ follows a Gaussian distribution parameterized by $\theta$,  centered on $\mu_{\hat{x}} = d_\theta([c, ys+(1-y)s'])$ with identity covariance matrix, and $d_\theta$ is the decoder and $[.,.]$ denotes a concatenation.
Therefore, by developing the reconstruction loss term, we obtain the mean squared error between the input and the reconstruction: $\mathcal{L}_{\text{rec}} = \sum_{i=1}^N || x - d_\theta([c, ys+(1-y)s'])||^2_2$. Importantly, \review{as in \cite{weinberger_moment_nodate, abid_contrastive_2019}}, we set the salient latent vectors of background samples to $\textbf{s'}=0$. This choice enables isolating the background factors of variability in the common space only. 

%\noindent \textbf{Priors regularization: } The prior regularizations correspond to the KL dissimilarity measures between the priors $p_\theta(s | y)$, $p_\theta(c)$, and the latent inferred by the encoders $q_{\phi_c}(c | x) q_{\phi_s}(s | x, y)$. 
% \noindent We assumed that the common and salient prior distributions are independent, and that $y$ does not carry information about $c$ (\textit{ie: } $p_\theta(c, s | y) = p_\theta(c) p_\theta(s | y)$), which is a desirable property. We can divide the regularization term into two different regularizations: the common regularization term, and the salient regularization term.  
\noindent \textbf{Common prior} Assuming  $p(c) \sim \mathcal{N}(0,I)$ and $q_{\phi_c}(c | x) \sim \mathcal{N}(\mu_{\phi}(x), \sigma_{\phi}(x, y))$, the KL loss has a closed form solution, as in usual VAE. Here, both $\mu_{\phi}(x)$ and $\sigma_{\phi}(x, y)$ are the outputs of the encoder $e_{\phi_c}$. \review{This loss is also used in \cite{abid_contrastive_2019,weinberger_moment_nodate}}. % follows a gaussian distribution centered on $0$ with unit variance. we simplify the common regularization into the well-known KL regularization of standard VAEs. In practice, $p_\theta(c)$ is assumed to follow a normal gaussian distribution centered on $0$ with unit variance. 

\noindent \textbf{Salient prior} 
First, we develop $p_\theta(s) = \sum_y p(y) p_\theta(s|y)$, where $p(y)$ follows a Bernoulli distribution with probability equal to $0.5$. % Thus, the salient prior reduces to a formula that only depends on $p_\theta(s|y)$, which is conditioned by the knowledge of the label ($0$: background, $1$: target). 
This allows us to distinguish the salient priors of background samples ($p(s | y=0)$) and target samples ($p(s | y=1)$). Similar to other CA-VAE methods, we assume that $p(s | y=1) \sim \mathcal{N}(0,I)$ and , as in \cite{zou_joint_2022}, that $p(s | x,y=0) \sim \mathcal{N}(s',\sqrt{\sigma_p} I)$, with $s'=0$ and $\sqrt{\sigma_p} < 1$, namely a Gaussian distribution centered on an informationless reference $s'$ with a small constant variance $\sigma_p$. 
%This choice is particularly convenient for generation purposes since we can directly set the salient part equal to $s'$ when generating new background samples (being $\sqrt{\sigma_p}<1$, it is an admissible fast-to-compute approximation). 
We preferred it to a Delta function $\delta(s=s')$ (as in \cite{weinberger_moment_nodate}) because it eases the computation of the KL divergence (i.e., closed form) and 
%works well in practice (as also shown in \cite{zou_joint_2022}).
%Furthermore, this choice 
it also means that we tolerate a small salient variation (e.g., noisy/erroneous diagnosis labels) in the background samples.%  In real applications, in particular medical ones, diagnosis labels can be noisy, and mild pathological patterns may exist in some healthy control subjects. Using such a prior, we tolerate these possible (erroneous) sources of variation.
% \RL{Furthermore, one could also extend the proposed method to a continuous $y$, for instance, between $0$ and $1$, describing the severity of the disease. Indeed, practitioners could define a function $\sigma_p(y)$ that would map the severity score $y$ to a salient prior standard deviation (\textit{e.g.,} $\sigma_p(y) = y$). In this way, we could extend our framework to the case where pathological variations would follow a continuum from no (or mild) to severe patterns.}

%\RL{Setting $\sigma_p$ as a small constant and since $s'=0$, it simplifies the KL divergence between $q_\phi(s | x, y=0)$ and $p_\theta(s | x, y=0)$ as: $\frac{||\mu_s^{x_i} ||_2^2 + ||\sigma_s^{x_i}||_2^2}{\sigma_p}$. Since, in practice, $\sigma_s^{x_i}$ is frozen to a small quantity $\sigma_q$ (see Sec.~\ref{sec:sampling}), we can ignore it. Eventually, to make it even more explicit that $\sigma_p$ is a tunable hyper-parameter, we call it $\alpha=\frac{1}{\sigma_p}$, which even further simplifies the KL divergence between $q_\phi(s | x, y=0)$ and $p_\theta(s | x, y=0)$ into $\alpha ||\mu_s^{x_i} ||_2^2$. Please note that, by controlling the amount of background variability that can be tolerated in the salient space, $\alpha$ plays the same role as in cPCA \cite{zou_contrastive_2013}. }

%To simplify, we reduce it to an $\alpha$-weighted Mean Squared Error between the mean $\mu_{\phi_s}(x, y=0)$ and $s' = 0$ in the minimization, where $\alpha$ theoretically depends on $\sigma_p$. For simplicity, we consider $\alpha$ as a tunable hyper-parameter (that is inversely proportional to the prior standard deviation $\sigma_p$).


%


%\begin{equation}
 %   \begin{aligned}
        %\mathcal{L}_{\text{priors}} = &  \sum_{i = 1}^N \underbrace{D_{KL}(N(\mu_c^{x_i}, \sigma_c^{x_i}) || N(0, 1))}_{\mathcal{L}_{\text{common prior}}} + \alpha \sum_{i / y_i=0}^N \underbrace{||\mu_s^{x_i} ||_2^2}_{\mathcal{L}^{\text{bg}}_{\text{salient prior}}}  + \sum_{i / y_i=1}^N \underbrace{D_{KL}(N(\mu_s^{x_i}, \sigma_s^{x_i}) || N(0, 1))}_{\mathcal{L}^{\text{tg}}_{\text{salient prior}}}
   %     \label{Regularization loss}
    %\end{aligned}
%\end{equation}
\noindent \textbf{Salient classification}  
The salient prior regularization encourages BG and TG salient factors to match two different Gaussian distributions centered in $s'=0$, but with different covariance.
%, and target samples to match a unit-variance gaussian distribution centered on $0$.
%target samples not to be generated using $s'=0$, 
To further reduce the overlap of target and common distributions on the salient space, we propose to minimize a Binary Cross Entropy (BCE) loss to distinguish the target from background samples in the salient space. Assuming that $p(y | s)$ follows a Bernoulli distribution parameterized by $f_\xi(s)$, a 2-layers classification Neural Network, we obtain a BCE loss between true labels $y$ and predicted labels $\hat{y} = f_\xi(s)$. \review{This loss is \textit{not} used in \cite{abid_contrastive_2019,weinberger_moment_nodate}}.

\noindent \textbf{Mutual Information}
\label{sec:MI}
To promote independence between $c$ and $s$, we minimize their mutual information, defined as the KL divergence between the joint distribution $q(c, s)$ and the product of their marginals $q(c) q(s)$. However, computing this quantity is not trivial, and it requires a few tricks to correctly estimate and minimize it. As in \cite{abid_contrastive_2019}, it is possible to take inspiration from FactorVAE \cite{kim_disentangling_2019}, which proposes to estimate the density-ratio between a joint distribution and the product of the marginals. In our case, we seek to enforce the independence between two sets of latent variables rather than between each latent variable of a set. The density-ratio trick \cite{nguyen_estimating_2010, Sugiyama2012DensityratioMU} allows us to estimate the quantity inside the $\log$ in Eq.\ref{eq:densityratio}. First, we sample from $q(c, s)$ by randomly choosing a batch of images $(x_i, y_i)$ and drawing their latent factors $[c_i, s_i]$ from the encoders $e_{\phi_c}$ and $e_{\phi_s}$. Then, we sample from $q(c) q(s)$ by using the same batch of images where we shuffle the latent codes among images (\textit{e.g.}, $[c_1, s_2]$, $[c_2, s_3]$, etc.). Once we obtained samples from both distributions, we trained an \textbf{independent} classifier $D_\lambda([c, s])$ to discriminate the samples drawn from the two distributions by minimizing a BCE loss. The classifier is then used to approximate the ratio in the KL divergence, and we can train the encoders $e_{\phi_c}$ and $e_{\phi_s}$ to minimize the resulting loss:
%In order to promote the statistical independence between common and salient patterns, we introduced the KL divergence between $q_{\phi_c, \phi_s}(c, s)$ and $q_{\phi_c}(c) q_{\phi_s}(s)$. Indeed, minimizing this term encourages the parametric posterior estimate to respect $q_{\phi_s, \phi_c}(c, s) = q_{\phi_s}(s,y) q_{\phi_c}(c,y)$. However, estimating this quantity is not trivial and it requires a few tricks that we will describe here. As in \cite{abid_contrastive_2019}, one possibility is to take inspiration from FactorVAE \cite{kim_disentangling_2019}, which proposes to estimate the density-ratio distribution between a joint distribution of latents and the product of its marginals. 
%\noindent Differently from FactorVAE, in our case we want to force independence between two sets of latent variables rather than each latent within a set. First, we propose to estimate the quantity inside the log using the density-ratio trick: \textit{i.e: } we introduce a discriminator $D_\lambda([c, s])$ that is trained to classify the samples drawn from the joint distribution to those sampled from the product of the marginal distributions. Given this classifier, we may use it to estimate the likelihoods of both  distributions: $q_{\phi_s, \phi_c}(c, s) = D_\lambda([c, s])$ and $q_{\phi_s}(s).q_{\phi_c}(c) = 1 - D_\lambda([c, s])$ and reformulate the TC loss as :
%\begin{equation}
%    \mathcal{L_\text{MI}} = KL(q(c, s) || q(c) q(s)) = \mathbb{E}_{q(c, s)} \log \left( \frac{q(c, s)}{q(c) q(s)} \right) \approx \sum_i \text{ReLU} \bigg( \log \bigg(\frac{D_\lambda([c_i, s_i])}{1 - D_\lambda([c_i, s_i])} \bigg) \bigg)
%\end{equation}
\begin{equation}
    \mathcal{L_\text{MI}} = \mathbb{E}_{q(c, s)} \log \left( \frac{q(c, s)}{q(c) q(s)} \right) \approx \sum_i \text{ReLU} \bigg( \log \bigg(\frac{D_\lambda([c_i, s_i])}{1 - D_\lambda([c_i, s_i])} \bigg) \bigg)
    \label{eq:densityratio}
\end{equation}
% \noindent where the ReLU function forces the estimate of the KL divergence to be positive, thus avoiding to back-propagate wrong estimates of the density ratio due to the simultaneous training of $D_\lambda([c, s])$. 
% Contrarily to \cite{abid_contrastive_2019}, it is important to use an independent optimizer for $D_\lambda$ to ensure that the density ratio is well estimated. 
%In \cite{abid_contrastive_2019}, the discriminator is trained simultaneously with the encoder and decoder. In our work, we use an optimizer estimated independently for $D_\lambda$, and we freeze $D_\lambda$'s parameters when minimizing the MI estimate. 

\section{Experiments}
\noindent \textbf{Evaluation} We evaluate the ability of SepVAE to separate common from target-specific patterns on three medical and one natural (CelebA) imaging datasets. We compare it with the only SOTA CA-VAE methods whose code is available: MM-cVAE \cite{weinberger_moment_nodate} and ConVAE \footnote{ConVAE implemented with our MI minimization, \textit{i.e.:} with independently trained discriminator.} \cite{abid_contrastive_2019}, using the same architecture for all models.\\
For quantitative evaluation, we use the fact that the information about some attributes (e.g. glasses/hats in CelebA) should be present either in the common or in the salient space. Once the encoders/decoder are trained, we train a Logistic (or Linear) Regression on the estimated salient and common factors of the training set to predict the attribute presence (or value). Then, we evaluate the classification/regression model on the salient and common factors estimated from a test set. We also report the background (BG) vs target (TG) classification accuracy (Acc.) using the trained classifier for SepVAE and an independently trained classifier (still 2 layers MLPs) for the other methods.
%, except for , where salient space predictions are directly estimated by the classifier.

  %evaluate the quality of the representations in two ways (MAE=Mean Absolute Error, B-ACC= Balanced Accuracy.). First, 


%the variability within the target dataset is assessed by fitting Logistic (or Linear) Regression to evaluate if the model captures the target-specific attributes and discards the common variability. Also, 

% For quantitative evaluation, we use the fact that the information about attributes, clinical variables, or subtypes (e.g. glasses/hats in CelebA) should be present either in the common or in the salient space. Once the encoders/decoder are trained, we evaluate the quality of the representations in two steps. First, we train a Logistic (resp. Linear) Regression on the estimated salient and common factors of the training set to predict the attribute presence (resp. attribute value). Then, we evaluate the classification/regression model on the salient and common factors estimated from a test set. By evaluating the performance of the model, we can understand whether the information about the attributes/variables/subtype has been put in the common or salient latent space by the method. Furthermore, we report the background (BG) vs target (TG) classification accuracy. To do so, a 2 layers MLPs is independently trained, except for SepVAE, where salient space predictions are directly estimated by the classifier. 

% In all Tables, for categorical variables, we compute (Balanced) Accuracy scores (=(B-)ACC), or Area-under Curve scores (=AUC) if the target is binary. For continuous variables, we use Mean Average Error (=MAE). Best results are highlighted in bold, second best results are underlined. For CelebA and Pneumonia experiments, mean, and standard deviations are computed on the results of 5 different runs in order to account for model initializations. For neuro-psychiatric experiments, mean and standard deviations are computed using a 5-fold cross-validation evaluation scheme.

%First, the variability within the target dataset is assessed by fitting Logistic (or Linear) Regression to evaluate if the model captures the target-specific variability and discards the common variability. \RL{In the case where common attributes are available, we assess if the common space captures these attributes in the same fashion}. 

% Qualitatively, the model can be evaluated by looking at the full image reconstruction (common+salient factors) and by fixing the salient factors to $s'$ for target images. Comparing full reconstructions with common-only reconstructions allows the user to interpret the patterns encoded in the salient factors $s$ (see Fig.\ref{fig:brats_qualitative_results} and Fig.\ref{fig:celeba_qualitative_results}).

\noindent \textbf{CelebA - glasses vs hat identification: } In the CelebA with attributes dataset \cite{liu_deep_2015}, the target set contains images of celebrities wearing glasses or hats while background images show no accessories. We used a train set of $20000$ images, ($10000$ no accessories, $5000$ glasses, $5000$ hats) and an independent test set of $4000$ images ($2000$ no accessories, $1000$ glasses, $1000$ hats). In Tab.\ref{table:Celeba_Results} and Fig. \ref{fig:celeba_qualitative_results}, we demonstrate that we successfully distinguish glasses and hats attributes in the salient space\footnote{Our evaluation process is different from \cite{weinberger_moment_nodate} as their TEST set has been used during the model training. Here, TRAIN and TEST are correctly separated.
%Indeed, the TRAIN / TEST split used for training Logistic Regression is performed after the model fitting on the TRAIN+TEST set.
}. In Fig.~\ref{fig:Celeba_pca}, we show that SepVAE, differently from MM-cVAE, maximizes the target variance in the salient space while reducing the background variance. Ratios of variances are: MM-cVAE: $\sigma^2(s|y=0) / \sigma^2(s|y=1]) = 1.79$; SepVAE: $\sigma^2(s|y=0]) / \sigma^2(s|y=1) = 20.31$. More details are in the Supplementary.
% SepVAE has a higher target on background sets variance ratio in the salient space compared to MM-cVAE: 
%\begin{figure}[!tbp]
%    \centering
%        \centering
%        \includegraphics[width=0.45\linewidth]{celeba_qualitative_results.png}
%        \includegraphics[width=0.45\linewidth]{brats_qualitative_results.png} 
%         \caption{SepVAE qualitative example on the CelebA with accessories dataset (BG = no accessories, TG = hats and glasses) and BRATS 2021 dataset \cite{menze_multimodal_2015}. (Middle, common+salient): Full reconstructions using the estimated common and salient factors. (Right, common only): Reconstruction using only the estimated common factors fixing the salient to $s'$. \RL{The salient latent variables capture the accessories (hats and glasses), which are target-specific patterns. The common latents capture the common attributes (e.g., identity, skin color).}
%         }
%        \label{fig:celeba_qualitative_results}
%\end{figure}

% In the CelebA with attributes dataset \cite{liu_deep_2015}, the target set contains images of celebrities wearing glasses or hats while background images show no accessories. 

\begin{figure}
    \begin{minipage}{.59\textwidth}
        \captionof{table}{CA-VAE  performance on CelebA with accessories dataset. Accessories (glasses/hat) information should only be present in the salient space, not in the common.  Average and std are computed over 5 different runs with different parameter initialization.}
        \resizebox{\columnwidth}{!}{
        \begin{tabular}{lccccr}
            \toprule
             & Glss/Hats Acc & Glss/Hats Acc & Bg vs Tg AUC & Bg vs Tg AUC \\
             & salient $\uparrow$ & common $\downarrow$ & salient $\uparrow$ & common $\downarrow$ \\
            \midrule
            ConVAE & 82.32$\pm$1.17 & 75.01$\pm$2.52 & 82.46$\pm$0.58 & 78.39$\pm$0.41 \\
            MM-cVAE & \underline{85.17$\pm$0.60} & \underline{73.93$\pm$1.66} & \underline{88.53$\pm$0.39} & \underline{78.03$\pm$0.35}  \\
            SepVAE & \textbf{87.62$\pm$0.75} & \textbf{72.16$\pm$2.02} & \textbf{93.15$\pm$1.65} & \textbf{77.60$\pm$0.20} \\
            \bottomrule
        \end{tabular}
        }
        \label{table:Celeba_Results}
    \end{minipage}
    \hspace{0.01\textwidth}
    \begin{minipage}{0.39\textwidth}
        \centering
        \includegraphics[width=0.48\textwidth]{mm_vae_pca_salient_celeba.png} 
        \includegraphics[width=0.48\linewidth]{dis_vae_pca_salient_celeba.png}
        \caption{PCA projections of MM-cVAE (left) and SepVAE (right) salient space on CelebA TEST set. Yellow: no accessories. \\
        Dark Blue: glasses. Purple: hats. 
        %\PG{il faudrait calculer la variance des données projetées et la distance entre centroides des trois clusters. }
        }
        % PCA projections of MM-c-VAE (left) and SepVAE (right) salient space on CelebA TEST set. Yellow: no accessories. Dark Blue: glasses. Purple: hats. \RL{We can clearly observe that our method maximizes the target variance while reducing the background variance. We attribute this different behaviour to our salient classification loss, which reduces the overlap between background and target salient distributions.
        \label{fig:Celeba_pca}
    \end{minipage}%
\end{figure}

%In Tab.~\ref{table:Celeba_Results}, we quantitatively demonstrate that we successfully distinguish glasses and hats attributes in the salient space.  Reconstruction results are shown in Fig. \ref{fig:celeba_qualitative_results}. In Fig.~\ref{fig:Celeba_pca} SepVAE minimizes the background dataset variance in the salient space (PCA projection shows that yellow points are centered around $s'=0$), as compared with MM-VAE .



% \begin{table}[h!]
%     \centering
% \resizebox{\columnwidth}{!}{    
% \begin{tabular}{lccccr}
%         \toprule
%          & Subgrp Acc & Subgrp Acc & Bg vs Tg Acc & Bg vs Tg Acc  \\
%          & salient $\uparrow$ & common $\downarrow$ & salient $\uparrow$ & common $\downarrow$ \\
%         \midrule
%         ConVAE \review{:= SepVAE no SAL + CLSF} & 82.30$\pm$1.53 & 73.58$\pm$1.84 & 67.80$\pm$5.93 & 58.05$\pm$7.17 \\
%         MM-cVAE & 82.86$\pm$1.87 & 74.35$\pm$3.19 & 70.44$\pm$2.69 & 59.94$\pm$5.88\\
%         SepVAE & \textbf{84.78$\pm$0.42} & \textbf{70.92$\pm$1.39} & \textbf{78.13$\pm$3.03} & \underline{57.52$\pm$4.14} \\
%         \midrule
%         SepVAE no \textbf{MI} & 84.10$\pm$0.48 & 71.792$\pm$2.94 & 75.186$\pm$5.69 & 60.35$\pm$4.73 \\
%         SepVAE no \textbf{CLSF} & \underline{84.71$\pm$1.19} & 73.58$\pm$2.19 & 71.91$\pm$4.65 & \textbf{55.79$\pm$5.41}  \\
%         SepVAE no \textbf{SAL} & 83.98$\pm$0.85 & 72.61$\pm$2.05 & 73.03$\pm$2.97 & 61.43$\pm$2.25 \\
%         \review{SepVAE no \textbf{MI + SAL}} & 81.58$\pm$3.68 & \underline{71.73$\pm$5.17} & 61.24$\pm$3.89 & \textbf{54.33$\pm$5.30} \\
%         \review{SepVAE no \textbf{MI + CLSF}} & 84.25$\pm$0.47 & 73.17$\pm$3.15 & 53.10$\pm$1.63 & 57.58$\pm$6.74 \\
%         \review{SepVAE no \textbf{MI + SAL + CLSF}} & 81.78$\pm$2.12 & 76.71$\pm$2.10 & 62.87$\pm$7.15 & 59.37$\pm$5.69 \\
%         \bottomrule
%     \end{tabular}
%     }
%     \caption{CA-VAE methods performance on the Healthy vs Pneumonia X-Ray dataset. Pneumonia subtype information should only be present in the salient space. The lower part shows an ablation study of regularization losses. Average and std are computed over 5 different runs with different parameters initializations.}
% \label{table:Pneumonia_Results}
% \end{table}

\begin{wraptable}{L}{0.55\columnwidth}% automatically uses minimum width
    \vspace{-0.025\textwidth}
    \captionof{table}{CA-VAE methods performance on the Healthy vs Pneumonia X-Ray dataset. Pneumonia subtype information should only be present in the salient space. The lower part shows an ablation study of regularization losses. Average and std are computed over 5 different runs with different parameters initializations.}
    \label{table:Pneumonia_Results}
    \centering
    \resizebox{0.55\columnwidth}{!}{
    \begin{tabular}{lccccr}
        \toprule
         & Subgrp Acc & Subgrp Acc & Bg vs Tg Acc & Bg vs Tg Acc  \\
         & salient $\uparrow$ & common $\downarrow$ & salient $\uparrow$ & common $\downarrow$ \\
        \midrule
        ConVAE \review{:= SepVAE no SAL + CLSF} & 82.30$\pm$1.53 & 73.58$\pm$1.84 & 67.80$\pm$5.93 & 58.05$\pm$7.17 \\
        MM-cVAE & 82.86$\pm$1.87 & 74.35$\pm$3.19 & 70.44$\pm$2.69 & 59.94$\pm$5.88\\
        SepVAE & \textbf{84.78$\pm$0.42} & \textbf{70.92$\pm$1.39} & \textbf{78.13$\pm$3.03} & \underline{57.52$\pm$4.14} \\
        \midrule
        SepVAE no \textbf{MI} & 84.10$\pm$0.48 & 71.792$\pm$2.94 & 75.186$\pm$5.69 & 60.35$\pm$4.73 \\
        SepVAE no \textbf{CLSF} & \underline{84.71$\pm$1.19} & 73.58$\pm$2.19 & 71.91$\pm$4.65 & \textbf{55.79$\pm$5.41}  \\
        SepVAE no \textbf{SAL} & 83.98$\pm$0.85 & 72.61$\pm$2.05 & 73.03$\pm$2.97 & 61.43$\pm$2.25 \\
        \review{SepVAE no \textbf{MI + SAL}} & 81.58$\pm$3.68 & \underline{71.73$\pm$5.17} & 61.24$\pm$3.89 & \textbf{54.33$\pm$5.30} \\
        \review{SepVAE no \textbf{MI + CLSF}} & 84.25$\pm$0.47 & 73.17$\pm$3.15 & 53.10$\pm$1.63 & 57.58$\pm$6.74 \\
        \review{SepVAE no \textbf{MI + SAL + CLSF}} & 81.78$\pm$2.12 & 76.71$\pm$2.10 & 62.87$\pm$7.15 & 59.37$\pm$5.69 \\
        \bottomrule
    \end{tabular}
    }
\end{wraptable}

\vspace{0.025\textwidth}
\noindent \textbf{Pneumonia subgroups:}
From \cite{kermany_identifying_2018}, we used 1342 healthy radiographies (\textit{background}) and 2684 pneumonia radiographies (\textit{target}), divided into two subgroups: viral (1342 samples) and bacterial (1342 samples), see Fig.\ref{fig:pneumonia_dataset} in the Suppl. % Radiographies were selected from a cohort of pediatric patients aged between one and five years old from Guangzhou Women and Children’s Medical Center, Guangzhou. TRAIN set images were graded by 2 radiologists experts and the independent TEST set was graded by a third expert to account for label uncertainty. 
In Tab.~\ref{table:Pneumonia_Results}, we demonstrate that our method can produce a salient space that captures the pathological variability, as it better distinguishes the two subgroups. 

\noindent \underline{Ablation: } In Tab.~\ref{table:Pneumonia_Results}, we also propose to disable different components of the loss to show that the proposed full model is always better on average. no \textbf{MI} means that we removed the Mutual Information loss. %(no Mutual Information Minimization). 
no \textbf{CLSF} means that we disabled the Salient Classification loss. %classification loss on the salient space. 
%(no Salient Classification). 
no \textbf{SAL} means that we ignored the Salient Prior loss.
%regularization loss that forces the background samples to align with an informationless vector $\textbf{s'} = 0$ (no Salient Prior).

\begin{table}[!ht]
    \caption{CA-VAE methods performance on the prediction of disorder-specific variables, \textit{i.e.} SANS, SAPS, for schizophrenia disorder (upper table), and ADI-s, ADOS \cite{akshoomoff_role_2006}, for autism disorder (lower table) and common variables (Age, Sex, Site) using only salient factors of test images from the target dataset. MAE=Mean Absolute Error. %separation of healthy and schizophrenia-specific variability experiment.
    }
    \label{table:SZ-HC_results}
    \resizebox{1\columnwidth}{!}{
    \setlength\tabcolsep{1.5pt}
    \begin{sc}
    \begin{tabular}{lcccccccr}
    \toprule
     & Age MAE $\uparrow$ & Sex B-Acc $\downarrow$ & Site B-Acc $\downarrow$ & SANS MAE $\downarrow$ & SAPS MAE $\downarrow$ & Diag AUC $\uparrow$ \\
    \midrule
    ConVAE & \underline{7.46$\pm$0.18} & 72.72$\pm$1.32 & 54.46$\pm$2.46 & \textbf{3.95$\pm$0.28} & 2.76$\pm$0.18 & 58.53$\pm$4.87 \\
    MM-cVAE & 7.10$\pm$0.34 & \textbf{72.15$\pm$2.47} & 56.69$\pm$9.84 & 4.52$\pm$0.33 & 3.16$\pm$0.05 & 70.94$\pm$4.08 \\
    SepVAE & \textbf{7.98$\pm$0.25} & \underline{72.61$\pm$2.19} & \textbf{44.10$\pm$5.78} & \underline{4.14$\pm$0.39} & \textbf{2.60$\pm$0.27} & \textbf{79.15$\pm$3.39} \\
    \bottomrule
    \end{tabular}
    \end{sc}
    }
    \vspace{0.015\columnwidth}
    \label{table:AD-HC_results}
    \centering
    \resizebox{1\columnwidth}{!}{
    \setlength\tabcolsep{1.5pt}
    \centering
    \begin{sc}
    \begin{tabular}{lcccccccr}
    \toprule
     & Age MAE $\uparrow$ & Sex B-Acc $\downarrow$ & Site B-Acc $\downarrow$ & ADOS MAE $\downarrow$ & ADI-s MAE $\downarrow$ & Diag AUC $\uparrow$ \\
    \midrule
    ConVAE & 3.97$\pm$0.19 & 66.67$\pm$1.12 & 40.97$\pm$2.06 &  \underline{10.1$\pm$1.27} & 5.14$\pm$0.17 &  \underline{54.93$\pm$2.04} \\
    MM-cVAE & \underline{3.74$\pm$0.12} & \underline{64.07$\pm$2.58} &  \underline{40.93$\pm$2.66} & 10.5$\pm$2.47 &  \underline{5.09$\pm$0.16} & 54.88$\pm$2.76 \\
    SepVAE & \textbf{4.38$\pm$0.09} & \textbf{59.61$\pm$1.78} & \textbf{33.58$\pm$1.86} & \textbf{8.55$\pm$1.68} & \textbf{4.91$\pm$0.17} & \textbf{59.73$\pm$1.78} \\
    \bottomrule
    \end{tabular}
    \end{sc}
    }
\end{table}

\noindent \textbf{Parsing neuro-anatomical variability in psychiatric diseases} 
% \RL{The task of identifying consistent correlations between neuro-anatomical biomarkers and observed symptoms in psychiatric diseases is important for developing more precise treatment options. Separating the different latent mechanisms that drive neuro-anatomical variability in psychiatric disorders is a challenging task. Contrastive Analysis (CA) methods such as ours have the potential to identify and separate healthy from pathological neuro-anatomical patterns in structural MRIs. This ability could be a key component to push forward the understanding of the mechanisms that underlie the development of psychiatric diseases.} 
Given a background population of Healthy Controls (HC) and a target population suffering from a Mental Disorder (MD), the objective is to capture the pathological factors of variability in the salient space, such as psychiatric and cognitive clinical scores, while isolating in the common space the patterns related to demographic variables, such as age and sex, or acquisition sites. For each experiment, we gather T1w anatomical VBM  \cite{ashburner_voxel-based_2000} pre-processed images of HC and MD subjects. We divide them into 5 TRAIN, VAL splits (0.75, 0.25) and evaluate in a cross-validation scheme the performance of SOTA CA-VAEs. % \RL{Please note that this is a challenging problem, especially due to the high dimensionality of the input and the scarcity of the data. Notably, the measures of psychiatric and cognitive clinical scores are only available for some patients, making it scarce and precious information.} 

%and evaluate each method's performances average performance and associated standard errors.
\noindent \textbf{Schizophrenia: } We merged images of schizophrenic patients (TG) and healthy controls (BG) from the datasets SCHIZCONNECT-VIP \cite{wang_schizconnect_2016} and BSNIP \cite{tamminga_bipolar_2014}. Results in Tab.~\ref{table:SZ-HC_results} show that the salient factors estimated using our method better predict schizophrenia-specific variables of interest: SAPS (Scale of Positive Symptoms), SANS (Scale of Negative Symptoms), and diagnosis. On the other hand, salient features are shown to be poorly predictive of demographic variables: age, sex, and acquisition site. It paves the way toward a better understanding of schizophrenia disorder by capturing neuro-anatomical patterns that are predictive of the psychiatric scales while not being biased by confound variables \cite{barbano_unbiased_2023}. More details in the Suppl.\\
\noindent \textbf{Autism: } We also %compare several CA-VAEs for parsing the autism disorder heterogeneity of
combine patients with autism from ABIDE1 and ABIDE2 \cite{heinsfeld_identification_2017} (TG) with healthy controls (BG). In Tab.~\ref{table:AD-HC_results}, SepVAE's salient latents better predict the diagnosis and the clinical variables, such as ADOS (Autism Diagnosis Observation Schedule) and ADI Social (Autism Diagnosis Interview Social) which quantifies the social interaction abilities. On the other hand, salient latents poorly infer irrelevant demographic variables (age, sex, and acquisition site). More details in the Suppl.
\section{Conclusions and Perspectives}
% In this paper, we developed a novel CA-VAE method entitled SepVAE.
%This method echoes the Contrastive Analysis VAEs 
%which aims at separating the common factors of variation between a background dataset and a target dataset, from the ones that only exist in the target dataset. 
Building onto  Contrastive Analysis methods, we discuss previously proposed regularizations about (1) the matching of target and background distributions in the common space and (2) the overlapping of target and background priors in the salient space. These regularizations may fail to prevent information leakage between common and salient spaces. %, especially when datasets are imbalanced/biased. 
We thus propose two alternative solutions: salient discrimination between target and background samples, and mutual information minimization between common and salient spaces. We demonstrate superior performances on radiological and two neuro-psychiatric applications, where we successfully separate the pathological information of interest (diagnosis, pathological scores) from the ``nuisance" common variations (e.g., age, site). The development of CA methods offers a large spectrum of perspectives. It could be further extended to multiple target datasets (healthy Vs several pathologies) and to other generative models, such as GANs or Diffusion Models, for improved generation quality. Furthermore, generative models could also be coupled with contrastive learning losses, to improve the representation quality \cite{dufumier_integrating_2023}. 
% Eventually, to be entirely trustworthy, the model must be identifiable, namely, we need to know the conditions that allow us to learn the correct joint distribution over observed and latent variables. We plan to follow \cite{khemakhem_variational_2020, von_kugelgen_self-supervised_2021} to obtain theoretic guarantees of identifiability of our model.
\clearpage
%\newpage
%\mbox{~}

% Acknowledgments---Will not appear in anonymized version
% \midlacknowledgments{We thank a bunch of people.}

\bibliography{midl24_127}

%\newpage

%\printbibliography

\clearpage

\appendix

\section{Context on Variational Auto-Encoders}
Variational Autoencoders (VAEs) are a type of generative model that can be used to learn a compact, continuous latent representation of a dataset. They are based on the idea of using an encoder network to map input data points $x$ (\textit{e.g:} an image) to a latent space $z$, and a decoder network to map points in the latent space back to the original data space. Mathematically, given a dataset $X = {x_i}_{i=1}^N$ and a VAE model with encoder $q_\phi(z|x)$ and decoder $p_\theta(x|z)$, the VAE seeks $\phi, \theta$ to maximize a lower bound of the input distribution likelihood:
\begin{equation}
    \log p_\theta(x) \leq \mathbf{E}_{z \sim q_\phi(z|x)} \log p_\theta(x|z) - KL(q_\phi(z|x) || p_\theta(z))
\end{equation}
\noindent where $p_\theta(x|z)$ is the likelihood of the input space, and $\text{KL}(q_\phi(z|x) || p(z))$ is the Kullback-Leibler divergence between $q_\phi(z|x)$, the approximation of the posterior distribution, and $p(z)$ the prior over the latent space (often chosen to be a standard normal distribution). The first term in the objective function, $\mathbf{E}_{z \sim q_\phi(z|x)}\log p_\theta(x|z)$, is the negative reconstruction error, which measures how well the decoder can reconstruct the input data from the latent representation. The second term, $\text{KL}(q_\phi(z|x) || p(z))$, encourages the encoder distribution to be similar to the prior distribution, which helps to prevent overfitting and encourage the learned latent representation to be continuous and smooth.

\section{Salient posterior sampling for background samples}
In Sec.~ 3.3, we motivated the choice of a peaked Gaussian prior for salient background distribution with a user-defined $\sigma_p$. This way, the derivation of the Kullback-Leiber divergence is directly analytically tractable as in standard VAEs. \\
To simplify the optimization scheme, we could also set and freeze the standard deviations $\sigma_q^{y=0}$ of the salient space of the background samples. This way, it reduces the Kullback-Leiber divergence between $q_\phi(s | x, y=0)$ and $p_\theta(s | x, y=0)$ to a $\frac{1}{\sigma_p}$-weighted Mean Squared Error between $\mu_s(x|y=0)$ and $s'$ : $\frac{||\mu_s^{x_i|y=0} - s'||_2^2}{\sigma_p}$. 

\noindent In our code, we make this choice as it simplifies the training scheme ($\sigma_q^{y=0}$ does not need to be estimated). In the case where there exists a continuum between healthy and diseased populations, $\sigma_q^{y=0}$ should be estimated.

\noindent Also, the choice of a frozen $\sigma_q^{y=0}$ allows controlling the radius of the classification boundary between background and target samples in the salient space. Indeed, the classifier is fed with samples from the target distributions ($q_{\phi_s(s|x,y=1)} \sim N(\mu_s(x), \sigma_s(x))$), and background distributions ($q_{\phi_s(s|x,y=0)} \sim N(\mu_s(x|y=0), \sigma_q)$. This implicitly avoids the overlap of both distributions with a margin proportional to $\sigma_q$. See Fig. \ref{fig:classification_term} for a visual explanation. 

\noindent In real applications, in particular medical ones, diagnosis labels can be noisy, and mild pathological patterns may exist in some healthy control subjects. Using such a prior, we tolerate these possible (erroneous) sources of variation. Furthermore, one could also extend the proposed method to a continuous $y$, for instance, between $0$ and $1$, describing the severity of the disease. Indeed, practitioners could define a function $\sigma_p(y)$ that would map the severity score $y$ to a salient prior standard deviation (\textit{e.g.,} $\sigma_p(y) = y$). In this way, we could extend our framework to the case where pathological variations would follow a continuum from no (or mild) to severe patterns.

\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.99\linewidth]{dis_vae_classification_utility.png} 
    \caption{Illustration of the regularization loss within the salient space. As in MM-cVAE, the prior $q_{\phi_s(s|x,y=0)} \sim \textbf{s'}$ on the background samples (blue) forces their variance to be as small as possible. However, as the prior on target samples (green) follow a normal distribution, they may overlap with the background distribution. To avoid this case, our method trains a non-linear classifier to avoid the overlap of both distributions with a margin proportional to $\sigma_q$.}
    \label{fig:classification_term}
\end{figure}

\section{The effect of matching target and background distributions despite data biases}
\review{In the paper, we pinpointed data imbalance as a possible example of data biases. We argued that 
%This is a fair point. On page 3; Matching background and target common patterns section, we argued that 
forcing the distribution in the common space to be the same across target and background samples may undermine the capture of common factors in the common space since some of them might be put in the salient latent space by the method. \\
%. And thus these undesirable shared patterns may be captured in the salient latent space. 
We have not used a highly imbalanced dataset to show that behavior but the effect of data biases can still be observed in 
%we have shown another reA result that demonstrates this behavior can be observed in 
the neuropsychiatric experiments, Tab.~\ref{table:SZ-HC_results_data_biases}. Indeed, in these experiments, the age distribution differs between healthy controls and diseased individuals in the schizophrenia disorder dataset, as shown in \cite{dufumier_manuscript_2023} (Table 2.1: SCHIZCONNECT and BSNIP), which can thus be considered as a data bias. As shown in the lower table, matching the healthy and diseased sample distributions in the common space undermines the capture of patterns associated with the age in the common space of MM-cVAE, but not in SepVAE and ConVAE. Thus, as the reconstruction objective requires the input information to be preserved in the latent space, age-related patterns naturally emerge in the salient space, even if they should be in the common latent space.}

\begin{table}[!ht]
    \caption{Separation of healthy and schizophrenia-specific variability experiment. CA-VAE methods performance on the prediction of the common variable AGE, using only factors of test images from the target dataset. MAE=Mean Absolute Error.
    }
    \label{table:SZ-HC_results_data_biases}
    \centering
    \resizebox{0.6\columnwidth}{!}{
    \setlength\tabcolsep{1.5pt}
    \begin{sc}
    \begin{tabular}{lcccccccr}
    \toprule
     & Age MAE Salient $\uparrow$  & Age MAE Common $\downarrow$  \\
    \midrule
    ConVAE & 7.46$\pm$0.18 & \underline{6.40$\pm$0.26} \\
    MM-cVAE & 7.10$\pm$0.34 & 6.55$\pm$0.18 \\
    SepVAE & \textbf{7.98$\pm$0.25} & \textbf{6.40$\pm$0.13} \\
    \bottomrule
    \end{tabular}
    \end{sc}
    }
\end{table}

%\clearpage
\section{More details on evaluation}
\noindent \textbf{Evaluation details} Here, we evaluate the ability of SepVAE to separate common from target-specific patterns on three medical and one natural (CelebA) imaging datasets. 

\noindent For quantitative evaluation, we use the fact that the information about attributes, clinical variables, or subtypes (e.g. glasses/hats in CelebA) should be present either in the common or in the salient space. Once the encoders/decoder are trained, we evaluate the quality of the representations in two steps. First, we train a Logistic (resp. Linear) Regression on the estimated salient and common factors of the training set to predict the attribute presence (resp. attribute value). Then, we evaluate the classification/regression model on the salient and common factors estimated from a test set. By evaluating the performance of the model, we can understand whether the information about the attributes/variables/subtype has been put in the common or salient latent space by the method. Furthermore, we report the background (BG) vs target (TG) classification accuracy. To do so, a 2 layers MLPs is independently trained, except for SepVAE, where salient space predictions are directly estimated by the classifier. 

\noindent In all Tables, for categorical variables, we compute (Balanced) Accuracy scores (=(B-)ACC), or Area-under Curve scores (=AUC) if the target is binary. For continuous variables, we use Mean Average Error (=MAE). Best results are highlighted in bold, second best results are underlined. For CelebA and Pneumonia experiments, mean, and standard deviations are computed on the results of 5 different runs in order to account for model initializations. For neuro-psychiatric experiments, mean and standard deviations are computed using a 5-fold cross-validation evaluation scheme.

\noindent First, the variability within the target dataset is assessed by fitting Logistic (or Linear) Regression to evaluate if the model captures the target-specific variability and discards the common variability. In the case where common attributes are available, we assess if the common space captures these attributes in the same fashion. 

\noindent Qualitatively, the model can be evaluated by looking at the full image reconstruction (common+salient factors) and by fixing the salient factors to $s'$ for target images. Comparing full reconstructions with common-only reconstructions allows the user to interpret the patterns encoded in the salient factors $s$ (see Fig.\ref{fig:brats_qualitative_results} and Fig.\ref{fig:celeba_qualitative_results}).

\section{Implementation Details}

\subsection{CelebA glasses and hat versus no accessories}
We used a train set of $20000$ images, ($10000$ no accessories, $5000$ glasses, $5000$ hats) and an independent test set of $4000$ images ($2000$ no accessories, $1000$ glasses, $1000$ hats), and ran the experiment $5$ times to account for initialization uncertainty. Images are of size $64 \times 64$, pixel were normalized between $0$ and $1$.
For this experiment, we use a standard encoder architecture composed of 5 convolutions (channels 3, 32, 32, 64, 128, 256), kernel size 4, stride 2, and padding (1, 1, 1, 1, 1). Then, for each mean and standard deviations predicted (common and salient) we used two linear layers going from $256$ to hidden size $32$ to (common and salient) latent space size $16$. The decoder was set symmetrically. We used the same architecture across all the concurrent works we evaluated. We used a common and latent space dimension of $16$ each. The learning rate was set to $0.001$ with an Adam optimizer. Oddly we found that re-instantiating it at each epoch led to better results (for concurrent works also), we think that it is because it forgets momentum internal states between the epochs. The models were trained during 250 epochs. To note, in the original contribution, MM-cVAE used latent spaces of $16$ for the salient space and $6$ for the common space and a different architecture but we noticed that it led to artifacts in the reconstruction (see \cite{weinberger_moment_nodate}). Also, we did not succeed in reproducing their performances with their code, their model, and their latent spaces, even with the same experimental setup. We, therefore, used our model setting which led to better performances across each method with batch size equal to $512$. We used $\beta_c=0.5$ and $\beta_s=0.5$, $\kappa=2$, $\gamma=1e-10$, $\sigma_p=0.025$. For MM-cVAE we used the same learning rate, $\beta_c=0.5$ and $\beta_s=0.5$, the background salient regularization weight $100$, common regularization weight of $1000$.

\begin{figure}
    \centering
    \includegraphics[width=0.7\linewidth]{celeba_dataset_description.png} 
    \caption{CelebA accessories dataset. We used a train set of $20000$ images ($10000$ no accessories, $5000$ glasses, $5000$ hats) and an independent test set of $4000$ images ($2000$ no accessories, $1000$ glasses, $1000$ hats) and ran the experiment $5$ times to account for initialization uncertainty. Images were centered on the face and then resized to $64 \times 64$, pixels were normalized between $0$ and $1$.}
    \label{fig:celeba}
\end{figure}

\subsection{Pneumonia}
Train set images were graded by $2$ radiologist experts and the independent test set was graded by a third expert, the experiment was run $5$ times to account for initialization uncertainty. Radiographies were selected from a cohort of pediatric patients aged between one and five years old from Guangzhou Women and Children’s Medical Center, Guangzhou. TRAIN set images were graded by 2 radiologists experts and the independent TEST set was graded by a third expert to account for label uncertainty. Images are of size $64 \times 64$, pixel were normalized between $0$ and $1$.
For this experiment, we use a standard encoder architecture composed of 4 convolutions (channels 3, 32, 32, 32, 256), kernel size 4, and padding (1, 1, 1, 0). Then, for each mean and standard deviations predicted (common and salient) we used two linear layers going from $256$ to hidden size $256$ to (common and salient) latent space size $128$. The decoder was set in a symmetrical manner. We used the same architecture across all the concurrent works we evaluated. We used a common and latent space dimension of $128$ each. The learning rate was set to $0.001$ with an Adam optimizer. Oddly we found that re-instantiating it at each epoch led to better results (for concurrent works also), we think that it is because it forgets momentum internal states between the epochs. The models were trained during 100 epochs with batch size equal to $512$. We used $\beta_c=0.5$ and $\beta_s=0.1$, $\kappa=2$, $\gamma=5e-10$, $\sigma_p=0.05$. For MM-cVAE, we used the same learning rate, $\beta_c=0.5$ and $\beta_s=0.1$, the background salient regularization weight $100$, common regularization weight of $1000$.


\begin{figure}[!tbp]
    \centering
    \includegraphics[width=.99\linewidth]{pneumonia.png} 
    \caption{Illustration of the pneumonia dataset. Target images are pneumonia images composed of viral and bacterial pneumonia. Background images are healthy X-Ray images. Original dataset image description from \cite{kermany_identifying_2018}. The dataset is available at \url{https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia}.}
    \label{fig:pneumonia_dataset}
\end{figure}


\subsection{Neuro-psychiatric experiments}
The task of identifying consistent correlations between neuro-anatomical biomarkers and observed symptoms in psychiatric diseases is important for developing more precise treatment options. Separating the different latent mechanisms that drive neuro-anatomical variability in psychiatric disorders is a challenging task. Contrastive Analysis (CA) methods such as ours have the potential to identify and separate healthy from pathological neuro-anatomical patterns in structural MRIs. This ability could be a key component to push forward the understanding of the mechanisms that underlie the development of psychiatric diseases.
As explained in the main text, given a background population of Healthy Controls (HC) and a target population suffering from a Mental Disorder (MD), the objective is to capture the pathological factors of variability in the salient space, such as psychiatric and cognitive clinical scores, while isolating the patterns related to demographic variables, such as age and sex, or acquisition sites to the common space. For each experiment, we gather T1w anatomical VBM  \cite{ashburner_voxel-based_2000} pre-processed images of HC and MD subjects of size $128 \times 128 \times 128$. We divide them into 5 TRAIN, VAL splits (0.75, 0.25) and evaluate in a cross-validation scheme the performance of SOTA CA-VAEs. Let us note that this is a challenging problem, especially due to the high dimensionality of the input and the scarcity of the data. Notably, the measures of psychiatric and cognitive clinical scores are only available for some patients, making it scarce and precious information.

Images are of size $128 \times 128 \times 128$ with voxels normalized on a Gaussian distribution per image. Experiments were run $3$ times with a different train/val/test split to account for initialization and data uncertainty.
For this experiment, we use a standard encoder architecture composed of 5 3D-convolutions (channels 1, 32, 64, 128), kernel size 3, stride 2, and padding 1 followed by batch normalization layers. Then, for each mean and standard deviations predicted (common and salient), we used two linear layers going from $32768$ to hidden size $2048$ to (common and salient) latent space size $128$. The decoder was set symmetrically, except that it has four transposed convolutions (channels 128, 64, 32, 16, 1), kernel size 3, stride 2, and padding 1 followed by batch normalization layers. We used the same architecture across all the concurrent works we evaluated. We used a common and latent space dimension of $128$ each. The models were trained during 51 epochs with a batch size equal to 32 with an Adam optimizer. For the Schizophrenia experiment, for Sep VAE, we used a learning rate of $0.00005$, $\beta_c=1$ and $\beta_s=0.1$, $\kappa=10$, $\gamma=1e-8$, $\alpha=\frac{1}{0.01}$. For MM-cVAE we used the same learning rate, $\beta_c=1$ and $\beta_s=0.1$, the background salient regularization weight $100$, common regularization weight of $1000$.
For the Autism disorder experiment, we used a learning rate of $0.00002$, $\beta_c=1$ and $\beta_s=0.1$, $\kappa=10$, $\gamma=1e-8$, $\sigma_p=0.01$. For MM-cVAE we used the same learning rate, $\beta_c=1$ and $\beta_s=0.1$, the background salient regularization weight $100$, common regularization weight of $1000$.

\section{On Mutual Information Estimation and Minimization}

To promote  independence between $c$ and $s$, we minimize their mutual information, defined as the KL divergence between the joint distribution $q(c, s)$ and the product of their marginals $q(c) q(s)$. However, computing this quantity is not trivial, and it requires a few tricks to correctly estimate and minimize it. As in \cite{abid_contrastive_2019}, it is possible to take inspiration from FactorVAE \cite{kim_disentangling_2019}, which proposes to estimate the density-ratio between a joint distribution and the product of the marginals. In our case, we seek to enforce the independence between two sets of latent variables rather than between each latent variable of a set. The density-ratio trick \cite{nguyen_estimating_2010, Sugiyama2012DensityratioMU} allows us to estimate the quantity inside the $\log$ in Eq.\ref{eq:densityratio_appendix}. First, we sample from $q(c, s)$ by randomly choosing a batch of images $(x_i, y_i)$ and drawing their latent factors $[c_i, s_i]$ from the encoders $e_{\phi_c}$ and $e_{\phi_s}$. Then, we sample from $q(c) q(s)$ by using the same batch of images where we shuffle the latent codes among images (\textit{e.g.}, $[c_1, s_2]$, $[c_2, s_3]$, etc.). Once we obtained samples from both distributions, we trained an \textbf{independent} classifier $D_\lambda([c, s])$ to discriminate the samples drawn from the two distributions by minimizing a BCE loss. The classifier is then used to approximate the ratio in the KL divergence, and we can train the encoders $e_{\phi_c}$ and $e_{\phi_s}$ to minimize the resulting loss:
\begin{equation}
    \mathcal{L_\text{MI}} = \mathbb{E}_{q(c, s)} \log \left( \frac{q(c, s)}{q(c) q(s)} \right) \approx \sum_i \text{ReLU} \bigg( \log \bigg(\frac{D_\lambda([c_i, s_i])}{1 - D_\lambda([c_i, s_i])} \bigg) \bigg)
    \label{eq:densityratio_appendix}
\end{equation}
\noindent where the ReLU function forces the estimate of the KL divergence to be positive, thus avoiding to back-propagate wrong estimates of the density ratio due to the simultaneous training of $D_\lambda([c, s])$. Contrarily to \cite{abid_contrastive_2019}, it is important to use an independent optimizer for $D_\lambda$ to ensure that the density ratio is well estimated. The pseudo-code is available in Alg.~\ref{alg:mi}, and a visual explanation is shown in Fig.\ref{fig:mutual information_term}.

\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.99\linewidth]{mutual_information_minimization.png} 
    \caption{Illustration of Mutual Information loss between the common and the salient space. Given two images $x_a$ and $x_b$, 4 sets of latents are computed: $c_a$ and $s_a$ latents of the image $a$, $c_b$ and $s_b$ latents of the image $b$. A non-linear MLP is independently trained with a binary cross-entropy loss
    %. On the one hand, it is expected 
    to classify shuffled concatenations (i.e., from different images) with the label $0$
    %. On the other hand, it classifies 
    and concatenations of latents coming from the same image with label $1$. Then, during 
    %standard optimization
    training, encoders should 
    %are encouraged 
    not to be able to identify whether a concatenation of latents belong to class $0$ (shuffled common and salient spaces) or class $1$ (common and salient spaces coming from the same image). We encourage that by minimizing $D_{KL}(p_{\phi_s, \phi_c}(c, s) || p_{\phi_c}(c) \times p_{\phi_s}(s))$.}
    \label{fig:mutual information_term}
\end{figure}

\begin{algorithm}[!ht]
   \caption{Minimizing the Mutual Information between common and salient spaces, given a batch of size $B$.}
   \label{alg:mi}
    \begin{algorithmic}[1]
       \STATE {\bfseries Input:} $X \in \mathbf{R}^{B \times(C \times W \times H)} $
       \FOR{$t$ in epochs :}
       \STATE \underline{Discriminator training : }
       \STATE Sample $z=[c, s]$ from $q_{\phi_c, \phi_s}$.
       \STATE Sample $\Bar{z}=[c, \Bar{s}]$ from $q_{\phi_c} \times q_{\phi_s}$ by shuffling $s$ along the batch dimension.
       \STATE Compute $\mathcal{L}_{BCE}= - \log(D(z)) - \log(1-D(\Bar{z}))$
       \STATE Freeze $\phi_c$ and $\phi_s$. Update $D$ parameters only.
       \STATE \underline{Encoders training : }
       \STATE Sample $z=[e_{\phi_c}(x), e_{\phi_s}(x)]$ from $q_{\phi_c, \phi_s}$.
       \STATE Compute $\mathcal{L}_{MI}= \sum_{i=1}^B \text{ReLU} \bigg( \log \frac{D(z_i)}{1 - D(z_i)} \bigg)$
       \STATE Freeze $D$ parameters. Update $\phi_c$ and $\phi_s$.
       \ENDFOR
    \end{algorithmic}
\end{algorithm}


\end{document}
