\documentclass[accepted]{uai2022} % after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.
%\documentclass[twoside]{article}
%\usepackage{aistats2022}

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    %\bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{algorithm}
\usepackage{algorithmicx}
\usepackage[noend]{algpseudocode}

\usepackage[american]{babel}

\usepackage{amsthm}

\usepackage{graphicx}
%\usepackage[hyperfigures=false]{hyperref}
%\usepackage[pdftex, plainpages=false]{hyperref} % For hyperlinks and pdf metadata 
\usepackage{amsmath, amssymb} % For better mathhttps://www.overleaf.com/project/5e5e58e0b1be6f0001a94d36
\usepackage{float}
\usepackage{xcolor}
\usepackage{tikz}

\newcommand{\antti}[1]{{{\color{blue} [Antti: #1]}}}
\newcommand{\vitoria}[1]{{{\color{violet} [Vitoria: #1]}}}
\newcommand{\aapo}[1]{{{\color{magenta} [Aapo: #1]}}}

\newcommand{\indep}{\perp \!\!\! \perp}

\DeclareMathOperator{\Tr}{Tr}

\newcommand{\w}{\mathbf{w}}
\newcommand{\Rb}{\mathbb{R}}
\newcommand{\db}{\mathbf{d}}
\renewcommand{\a}{\mathbf{a}}
\renewcommand{\b}{\mathbf{b}}
\newcommand{\s}{\mathbf{s}}
\renewcommand{\S}{\mathbf{S}}
\newcommand{\e}{\mathbf{e}}
\newcommand{\xaug}{\tilde{\x}}
\newcommand{\xauggen}{\tilde{\tilde{\x}}}
\newcommand{\cb}{\mathbf{c}}
\newcommand{\qtot}{\tilde{q}}
\newcommand{\qmarg}{\bar{q}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\xlat}{\xi}
\newcommand{\logistic}{\phi}
\newcommand{\xlatb}{{\boldsymbol{\xi}}}
\newcommand{\wmarg}{\bar{\mathbf{w}}}
\newcommand{\xpoint}{\mathbf{\bar{u}}}
\newcommand{\xpointind}{\bar{u}}
\newcommand{\z}{\mathbf{z}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\xx}{\tilde{x}}
\newcommand{\yy}{\tilde{y}}
\renewcommand{\ss}{\tilde{s}}
\newcommand{\xxb}{\mathbf{\xx}}
\newcommand{\yyb}{\mathbf{\yy}}
\newcommand{\ssb}{\mathbf{\ss}}
\newcommand{\sest}{z}
\newcommand{\sestb}{\mathbf{\sest}}
\newcommand{\yplain}{y}
\newcommand{\m}{\mathbf{m}}
\newcommand{\h}{\mathbf{h}}
\newcommand{\uu}{\mathbf{u}}
\newcommand{\kk}{\mathbf{k}}
\newcommand{\vb}{\mathbf{v}}
\newcommand{\cc}{\mathbf{c}}
\renewcommand{\u}{\mathbf{u}}
\newcommand{\f}{\mathbf{f}}
\newcommand{\g}{\mathbf{g}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\J}{\mathbf{J}}
\renewcommand{\H}{\mathbf{H}}
\newcommand{\C}{\mathbf{C}}
\newcommand{\I}{\mathbf{I}}
\newcommand{\D}{\mathbf{D}}
\newcommand{\U}{\mathbf{U}}
\newcommand{\M}{\mathbf{M}}
\newcommand{\B}{\mathbf{B}}
\renewcommand{\L}{\mathbf{L}}
\newcommand{\A}{\mathbf{A}}
\newcommand{\V}{\mathbf{V}}
\newcommand{\W}{\mathbf{W}}
\newcommand{\X}{\mathbf{X}}
\newcommand{\Y}{\mathbf{Y}}
\newcommand{\q}{\mathbf{q}}
\newcommand{\Q}{\mathbf{D}_\mathbf{q}}
\newcommand{\mub}{\boldsymbol{\mu}}
\newcommand{\Sigmab}{\boldsymbol{\Sigma}}
\newcommand{\Qb}{\mathbf{Q}}
\newcommand{\n}{\mathbf{n}}
\newcommand{\0}{\mathbf{0}}
\newcommand{\lb}{\mathbf{l}}
\newcommand{\Sb}{\mathbf{S}}

\renewcommand{\texttt}[1]{#1}

\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}

%\title{An Approach to Binary Independent Component Analysis via Non-stationarity}

\title{Binary Independent Component Analysis: A  Non-stationarity-based Approach}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1,2]{\href{mailto:<antti.hyttinen@helsinki.fi>?Subject=Your UAI 2022 paper}{Antti Hyttinen}{}}
\author[1,2,3]{\href{mailto:<vitoria.barin-pacela@mila.quebec>?Subject=Your UAI 2022 paper}{Vitória Barin-Pacela}{}}
\author[1]{\href{mailto:<aapo.hyvarinen@helsinki.fi>?Subject=Your UAI 2022 paper}Aapo~Hyv\"arinen}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
University of Helsinki\\
   Helsinki, Finland
}
\affil[2]{%
    Helsinki Institute for Information Technology, Finland
}
\affil[3]{%
    Mila\\
    Universit\'e de Montr\'eal\\
    Montr\'eal, Canada
  }

\begin{document}
\maketitle

\begin{abstract}
We consider independent component analysis of binary data. While fundamental in practice, this case has been much less developed than ICA for continuous data. We start by assuming a linear mixing model in a continuous-valued latent space, followed by a binary observation model. Importantly, 
we assume that the sources are non-stationary; 
this is necessary since any non-Gaussianity would essentially be destroyed by the binarization.
Interestingly, the model allows for closed-form likelihood by employing the cumulative distribution function of the multivariate Gaussian distribution. In stark contrast to the continuous-valued case, we prove non-identifiability of the model with few observed variables; our empirical results imply identifiability when the number of observed variables is higher. We present a practical method for binary ICA that uses only pairwise marginals, which are faster to compute than the full multivariate likelihood. 
Experiments give insight into the requirements for the number of observed variables, segments, and latent sources that allow the model to be estimated.
\end{abstract}

\section{Introduction}

Despite significant progress in both linear and nonlinear ICA in recent years~\citep{tcl, hyvarinen19, Khemakhem2019}, ICA for binary data remains a challenging and important problem as binary data is abundant in various fields, such as bioinformatics, health informatics, social sciences, natural language, and electrical engineering. An ICA model for binary data may also open new opportunities in solving problems closely related to ICA, such as
causal discovery~\citep{Shimizu06JMLR} and feature extraction~\citep{tcl}.
%in such applications. 

Methods for binary ICA have been proposed based on either binary or continuous-valued independent components. In the case of binary components, \citet{Himberg01} and \citet{nguyen10} assumed an OR mixture model.  
In addition, some extensions of Latent Dirichlet can be seen as binary ICA \citep{Podosinnikova15, Buntine05}.
On the other hand, \cite{kaban06} presented an approach based on a latent linear model and binarized observations, although the components were restricted to the unit interval, which limits its applicability.
Recently, \citet{Khemakhem2019} presented a nonlinear ICA model (iVAE) that can employ binarized observations, making several contributions that we can build on.

Our goal is to study the prospects of ICA for binary data using a model that is both theoretically analyzable and intuitively appealing. 
It is crucial to investigate the identifiability of such a model, and to have a consistent estimator which is not based on approximations whose validity are not clear.
None of the approaches above fulfills all of these criteria.\footnote{As noted in the Corrigendum of \citet{Khemakhem2019} (v4 on arXiv), their initial identifiability proof for a discrete non-linear ICA model is incorrect.}


We propose a binary ICA model  
inspired by recent developments in nonlinear ICA.
We formulate a latent linear model with a separate binarizing measurement equation. 
Crucially, we assume the components to be non-stationary, which is a powerful principle and very useful here because any non-Gaussianity (commonly employed in ICA) may be destroyed by binarization.
Thus, we obtain a binary ICA model whose likelihood can be described in closed-form
via the multivariate Gaussian cumulative distribution function. We further propose to combine the likelihood with a moment-matching approach to obtain a fast and accurate estimation algorithm. In fact, due to the model structure, pairwise marginal distributions of non-binarized data can be accurately estimated from the binary data and the likelihood can be computed directly from them. We investigate the identifiability of the model, and somewhat surprisingly, we show that low-dimensional models are in fact non-identifiable---while higher-dimensional models are (empirically) shown to be identifiable. 

\section{A MODEL FOR BINARY ICA}
\label{model}

In this section, we define a binary counterpart of the linear ICA model. In particular, we consider here a model based on non-stationarity of the components, and start by motivating such an approach.

\subsection{The approach of non-stationarity}

While often non-stationarity is considered a nuisance, in the theory of ICA it is well-known that a suitable non-stationarity of the independent components can be very useful. \citet{Pham01} already used it in the case of linear ICA, and \citet{tcl} extended the idea to nonlinear ICA. Note that the mixing is assumed stationary, and the non-stationarity is a statistical property of the components only.

In line with such literature, we assume the $n$-dimensional data is divided into $n_u$ segments which express the non-stationarity, i.e.\ the segments have different distributions. In the case of time series, we may be able to find such segmentation simply by taking time bins of equal sizes. Such non-stationarity based on a segment-wise (piece-wise stationary) model is well-known in linear ICA \citep{Pham01,JSSv076i02}. Formally, each data point has a segment index $u$ assigned to it. 

In fact, this setting is more general and it is not necessary to have time-series. The additionally "observed" variable $u$ makes the non-stationarity a special case of the auxiliary variable framework of \citet{Khemakhem2019}. 
It is thus not only natural in the case of non-stationary time series, but also when there is any other external discrete variable, such as the experimental condition or intervention, or even a class label that modulates the distribution of the data.


The motivation for such a non-stationary model is that it can greatly extend the identifiability of ICA. 
Linear ICA is identifiable if the components are simply non-Gaussian, which is why the utility of non-stationarity in that context has always been dubious and such algorithms are rarely used. However, in the case of \textit{non}linear ICA, non-Gaussianity does not enable identifiability, which may be intuitively clear since a nonlinear transformation can change the marginal distributions quite arbitrarily from non-Gaussian to Gaussian or vice versa. A major advance was in fact obtained by \citet{tcl}, who showed that non-stationarity does enable identifiability in the nonlinear case.

Here, we propose that using non-stationarity of the components is very useful in the case of binary data as well. Again, intuitively, non-Gaussianity is likely to be rather useless since the binarization destroys any detail about the non-Gaussianity of the distributions, and such a model would be unlikely to be identifiable. However, non-stationarity is \textit{not} destroyed by binarization. Thus, binary ICA can be estimated based on non-stationarity of the components, as we will show later in this paper.


\subsection{Formal model definition}

To define the model in detail, we assume the $n$-dimensional data is generated from $n_z$ latent variables (independent components, or sources), collected into a latent random vector $\mathbf{z}^u$, which are generated independently of each other from a Gaussian distribution. Crucially, the parameters of the Gaussian distribution change as a function of the segment as
$$
\mathbf{z}^u \sim \mathcal{N}(\mub_\z^{u},\Sigmab_\z^{u})
$$
where $\Sigmab^u_\z$ is a diagonal matrix of the source
variances in segment $u$. 


We define ``intermediate'' variables $\mathbf{y}^u$ which are a linear mixing of the sources by a mixing matrix $\A$ with $n$ rows and $n_z$ linearly independent columns
\begin{eqnarray}
\mathbf{y}^u = \mathbf{A} \mathbf{z}^u \sim \mathcal{N}(\A \mub_\z^{u}, \ \A \Sigmab_\z^{u} \A^{\intercal}). \label{eq:mixing}
\end{eqnarray}
Here the mixing matrix $\A$ is constant, i.e., stationary, over the segments $u$ \citep{Pham01}.


While some work in ICA considers noisy continuous observations by adding noise to $\mathbf{y}^u$, we can consider here binarized observations $\mathbf{x}^u$ instead.
The binarization is done using a linking function $\sigma$ so that the probability of $i$th element of $\mathbf{x}^u$ being 1 is:
$$
P(x_i^u=1) = \sigma(y_i^u).
$$

We use a linking function based on the Gaussian CDF (cumulative distribution function):
$$
\sigma(y_i^u) = \Phi \left(\sqrt{\frac{\pi}{8}} y_i^u \big| 0,1 \right)
$$
where $\Phi$ is the cumulative distribution function of the Gaussian distribution, here with mean $0$ and variance $1$. We use $\sqrt{\pi/8}$ as the coefficient to match closely to the  sigmoid function $\sigma(y_i) = \frac{1}{1+e^{-y_i}}$~\citep{waissi,sigmoidtrick}, which is standardly used in statistics and machine learning in similar linking contexts.

We directly allow for different coefficients instead of $\sqrt{\pi/8}$, but our estimation methods assume that the linking function has the particular form. The motivation is to allow for closed-form expressions of the Gaussian integrals involved in Section~3 in terms of the Gaussian CDF. The difference to the logistic function is very small, while the methods are much simpler with the used linking function. In fact, our ICA model allows for closed-form likelihood with this particular linking function (Section~3), which would be difficult to achieve with a logistic linking function.

Furthermore, the linking function has the following intuitive interpretation. Take $y_i^u$, add independent noise $\epsilon$ from $\mathcal{N}(0, \frac{8}{\pi})$, and binarize $y_i^u$ simply by a hard threshold 0 to get $x_i^u$. This gives the same distribution for $x_i^u$, since the probabilities match:
\begin{eqnarray*}
P(x_i^u=1) = P( y_i^u + \epsilon > 0 ) = P(\epsilon > -y_i^u)\\
= \int_{-y_i^u}^{\infty} \mathcal{N}\left( \epsilon \big| 0, \frac{8}{\pi} \right) d \epsilon
=\Phi\left(\sqrt{\frac{\pi}{8}} y_i^u \big\vert 0,1 \right).
\end{eqnarray*}

A binary ICA model  $\mathcal{M}=(\A,\{\mub_\z^u\}_u,\{\Sigmab_\z^u\}_u)$ thus consists of the following parameters: the mixing matrix $\A$, the means $\mub_\z^u$ and the diagonal (co)variance matrices $\Sigmab_\z^u$ for all segments $u$, denoted by $\{\mub_\z^u\}_u$ and  $\{\Sigmab_\z^u\}_u$. Consequently, it defines a distribution for a binary vector $\x^u$ in each segment indexed by $u$.

\section{THE LIKELIHOOD}

A surprising observation regarding the the latent variable model defined in Section~2 is that we can calculate the likelihood in closed-form by employing the multivariate Gaussian CDF. For example, the model
 defines the probability of the data vector of all ones, denoted by $\mathbf{1}$, as:
\begin{eqnarray}
&& P(\x^u=\mathbf{1}|\mathcal{M})
=\int P(\x^u=\mathbf{1}|\y^u) P(\y^u|\mathcal{M})d \y \nonumber\\
   && =  \int\Phi \left(\sqrt{\frac{\pi}{8}} \y^u \vert \mathbf{0}, \I \right) \mathcal{N}(\y^u\vert \A \mub_\z^{u}, \A \Sigmab_\z^{u} \A^{\intercal})d \y \nonumber 
\end{eqnarray}
where the univariate Gaussian CDFs are written as a multivariate Gaussian CDF $\Phi$ with an identity covariance matrix. 
The benefit of using a Gaussian CDF-based linking function comes into play here, as the value of the integral is directly a value of a multivariate Gaussian CDF \citep{waissi,sigmoidtrick}: The above formula actually specifies the probability of first drawing $\y^u$, multiplying it by $\sqrt{\pi/8}$, and then, independently, drawing a standard Gaussian variable $\n \sim \mathcal{N}(\0,\I)$ that is element-wise smaller. We therefore have:
\begin{eqnarray*}
P(\x^u=\mathbf{1}|\mathcal{M}) 
=P\left( \n - \sqrt{\frac{\pi}{8}} \y^u  < \mathbf{0} \right) 
\end{eqnarray*}
This motivates us to define a random vector $\q^u$, an important construct in the following developments, as:
\begin{eqnarray}
\q^u&=& \n -\sqrt{\frac{\pi}{8}}\y^u, \label{eq:q}
\end{eqnarray}
which is simply a noisy, re-scaled and sign-flipped version of the linear mixture $\mathbf{y}^u$. 
In fact, since $\q^u$ is the sum of two independent Gaussian random vectors, it also has a Gaussian distribution $\q^u  \sim  \mathcal{N}\left(\mub_{ \q}^u , \Sigmab_{ \q}^u \right)$ with:
\begin{eqnarray}
\mub_{\q}^u&=& -\sqrt{\frac{\pi}{8}} \A \mub_\z^u, \label{eq:muq}\\
\Sigmab_{ \q}^u &=& \I + \frac{\pi}{8} \A \Sigmab_\z^u \A^{\intercal}. \label{eq:sigma} \label{eq:sigmaq}
\end{eqnarray}
The probability of the data vector of ones in segment $u$ is, then:
\begin{eqnarray}
P(\x^u=\mathbf{1}|\mathcal{M}) = P\left( \q^u < \mathbf{0} \right) =
\Phi\left(\mathbf{0} \vert\mub_{ \q}^u, \Sigmab_{ \q}^u \right), \label{eq:prob}
\end{eqnarray}
where the cumulative distribution function of the multivariate Gaussian $\Phi$ has all variables integrated from $-\infty$ to $0$; it is readily implemented in basic packages~\citep{mvnorm}.


Similar derivation gives the probabilities for other assignments to $\x^u$. These probabilities can be expressed compactly for all value assignments as:
\begin{equation}
    P(\x^u \vert \mathcal{M}) = \Phi \left(l(\x^u ), u(\x^u ) \vert \mub_{ \q}^u, \Sigmab_{ \q}^u \right) \label{eq:probxu}
\end{equation}
in which the multivariate Gaussian probability density function is integrated from the lower bound $l(\x^{u}) $ to the upper bound $u(\x^{u})$, with the $i$th elements in the bounds defined by:
\begin{eqnarray*}
l(\x^u)[i] = \begin{cases}-\infty  \text{ if } x^u_i=1 \\
\quad\,0 \text{ otherwise}
\end{cases}
u(\x^u)[i]=  \begin{cases}\;\,0  \text{ if } x^u_i=1 \\
\infty \text{ otherwise}
\end{cases}
\end{eqnarray*}

\begin{figure}
    \centering
    \includegraphics[scale=0.25,trim={16cm 9cm 16cm 12cm},clip]{Rplot.jpeg}
\caption{Binary ICA model for two observed variables and three segments. For each segment, there is a bivariate Gaussian distribution on $\q^u$, the probability of an assignment to the binary observed variables is the probability mass in the corresponding quadrant.\label{fig:intuition}}
\end{figure}


Importantly, this formulation allows for a particularly clear intuitive interpretation of the model. Figure~\ref{fig:intuition} shows this for two observed variables and three segments. For each segment, the model defines a bivariate Gaussian distribution for $\q^u$, depicted by colors and contours on the planes. The probability for an assignment of the observed binary variables $\x^u$ in a segment is simply the probability mass in a corresponding quadrant. The multivariate Gaussian distributions for $\q^u$ in each segment are related in the sense that they are formed by the same mixing matrix performing on independent sources particular to the segment.

The log-likelihood of the whole data set can then be calculated as

\begin{eqnarray}
l
&=&
\sum_u
\sum_{\x^u} \hspace{-1mm} c(\x^u)\log \Phi(l(\x^u), u(\x^u) \vert \mub_{ \q}^u, \Sigmab_{\q}^u), \label{qdistr} \label{eq:likelihood}
 \end{eqnarray}
 where $c(\x^u)$ is the count of the data points with assignment $\x^u$ in a segment $u$
 and the sum is taken over all assignments to $\x^u$ and $u$.

\section{ON IDENTIFIABILITY}

Many ICA models can only be identified up to scaling and permutation indeterminacies of the sources~\citep{ICAbook,Khemakhem2019}. Straightforwardly we can see that those limitations apply for our model as well. By re-ordering columns of the mixing matrix and the sources, the implied distribution is unaffected; similarly, we can counteract the scaling (or sign-flip) of the mixing matrix columns by scaling (or sign-flipping) the sources. However, binarization actually induces additional indeterminacies as we will show next. 

\subsection{The Binarization Indeterminacy}

Recall that the probability of an assignment to binary $\x^u$ is given by the probability of the Gaussian $\q^u$ landing in different regions (Equation~\ref{eq:prob}).
But note that the probability in Equation~\ref{eq:prob} stays exactly the same even if $\q^u$ is multiplied by a diagonal matrix $\Qb^u$, possibly different for each segment $u$, with positive entries (scaling factors) on the diagonal:
\begin{eqnarray*}
P\left( \q^u  < \mathbf{0} \right) &=&P\left( \Qb^u  \q^u < \mathbf{0} \right).
\end{eqnarray*}
This is valid even if the elementwise operator is $>$ or a mixture of $>$ and $<$.\footnote{For the probability of $\x^u$ being all ones, any permutation matrix $\Qb^u$ would similarly preserve the implied probability, but the probability of some other assignment for $\x$ (each of which corresponds to some mixture of $>$ and $<$) may change then.}
Figure~\ref{fig:indet} shows an example of this equivalence relation for one segment and two observed variables. The two Gaussian distributions for $\q^u$ represented by the blue and red contours imply the exact same joint distribution for binary observed variables $\x^u$. The amount of mass in each of the 4 quadrants is exactly the same.  This means that we essentially lose all scale information on $\q^u$ in the binarization.

Then, two binary ICA models $\mathcal{M}=(\A,\{\mub_\z^u\}_u,\{\Sigmab_\z^u\}_u)$ and $\hat{\mathcal{M}}=(\hat{\A},\{\hat{\mub}_\z^u\}_u,\{\hat{\Sigmab}_\z^u\}_u)$ are indistinguishable if there are positive diagonal matrices $\{\Qb^u\}_u$ such that for each segment $u$, the means and covariances of $\q^u$ satisfy: 
\begin{eqnarray}
 \hat{\mub}_{\q}^u &=& \Qb^u \mub_{\q}^u \label{eq_1}, \\
\hat{\Sigmab}_{ \q}^u  &=& \Qb^u\Sigmab_{\q}^u \Qb^u,
\label{eq_2}
\end{eqnarray}
which can be written more clearly using the model parameters (Equations~\ref{eq:muq}  and~\ref{eq:sigmaq}) as:
\begin{eqnarray}
    \sqrt{\dfrac{\pi}{8}}\hat{\A} \hat{\mub}_\z^u&=& \Qb^u \sqrt{\dfrac{\pi}{8}} \A \mub_\z^u,
\label{eq_1v} \label{arithmetic1}\\
    \I + \dfrac{\pi}{8} \hat{\A} \hat{\Sigmab}_\z^u (\hat{\A})^\intercal &=& \Qb^u(\I + \dfrac{\pi}{8} \A \Sigmab_\z^u \A^\intercal )\Qb^u.
\label{eq_2v} \label{arithmetic2}
\end{eqnarray}
This limits identifiability possibilities (Section~4.2) but nevertheless also allows for the development of efficient estimation procedures in Sections~4.3 and~5.2. 


\begin{figure}
    \centering
    \includegraphics[scale=0.50]{scalingfactorplot.pdf}
\caption{Two Gaussian distributions (red and blue) for a two dimensional $\q^u$ which imply the same binary distributions after binarization by the linking function. That is because the mass of both distributions in each of the 4 quadrants is identical.\label{fig:indet} }
\end{figure}

\subsection{The Row Order Indeterminacy}

One of the consequences of the binarization indeterminacy is the following non-identifiability result concerning $n=2$ observed variables, proven in Appendix A in the supplement.
\begin{theorem}
If the row order of the 2-by-2 mixing matrix $\A$ of a binary ICA model is reversed, then the source means $\mub^u_\z$ and variances $\Sigmab^u_\z$ can be adjusted such that the implied distributions for the observed binary $\x^u$ remain identical.
\end{theorem}

This means that in addition to column order and scale, we also have row order indeterminacy here. Although the result may generalize to certain sparse higher dimensional models, fortunately, it does not jeopardize the estimation of higher dimensional models in general. 

This result does have consequences for 
causal discovery~\citep{Shimizu06JMLR,suzuki2021causal,peters2011causal,inazumi2014causal}.
Consider two structural equation models, implying opposite causal directions:
$$
\y^u= \left( \begin{array}{cc}
0 & 0\\
b & 0
\end{array}\right)\y^u + \z^u, \quad \y^u:= \left( \begin{array}{cc}
0 & b\\
0 & 0
\end{array}\right)\y^u + \z^u.
$$
where $\z^u$ has a Gaussian distribution in each segment $u$ with diagonal covariance matrix $\Sigmab_\z^u$.  The models 
correspond respectively to the mixing models (compare to Equation~\ref{eq:mixing}):
$$
\y^u=\left( \begin{array}{cc}
1 & 0\\
b & 1
\end{array}\right)\z^u, \quad \y^u=\left( \begin{array}{cc}
1 & b\\
0 & 1
\end{array}\right)\z^u.
$$
If we observed binarized $\y^u$, i.e. $\x^u$, we can at most identify the mixing matrix up to row order, column order and column scale.
By switching the column order and then the row order of the mixing matrix on the left, we get the mixing matrix on the right. Thus, unlike in the continuous case, we cannot detect the causal direction between two variables without further limiting assumptions or information on other variables.

\subsection{The Correlation Identifiability}

Note that the indistinguishable models satisfying Equation~\ref{eq_2} or Equation~\ref{eq_2v} have equal \emph{correlation matrices} (i.e.\ matrices of Pearson correlation coefficients) for the random variables $\q^u$. 
The next theorem  and corollary show that the correlations between elements of $\q^u$ are indeed theoretically identifiable from the distributions of the binary observed variables $\x^u$. Intuitively, the higher the correlation, the more likely will the pair of binary observed variables in $\x^u$ receive equal assignments. The fairly technical proof is given in Appendix B in the supplement.

\begin{theorem}
Two binary ICA models imply different distributions for binary observations $\x^u$ (in a given segment $u$) if the correlation matrices for $\q^u$ are not equal.
\end{theorem}

This result is crucial for the development of our novel estimation method (Section~5.2), via the corollary:

\begin{corollary} \label{corollary1}
The correlation matrix of $\q^u$ in a given segment $u$ is identifiable from the distribution for binary $\x^u$.
\end{corollary}

On the other hand, the following theorem recaps the well-known result \citep{ICAbook,Pham01} that the means do not help in estimating the mixing matrix  (proven in Appendix~C):
\begin{theorem}
If two models
$\mathcal{M}$
and $\hat{\mathcal{M}}$ with $n=n_z$
imply the same correlation matrices for $\q^u$ (in a given segment)
 then the means $\mub_\z^u$ can be adjusted such that the implied binary distributions are identical. \label{thm:means}
\end{theorem}

\section{METHODS FOR BINARY ICA}

Next, we present three methods for estimating the binary ICA model, building on the theory in Sections~3 and~4. The \texttt{BLICA} method of Section~5.2 is the main novel algorithmic contribution of the paper.

\subsection{Maximum Likelihood Estimation}

We have already derived the likelihood of the binary ICA model in Equation~\ref{eq:likelihood}. A straightforward approach is then to optimize this using e.g. L-BFGS~\citep{lbfgs}. The gradient involves the moments for the \emph{truncated} multivariate Gaussian distribution,
which can be obtained from R package \texttt{tmvtnorm}~\citep{tmvtnorm}.
Variances and scaling factors can be kept positive by using the log-exp transform. Unfortunately, the computation of the likelihood and its gradient can only be done for small models in practice, because the evaluation of the multivariate Gaussian CDF is time consuming, necessitating the use of sampling-based approximations. Our experiments refer to this as \texttt{full MLE}.

\subsection{The \texttt{BLICA} Method}



However, we can circumvent the computational burden of the high-dimensional Gaussian cumulative distribution function.
Due to the theory in Section 4, the correlations of $\q^u$ convey the essential information between the binary data and the continuous mixing model. Since the marginalization properties of our model are inherited from the multivariate Gaussian, such correlations can be estimated from \emph{pairwise} marginal distributions of elements of $\x^u$; in 2D the Gaussian cumulative distribution function is still quite quick to compute.
Thus, we combine maximum likelihood estimation with what could be called a ``moment-matching'' approach as follows.
We first recover the pairwise correlations of the continuous-valued $\q^u$ from the observed binary data on $\x^u$ (this is possible by Corollary~\ref{corollary1}) via MLE in 2D. Then we fit those correlations to the correlations implied by the latent linear mixing model using a more scalable MLE in the continuous-valued latent space. The resulting algorithm is summarized as Algorithm~\ref{alg:BLICA} and explained in detail below.

\textbf{Correlation estimation.}
On line 4, we estimate each correlations between elements in $\q^u$ separately, by directly fitting the likelihood in Equation~\ref{qdistr} in two dimensions, thus estimating $\mub_\q^u$ and $\Sigmab_\q^u$.
To calculate the multivariate Gaussian CDF, we use the R package \texttt{mvtnorm}~\citep{mvnorm}.
We employ the \texttt{GenzBretz} method, which is particularly suitable for the fast evaluation needed here~\citep{genz1993comparison}. Furthermore, the estimation can be simplified~\citep{lee}. Due to Equation~\ref{arithmetic2} the diagonal of $\Sigmab^u_\q$ can be set to 1s in this step. 
Furthermore, since the marginal of $x_i^u$ is
\begin{eqnarray}
P(x_i^u=1)&=&\Phi(-\mub^u_\q[i]/\sqrt{\Sigmab^u_\q[i,i]}|0,1), \label{eq:meanest}
\end{eqnarray}
both means in $\mub^u_\q$ can be computed from the respective marginals using the 1D inverse Gaussian CDF separately~\citep{mvnorm}. The univarite optimization problem for the remaining parameter in the interval $[-1,1]$ can then be solved efficiently using a line search method~\citep{brent2013algorithms}. The scalability of Algorithm~1 depends crucially on this step, as $n_u\cdot (n^2-n)/2$ correlations need to be estimated. The separately estimated correlations are collected to $n_u$ segmentwise $n$-by-$n$ correlation matrices denoted by $\C^u_\q$. 

\begin{algorithm}[!t]
\begin{algorithmic}[1]
\State Input data recorded at $n_u$ different segments.
\For{ segment $u \in \{1,\ldots,n_u\}$ }
\For{ each observed variable pair $\{x_i^u,x_j^u\}$ } 
\State \parbox[t]{6cm}{Estimate the correlation between $q_i^u$ and $q_j^u$ by maximizing the marginal pairwise likelihood of  $x_i^u$ and $x_j^u$ (in segment $u$).}
      \EndFor
      \State \parbox[t]{7cm}{Form and regularize the correlation matrix $\C_\q^u$ obtained from the pairwise correlations.}
\EndFor
\State Optimize scaled Gaussian likelihood  
with L-BFGS
over sufficient statistics  $\C_\q^u$ from all segments $u$.
\State Return the estimated mixing matrix $\A$ and source variances $\Sigmab^u_\z$ for all segments $u$.
\end{algorithmic}
\caption{The \texttt{BLICA} algorithm for Binary ICA.\label{alg:BLICA}}
\end{algorithm}

\textbf{Regularization.} When estimating the correlations of $\q^u$ from sample data, it can happen that a correlation matrix $\C_\q^u$ is close to singular or not positive definite. We use the following regularization on line 5, based on the parameter $r$~\citep{warton}, which marks the approximate condition number targeted. 
The regularized correlation matrix is then
\begin{equation}
\frac{1}{1+\delta} (\C_\q^u + \delta \mathbf{I}), \text{ where } \delta = \max \left( 0, \frac{\lambda_1-r\cdot \lambda_n }{ r-1} \right), \nonumber
\end{equation}
where $\lambda_1$ is the largest and $\lambda_n$ the smallest eigenvalue of $\C_\q^u$.
This regularization keeps the unit diagonal.

\textbf{Moment Matching.} 
Finally, on line 6, we fit the model parameters (including stationary $\A$) to the estimated correlations $\C^u_\q$ using a Gaussian likelihood model 
over the different segments $u$ (Section~2). But in contrast to the usual case where we have the covariance matrices, here we need to account for the ``binarization indeterminacy'', resulting in additional nuisance scaling parameters, as pointed out above. We use the term scaled Gaussian likelihood to refer to the ordinary multivariate Gaussian likelihood where we include additional parameters $\Qb^u$ as the scaling factors.
The fitting is thus done by the following scaled Gaussian likelihood based on the sufficient statistics $\C_\q^u$:
\begin{equation}
l =\sum_{u=1}^{n_u} \frac{N}{2} \left[
   %-n\log  (2 \pi) 
   -\log (\det (\Sigmab_\q^u)) - \Tr(\C_\q^u (\Sigmab_\q^u)^{-1}  ) \right] \nonumber
\end{equation} 
where recall that $\Sigmab_\q^u=\Qb^u(\mathbf{I}+\A\Sigmab_\z^u \A^T)\Qb^u$ by Equation~\ref{arithmetic2} is a function of
 the mixing matrix $\A$, 
 source variances $\{\Sigmab_\z^u\}_u$ (diagonal, positive elements) and scaling factors $\{\Qb^u\}_u$ (diagonal, positive elements). Variances and scaling factors can be kept positive by using the log-exp transform. Note that without the scaling factors $\{\Qb^u\}_u$, the mixing matrix $\A$ could be found via joint diagonalization~\citep{JSSv076i02}. Note also that due to Theorem~\ref{thm:means}, the source means do not need to be estimated.
 Here, instead, we perform the fitting by maximizing this likelihood using L-BFGS~\citep{lbfgs} with respect to the aforementioned parameters.

\subsection{Binary ICA through Linear iVAE} \label{ivae}

\citet{Khemakhem2019} presented the identifiable Variational Autoencoder (iVAE), an approach for nonlinear ICA employing variational autoencoders \citep{Kingma2014, Rezende2014} that assumes access to an additionally observed variable such that the sources are independent given the auxiliary variable; further, each source follows an exponential family distribution given the auxiliary variable. %univariate 
Here, we apply the iVAE approach to estimate the binary ICA model from Section~\ref{model} \citep{barin2021independent}. As proposed by \citet{Kingma2014} and \citet{ Khemakhem2019}, we use the factorized Bernoulli observational model and apply a sigmoid function element-wise to the output of the decoder to obtain the binary probabilities. 
Due to the linearity of our mixing model and the segment-wise structure, we can simplify the encoder (posterior approximation) of the VAE, and make all the transformations in the iVAE affine or linear, thus greatly simplifying the system. The \texttt{linear iVAE} is presented in more detail in Appendix~E.

\subsection{Estimation of the Sources}

After estimating the mixing matrix $\A$, it may be desired to estimate the sources $\z^u$ as well.
In the case of binary data, the individual source values cannot be accurately estimated (even up to scale and order indeterminacies) due to the inherent noise introduced by the binarization procedure. Presumably, though, if the number of observed variables is large and the number of sources is small, the estimation may be reasonable. In any case, the posterior $P(\mathbf{z}^u|\mathbf{x}
^u)$ can be easily calculated after estimating the mixing matrix.



\section{EXPERIMENTS}

\begin{figure*}[!t]
    \centering
\includegraphics[scale=0.42]{plot3.pdf}
\caption{Identifiability 
with equal number of observed variables and sources. The \texttt{BLICA} method used true (pairwise) probability distributions (i.e. infinite sample limit data). Each box is based on 30 models. 
A lower value on the y-axis (log-error) implies better performance.
Runs with values less than $-7$ (e.g. those in which the model was identified up to machine precision) are marked with $-7$. Compare to Table~1.\label{fig:idplot} }
\end{figure*}



We implemented our proposed methods and baselines using R (\texttt{BLICA}, \texttt{full MLE}) and python (\texttt{linear iVAE}). 
Here we investigate 
the identifiability of the model, as well as the finite-sample estimation performance and the scalability of our 
methods, also comparing to previous approaches. 

\textbf{Data.} The data was generated from the Binary ICA model (Section~2)
 in the following way. Means were drawn from $\mathrm{unif}(-0.5,0.5)$, standard deviations from
$\mathrm{unif}(0.5,3)$. Mixing matrix elements were drawn from $\mathrm{unif}(-3,3)$ while ensuring invertibility by resampling until the condition number ($\kappa$) was below 20 for $n<20$, or for $n\leq 20$ below the 75th quantile of 1000 sampled similar dimensional mixing matrices. For practical estimations from finite sample data we use 40 segments, varying the sample size per segment.


\textbf{Evaluation.} 
ICA methods are often compared in terms of the mean correlation coefficient of the estimated sources. Here, however, binarization induces heavy noise and individual samples of the estimated sources cannot be accurately estimated. We therefore focus our evaluation on the mixing model, 
and measure the mean cosine similarity (MCS) of the mixing matrix columns (taking the inherent order and scale indeterminacy of the sources into account, see Appendix~D).


\subsection{Identifiability}



\textbf{Results.}
Recall from Sections~4 and~5 that the correlations of $\q^u$ convey the information between the binary data and the mixing model, and each of these correlations can be determined from the marginal distributions over the corresponding pair of binary observed variables in $\x^u$(in a segment $u$).
Thus,  by using the exact pairwise binary distributions of elements of $\x^u$ from Equation~\ref{eq:probxu} as input for BLICA, 
we are here able to investigate  identifiability empirically without any finite sample effects.
Figure~\ref{fig:idplot} shows results on which models can be identified when the number of sources and observed variables are equal ($n\hspace*{-1mm}=\hspace*{-1mm}n_z$). In many cases, the method found the mixing matrix essentially up to machine precision, which can be seen as indication of identifiability.
Each box includes 30 different data generating models, and for each we ran \texttt{BLICA} 3 times; the MCS of the run with highest scaled Gaussian likelihood is plotted. 
With only 2 segments, or only 2 observed variables (also in Theorem~3), the model is not identifiable in any case. The minimal cases deemed identifiable (up to source scale and order) are $(n\hspace*{-1mm}=\hspace*{-1mm}5,n_u\hspace*{-1mm}=\hspace*{-1mm}5)$, $(n\hspace*{-1mm}=\hspace*{-1mm}6,n_u\hspace*{-1mm}=\hspace*{-1mm}4)$, $(n\hspace*{-1mm}=\hspace*{-1mm}7,n_u\hspace*{-1mm}=\hspace*{-1mm}4)$, $(n\hspace*{-1mm}=\hspace*{-1mm}8,n_u\hspace*{-1mm}=\hspace*{-1mm}4)$, $(n\hspace*{-1mm}=\hspace*{-1mm}9,n_u\hspace*{-1mm}=\hspace*{-1mm}3)$, and $(n\hspace*{-1mm}=\hspace*{-1mm}10,n_u\hspace*{-1mm}=\hspace*{-1mm}3)$.
Thus generally, the more observed variables ($n$) we have, the less segments ($n_u$) are needed.

\begin{table}[!t]
\centering
{
\setlength{\tabcolsep}{2.5pt}
\begin{tabular}{c|rrrrrrrrr}
Number of & \multicolumn{9}{c}{Number of Observed Variables ($n$)}\\
Segments ($n_u$) & $2$ & \phantom{0}$3$ & $4$ & $5$ & $6$ & $7$ & $8$ & $9$ & $10$\\ 
  \hline
$2$ & -6 & -9 & -12 & -15 & -18 & -21 & -24 & -27 & -30 \\ 
  $3$ & -7 & -9 & -10 & -10 & -9 & -7 & -4 & {\color{red}\textbf{0}} & {\color{red}\textbf{5}} \\ 
  $4$ & -8 & -9 & -8 & -5 &  {\color{red}\textbf{0}} & {\color{red}\textbf{7}} & {\color{red}\textbf{16}} & 27 & 40 \\ 
  $5$ & -9 & -9 & -6 & {\color{red}\textbf{0}} & 9 & 21 & 36 & 54 & 75 \\ 
  $6$ & \phantom{0}-10 & -9 & -4 & {\color{red}\textbf{5}} & 18 & 35 & 56 & 81 & 110 
\end{tabular}
}
\caption{Heuristic identifiability analysis. 
Each entry states the number of statistics (equations) minus the number of unknowns. 
The minimal cases 
with a non-negative number,
suggesting identifiability, are bolded in red.
\label{tab:idtable}}
\end{table}

\begin{figure*}
    \includegraphics[scale=0.36]{plot1b.pdf}\;\includegraphics[scale=0.36]{plot1a.pdf}\;\includegraphics[scale=0.36]{plot2.pdf}
\caption{
Finite sample performance. 
Left: 10 observed variables and 10 sources.
Center: 6 observed variables and 6 sources.
Right: 6 observed variables and 2 sources. 
Each box is based on thirty 40-segment datasets. 
\label{fig:10observed}\label{fig:comparison} }
\end{figure*}

\paragraph{Heuristic Identifiability Analysis.} We contrast 
the results to the 
well-known heuristic approach to identifiability used in factor analysis. It is based on counting the number of statistics we can calculate (or equations we can form), and the number of unknowns (parameters) we need to solve. If the 
former
is at least as large as the 
latter, there is hope that the model is identifiable.
The calculations in Table~\ref{tab:idtable} are based on Equations~\ref{arithmetic1} and~\ref{arithmetic2} when the number of sources equals the number of observations ($n=n_z$).
The statistics correspond to $n_u(n^2-n)/2$  covariances, $n_u\cdot n$ variances and $n_u\cdot n$ means (for $\q^u$). 
Unknowns include $n\cdot n$ mixing matrix coefficients, $n_u\cdot n$ (segment-wise) source variances, $n_u\cdot n$ source means, as well as $n_u\cdot n$ scaling terms (diagonal elements of $\Qb^u$).
In line with the classical literature in factor analysis, we ignore the source 
order indeterminacy. Figure~\ref{fig:idplot} and Table~\ref{tab:idtable} show a remarkably similar dependence between identifiability and the numbers of the segments and the observed variables: in particular, they agree on the minimal cases identifiable. Interestingly, cases with 2 observed variables as well as the cases with only 2 segments are never identifiable.
Note that these computational results together with Section~4 provide a bound for any future analytical results on identifiability.
If identifiability turns out to be possible in further cases, e.g., with a different mixing model or linking function, the results will need to depend on the particular parametric forms, thus limiting applicability.




\subsection{Finite Sample Estimation}




\textbf{Methods.} Next we turn our attention to estimation performance from finite sample data.
 We compare our new  \texttt{BLICA} (with different regularization parameter value $r$) method to its main competitors, \texttt{fastICA}~\citep{Himberg01,fastica} and the baseline implementations of \texttt{linear iVAE} and \texttt{full MLE}. Note that the model of \texttt{fastICA} is somewhat different, but it still employs a linear mixing of the sources and has the same sources scale and order indeterminacies; thus, \texttt{MCS} comparison is sensible. \texttt{fastICA} does not use the segment index, but pools all data from different segments.
Recall from Section~\ref{ivae} and Appendix~E that the \texttt{linear iVAE} uses essentially the same model, but instead of employing the likelihood, it optimizes the ELBO objective through L-BFGS. For runs with $n<20$ observed variables, a time budget of 2h was used, and the results that were obtained within the time limit are reported. For larger simulations, we allowed for 12h per run. To avoid local minima due to the difficult optimization landscape, we ran the \texttt{linear iVAE}, \texttt{full MLE} and \texttt{BLICA} with 3 different learning seeds and selected the best run according to the objective function (e.g. likelihood). 

\begin{figure*}[!t]
    \centering
    \includegraphics[scale=0.36]{plot4.pdf}\;\includegraphics[scale=0.36]{plot5.pdf}\;\includegraphics[scale=0.36]{timeplot.pdf}
\caption{Scalability.
Left: equal number of sources and observed variables. Center: 10 sources. 
Each box is based on thirty 40-segment datasets with 1000 samples per segment.
Right: Running times of the  steps of \texttt{BLICA} (Algorithm~1).
\label{fig:highdim} }
\end{figure*}


\textbf{Results.}
 Figure~\ref{fig:10observed} (left) shows the result for 10 observed variables and 10 sources. \texttt{BLICA} clearly outperforms others consistently improving with increasing sample size. 
With smaller dimensions, 6 observed variables and 6 sources in Figure~\ref{fig:10observed} (center), BLICA needs more samples to achive similar MCS. However, with fewer sources fewer samples are needed:  Figure~\ref{fig:10observed} (right) shows that for 6 observed variables and 2 sources, high MCS can be obtained with only 50 samples per segments. Interestingly, \texttt{linear iVAE} performs well only with fewer sources than observations, while \texttt{fastICA} is not able to reliably estimate the mixing matrix from binary data. Unfortunately, \texttt{full MLE} cannot perform sufficiently many optimization steps within the time limit of 2h even with 6 observed variables in Figure~\ref{fig:10observed} (center).

\textbf{Scalability.} Figure~\ref{fig:highdim} assesses the performance in higher dimensions over data sets with 40 1000-sample segments, thirty for each $n$. 
Only \texttt{BLICA} can estimate the mixing matrix with equal number of observed variables equals and sources in Figure~\ref{fig:highdim} (left). When the number of sources is fixed to 10 in Figure~\ref{fig:highdim} (center), also \texttt{linear iVAE} shows improving performance with increasing number of observed variables. Finally, Figure~\ref{fig:highdim} (right) shows the running time performance of \texttt{BLICA} (Algorithm~1) on the previous runs. The estimation of the quadratic number of correlations  starts taking considerable time with 100 observed variables. L-BFGS is relatively quick in solving the optimization problem to a solution close to the final result (i.e. 1\% lower MCS), then still gradually improving.




\section{RELATED WORK} \label{sec:related}

Our research connects particularly to the following earlier and more recent literature.
\citet{Himberg01}
consider binary observed vectors $\x$ and binary sources $\z$, so that the ICA mixing model is given by the Boolean expression $ x_{i}=\bigvee_{j=1}^{n_z} a_{i j} \wedge z_{j}$.
They show that this Boolean OR mixing can be approximated by a linear mixing model followed by a unit step function. Thus, they propose to estimate the model by ordinary ICA, and obtain reasonable results when the data is very sparse.
Similarly, \citet{nguyen10} studied binary ICA with OR mixtures by defining a disjunctive generative model.
They prove identifiability  
and propose an algorithm without  continuous-valued approximations.

\citet{kaban06} proposed a model where 
continuous sources follow a Beta distribution, followed by a binary observation model.
While their approach is related to ours, their latent variables are restricted to a finite interval, and they estimate the model using variational approximation which is unlikely to yield consistent estimators.
Discrete ICA has further been approached by extensions of LDA where the topic intensities are mutually independent \citep{Podosinnikova15, Buntine05, canny04}. Although their identifiability guarantees are limited \citep{podosinnikova16}, their method has the advantage of allowing for discrete data. 
\cite{lee} consider PCA 
employing a binarized Gaussian model.

Finally, we note that the very idea of estimating latent variable models by non-stationarity, originating in \citep{Matsuoka95,Pham01}, has been recently increasingly used in estimating generative models \citep{tcl,Khemakhem2019} as well as for causal discovery \citep{zhang2017causal,Monti19UAI}, even in deep learning. Automatically estimating the segment index by a HMM has been further proposed by \citet{halva2020hidden}.
Instead of the wide-spread idea of joint diagonalization of covariance matrices~\citep{belouchrani1997blind,Tsatsanis}, we used correlation matrices without explicit diagonalization criteria; related work on diagonalizing correlation matrices can be found in \citep{corrdiag}.

\section{CONCLUSION}

We presented a model for ICA of binary data which is based on a linear latent mixing model and non-stationarity of the sources. 
We investigated the identifiability, showing some surprising indeterminacies not present in ordinary ICA, including the fact that in the two-variable case the model cannot be identified. We believe that our identifiability results, theoretical and empirical, will be useful in future research on binary ICA. Based on our approach using a Gaussian link function, the likelihood can be obtained in closed form although the Gaussian cumulative distribution function is still computationally heavy. These advances allowed for a practical method \texttt{BLICA} that combines maximum likelihood estimation and moment-matching; it was shown to be  applicable in higher dimensions while still empirically showing consistent behaviour.
As future work, we aim to generalize from binary to discrete variables, consider parallelized approaches for scaling up full MLE estimation, and investigate the potential of the new learning algorithm in applications.

\subsubsection*{Acknowledgements}

The first author was supported by the Academy of Finland under grant 315771. The second author acknowledges funding from Samsung Electronics Co., Ltd. (at Mila). The third author acknowledges funding from the Academy of Finland and a CIFAR Fellowship.

\bibliographystyle{plainnat}
\bibliography{hyttinen_232}
\end{document}