% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

% added package by myself
\usepackage{amsmath}
\usepackage{subcaption}
\usepackage{graphicx}
\usepackage{float}
\usepackage{amssymb}

\title{Bias-aware Boolean Matrix Factorization Using Disentangled Representation Learning}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Xiao Wang}
\author[1]{Jia Wang}
\author[3]{Tong Zhao}
\author[1]{Yijie Wang}
\author[4,5]{Nan Zhang}
\author[2]{Yong Zang}
\author[2]{Sha Cao}
\author[2]{Chi Zhang}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    Indiana University\\
    Bloomington, Indiana, USA
}
\affil[2]{%
    School of Medicine\\
    Indiana University\\
    Indianapolis, Indiana, USA
}
\affil[3]{%
    Uber Inc\\
    Seattle, Washington, USA
  }
\affil[4]{%
    Institute of Science and Technology for Brain-inspired Intelligence\\
    Fudan University\\
    Shanghai\\
    China
}
\affil[5]{%
    School of Data Science\\
    Fudan University\\
    Shanghai\\
    China
}
  
  \begin{document}
\maketitle

\begin{abstract}
 Boolean matrix factorization (BMF) has been widely utilized in fields such as recommendation systems, graph learning, text mining, and -omics data analysis. Traditional BMF methods decompose a binary matrix into the Boolean product of two lower-rank Boolean matrices plus homoscedastic random errors. However, real-world binary data typically involves biases arising from heterogeneous row- and column-wise signal distributions. Such biases can lead to suboptimal fitting and unexplainable predictions if not accounted for. In this study, we reconceptualize the binary data generation as the Boolean sum of three components: a binary pattern matrix, a background bias matrix influenced by heterogeneous row or column distributions, and random flipping errors. We introduce a novel Disentangled Representation Learning for Binary matrices (DRLB) method, which employs a dual auto-encoder network to reveal the true patterns. DRLB can be seamlessly integrated with existing BMF techniques to facilitate bias-aware BMF. Our experiments with both synthetic and real-world datasets show that DRLB significantly enhances the precision of traditional BMF methods while offering high scalability. Moreover, the bias matrix detected by DRLB accurately reflects the inherent biases in synthetic data, and the patterns identified in the bias-corrected real-world data exhibit enhanced interpretability.
 
\end{abstract}

\vspace{-0.5cm}
\section{Introduction}\label{sec:intro}
Boolean matrix factorization (BMF) seeks to detect lower-dimensional patterns in a binary matrix, which contrasts with traditional matrix factorization techniques that focus on real-valued matrices. BMF has various applications across different domains \citep{miettinen2020recent}. One notable application is in recommender systems, where it can be used to extract meaningful user-item preferences or item-item similarities from binary user-item interaction data \citep{rukat2017bayesian,balasubramaniam2018people}. It can also be employed in bioinformatics for gene expression analysis \citep{liang2020bem,zhang2007binary,zhang2010binary}, network analysis \citep{8594848,kocayusufoglu2018summarizing}, and binary pattern mining \cite{lucchese2010generative,lucchese2010mining,lucchese2013unifying}. Preserving the binary nature of the data poses unique challenges to BMF \citep{stockmeyer1975set,miettinen2008discrete}. Several approaches have been proposed to address this issue, including non-negative matrix factorization (NMF) with binary constraints \citep{araujo2016faststep}, BMF with specific optimization objectives \citep{wan2020fast,lucchese2010generative,lucchese2010mining,lucchese2013unifying,miettinen2008discrete,miettinen2011model,miettinen2014mdl4bmf}, and probabilistic models that capture the binary nature of the data \citep{ravanbakhsh2016boolean,rukat2017bayesian}. 

Despite the significant advancements in BMF, it is essential to recognize that the current BMF formulation, which regards a matrix as the sum of a series of low-rank Boolean pattern matrices and independent and identical (i.i.d) random errors \citep{ravanbakhsh2016boolean,rukat2017bayesian}, overlooks the presence of biases within the data. When biases pervade the underlying data, the patterns extracted are likely to assimilate these biases, leading to a distortion of the resultant patterns \citep{mehrabi2021survey, yao2017beyond}. Such distortion can adversely affect the interpretation of the data, as well as subsequent analyses and decisions.

Real-world data often display unique biases and heteroscedastic (non-i.i.d.) distributions of signals and errors, which traditional BMF methods may not fully account for \citep{wan2020denoising}. A common form of bias arises from the varied distributions of row-/column-wise signals. For instance, in purchase history data, ``super items" that are disproportionately popular can be purchased by a large number of users. Similarly, ``super users", who purchase a wider array of items more frequently, can introduce further imbalances. These biases, driven by varying propensities, can distort the representation of true underlying patterns. Traditional BMF approaches, which typically assume equal importance across all items or users, may struggle to differentiate between genuine patterns and the biases introduced by the prevalence of these super items or users.

To tackle the bias issue in BMF, it is imperative to incorporate bias-aware assumptions in solving BMF. Here, we present a novel approach where we rethink the generation of the binary matrix as the Boolean sum of the true and to-be-detected low-rank patterns, row-wise and column-wise biases, and heteroscedastic errors. Building upon this fundamental idea, we introduce \textbf{DRLB}\footnote{code is available at https://github.com/xwang97/DRLB} (\textbf{D}isentangled \textbf{R}epresentation \textbf{L}earning method of \textbf{B}inary matrix), a cutting-edge deep learning framework specifically designed to disentangle a Boolean matrix into two distinct components: a low-rank pattern matrix $(U \otimes V)$ and a bias matrix adept at capturing the row-wise and column-wise biases. The key contributions of this study include:
%\vspace{-3mm}
\begin{itemize}
\item DRLB reconceptualizes a Boolean matrix as the aggregation of patterns, background bias, and heteroscedastic random errors. This flexible model can be generally applied across diverse real-world datasets, expanding its utility and relevance.
\item DRLB is the first deep neural network-based method to untangle the intricate relationship between low-rank patterns and the background bias inherent in rows and columns of a Boolean matrix.
\item DRLB enhances the bias removal of Boolean matrices. It can be implemented with any existing BMF methods to facilitate more accurate data analyses and interpretations. 
\end{itemize}
It is noteworthy that DRLB is developed to recognize and adjust for non-random and systematic biases in data, which is distinct from errors. DRLB specifically improves the awareness of skewness in Boolean data for a bias-aware BMF.

\vspace{-0.25cm}
\section{Related Work}
\vspace{-0.25cm}
\subsection{Boolean Matrix Factorization}\label{sec:2.1}

For a given Boolean matrix $X\in \{0,1\}^{m\times n}$ of $m$ rows and $n$ columns,  BMF computes a pair of low-rank binary matrices (\(U\in \{0,1\}^{m\times k}\) and \(V\in \{0,1\}^{k\times n}\)), whose Boolean product (denoted as $\otimes$) approximates \(X\), which could be generally presented as\(\colon\) 
\begin{equation}\label{first_eq}
X \sim U\otimes V , \quad X_{ij}\sim \vee_{l=1}^k U_{il}\wedge V_{lj}. 
\end{equation} 
Here, \(\wedge\) and \(\vee\) denote "and" and "or" operations. The two low-rank binary matrices are often solved under Boolean arithmetic to minimize the Frobenius norm or other norms of the reconstruction error $||E||=||X-U\otimes V||$.  For example, ASSO \citep{miettinen2008discrete} builds a row-wise correlation matrix and employs a heuristic mechanism for retrieving binary base matrices, while PANDA \cite{lucchese2010mining} iteratively discovers and retains the most significant patterns. However, the high computational cost of these methods limits their application to large-scale datasets. MEBF \citep{wan2020fast} utilizes binary matrix permutation theory and geometric segmentation that largely improved the efficiency in detecting 1s enriched patterns. A recent method, namely CG \citep{kovacs2020binary}, significantly improved the pattern detection accuracy by formulating the BMF problem as a mixed integer linear program and introducing a column generation-based optimization. In recent years, multiple methods that follow the formulation of (\ref{first_eq}) have been developed to improve the accuracy and efficiency of BMF \citep{miettinen2008discrete, rukat2017bayesian, kovacs2020binary, lucchese2010mining, dalleiger2022efficiently, avellaneda2022undercover, ravanbakhsh2016boolean, fischer2021differentiable, neumann2020biclustering}. 

\subsection{Bias-aware BMF}

Conventional Binary Matrix Factorization (BMF) algorithms have demonstrated commendable performance when the input follows the formulation of \eqref{first_eq}. However, this formulation may oversimplify the generation processes observed in real-world scenarios, in which data often exhibit a biased distribution of row-/column-wise signals, like the super users or items in purchase history data. Such biases will cause the conventional BMF methods to identify patterns that are skewed towards either the rows or columns displaying a high frequency of `1's, thereby impacting the accuracy and reliability of the subsequent pattern extraction and analysis. Specifically, certain patterns will be accentuated while others will be downplayed.

A recent study, BIND \citep{wan2020denoising}, first introduced the consideration of row-/column-dependent background bias in BMF. BIND identifies and eliminates the entire rows and columns that are less likely to be contained by a low-rank pattern before BMF, thus improving detection accuracy and decreasing the reconstruction error in the presence of background bias. However, our analysis reveals that BIND may introduce additional bias (see EXPERIMENTS and Figure 2). BIND also assumes identical fitting errors, which further limits its applicability in real-world data analysis. Thus, it is crucial to develop robust bias-handling approaches that could comprehensively capture the generation process of Boolean matrices.

\subsection{Disentangled Representation Learning}
Disentangled representation learning aims to identify and disentangle the latent patterns in input data \citep{wang2022disentangled}. With the ability to decompose the observations into components carrying different types of information, disentangled learning has demonstrated its high interpretability of the input data in diverse applications. In computer vision and recommender systems, VAE and GAN-based models are exploited to disentangle independent factors of variation and manipulate latent variables \citep{higgins2016beta, kim2018disentangling, chen2016infogan, ma2019learning}. Cross-domain tasks could also be improved by embedding the input from different domains into a shared domain-invariant content space and disentangling this shared space from different domain-specific attribute spaces \citep{lee2018diverse}. 

In this paper, we reconsider the generation of binary data as the Boolean sum of three components: patterns, background, and random error. This assumption can naturally adopt the advantage of disentangled representation learning. By training two auto-encoders to extract the pattern and background bias in an unsupervised fashion, we showed our model well captured the data generation process and achieved highly desirable performance in binary data bias removal. 


\begin{figure*}[!t]
\centering
  \includegraphics[width=1.0\textwidth, height=9.0cm]{figures/framework.png}
  \caption{The DRLB Framework. The observed matrix serves as input for both the Pattern Net (blue) and Background Net (yellow). The outputs of the two decoders reconstruct the input by their Boolean sum. Background bias matrices are first generated based on the row-wise and column-wise distributions of the input matrix, which serves as the training input for the Background Net. The bottleneck features of the observed matrix and generated background extracted by the Background Net are constrained by a distribution loss. The Pattern Net and Background Net disentangle the input matrix into (1) the bias-removed pattern + error matrix and (2) background bias, which could be seamlessly input into a BMF method.}
  \label{fig:frame}
\end{figure*}


\vspace{-0.15cm}
\section{DRLB Framework}\label{sec:math}
\vspace{-0.15cm}
\subsection{Notations}
\vspace{-0.15cm}
In this study, we represent a matrix, vector, and scalar value by uppercase (\(X\)), bold lowercase (\(\textbf{x}\)), and lowercase (\(x\)) characters, respectively. The upper script represents the dimension of the object (e.g. \(X^{m\times n}\)), while the lower script indicates the element indices (e.g. \(i\)-th row: \(X_{i:}\), \(j\)-th column: \(X_{:j}\), and \(ij\)-th element: \(X_{ij}\)). \(||\cdot||\) represents a general form of matrix norm, such as the Frobenius norm. Under Boolean arithmetic, the \textit{and}, \textit{or}, and \textit{not} operations are denoted by \(\wedge,\, \vee\), and \(\neg\). Subsequently, the Boolean element-wise sum and subtraction are defined as \(X\oplus Y=X\vee Y\) and \(X\ominus Y=(\neg X \vee Y)\wedge (X\vee \neg Y)\). The Boolean matrix product is defined as \(H=X\otimes Y\), where \(H_{ij}=\vee_{l=1}^k X_{ik}\wedge Y_{lj}\). 

\subsection{Mathematical Considerations} \label{sec:3.2}
We first consider the following generation approach of a Boolean matrix $X\in \{0,1\}^{m\times n}\colon$
\begin{equation}\label{second_eq}
  X = U\otimes V \oplus X^B + E
\end{equation}
, where $X^B$ is the background bias matrix generated by row- and column-wise probability vectors $\textbf{p}^r\in  R^{m\times 1}$ and $\textbf{p}^c\in R^{n\times 1}$, which reflect the probability of observing 1s in a row or a column of $X^B$: 
\begin{align*}
    P(X^0_{i \cdot}=1)\propto \textbf{p}^r \ and\ P(X^0_{\cdot j}=1)\propto \textbf{p}^c.
\end{align*}
$E$ is an identical and independent flipping error with a flipping probability $P(1\rightarrow 0) = P(0\rightarrow 1) = p_0$. Noted, (\ref{second_eq}) extends (\ref{first_eq}) by introducing the background bias matrix $X^B$, which forms a bias-aware BMF problem. Intuitively, (\ref{second_eq}) considers the observed 1s could be either generated by low-rank patterns or the row/column-dependent background bias. Thus, a successful disentanglement of $X^B$ may not only decrease the fitting error \(|X\ominus X^B-U\otimes V|\) but also increase the explainability of the 1s in the observed data. 

In real-world data, another challenge is that the distribution of error could be heterogeneous, i.e.,
\begin{align*}
    P(1\rightarrow 0) \neq P(0\rightarrow 1).
\end{align*}
Moreover, $P(1\rightarrow 0)$ and $P(0\rightarrow 1)$ may depend on $U$, $V$, $\textbf{p}^r$, $\textbf{p}^c$, or even $X^0$. For example, the density of 1's could vary among the sub-matrices, meaning $P(1\rightarrow 0)$ depends on $U$ and $V$. To capture the generation of real-world Boolean data and enable robust BMF for more general inputs, we further extend (\ref{second_eq}) to the following probabilistic definition of the bias-aware BMF problem.

\textbf{Definition 1. Bias-Aware BMF (BABMF) problem.} For a given Boolean matrix $X\in \{0,1\}^{m\times n}$, BABMF seeks for the solution of $U^{m\times k}$, $V^{k\times n}$, $X^B$, such that $P(X^B_{i \cdot}=1)\propto \textbf{p}^r$ and $P(X^B_{\cdot j}=1)\propto \textbf{p}^c$, and 
\begin{equation}\label{third_eq}
\fontsize{9}{11}\selectfont
P(X_{ij}=1) = 
  \begin{cases}
    \textbf{p}^c_i\textbf{p}^r_j, & \text{if $({U\otimes V})_{ij}=0$}\\
    min\{\textbf{p}^c_i\textbf{p}^r_j + p_{ij},1\}, & \text{if $({U\otimes V})_{ij}=1$}
  \end{cases}
\end{equation}
, where $p_{ij}$ encodes the low-rank patterns different from bias or error, i.e., $p_{ij} > 0$ if $({U\otimes V})_{ij}=1$, and $p_{ij} = 0$ otherwise. We want to note that (\ref{third_eq}) provides a general formulation of the bias-aware BMF problem. Without additional constraints on $p_{ij}$, (\ref{third_eq}) is incorrectly posed because $p_{ij}$ has a trivial solution under Maximum Likelihood Estimation (MLE). When the structure of $p_{ij}$ is given, an MLE-based solution could be formulated. Noted, (\ref{third_eq}) provides a rigorous probabilistic formulation of the generation approach of a Boolean matrix following (\ref{second_eq}).

In practice, it is always challenging to derive the analytic form of errors for real-world data, i.e., defining the structure of $p_{ij}$. Here we propose a heuristic and general solution to the BABMF problem formulated in (\ref{third_eq}) without any pre-assumption of errors, namely \textbf{D}isentangled \textbf{R}epresentation \textbf{L}earning based \textbf{B}MF (DRLB). DRLB considers that $X$ is formed by the Boolean sum of $X^B$ (background bias matrix) and $X^P$ (pattern matrix). As defined in (\ref{third_eq}), $X^P$ is generated by $P(X^P_{ij}=1)=p_{ij}$ and $X^B$ is generated by $P(X^B_{ij}=1)=  \textbf{p}^c_i\textbf{p}^r_j$ . 
DRLB disentangles the Boolean input $X$ into the Boolean sum of $X^P$ and $X^B$.

\subsection{Overview of the Network Architecture of DRLB}
% The overall framework of the DRLB method is shown in Figure~\ref{fig: frame}, which disentangles the input matrix $X$ into a pattern matrix $X^L$ and a background matrix $X^0$ by using two auto-encoders, namely Pattern Net (blue in Figure~\ref{fig:frame}) and Background Net (yellow in Figure~\ref{fig:frame}), each consisting of five layers. Specifically, DRLB maximizes the goodness of fitting of $X$ by the Boolean sum of $X^L$ and $X^0$ and minimizes the difference between the distribution of the signals generated by the Background Net and the pre-assumed distribution of $X^0$, then the signals from the pattern matrix $X^L$ could be captured by the Pattern Net. After the disentanglement, the estimated $X^L$ could serve as the input of any BMF method for further decomposition.

% To enable the Background Net to capture the row-/column-wise background bias, DRLB first generates a simulated background matrix from the observed matrix based on the background probabilities (see details in the next section). This generated matrix approximates the true background bias, especially when the observed matrix is significantly larger than the pattern matrix \citep{yang2014sparse}. Additionally, the original observed matrix serves as input for both the Pattern Net and the Background Net to compute and optimize the pattern matrix, with the goal of reconstructing the observed matrix. Constraints are imposed on the middle layers of both auto-encoders to align the distributions of the generated background matrix. The generation of the background matrix and the optimization of the pattern matrix iterate until a convergence criterion is met. The final solved pattern matrix, now debiased, will serve as input for any BMF method.

For a given Boolean matrix $X$, we aim to learn a latent representation $Z$ that could reconstruct the input through a neural network. The likelihood of $X$ can be represented by:
\begin{equation}
    P(X) = \int_Z P(X|Z)P(Z)dZ
\end{equation}
Previous studies on variational inference \citep{kingma2013auto} have derived that the log-likelihood has a lower bound as:
\begin{equation}
    \log P(X) \geq \mathbf{E}_{Q_\phi (Z|X)}[\log P_\theta (X|Z)] - \mathbf{KL}(Q_\phi (Z|X)|| P(Z))
\label{eq5}
\end{equation}
, where $\phi$ presents the parameters of the neural network $Q_\phi(Z|X)$ (encoder) that approximates the posterior probability $P(Z|X)$, and $\theta$ represents the parameters of the neural network $P_\theta(X|Z)$ (decoder) of the likelihood $P(X|Z)$.

Following the definition of BABMF and the discussion in section \ref{sec:3.2}, a binary matrix is generated by the Boolean sum of pattern and bias matrices, i.e., $X = X^P \oplus X^B$. In DRLB, we solve the BABMF problem by factorizing the latent representation of $Z$ into two independent components, $Z^P$ and $Z^B$, which separately generate the pattern matrix $X^P$ and bias matrix $X^B$. Under this consideration, \eqref{eq5} can be extended to (see detailed derivations in APPENDIX \ref{A1}):
\begin{equation}
\begin{split}
    \log P(X) \geq \mathbf{E}_{Q_{\phi^P,\phi^B}  (Z^P, Z^B|X)}[\log P_{\theta^P, \theta^B} (X|Z^P, Z^B)]\\
    - \mathbf{KL}(Q_{\phi^P} (Z^P|X)|| P(Z^P)) - \mathbf{KL}(Q_{\phi^B} (Z^B|X)|| P(Z^B)) 
\end{split}
\label{eq6}
\end{equation}

\vspace{-0.2cm}
To factorize $Z$, we introduced two neural networks (as illustrated in Figure \ref{fig:frame}), namely (1) Pattern Net (colored in blue, with parameters $\phi^P, \theta^P$) and (2) Background Net (colored in yellow, with parameters $\phi^B, \theta^B$). In \ref{sec:3.4}, we detail the collaborative training of the two networks and how the lower bound derived in \eqref{eq6} ensures an effective factorization of $Z$ and the disentanglement of the patterns and bias matrices.

\vspace{-0.25cm}
\subsection{Disentangled Representation Learning}\label{sec:3.4}
\vspace{-0.15cm}
% This section details the loss function and optimization approach of DRLB. Denote $f_P$ and $f'_P$ as the encoder and decoder functions of Pattern Net and $f_B$ and $f'_B$ as the encoder and decoder functions of Background Net, respectively. 

% The first step of DRLB is to generate a simulated background matrix $X^{0}$ based on the input matrix $X$. To generate elements in $X^{0}$, DRLB first estimates the background distributions $P(X^0_{ij}=1)$. Based on (\ref{third_eq}), $P(X^0_{ij}=1)\propto \textbf{p}^{c}_i\cdot \textbf{p}^{r}_j$, where $\textbf{p}^{r}_j$ and $\textbf{p}^{c}_i$ are the row-wise and column-wise background probabilities. By (\ref{third_eq}), $\textbf{p}^{r}$ and $\textbf{p}^{c}$ are not directly identifiable because $p_{ij}^{c}$ and $U,\ V$ are unknown. We heuristically approximate them by $\hat{\textbf{p}}^c_i=\frac{X_{i:}}{n}$ and $\hat{\textbf{p}}^r_j=\frac{X_{:j}}{m}$. DRLB further generates $X^{0}$ by multiplying this approximated probability with a hyper-parameter $\alpha$:
% \begin{equation}
%   P(X^{0}_{ij}=1)=\alpha \cdot \frac{X_{i:}}{n}\cdot\frac{X_{:j}}{m}
% \end{equation}

% With both the observed binary matrix and the generated background matrix, the two networks could be trained to disentangle patterns and backgrounds. The first loss is the reconstruction error. The observed data $X$ will be used as input for both Pattern Net and the Background Net. After the transformation by the two networks, the sum of their outputs $X'$ should well reconstruct the input $X$. Meanwhile, the generated background matrix $X^0$ will be passed to the Background Net and should be reconstructed by the output of the Background Net (denoted by $X^{0'}$). Two reconstruction errors are formulated as:
% \begin{equation}
%   L_{recon1} = \sum\limits_{i=1}^n \Vert X_{i:}-X'_{i:} \Vert,
% \end{equation}

% \begin{equation}
%   L_{recon2} = \sum\limits_{i=1}^n \Vert X_{i:}^0 -X_{i:}^{0'} \Vert.
% \end{equation}

% The above loss functions constrain the sum of the outputs of Pattern Net and Background Net to reconstruct $X$. Based on this, if the Background Net could extract true background bias, then the Pattern Net should embed true patterns. Therefore, we introduce the third loss, $L_{dist}$, to penalize the difference between the distributions of the bottleneck features extracted from the observed matrix and the generated background matrices:

% \begin{equation}
%     L_{dist} = MMD^2(f_B(X), f_B(X^0)).
% \end{equation}
% Here MMD is the maximum mean discrepancy \citep{gretton2012kernel}, which is a measure of discrepancies between distributions via projections onto a reproducing kernel Hilbert space associated with a reproducing kernel $k$,

% \begin{equation}
% \begin{split}
%     MMD^2(p,q) = & E_{x,x'\sim p}k(x,x') - 2E_{x\sim p,y\sim q}k(x,y) + \\
%                  & E_{y,y'\sim q}k(y,y'),
% \end{split}    
% \end{equation}
% where $p$ and $q$ are two distributions. In practice, the expectations are estimated by sample means.

% With all these designs, our neural network model will be trained by the combination of these three loss functions:
% \begin{equation}
%     L = L_{recon1} + L_{recon2} + \lambda L_{dist}
% \end{equation}
% where $\lambda$ is an adjustable hyper-parameter.

This section details the loss function, network training, and optimization approaches of DRLB. Denote the encoder and decoder functions of the Pattern and Background Nets as $f_{\phi^P}, f_{\theta^P}, f_{\phi^B}, f_{\theta^B}$, respectively. \cite{kingma2013auto} suggested that the MLE of $P(X)$, i.e. optimization of $Z^P$ and $Z^B$, could be alternatively achieved by maximizing the lower bound as derived in \eqref{eq6}. 

\textbf{Maximize the expectation term}. The first term in \eqref{eq6} is the expectation $\mathbf{E}_{Q_{\phi^P,\phi^B}  (Z^P, Z^B|X)}[\log P_{\theta^P, \theta^B} (X|Z^P, Z^B)]$, which could be maximized by training the two networks. The posterior probability $P(Z^P, Z^B|X)$ could be approximated by $Q_{\phi^P,\phi^B}  (Z^P, Z^B|X)$. For the log-likelihood term $\log P_{\theta^P, \theta^B} (X|Z^P, Z^B)$, noting $X$ is the sum of $X^P$ and $X^B$ and it only contains binary values, we formulate it using the following Bernoulli distribution:
\begin{equation}
    P_{\theta^P, \theta^B} (X_{ij}|Z^P, Z^B) \sim Ber(f_{\theta^P}(Z^P)_{ij} + f_{\theta^B}(Z^B)_{ij})
\end{equation}
Denote the sum of the outputs of the two decoders as $f_\theta(Z)_{ij} \triangleq f_{\theta^P}(Z^P)_{ij} + f_{\theta^B}(Z^B)_{ij}$. The log-likelihood term can be written as:
\begin{equation}
\begin{split}
    \log P_{\theta^P, \theta^B} (X|Z^P, Z^B) = \sum_{ij} X_{ij}\log(f_\theta(Z))_{ij}\\ + (1-X_{ij})\log(1-f_\theta(Z))_{ij}
\end{split}
\label{logL}
\end{equation}
Maximizing the log-likelihood is equivalent to minimizing the binary cross-entropy loss. We define the loss function for the expectation term in \eqref{eq6} as:
\vspace{-0.01cm}
\begin{equation}
    L_{recon1} = -\frac{1}{mn}\log P_{\theta^P, \theta^B} (X|Z^P, Z^B)
\label{recon1}
\end{equation}
\textbf{Optimize the KL divergence term}. In variational inference, the KL divergence terms in \eqref{eq6} regularize the discrepancies between the distributions of the latent representations $Z^P$ and $Z^B$ and their prior distributions \citep{kingma2013auto}. To enable an effective disentangled learning of $Z^P$ and $Z^B$, DRLB first approximates a prior distribution of $Z^B$ to ensure $f_{\theta^B}(Z^B)$ can accurately and specifically capture the background bias. A disentangled learning of $Z^P$ and $Z^B$ is achieved by maximizing the generation of $X$ via $f_{\theta^P}(Z^P) + f_{\theta^B}(Z^B)$. We set $P(Z^P) \sim N(0, I)$. For $Z^B$, DRLB approximates its prior distribution as detailed below.

%The two KL divergence terms in \eqref{eq6} control the distributions of the two latent components $Z^P$ and $Z^B$ with a prior. One common choice of the prior is the standard Normal $N(0, I)$. However, simply applying such a prior to both the two KL terms will not enable the disentanglement. Therefore, we will use different strategies for the two priors.\\
%We do not have direct access to the distribution of patterns, so we use $N(0, I)$ for $P(Z^P)$ as a simple regularization of the posterior. For $Z^B$, we first modelled the distribution of the bias distribution, and then used the distribution of its latent representation as $P(Z^B)$. Since we already enabled the summation of the two decoder outputs to reconstruct the original input, constraining the posterior of $Z^B$ with the bias distribution will force $Z^P$ to capture the latent distribution of patterns. We will introduce how we can achieve this in the next paragraph.


\textbf{Estimate the distribution of bias and the prior distribution of $Z^B$}. To approximate the distribution of $Z^B$, we first generate a simulated bias matrix, denoted by $\hat X^{B}$. By \eqref{third_eq}, $P(X^B_{ij}=1)\propto \textbf{p}^{c}_i\cdot \textbf{p}^{r}_j$, here $\textbf{p}^{r}_j$ and $\textbf{p}^{c}_i$ are two vectors that represent the row-wise and column-wise background probabilities. Noted, $\textbf{p}^{r}$ and $\textbf{p}^{c}$ are not identifiable when $p_{ij}^{c}$ are unknown. Thus, the probabilities are heuristically approximated by $\hat{\textbf{p}}^c_i=\frac{X_{i:}}{n}$ and $\hat{\textbf{p}}^r_j=\frac{X_{:j}}{m}$. $\hat X^{B}$ is further randomly generated by the product of the approximated probabilities with a hyper-parameter $\alpha$:
\begin{equation}
  P(\hat X^{B}_{ij}=1)=\alpha \cdot \frac{X_{i:}}{n}\cdot\frac{X_{:j}}{m}
\end{equation}
As illustrated in Figure \ref{fig:frame}), the Background Net in DRLB learns the latent representation of the background bias, denoted as $\hat Z^B$, from $\hat X^B$. Similar to \eqref{logL} and \eqref{recon1}, the generative model can be trained using the following loss:
\begin{equation}
    L_{recon2} = -\frac{1}{mn}\log P_{\theta^B}(\hat X^B|\hat Z^B)
\end{equation}
, here $\hat Z^B=f_{\phi^B}(\hat X^B)$ is a prior distribution of $Z^B$ learned from the randomly simulated background bias matrix $\hat X^B$, and $\log P_{\theta^B}(\hat X^B|\hat Z^B)=\sum_{ij}\hat X^B_{ij}\log (f_{\theta^B}(\hat Z^B))_{ij} + (1-X_{ij})\log (1-f_{\theta^B}(\hat Z^B))_{ij}$ is the log-likelihood function. To ensure a high robustness and flexibility of the method, we introduced the third loss based on Maximum Mean Discrepancy (MMD) to further regularize the discrepancies of $Z^B$ and $Z^P$ with their prior distributions \citep{gretton2012kernel}(see details in APPENDIX \ref{A2}):
\begin{equation}
\begin{split}
    L_{dist} = MMD^2(f_{\phi^P}(X), N(0, I)) \\
    + MMD^2(f_{\phi^B}(X), f_{\phi^B}(\hat X^B))
\end{split}
\label{ldist}
\end{equation}

With the above considerations, the dual networks in DRLB will be trained by minimizing the combined loss function:
\begin{equation}
    L = L_{recon1} + L_{recon2} + \lambda L_{dist}
\end{equation}
, where $\lambda$ is an adjustable hyper-parameter.

\begin{figure*}
    \centering
    \includegraphics[scale=0.38]{figures/results.png}
    \caption{Comparison of reconstruction errors on simulated data. First row: pattern size = 80. Second row: pattern size = 120. Columns from left to right correspond to the number of simulated patterns = 2, 3, and 4. The three denoising settings, namely without denoising, BIND, and DRLB, are blue, red, and green colored, respectively.}
    \label{box}
\end{figure*}

\vspace{-0.15cm}
\section{Experiments}\label{sec:floats}
\vspace{-0.15cm}
\subsection{Benchmark and Baselines}
\vspace{-0.15cm}
We evaluate the performance of DRLB on both synthetic and real-world datasets. Noted, DRLB is a bias-removal method that could be seamlessly implemented with any BMF method. To show the effectiveness of DRLB in removing background bias and increasing the detection accuracy of BMF, we selected four BMF methods for the downstream BMF task, namely ASSO, PANDA, MEBF, and CG. As introduced in \ref{sec:2.1}, ASSO \citep{miettinen2008discrete} and PANDA \citep{lucchese2010mining} are two classic methods that are commonly utilized as baselines in BMF method development. MEBF \citep{wan2020fast} is a state-of-the-art (SOTA) method that has the top running speed and satisfactory accuracy. CG \citep{kovacs2020binary} is a SOTA method that robustly achieved the top prediction accuracy in multiple recent BMF works. Based on the experiments of recent BMF studies\citep{kovacs2020binary, avellaneda2021boolean}, we consider the four selected BMF methods can represent conventional and SOTA methods. We want to note the goal of the experiment with these selected BMF methods is to demonstrate the effectiveness of bias-removal of DRLB, and the advantage of implementing DRLB with the BMF methods over applying BMF alone, rather than comparing DRLB against any of the BMF methods. 

To show DRLB's superiority in background bias removal, we select BIND as the baseline method for comparison. To the best of our knowledge, BIND is the only baseline method for this type of analysis. For each data and BMF method, the performance is evaluated based on three different inputs: 1) the original input, 2) data with bias removed using BIND, and 3) data with bias removed using DRLB. 
\vspace{-0.2cm}
\subsection{Implementation Details}\label{sec:4.2}
\vspace{-0.2cm}
All the BMF methods and BIND used in this paper were implemented with the source code and default parameters from the original works. DRLB was implemented via PyTorch and trained with Adam Optimizer\citep{kingma2014adam}. In DRLB, five layers were set in each of the two neural networks, with dimensions $\{D, 200, 20, 200, D\}$ (here $D$ is the dimension of the input data). ReLU activation is used before and after the bottleneck layer. The final layer is mapped to binary values using a Sigmoid activation. Implementation details including batch size, learning rate, parameters, and hardware information are given in APPENDIX \ref{AppendixB}. Analysis of running time and scalability is given in APPENDIX \ref{AppendixC}.

\vspace{-0.2cm}
\subsection{Experiments on Simulated Data}
\vspace{-0.2cm}
\subsubsection{Experimental Settings.} 
\vspace{-0.2cm}
We simulated Boolean matrices $X^{500\times 500}$ of 500 rows and columns by $X = U\otimes V + X^B + E$, with pattern size $\in \{80, 120\}$, pattern number $k\in \{2, 3,4\}$ and a fixed flipping error $E \colon = p(1\rightarrow 0) = p(0\rightarrow 1) = 0.01$. The row- and column-wise probability vectors of background bias were simulated by random sampling from the uniform distribution $\textbf{p}^{r}, \textbf{p}^{c} \sim U[0.1, 1]$. In total, 600 synthetic input matrices of six scenarios were simulated.

Two metrics were used in our evaluation. First, we evaluated the reconstruction error, which is the most common metric for evaluating the performance of BMF methods. The reconstruction error is defined as:
\vspace{-0.2cm}
\begin{equation}
    Reconstruction\ Error = || U\otimes V - A^* \otimes B^*|| 
\end{equation}
\vspace{-0.05cm}
, where $U, V$ generates the ground truth pattern matrix $X^P$ as defined in (\ref{third_eq}) and $A^*, B^*$ are decomposed patterns of a BMF algorithm.
Second, we used the \textbf{signal}(1's in $U\otimes V$)/\textbf{noise}(1s from background and errors) ratio of the debiased matrix (bias-removed matrix) to evaluate the effectiveness of the bias removal 
 of BIND and DRLB.

\vspace{-0.275cm}
\subsubsection{Evaluation on reconstruction error.}
\vspace{-0.15cm}
\textit{DRLB drastically decreases the reconstruction error of BMF methods.} Figure~\ref{box} shows the reconstruction errors of different BMF methods on simulated data using different bias removal approaches. We have seen that the performances of ASSO, MEBF, and CG are drastically improved when implemented on the DRLB-debiased data compared to the original input and BIND-debiased data. Also, a slight improvement was seen in PANDA. We note that DRLB is especially helpful for methods that have a high sensitivity in detecting dense patterns, such as ASSO and CG. These methods cannot distinguish between true patterns and dense blocks formed by background biases. Thus, they tend to suffer a higher false positive rate when there is a stronger background bias, which could be effectively handled by DRLB. On the contrary, BIND also showed a lower level of improvement compared to DRLB. In addition, it tends to lose too many pattern signals and may introduce additional biases Figure~\ref{denoise}(b,e). 

Our experiments showed that DRLB + CG achieved the lowest reconstruction error under all the testing scenarios. As shown in Figure~\ref{box}, the reconstruction error of CG was high when applied to the original data, which drastically decreased when DRLB was implemented. CG is the top-performing method when applied to the data without bias, and it gains a drastically increased performance against bias when implemented with DRLB.

\begin{table}[!h]
\begin{tabular}{lcccccc}
\hline
\multicolumn{1}{|l|}{size}   & \multicolumn{3}{c|}{80}                                                                                        & \multicolumn{3}{c|}{120}                                                                                       \\ \hline
\multicolumn{1}{|l|}{number} & \multicolumn{1}{c|}{2}              & \multicolumn{1}{c|}{3}             & \multicolumn{1}{c|}{4}              & \multicolumn{1}{c|}{2}             & \multicolumn{1}{c|}{3}              & \multicolumn{1}{c|}{4}              \\ \hline
\multicolumn{1}{|l|}{original}  & \multicolumn{1}{c|}{0.18}           & \multicolumn{1}{c|}{0.29}          & \multicolumn{1}{c|}{0.37}           & \multicolumn{1}{c|}{0.43}          & \multicolumn{1}{c|}{0.68}           & \multicolumn{1}{c|}{1.02}           \\ \hline
\multicolumn{1}{|l|}{BIND}           & \multicolumn{1}{c|}{0.06}           & \multicolumn{1}{c|}{0.10}          & \multicolumn{1}{c|}{0.05}           & \multicolumn{1}{c|}{0.48}          & \multicolumn{1}{c|}{0.15}           & \multicolumn{1}{c|}{0.53}           \\ \hline
\multicolumn{1}{|l|}{DRLB}           & \multicolumn{1}{c|}{\textbf{10.82}} & \multicolumn{1}{c|}{\textbf{7.64}} & \multicolumn{1}{c|}{\textbf{21.25}} & \multicolumn{1}{c|}{\textbf{8.54}} & \multicolumn{1}{c|}{\textbf{93.62}} & \multicolumn{1}{c|}{\textbf{87.12}} \\ \hline
                                     & \multicolumn{1}{l}{}                & \multicolumn{1}{l}{}               & \multicolumn{1}{l}{}                & \multicolumn{1}{l}{}               & \multicolumn{1}{l}{}                & \multicolumn{1}{l}{}               
\end{tabular}
\vspace{-0.4cm}
\caption{Signal/noise ratio of the original input, BIND-debiased data, and DRLB-debiased data on simulated data with different pattern sizes and numbers of patterns.}
\label{snratio}
\end{table}

\begin{figure}[!t]
\minipage{0.28\columnwidth}
  \includegraphics[width=\linewidth]{figures/s3.png}
  \subcaption{}
  \label{fig:awesome_image1}
\endminipage\hfill
\minipage{0.28\columnwidth}
  \includegraphics[width=\linewidth]{figures/s3-bind.png}
  \subcaption{}
  \label{fig:awesome_image2}
\endminipage\hfill
\minipage{0.28\columnwidth}%
  \includegraphics[width=\linewidth]{figures/s3-NN.png}
  \subcaption{}
  \label{fig:awesome_image3}
\endminipage\\
\minipage{0.28\columnwidth}
  \includegraphics[width=\linewidth]{figures/s4.png}
  \subcaption{}
  \label{fig:awesome_image4}
\endminipage\hfill
\minipage{0.28\columnwidth}
  \includegraphics[width=\linewidth]{figures/s4-bind.png}
  \subcaption{}
  \label{fig:awesome_image5}
\endminipage\hfill
\minipage{0.28\columnwidth}
  \includegraphics[width=\linewidth]{figures/s4-NN.png}
  \subcaption{}
  \label{fig:awesome_image6}
\endminipage\\
\vspace{-0.3cm}
\caption{Bias-removing on simulated data. 1's in true patterns and background bias were red and blue colored, respectively. Row 1: four patterns of size 80. Row 2: two patterns of size 120. (a,d) original inputs ; (b,e) Bind-debiased data;  (c,f) DRLB-debiased data.}
\label{denoise}
\end{figure}

\vspace{-0.25cm}
\subsubsection{Evaluation on signal/noise ratio.}
\vspace{-0.25cm}
\textit{DRLB can remove the majority of the background biases and achieves a high signal/noise ratio.} Table~\ref{snratio} shows the signal/noise ratio of the original data, BIND-debiased data, and DRLB-debiased data in the simulated scenarios. It is noteworthy that the matrices debiased by DRLB have drastically increased signal/noise ratios compared to the original input or BIND debiased data. To illustrate the bias-removing power of DRLB, we visualized DRLB-debiased matrices versus the original inputs and BIND-debiased results of two simulation scenarios in Figure~\ref{denoise}. It shows that the background bias was significantly removed by DRLB and the patterns become more distinct than the original and BIND-debiased matrices. BIND may remove entire rows or columns that are likely to be biased. Although BIND removes biases, it also loses a significant part of pattern signals, which explains why the signal/noise ratios of BIND-debiased matrices are even lower than the original input.

\vspace{-0.3cm}
\subsubsection{Evaluation on estimating background bias.} 
\vspace{-0.2cm}
We also evaluated the accuracy of DRLB in approximating the background bias. We first reconstructed the background bias matrix by the Boolean difference between the input matrix and the pattern matrices detected by BMF methods. Row- and column-wise bias levels were estimated by the frequency of 1s in each row and column of the reconstructed background bias matrix. We computed the correlation between the estimated bias level and the true bias level to evaluate the bias recovery capability of DRLB. Figure~\ref{corr} shows the correlations of row- and column-wise bias on three simulated scenarios. The correlation between the estimated bias level and the true bias level is over 0.95 in most cases. The high correlation demonstrates that the row- and column-wise bias could be well captured by DRLB.

\vspace{-0.2cm}
\begin{figure}[H]
    \includegraphics[scale=0.48]{figures/cor_plot copy3.png}
    \vspace{-0.8cm}
    \caption{Correlation between estimated bias level and true bias level on three simulated datasets with various pattern sizes and numbers. First row: results of row-wise bias. Second row: results of column-wise bias on the same dataset.}
    \label{corr}
\end{figure}
\vspace{-0.4cm}
\subsection{Ablation studies}
\vspace{-0.2cm}
\subsubsection{Selection and sensitivity of hyper-parameters.}
There are two hyper-parameters in DRLB, namely (1) $\lambda$ that balances contributions of the loss functions, and (2) $\alpha$ that controls the density of the generated background matrix. $\alpha$ depends on the data, which is recommended to be set between 2 and 3. In our experiments, we set $\alpha=3$ for all the simulated data and $\alpha=2$ for the real-world data. We also tested the impact of varied $\lambda$. Because DRLB + CG has been confirmed as the best-performed combination, we utilized CG for BMF in the test. Figure~\ref{sen} shows the reconstruction errors of CG on the simulated scenarios when $\lambda$ varies from 0.1 to 1. We can see that our model is robust when $\lambda$ varies in a relatively wide range. But the reconstruction error tends to increase when $\lambda$ goes large. Therefore, we set the value of $\lambda=0.3$ in all simulated and real-world data experiments.

\vspace{-0.3cm}
\subsubsection{Influence by bias level.}
In order to show the robustness of DRLB to different bias levels, we evaluated the method on simulated data with different bias levels. Four scenarios were simulated, with the row/column bias randomly sampled from $U[p,1]$ and $p=0, 0.1, 0.2, 0.3.$ Still, DRLB + CG was utilized for the evaluation. Figure~\ref{bias} shows the reconstruction error of CG before/after denoising under these scenarios. Sampling from larger $p$ will increase the bias level and the reconstruction error. DRLB consistently performs well in handling the high-bias cases and ensures a small reconstruction error for the downstream BMF. We conclude that DRLB has a high robustness to varied bias levels. Therefore, it can be robustly applied to a wide range of binary data.

\begin{figure}[H]
    \centering
    \includegraphics[scale=0.5]{figures/lambda.png}
    \vspace{-0.2cm}
    \caption{Reconstruction error of CG on six simulated scenarios after bias removed by DRLB using different  $\lambda$. Different colors represent the six scenarios in Table 1.}
    \label{sen}
\end{figure}

\vspace{-0.8cm}
\begin{figure}[htbp]
    \centering
    \includegraphics[scale=0.45]{figures/bias2.png}
    \caption{Performance of CG and DRLB + CG on simulated data of different bias levels.}
    \label{bias}
\end{figure}

% \vspace{-0.8cm}
\subsection{Experiments on Real World Data}
\vspace{-0.3cm}
To further evaluate the effectiveness of DRLB, we tested our model on two real-world single-cell RNA sequencing (scRNA-seq) data. BMF has been commonly applied in scRNA-seq data analysis \citep{chang2021iris, fang2020effective}. Notably, scRNA-seq data (1) is non-negative, (2) contains a large number of 0s, and (3) always has heterogeneous row(gene)- and column(cell)-wise distributions (see detailed discussions in APPENDIX \ref{AppendixD}), thus forming desired testing data for the BABMF problem. Both selected data were collected from liver cancer tissues\citep{ma2019tumor,wang2019single}. The two data have 17530 genes and 5762 cells (GSE125449) and 14452 genes and 4375 cells (GSE140228), respectively. 

The low-rank patterns formulated by BMF correspond to functional modules in scRNA-seq data. In the real-world data-based experiment, we focus on demonstrating that the application of DRLB enables the detection of more biologically meaningful patterns. Still, DRLB + CG was selected for analysis and compared with CG.
\vspace{-0.3cm}
\subsubsection{Data preprocessing.}
\vspace{-0.15cm}
For each data, we first select the top 2000 varied genes and cells based on row-/column-wise standard deviation, which gives us a $2000\times 2000$ matrix. The original continuous data were binarized by setting the top 80\% non-zero values to 1 and the other values to 0.
\vspace{-0.5cm}
\subsubsection{Performance evaluation.}
\vspace{-0.3cm}
Since there is no ground truth for real-world data, instead of evaluating reconstruction errors and signal/noise ratio, we focused on demonstrating the biological meaning of the patterns derived from DRLB-debiased data vs original inputs. We first use the adjusted rand index (ARI) to test the coincidence between the detected patterns and known cell type labels. Intuitively, true patterns should have high ARI because most function modules are cell-type specific. On the other hand, false positives caused by background bias may not match well with cell types. Figure~\ref{ari} shows the performance of DRLB + CG vs CG. Bias removal by DRLB resulted in a much higher ARI than applying CG merely to the original input. This result partially demonstrated that the implementation of DRLB improves the detection of functional gene modules.

To further evaluate the detected patterns, we performed a pathway enrichment analysis of the genes in each detected pattern against 3090 canonical pathways in the Molecular Signatures Database \citep{subramanian2005gene,liberzon2011molecular}. We found a large number of cancer-related pathways can only be detected by using the DRLB + CG. For example, in GSE125449, the PPAR signaling pathway was only detected in DRLB-debiased data. This pathway plays an important role in lipid circulation and metabolic reprogramming in cancer. Pathways related to immune response and other biological functions were also only detected by DRLB + CG. Our analysis demonstrates that the implementation of DRLB enables better bias handling and functional analysis of scRNA-seq.
\vspace{-0.2cm}
\begin{figure}[H]
    \centering
    \includegraphics[scale=0.5]{figures/realworlddata.png}
    \vspace{-0.4cm}
    \caption{ARI on real-world data.}
    \label{ari}
\end{figure}


% \vspace{-1.0cm}

\section{Conclusion}
\vspace{-0.3cm}
In this study, we introduce DRLB, a method that can effectively handle systematic biases in Boolean matrix factorization. By disentangling the input matrix into distinct pattern and background matrices, DRLB provides a more accurate representation of low-rank patterns in both simulated and real-world data. DRLB can be seamlessly implemented with all existing BMF methods to improve their detection accuracy for biased data and enhance the reliability and context-specific meaningfulness of the detected low-rank patterns. The scope of this work is not to provide an efficient bias removal method for large data, and therefore one future task can be to provide a more efficient method of DRLB.


% \begin{contributions} % will be removed in pdf for initial submission 
% 					  % (without ‘accepted’ option in \documentclass)
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions. 
%     This is a nice way of making clear who did what and to give proper credit.
%     This section is optional.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    This study is supported by National Science Foundation (DBI IIBR 2047631, IIS 2145314); and American Cancer Society (RSG-22-062-01-MM). 

    % \emph{All} acknowledgements go in this section.
\end{acknowledgements}

% References
\bibliography{uai2024-template}

\newpage

\onecolumn

\title{Bias-aware Boolean Matrix Factorization Using Disentangled Representation
Learning (Supplementary Material)}
\maketitle


\appendix
\section{Mathematical derivations}
\subsection{The Derivation of equation (6)}\label{A1}
Here we show how we derived Eq.\eqref{eq6} from Eq.\eqref{eq5}. With the disentanglement strategy, the latent space is decomposed into two independent components, $Z^P$ and $Z^B$. Because of the independence, we have: 1) $P(Z^P, Z^B) = P(Z^P)P(Z^B)$; 2) $Z^P$ and $Z^B$ are only associated with their own corresponding networks. Then:
\begin{equation}
\begin{split}
    \log P(X) &\geq \mathbf{E}_{Q_\phi (Z|X)}[\log P_\theta (X|Z)] - \mathbf{KL}(Q_\phi (Z|X)|| P(Z))\\
    &= \mathbf{E}_{Q_{\phi^P,\phi^B}  (Z^P, Z^B|X)}[\log P_{\theta^P, \theta^B} (X|Z^P, Z^B)] - \mathbf{KL}(Q_{\phi^P, \phi^B} (Z^P, Z^B|X)|| P(Z^P, Z^B))\\
    &= \mathbf{E}_{Q_{\phi^P,\phi^B}  (Z^P, Z^B|X)}[\log P_{\theta^P, \theta^B} (X|Z^P, Z^B)] - \mathbf{KL}(Q_{\phi^P} (Z^P|X)|| P(Z^P)) - \mathbf{KL}(Q_{\phi^B} (Z^B|X)|| P(Z^B)) 
\end{split}
\label{eqA.1}
\end{equation}

\subsection{The Formulation of equation (12)}\label{A2}
In Eq.\eqref{ldist}, we used Maximum Mean Discrepancy (MMD) to further constrain the distance of two distributions. Here MMD is a non-parametric metric that estimates discrepancies of different distributions by projecting data to a reproducing kernel Hilbert space with kernel functions. Concretely, let $X$ and $Y$ be two sets of samples with distribution $p$ and $q$, respectively. $x$ and $x'$ are different samples from $X$, $y$ and $y'$ are different samples from $Y$. The MMD between these two distributions is defined as:
\begin{equation}
    MMD^2(p, q) = \mathbf{E}_{x, x'} [k(x, x')] - 2\mathbf{E}_{x,y}[k(x,y)] + E_{y,y'}[k(y,y')]
\end{equation}
where $k$ is some pre-defined kernel function. \\
In practice, the expectations can be estimated by sample means:
\begin{equation}
    MMD^2(X, Y) = \frac{1}{m^2}\sum_{i,j=1}^m k(x_i, x_j) - \frac{2}{mn}\sum_{i,j=1}^{m,n}k(x_i,y_j) + \frac{1}{n^2}\sum_{i,j=1}^n k(y_i, y_j)
\end{equation}
where $m$ and $n$ are sample sizes of $X$ and $Y$.\\
In Eq.\eqref{ldist} the loss function $L_{dist}$ utilized MMD to constrain the distribution distances of two pairs of samples: 1) $f_{\phi^P}(X)$ and random samples from $N(0, I)$; 2) $f_{\phi^B}(X)$ and $f_{\phi^B}(\hat X^B)$.

\section{Implementation details}\label{AppendixB}
Following the implementation information in \ref{sec:4.2}, we provide further implementation details including batch size, learning rate, hyper-parameters, and hardware information below.

To train the two networks in DRLB, we used a batch size of 8 for simulated data and a batch size of 32 to accelerate the training for real-world data. The initial learning rate is set as 0.001, with a decay rate of 0.5 every ten epochs. Based on the experiments, we found the model converges well, and training for 100 epochs is enough for all the tested data. The hyperparameter $\lambda$ is set as 0.3 for all the data, and $\alpha$ is set between 2 and 3. We performed 10 runs and reported the averaged results for each input data.

All analyses were conducted on a laptop with 12th Gen Intel(R) Core(TM) i7-12700 CPU and NVIDIA GeForce RTX 3060 Ti GPU. 

\section{Running speed and computational efficiency}\label{AppendixC}
DRLB is a deep neural network-based method that relies on GPU computation. Noted, DRLB is designed to be implemented with BMF methods. Thus, instead of deriving the theoretical computational speed of the DRLB algorithm, we evaluated its running speed on our testing data and compared its running time with BMF methods. Among the SOTA BMF methods, MEBF is one of the fastest methods while ASSO, PANDA, and CG have relatively slower but acceptable running speeds. In our analysis, we have seen that DRLB is generally faster than or at the same level as the BMF methods ASSO, PANDA, and CG. On both simulated and real-world data, the running time of DRLB is about 10 times of MEBF. Considering MEBF is a highly scalable method, we concluded that the running of DRLB does not significantly introduce additional running costs when implemented with existing BMF methods and it can be applied to large data sets. Also, DRLB has a similar running speed compared with BIND.

\section{Systematic biases in single-cell RNA-sequencing data}\label{AppendixD}
Single-cell RNA-sequencing (scRNA-seq) data measures relative gene expression (or transcriptomic) abundance in a group of single cells. The typical form of single-cell RNA-sequencing (scRNA-seq) data is a matrix, in which each row is a gene feature and each column is a single cell. Each element in the data measures the relative expression level of a gene in a cell. Noting the high sparsity of scRNA-seq data, binarization is commonly utilized in scRNA-seq data processing and analysis. A common binarization approach is using a hard cutoff. All values larger than the cutoff are assigned as 1 and all the other values are assigned as 0.

Here we want to discuss the binarized scRNA-seq data generally contain systematic biases as described in Eq.\eqref{second_eq} and Eq.\eqref{third_eq}, especially for the ones generated by using the 10x Chromium or other drop-seq protocols \citep{andrews2021tutorial, jovic2022single}. Specifically, 10x Chromium and other drop-seq protocols generate a pooled library of 5000-20000 single cells and then sequence the pooled library to measure a cell-wise gene expression profile. In the pooled library, some cells may have more mRNA molecules amplified and measured while some others may have a lower mRNA amplification rate and less mRNA measured, majorly because of the stochastics of biochemical reactions of amplification and sequencing. Thus, in the 10x Chromium and other drop-seq data, the total signal (or called total counts) collected from each cell (column) could be varied a lot. The variation of this total signal is similar to the total number of items purchased by different users in purchase history data. Some ``super-cells" have much higher total signals (total counts) measured compared to other cells, which are like the super-buyers who tend to purchase more items in purchased history data. This variation will be inherited in the binarized data. Hence, the variation of the total measured signals through different cells naturally forms a column-wise bias in binarized scRNA-seq data, which follows Eq.\eqref{second_eq} and Eq.\eqref{third_eq}. On the other hand, the exact mRNA abundance of some genes, such as metabolism, cell structure, and other housekeeping genes, are always high while some genes like transcriptional factors or signaling molecules always have low expression levels in cells. Similarly, this gene-wise variation will be inherited in the binarized data. Thus, the difference in the natural expression level of different genes forms a row-wise bias in binarized scRNA-seq data, which follows Eq.\eqref{second_eq} and Eq.\eqref{third_eq}. 

In summary, the scRNA-seq data generated by using 10x Chromium and other drop-seq protocols naturally contains bias led by the varied distribution of row-wise and column-wise signals. Thus, this data type formed a desired real-world testing data type to benchmark DRLB. In this study, both of the selected testing data sets, GSE125449 and GSE140228, were generated by using the 10x Chromium or drop-seq protocols.

\end{document}
