

% moving to npj style 

\begin{figure*}[t]
    \centering
    % \rule{\textwidth}{0.3\textheight} % Placeholder box with width and height
    \includegraphics[width=\linewidth]{sec/figs/fig2v6.png}
    \caption{\textbf{Overview of our IntraPair InterCluster (IPIC) approach.} 
    Following Algorithm \ref{alg:ipic}, given an unpaired dataset  (triangles and circles represent different modalities, with colors indicating treatment groups), we first perform matching within each treatment group to generate a pseudo-paired dataset. 
    Next, we use encoders $\phi^1$ and $\phi^2$ to produce embeddings ${h^1}$ and ${h^2}$.
    We then apply the matching scores $M_t$ as weights with embeddings ($\textbf{v}$) generated by the projection head $f(\cdot)$ for intra-treatment group learning using the  $\mathcal{L}_{intra}$ objective (Eq.~\ref{eq:lintra}).
    Finally, we cluster embeddings ($\textbf{u}$) generated by projection head $g(\cdot)$ to 
    generate pseudo-labels to perform inter-treatment group learning with objective $\mathcal{L}_{inter} (Eq.~\ref{eq:linter})$.
    % color are treatment hsape are modalties [done]
    % notation changing to writing, with encoder as $\phi$, embedding as $h$, head as $f(\cdot)$ and $fg\cdot)$ [done]
    }
    \label{fig:methodoverview}
    \vspace{-13pt}
\end{figure*}



\section{Introduction \& Related Work}
\label{sec:intro}



Recent advances in high-throughput screening technologies have facilitated the collection of large-scale biological datasets in various modalities~\citep{way2023evolution,larsson2021spatially,stoeckius2017simultaneous,baek2020single,boutros2015microscopy}. As each modality alone provides a limited view, integrating them is crucial for forming a more complete picture of biology and ultimately enhancing our understanding of the underlying mechanisms~\citep{bunne2024buildvirtualcellartificial,heumos2023best}.

In parallel, recent advances in self-supervised learning have enabled impressive multimodal capabilities in domains like computer vision and natural language processing, including zero-shot classification~\citep{radford2021learning}, text-to-image generation~\citep{rombach2022high, ramesh2022hierarchical}, and text-based image editing~\citep{hertz2023prompttoprompt}. However, these successes have relied on \emph{paired} samples across modalities, such as images and their associated captions. For example, CLIP~\citep{radford2021learning} uses an InfoNCE loss~\citep{gutmann2010noise, oord2018representation} to learn representations that maximize the true matching (or similarity) between representations of images and their captions.

In biology, paired samples are often difficult (if not impossible) to collect due to the \emph{destructive nature} of many measurement devices, e.g., single-cell RNA sequencing, protein expression profiling, and high-content microscopy\footnote{See Figure~\ref{fig:unpairbio}}. Thus, samples must be \textit{indirectly linked} through shared experimental conditions, such as gene knockouts or chemical treatments.
To this end, several recent studies have explored the potential of multimodal contrastive learning in the biological domain~\citep{bao2022integrative,yang2022contrastive,fradkin2024molecules}, with promising results in predicting cell states \citep{min2024multimodal}, identifying phenotypic changes~\citep{bao2022integrative}, and assessing perturbation outcomes~\citep{fradkin2024molecules}. In addition, others have explored approximate matching techniques~\citep{xi2024propensity,ryu2024cross} and anchors~\cite{stuart2019comprehensive}, along with partial pairing~\cite{zhu2023robust,tu2022cross} and indirect links~\cite{liu2020jointly,lamiable2023revealing,hao2021integrated}. While promising, these methods mostly assume consistent underlying correlations between modalities. However, in complex biological contexts, these correlations are often non-linear and noisy. For instance, a treatment may alter gene expression without visibly affecting cell morphology, or similar treatments may yield divergent effects across modalities. Over-reliance on such techniques can force alignments where true biological correlations may not exist, degrading the quality of learned embeddings. Our approach seeks to mitigate this by enabling more biologically relevant alignments.

The contributions of our work are as follows:
\begin{itemize}
    \item We introduce \textbf{IntraPair InterCluster}~(IPIC), a novel contrastive approach for unpaired biological datasets. IPIC aligns modalities via two complementary strategies: using treatment-group labels for intra-treatment \textit{matching} and leveraging inherent modality structures for inter-treatment \textit{clustering}.
    \item Unlike previous approaches, IPIC effectively leverages both shared experimental conditions \emph{and} intrinsic modality patterns without requiring paired samples, producing embeddings that are both accurate and biologically meaningful.
    \item In comprehensive experiments on four real-world ``omics’’ datasets (phenomics and transcriptomics), we demonstrate that IPIC consistently outperforms baselines, including random pairing and weakly-supervised contrastive methods~\citep{alwassel2020self, zheng2021weakly}, providing a foundation for extending unpaired contrastive learning to new domains.
\end{itemize}

% \section{Introduction \& Related Work}
% \label{sec:intro}
% Recent advances in high-throughput screening technologies have facilitated the collection of large-scale biological datasets in various modalities~\citep{way2023evolution,larsson2021spatially,stoeckius2017simultaneous,baek2020single,boutros2015microscopy}. As each modality alone provides a limited view, integrating them is crucial for forming a more complete picture of biology and ultimately enhancing our understanding of the underlying mechanisms~\citep{bunne2024buildvirtualcellartificial,heumos2023best}.

% In parallel, recent advances in self-supervised learning have enabled impressive multimodal capabilities in domains like computer vision and natural language processing, including zero-shot classification~\citep{radford2021learning}, text-to-image generation~\citep{rombach2022high, ramesh2022hierarchical}, and text-based image editing~\citep{hertz2023prompttoprompt}. However, these successes have relied on \emph{paired} samples across modalities, such as images and their associated captions. For example, CLIP~\citep{radford2021learning} uses an InfoNCE loss~\citep{gutmann2010noise, oord2018representation} to learn representations that maximize the true matching (or similarity) between representations of images and their captions.

% In biology, paired samples are often difficult (if not impossible) to collect due to the \emph{destructive nature} of many measurement devices, e.g., single-cell RNA sequencing, protein expression profiling, and high-content microscopy\footnote{See Figure~\ref{fig:unpairbio}}. Thus, samples must be \textit{indirectly linked} through shared experimental conditions, such as gene knockouts or chemical treatments.
% To this end, several recent studies have explored the potential of multimodal contrastive learning in the biological domain~\citep{bao2022integrative,yang2022contrastive,fradkin2024molecules}, with promising results in predicting cell states \citep{min2024multimodal}, identifying phenotypic changes~\citep{bao2022integrative}, and assessing perturbation outcomes~\citep{fradkin2024molecules}. In addition, others have explored approximate matching techniques~\citep{xi2024propensity,ryu2024cross} and anchors~\cite{stuart2019comprehensive}, along with partial pairing~\cite{zhu2023robust,tu2022cross} and indirect links~\cite{liu2020jointly,lamiable2023revealing,hao2021integrated}. While promising, these methods mostly assume consistent underlying correlations between modalities. However, in complex biological contexts, these correlations, are often non-linear and noisy. For instance, a treatment may alter gene expression without visibly affecting cell morphology, or similar treatments may yield divergent effects across modalities. Over-reliance on such techniques can force alignments where true biological correlations may not exist, degrading the quality of learned embeddings. Our approach seeks to mitigate this allowing for more biologically relevant alignments.

% The contributions of our work are as follows:
% We introduce \textit{IntraPair InterCluster}~(IPIC), a contrastive approach for unpaired multimodal learning on biological datasets. IPIC aligns modalities via two complementary strategies: it uses treatment-group labels for intra-treatment \textit{matching} while leveraging the inherent structure within each modality for inter-treatment \textit{clustering}. This dual approach produces embeddings that are both accurate and biologically meaningful. Thus, unlike previous biological multimodal contrastive approaches, IPIC leverages both the shared experimental conditions \emph{and} the intrinsic modality patterns---all without requiring paired samples. 
% In comprehensive experiments on real-world ``omics’’ datasets (phenomics and transcriptomics), we demonstrate that IPIC consistently outperforms baseline approaches, including methods using random pairing within treatment groups and weakly-supervised contrastive methods~\citep{alwassel2020self, zheng2021weakly}. In doing so, we not only improve multimodal representation learning for biological data, but also provide a foundation for extending unpaired contrastive learning to other domains.

% OLD INTRODUCTION (feel free to restore!) ------------------------------------------------------------

% Self-supervised learning (SSL) and contrastive learning (CL) have achieved substantial success in multimodal representation learning, particularly in domains like computer vision and natural language processing.  Models such as CLIP~\cite{radford2021learning} leverage image-text pairs with contrastive objectives like the InfoNCE loss~\cite{} to align cross-modal representations, enabling breakthroughs in zero-shot learning, transfer learning, and generative models. 

% These methods, however, rely on the availability of paired data, which is abundant in internet-scale datasets. In stark contrast, applying the same techniques to biological data presents unique challenges. High-throughput biological assays, such as single-cell RNA sequencing, protein expression profiling, and high-content microscopy imaging, are inherently destructive processes. Once a sample undergoes a specific assay, it usually cannot be used for another measurement, making it impossible to collect truly paired data across multiple modalities from the same sample. Consequently, biological datasets often lack the explicit pairings needed for traditional multimodal CL frameworks. Instead, samples can only be indirectly linked through shared experimental conditions, such as gene knockouts or chemical treatments.

% Despite these limitations, several recent studies have explored the potential of multimodal contrastive learning in the biological domain~\cite{}. They have shown promise in predicting cell states, identifying phenotypic changes, and assessing perturbation outcomes. Researchers have also attempted to address the challenge of unpaired biological data by treating it as a translation problem across modalities, often using matching techniques~\cite{}. Many approaches rely on partial pairing, indirect links, or approximate anchors to align modalities.  While promising, these methods often assume underlying correlations that may not exist in complex biological contexts, where relationships between modalities are non-linear and noisy \cite{}.

% In this work, we introduce a framework for Unpaired Contrastive Learning (UCL) of biological data where explicit pairings between modalities are often unavailable or infeasible. Unlike traditional multimodal CL methods that rely on paired samples, our approach leverages both shared experimental conditions (e.g treatment labels or perturbation types) and internal structure of each modality to guide cross-modality alignment. This allows us to bridge different omics modalities, capturing meaningful biological relationships that improve performance downstream tasks without requiring direct pairings.

% We demonstrate that, unlike our proposed approach, IntraPair and InterCluster (IPIC), random pairing within treatment groups significantly undermines contrastive learning performance. Moreover, naive translation and existing weakly-supervised approaches prove to be suboptimal for downstream biological tasks. Our contributions are summarized as follows:

% \begin{itemize} 

% \item We highlight a commonly overlooked issue in biological multimodal learning: the assumption of pairings. We formally define unpaired biological contrastive learning to address this gap, revealing the limitations of existing approaches on unpaired data.
% \item We introduce a new method to address that problem that combines contrastive learning with representation clustering and matching techniques. By leveraging shared treatment labels and exploiting intrinsic patterns within each modality, our approach align unpaired modalities more effectively, improving  representation learning. 
% \item We conduct comprehensive experiments on real omics datasets (phenomics and transcriptomics), demonstrating superior performance compared to existing baselines.

% \end{itemize}

% Our work not only enhances biological data analysis but also provides a foundation for extending unpaired contrastive learning to other domains where paired data is scarce, thus expanding the scope of SSL in scientific research. 

% OLD INTRODUCTION END --------------------------------------------------------



% \subsection*{Contribution}
% \begin{itemize}
%     \item Contribution 1: We show the community a normally overlooked problem in bio CL, pairness To treat this problem formally, we define new problem called unpaired CL
%     \item Contribution 2: We propose a better/first solution to solve unpaired CL
%     \item Contribution 3: Our extensive experiments show the importance of not ignoring the un-pairedness of the datasets and our methods paves new way to learn better representation across different data modalities
% \end{itemize}


% \emph{Skeleton}
% \begin{itemize}
%     \item ssl and cl from the natural dataset (image and text) are helping biomodality
%     \item however, the naturally paired dataset is not available due to tech
%     \item prior work on biomodal assumes pair at random
%     \item in this paper, we show pair at random negatively affect your representation
%     \item we define this problem as unpaired contrastive learning problem
%     \item we also provide solution to treat this paired data with help from matching and representation clustering
%     \item we show extensive experiments on relevant bio datasets include sequence and image as modalities (identified downstream tasks) and show our method outperforms all prior baselines that assume paired samples
% \end{itemize}


% \newpage