SynPair: Pairing Unpaired Antibody Chains at Billion-Sequence Scale With Contrastive Learning

Oliver M. Turnbull; Charlotte Deane

SynPair: Pairing Unpaired Antibody Chains at Billion-Sequence Scale With Contrastive Learning

Oliver M. Turnbull, Charlotte Deane

Published: 11 Jun 2025, Last Modified: 18 Jul 2025GenBio 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: antibodies, language model, PLM, synthetic data

TL;DR: SynPair efficiently pairs billions of unpaired antibody chains (VH–VL) using dual encoders trained with contrastive learning and fast ANN search.

Abstract: Large-scale antibody sequence datasets, such as the Observed Antibody Space (OAS), contain billions of unpaired heavy (VH) and light (VL) chain sequences but fewer than 0.2\% paired sequences, limiting the performance of antibody language models trained on these resources. Existing computational antibody pairing models, such as ImmunoMatch, achieve promising accuracy but rely on computationally intensive cross-encoder architectures, making large-scale synthetic pairing infeasible. Here, we reframe antibody chain pairing as a dense retrieval problem and introduce SynPair, a dual-encoder model trained with contrastive InfoNCE loss that achieves state-of-the-art pairing accuracy while dramatically reducing computational requirements. SynPair can pair the entire unpaired OAS corpus—over 2 billion sequences—in less than 24 hours on standard HPC resources, a task previously computationally intractable. The synthetically paired libraries generated by SynPair closely match naturally occurring antibody pairing distributions, providing the potential for a biologically realistic, massively expanded paired dataset for antibody language model pre-training.

Submission Number: 170

Loading