Keywords: antibodies, language model, PLM, synthetic data
TL;DR: SynPair efficiently pairs billions of unpaired antibody chains (VH–VL) using dual encoders trained with contrastive learning and fast ANN search.
Abstract: Large-scale antibody sequence datasets, such as the Observed Antibody Space (OAS), contain billions of unpaired heavy (VH) and light (VL) chain sequences but fewer than 0.2\% paired sequences, limiting the performance of antibody language models trained on these resources. Existing computational antibody pairing models, such as ImmunoMatch, achieve promising accuracy but rely on computationally intensive cross-encoder architectures, making large-scale synthetic pairing infeasible. Here, we reframe antibody chain pairing as a dense retrieval problem and introduce SynPair, a dual-encoder model trained with contrastive InfoNCE loss that achieves state-of-the-art pairing accuracy while dramatically reducing computational requirements. SynPair can pair the entire unpaired OAS corpus—over 2 billion sequences—in less than 24 hours on standard HPC resources, a task previously computationally intractable. The synthetically paired libraries generated by SynPair closely match naturally occurring antibody pairing distributions, providing the potential for a biologically realistic, massively expanded paired dataset for antibody language model pre-training.
Submission Number: 170
Loading