EpitopeGen: Learning to Generate T Cell Epitopes: A Semi-Supervised Approach with Biological Constraints
Keywords: protein language model, T-cell recognition, generative transformer
TL;DR: We propose a semi-supervised learning approach that utilizes large amount of unpaired data for generative modeling of epitopes from T-cell receptor sequences. EpitopeGen successfully generated epitopes with high binding affinity and naturalness.
Abstract: Single-cell TCR sequencing enables high-resolution analysis of T Cell Receptor (TCR) diversity and clonality, offering valuable insights into immune responses and disease mechanisms. However, identifying cognate epitopes for individual TCRs requires complex and costly functional assays. We address this challenge with EpitopeGen, a large-scale transformer model based on the GPT-2 architecture that generates potential cognate epitope sequences directly from TCR sequences. To overcome the scarcity of TCR-epitope binding pairs ($\approx100,000$), EpitopeGen uses a semi-supervised learning method, termed BINDSEARCH, which searches over 70 billion potential pairs and incorporates high binding affinity predictions as pseudo-labels. To incorporate CD8$^+$ T cell biology into the model as an inductive bias, EpitopeGen employs a novel data balancing method, termed Antigen Category Filter, that carefully controls antigen category ratios in its training dataset. EpitopeGen significantly outperforms baseline approaches, generating epitopes with high binding affinity, diversity, naturalness, and biophysical stability.
Submission Number: 97
Loading