EpitopeGen: Learning to Generate T Cell Epitopes: A Semi-Supervised Approach with Biological Constraints

Minuk Ma; Wilson Tu; Carlos Vasquez-Rios; Jiarui Ding

EpitopeGen: Learning to Generate T Cell Epitopes: A Semi-Supervised Approach with Biological Constraints

Minuk Ma, Wilson Tu, Carlos Vasquez-Rios, Jiarui Ding

Published: 11 Jun 2025, Last Modified: 18 Jul 2025GenBio 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: protein language model, T-cell recognition, generative transformer

TL;DR: We propose a semi-supervised learning approach that utilizes large amount of unpaired data for generative modeling of epitopes from T-cell receptor sequences. EpitopeGen successfully generated epitopes with high binding affinity and naturalness.

Abstract: Single-cell TCR sequencing enables high-resolution analysis of T Cell Receptor (TCR) diversity and clonality, offering valuable insights into immune responses and disease mechanisms. However, identifying cognate epitopes for individual TCRs requires complex and costly functional assays. We address this challenge with EpitopeGen, a large-scale transformer model based on the GPT-2 architecture that generates potential cognate epitope sequences directly from TCR sequences. To overcome the scarcity of TCR-epitope binding pairs ($\approx100,000$), EpitopeGen uses a semi-supervised learning method, termed BINDSEARCH, which searches over 70 billion potential pairs and incorporates high binding affinity predictions as pseudo-labels. To incorporate CD8$^+$ T cell biology into the model as an inductive bias, EpitopeGen employs a novel data balancing method, termed Antigen Category Filter, that carefully controls antigen category ratios in its training dataset. EpitopeGen significantly outperforms baseline approaches, generating epitopes with high binding affinity, diversity, naturalness, and biophysical stability.

Submission Number: 97

Loading