Keywords: T cell receptor, Peptide recognition, Protein language model
Abstract: The interaction between the T cell receptor (TCR) and peptide-human leukocyte antigen complex (pHLA) is a fundamental process underlying T cell-mediated immunity. Computational methods have been developed to predict TCR-pHLA binding, but most existing models were trained on relatively small datasets and focused solely on the Complementarity Determining Region 3 (CDR3) of the TCR $\beta$ chain. A key barrier to developing advanced prediction models is the limited availability of comprehensive data containing understudied prediction components. In this light, we developed the Hi-TPH dataset with more protein sequences and gene annotations. The dataset is stratified into five hierarchical subsets at four different levels, ranging from Hi-TPH level I with only the peptide sequence and TCR CDR3 $\beta$ to Hi-TPH level II, III, and IV that incorporate increasing levels of HLA sequences, full TCR $\alpha$ and $\beta$ chains, and gene annotations. Hi-TPH at any level represents the largest dataset with corresponding prediction components to date, for instance, the Hi-TPH level IV dataset is at least 5.99 times the size of existing ones regarding the number of TCR-pHLA pairs. We further report benchmark results on the Hi-TPH dataset, establishing valuable baselines for the TCR-pHLA binding prediction task. This comprehensive dataset and associated benchmarks provide a valuable resource for developing advanced TCR-pHLA binding prediction models and exploring research directions such as understanding the contribution of different components and enhancing model generalization to unseen peptides, with potential applications in developing targeted therapies, including personalized vaccines and immunotherapies.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8966
Loading