BaggingCPP: An Inductive PU-Learning Framework for Discovering Cell-Penetrating Peptides

Published: 06 Oct 2025, Last Modified: 06 Oct 2025NeurIPS 2025 2nd Workshop FM4LS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: cell-penetrating peptides, positive-unlabeled learning, protein language models, LoRA, light attention, bagging ensembles
TL;DR: BaggingCPP uses inductive PU learning with protein LMs and ensembling to discover novel cell-penetrating peptides, including two validated experimentally
Abstract: Cell Penetrating Peptides (CPPs) are a promising approach for intracellular delivery of diverse molecular cargos. Although hundreds of CPPs have been previously characterized, most are cationic peptides with limited penetration efficiency or poor pharmaceutical properties; new high-throughput discovery approaches are thus needed. Here we introduce \textbf{BaggingCPP}, a machine learning-based CPP discovery framework that integrates inductive Positive-Unlabeled (PU) learning, protein language models and parameter-efficient fine-tuning algorithms. Unlike prior works, we do not use an artificial negative set that leads to distribution shift but instead use PU learning to train and infer on the same dataset - a large corpora of naturally expressed peptides such as hormones, neuropeptides, and small proteins. BaggingCPP reaches a cross-validated \textbf{AUC-ROC of 0.984} on our dataset and matches the performance of the state-of-the-art GraphCPP when both methods are trained \textit{and} inferred on the public CPP1708 benchmark. We used BaggingCPP to identify several candidate CPPs with low similarity to any known CPP and experimentally validated two. BaggingCPP thus represents a data-driven, biologically grounded route to expand the chemical diversity of known CPPs.
Submission Number: 69
Loading