CPSea: Large-scale cyclic peptide-protein complex dataset for machine learning in cyclic peptide design
Keywords: cyclic peptide design, peptide-protein complexes, dataset curation
TL;DR: A synthetic cyclic peptide-protein complex dataset derived from AFDB, facilitating training cyclic peptide binder design model from scratch for the first time.
Abstract: Cyclic peptides exhibit better binding affinity and proteolytic stability compared to their linear counterparts. However, the development of cyclic peptide design models is hindered by the scarcity of data. To address this, we introduce **CPSea**(**C**yclic **P**eptide **Sea**), a dataset of 2.71 million cyclic peptide-receptor complexes, curated through systematic mining of the AlphaFold Database (AFDB). Our pipeline extracts compact domains from AFDB, identifies cyclization sites using the $\beta$-carbon (C$_\beta$) distance thresholds, and applies multi-stage filtering to ensure structure fidelity and binding compatibility. Compared with experimental data of cyclic peptides, CPSea shows similar distributions in metrics on structure fidelity and wet-lab compatibility. To our knowledge, CPSea is the largest cyclic peptide-receptor dataset to date, enabling end-to-end model training for the first time. The dataset also showcases the feasibility of simulating inter-chain interactions using intra-chain interactions, expanding available resources for machine-learning models on protein-protein interactions. The dataset and relevant scripts are accessible on GitHub ([https://github.com/YZY010418/CPSea](https://github.com/YZY010418/CPSea)).
Croissant File: json
Dataset URL: https://www.kaggle.com/datasets/ziyiyang180104/cpsea/
Code URL: https://github.com/YZY010418/CPSea
Primary Area: AL/ML Datasets & Benchmarks for life sciences (e.g. climate, health, life sciences, physics, social sciences)
Submission Number: 1691
Loading