Abstract: Deciphering protein folding and unfolding pathways under tension is essential for deepening our understanding of fundamental biological mechanisms. Such insights hold the promise of developing treatments for a range of debilitating and fatal conditions, including muscular disorders like Duchenne Muscular Dystrophy and neurodegenerative diseases such as Parkinson's disease. Single molecule force spectroscopy (SMFS) is a powerful technique for investigating forces involved in protein domains folding and unfolding. However, SMFS trials often involve multiple protein molecules, necessitating filtering to isolate measurements from single-molecule trials. Currently, manual visual inspection is the primary method for classifying single-molecule data; a process that is both time-consuming and requires significant expertise. Here, we both apply state-of-the-art machine learning models and present a novel deep learning model tailored to SMFS data. The proposed model employs a dual-branch fusion strategy; one branch integrates the physics of protein molecules, and the other operates independently of physical constraints. This model automates the isolation of single-molecule measurements, significantly enhancing data processing efficiency. To train and validate our approach, we developed a physics-based Monte Carlo engine to simulate force spectroscopy datasets, including trials involving single molecules, multiple molecules, and no molecules. Our model achieves state-of-the-art performance, outperforming five baseline methods on both simulated and experimental datasets. It attains nearly 100\% accuracy across all simulated datasets and an average accuracy of $79.6 \pm 5.2$\% on experimental datasets, using only $\sim$30 training samples, surpassing baseline methods by 11.4\%. Notably, even without expert annotations on experimental data, the model achieves an average accuracy of $72.0 \pm 5.9$\% when pre-trained on corresponding simulated datasets. With our deep learning approach, the time required to extract meaningful statistics from single-molecule SMFS trials is reduced from a day to under an hour. This work results in SMFS experimental datasets from four important protein molecules crucial to many biological pathways. To support further research, we have made our datasets publicly available and provided a Python-based toolbox (https://github.com/SalapakaLab-SIMBioSys/SMFS-Identification).
Lay Summary: Understanding protein folding and unfolding under mechanical forces is important for understanding biological processes. Such insights hold the promise of developing treatments for serious conditions, including muscular disorders like Duchenne Muscular Dystrophy and neurodegenerative diseases such as Parkinson’s disease. Single molecule force spectroscopy (SMFS) is a powerful technique for investigating forces involved in protein folding and unfolding. However, real-world SMFS data often includes trials from multiple proteins, which can confound meaningful results. To interpret these data correctly, researchers need to isolate clean, single-molecule trials—a process that traditionally required manual inspection by experts and could take an entire day for just one experiment.
In this work, we automated this filtering process by applying several state-of-the-art machine learning models and introducing a novel deep learning model that incorporates physical knowledge of proteins. We also provide a simulation engine to generate realistic datasets alongside extensive experimental data from a variety of proteins. Our model outperforms existing methods and reduces the time needed to extract meaningful results from a full day to under an hour. To support future research, we have made our datasets and Python-based toolbox publicly available.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/SalapakaLab-SIMBioSys/SMFS-Identification
Primary Area: Applications->Chemistry, Physics, and Earth Sciences
Keywords: Single molecule force spectroscopy, protein unfolding, experimental datasets, application in single molecule identification, physics-augmented deep learning framework, physics-based Monte Carlo simulation
Submission Number: 5208
Loading