Track: Full Paper Track
Keywords: Active Learning, Few-shot Learning, Structure-aware Embeddings, Variant Detection, Protein Language Models, Gaussian Processes, SARS-CoV-2
TL;DR: We introduce an active learning framework that combines protein language models, Gaussian processes, and biophysical modeling to efficiently predict high-fitness SARS-CoV-2 variants with minimal experimental data.
Abstract: Early detection of high-fitness SARS-CoV-2 variants is crucial for pandemic response, yet limited experimental resources hinder timely identification. We propose an active learning framework that integrates a protein language model, a Gaussian process with uncertainty estimation, and a biophysical model to predict the fitness of novel receptor-binding domain (RBD) variants in a few-shot learning setting. Our approach prioritizes the most informative variants for experimental characterization, accelerating high-fitness variant detection by up to 5× compared to random sampling while testing fewer than 1\% of all possible variants. Benchmarking on deep mutational scans, we show that our method identifies evolutionarily significant sites, particularly those facilitating antibody escape. We systematically compare different acquisition strategies and demonstrate that incorporating uncertainty-driven exploration enhances coverage of the mutational landscape, enabling the discovery of evolutionarily distant yet high-risk variants. Our results suggest that this framework could serve as an efficient early warning system for identifying concerning SARS-CoV-2 variants before they achieve widespread circulation.
Attendance: Marian Huot
Submission Number: 71
Loading