Keywords: Multi-modal foundation models, Structure-aware protein language models, Biomolecular representation learning, Protein design, Drug discovery, NANOBODY, VHH antibodies
TL;DR: SaNano is a multimodal VHH foundation model that learns from sequence and predicted structure. Fine-tuned on nanobody sequence-structure pairs, it improves contact prediction, developability prediction, and CDR3 reconstruction for protein design.
Abstract: Single domain antibodies such as camelid VHH (or NANOBODY®) are increasingly adopted as therapeutics due to their compact architecture, stability, and favorable developability. However, their distinct structural features limit transferability from general protein language models, and the limited number of available sequences constrains training large VHH-specific foundation models from scratch. We introduce SaNano, a structure-aware VHH language model that unifies sequence and Foldseek 3Di structural tokens, fine-tuned from SaProt. Trained on curated VHH sequences paired with NanobodyBuilder2 structures and mixed with SwissProt/AlphaFold data, SaNano interleaves structure-conditioned and sequence-only masked language modeling to internalize structural priors while retaining strong sequence-only inference. SaNano outperforms general protein and antibody/VHH baseline models on few-shot contact prediction, biophysical property prediction, and sequence reconstruction (pseudo-perplexity), with especially large gains in the highly variable complementarity-determining region 3 (CDR3). Crucially, structure-aware fine-tuning improves sequence-only performance, reducing reliance on costly structure prediction in high-throughput screening. SaNano is available at
\url{https://huggingface.co/novonordisk-red/SaNano}.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 11
Loading