Keywords: host-pathogen interactions, protein language models, fine-tuning, LoRA, attention coefficients
TL;DR: ProteomeLM encodes host-pathogen interaction signal without supervision (AUROC 0.786); LoRA fine-tuning with asymmetric masking and blocked pathogen self-attention boosts this to 0.808, generalizing across ten viral and bacterial pathogens.
Abstract: ProteomeLM is a proteome-scale language model trained on proteomes spanning the tree of life to reconstruct masked protein embeddings from proteome context within each species. Its attention coefficients capture protein-protein interactions without supervision. Here, we show that this capability extends to cross-species host-pathogen interactions~(HPI) across ten human pathogen taxa spanning viruses and bacteria, and can be further improved with lightweight fine-tuning. We introduce \textbf{ProteomeLM-HPI}, a parameter-efficient adaptation via LoRA, trained on concatenated host-pathogen proteomes to reconstruct masked pathogen embeddings from host context. ProteomeLM-HPI involves two key design choices: \emph{asymmetric masking} (pathogen-heavy masking) and \emph{blocked self-attention}. Systematic ablations show that both choices contribute. To assess generalization, we introduce a strict cross-species benchmark enforcing pathogen-level holdout and 40\% sequence-identity filtering. On this benchmark, Proteome-HPI improves AUC on 8 out of 10 unseen pathogens. This is a work-in-progress report; code, data and models will be made publicly available.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 51
Loading