RaFT-DM: A Residue-Aware Fusion Transformer with Domain-Wise Memory for Accurate Multi-Label Protein Function Prediction

ACL ARR 2026 January Submission509 Authors

23 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Protein Function Prediction, Protein Language Models, Domain-aware Modeling, Gene Ontology, Graph Neural Networks
Abstract: Accurate protein function prediction via modeling protein sequences as long structured symbolic sequences is essential for understanding cellular mechanisms and guiding drug discovery. However, conventional deep learning models often use global pooling, which weakens domain-specific signals and fails to capture contextual dependencies across domains. Here, we propose $\underline{\textbf{R}}$esidue-$\underline{\textbf{a}}$ware $\underline{\textbf{F}}$usion $\underline{\textbf{T}}$ransformer with $\underline{\textbf{D}}$omain-wise $\underline{\textbf{M}}$emory ($\textbf{RaFT-DM}$), a segment-aware multimodal modeling framework that introduces explicit memory over structured segments. RaFT-DM performs residue-level cross-attention between sequence and structure embeddings, and segments the fused representation using InterProScan-derived domain annotations. A BiLSTM initialized with a global token then captures segment-level contextual semantics, preserving local discriminative features while modeling inter-segment relationships. Experiments on standard benchmarks show that RaFT-DM consistently outperforms state-of-the-art baselines. By replacing global pooling with domain-aware modeling, RaFT-DM reduces missed recalls and misclassifications, enabling more accurate and interpretable predictions. The implementation of RaFT-DM is available at https://github.com/anonymous/RaFT-DM.
Paper Type: Long
Research Area: Clinical and Biomedical Applications
Research Area Keywords: Biomedical NLP, Graph-based Methods, Multimodal Applications, Knowledge-Augmented Methods, Representation Learning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English, Protein Sequences
Submission Number: 509
Loading