ProtFunAgent: Agentic LLM Cascades for Low-Resource Protein Function Gap-Filling via Homology RAG and Ontology-Constrained Decoding

ICLR 2026 Conference Submission16095 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: protein function prediction, large language models, ontology-aware decoding, Gene Ontology, homology retrieval, retrieval-augmented generation, BLAST, constrained decoding, agentic LLM cascades, biological priors, annotation refinement, ontology-grounded prediction, hierarchical F1, recall improvement, protein annotation, computational biology, knowledge integration, function gap-filling
TL;DR: Ontology-aware agentic LLM with homology RAG and GO-constrained decoding for grounded, low-resource protein function gap-filling.
Abstract: Predicting protein function is a long-standing challenge, especially for poorly characterized sequences where homology transfer is unreliable and large language models (LLMs) produce fluent but biologically imprecise annotations. Existing approaches often fail to integrate critical priors such as Gene Ontology (GO) structure or homology evidence, limiting both recall and generalization. We present \textbf{ProtFunAgent}, an agentic framework that couples LLM reasoning with biological constraints through three key innovations: (1) \emph{homology-guided retrieval-augmented generation}, where top-$k$ sequence homologs inject functional priors; (2) \emph{ontology-constrained decoding}, aligning predictions with the GO hierarchy via lexicon-aware filtering and pruning; and (3) a \emph{synthesis-and-judging cascade} of LLMs, where multiple models collaborate and self-evaluate to refine candidate summaries. This design mirrors biocurator workflows while retaining the flexibility of generative models. On UniProt-derived benchmarks, ProtFunAgent outperforms single-LLM and heuristic baselines, delivering \textbf{over $3\times$ higher hierarchical F1} and nearly doubling recall while maintaining precision. Moreover, the framework \textbf{closes more than half of the gap to oracle-level annotation}, demonstrating that embedding biological structure into agentic LLM pipelines enables scalable, ontology-faithful function prediction. ProtFunAgent provides a general blueprint for marrying symbolic constraints with generative reasoning, advancing automated protein annotation at scale.
Supplementary Material: pdf
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 16095
Loading