VarLitBench and VarLitAgent for Benchmarking and Automating LLM-Assisted Functional Evidence Curation in Genomic Variant Interpretation

Ali Saadat; Jacques Fellay

VarLitBench and VarLitAgent for Benchmarking and Automating LLM-Assisted Functional Evidence Curation in Genomic Variant Interpretation

Ali Saadat, Jacques Fellay

Published: 28 May 2026, Last Modified: 28 May 2026GenBio 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Clinical Genetics, Functional Evidence, ACMG/AMP, Variant Interpretation

Abstract: Linking genomic variants to functional evidence in the literature is a central but labor-intensive step in clinical variant interpretation. We introduce VarLitBench, a ClinGen-anchored benchmark for evaluating large language models (LLMs) on variant-specific functional-evidence curation, and VarLitAgent, an end-to-end pipeline for human-in-the-loop evidence retrieval, extraction, and reporting. VarLitBench evaluates two tasks. In abstract screening, the model determines whether a paper is likely to report a functional experiment that directly tests one or more genetic variants. In full-paper extraction, the model aligns the target variant to mentions in the paper, extracts experimental readouts, classifies evidence direction, and generates a concise evidence summary. We evaluated gpt-4o-mini, o4-mini, claude-haiku-4-5, and claude-sonnet-4-5. All models achieved high recall for abstract screening, ranging from 0.873 to 0.904, with claude-sonnet-4-5 obtaining the best overall F1 score of 0.792. For full-paper PS3 versus BS3 evidence-direction classification, o4-mini achieved the highest F1 score, 0.979. We also compared model-generated summaries with expert-written ClinGen curator rationales using an LLM-as-judge protocol. Claude models obtained the highest mean correspondence scores. Evidence strength assignment, such as distinctions between pathogenic strong and pathogenic moderate, remained challenging across models. VarLitAgent builds on these findings by taking a genomic variant as input, expanding its identifiers, retrieving candidate literature, screening abstracts, obtaining full texts or PDFs when available, and performing multimodal evidence extraction. The system supports a direct mode for efficient processing and an agentic mode for deeper parsing of figures and tables. Together, VarLitBench and VarLitAgent provide a practical foundation for auditable LLM assistance in functional-evidence curation.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 103

Loading