VariantBench: Benchmarking Language Models on Scientific Reasoning Across the Pharmacogenomic Evidence Pipeline

Shlok Natarajan; Andrew Lanpouthakoun; Etash Kumar Guha; Aaron Fanous; Roxana Daneshjou

VariantBench: Benchmarking Language Models on Scientific Reasoning Across the Pharmacogenomic Evidence Pipeline

Shlok Natarajan, Andrew Lanpouthakoun, Etash Kumar Guha, Aaron Fanous, Roxana Daneshjou

Published: 05 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: Large Language Models, Benchmarking, AI for Science, Evaluation of Large Language Models

TL;DR: We introduce VariantBench, a 79k-question benchmark evaluates whether LLM agents can sustain logical consistency across chained, cross-document, and clinical reasoning workflows

Abstract: Large language models increasingly serve as reasoning engines over scientific literature, yet it remains unclear whether they can sustain logical consistency across the multi-stage workflows required for real-world literature analysis. We introduce \textsc{VariantBench}, a benchmark that mirrors the full pharmacogenomic evidence curation pipeline grounded in expert-curated annotations from the ClinPGx research team. The benchmark comprises 79,592 structured single-paper questions and 394 agentic cross-document and clinical reasoning tasks spanning three tiers of complexity: factual extraction, dependent multi-turn reasoning, and CPIC guideline recreation under zero-context and evidence-provided settings. Evaluating frontier tool-use agents with the Harbor framework reveals substantial brittleness in multi-step reasoning. While per-step accuracy on chained tasks exceeds 60%, requiring all steps in a chain to be correct reduces success to 13.6%. Cross-document synthesis further degrades performance relative to single-paper comprehension. For clinical guideline recreation, providing the referenced literature improves mean reward by 20 points, indicating that models benefit substantially from explicit evidence access but remain unreliable when relying solely on parametric recall. VariantBench provides deterministic verifiers, reproducible agent infrastructure, and a large-scale expert-grounded evaluation suite for measuring progress toward robust scientific reasoning.

Presenter: ~Shlok_Natarajan1

Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Submission Number: 113

Loading