A Multimodal Literature Agent as Substrate for Autonomous Biology Research

Published: 30 May 2026, Last Modified: 30 May 2026ICML2026-AI4Science SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 3: AI Scientist Proposal Competition
Keywords: multimodal retrieval-augmented generation, scientific literature agents, vision-language models, biomedical question answering, autonomous scientific discovery
TL;DR: A literature agent that treats figures, tables, and text as first-class retrieval artifacts and delegates figure inspection to a multi-round VLM zoom loop with abstention, reaching state-of-the-art on three components of LAB-Bench 2.
Abstract: Autonomous science systems for biology, covering hypothesis generation, experimental design, and data analysis, depend on a literature module they can trust. Most existing modules treat scientific publications as text, missing the figures, gels, and tables where biological evidence is encoded. We present a literature agent that is multimodal from the ingestion layer up: figures and tables are first-class retrieval artifacts, and figure inspection is delegated to a multi-round vision-language-model (VLM) zoom loop with an explicit abstention action that flags incorrect retrievals rather than answering from them. The agent reaches the strongest publicly reported scores on three components of LAB-Bench 2: 62.5% on FigQA2, 88.6% on LitQA3, and 88.8% on TableQA2, exceeding the best public baselines by 4.4, 4.1, and 9.5 points respectively. Provenance is tracked at the chunk level (DOI, page, character offsets) and per-publication metadata (retraction status, journal, citation count) is surfaced alongside each answer, making the agent a credible grounding layer for the action-taking agents that complete an autonomous discovery pipeline. As biology labs increasingly automate hypothesis generation, experimental design, and analysis, a trustworthy literature module — one that reads the actual evidence rather than only the prose around it — becomes the substrate the rest of the automation builds on.
Submission Number: 251
Loading