Sibyl: Temporal Backtesting for Literature-Based Scientific Discovery with Large Language Model Agents

Blagoy Rangelov

Sibyl: Temporal Backtesting for Literature-Based Scientific Discovery with Large Language Model Agents

Blagoy Rangelov

Published: 30 May 2026, Last Modified: 06 Jun 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.

Track: Track 1: Original Research/Position/Education/Attention Track

Keywords: literature-based discovery, agentic AI, scientific reasoning, temporal backtesting, knowledge extraction, provenance auditing

TL;DR: A multi-agent LLM pipeline that generates falsifiable scientific predictions from pre-2015 literature, with an 18% confirmation rate validated by post-2015 publications and a provenance audit protocol that detects citation hallucination.

Abstract: We present Sibyl, a multi-agent LLM pipeline that autonomously mines scientific literature to generate falsifiable predictions, evaluated through a temporal backtesting framework analogous to quantitative finance. The system extracts structured claims from a training corpus (pre-cutoff publications), compiles a machine-readable knowledge base, generates testable hypotheses from identified gaps, and validates them against a held-out post-cutoff corpus. Applied to X-ray binary astrophysics as a proof-of-concept domain, the pipeline assembled 14,400 refereed papers, extracted over 11,000 structured claims, and generated 60 falsifiable predictions from pre-2015 literature alone. Of these, 11 (18%) were confirmed by independent post-2015 publications the system never observed. A post-hoc provenance audit identified three systematic failure modes - corpus contamination, validation-era leakage into hypothesis framing, and citation hallucination - the last of which we detected via a novel cross-prediction consistency check. Sensitivity analyses show that the confirmation rate is robust (12.5-18%) under progressively conservative filters. We present here the preliminary results from an ongoing project, introducing the pipeline architecture, the backtesting evaluation methodology, and the provenance audit protocol as contributions to the AI-for-science community.

Submission Number: 20

Loading