Adversarial Fast-Moving Real-World Domains as Test Beds For Benchmarking AI Scientist Capabilities

Published: 30 May 2026, Last Modified: 30 May 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: AI scientists, benchmarking, large language models, reasoning, novelty, hypothesis formulation, adversarial real-world domains, time-delayed ground truth
TL;DR: We propose that real-world domains with adversarial selection pressure and time-delayed expert outputs can act as practical prospective test beds for AI scientist evaluation and provide a proof of concept study in Formula 1 and Magic: The Gathering.
Abstract: Benchmarking the ability of AI scientists to generate novel ideas is notoriously difficult. Existing benchmarks in this field have made progress in evaluating scientific reasoning and research replication, but often rely on synthetic tasks or retrospective targets, which may be confounded by prior exposure. We hypothesize that complex, adversarial, fast-moving real-world domains where expert practitioners independently generate observable outputs can provide a practical solution to fill this gap and evaluate the capabilities needed for AI scientists, including reasoning, novelty, and hypothesis formulation. We instantiate this framework in two structurally different domains, Formula 1 (F1), where models ideate around car design concepts for the 2026 season, and real pre-season innovations provide a ground truth, and Magic: The Gathering (MTG), where models propose decks from a recently updated card pool and are evaluated against 19 Pro Tour (PT) decklists. Across both domains, models produce plausible outputs, but few align with real-world expert solutions. In F1, the best model, GPT-5.2 matched 10 of 40 real innovations with 166 ideas proposed across runs. In MTG, the best deck from Gemini 3 Flash recovered 5 of 7 new-set cards from the third-place PT deck, and across all 108 decks, the cards models selected most often were also the cards most widely adopted by PT decks (Spearman r = 0.74, p = 0.0003). These results suggest that a key capability gap for AI scientists is not idea generation, but filtering, prioritization, and coherent novelty
Submission Number: 157
Loading