Track: Track 2: Dataset Proposal Competition
TL;DR: A dataset of 522 structured process traces from 7 AI models executing scientific tasks, revealing that models with identical success rates differ 30x in error behavior — a dimension invisible to output-only benchmarks.
Abstract: Existing benchmarks for autonomous AI scientists evaluate only final outputs — generated code, hypotheses, or papers — yet discard the reasoning process by which those outputs were obtained. This makes it impossible to audit scientific methodology, diagnose failure modes, or distinguish systematic reasoning from fortunate guessing. We propose OpenDiscoveryTrace, a public dataset of 558 complete AI scientific agent trajectories that captures how models reason, not just what they produce. Each trajectory records a structured 9-field-per-step trace — including thoughts, tool calls, observations, errors, revision triggers, and self-reported confidence — as models execute 124 scientific tasks spanning drug discovery, materials science, genomics, and scientific literature analysis. The dataset covers seven models: three frontier (GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro; 124 trajectories each, fully balanced across domains and difficulty levels) and four open-weight (Qwen2.5-7B, Mistral-7B-v0.3, Phi-3.5-mini, Qwen2.5-1.5B), plus 60 live-retrieval variants and 6 tool-scaffolded open-weight trajectories. Pilot analysis on 363 LLM-judged trajectories reveals that process traces expose behavioral differences invisible to output-only evaluation: all three frontier models achieve comparable success rates (84-89%), yet Claude Opus 4.6 produces 30x more errors than GPT-5.4 (2.5 vs. 0.08 per trajectory, p<0.0001, Cliff's delta=0.613), with qualitatively different error profiles — 66.7% tool misuse for Claude versus 83.6% reasoning errors for GPT-5.4. We define five benchmark tasks with baselines from logistic regression, random forests, LSTM, and Transformer models. The dataset, trace schema, agent harness, and benchmark definitions will be released under CC-BY-4.0 to support research on process-level evaluation, scientific agent auditing, and AI governance.
Keywords: AI for science, dataset, scientific agents, process traces, trajectory evaluation, benchmarks, autonomous discovery, error analysis
Submission Number: 142
Loading