Can AI Scientists Discover Neural Mechanisms? Evaluating Agentic Biological Discovery in a Digital Fly.

Aarav Sinha

Can AI Scientists Discover Neural Mechanisms? Evaluating Agentic Biological Discovery in a Digital Fly.

Aarav Sinha

Published: 28 May 2026, Last Modified: 28 May 2026GenBio 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: agentic AI for science, AI scientist, biological discovery, mechanistic discovery, causal mechanism discovery, scientific agents, benchmark, evaluation framework, digital fly, Drosophila, computational neuroscience, connectomics, intervention planning, counterfactual reasoning, sequential experimentation

TL;DR: We introduce a digital-fly benchmark that tests whether an AI scientist can iteratively discover neural mechanisms through budgeted interventions, rather than just generate plausible biology-sounding explanations.

Abstract: Autonomous scientific agents are beginning to move beyond narrow assistance roles toward systems that can generate hypotheses, run experiments, and draft manuscripts. Yet the current evaluation still says little about whether such systems can uncover biological mechanisms rather than merely summarize correlations or orchestrate known pipelines. We propose a benchmark for agentic biological discovery focused on uncovering neural mechanisms in a digital fly, and we instantiate it as a pilot study across six tasks from a whole-brain Drosophila model. The benchmark casts discovery as a budgeted hypothesis--experiment--update loop in which an agent observes compressed neural and behavioral readouts, selects interventions, and outputs a mechanistic explanation together with a held-out counterfactual prediction. On feeding and grooming tasks, a planning agent reaches a mean Node F1 of 0.856/0.952/1.000/1.000 at budgets 2/4/6/8, outperforming a one-shot GPT-5.4 stand-in (0.532), a no-memory variant (0.578/0.667/0.667/0.667), and a memory-only variant (0.789/0.897/0.897/0.897). In a named-versus-anonymized test, the one-shot baseline drops from 1.000 to 0.532 Node F1 whereas planning remains at 1.000 in both settings. We interpret these results as a pilot benchmark sensitivity study rather than a definitive leaderboard, but they suggest that the digital-fly setting can reveal meaningful differences in iterative mechanistic reasoning, memory, and shortcut robustness.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 10

Loading