Track: Track 1: Original Research/Position/Education/Attention Track
TL;DR: Episodic self-reflection substantially improves agent yield on closed-loop materials discovery, but consistently trades compositional breadth for depth.
Abstract: Autonomous scientific discovery agents face reasoning challenges that depart from those in standard LLM agent benchmarks: experimental campaigns span long horizons with many reasoning steps per decision, feedback is continuous rather than binary, and success requires navigating an exploration–exploitation tradeoff over high-dimensional spaces. A natural question is whether such agents can improve across successive campaigns by learning from their own trajectories. Using the MADE benchmark for closed-loop materials discovery as a case study, we instantiate a ReAct agent with episodic self-reflection and evaluate Qwen3-30B-Instruct and Qwen3.5-122B across 30 chemical systems and 8 episodes each. We find that reflection substantially improves per-episode discovery yield ($+4.6$ and $+6.5$ novel stable materials per episode, respectively), with positive cross-episode learning slopes indicating that the gains are attributable to episodic memory rather than stochasticity. However, self-reflection acts as a double-edged sword: it consistently shifts agent behaviour from breadth to depth, concentrating queries on near-hull families at the cost of compositional coverage, risking premature exploitation on sparser search spaces (a regime we do not test directly). We identify recurring failure modes such as actor non-compliance and discuss their implications for scientific discovery agents more broadly.
Keywords: LLM agents, Self-reflection, scientific discovery, materials discovery
Submission Number: 41
Loading