Time-Series as Feedback: Evaluating Adaptive Reasoning in LLM Agents

Published: 01 Mar 2026, Last Modified: 11 Apr 2026ICLR 2026 TSALM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Presentation Attendance: Yes, we will present in-person
Keywords: Time-Series, Reasoning, Scientific Discovery, LLM Agents
TL;DR: We introduce a benchmark for adaptive time-series reasoning, showing that LLM agents outperform a non-adaptive baseline in hypothesis-driven system identification from time-series feedback, though likelihood-based feedback remains an upper bound.
Abstract: Time-series interpretation and reasoning are essential for inferring the state of physical systems and remain a key challenge for autonomous scientific discovery. We introduce a benchmark to evaluate whether large language model (LLM) agents can perform such reasoning in adaptive experiment-planning settings where time-series observations serve as feedback and experimental conditions constitute agent actions that generate new trajectories. Using kinetic mechanism identification as a motivating testbed, we construct an agent–environment loop in which an agent iteratively proposes experiments, receives time-series data, and refines hypotheses over competing mechanisms while selecting new experimental conditions that best discriminate among them. We show that agents with likelihood-based (NLL) feedback consistently outperform adaptive and non-adaptive baselines, demonstrating effective hypothesis-aware adaptive experimental design. Agents operating directly on raw time-series feedback also outperform the same baselines, indicating a non-trivial capability to extract task-relevant information from noisy trajectories without hand-engineered analysis tools. However, raw-feedback performance remains below NLL-feedback performance, highlighting current limitations in direct time-series interpretation by LLM agents without structured signals. Overall, this work contributes both (i) a benchmark for interactive time-series reasoning in adaptive experimental settings, and (ii) an empirical study of LLM agents’ strengths and limitations in hypothesis-driven scientific experimentation.
Track: Research Track (max 4 pages)
Submission Number: 87
Loading