State-Aware Policy Optimization for a Reliable Multi-Turn, Multi-Tool Scientific Agent in Kinetic Biological Models
Track: Track 1: Original Research/Position/Education/Attention Track
TL;DR: We propose a state-aware RL framework that enables reliable multi-turn, multi-tool scientific agents by reconstructing environment state, achieving strong performance on kinetic biological modeling tasks with a compact model.
Abstract: Large language model (LLM)-based agents show promise for scientific tasks requiring multi-turn reasoning with multi-tool use, but often fail due to error propagation, where early mistakes cascade through downstream steps. In kinetic biological modeling, including quantitative systems pharmacology (QSP), multi-step workflows involve simulating and analyzing complex dynamical systems. Standard reinforcement learning (RL) assumes stateless input, yet these scientific workflows require stateful simulation backends that preserve environment state across tool calls. We address this limitation through a state-aware Group Relative Policy Optimization (GRPO) framework that decomposes multi-turn interactions into per-turn episodes while reconstructing environment state by replaying prior tool calls on a live simulation backend. Combined with a hybrid reward function, consisting of deterministic verification and a frozen LLM judge, our approach enables verifiable, sample efficient optimization. We apply this to Talk2BioModels (T2B), an open-source agent to interrogate kinetic biological models through natural language. Using only a compact Qwen2.5-3B-Instruct backbone, our optimized agent achieves 98.8% tool correctness, 91.5% argument correctness, and 73.2% task completion, outperforming frontier models across on a 324-turn benchmark spanning 10 multi-turn, multi-tool scenarios covering 20 biological models.
Keywords: Scientific Agent, Kinetic Biological Modeling, Quantitative Systems Pharmacology (QSP), Reinforcement Learning, Policy Optimization
Submission Number: 86
Loading