Offline Policy Learning for Clinical-Trial Strategy

Published: 25 May 2026, Last Modified: 27 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agentic AI, decision agents, LLM, sequential decision-making, offline policy learning, clinical-trials, stratergy
TL;DR: We curate an offline RL dataset of real-world clinical-development trajectories with outcome-derived rewards, train policies via multiple BC objectives, and find they outperform tool-using frontier LLM agents on next-window trial planning.
Abstract: Clinical development is sequential decision-making under uncertainty. We study this setting by framing oncology clinical development as an offline decision-making problem in which models predict the next six-month trial portfolio of an oncology drug program from information available at the decision date. To support this, we construct a temporal dataset that combines 31.7k heterogeneous public data records, including trial registries, regulatory reviews, sponsor filings, utilization data, and epidemiology, into 881 offline decision episodes across 45 historical programs. We compare behavioral cloning, reward-weighted behavioral cloning, and learned-reward training against four frontier LLM agents that share a common date-gated retrieval scaffold across held-out drug, sponsor, drug-class, and temporal splits. Adapters trained offline outperform every non-fine-tuned baseline. In the post-August 2025 contamination-clean holdout, offline training reaches 39.9\% Indication F1 against 11.2\% for the strongest tool agent, suggesting that structured offline learning can capture clinical-development strategy beyond memorized trial records.
Submission Number: 110
Loading