Large Language Model Agents Struggle With Online Patient Form Filling

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY-NC 4.0
Keywords: Web navigation, long-term planning, AI for healthcare
TL;DR: Benchmark LLM agents for filling out online patient forms and find that they do worse than regular webpages.
Abstract: Large language models (LLMs) have recently shown promise in clinical computational reasoning. However, when tasked with performing medical calculations independently, they remain unreliable—frequently misapplying formulas, overlooking relevant patient inputs, or making arithmetic mistakes. Prior benchmarks such as MedCalc-Bench have shown that these errors compound across equation-based and rule-based calculators, limiting their real-world reliability. In this work, we augment LLMs with Browser-Use, an agent scaffold for web navigation, to directly operate medical calculators on a confidential and proprietary platform, which we refer to as "MedSite-A". This removes the need for curated APIs by leveraging existing, clinically validated web interfaces. While LLMs have previously demonstrated a potential for automating long-horizon web navigation tasks, our experiments reveal that browser-augmented agents struggle with schema discovery, execution drift, and domain reasoning—often performing worse than text-only baselines. This case study highlights the urgent need for robust oversight, failure detection, and domain-specific evaluation frameworks before integrating such agents into clinical workflows.
Submission Type: Benchmark Paper (4-9 Pages)
NeurIPS Resubmit Attestation: This submission is not a resubmission of a NeurIPS 2025 submission.
Submission Number: 145
Loading