Synthesis and Evaluation of Long-term History-aware Medical Dialogue

Published: 19 Dec 2025, Last Modified: 05 Jan 2026AAMAS 2026 FullEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Healthcare agent, LLM, Synthetic Dataset, Medical Dialogue Dataset
Abstract: An effective healthcare agent must be able to recall and reason over a patient’s longitudinal medical history. However, the absence of datasets with realistic long-term dialogue timelines has limited systematic evaluation. Real clinical text is constrained by privacy and ethics, while existing benchmarks focus on isolated interactions, failing to capture cross-session reasoning. We introduce a framework for synthesizing high-quality, long-term medical dialogues using Large Language Models (LLMs). Our approach entails a knowledge-guided decomposition of the task into three stages: the construction of synthetic patient profiles with diverse disease and complication trajectories, the generation of multi-turn dialogues per clinical encounter, and their integration into a coherent longitudinal history dataset, namely MediLongChat. We established three benchmark tasks, In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesize Reasoning, to evaluate the memory capabilities of healthcare agents. To assess the data quality, we introduce a multi-dimensional evaluation framework that combines vector-based metrics with LLM-as-a-judge assessments. Specifically, we define three automatic measures—Faithfulness, Coherence, and Diversity—together with two LLM-based evaluations: Correctness and Realism. These metrics collectively establish a rigorous standard for evaluating synthetic dialogue datasets. Benchmark experiments demonstrate that even state-of-the-art LLMs struggle with MediLongChat, particularly in long-term memory and cross-dialogue reasoning. These findings highlight the applicability of the benchmark and underscore the need for new methods tailored to advance healthcare agents.
Area: Generative and Agentic AI (GAAI)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 906
Loading