Monopath DAGs: Structuring Patient Trajectories from Clinical Case Reports

ICLR 2026 Conference Submission21410 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Graphical Models, Structured Prediction, Healthcare, Natural Language Processing, Patient Trajectories
TL;DR: A modular NLP pipeline that transforms unstructured cancer case reports into directed acyclic graphs representing patient trajectories, enabling structured analysis, evaluation, and synthetic data generation.
Abstract: High-quality datasets capturing rare diseases, atypical responses, and complex care pathways are critically needed in clinical machine learning. While electronic health records (EHRs) remain the dominant data source, they are constrained by institutional silos, privacy regulations, and the inherent scarcity of many clinically significant scenarios. Narrative case reports offer a complementary source: publicly available and often focused on diagnostically or therapeutically challenging cases. Yet their unstructured format limits reuse for modeling and data generation. We present a modular framework that transforms free-text case reports into Monopath Directed Acyclic Graphs (DAGs)—structured representations of patient trajectories that are both temporally ordered and semantically grounded. DAGs are a natural fit for modeling clinical narratives as they encode time-ordered clinical states and transitions, supporting branching and causal reasoning. We apply the pipeline to a curated corpus of 485 lung cancer case reports. Graph fidelity is supported both by automated metrics (ClinicalBERT BERTScore, F1 = 0.798 ± 0.051) and by direct clinical assessment, with practicing physicians rating event order and content positively. Compared to free-text vignettes, DAG embeddings yield higher Calinski–Harabasz clustering scores in raw space (110.5 vs. 41.9) and after PCA/UMAP (157% and 69% relative gains). In a clinician evaluation, graph-conditioned synthetic narratives are preferred in 62% of 106 comparisons and scored higher on timeline validity and decision support. In addition, we demonstrate applicability beyond lung cancer by applying the framework to four rare diseases across body systems, observing consistent clustering gains. Pending large-scale validation, these results highlight the promise of Monopath DAGs to serve as reusable, clinically grounded templates for patient similarity, augmentation, and controlled narrative generation. We release all graphs, schema, and code.
Primary Area: datasets and benchmarks
Submission Number: 21410
Loading