Keywords: Synthetic Data, Instruction Tuning, Evaluation, Healthcare
TL;DR: TIMER is a synthetic data framework that enables both time-aware evaluation and temporal instruction tuning of large language models to better reason over longitudinal electronic health records.
Abstract: Large language models (LLMs) have emerged as promising tools for assisting in medical tasks, yet processing Electronic Health Records (EHRs) presents unique challenges due to their longitudinal nature. While LLMs' capabilities to perform medical tasks continue to improve, their ability to reason over temporal dependencies across multiple patient visits and time frames remains unexplored.
We introduce TIMER (**T**emporal **I**nstruction **M**odeling and **E**valuation for Longitudinal Clinical **R**ecords), a synthetic data generation framework that incorporates temporal distribution of instructions as a critical dimension in both instruction evaluation and tuning for longitudinal clinical records. We develop TIMER-Bench, the first time-aware benchmark that evaluates temporal reasoning capabilities over longitudinal EHRs, as well as TIMER-Instruct, an instruction-tuning methodology for LLMs to learn reasoning over time. We demonstrate that models fine-tuned with TIMER-Instruct improve performance by 7.3% on human-generated benchmarks and 9.2% on TIMER-Bench, indicating that temporal instruction-tuning improves model performance for reasoning over EHR. Our code is available at [TIMER](https://anonymous.4open.science/r/TIMER-2874).
Submission Number: 18
Loading