Abstract: Summarization is designed to condense text by focusing on the most critical information drawn from a source document (Paice, 1990; Kupiec et al., 1995). Generative large language models (LLMs) outperform previous summarization methods, yet traditional metrics struggle to capture resulting performance in more powerful LLMs (Goyal et al., 2022). In safety-critical domains such as medicine, rigorous evaluation is required given the potential for errors. We propose MED-OMIT, a new omission benchmark for medical summarization. MED-OMIT focuses on omissions as these have not been as widely studied as other types of errors.
Given a doctor-patient conversation and a generated summary, MED-OMIT categorizes the chat into a set of facts and identifies which are omitted from the summary. We further determine fact importance by simulating the impact of each fact on a downstream clinical task: differential diagnosis (DDx) generation. MED-OMIT leverages LLM prompt-based approaches which categorize the importance of facts and cluster them as supporting or negating evidence to the diagnosis. We evaluate MED-OMIT on a publicly-released dataset of patient-doctor conversations. Based on expert evaluations, we find that MED-OMIT captures omissions and does so in cases where traditional metrics cannot. We highlight which models perform well out-of-the-box on this task, such as gpt-4, and which would likely require additional training.
Paper Type: long
Research Area: NLP Applications
Contribution Types: Model analysis & interpretability
Languages Studied: English
0 Replies
Loading