SySDEM - Synthetic and Stratified Degradations for Evaluating Metrics for Long-Form Text in Medical Domain

Naveen Jafer Nizar, Qinlan Shen, Sumana Srivatsa, Krishnaram Kenthapadi

Published: 27 Nov 2025, Last Modified: 09 Dec 2025ML4H 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: evaluation, SOAP notes, metrics, long form eval, text evaluation, synthetic degradation
TL;DR: We propose a framework to evaluate the quality of reference-based evaluation metrics for long form text generation.
Track: Proceedings
Abstract: The evaluation of long-form text in medical domain is increasingly reliant on automated metrics. However, the reliability of these metrics themselves is often assumed rather than rigorously tested, especially for use cases where long-form generations are the expected output. This paper addresses this gap by proposing SysDEM -{Sy}nthetic and {S}tratified {D}egradations for {E}valuating {M}etrics, a framework to evaluate the quality of reference-based evaluation metrics. Using this framework, we demonstrate a method that iteratively perturbs candidate texts to assess the sensitivity and discrimination power of reference-based text evaluation metrics. Through experiments on the ACI-Bench clinical note generation dataset, we demonstrate the importance of evaluating evaluation metrics for long-form text, highlighting the need for robust validation methodologies.
General Area: Applications and Practice
Specific Subject Areas: Evaluation Methods & Validity
Data And Code Availability: No
Ethics Board Approval: No
Entered Conflicts: I confirm the above
Anonymity: I confirm the above
Submission Number: 165
Loading