A Pilot Study in Surveying Clinical Judgments to Evaluate Radiology Report Generation

William Boag

Published: 01 Jan 2021, Last Modified: 19 Apr 2024FAccT 2021EveryoneCC BY 4.0

Abstract: The recent release of many Chest X-Ray datasets has prompted a lot of interest in radiology report generation. To date, this has been framed as an image captioning task, where the machine takes an RGB image as input and generates a 2-3 sentence summary of find- ings as output. The quality of these reports has been canonically measured using metrics from the NLP community for language generation such as Machine Translation and Summarization. How- ever, the evaluation metrics (e.g. BLEU, CIDEr) are inappropriate for the medical domain, where clinical correctness is critical. To address this, our team brought together machine learning experts with radiologists for a pilot study in co-designing a better metric for evaluating the quality of an algorithmically-generated radiology report. The interdisciplinary collaborative process involved mul- tiple interviews, outreach, and preliminary annotation to design a larger scale study – which is now underway – to build a more meaningful evaluation tool.