Abstract: The recent release of many Chest X-Ray datasets has prompted a
lot of interest in radiology report generation. To date, this has been
framed as an image captioning task, where the machine takes an
RGB image as input and generates a 2-3 sentence summary of find-
ings as output. The quality of these reports has been canonically
measured using metrics from the NLP community for language
generation such as Machine Translation and Summarization. How-
ever, the evaluation metrics (e.g. BLEU, CIDEr) are inappropriate
for the medical domain, where clinical correctness is critical. To
address this, our team brought together machine learning experts
with radiologists for a pilot study in co-designing a better metric
for evaluating the quality of an algorithmically-generated radiology
report. The interdisciplinary collaborative process involved mul-
tiple interviews, outreach, and preliminary annotation to design
a larger scale study – which is now underway – to build a more
meaningful evaluation tool.
Loading