Keywords: cardiology, echocardiogram reports, question answering, fairness audits
Abstract: We introduce a novel question-answering (QA) dataset using echocardiogram reports sourced from the Medical Information Mart for Intensive Care data. This dataset is specifically designed to enhance QA systems in cardiology, consisting of 771,244 QA pairs addressing a wide array of cardiac abnormalities and their severity. We compare various large language models (LLMs), including both open-source general models and biomedical-specific models, alongside state-of-the-art closed-source models for zero-shot evaluation. Our results show that fine-tuning LLMs improves performance across various QA metrics, highlighting the validity and value of our dataset. Further, we conduct fine-grained fairness audits to assess the bias-performance trade-off of LLMs across marginalized populations. Our objective is to propel the field forward by establishing a benchmark framework for developing LLM AI agents that support clinicians in their daily workflow within the cardiology space. The dataset aims to support the advancement of natural language models for use in diagnostic decision support systems, aiming to increase efficiency in cardiology care.
Submission Number: 182
Loading