That's the Wrong Lung! Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical DataDownload PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: Pretraining multimodal models on Electronic Health Records (EHRs) provides a means to learn rich representations that might transfer to downstream tasks with minimal supervision. Recent multimodal models induce soft local alignments between modalities (image regions and sentences). This is of particular interest in the medical domain, where alignments could serve to highlight regions in an image relevant to specific phenomena described in free-text. Past work has presented example “heatmaps” as qualitative evidence that cross-modal soft alignments can be interpreted in this manner. However, there has been little quantitative evaluation of such alignments. Here we compare alignments from a state-of-the-art multimodal (image and text) model for EHR with human annotations that associate image regions with sentences. Our main finding is that the text has surprisingly little influence on the attention; alignments do not consistently reflect basic anatomical information. Moreover, synthetic modifications, such as substituting "left" for "right," do not substantially influence attention. We find that simple techniques such as masking out entity names during training show promise in terms of their ability to improve alignments without additional supervision.
Paper Type: long
0 Replies

Loading