In-Context Learning for Zero-shot Medical Report Generation

RUI Liu; Mingjie Li; Shen Zhao; Ling Chen; Xiaojun Chang; Lina Yao

In-Context Learning for Zero-shot Medical Report Generation

RUI Liu, Mingjie Li, Shen Zhao, Ling Chen, Xiaojun Chang, Lina Yao

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Medical report generation (MRG) has emerged as a pivotal research topic in the medical multi-modal field, given its potential to alleviate the heavy workloads of radiologists. Recently, advancements have been made with MRG systems that leverage large multimodal models (LMMs) to generate high-quality reports. To address the challenge of collecting large amounts of paired medical image-report data for training, this paper proposes a zero-shot report generation model based on in-context learning, we call it MCVGen. Departing from traditional in-context learning approaches that directly feed all demonstrations to a pre-trained large model, this work innovates by employing a multi-modal contextual vector (MCV) to represent the contextual information of demonstrations. Initially, we pre-train a medical large multi-modal model (Med-LMM) and secure the last hidden state of each demonstration through the forward pass in Med-LMM. Benefits from the auto-regressive mechanism, the last hidden state garners critical information to the targeted scenarios. Subsequently, we average the multiple MCVs and integrate them with the first hidden state on the new query, thereby shifting the latent states and guiding the model toward acquiring previously unlearned multi-modal contextual information. This approach has the advantage of regulating the number of prompts, thus reducing computational costs. We tested our model on the publicly available Open-IU and MIMIC datasets, demonstrating its exceptional zero-shot capability on both cross-center and cross-disease evaluations. We hope it could be a viable solution for practical clinical applications.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Generation] Multimedia Foundation Models

Relevance To Conference: This work contributes significantly to the field of multimedia/multimodal processing by introducing ICZGen, a zero-shot report generation model that ingeniously combines in-context learning with large multimodal models (LMMs). By leveraging the nuanced capabilities of LMMs to interpret and synthesize information across both textual and visual modalities, our model addresses a crucial gap in the automated generation of medical reports. ICZGen employs multimodal in-context vectors, thereby eliminating the dependency on large-scale paired training data. This advancement facilitates the generation of accurate, high-quality medical reports from images, without the need for prior direct examples of similar reports. Tested on Open-IU and MIMIC datasets, ICZGen demonstrated exceptional zero-shot capabilities, underscoring its relevance and potential impact on advancing medical multimedia/multimodal processings.

Supplementary Material: zip

Submission Number: 1805

Loading