LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal Large Language Model (MLLM) has recently garnered significant attention as a prominent research focus. By harnessing the capability of powerful Large Language Model (LLM), it facilitates the transition of conversational generative AI from unimodal text to performing multimodal tasks. This blooming development has begun to significantly impact the medical field. However, visual language models in the general domain lack sophisticated comprehension required for medical visual conversations. Even some models specifically tailored for the medical domain often produce answers that tend to be vague and weakly related to the visual contents. In this paper, we propose a fine-grained and adaptive visual language model architecture for Chinese medical visual conversations through parameter-efficient tuning. Specifically, we devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics. Then we note data redundancy that is common in medical scenes but ignored in most prior works. In cases of a single text paired with multiple figures, we utilize weighted scoring with knowledge distillation to adaptively screen valid images mirroring text descriptions. For execution, we leverage a large-scale Chinese ultrasound multimodal dataset obtained first-hand from the hospital database. We create instruction-following data based on text derived from doctors, which ensures professionality and thus contributes to effective tuning. With enhanced architecture and quality data, our Large Chinese Language and Vision Assistant for Ultrasound (LLaVA-Ultra) shows strong capability and robustness to medical scenarios. On three medical visual question answering datasets, LLaVA-Ultra surpasses previous state-of-the-art models on various metrics.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: This work contributes to multimedia and multimodal processing by introducing a fine-grained and adapted visual language model specifically tailored for medical visual conversations in Chinese language. As important multimodal and multimedia forms, vision and language are often closely related and mutually supportive. In medical scenarios, clinical examination and diagnosis are closely related to these two types of information, thus becoming the focus of our work attention. Leveraging a large-scale ultrasound image-text dataset and improved architecture, our model, LLaVA-Ultra, demonstrates robustness and effectiveness in handling complex medical scenarios. Through parameter-efficient training, the model can provide detailed and specialized answers to the input medical images and questions given by the user. It surpasses previous state-of-the-art models on various metrics, showcasing outstanding multimodal conversational capability. LLaVA-Ultra provides a boost to explore the development of visual-language multimodality in terms of data and model architecture. It not only explores effective ways to transition multimodal models from the general field to the medical field, but also offers insights for enhancing and refining multimodal models in both the general domain and other application domains.
Supplementary Material: zip
Submission Number: 4878
Loading