Multimodal Fusion of CNN and LLM Embeddings for Chest X-Ray Classification

Shriya Ramesh; Smarak Patnaik; Sreedath Panat

Multimodal Fusion of CNN and LLM Embeddings for Chest X-Ray Classification

Shriya Ramesh, Smarak Patnaik, Sreedath Panat

07 Sept 2025 (modified: 11 Oct 2025)Submitted to NeurIPS 2025 2nd Workshop FM4LSEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Convolutional Neural Networks, Chest X rays, Medical Imaging

TL;DR: Our work is to evaluate if we get better classification using a fusion model over a unimodal model. The classification is done on a dataset having X-ray images and radiology reports.

Abstract: Accurate detection of abnormalities in Chest X-rays is crucial for timely diagnosis and treatment. While multimodal models give us strong results, they are often computationally expensive to train. In this work, we propose a framework that fuses CNN embeddings from X-ray images with LLM-based semantic embeddings from radiology reports. The fused representation is processed through a multi-layer perceptron network to perform both binary as well as multilabel classification. Experiments show that our approach improves metrics compared to unimodal baselines while requiring fewer compute resources, making it suitable for a resource constrained environment.

Submission Number: 81

Loading