Abstract: Current large multimodal models (LMMs) leverage chest X-ray (CXR) images to create informative reports; however, these models typically limit input to single images, missing time-relative insights. This work introduces TX-LLaVA (Temporal X-ray Large Language Vision Assistant), designed to track historical changes and produce holistic reports across multiple CXR images taken over different visits. Built upon Video-LLaVA, TX-LLaVA incorporates a unique temporal dataset and utilizes efficient fine-tuning techniques to achieve state-of-the-art results. TX-LLaVA not only generates detailed reports but also effectively highlights changes across sequential CXR scans, enhancing the diagnostic process. TX-LLaVA reaches a ROUGE-L score of 0.20, a 21.21% increase from the baseline model.
External IDs:dblp:conf/isbi/ElgendyC25
Loading