From Human Attention to Diagnosis: Semantic Patch-Level Integration of Vision-Language Models in Medical Imaging

Dmitry Lvov; Ilya Pershin

From Human Attention to Diagnosis: Semantic Patch-Level Integration of Vision-Language Models in Medical Imaging

Dmitry Lvov, Ilya Pershin

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Gaze prediction, medical image analysis, CXR, deep learning, fixation modeling, attention mechanisms, visual attention, neural networks

TL;DR: This paper explores gaze prediction and fixation modeling in medical images using deep learning techniques.

Abstract: Predicting human eye movements during goal-directed visual search is critical for enhancing interactive AI systems. In medical imaging, such prediction can support radiologists in interpreting complex data, such as chest X-rays. Many existing methods rely on generic vision--language models and saliency-based features, which can limit their ability to capture fine-grained clinical semantics and integrate domain knowledge effectively. We present \textbf{LogitGaze-Med}, a state-of-the-art multimodal transformer framework that unifies (1) domain-specific visual encoders (e.g., CheXNet), (2) textual embeddings of diagnostic labels, and (3) semantic priors extracted via the logit-lens from an instruction-tuned medical vision--language model (LLaVA-Med). By directly predicting continuous fixation coordinates and dwell durations, our model generates clinically meaningful scanpaths. Experiments on the GazeSearch dataset and synthetic scanpaths generated from MIMIC-CXR and validated by experts demonstrate that LogitGaze-Med improves scanpath similarity metrics by 20--30\% over competitive baselines and yields over 5\% gains in downstream pathology classification when incorporating predicted fixations as additional training data.

Primary Area: Machine learning for sciences (e.g. climate, health, life sciences, physics, social sciences)

Submission Number: 15904

Loading