Contextual Semantic Relevance Predicting Human Visual Attention

Kun Sun; Rong Wang

Contextual Semantic Relevance Predicting Human Visual Attention

Kun Sun, Rong Wang

Published: 03 Oct 2025, Last Modified: 13 Nov 2025CPL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Semantic information, vision language model, top-down processing

Abstract: # Abstract Understanding how humans process visual information is crucial for advancing cognitive science and developing intelligent systems. Semantic relevance, defined as the meaning of objects and their contextual relationships within a scene, plays an important role in predicting visual attention. Traditional saliency models largely emphasize low-level perceptual cues or treat visual and linguistic inputs as independent (Hayes & Henderson, 2021; Hwang et al., 2011), overlooking their interaction. This study introduces novel metrics of contextual semantic relevance that jointly capture visual and linguistic dimensions to investigate their effects on human visual attention. Using a large-scale eye-tracking dataset from the *Human Attention in Image Captioning* corpus (He et al., 2019), which includes fixation duration and count data for 1,000 diverse images, we developed a set of multimodal metrics quantifying how target objects relate to their visual and semantic context. Vision-based metrics were computed using two state-of-the-art models, DINOV2 (facebook/dinov2-base) trained purely on visual features without linguistic supervision, and CLIP (openai/clip-vit-large-patch14) which aligns visual and linguistic spaces. These metrics quantify visual similarity between a target object and its surrounding objects (`objs_vissim`), the overall scene (`obj_image_vissim`), and all scene elements (`overall_vissim`). Language-based metrics, derived from Sentence Transformer embeddings (Reimers & Gurevych, 2019), measure semantic similarity between object labels and image captions at multiple levels: word (`words_semsim`), sentence (`sent_semsim`), and overall meaning (`overall_semsim`). Fixation duration and count were recorded for labeled objects (e.g., "bottle", "baby", "lemon"). A combined metric (`total_vissem_sim`) integrates visual and linguistic information through an equal-weighted linear combination (α = β = 1), serving as a principled baseline shown to perform comparably to optimized weights (Sun & Liu, 2025). We employed Generalized Additive Mixed Models (GAMMs; Wood, 2017) with two model specifications, one including random effects and the other incorporating random smooths, to evaluate the predictive power of these metrics while controlling for object proportion, visual saliency, and random factors such as participant and object position. All proposed metrics significantly predicted visual attention (p < 0.0001), revealing distinct cognitive patterns. Vision-based metrics showed robust monotonic relationships: DINOV2-based `obj_image_vissim` yielded the strongest effects (ΔAIC = -311 for duration, -301 for fixation), indicating that objects visually similar to their scenes attract more attention. CLIP-based metrics exhibited moderate effects (`obj_image_vissim`: ΔAIC = -208; `objs_vissim`: ΔAIC = -156), while `overall_vissim` contributed less. Language-based metrics displayed non-linear patterns: `overall_semsim` achieved strong performance (ΔAIC = -294 for duration, -306 for fixation), with U-shaped curves suggesting that both semantically fitting and anomalous objects draw attention. `words_semsim` also showed robust effects (ΔAIC = -252 to -296), while `sent_semsim` was moderate (ΔAIC = -155 to -258). The combined metric (`total_vissem_sim`) outperformed all individual predictors (ΔAIC = -382 for duration, -398 for fixation), improving model fit by 20–30% relative to the best unimodal metric. Temporal analyses showed that vision-based effects dominate early fixations (less than 200 ms), consistent with bottom-up processes, whereas language-based effects persist across time, reflecting top-down integration. The combined model remained superior throughout, indicating continuous multimodal interaction during visual processing. These findings demonstrate that human visual attention integrates multiple representational formats rather than relying solely on perceptual or semantic cues. The superior performance of combined metrics supports theories emphasizing multimodal processing (Holler & Levinson, 2019; Benetti et al., 2023) and highlights the dynamic interplay between top-down semantic guidance and bottom-up salience (Gilbert & Li, 2013; Dijkstra et al., 2017). Although our metrics use DINOV2 and CLIP, the framework is model-agnostic and scalable (O(k² · d + |C| · d) for k objects and d-dimensional embeddings, approximately 0.3 seconds per image), making it suitable for large-scale applications. Beyond theoretical implications, this approach informs the design of AI systems requiring human-like attention mechanisms, such as image captioning, visual question answering, and assistive technologies. By bridging computational modeling and empirical eye-tracking data, this work contributes to cognitive science, human-computer interaction, and the development of cognitively inspired AI that more effectively models and predicts human visual behavior.

Submission Number: 60

Loading