Human Gaze is All You Need: Aligning Image Encoders with Human Attention

Human Gaze is All You Need: Aligning Image Encoders with Human Attention

ICLR 2026 Conference Submission18065 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Large Language Models (VLLMs), Image Captioning, Gaze Patterns, Human-AI Alignment, Image Encoder

TL;DR: We taught a VLM to look at pictures like humans do by injecting eye-tracking data into it. As a result, its image descriptions became significantly more accurate and human-like. We are also releasing our method and a new dataset.

Abstract: Replicating human-like perception in artificial systems requires capturing the attentional biases that shape human interpretation of visual scenes. While modern Vision-Language Models (VLMs) demonstrate strong multimodal reasoning, they often lack the behavioral priors that guide human attention. We address this gap with a framework that integrates human gaze patterns into the visual encoder of a state-of-the-art VLM. Aggregated attention heatmaps—collected from 29 participants in a visual description task—are incorporated via a cross-attention mechanism that refines the encoder’s latent space to prioritize human-salient regions. Aligning model attention with human gaze yields consistent improvements in both human-likeness and semantic accuracy of image descriptions. **METEOR** and **Cosine Similarity** increase by *29.6%* and *4.6%*, respectively. Our contributions are threefold: a lightweight, plug-in **architectural modification of VLM** for integrating behavioral priors without full model retraining; empirical evidence of **enhanced alignment** with human perception, especially in scenes with strong bottom-up saliency cues; and a **novel dataset** of 778 image–heatmap–caption triples to facilitate research on attention-conditioned generation. This work demonstrates that incorporating behavioral priors systematically enhances VLMs and contributes to the development of more human-aligned interpretative capabilities for social cognition and human–AI interaction.

Supplementary Material: zip

Primary Area: applications to neuroscience & cognitive science

Submission Number: 18065

Loading