# Human Attention is All You Need: Aligning Image Encoders to Human Visual Focus

This repository contains the code for our submission: "Human Attention is All You Need: Aligning Image Encoders to Human Visual Focus."

Our novel framework addresses the challenge of replicating human-like perception by aligning computational attention in Visual Language Models (VLMs) with human behavioral data. We achieve this by injecting a specifically calibrated signal, derived from aggregated human attention heatmaps (gaze patterns from an eye-tracker during a visual description task), into the visual encoder of a VLM (specifically, Qwen2.5-VL). This signal is integrated via a cross-attention mechanism, refining the encoder's latent space to prioritize human-salient regions. The result is generated image descriptions that are demonstrably more human-like in their focus and content, with significant improvements in metrics like ROUGE, METEOR, and semantic similarity.

## Key Features & Contributions

*   **Human-Aligned Attention Injection:** A lightweight mechanism to inject human gaze-derived saliency into the VLM's visual encoder using cross-attention, without modifying the base VLM architecture or requiring retraining from scratch.
*   **Enhanced Human-Likeness:** Produces image descriptions that better reflect human perceptual priorities and semantic precision, especially in scenarios dominated by bottom-up visual processing.
*   **Novel Dataset Release:** Introduces a new dataset of image-heatmap-caption triples (30 unique images from CAT2000, gaze heatmaps from 29 human participants, and corresponding descriptions for each image from each person) for studying attention-conditioned generation. (Provided in supplementary materials).
*   **Two-Stage Training Pipeline:** A carefully designed training process:
    1.  **Saliency Calibrator Warm-up:** To effectively integrate human saliency, minimizing influence from human linguistic style variations.
    2.  **Tuning:** To adapt the entire VLM to original human-generated captions and better utilize saliency-informed features.
*   **Practical Readiness:** Optimized with FlashAttention for potential low-latency, real-time applications.

## Important Note for Reviewers

To maintain anonymity for the review process, this repository and its instructions are adapted for an anonymized submission:

*   **Dataset:** The custom dataset (image-heatmap-caption triples) is provided in the supplementary materials and located in `dataset/`.
*   **Pre-trained Model Checkpoints:** If you wish to skip the training process, our fine-tuned model checkpoints (for both Stage 1 and Stage 2) are also provided on anonym huggingface account. The following instructions will show how to use it
*   **Base Model:** The base VLM is Qwen2.5-VL-3B-Instruct. The setup scripts might attempt to download it. If this is an issue during review, please use pre-trained checkpoints or let us know.
*   **External Links & Anonymity**: To preserve anonymity, any external links that could potentially de-anonymize the authors have been avoided. All essential artifacts, such as the custom dataset and scripts, are included in the supplementary materials. Large model checkpoints (~8GB), which cannot be included directly, are provided via a link to an anonymized Hugging Face repository. Instructions for building the Docker image from the provided Dockerfile are included.

We apologize for any inconvenience this slightly modified setup may cause.

## Prerequisites

*   Docker
*   NVIDIA GPU with CUDA support (all training and inference scripts are optimized and tested for a single NVIDIA H100 GPU).

## Setup

**Build and Run Docker Container:**

```bash
make build && make run_docker
```
This will build the Docker image with all necessary dependencies and start an interactive container.

## Inference & Testing with Heatmaps

*   To perform inference, generate image descriptions with heatmap integration, and reproduce the results presented in Tables 1 and 2 of our paper, run `make jupyter`. This will launch a JupyterLab environment.
*   Then, open and utilize the `test.ipynb` notebook. This notebook provides a practical demonstration of the model's interface for processing images and heatmaps, and for generating human-aligned descriptions.
*   The Python code snippet below illustrates the core components for loading the model and generating outputs, as used within `test.ipynb`.

```python
import torch
from transformers import AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

from src.qwen2_5.fa_model import Qwen2_5_VLForConditionalGenerationWithHeatmap



refine_text = lambda text: text[len("assistant\n"):] if text.startswith("assistant\n") else text

model_name = "AnonResearcher/Qwen2.5-VL"  # Specifies the Hugging Face repository name (anonymized for review) or a local path to your fine-tuned model checkpoints.
processor = AutoProcessor.from_pretrained(model_name)

base_model_load_kwargs = {
        "torch_dtype": "bfloat16",
        "device_map": "cuda",
        "attn_implementation": "flash_attention_2",
        "trust_remote_code": True
    }
local_model = Qwen2_5_VLForConditionalGenerationWithHeatmap.from_pretrained(
    model_name, **base_model_load_kwargs
)


def gen_inputs(inputs, max_new_tokens=1000):
    generated_ids = local_model.generate(**inputs, max_new_tokens=max_new_tokens)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    output_text = [refine_text(out) for out in output_text]
    return output_text, generated_ids, generated_ids_trimmed

```


## Training the Model

![Training Steps](training_steps.jpg)


The model is trained in a two-stage process as described in the paper (Section 3.2).
Configuration files for training can be found in the `config/` directory. Adjust them as needed for your environment (though defaults are set for an H100).

**Inside the Docker container:**

0. **Prepare `.env` file** rewrite `.env_template` with your credentials and create `.env`

    ```bash
    mv .env_template .env
    ```

1.  **Stage 1: Saliency Calibrator Warm-up**
    
    This stage trains the learnable saliency projection, the inserted cross-attention module, and LoRA adapters for subsequent VE layers. The LLM remains frozen.
    *   Training targets are human-provided image descriptions stylistically normalized by the pre-trained VLM.

    ```bash
    make train
    ```
    *   Use `compression.py` to compress the weights.

2.  **Stage 2: Tuning**
    * This stage adapts the entire VLM to original human-generated captions.
    ```bash
    make tuning
    ```

## Applications

The heatmap injection mechanism can be useful for several applications:

1.  **Guided Visual Reasoning**: Direct the model to focus on specific regions when answering visual questions or performing other reasoning tasks.
2.  **Human-like Image Captioning**: Generate image descriptions that more closely mirror human points of interest and descriptive style.
3.  **Visual Grounding**: Improve referring expression comprehension by highlighting relevant image regions corresponding to textual phrases.
4.  **Human-in-the-loop Systems**: Allow dynamic human feedback through attention guidance for interactive tasks, potentially improving system interpretability and user alignment.

## Citation

If you find this work useful in your research, please consider citing our paper (once published):

```bibtex
@misc{anonymous2025humanattention,
    title={Human Attention is All You Need: Aligning Image Encoders to Human Visual Focus},
    author={Anonymous Author(s)},
    year={2025},
    eprint={arXiv:xxxx.xxxxx}, % To be updated upon public release
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}
```
