# See What You Are Told: Visual Attention Sink in Large Multimodal Models[Project Title

Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder. Ideally, these models should focus on key visual information relevant to the text token. However, recent findings indicate that LMMs have an extraordinary tendency to consistently allocate high attention weights to specific visual tokens, even when these tokens are irrelevant to the corresponding text. In this study, we investigate the underlying causes of this phenomenon and explore the characteristics of these irrelevant visual tokens. Our findings show that this behavior arises due to the massive activation of certain hidden state dimensions, which resembles the attention sink found in language models. Hence, we refer to this phenomenon as the visual attention sink. In particular, our analysis reveals that removing the irrelevant visual sink tokens does not impact model performance, despite receiving high attention weights. Consequently, we recycle the attention to these tokens as surplus resources, redistributing the attention budget to enhance focus on the image. To achieve this, we introduce Visual Attention Redistribution (VAR), a method that redistributing attention in image-centric heads, which we identify to be innately focusing on visual information. VAR can be seamlessly applied across different LMMs to improve performance on a wide range of tasks, including general vision-language tasks, visual hallucination tasks, and vision-centric tasks, all without the need for additional training, models, or inference steps. Experimental results demonstrate that VAR enables LMMs to process visual information more effectively by adjusting their internal attention mechanisms, offering a new direction to enhancing the multimodal capabilities of LMMs.

## Table of Contents

- [Introduction](#introduction)
- [Installation](#installation)
- [Usage](#usage)
- [Dataset](#dataset)

## Introduction

This repository contains the code and resources for the experiments conducted in our paper. The objective of the project is [briefly explain the research goal, method, and importance of the study.

Key highlights of the project:

- Finding *visual sink tokens* in large multi modal models
- Recycling the attention to these visual sink tokens as surplus resources, redistributing the attention budget to enhance focus on the image.

## Installation

To run the code, you will need to install the following dependencies. You can set up the environment by running the following commands:

```bash
# Create a virtual environment (optional but recommended)
bash build-environment.sh

conda activate {your-envrionment-name}  # On Windows: env\Scripts\activate

```

## Usage

```
cd {Project-directory}
bash Users/shell/launch.sh {DATASET} {CATEGORY-OF-DATASET} {EXP-YAML-FILE} {SAVE-TO-ANS-DIR} {SAVE-TO-ANS-FILE-NAME} {GPU-NUMBER}
```

### Example

```python
bash Users/shell/launch.sh POPE adversarial 'cfgs-lv1.5-7b' my-answers ans1 0
```

### Configuration settings

```
model_path: liuhaotian/llava-v1.5-7b  # Pretrained model weights (refer to Hugging Face)

logic: 1  # Set to 0 to disable custom logic
var: 1  # Set to 0 to disable visual attention redistribution
dim_prospector: 1  # Set to 0 to skip sink dimension detection on pretrained LLMs
head_fork: 1  # Set to 0 to skip image-centric head detection

sink_rule: ours  # Keep 'ours' to apply custom logic during inference
head_rule: ours  # Keep 'ours' to apply custom logic during inference

summ: 0.2  # Hyperparameter for visual attention summation
tau: 20  # Hyperparameter for sink token detection
rho: 0.5  # Hyperparameter for image-centric head detection
p: 0.4  # Portion of the attention weights budget

max_new_tokens: 1024

except_last_layer: 1  # Exclude the last layer when applying custom logic

```

## DATASET

We use the POPE benchmark as an example dataset for actual reproduction. Follow the instructions at the following link to prepare the POPE benchmark:
[POPE Benchmark - GitHub Repository](https://github.com/RUCAIBox/POPE)
