<h1 align = "center">
Semantic Alignment for Multimodal Large Language Models
</h1>




## SAM model

The core mechanism of our SAM model is the **Bidirectional Semantic Guidance** mechanism with two interactive processes: 

* **Assisted Visual Token Extraction** (Part A) 
* **Contextual Semantic Generation** (Part B)

In the Assisted Visual Token Extraction process, the Q-former module leverages the contextual semantics from other images in the multi-modal instruction to guide the extraction of visual tokens from the currently  perceived image features.

During the Contextual Semantic Generation phase, the W-former module is utilized to select the contextual semantics from the visual context of contextual images (*i.e.*, images other than the currently perceived image). This selection process is facilitated by the attention mechanism in the adaptive adjustment and the Q-former module, with assistance from the visual tokens extracted from the currently perceived images.

![images](images/method.png)

## Cases

SAM demonstrates strong abilities to perform group captioning and storytelling tasks. In **(a)**, SAM can identify commonalities between images accurately, while other MLLMs' answers either show weak instruction-following ability or contain redundancy and hallucinations. In **(b)**, while other MLLMs might treat the storytelling task as an image captioning task, SAM successfully discovers the correlation between the characters in the images and matches them with the names of the characters in the text, creating a coherent story.


![images](images/case.png)


## Getting Started

**1. Installation**

Git clone our repository and creating conda environment:

```bash
conda create -n link python=3.8
conda activate link
pip install -e .
```

**2. Prepare Vicuna Weights**

The current version of SAM supports Vicuna-7B  as the language model. Please first follow the [instructions](https://huggingface.co/lmsys/vicuna-7b-v1.1) to prepare Vicuna-v1.1 7B weights. 

Then modify the ```llm_model``` in the [link/configs/models/link.yaml](link/configs/models/link.yaml#L26) to your vicuna 7b model path.

**3. Inference**

```
python inference.py
```



## Acknowledgment

We've built upon the [LAVIS](https://github.com/salesforce/LAVIS/tree/main) library by Salesforce for the development of our code.