# TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning

This repository contains the official implementation of **TCAP**, a defense framework designed to detect backdoors in Multimodal Large Language Models (MLLMs). TCAP works by analyzing the attention allocation across **System**, **Image**, and **User** components in an unsupervised manner.

## 🚀 Quick Start

### ⚔️ Attack (Backdoor Injection)

Perform LoRA fine-tuning on a poisoned dataset to implant a backdoor. The default poisoning rate is typically **10%**.

#### **1. Dataset Preparation**

Extract samples containing images and format them according to the requirements of your target MLLM's fine-tuning pipeline.

#### **2. Trigger Injection**

Inject backdoor triggers using the utilities implemented in `attack/trigger_injection_utils.py`.

> **Note:** Our injection methods are inherited from [BackdoorBench](https://github.com/SCLBD/BackdoorBench). You may refer to their repository for advanced trigger configurations.

#### **3. Backdoor Training**

You can use official fine-tuning scripts or universal frameworks like [ms-swift](https://github.com/modelscope/ms-swift). Supported models include:

- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
- [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT)
- [InternVL](https://github.com/OpenGVLab/InternVL)

**Evaluation:** After training, evaluate the model on both a **fully poisoned test set** and a **clean test set**. A successful attack should demonstrate a high Attack Success Rate (e.g., ASR > 90%) on poisoned data while maintaining benign performance on clean data.

------

### 🛡️ Defense (TCAP)

TCAP detects poisoned samples by profiling how the model allocates attention during inference.

#### **1. Extract Attention Vectors**

Perform single-token inference on the training set using the backdoored model to extract **Tri-Component Attention (TCA)** allocation vectors. Implementation scripts are located in:

- `defense/Qwen3-VL/`
- `defense/LLaVA-NeXT/`
- `defense/InternVL/`

**Input Data Format (JSON/JSONL):**

```json
# JSON
[
  {
    "id": "sample-00001", 
    "image": "image_00001.png", 
    "conversations": [
      {"from": "human", "value": "<image>\nYour question here..."}, 
      {"from": "gpt", "value": ""}
    ]
  },
  {
    "id": "sample-00002-poisoned", 
    "image": "image_00002.png", 
    "conversations": [
      {"from": "human", "value": "<image>\nYour question here..."}, 
      {"from": "gpt", "value": ""}
    ]
  }
]
```

```jsonl
# JSONL
{"id": "sample-00001", "image": "image_00001.png", "conversations": [{"from": "human", "value": "<image>\nYour question here..."}, {"from": "gpt", "value": ""}]}
{"id": "sample-00002-poisoned", "image": "image_00002.png", "conversations": [{"from": "human", "value": "<image>\nYour question here..."}, {"from": "gpt", "value": ""}]}
```

Ensure all images are stored in the directory specified by `--image-folder`.

Mark the ids of poisoned samples as `-poisoned` for convenient calculation of Precision, Recall and F1. Or you may need to modify the evaluation code.

**Key Arguments:**

| Argument | Description |
| :--- | :--- |
| `--model-path` | Path to the backdoored MLLM checkpoint. |
| `--image-folder` | Path to the directory containing images. |
| `--question-file` | Path to the input JSON/JSONL file. |
| `--answer-file` | Path to save the profiling results. |
| `--system-prompt-add` | Optional task-specific system prompts (see Appendix). |
| `--tcap` | Flag to enable Tri-Component Attention Profiling. |
| `--new-tokens` | Number of tokens to generate (default set to 1). |

**Example Execution:**

```Bash
python3 defense/Qwen3-VL/model_vqa.py \
    --model-path /path/to/backdoored-llava-checkpoint \
    --image-folder /path/to/dataset/images \
    --question-file /path/to/training_data.jsonl \
    --answer-file ./tcap_results.jsonl \
    --tcap
```

#### **2. TCAP Analysis and Backdoor Cleaning**

Run the `defense/tcap.py` script to analyze the extracted Tri-Component Attention (TCA) vectors. This script performs GMM-based profiling on the **System component** to identify attention anomalies, aggregates votes from sensitive heads using the Dawid-Skene algorithm, and automatically filters out poisoned samples.

**Arguments:**

| **Argument**    | **Description**                                              |
| --------------- | ------------------------------------------------------------ |
| `--tcap-file`   | Path to the profiling results generated in Step 1 (e.g., `tcap_results.jsonl`). |
| `--train-file`  | (Optional) Path to the original poisoned training data (JSON/JSONL) to be cleaned. |
| `--output-file` | (Optional) Path to save the cleaned training data.           |

**Example Execution:**

```
python tcap.py \
    --tcap-file ./tcap_results.jsonl \
    --train-file /path/to/training_data.jsonl \
    --output-file ./cleaned_data.jsonl
```

