<h2 align="center">PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning</a></h2>
<h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update.</h5>
<h5 align=center>

[//]: # (## 📣 News)

## 😄 Highlights

### 💡 Over-reliance on single prompt is biased
The performance of visual ICL increases with the similarity between in-context image and query image.
Yet, the predictions are often inaccurate. The prediction performance is better when the query and its ground-truth image are used as an in-context pair. 
Each output token from MAE can be seen as assignment scores over the VQGAN codebook;
therefore, a token can be represented as a point in the assignment score space. For the single in-context pair case,
the tokens are directly converted to the prediction (i.e., single-example prediction). However, this token is not necessarily close enough to the ground truth.
By properly averaging the assignment scores from different in-context pairs, such bias can be reduced.

<div align=center>
<img src="Figure/bias.png" width="400px">

</div>

### 😊 By smoothing the assignment scores with surrounding ones from different in-context pairs
We propose **PA**tch-based k-**N**earest neighbor visual **I**n-**C**ontext **L**earning (**PANICL**).
PANICL is built on top of MAE-VQGAN, and prediction is based on tokens (corresponding to patches in the input image) in VQGAN's codebook,
which are decoded into an output image by decoder of pre-trained VQGAN.
We argue that each token's assignment scores over the VQGAN codebook over-relies on a provided single in-context pair,
and smoothing the assignment scores with surrounding ones from different in-context pairs identified as k-nearest neighbors to the query's assignment scores can mitigate such bias.

<div align=center>
<img src="Figure/framework.png" width="800px">
</div>

## 🔥 Main Results
### Multi-example ICL Methods & **PANICL**
We follow previous works in creating a large enough grid for up to seven in-context pairs.

| m  | Methods           | Fold-0 | Fold-1 | Fold-2 | Fold-3 | Mean  | Det. (mIoU ↑) | Color. (MSE ↓)
|----|-------------------|--------|--------|--------|--------|-------|---------------|---------------
| m=1 | Large Canvas      | 22.79  | 27.91  | 24.20  | 21.84  | 24.18 | 18.25         | 0.97|
|     | **PANICL**        | **36.42** | **38.47** | **34.56** | **34.12** | **35.89** | **28.08** |**0.63**
| m=2 | Large Canvas      | 23.31  | 29.05  | 24.64  | 20.65  | 24.41 | 18.25         |0.85|
|     | Query Voting      | 35.68  | 39.12  | 35.92  | 33.25  | 36.01 | 25.15         | -  |
|     | **PANICL**        | **37.37** | **40.11** | **37.68** | **34.49** | **37.41** | **29.37** |**0.61**|
| m=3 | Large Canvas      | 25.29  | 31.96  | 28.00  | 24.17  | 27.35 | 21.71         |0.81|
|     | Query Voting      | 36.63  | 38.99  | 36.17  | 32.68  | 36.12 | 27.93         |-
|     | **PANICL**        | **37.43** | **40.48** | **37.91** | **35.42** | **37.43** | **29.31** |**0.60**|
| m=4 | Large Canvas      | 26.01  | 32.73  | 27.91  | 25.90  | 28.14 | 25.68         |0.81|
|     | Query Voting      | 37.45  | 39.84  | 37.06  | 33.35  | 36.93 | 26.73         |-
|     | **PANICL**        | **38.18** | **40.63** | **37.82** | **35.02** | **37.91** | **29.20** |**0.60**|
| m=5 | Large Canvas      | 26.54  | 33.34  | 28.28  | 25.97  | 28.53 | 27.17         |0.80|
|     | Query Voting      | 37.39  | 39.65  | 36.71  | 32.46  | 36.55 | 28.19         |-
|     | **PANICL**        | **38.00** | **40.42** | **38.02** | **34.70** | **37.79** | **29.27** |**0.60**|
| m=6 | Large Canvas      | 27.12  | 33.90  | 29.43  | 27.30  | 29.44 | 28.74         |0.80|
|     | Query Voting      | 37.90  | 39.88  | 37.22  | 33.01  | 37.00 | 27.50         |-
|     | **PANICL**        | **37.78** | **40.53** | **38.15** | **34.63** | **37.77** | **29.75** |**0.60**|
| m=7 | Large Canvas      | 27.49  | 34.38  | 30.56  | 29.04  | 30.37 | 30.02         |0.79|
|     | Query Voting      | 37.68  | 39.70  | 36.83  | 32.48  | 36.67 | 28.16         |-
|     | **PANICL**        | **37.60** | **40.20** | **37.90** | **34.53** | **37.56** | **29.17** |**0.60**|


### Single-example ICL Baseline & **PANICL**
 We compare **PANICL** with a variety of **training-free** methods in visual ICL that accept only a single in-context pair, i.e.,
 the random variant of MAE-VQGAN, UnsupPR, and Pixel-level Retr. .

|               | Venue        | Fold-0 | Fold-1 | Fold-2 | Fold-3 | Mean  | Det. (mIoU ↑) | Color. (MSE) ↓ |
|--------------|-------------|--------|--------|--------|--------|-------|---------------|---------------|
| **Training-needed** |             |        |        |        |        |       |               |               |
| SupPR  | NeurIPS'23  | 37.08  | 38.43  | 34.40  | 32.32  | 35.56 | 28.22         | 0.63          |
| SCS    | ECCV'24     | -      | -      | -      | -      | 35.00 | -             | -             |
| Partial2Global† | NeurIPS'24  | 38.81  | 41.54  | 37.25  | 36.01  | 38.40 | 30.66         | 0.58          |
| **Training-free** |             |        |        |        |        |       |               |               |
| Random  | NeurIPS'22  | 28.66  | 30.21  | 27.81  | 23.55  | 27.56 | 25.45         | 0.67          |
| UnsupPR | NeurIPS'23  | 34.75  | 35.92  | 32.41  | 31.16  | 33.56 | 26.84         | 0.63          |
| VTV   | ECCV’24 | 38.00 | 38.00 | 33.00 | 32.00 | 35.30 | - | -
| PLR | TIP'25  | 36.32  | 38.57  | 36.37  | 33.95  | 36.30 | 27.27         | 0.63          |
| **PANICL (m = 4)** | *Ours*  | 38.18 | **40.63** | 37.82 | 35.02 | 37.91 | **29.20** | **0.60** |
| **PANICL† (m = 4)** | *Ours*  | **38.63** | 40.44 | **39.50** | **35.89** | **38.62** | 28.85 | **0.60** |
| prompt-SelF | TIP'25  | 42.48  | 43.34  | 39.76  | 38.50  | 41.02 | 29.83         | -             |
| **PANICL+voting (m = 4)** | *Ours*  | **43.85** | **45.29** | **42.09** | **36.19** | **41.86** | **31.05** | **-** |

### Random, PLR, and **PANICL** on FSS-1000, ADE20K, and COCO_Pose
| Method     | FgSeg.<br>(FSS-1000, <br>MAE-VQGAN)<br>mIoU ↑ | FgSeg.<br>(SegGPT)<br>mIoU ↑ | MoSeg.<br>(SegGPT)<br>mIoU ↑ | MoSeg.<br>(SegGPT)<br>mACC ↑ | MoSeg.<br>(LVM)<br>IoU ↑ | MoSeg.<br>(LVM)<br>P-ACC ↑ | KpDet.<br>(COCO, <br>Painter)<br>AP ↑ |
| ---------- | ------------------------------ | ---------------------------- | ---------------------------- | ---------------------------- | ------------------------ | -------------------------- | ------------------------ |
| Random     | 58.30                              | 72.10                        | 18.80                        | 27.40                        | 91.13                    | 92.05                      | 71.8                        |
| PLR        | 58.67                          | 75.88                        | 21.92                        | 28.40                        | 91.00                    | 92.19                      | 72.1                     |
| **PANICL** | **60.22**                      | **76.13**                    | **21.97**                    | **28.43**                    | **91.78**                | **92.73**                  | **72.2**                 |



## 🤩 Visual Examples
### Foreground Segmentation, Single Object Detection, and Colorization on MAE-VAGAN, and Multi-Object Segmentation on SegGPT and LVM, as well as Keypoing Detection on Painter
<div align=center>
<img src="Figure/visual_examples.png" width="800px">
</div>

### Edge Detection, and Inpainting on MAE-VQGAN
<div align=center>
<img src="Figure/visual_examples_ed_ip.png" width="800px">
</div>


## 🔨 Requirements and Installation
* We conducted all experiments of MAE-VQGAN, SegGPT, Painter on NVIDIA RTX GeForce RTX 4090 24GB. The experiments on LVM was conducted on A100 GPU.
* PyTorch >= 1.8.0
* Install required packages:
```
conda create -n panicl python=3.10 -y
conda activate panicl
pip install --extra-index-url https://download.pytorch.org/whl/cu126 torch==2.6.0+cu126 torchvision==0.21.0+cu126 torchaudio==2.6.0+cu126
pip install -r requirements.txt
```

## 📦 Dataset & In-context Retriever
### Dataset
Download the Pascal-5<sup>i</sup> Dataset from [Volumetric-Aggregation-Transformer](https://github.com/Seokju-Cho/Volumetric-Aggregation-Transformer), and put it under```PANICL/```.
Please follow the [SegGPT](https://github.com/baaivision/Painter) to download and prepare ADE20K, and COCO_Pose datasets. 

### Pre-trained Weights for MAE-VQGAN and SegGPT
Please follow the [Visual Prompting](https://github.com/amirbar/visual_prompting) to prepare the model
and download the ```CVF 1000 epochs``` under ```weights/```.
Download the pre-trained weights from [SegGPT](https://github.com/baaivision/Painter), and put it under ```SegGPT/```. For LVM, follow [LVM](https://github.com/ytongbai/LVM) to set up LVM, and put weights to ```LVM/weights/lvm``` and ```LVM/vqvae_ckpts```.

### In-context Pair Retriever

[Foreground Sementation Prompt Retriever](./Segmentation.md)

[Single Object Detection Prompt Retriever](./Detection.md)

## 🏃‍ Run Inference
### 🔨PANICL on MAE-VQGAN
#### Foreground segmentation
```
# Change the fold for implementing each split.
python test_segmentation.py --batch-size 1 --fold 3 --arr a1 --output_dir visual_examples --seed 1 --k 5 --n-shot 4 --device cuda:0
```
* `--fold`: Change the fold `[0, 1, 2, 3]` for implementing each split.
* `--save-examples`: Whether to save visual examples. This is an optional flag;
simply add `--save-examples` to the command if you wish to save visual examples.
If not included, visual examples will not be saved.
* `--seed`: Random seed for reproduction.

#### Single object detection
```
python test_detection.py --batch-size 1 --fold 0 --arr a1 --output_dir visual_examples --seed 1 --k 5 --n-shot 4 --device cuda:0
```

### Multi-example ICL Baseline
#### Foreground segmentation
```
python multi_example_ICL_seg.py --batch-size 1 --fold 3 --arr a1 --output_dir visual_examples --seed 1 --n-shot 4 --device cuda:0
```
#### Single object detection
```
python multi_example_ICL_det.py --batch-size 1 --fold 0 --arr a1 --output_dir visual_examples --seed 1 --n-shot 4 --device cuda:0
```

### 💪Inference on SegGPT
#### Foreground segmentation
#### Baseline
```
python test_segmentation_SegGPT.py --batch-size 1 --fold 3 --arr a1 --output_dir visual_examples --seed 1 --n-shot 1 --device cuda:0 --model icl
```
#### Feature Ensemble
```
python test_segmentation_SegGPT.py --batch-size 1 --fold 3 --arr a1 --output_dir visual_examples --seed 1 --n-shot 2 --device cuda:0 --mode icl
```
#### PANICL
```
python test_segmentation_SegGPT.py --batch-size 1 --fold 3 --arr a1 --output_dir visual_examples --seed 1 --n-shot 2 --device cuda:0 --mode panicl
```

### 🔨PANICL on LVM
#### Multi-object segmentation
#### Baseline
```
python LVM/inference_seg_7b_ade20k.py
```
#### PANICL
```
python LVM/inference_seg_7b_ade20k_panicl.py
```