# MM2024: Multimodal Inplace Prompt Tuning for Open-set Object Detection

## Preparation
**Data**  Prepare ``Objects365`` (for modulated pre-training), ``LVIS`` (for evaluation), and ``ODinW`` (for evaluation) benchmarks following [DATA.md](DATA.md).



**Environment** This repo requires Pytorch==1.9  and torchvision. 
Init the  environment:
```
./init.sh
```

**Initial weight** MIPT is build upon frozen language-queried detector. To conduct modulated pre-training, download corresponding pre-trained model weights first.

We apply MIPT on GLIP and GroundingDINO:

```
GLIP-T:
wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/glip_tiny_model_o365_goldg_cc_sbu.pth -O MODEL/glip_tiny_model_o365_goldg_cc_sbu.pth
GLIP-L:
wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/glip_large_model.pth -O MODEL/glip_large_model.pth
GroundingDINO-T:
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth -O MODEL/groundingdino_swint_ogc.pth
```

## Vision Clues Extraction

**Take GLIP-T (MIPT) as an example.**

### Objects365 for modulated pre-training:

```
python3.9 tools/extract_vision_query.py --config_file configs/pretrain/glipt-mipt.yaml --dataset objects365 --add_name tiny 
```
This will generate a query bank file in ``MODEL/object365_query_5000_sel_tiny.pth``


Some paramters corresponding to the query extraction:

``DATASETS.FEW_SHOT``: if set ``k>0``, the dataset will be subsampled to k-shot for each category when initializing the dataset. This is completed before training. Not used during pre-training.

``VISION_QUERY.MAX_QUERY_NUMBER``: the max number of vision queries for each category when extracting the query bank. Note that the query extraction is conducted before training and evaluation.

``VISION_QUERY.NUM_QUERY_PER_CLASS`` controls how many queries to provide for each category during one forward process in training and evaluation.

Usually, we set 

``VISION_QUERY.MAX_QUERY_NUMBER=5000``, ``VISION_QUERY.NUM_QUERY_PER_CLASS=5``, ``DATASETS.FEW_SHOT=0`` during pre-training. 

``VISION_QUERY.MAX_QUERY_NUMBER=5``, ``VISION_QUERY.NUM_QUERY_PER_CLASS=5``, ``DATASETS.FEW_SHOT=5`` during few-shot (5-shot) fine-tuning.



### LVIS for downstream tasks:
```
python tools/extract_vision_query.py --config_file configs/pretrain/glipt-mipt.yaml --dataset lvis --num_vision_queries 5 --add_name tiny
```
This will generate a query bank file in ``MODEL/lvis_query_5_pool7_sel_tiny.pth``.

``--num_vision_queries`` denotes number of vision queries for each category, and can be an arbitrary number. This will set both ``VISION_QUERY.MAX_QUERY_NUMBER`` and ``DATASETS.FEW_SHOT`` to ``num_vision_queries``.
Note that here ``DATASETS.FEW_SHOT`` is only for accelerating the extraction process.

``--add_name`` is only a mark for different models.
For training/evaluating with GLIPT-MIPT/GLIPL-MIPT/GDINO-MIPT, we set ``--add_name`` to 'tiny'/'large'/'gd'.

### ODinW for downstream tasks:

```
python tools/extract_vision_query.py --config_file configs/pretrain/glipt-mipt.yaml --dataset odinw-13 --num_vision_queries 5 --add_name tiny
```
This will generate query bank files for each dataset in ODinW in  ``MODEL/{dataset}_query_5_pool7_sel_tiny.pth``.


## Modulated Training

**Take GLIPT-MIPT as an example.**

```
python -m torch.distributed.launch --nproc_per_node=8 tools/train_net.py --config-file configs/pretrain/glipt-mipt.yaml --use-tensorboard OUTPUT_DIR 'OUTPUT/GLIPT-MIPT/'
```
To pre-train on custom datasets, please specify ``DATASETS.TRAIN`` and ``VISION_SUPPORT.SUPPORT_BANK_PATH`` in the config file. The query bank can be extracted following the above instruction.

## (Zero-Shot) Evaluation
**Take GLIPT(MIPT) as an example.**

### LVIS Evaluation
```
python -m torch.distributed.launch --nproc_per_node=4 \
tools/test_grounding_net.py \
--config-file configs/pretrain/glipt-mipt.yaml \
--additional_model_config configs/vision_query_5shot/lvis_minival.yaml \
VISION_QUERY.QUERY_BANK_PATH MODEL/lvis_query_5_pool7_sel_tiny.pth \
MODEL.WEIGHT model_weight_path \
TEST.IMS_PER_BATCH 4 
```
If you wish to evaluate on Val 1.0, set ``--task_config`` to ``configs/vision_query_5shot/lvis_val.yaml``.
``VISION_QUERY.QUERY_BANK_PATH`` is the vision queries extracted via ``tools/extract_vision_query.py``. Please follow the above section to extract corresponding vision queries.

### ODinW / Custom Dataset Evaluation
```
python tools/eval_odinw.py --config_file configs/pretrain/glipt-mipt.yaml \
--opts 'MODEL.WEIGHT model_weight_path' \
--setting zero-shot \
--add_name tiny \
--log_path 'OUTPUT/odinw_log/'
```
The results are stored at ``OUTPUT/odinw_log/``.

If you wish to use custom vision queries or datasets, add ``'VISION_QUERY.QUERY_BANK_PATH custom_bank_path'`` to the ``--opts`` argment, and also modify the ``dataset_configs`` in the ``tools/eval_odinw.py``. 


## Fine-Tuning
**Take GLIPT-MIPT as an example.**
```
python tools/eval_odinw.py --config_file configs/pretrain/glipt-mipt.yaml \
--opts 'MODEL.WEIGHT model_weight_path' \
--setting 3-shot \
--add_name tiny \
--log_path 'OUTPUT/odinw_log/'
```
This command will first  automatically extract the vision query bank from the (few-shot) training set. Then conduct fine-tuning.
If you wish to use custom vision queries, add ``'VISION_QUERY.QUERY_BANK_PATH custom_bank_path'`` to the ``--opts`` argment, and also modify the ``dataset_configs`` in the ``tools/eval_odinw.py``.

If set ``VISION_QUERY.QUERY_BANK_PATH`` to ``''``, the model will automatically extract the vision query bank from the (few-shot) training set before fine-tuning.

