# Multi-Level CLIP Transfer for Open-Vocabulary Object Detection

**This code is provided as supplementary material for the ICLR 2026 submission #11958 and for review purposes only.**

**All links in this document refer to publicly available resources and do not contain any private or identifying information.**

## Requirements

We recommend using **conda** to create the virtual environment

```bash
conda create -n mct-det python=3.9 -y
conda activate mct-det
```

Install [PyTorch](https://pytorch.org/). We recommend version **1.13**. Please make sure to install PyTorch with a compatible CUDA version. For example, for PyTorch 1.13.1 and CUDA 11.6:

```bash
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
```

This project uses `EVA-CLIP`, so run the following command to install the package

```bash
pip install -e . -v
```

The detection framework is built upon [MMDetection2.x](https://github.com/open-mmlab/mmdetection/tree/v2.28.1). To install MMCV and MMDetection2.x, run

```bash
pip install -U openmim
mim install mmcv-full==1.7.0
pip install mmdet==2.28.1
```

For other installation methods, please refer to the official website of [MMCV](https://github.com/open-mmlab/mmcv.git) and [MMDetection](https://github.com/open-mmlab/mmdetection.git).

## Data Preparation

The main experiments are conducted on [COCO](https://cocodataset.org/#home) and [LVIS](https://www.lvisdataset.org/) datasets. We also perform transfer evaluation on [Objects365v1](https://www.objects365.org/overview.html). We use [train2017](http://images.cocodataset.org/zips/train2017.zip) and [val2017](http://images.cocodataset.org/zips/val2017.zip) of COCO, validation set of [LVIS](https://dl.fbaipublicfiles.com/LVIS/lvis_v1_val.json.zip) and [Objects365v1](https://opendatalab.com/OpenDataLab/Objects365_v1).
Please prepare datasets and organize them like the following:

```text
MCT-Det
├── data         # use soft link to save storage on the disk
    ├── coco
        ├── annotations                           # download from COCO
            ├── instances_val2017.json            # download from COCO
        ├── train2017                             # download from COCO
        ├── val2017                               # download from COCO
        ├── zero-shot                             # download from F-ViT official release 
            ├── instances_val2017_all_2.json
            ├── instances_train2017_seen_2_65_cat.json
    ├── lvis_v1
        ├── annotations
            ├── lvis_v1_train_seen_1203_cat.json  # download from F-ViT official release 
            ├── lvis_v1_val.json                  # download from LVIS
        ├── train2017                             # the same with coco
        ├── val2017                               # the same with coco
    ├── Objects365v1
        ├── objects365_reorder_val.json           # download from F-ViT official release 
        ├── val                                   # download from Objects365v1
    
```

We adopt the preprocessed JSON files provided by the
[official F-ViT repository](https://github.com/wusize/CLIPSelf/tree/main/F-ViT). Download from [Drive](https://entuedu-my.sharepoint.com/personal/size001_e_ntu_edu_sg/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fsize001%5Fe%5Fntu%5Fedu%5Fsg%2FDocuments%2Fopensource%2Fclipself&ga=1) and put `instances_val2017_all_2.json` and `instances_train2017_seen_2_65_cat.json` under `data/coco/zero-shot/`, `lvis_v1_train_seen_1203_cat.json` under `data/lvis_v1/annotations/`, and `objects365_reorder_val.json` under `data/Objects365v1/`.

## CLIPSelf Checkpoints

We adopt CLIPSelf checkpoints provided by the [official CLIPSelf repository](https://github.com/wusize/CLIPSelf) to initialize the backbone.
Obtain the CLIPSelf checkpoints from
[Drive](https://entuedu-my.sharepoint.com/personal/size001_e_ntu_edu_sg/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fsize001%5Fe%5Fntu%5Fedu%5Fsg%2FDocuments%2Fopensource%2Fclipself&ga=1), and they can be organized as follows:

```text
MCT-Det
├── checkpoints  # use soft link to save storage on the disk
    ├── eva_vitb16_coco_clipself_patches.pt
    ├── eva_vitb16_coco_clipself_proposals.pt
    ├── eva_vitl14_coco_clipself_patches.pt
    ├── eva_vitl14_coco_clipself_proposals.pt
    ├── eva_vitb16_lvis_clipself_patches.pt
    ├── eva_vitl14_lvis_clipself_patches.pt
```

## Detectors

### Train

Prepare the CLIPSelf checkpoints as shown in the [previous section](#clipself-checkpoints).
An example of training on OV-COCO:

```bash
bash dist_train.sh configs/ov_coco/mct_vitb16_ovcoco_clipself_proposals.py \
                   4 --work-dir your/working/directory
```

An example of training on OV-LVIS:

```bash
bash dist_train.sh configs/ov_lvis/mct_vitl14_ovlvis_clipself_patches.py \
                   4 --work-dir your/working/directory
```

To use multiple machines (e.g., 2x8=16 GPUs) to expedite the training on OV-LVIS, refer to the tutorial of [MMDetection](https://mmdetection.readthedocs.io/en/latest/user_guides/train.html). We have set `auto_scale_lr = dict(enable=True, base_batch_size=64)` in the config files, so the learning rate will be
modified automatically.

### Test

An example of evaluation on OV-COCO

```bash
bash dist_test.sh configs/ov_coco/mct_vitb16_ovcoco_clipself_proposals.py \
                  path/to/checkpoints/mct_vitb16_ovcoco_clipself_proposals.pth \
                  4 --eval bbox
```

An example of evaluation on OV-LVIS

```bash
bash dist_test.sh configs/ov_lvis/mct_vitl14_ovlvis_clipself_patches.py \
                  path/to/checkpoints/mct_vitl14_ovlvis_clipself_patches.pth \
                  4 --eval segm
```

### Transfer

Transfer evaluation on COCO:

```bash
bash dist_test.sh configs/transfer/mct_vitl14_transfer2coco.py \
                  path/to/checkpoints/mct_vitl14_ovlvis_clipself_patches.pth \
                  4 --eval bbox
```

Transfer evaluation on Objects365v1:

```bash
bash dist_test.sh configs/transfer/mct_vitl14_transfer2objects365v1.py \
                  path/to/checkpoints/mct_vitl14_ovlvis_clipself_patches.pth \
                  4 --eval bbox
```

## Acknowledgement

**This code is provided for review purposes only.**

We sincerely thank [MMDetection](https://github.com/open-mmlab/mmdetection), [open-clip](https://github.com/mlfoundations/open_clip), [CLIPSelf](https://github.com/wusize/CLIPSelf) for their valuable code bases. These resources are publicly available and are only used as dependencies in our implementation.
