
# Corrupted Image Modeling for Self-Supervised Visual Pre-Training

This repository provides the implementation of "Corrupted Image Modeling for Self-Supervised Visual Pre-Training". 
The README file shows an example of pre-training and fine-tuning ViT-Base with RevDet pre-training objective.

## Requirements

To install requirements:

```setup
pip install -r requirements.txt
```

## Pre-training

To pre-train the ViT-Base using RevDet on ImageNet-1K w/ 16 GPUs, run this command:

```pre-train
# Download and extract ImageNet-1k
DATA_PATH=~/path/to/imagenet1k

# Download the tokenizer weight from OpenAI's DALL-E
TOKENIZER_PATH=/path/to/tokenizer

python -m torch.distributed.launch --nproc_per_node=16 run_beit_pretraining.py \
    --data_path ${DATA_PATH} \
    --discrete_vae_weight_path ${TOKENIZER_PATH} \
    --model beit_e_base_patch16_224_8k_vocab \
    --batch_size 4 --lr 1e-3 --warmup_epochs 10 --epochs 300 \
    --clip_grad 3.0 --opt_betas 0.9 0.98 --opt_eps 1e-6 \
    --drop_path 0. --layer_scale_init_value 0 \
    --imagenet_default_mean_and_std \
    --gen_dim_ratio 1 --gen_depth 4 --share_num_layers 2 \
    --num_mask_patches 100 \
    --min_mask_patches_per_block 1 \
    --max_mask_patches_per_block 1 \
    --disable_rel_pos_bias --abs_pos_emb \
    --second_input_size 224 \
    --disable_interpolate --dis_loss_weight 1
```


## Fine-tuning

To fine-tune the model on ImageNet-1K, run:

```fine-tune
# Download and extract ImageNet-1k
DATA_PATH=~/path/to/imagenet1k/train
EVAL_DATA_PATH=~/path/to/imagenet1k/val

- python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
    --model beit_base_patch16_224 \
    --data_path ${DATA_PATH} \
    --eval_data_path ${EVAL_DATA_PATH} \ 
    --nb_classes 1000 \
    --data_set image_folder \
    --finetune /path/to/pre-trained/model/weight \ 
    --batch_size 128 \
    --update_freq 1 \ 
    --lr 5e-3 \
    --warmup_epochs 5 \ 
    --epochs 100 \
    --layer_decay 0.8 \ 
    --drop_path 0.1 \ 
    --weight_decay 0.05 \ 
    --mixup 0.8 \
    --cutmix 1.0 \
    --disable_rel_pos_bias \ 
    --abs_pos_emb \
    --imagenet_default_mean_and_std \ 
    --min_lr 1e-5 \
    --layer_scale_init_value 0 \
    --crop_pct 0.95 \
    --no-repeated-aug \
    --enable_deepspeed
```

## Results

Our model achieves the following performance on:

### Image Classification on ImageNet

| Model         | Top 1 Accuracy  |
| ------------------ |---------------- |
| ViT-Base-RevDet   |     83.2%~83.3%       |   


