# DietCL

### Create Environment
```
conda env create -f environment. yml
conda activate dietcl
```
Follow this [issue](https://github.com/rwightman/pytorch-image-models/issues/420#issuecomment-776459842) to fix the library import problem.  

### Data Preprossing for ImageNet10k and ImageNet2k
1. Download ImageNet 22k V2 and process dataset following [ImageNet-21K Pretraining for the Masses
](https://github.com/Alibaba-MIIL/ImageNet21K) 
2. Download ImageNet 1k Dataset or the folder names of the ImageNet 1k dataset
3. Get classes that are not overlop from ImageNet by 
```
# set your data root to store the ImageNet10k/ImageNet2k benchmark
DATA_ROOT="/your/path/to/save/softlinks"
# build benchmark
python preprocessing/21kunique.py \
        --root1k /your/path/to/ImageNet1k \
        --root21k /your/path/to/processed/ImageNet21k \
        --benchmark $DATA_ROOT \
        --dataset ImageNet10k
```
4. Split the benchmark into N splits(time steps)
```
# set your data root to store the non-overlap-ImageNet21k benchmark
DATA_ROOT="/your/path/to/save/softlinks"
# set your data root to store the benchmark
SPLIT_DIR="/your/path/to/save/softlinks_of_splits"
# set the number of splits
N=20
# build split-benchmark
python preprocessing/build_setting.py \
        --nonoverlap $DATA_ROOT \
        --setting_dir $SPLIT_DIR \
        --split $N
```
5. Labeled-Unlabeled data split
```
# set your data root to store the N-split-non-overlap-ImageNet21k benchmark
SPLIT_DIR="/your/path/to/save/softlinks_of_splits"
# set the number of splits
N=20
# set the label rate
L=0.01
# build labeled-unlabeled split-benchmark
python preprocessing/label_split.py \
        --setting_dir $SPLIT_DIR \
        --split $N --label_ratio $L
```
### Data preprossing for CGLM
Download Google Landmark V2 from [here](https://github.com/cvdfoundation/google-landmark). 
```
# set your data root to store the GLMV2
DATA_ROOT="/your/path/to/save/glmv2"
# set your data root for the CGLM benchmark
BENCHMARK="/your/path/to/save/cglm"
# move tran.txt, test.txt to your data folder
mv train.txt ${DATA_ROOT}/train.txt
mv test.txt ${DATA_ROOT}/test.txt
# build labeled-unlabeled split-benchmark
python preprocessing/cglm.py  --root $SPLIT_DIR --benchmark $BENCHMARK
```
Modify the line 10 in dataset/cglm.py
```angular2html
# add $DATA_ROOT here 
dataroot = ''
```


### Downlaod Pre-trained Model
Download MAE pre-trained from [here](https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base_full.pth)
```angular2html
MODEL=/your/path/to/model/dir
cd $MODEL
wget https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base_full.pth
```


### Run Experiments
We run our experiments with 2 A100 GPUs, and 12 CPUs. We use batch size 256 per GPU. 
```
PORT="$(($RANDOM % 10000 + 30000))"
SPLIT=20
LR=4e-4
GPU=2
BS=256
STEP=20
METHOD=balance2stage
STAGE1=400
dataset=
ELR=0.1


if [ "$dataset" = "ImageNet10k" ]; then
    DATA=/ImageNet10k/data/dir
    LABEL=0.01
elif [ "$dataset" = "cglm" ]; then
    LABEL=0.05
    DATA=/CGLM/data/dir
else
    echo "dataset value error"
fi

python   main.py \
--dist-url tcp://127.0.0.1:${PORT} --ngpus_per_node $GPU -j 4 \
--multiprocessing-distributed --dist_eval \
--dataset $dataset \
--data $DATA \
--steps ${STEP} \
--split $SPLIT \
--label_ratio $LABEL \
--blr ${LR}  -b $BS --lr_extra_rate ${ELR} \
--unsup_loss --replay_first --mask_cur_loss --sampling batchmix \
--size_replay_buffer -1 --method ${METHOD} --min_budget $STAGE1
```
If you want to use WandB to save the logs, please modify `init_wandb_writer` in `utils/misc.py` according to your wandb account and use args `--wandb_log`