# Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

## Usage

```bash
python -m venv .env
source .env/bin/activate
pip install -r requirements.txt
```

An example command for training M-FLYT:

```bash
python -m FLYT.train_flyt \
    --upstream_data_dir ${UPSTREAM_DATA_DIR} \
    --downstream_data_dir ${DOWNSTREAM_DATA_DIR} \
    --datacomp_eval_dir ${DATACOMP_EVAL_DIR} \
    --output_dir ${OUTPUT_DIR} \
    --exp_name ${EXP_NAME} \
    --precision ${PRECISION} \
    --num_checkpoints ${NUM_CHECKPOINTS} \
    --save_frequency 1 \
    --seed ${SEED} \
    --report_to_wandb \
    --accum_freq 1 \
    --wandb_project_name ${WANDB_PROJECT_NAME} \
    --log_every_n_steps 1 \
    --downstream_task_names imagenet \
    --model ViT-B-32 \
    --reference_learning_rate 5e-5 \
    --scoring_learning_rate 1e-3 \
    --warmup 100 \
    --upstream_batch_size 4096 \
    --downstream_batch_size 3072 \
    --n_iterations 5000 \
    --reference_pretrained openai \
    --scoring_pretrained openai \
    --update_reference_model \
    --downstream_logit_scale 2.65926
```
