# big_flow_gpt

## 1. Data processing
FineFineWeb dataset:

```bash
cd data && python data_unconditional.py ../configs/config_data_ffw.yaml && cd ..
```

### Lamini dataset:

```bash
cd data && python data_conditional.py ../configs/config_data_lamini.yaml && cd ..
```

### WMT14 dataset:

```bash
cd data && python data_conditional.py ../configs/config_data_wmt.yaml && cd ..
```

### Infilling python code dataset:

For flow and diffusion models:
```bash
cd data && python data_infilling.py ../configs/config_data_infilling.yaml && cd ..
```

For GPT models:
```bash
cd data && python data_infilling_gpt.py ../configs/config_data_infilling_gpt.yaml && cd ..
```

MBPP test set:
```bash
cd data && python data_infilling_test.py ../configs/config_data_infilling_test.yaml && cd ..
```

## 2. Training

To train models use provided configs:

```bash
CUDA_VISIBLE_DEVICES=DEVICE_IDS python train_fm.py ./configs/config_TASK_DATASET_klflow.yaml
```

TASK:
- unconditional
- conditional
- infilling

DATASET:
- ffw
- lamini
- wmt
- pythoncode

It's possible to train other methods by changing the variable fm.type in the config file (e.g. "DFM", "GPT"). For "DFM" it is required to set *max_t = 1.0* in the config file. Also it is necessary to set *device_batch_size* multiplied by the GPU number was a multiple of *batch_size*. Set up *wandb_key* and *project_name* to your own key and project name if you would like to track the training process in wandb.

## 3. Inference

To generate samples for unconditional and conditional tasks use:

```bash
CUDA_VISIBLE_DEVICES=DEVICE_IDS python inference_fm.py ./configs/config_TASK_DATASET_klflow.yaml
```

For infilling use:

```bash
CUDA_VISIBLE_DEVICES=DEVICE_IDS python inference_fm_infilling.py ./configs/config_TASK_DATASET_klflow.yaml
```

## 4. Evaluation

### Unconditional evaluation

For unconditional evaluation use:

```bash
CUDA_VISIBLE_DEVICES=DEVICE_IDS python eval_unconditional.py ./configs/config_TASK_DATASET_klflow.yaml
```

### Conditional evaluation

For conditional evaluation use: 

```bash
CUDA_VISIBLE_DEVICES=DEVICE_IDS python eval_conditional.py ./configs/config_TASK_DATASET_klflow.yaml
```

### Infilling evaluation

For infilling evaluation use:

```bash
CUDA_VISIBLE_DEVICES=DEVICE_IDS python eval_infilling.py ./configs/config_TASK_DATASET_klflow.yaml
```