# Soft-Masked Diffusion Language Models

This code contains the CoDiLA implementation on top of the Dream-7B and Dream-Coder-7B instruction-tuned models. This code is adapted from the original Dream-7B github repo: https://github.com/DreamLM/Dream.


## Build the Environment 🛠️

#### Hardware

We run everything on one A100/H100 with 80GB.


## Reproducing our Experiments 🔬


### Training

Before properly evaluating the CoDiLA model, one must perform the necessary finetuning on the AR model. This can be performed by running
```
cd train
pip install -r requirements.txt
bash train_local.sh
```
This fully finetunes the `Qwen/Qwen3-0.6B` model on top of 'Dream-org/Dream-Coder-v0-Instruct-7B' which is frozen. 


the path to the pretrained will later be used for evaluation

### Training Output Files

After training, the output will be a set of checkpoint files inside a `pretrained` directory. This directory will contain an `ar_model/` subfolder with the following files:
- `config.json`
- `generation_config.json`
- `model.safetensors`

You will use the path to this `pretrained` directory as the model path for evaluation.

### Evaluation

Once the AR model have been finetuned, you can run the evaluation.

The evaluation for humaneval_instruct, humaneval_instruct_plus, mbpp_instruct, mbpp_instruct_plus is based on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
```
cd eval_instruct/harness
pip install -e .
bash submit_local.sh
``` 
