# Data Distillation for Efficient In-Context Learning

This repository focuses on tools and scripts for data distillation in the context of efficient in-context learning. 
Our work builds upon the  [MetaICL](https://github.com/facebookresearch/MetaICL) codebase.


## Dependencies
- For data preprocessing, ensure you have `datasets==1.4.0` installed. However, this version isn't compatible with the Transformers version used for training and inference.
- We recommend setting up two separate environments: one for data preprocessing and another for model training/inference.

## Data Preprocessing

### Pretrain C4 dataset
We utilize the validation set of [C4](https://huggingface.co/datasets/c4/viewer/en/validation) dataset, select "**en**" subset of validation split. 

### Meta-train and Meta-test dataset
For details on downloading and preprocessing, kindly refer to the [MetaICL](https://github.com/facebookresearch/MetaICL) documentation.



## Data Distillation Training
Inside [src](./src) directory, you will find:
- [dataset_distill.py](./src/dataset_distill.py) - This houses both the pretrain C4 dataset class and the meta-train/meta-test dataset class.
- [model_distill.py](./src/model_distill.py)- This manages the interaction between the large language model and the context distillation model.
- [SmallModel.py](./src/SmallModel.py)- This file contains the implementation of the context distillation model.
 


### Pre-training:
```shell
cd scripts
sh c4_pretrain.sh
```

### FineTuning
```shell
cd scripts
sh finetune.sh
```

## License
MetaICL is CC-BY-NC 4.0 licensed.


[paper]: https://arxiv.org/abs/2110.15943


[unifiedqa-paper]: https://arxiv.org/abs/2005.00700
[unifiedqa-data]: https://console.cloud.google.com/storage/browser/unifiedqa/data
[crossfit-paper]: https://arxiv.org/abs/2104.08835
[crossfit-repo]: https://github.com/INK-USC/CrossFit
[t0-paper]: https://arxiv.org/abs/2110.08207
[t0-repo]: https://github.com/bigscience-workshop/promptsource

