# Label error detection on text classification

## Download model, features, noisy labels
- To download the necessary files, run
```
python download.py -t [sst2/mnli]
```
- The default save directory is `./results`. You can specify a different directory by using the `--cache_dir` option.
- These files will be downloaded (**680MB** for SST2 / **1.7G** for MNLI):
  - roberta-base_fp16_noise0.1 : Trained RoBERTa-Base on the noisy training set. 
  - roberta-base_fp16_noise0.1/epoch_4 : Training data features extracted using the model at epoch 4.
  - target0.1_large_4.pt : Noisy labels generated by RoBERTa-Large model (see `./model/noisy_label.py`).

## Run detection
- SST2/MNLI with synthetic label error (10%) and the RoBERTa-Base model
```
python detect.py -t [sst2/mnli]
```
- Use the identical `--cache_dir` as above.
- For text data, the default kernel temperature is set to `--pow 4`.
- You can reduce GPU memory usage by half using half precision with `--dtype float16`, with a marginal performance drop.

## Training and feature extraction
- Reference: [Huggingface runglue.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py) 
- For training models, please refer to the code above. We also provide our modified code in `./model`.
- For feature extraction, please refer to `feature.py`
