# Pea-KD : Parameter-efficient and accurate Knowledge Distillation
This package provides an implementation of Pea-KD, which is to improve KD performance. In this package, 6-layer student case is considered and
examples of PeaBERT (Parameter-efficient and accurate BERT) on RTE and MRPC is provided. 

## Overview
#### Brief Explanation of the paper. 
Two main ideas proposed in the paper. Shuffled Parameter Sharing (SPS) and Pretraining with Teacher's Predictions (PTP). 

1) SPS 

- step1 : Paired Parameter Sharing. 
We first double the layers of the student model. Then, we share the parameters between the bottom half and the upper half as in figure1. 
By this way, the model has twice the number of layers and thus can have more 'effective' model complexity while having the same number of actual parameters. 

- step2 : Shuffling
In addition to step1, we shuffle the Query and Key parameters between the shared pairs in order to further increase the 'effective' model complexity. 
By this shuffling process, the parameter-shared pairs can behave close to individual layers and therefore increasing the 'effective' model complexity. 
We will call this architecture the SPS model, and for 6-layer students we apply SPS to the top 3 layers only (See details in the paper).

2) PTP 

- We pretrain the student model with new artificial labels (PTP labels). The labels are assigned as follows.

``` Unicode
PTP labels 
  ├── 'Confidently Correct' = teacher model's prediction is correct & confidence > t 
  ├── 'Unconfidently Correct' = teacher model's prediction is correct & confidence <= t 
  ├── 'Confidently Wrong' = teacher model's prediction is wrong & confidence > t 
  └── 'Unconfidently Wrong' = teacher model's prediction is wrong & confidence <= t
  t = hyperparameter : depends on the downstream task and the teacher model. e.g.) t = 0.95 for MRPC, t = 0.8 for RTE.
```  
#### Baseline Codes
This repository is based on the [GitHub repository](https://github.com/intersun/PKD-for-BERT-Model-Compression) for [Patient Knowledge Distillation for BERT Model Compression](https://arxiv.org/abs/1908.09355). All source files are from the repository if not mentioned otherwise. The main scripts that actually run tasks are the following two files, and they have been modified from the original files in the original repository:
- 'NLI_KD_training.py' -> 'NLI_KD_training_RTE.py' & 'NLI_KD_training_MRPC.py'
- 'run_glue_benchmark.py'

``` Unicode
PeaBERT        
  ├── BERT
  │    └── pytorch_pretrained_bert: BERT sturcture files
  ├── data
  │    ├── data_raw
  │    │     ├── glue_data: task dataset
  │    │     └── download_glue_data.py
  │    ├── models
  │    │     └── bert_base_uncased: ckpt -> must download your own.
  │    └── outputs
  │           └── save teacher model predictions & trained student models.
  ├── src : The overall utils. 
  ├── envs.py: save directory paths for several usage.
  ├── run_glue_benchmark.py : save teacher prediction. Used for PTP-pretraining, KD, Patient-KD e.t.c. 
  ├── PTP_RTE.pym PTP_MRPC.py : PTP-pretraining the student model. 
  └── NLI_KD_training_RTE.py, NLI_KD_training_MRPC.py: comprehensive training file for teacher and student models.
  
  don't mind the word KDAP. It was a tentative name for Pea-KD. 
  
```

#### Data description
- GLUE datasets

* Note that: 
    * You can download GLUE datasets by PeaBERT/data/data_raw/download_glue_data.py


## Install

#### Environment 
* Ubuntu
* CUDA 10.0
* Pytorch 1.4 
* numpy
* torch
* Tensorly
* tqdm
* pandas
* apex

## How to Run

Refer to 'MRPC_example_PeaBERT6_92.9.ipynb' and 'RTE_example_PeaBERT6_73.56.ipynb' and the paper to understand how to run the code. 
It is basically equivalent to the Patient Knowledge Distillation code (https://github.com/intersun/PKD-for-BERT-Model-Compression). 

## Contact

- anonymous

*This software may be used only for research evaluation purposes.*  
*For other purposes (e.g., commercial), please contact the authors.*