# TNF Source Code

Implementation for the paper "Taking Notes on the Fly Helps Language Pretraining".

## Brief Introduction
This repo is built for the experimental codes in our paper, containing all the model implementation, data preprocessing for TNF, and parameter settings. Here we thank the authors of the codebase, [fairseq](https://github.com/pytorch/fairseq), and our code is upgraded from it. So more details and usages on fairseq please see the original repo.

## Requirements and Installation

More details see [fairseq](https://github.com/pytorch/fairseq). Berifly,

* [PyTorch](http://pytorch.org/) version >= 1.4.0
* Python version >= 3.5
* For training new models, you'll also need an NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)
* **For faster training** install NVIDIA's [apex](https://github.com/NVIDIA/apex) library with the `--cuda_ext` option

## Getting Started

### Overall Usage
The [full documentation](https://fairseq.readthedocs.io/) of fairseq contains instructions for getting started, training new models and extending fairseq with new model types and tasks.

### Data Pre-Processing

#### Pretraining Data for TNF
We follow a couple of consecutive pre-processing steps: segmenting documents into sentences by Spacy, normalizing, lower-casing, and tokenizing the texts by Moses decoder, and finally, applying byte pair encoding (BPE) with setting the vocabulary size |V| as 32,678.
This version of code only contains the part of preprocessing TNF's dataset, which refers to `preprocess/pretrain/process.sh`.

#### Down-Stream Data for TNF
Follow the procedure as the above one, we process the GLUE of TNF's dataset by `preprocess/glue/process.sh`.

When reproducing, please modify some related file paths.

### Pre-Training Usage

For pretrainning BERT-TNF model, you can refer to the following:
```bash
#!/usr/bin/env bash

echo 'Prepare Data'
PREFIX=/path/to/data_dir
DATA_DIR=$PREFIX/data/
TNF_DATA_DIR=$PREFIX/tnf_data/
TNF_SUBWORD_DATA_DIR=$PREFIX/tnf_subword_data/
SAVE_DIR=$PREFIX/path/to/save_dir/

echo 'Prepare training'
cd /path/to/TNF
pip install --editable .

echo 'Start training'
TNF_LAMBDA=0.5          # TNF Lambda
TNF_GAMMA=0.1           # TNF Gamma
TOTAL_UPDATES=1000000   # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0001          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=8         # Number of sequences per batch (batch size)
UPDATE_FREQ=4           # Increase the batch size
SEED=100                # Random seed

python train.py $DATA_DIR --num-workers 8 --ddp-backend=c10d \
       --tnf-data $TNF_DATA_DIR --tnf-subword-data $TNF_SUBWORD_DATA_DIR \
       --task tnf_masked_lm --criterion tnf_masked_lm \
       --arch tnf_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
       --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
       --lr-scheduler polynomial_decay --lr $PEAK_LR \
       --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
       --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
       --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ --seed $SEED \
       --mask-prob 0.15 --embedding-normalize \
       --max-update 1000000 --log-format simple --log-interval 10 --tensorboard-logdir . \
       --keep-updates-list 20000 50000 100000 200000 400000 600000 800000 1000000 \
       --save-interval-updates 10000 --keep-interval-updates 5 \
       --no-epoch-checkpoints --skip-invalid-size-inputs-valid-test \
       --save-dir $SAVE_DIR \
       --tnf-lambda $TNF_LAMBDA --tnf-gamma $TNF_GAMMA \
       --update-tnf-lambda 0 --tnf-emb-zero-init 1 \
       --update-tnf-emb mask --ctx windowavg --ctx-window-size 32\
       --restore-file $SAVE_DIR/checkpoint_last.pt
```

### Fine-tuning
After setting hyperparameters, you can fine-tune the model by the command below. By varing the TNF_LAMBDA, you can run TNF, TNF-F and TNF-U.

```bash
#!/usr/bin/env bash

python train.py $DATA_DIR --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
       --tnf-data $TNF_DATA_DIR $TNF_BP \
       --tnf-lambda $TNF_LAMBDA --tnf-gamma $TNF_GAMMA --update-tnf-lambda 0 $UPDATE_TNF \
       --tnf-emb-zero-init 1 --fix-dict-shift True \
       --restore-file $BERT_MODEL_PATH \
       --max-positions 512 \
       --max-sentences $SENT_PER_GPU --update-freq $UPDATE_FREQ \
       --max-tokens $MAX_TOKENS \
       --task tnf_sentence_prediction \
       --reset-optimizer --reset-dataloader --reset-meters \
       --required-batch-size-multiple 1 \
       --init-token 0 --separator-token 2 \
       --arch $ARCH \
       --criterion sentence_prediction $OPTION \
       --num-classes $N_CLASSES \
       --dropout 0.1 --attention-dropout 0.1 \
       --weight-decay $WEIGHT_DECAY --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
       --clip-norm 0.0 --validate-interval-updates $VALID_FREQ \
       --lr-scheduler polynomial_decay --lr $LR --warmup-ratio $WARMUP_RATIO \
       --max-epoch $N_EPOCH --seed $SEED --save-dir $OUTPUT_PATH --no-progress-bar --log-interval 100 --no-epoch-checkpoints --no-last-checkpoints --no-best-checkpoints \
       --find-unused-parameters --skip-invalid-size-inputs-valid-test --truncate-sequence --embedding-normalize \
       --tensorboard-logdir . \
       --best-checkpoint-metric $METRIC --maximize-best-checkpoint-metric $REL_POS | tee $OUTPUT_PATH/train_log.txt
```
