# NLP

LMU for NLP.

## Developer Instructions

```
pip install -e .
pip install pre-commit black==21.4b2 isort
pre-commit install
```

## Example commands to run our largest models, with and without global attention

```
microsoft-nlp --model_type LMUD --parallel_strategy Mirrored --embed_dim 204 --ff_dim 204 --order 200 --eqn11 3 --theta 512 --n_layers 3 --lmud_order 0.1 --no_share_filters --no_gating --n_filters 1 --layernorm --pre_ffn 1.5 --post_ffn 2 --option2 --n_heads 12 train --batch_size 32 --learning_scale batch-5 --decay_steps 25000 --warmup_steps 293

microsoft-nlp --model_type LMUD --parallel_strategy Mirrored --embed_dim 204 --ff_dim 204 --order 250 --eqn11 3 --theta 512 --n_layers 3 --lmud_order 0.1 --no_share_filters --no_gating --n_filters 1 --layernorm --pre_ffn 1.5 --post_ffn 2.1 --option2 train --batch_size 32 --learning_scale batch-5 --decay_steps 25000 --warmup_steps 293

```

## OpenWebText2 Instructions

To reproduce the filtered and tokenized dataset from scratch:

```
cd datasets
mkdir owt2-raw
cd owt2-raw
wget https://the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar
tar -xvf openwebtext2.jsonl.zst.tar
rm openwebtext2.jsonl.zst.tar
cd ../../
python scripts/tokenize_owt2.py
```

To reproduce the tokenized records, first create the tokenized dataset as above, then run
`python scripts/owt2_to_records.py`. It will create both `.npy` and `.tfrecord` records,
which you can separate into two directories yourself (or use the `kind` parameter in the
script to only generate one type of record).
