# The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models (oBERT)

Author: @eldarkurtic

Paper: [https://arxiv.org/abs/2203.07259](https://arxiv.org/abs/2203.07259)

Demo: [https://neuralmagic.com/blog/obert/](https://neuralmagic.com/blog/obert/)

Abstract:
```
We introduce the Optimal BERT Surgeon (oBERT), an efficient and accurate pruning method based on approximate second-order information, which we show to yield state-of-the-art results for compression in both stages of language tasks: pre-training and fine-tuning. Specifically, oBERT extends existing work on second-order pruning by allowing for pruning weight blocks, and is the first such method that is applicable at BERT scale.
Second, we investigate compounding compression approaches to obtain highly compressed but accurate models for deployment on edge devices. These models significantly push boundaries of the current state-of-the-art sparse BERT models with respect to all metrics: model size, inference speed and task accuracy. For example, relative to the dense BERT-base, we obtain 10x model size compression with < 1% accuracy drop, 10x CPU-inference speedup with < 2% accuracy drop, and 29x CPU-inference speedup with < 7.5% accuracy drop.
```

The Optimal BERT Surgeon (oBERT) is implemented and integrated in the SparseML library in the form of `OBSPruningModifier`, which can be used for approximate second-order (unstructured and 4-block) pruning of other models, besides BERT, too.

To ease reproducibility, in the following Tables (which correspond to the Tables reported in the paper), we provide links for open-sourced checkpoints, recipes and scripts used to produce them.


## Table 1: Dev-set performance of downstream-pruned BERT-base models
In the following Table (Table 1. from the paper) we provide links for the best performing unstructured-pruned oBERT models, along with recipes and scripts to reproduce them from scratch.

| Task | BERT<br>base | Sparsity | oBERT<br>10 epochs | oBERT<br>30 epochs |
| :---: | :-------: | :----: | :----------------: | :----------------: |
| SQuAD<br>F1 | 88.54<br>[model](https://huggingface.co/neuralmagic/oBERT-teacher-squadv1) | 80%<br>90%<br>97% | -<br>87.98<br>84.65 | 89.04 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-unstructured-80-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_unstructured80_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>88.31 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-unstructured-90-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_unstructured90_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>85.98 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-unstructured-97-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_unstructured97_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh) |
| MNLI<br>m-acc | 84.54<br>[model](https://huggingface.co/neuralmagic/oBERT-teacher-mnli) | 80%<br>90%<br>97% | -<br>83.20<br>81.00 | 84.32 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-unstructured-80-mnli), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_unstructured80_mnli.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_mnli_qqp.sh)<br>83.79 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-unstructured-90-mnli), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_unstructured90_mnli.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_mnli_qqp.sh)<br>81.77 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-unstructured-97-mnli), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_unstructured97_mnli.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_mnli_qqp.sh) |
| QQP<br>acc | 91.06<br>[model](https://huggingface.co/neuralmagic/oBERT-teacher-qqp) | 80%<br>90%<br>97% | -<br>90.89<br>90.23 | 91.57 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-unstructured-80-qqp), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_unstructured80_qqp.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_mnli_qqp.sh)<br>91.35 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-unstructured-90-qqp), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_unstructured90_qqp.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_mnli_qqp.sh)<br>90.87 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-unstructured-97-qqp), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_unstructured97_qqp.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_mnli_qqp.sh) |


## Table 2: Sparse-transfer dev-set performance of upstream-pruned BERT-base models
In the following Table (Table 2. from the paper) we provide links for the best performing unstructured-pruned oBERT models, along with recipes and scripts to reproduce them from scratch.

**Note (models v2)**: these results will be presented in the upcoming updated version of the paper.

Upstream pruned oBERT models:
- 90% unstructured [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-90-v2), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/3epochs_unstructured90_mlm.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/3epochs_gradual_pruning_mlm.sh)
- 97% unstructured [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-97-v2), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/3epochs_unstructured97_mlm.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/3epochs_gradual_pruning_mlm.sh)

when fine-tuned on a downstream task with fixed masks (i.e. sparse-transfer):

| Task | BERT<br>base | Sparsity | oBERT |
| :---: | :-------: | :----: | :------: |
| SQuAD<br>F1 | 88.54<br>[model](https://huggingface.co/neuralmagic/oBERT-teacher-squadv1) | 90%<br>97% | 88.49 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-90-finetuned-squadv1-v2), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/8epochs_sparse_transfer_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/8epochs_sparse_transfer_squad.sh)<br>84.92 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-97-finetuned-squadv1-v2), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/8epochs_sparse_transfer_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/8epochs_sparse_transfer_squad.sh)|
| MNLI<br>m-acc | 84.54<br>[model](https://huggingface.co/neuralmagic/oBERT-teacher-mnli) | 90%<br>97% | 83.40 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-90-finetuned-mnli-v2), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/8epochs_sparse_transfer_mnli.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/8epochs_sparse_transfer_mnli_qqp.sh)<br>80.91 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-97-finetuned-mnli-v2), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/8epochs_sparse_transfer_mnli.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/8epochs_sparse_transfer_mnli_qqp.sh)|
| QQP<br>acc | 91.06<br>[model](https://huggingface.co/neuralmagic/oBERT-teacher-qqp) | 90%<br>97% | 90.99 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-90-finetuned-qqp-v2), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/8epochs_sparse_transfer_qqp.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/8epochs_sparse_transfer_mnli_qqp.sh)<br>90.33 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-97-finetuned-qqp-v2), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/8epochs_sparse_transfer_qqp.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/8epochs_sparse_transfer_mnli_qqp.sh) |

**Note**: the results below are currently presented in the paper and will be removed when the updated version of the paper is released with previously presented **v2** models.

Upstream pruned oBERT models:
- 90% unstructured [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-90)
- 97% unstructured [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-97)

when fine-tuned on a downstream task with fixed masks (i.e. sparse-transfer):

| Task | BERT<br>base | Sparsity | oBERT |
| :---: | :-------: | :----: | :------: |
| SQuAD<br>F1 | 88.54<br>[model](https://huggingface.co/neuralmagic/oBERT-teacher-squadv1) | 90%<br>97% | 88.42 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-90-finetuned-squadv1)<br>84.39 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-97-finetuned-squadv1) |
| MNLI<br>m-acc | 84.54<br>[model](https://huggingface.co/neuralmagic/oBERT-teacher-mnli) | 90%<br>97% | 82.29 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-90-finetuned-mnli)<br>78.85 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-97-finetuned-mnli) |
| QQP<br>acc | 91.06<br>[model](https://huggingface.co/neuralmagic/oBERT-teacher-qqp) | 90%<br>97% | 90.83 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-90-finetuned-qqp)<br>89.76 [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pruned-unstructured-97-finetuned-qqp) |



## Table 3: F1 score of the 3, 6, and 12-layer models compound-compressed with oBERT on SQuAD
In the following Table (Table 3. from the paper) we provide links for the best performing oBERT models, along with recipes and scripts to reproduce them from scratch.

For the 12-layer model, we use the standard HuggingFace's `bert-base-uncased` [model](https://huggingface.co/bert-base-uncased) for a fair comparison with other compression approaches. For the 3 and 6 layer models, we drop layers from our upstream-pretrained 12-layer [model](https://huggingface.co/neuralmagic/oBERT-12-upstream-pretrained-dense), and pretrain them to obtain the following 3 and 6 layer dense models:
- 6-layer dense pretrained [model](https://huggingface.co/neuralmagic/oBERT-6-upstream-pretrained-dense)
- 3-layer dense pretrained [model](https://huggingface.co/neuralmagic/oBERT-3-upstream-pretrained-dense)

| Layers | Sparsity | Unstructured | 4-block | 4-block+QAT |
| :---:  | :---:    | :---:        | :---:   | :---:       |
| 12 | 0%<br>80%<br>90% | 89.48 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-dense-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_dense_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>89.04 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-unstructured-80-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_unstructured80_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>88.31 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-unstructured-90-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_unstructured90_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh) | 89.48 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-dense-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_dense_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>88.57 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-block4-80-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_4block80_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>87.57 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-block4-90-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_4block90_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh) | 89.06 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-dense-QAT-squadv1)<br>87.89 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-block4-80-QAT-squadv1)<br>86.68 [model](https://huggingface.co/neuralmagic/oBERT-12-downstream-pruned-block4-90-QAT-squadv1) |
| 6 | 0%<br>80%<br>90% | 88.32 [model](https://huggingface.co/neuralmagic/oBERT-6-downstream-dense-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_dense_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>88.20 [model](https://huggingface.co/neuralmagic/oBERT-6-downstream-pruned-unstructured-80-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_init30_unstructured80_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>86.78 [model](https://huggingface.co/neuralmagic/oBERT-6-downstream-pruned-unstructured-90-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_init30_unstructured90_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh) | 88.32 [model](https://huggingface.co/neuralmagic/oBERT-6-downstream-dense-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_dense_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>87.00 [model](https://huggingface.co/neuralmagic/oBERT-6-downstream-pruned-block4-80-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_init30_4block80_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>85.34 [model](https://huggingface.co/neuralmagic/oBERT-6-downstream-pruned-block4-90-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_init30_4block90_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh) | 87.94 [model](https://huggingface.co/neuralmagic/oBERT-6-downstream-dense-QAT-squadv1)<br>86.10 [model](https://huggingface.co/neuralmagic/oBERT-6-downstream-pruned-block4-80-QAT-squadv1)<br>84.59 [model](https://huggingface.co/neuralmagic/oBERT-6-downstream-pruned-block4-90-QAT-squadv1) |
| 3 | 0%<br>80%<br>90% | 84.66 [model](https://huggingface.co/neuralmagic/oBERT-3-downstream-dense-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_dense_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>84.08 [model](https://huggingface.co/neuralmagic/oBERT-3-downstream-pruned-unstructured-80-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_init30_unstructured80_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>82.50 [model](https://huggingface.co/neuralmagic/oBERT-3-downstream-pruned-unstructured-90-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_init30_unstructured90_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh) | 84.66 [model](https://huggingface.co/neuralmagic/oBERT-3-downstream-dense-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_dense_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh)<br>82.79 [model](https://huggingface.co/neuralmagic/oBERT-3-downstream-pruned-block4-80-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_init30_4block80_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh) <br>80.69 [model](https://huggingface.co/neuralmagic/oBERT-3-downstream-pruned-block4-90-squadv1), [recipe](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/recipes/30epochs_init30_4block90_squad.yaml), [script](https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT/scripts/30epochs_gradual_pruning_squad.sh) | 84.25 [model](https://huggingface.co/neuralmagic/oBERT-3-downstream-dense-QAT-squadv1)<br>82.04 [model](https://huggingface.co/neuralmagic/oBERT-3-downstream-pruned-block4-80-QAT-squadv1)<br>79.66 [mode](https://huggingface.co/neuralmagic/oBERT-3-downstream-pruned-block4-90-QAT-squadv1) |


## Important notes
- the `OBSPruningModifier` will make use of all available GPUs during the pruning step to split the workload; use `CUDA_VISIBLE_DEVICES` to specify which GPUs can/should be used
- all experiments are designed to fit on a single 24GB RTX 3090 card, except the upstream ones which need to use more GPUs due to the large pre-training batch-size
- if an experiment doesn't fit on a single GPU, the multi-GPU mode via PyTorch DistributedDataParallel (DDP) should be used; the `OBSPruningModifier` will make use of all the available GPUs to split the workload
- results reported in the paper are obtained with the following versions of libraries:
    - `sparseml=0.2.0`
    - `transformers=4.5.1`
    - `datasets=1.6.1`
    - `torch=1.8.1`
- since then, we have improved and optimized `OBSPruningModifier` implementation, and to ease reproducibility, we have successfully reproduced results with newer versions of libraries:
    - `sparseml=0.12.0`
    - `transformers=4.18.0.dev0`
    - `datasets=2.0.0`
    - `torch=1.11.0`

# BibTeX entry and citation info
```bibtex
@article{kurtic2022optimal,
  title={The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models},
  author={Kurtic, Eldar and Campos, Daniel and Nguyen, Tuan and Frantar, Elias and Kurtz, Mark and Fineran, Benjamin and Goin, Michael and Alistarh, Dan},
  journal={arXiv preprint arXiv:2203.07259},
  year={2022}
}
```