Block Skim Transformer for Efficient Question Answering

Yue Guan; Jingwen Leng; Yuhao Zhu; Minyi Guo

Block Skim Transformer for Efficient Question Answering

Yue Guan, Jingwen Leng, Yuhao Zhu, Minyi Guo

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Efficient Transformer, Question Answering

Abstract: Transformer-based encoder models have achieved promising results on natural language processing (NLP) tasks including question answering (QA). Different from sequence classification or language modeling tasks, hidden states at all positions are used for the final classification in QA. However, we do not always need all the context to answer the raised question. Following this idea, we proposed Block Skim Transformer (BST) to improve and accelerate the processing of transformer QA models. The key idea of BST is to identify the context that must be further processed and the blocks that could be safely discarded early on during inference. Critically, we learn such information from self-attention weights. As a result, the model hidden states are pruned at the sequence dimension, achieving significant inference speedup. We also show that such extra training optimization objection also improves model accuracy. As a plugin to the transformer-based QA models, BST is compatible with other model compression methods without changing existing network architectures. BST improves QA models' accuracies on different datasets and achieves $1.6\times$ speedup on $BERT_{large}$ model.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: An efficient plug-to-play method for Transformer-based models by skmming context blocks.

Supplementary Material: zip

Reviewed Version (pdf): https://openreview.net/references/pdf?id=y1VxQmk8Ft

13 Replies

Loading