Understanding, Uncovering, and Mitigating the Causes of Inference Slowdown for Language Models

IEEE SaTML 2024 Conference Submission158 Authors

Published: 07 Mar 2024, Last Modified: 18 Apr 2024SaTML 2024EveryoneRevisionsBibTeX
Keywords: adversarial machine learning, pre-trained language models, efficient machine learning
TL;DR: Demystifying the slowdown effect on pre-trained language models and increasing robustness against slowdown attacks.
Abstract: Dynamic neural networks (DyNNs) have shown promise for alleviating the high computational costs of pre-trained language models (PLMs), such as BERT and GPT. Emerging slowdown attacks have shown to inhibit the ability of DyNNs to omit computation, e.g., by skipping layers that are deemed unnecessary. As a result, these attacks can cause significant delays in inference speed for DyNNs and may erase their cost savings altogether. Most research in slowdown attacks has been in the image domain, despite the ever-growing computational costs---and relevance of DyNNs---in the language domain. Unfortunately, it is still not understood what language artifacts trigger extra processing in a PLM, or what causes this behavior. We aim to fill this gap through an empirical exploration of the slowdown effect on language models. Specifically, we uncover a crucial difference between the slowdown effect in the image and language domains, illuminate the efficacy of pre-existing and novel techniques for causing slowdown, and report circumstances where slowdown does not occur. Building on these observations, we propose the first approach for mitigating the slowdown effect. Our results suggest that slowdown attacks can provide new insights that can inform the development of more efficient PLMs.
Submission Number: 158
Loading