LiteTransformerSearch: Training-free On-device Search for Efficient Autoregressive Language Models

Mojan Javaheripi; Shital Shah; Subhabrata Mukherjee; Tomasz Lukasz Religa; Caio Cesar Teodoro Mendes; Gustavo Henrique de Rosa; Sebastien Bubeck; Farinaz Koushanfar; Debadeepta Dey

LiteTransformerSearch: Training-free On-device Search for Efficient Autoregressive Language Models

Mojan Javaheripi, Shital Shah, Subhabrata Mukherjee, Tomasz Lukasz Religa, Caio Cesar Teodoro Mendes, Gustavo Henrique de Rosa, Sebastien Bubeck, Farinaz Koushanfar, Debadeepta Dey

Published: 16 May 2022, Last Modified: 03 Nov 2024AutoML 2022 (Late-Breaking Workshop)Readers: Everyone

Abstract: The Transformer architecture is ubiquitously used as the building block of large-scale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. We leverage the somewhat surprising empirical observation that the number of non-embedding parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple search algorithm that can be directly run on target devices. We rigorously show that the pareto-frontier of perplexity versus different hardware costs such as latency and memory can be found without need for any model training, using non-embedding parameters as a proxy for perplexity. We evaluate our method, dubbed Lightweight Transformer Search (LTS) on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.6×, 2.5× faster runtime and 1.3×, 2× lower peak memory utilization. LTS extracts the pareto-frontier in under 3 hours, running on a commodity laptop. We effectively remove the carbon footprint of training during search for hundreds of GPU hours, offering a strong simple baseline for future NAS methods in autoregressive language modeling.

Keywords: neural architecture search, transformers, autoregressive language modeling

One-sentence Summary: We propose a training-free architecture evaluation proxy for NAS on transformers, that enables fast search directly on the target commodity hardware.

Reproducibility Checklist: Yes

Broader Impact Statement: Yes

Paper Availability And License: Yes

Code Of Conduct: Yes

Reviewers: mojan javaheripi, mojan@ucsd.edu

Main Paper And Supplementary Material: pdf

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/litetransformersearch-training-free-on-device/code)

1 Reply

Loading