Abstract: The Transformer architecture is ubiquitously used as the building block of large-scale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. We leverage the somewhat surprising empirical observation that the number of non-embedding parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple search algorithm that can be directly run on target devices. We rigorously show that the pareto-frontier of perplexity versus different hardware costs such as latency and memory can be found without need for any model training, using non-embedding parameters as a proxy for perplexity. We evaluate our method, dubbed Lightweight Transformer Search (LTS) on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.6×, 2.5× faster runtime and 1.3×, 2× lower peak memory utilization. LTS extracts the pareto-frontier in under 3 hours, running on a commodity laptop. We effectively remove the carbon footprint of training during search for hundreds of GPU hours, offering a strong simple baseline for future NAS methods in autoregressive language modeling.
Keywords: neural architecture search, transformers, autoregressive language modeling
One-sentence Summary: We propose a training-free architecture evaluation proxy for NAS on transformers, that enables fast search directly on the target commodity hardware.
Reproducibility Checklist: Yes
Broader Impact Statement: Yes
Paper Availability And License: Yes
Code Of Conduct: Yes
Reviewers: mojan javaheripi, mojan@ucsd.edu
Main Paper And Supplementary Material: pdf
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/litetransformersearch-training-free-on-device/code)
1 Reply
Loading