Keywords: Neural architecture search, large language models, structural pruning, efficiency
TL;DR: We advance two-stage neural architecture search for structured pruning to scale it to large language models.
Abstract: Large language models (LLMs) exhibit remarkable reasoning abilities, allowing
them to generalize across a wide range of downstream tasks, such as commonsense
reasoning or instruction following. However, as LLMs scale, inference costs
become increasingly prohibitive, accumulating significantly over their life cycle.
This poses the question: Can we compress pre-trained LLMs to meet diverse
size and latency requirements? We leverage Neural Architecture Search (NAS) to
compress LLMs by pruning structural components, such as attention heads, neurons,
and layers, aiming to achieve a Pareto-optimal balance between performance and
efficiency. While NAS already achieved promising results on small language
models in previous work, in this paper we propose various extensions that allow us
to scale to LLMs. Compared to structural pruning baselines, we show that NAS
improves performance up to 3.4% on MMLU with an on-device latency speedup.
Submission Number: 106
Loading