CLAM: Unifying Finetuning, Quantization, and Pruning by Chaining LLM Adapter Modules

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: parameter-efficient finetuning, quantization, pruning, adapters, LLM
TL;DR: A new framework for generally combining techniques in parameter-efficient finetuning, quantization, and pruning for more efficient and performant compressed LLMs.
Abstract: As LLMs have grown in size and applicability, so too have the number of methods that adapt them for downstream tasks. Recent works to address challenges in memory consumption, task performance, and inference efficiency have led to the fields of parameter-efficient finetuning (PEFT), quantization, and pruning, among others. While it is useful to combine their benefits, composing these techniques in flexible ways is challenging due to the changes each method makes to the model and any restrictions they might impose. To address these challenges, we develop an algebraic abstraction called CLAM that enables unlimited chaining of popular resource-efficient methods on nearly every modern LLM with minimal overhead. We demonstrate that CLAM can create new compositions of techniques that achieve SOTA performance on specializing compressed models across multiple benchmarks.
Submission Number: 60
Loading