Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Neural architecture search, model compression, structural pruning, large language models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We present a practical use-case for multi-objective weight-sharing based neural architecture search to prune fine-tuned large language models.
Abstract: Large language models (LLMs) mark the state-of-the-art for natural language understanding.
However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency.
This paper explores weight-sharing based neural architecture search (NAS) as a form of structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance.
Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.
Our NAS approach achieves up to 50% compression with less than 5% performance drop for a fine-tuned BERT model on 7 out of 8 text classification tasks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2340
Loading