Keywords: Deep Learning, Model Compression, Efficient Deep Learning, Parameter Sharing, Sparsity, Tensor Decomposition, Quantization, Quantization Aware Training, QAT, Vision Transformers, ViT, DeiT-B, SWIN-L, Large Language Models, LLM, Gemma2, LLaMa3.1, LLaMa
TL;DR: FiPS is a sparsity-driven parameter-sharing method that combines shared bases, sparse factors, and quantization-aware training to compress ViTs and LLMs substantially while preserving accuracy and perplexity.
Abstract: Large neural networks attain cutting-edge performance on many tasks, yet their sheer size hinders deployment on resource-constrained devices. Among existing compression approaches, parameter sharing remains relatively unexplored. In this paper, we introduce Fine-grained Parameter Sharing (FiPS), a unified compression framework that combines parameter sharing, tensor decomposition, and sparsity for achieving optimal compression. FiPS compresses transformers by factorizing MLPs concatenated across layers into a shared low-rank basis with sparse, layer-specific projection matrices. Both components are initialized by singular-value decomposition (SVD) and jointly optimized with block-wise reconstruction error minimization. As a result, FiPS enables compression of a variety of Vision Transformers (ViTs) and Large Language Models (LLMs) by 20–50% with negligible degradation in quality. Finally, we combine FiPS with Quantization Aware Training (QAT) to obtain state-of-the-art compression results on GEMMA-2 models. These results establish fine-grained parameter sharing as a practical route to compact, high-performance transformer models.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 584
Loading