Keywords: Low Rank Adaptation, Edge Devices, Quantization, Compression, Efficient Fine-tuning
Abstract: The deployment of large language models (LLMs) for specialized tasks on resource-constrained edge devices like smartphones and sensors presents a significant scalability problem. To run on such hardware, these massive models must be compressed using techniques like \emph{quantization or pruning} to reduce their memory and computational footprint. Concurrently, foundational LLMs are periodically updated by their developers with new data, making their $\textit{internal parameters shift over time}$. While parameter-efficient methods like Low-Rank Adaptation (LoRA) streamline personalization by fine-tuning only a small fraction of parameters, the resulting adapters are $\textbf{brittle}$; a LoRA trained for one specific compression scheme is incompatible with another, and an adapter trained on an older base model performs poorly on an updated one. This forces a costly cycle of retraining for each unique device and every new model release. To address this, we introduce a novel framework that creates a single, universally portable adapter that is both $\textbf{\textit{(i)} compression-aware and \textit{(ii)} temporally robust}$. We achieve this by augmenting the training process with a variety of simulated compression techniques during a single run, utilizing a quantized forward pass to build resilience while maintaining a full-precision backward pass for stable gradient optimization. $\textit{This method yields a unified adapter robust to diverse compression artifacts and the subtle parameter shifts from model evolution}$. Extensive experiments on models such as $\texttt{Llama-2, Llama-3.1, Gemma-2}$, and $\texttt{Mistral}$ across reasoning benchmarks like $\textit{SQA, MATH, and GSM8K}$ demonstrate that our single adapter achieves performance comparable to specialized adapters ($\textit{e.g.}$, QLoRA) that are individually retrained for each compression scheme. Furthermore, we show this single adapter maintains its high performance when applied to future, evolved versions of the base model, eliminating the need for periodic retraining. Our work pioneers an efficient paradigm for edge AI, creating portable model patches that bridge the gap between cloud-based personalization, the diverse hardware ecosystem, and the lifecycle of evolving LLMs.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 14881
Loading