Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines

Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines

ACL ARR 2025 February Submission5326 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across multiple domains and fine-tuning is an essential step in adapting a pre-trained model to downstream tasks with user data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory, computational demands, and limited infrastructure support for backpropagation. We observe that inference engines (e.g., ExecuTorch) can be repurposed for fine-tuning by leveraging zeroth-order (ZO) optimization. Memory efficient ZO (MeZO) proposes to estimate gradient using only two forward passes, reducing the memory cost to the same as inference. However, ZO methods require multiple queries of gradient estimations at each training step to achieve good performance. Since multi-query gradient estimation requires multiple independent forward passes, our key insight is that these can be executed in parallel. To this end, we propose parallelized randomized gradient estimation (P-RGE), which employs a novel design based on parameter-efficient fine-tuning techniques to achieve high-speed training while still harvesting the model performance boost, without increasing computational cost. Moreover, it seamlessly extends inference engines without altering their underlying runtime code and only minimal server-side modifications are needed. Through extensive experiments, we demonstrate that P-RGE delivers substantial gains in fine-tuning efficiency and accuracy, thereby enabling real-time, on-device personalization of LLMs under strict memory and compute budgets. Code available at: anonymous.4open.science/r/PRGE-ARR-4FE5.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: parameter-efficient-training, NLP in resource-constrained settings

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 5326

Loading