SFT-P:Federated Tuning and Pruning for LLMs

SFT-P:Federated Tuning and Pruning for LLMs

ACL ARR 2026 January Submission7081 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Federated Learning, Large Laguage Model, Structural Pruning, Model Compression, Personalized on-device Inference

Abstract: Deploying large language models (LLMs) on personal devices is appealing because the most useful interactions depend on private, user-specific context, yet on-device inference and adaptation are constrained by memory, latency, and energy budgets. Structural pruning can reduce runtime cost by removing coherent computation units while preserving dense-kernel execution, but most existing pruning pipelines are developed in centralized settings with shared corpora and optimize for broadly averaged capability, which is misaligned with personalized objectives and difficult to transfer to privacy-sensitive, non-IID client data. We present Structural Federated Tuning and Pruning (SFT-P), a federated framework that learns the pruning decision jointly with training under a round-based FedAvg protocol. SFT-P uses a client-conditioned mask generator with globally shared parameters and a private on-client embedding to produce hard routing masks, enabling client-specific structured pruning specified budgets; it can optionally co-train lightweight low-rank adapters to improve robustness at higher pruning ratios. Experiments on four heterogeneous client tasks show that federating pruning decisions with training is especially beneficial under aggressive compression, where SFT-P improves the best federated baseline by +8.5 Avg points on LLaMA-7B at 50% pruning.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: Efficient/Low-Resource Methods for NLP,pruning,NLP in resource-constrained settings

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 7081

Loading