TLC-Calibrator: Latency-Efficient LoRA-Split Inference for Edge LLMs via Two-Tiered Communication Calibration
Keywords: Collaborative Inference, Multi-Head Attention, Semantic Importance, Transformer
Abstract: Large language models (LLMs) achieve strong generalization across diverse tasks, but their size hinders personalized deployment on user devices. Low-Rank Adaptation (LoRA) enables user-specific fine-tuning with minimal additional parameters, and its structural separability naturally supports collaborative edge inference, where the frozen base model runs on an edge server and lightweight LoRA adapters reside on user devices for privacy and scalability. However, each LoRA layer induces two edge--device communication rounds to exchange hidden states and LoRA-updated projections, making communication latency dominate inference time. Our empirical analysis shows that the impact of LoRA is highly uneven across layers and tokens, and skipping LoRA in low-impact regions leads to negligible accuracy loss. Building on this observation, we propose TLC-Calibrator, a two-tiered communication calibration framework that adaptively decides when LoRA-related communication is necessary. A server-side calibrator determines whether to transmit intermediate activations to the device, while a device-side calibrator decides whether the resulting LoRA projections should be sent back. Experiments show that TLC-Calibrator achieves up to 2.4$\times$ speedup with less than 1.9\% accuracy loss.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 5206
Loading