TLC-Calibrator: Latency-Efficient LoRA-Split Inference for Edge LLMs via Two-Tiered Communication Calibration

TLC-Calibrator: Latency-Efficient LoRA-Split Inference for Edge LLMs via Two-Tiered Communication Calibration

ACL ARR 2026 January Submission5206 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Collaborative Inference, Multi-Head Attention, Semantic Importance, Transformer

Abstract: Large language models (LLMs) achieve strong generalization across diverse tasks, but their size hinders personalized deployment on user devices. Low-Rank Adaptation (LoRA) enables user-specific fine-tuning with minimal additional parameters, and its structural separability naturally supports collaborative edge inference, where the frozen base model runs on an edge server and lightweight LoRA adapters reside on user devices for privacy and scalability. However, each LoRA layer induces two edge--device communication rounds to exchange hidden states and LoRA-updated projections, making communication latency dominate inference time. Our empirical analysis shows that the impact of LoRA is highly uneven across layers and tokens, and skipping LoRA in low-impact regions leads to negligible accuracy loss. Building on this observation, we propose TLC-Calibrator, a two-tiered communication calibration framework that adaptively decides when LoRA-related communication is necessary. A server-side calibrator determines whether to transmit intermediate activations to the device, while a device-side calibrator decides whether the resulting LoRA projections should be sent back. Experiments show that TLC-Calibrator achieves up to 2.4$\times$ speedup with less than 1.9\% accuracy loss.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: Efficient/Low-Resource Methods for NLP

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 5206

Loading