Keywords: On-device LLMs, knowledge distillation, query efficiency, quantization, edge AI
Abstract: Large language models (LLMs) are increasingly deployed on edge devices under strict computation, memory, and quantization constraints.
In such settings, extracting or distilling knowledge from heavily quantized on-device LLMs poses a fundamentally different challenge from conventional cloud-based distillation, due to limited query budgets and amplified quantization noise. We propose CLIQ (Clustered Instruction Querying), a query-efficient distillation framework designed for extracting knowledge from quantized on-device LLMs. CLIQ explicitly models the semantic structure of the instruction space by clustering queries and generating a compact set of cluster-aware, representative instructions, thereby improving semantic coverage while reducing redundancy. Extensive experiments on quantized Qwen-family models under INT8 and INT4 settings show that, under identical query budgets, CLIQ consistently outperforms original query sampling across BERTScore, BLEU, and ROUGE metrics. Our results demonstrate that structured, semantically representative supervision is critical for effective distillation of edge-oriented language models.
Paper Type: Long
Research Area: Low-resource Methods for NLP
Research Area Keywords: quantization, distillation, data-efficient training, LLM efficiency, NLP in resource-constrained settings
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 5385
Loading