Compact Yet Capable: Do Multitask-Based Multi-Teacher Distillation with Precision-Controlled Task-Specific Dynamic PTQ Outperform Static Quantization for Low-Resource Multitask NLU?

20 Sept 2025 (modified: 20 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Teacher Knowledge Distillation, Precision Controlled Dynamic PTQ, Static PTQ, Multitask NLU, Unified Weight-Activation Precision
Abstract: The evolution of conversational AI emphasizes not just accuracy, but also efficiency and scalability. In low-resource Indic languages (Tamil, Telugu, Malayalam, Kannada, Hindi, Bengali), cross-domain, multi-intent NLU tasks such as Intent Detection (ID), Domain Classification (DC), and Slot Filling (SF) become especially challenging due to cross-domain variability and limited annotated data. LLMs, though powerful, incur high computational costs and slow inference due to their high resource requirements. Knowledge Distillation (KD) enables lightweight student models to retain performance from larger teachers, while post-training quantization (PTQ) further reduces inference cost, making low-resource multitask NLU more feasible on constrained hardware. In our paper, we investigate scalable deployment architectures for multitask NLU tasks in resource-constrained environments. We compare static PTQ applied to a non-distilled multitask baseline with precision-controlled, task-specific dynamic PTQ applied to a multi-teacher based distilled student. Static PTQ uses QuantStub/DeQuantStub, calibration over representative batches, and zero-point quantization, while after training, the distilled student undergoes precision-controller–driven dynamic PTQ. The student is distilled from three pairs of teachers (ID–DC, ID–SF, DC–SF) using adaptive attention-based fusion and temperature scaling. The controller assigns different precisions for encoder attention layers, encoder MLP blocks, and each multitask head (ID, DC, SF), allowing finer-grained accuracy–efficiency optimization without calibration. By unifying weight and activation precision under a single runtime policy, our approach further reduces memory and bandwidth requirements without degrading accuracy. Experimental results on a custom multilingual Indic dataset show that our multitask based multi-teacher–distilled, precision-controller–quantized student achieves a superior accuracy–efficiency trade-off, significantly reducing inference latency, memory footprint, and runtime bandwidth while preserving accuracy across NLU tasks. Our study demonstrates that unifying KD with precision-controlled, task-specific dynamic PTQ under a single weight–activation policy delivers scalable, real-time NLU for low-resource multilingual settings while achieving optimal efficiency–accuracy trade-offs.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24240
Loading