TL;DR: We study per-task post-training quantization for LLMs and show that allocating mixed precision using task-conditioned hidden-representation signals preserves task accuracy under substantial compression.
Abstract: Many applications of large language models (LLMs) require only a narrow capability, yet common post-training quantization (PTQ) pipelines assign precision largely without regard to the target task. As a result, they may spend bits on layers that are less relevant to the task. We propose per-task mixed-precision PTQ guided by hidden representations. Given a small set of unlabeled calibration prompts from the target task, we estimate layer importance and allocate higher precision to task-relevant layers while lower to the rest, under a bits allocation budget. We introduce three task-aware allocation signals: \textbf{TAQ}, which scores layers using an information-stability criterion derived from activation geometry; \textbf{TAQO}, which ranks layers by direct sensitivity to single-layer quantization; and \textbf{TAQ-KL}, which measures output sensitivity via KL divergence under a noise proxy for quantization error. Together, these methods provide a simple, post-training framework that connects mechanistic signals to quantization decisions, enabling task-aligned compression without additional training. A reference implementation is available at https://anonymous.4open.science/r/TAQ-9217.
Primary Area: Deep Learning->Other Representation Learning
Keywords: Post-training quantization, LLMs, LLM compression, mechanistic interpretability.
Submission Number: 2823
Loading