You Had One Job: Per-Task Quantization Using LLMs’ Hidden Representations

Amit LeVi; Raz Lapid; Rom Himelstein; Chaim Baskin; Ravid Shwartz-Ziv; Avi Mendelson

You Had One Job: Per-Task Quantization Using LLMs’ Hidden Representations

Amit LeVi, Raz Lapid, Rom Himelstein, Chaim Baskin, Ravid Shwartz-Ziv, Avi Mendelson

10 Jan 2026 (modified: 24 Jun 2026)Submitted to ICML 2026EveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: We study per-task post-training quantization for LLMs and show that allocating mixed precision using task-conditioned hidden-representation signals preserves task accuracy under substantial compression.

Abstract: Many applications of large language models (LLMs) require only a narrow capability, yet common post-training quantization (PTQ) pipelines assign precision largely without regard to the target task. As a result, they may spend bits on layers that are less relevant to the task. We propose per-task mixed-precision PTQ guided by hidden representations. Given a small set of unlabeled calibration prompts from the target task, we estimate layer importance and allocate higher precision to task-relevant layers while lower to the rest, under a bits allocation budget. We introduce three task-aware allocation signals: \textbf{TAQ}, which scores layers using an information-stability criterion derived from activation geometry; \textbf{TAQO}, which ranks layers by direct sensitivity to single-layer quantization; and \textbf{TAQ-KL}, which measures output sensitivity via KL divergence under a noise proxy for quantization error. Together, these methods provide a simple, post-training framework that connects mechanistic signals to quantization decisions, enabling task-aligned compression without additional training. A reference implementation is available at https://anonymous.4open.science/r/TAQ-9217.

Primary Area: Deep Learning->Other Representation Learning

Keywords: Post-training quantization, LLMs, LLM compression, mechanistic interpretability.

Submission Number: 2823

Loading