Prompt-Adaptive Quantization: Adaptive Per-Prompt Routing for Efficient LLM Inference

Published: 06 Nov 2025, Last Modified: 06 Nov 2025AIR-FM PosterEveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Keywords: Quantization, LLM serving, BERT, LLMs
Abstract: Large Language Models (LLMs) produce strong results but are costly to serve. Static post-training quantization reduces memory and compute, yet uses a single bit width for all prompts, wasting resources on easy inputs and degrading accuracy on harder ones. We introduce Prompt-Adaptive Quantization (PAQ), a per-prompt precision framework that requires no retraining of the underlying model. PAQ trains a lightweight BERT-based router with perplexity-guided supervision to select the smallest adequate quantization level (2, 4, 8, or 16 bits) per input. At inference, prompts are automatically routed to the appropriate pre-quantized LLM variant. Overall, PAQ serves as a novel framework for adaptive per-prompt quantization, reducing latency while maintaining strong accuracy across tasks.
Submission Track: Workshop Paper Track
Submission Number: 39
Loading