Prompt-Adaptive Quantization: Adaptive Per-Prompt Routing for Efficient LLM Inference

Gabriel Jimenez; Vivann Khanna; Soham Chatterjee; Rishi Sastri; Raine Ma; Kevin Zhu; Sunishchal Dev

Prompt-Adaptive Quantization: Adaptive Per-Prompt Routing for Efficient LLM Inference

Gabriel Jimenez, Vivann Khanna, Soham Chatterjee, Rishi Sastri, Raine Ma, Kevin Zhu, Sunishchal Dev

Published: 06 Nov 2025, Last Modified: 06 Nov 2025AIR-FM PosterEveryoneRevisionsBibTeXCC BY 4.0

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Keywords: Quantization, LLM serving, BERT, LLMs

Abstract: Large Language Models (LLMs) produce strong results but are costly to serve. Static post-training quantization reduces memory and compute, yet uses a single bit width for all prompts, wasting resources on easy inputs and degrading accuracy on harder ones. We introduce Prompt-Adaptive Quantization (PAQ), a per-prompt precision framework that requires no retraining of the underlying model. PAQ trains a lightweight BERT-based router with perplexity-guided supervision to select the smallest adequate quantization level (2, 4, 8, or 16 bits) per input. At inference, prompts are automatically routed to the appropriate pre-quantized LLM variant. Overall, PAQ serves as a novel framework for adaptive per-prompt quantization, reducing latency while maintaining strong accuracy across tasks.

Submission Track: Workshop Paper Track

Submission Number: 39

Loading