Precision Is Not Performance: A Utility-Aware Evaluation of Quantized LLM Inference

TMLR Paper6961 Authors

10 Jan 2026 (modified: 10 Apr 2026)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) have become an increasingly important part of most modern AI systems; however, as LLMs grow in size, their usable responses are delayed. Additionally, it is challenging to achieve efficient inference using LLMs in the absence of sufficient resources, such as memory and computing power, because memory consumption and computing costs become significant concerns for utilizing LLMs efficiently. To address these concerns, Quantization methods are used. Quantization refers to reducing numerical precision during the inference stage of the model to reduce memory usage. Through quantization, model memory usage and cost efficiency can be enhanced. Unfortunately, research into quantization has typically focused on theoretical performance predictions and sample performance testing (i.e., isolated performance benchmarks), providing a limited view of how reduced numerical precision would impact the end-to-end behavior of inferred responses from the model in the real world. As a result, a significant gap exists in the practical ability to make decisions about deploying quantized LLM models. To help fill this gap, we propose a novel Utility-aware Quantization Framework (UAQF). The proposed UAQF is evaluated using multiple instruction-tuned LLMs, including LLaMA-2-7B and LLaMA-2-13B, and evaluated across three instruction-tuned LLM variants spanning 7B and 13B parameter scales. The framework is tested across FP16, 8-bit, and 4-bit quantization, using a diverse set of prompts, and the resulting end-to-end latency and throughput are compared against established quantization approaches such as GPTQ, AWQ, ZeroQuant, Atom, and SmoothQuant. The experimental results indicate that lower-bit quantization consistently improves throughput with minimal impact on output quality across models and prompts. Moreover, the analysis reveals that aggressive quantization often provides greater overall utility than intermediate-precision settings, highlighting deployment-level behaviors that are not evident in single-model or isolated-metric evaluations. These results demonstrate that UAQF enables deeper empirical insight into quantization efficiency and system behavior than existing approaches, reinforcing the need for deployment-oriented quantization assessment.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: - Clarify contribution positioning: Explicitly clarified that UAQF is an evaluation and decision framework, not a quantization algorithm. Added a dedicated positioning subsection to distinguish UAQF from existing quantization methods. - Remove misleading comparisons: Reframed comparative tables and descriptions to avoid presenting UAQF Adaptive as competing algorithmically with GPTQ/AWQ/SmoothQuant/Atom; clarified that it performs deployment-level configuration selection. - Provide full implementation details: Added complete hardware and software specifications (GPU, CUDA, PyTorch, Transformers versions), decoding configuration, synchronization protocol, and runtime constraints. - Explain kernel-dependent behavior: Included a dedicated backend discussion explaining memory-bound decoding, packed INT4 kernels, dynamic INT8 dequantization paths, and reasons for observed 8-bit underperformance under the tested configuration. -Strengthen quality evaluation: Extended evaluation beyond WikiText-2 perplexity by reporting GSM8K exact-match accuracy, CommonsenseQA accuracy, and instruction-following compliance metrics. - Improve deployment realism clarity: Clearly stated experimental assumptions (single A100, batch=1, fixed sequence length) and added an explicit scope section acknowledging limitations (TTFT, prefill/decode separation, batching, long-context regimes as future extensions). - Tighten literature and presentation: Condensed literature review, merged tables into a compact format, removed multi-line descriptive cells, reduced redundancy, and streamlined background material. - Scope calibration: Adjusted claims to avoid overgeneralization and limited conclusions to the evaluated runtime configuration while stating that the framework itself is backend-agnostic. - Clearly articulated methodological novelty: Emphasized stability-aware aggregation, cross-metric normalization, and deployment-weighted utility ranking as the core contributions in a structured formulation section.
Assigned Action Editor: ~Sachin_Kumar1
Submission Number: 6961
Loading