FineAMP: Optimization-Based Automatic Mixed Precision Quantization for Efficient Diffusion Model Inference

Published: 22 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Quantization, mixed precision, efficiency, optimization, large models, linear programming, integer linear programming
Abstract: Quantization of weights and activations is crucial in ensuring efficient inference for large machine learning models. Different operations within a machine learning model vary in their sensitivity to quantization; some introduce significantly more error than others. Mixed precision quantization methods aim to leverage this observation to improve model efficiency without causing a significant drop in the model's performance. We propose an optimization-based method, called FineAMP, that automatically determines bit-width assignments for individual operations. We show that fine-grained, per-operation bit-width assignments result in lower on-device latency compared to coarser, uniform bit-width strategies.
Submission Number: 42
Loading