FineAMP: Optimization-Based Automatic Mixed Precision Quantization for Efficient Diffusion Model Inference
Keywords: Quantization, mixed precision, efficiency, optimization, large models, linear programming, integer linear programming
Abstract: Quantization of weights and activations is crucial in ensuring efficient inference for large machine learning models. Different operations within a machine learning model vary in their sensitivity to quantization; some introduce significantly more error than others. Mixed precision quantization methods aim to leverage this observation to improve model efficiency without causing a significant drop in the model's performance. We propose an optimization-based method, called FineAMP, that automatically determines bit-width assignments for individual operations. We show that fine-grained, per-operation bit-width assignments result in lower on-device latency compared to coarser, uniform bit-width strategies.
Submission Number: 42
Loading