Profile-Guided Quantization: a compiler solution to automate quantization for efficient LLM training

Published: 21 May 2025, Last Modified: 17 Jun 2025MLArchSys 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Presentation: Virtual
Keywords: Quantization, LLM Training, compiler
Presenter Full Name: Gil Tabak
TL;DR: A compiler-based approach for automatically applying quantization on a per-operation level using a 'profile-driven' methodology (using the roofline model for estimated performance gain combined with signal to quantized noise ratio).
Presenter Email: tabak.gil@gmail.com
Abstract: The growing size and complexity of Large Language Models (LLMs) for generative artificial intelligence (AI) have significantly intensified the compute and memory demands during training and serving. While scaling up model size has fueled rapid progress to deliver advanced AI capabilities, it is increasingly challenging to efficiently host these models on hardware where resources are always constrained. As a promising compression technique, quantization is widely used to address hardware efficiency bottlenecks. However, how and where to apply quantization remains challenging. Significant barriers complicate the practical use of quantization ranging from varied developer skill in applying quantization, to diverse numeric sensitivity of targeted AI models, and different quantization support from ML frameworks and back-end hardware. To address these issues, we propose profile-guided quantization (PGQ), a compiler-based solution that leverages logged tensor statistics and metadata to automatically determine optimal quantization settings for individual operations given targeted workload and hardware. PGQ alleviates the domain knowledge needed from model, framework, and hardware and streamlines the design and implementation of quantization recipes through graph-level instruction rewriting within the compiler, which makes quantized LLM training accessible to a much broader audience. As an example of our proposed approach, we demonstrate its application in quantized training use cases with GemmaV2 models showing up to 18.2% speed up of a training step.
Presenter Bio: Gil has been a SWE at Google for 6 years, with work focusing on TPU optimizations.
Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.
YouTube Link: https://www.youtube.com/watch?v=O6vJ-abwV3c
YouTube Link Poster: NA
Dataset Release: I certify that all co-authors commit to release the dataset and necessary scripts to reproduce the presented results.
Google Slides: https://drive.google.com/file/d/1owvr1X51EQWef6aQNsRbd8BOoVtj2fXl (please request access)
Poster: No
Workshop Registration: Can not attend. Pre-recorded video.
YouTube Link Short: (placeholder)
Submission Number: 1
Loading