Constrained bit allocation for mixed-precision deep neural networks

Published: 21 May 2025, Last Modified: 17 Jun 2025MLArchSys 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Presentation: In-Person
Keywords: Neural networks, Compression, Numerical formats, Block floating point, Microexponents, MXFP, Interior point methods, Constrained optimization, Bit allocation, Mixed precision, Transformers
Presenter Full Name: Souleyman Boudouh
Presenter Email: souleyman.boudouh@epfl.ch
Abstract: The increasing complexity of deep neural networks (DNNs) necessitates effective model compression to reduce their computational and memory footprints for deployment on resource-constrained hardware. Layer-wise bit allocation is a prominent compression method shown to significantly reduce DNN footprints while preserving model accuracy. However, how best to incorporate hardware constraints within the allocation search remains a key question, as many tacitly assume constraints can be adequately handled via soft penalties or heuristics, often failing to guarantee feasibility or optimality. In this paper, we explore a reformulation of the bit allocation problem as an explicit constrained optimization problem, solved using interior-point methods within a NAS-based framework, notably requiring only minimal calibration data (as few as 128 samples). We corroborate this approach with experiments spanning transformer architectures (Llama, Gemma, Qwen; 500M-3B parameters), evaluating performance with MXFP formats. We show that this constrained formulation not only allows us to achieve significantly finer resolution in compression ratios compared to the discrete steps offered by uniform MXFP application (e.g., 4.25, 6.25, 8.25 bits), but also demonstrates that explicitly satisfying hardware budgets while optimizing for accuracy consistently outperforms uniform allocation methods, improving performance by up to several standard deviations in some cases, especially under strict resource limits. Our findings extend to the efficient deployment of large models in resource-constrained compute platforms, offering insights into best practices for applying bit allocation to maximize hardware resource efficiency without unduly compromising accuracy.
Presenter Bio: Currently an EPFL student and visiting researcher at MBZUAI. My research focuses on neural network compression and designing efficient, compact models through mathematically principled approaches with formal guarantees.
Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.
YouTube Link: N/A
Poster: Yes
Workshop Registration: Yes, the presenter has registered for the workshop.
YouTube Link Short: N/A
Submission Number: 14
Loading