BAQE: Backend-Adaptive DNN Deployment via Synchronous Bayesian Quantization and Hardware Configuration Exploration
Abstract: Efficiently deploying deep learning (DL) algorithms on different hardware backends has become a time-consuming challenge. Achieving ultimate inference efficiency on hardware requires both algorithm-level model compression techniques, such as model quantization, and hardware-level optimization, such as operation reconfiguration and scheduling. In this article, we propose BAQE, a unified deployment framework that bridges the gap between algorithm-level and backend-level optimization. By constructing a global search space, we can synchronously optimize both the model quantization settings and backend configuration parameters. To accelerate this laborious and time-consuming process, we propose a searching strategy based on multiobjective Bayesian optimization (BO) using a Gaussian model with deep kernel learning as the surrogate model. More importantly, BAQE can easily adapt to various backends with different hardware resources efficiently and effectively. Each inner step of the optimization process is aware of the genuine hardware resources, ensuring that all accuracy/latency metrics and historical knowledge/feedback are evaluated directly on the device within each iteration. Empirical results demonstrate that our approach achieves both superior inference time and accuracy with a faster optimization process.
External IDs:dblp:journals/tcad/ZhaoYBWY25
Loading