Abstract: Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suf-fers from performance drop due to the limited numerical representation ability. Conversely, mixed-precision quan-tization (MPQ) is advocated to compress the model ef-fectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. Previous works only focus on determining the optimal bit-width configuration in the first stage effi-ciently, while ignoring the considerable time costs in the second stage and thus hindering deployment efficiency sig-nificantly. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression. Specifically, in the first stage, all potential bit-width configurations are coupled and thus optimized simultane-ously within a set of shared weights. However, our ob-servations reveal a previously unseen and severe bit-width interference phenomenon among highly coupled weights during optimization, leading to considerable performance degradation under a high compression ratio. To tackle this problem, we first design a bit-width scheduler to dy-namically freeze the most turbulent bit-width of layers during training, to ensure the rest bit-widths converged prop-erly. Then, taking inspiration from information theory, we present an information distortion mitigation technique to align the behaviour of the bad-performing bit-widths to the well-performing ones. In the second stage, an inference-only greedy search scheme is devised to evaluate the good-ness of configurations without introducing any additional training costs. Extensive experiments on three representative models and three datasets demonstrate the effective-ness of the proposed method. Code can be available on https://github.com/1hunters/retraining-free-quantization.
Loading