Abstract: Collaborative inference accelerates DNN inference tasks of resource-limited devices (e.g., clients) by offloading model slices to resource-rich devices (e.g., servers). During the inference procedure, outputs of model slices are transmitted among devices, causing significant intermediate data transmission overhead and posing a risk of privacy leakage of the client’s input data. Quantization has been widely used in collaborative inference to enhance communication efficiency. However, traditional quantization cannot prevent data privacy leaks. Besides, perturbation-based privacy protection methods, such as adding Laplace noise to the intermediate data of collaborative inference, do not consider communication efficiency. In this paper, we introduce Layered Laplace Random Quantization to simultaneously achieve communication efficiency and data privacy protection in collaborative inference by compressing the intermediate data with Laplace quantization noise. We also propose stability training to recover the accuracy loss caused by our method. Evaluation results show that our method achieved an average inference latency speedup of 1.2x-1.3x for different DNN models compared with the baseline methods while achieving comparable data privacy protection and recoverable accuracy loss.
Loading