Abstract: Audio recognition systems have become integral to various applications, including speech-to-text, virtual assistants, and security monitoring, where efficiency and privacy are key concerns. The collaborative recognition system (CRS) partitions and deploys the neural network (NN) across multiple edge devices for cooperative recognition without sharing raw audio data. This distributed approach significantly alleviates the computational burden on the client and ensures data privacy. These advantages make CRS increasingly prevalent in audio recognition applications to improve security, efficiency, and provide timely feedback. However, sharing information during collaboration still poses the risk of exposing original data. To the best of our knowledge, this paper introduces InverCRS, the first inversion attack targeting CRS-empowered audio recognition systems. InverCRS is a generative attack in which the attacker trains a local generative model to take intermediate results as input and output the original audio. Once the generative model is trained, it can perform audio inversion using new intermediate results without the need for further optimization. Furthermore, InverCRS utilizes heuristic algorithms to approximate gradients, making it applicable to both white-box and black-box scenarios. We conduct comprehensive experiments to evaluate the feasibility and efficiency of InverCRS across two real-world audio datasets. The results demonstrate that InverCRS can effectively reconstruct the original audio from various split points within the CRS. Additionally, we investigate two potential defense strategies and provide experimental evaluations of their effectiveness in mitigating this attack.
External IDs:dblp:conf/icdcs/ZhangLHDZYL025
Loading