Abstract: Black-box attacks aim to generate adversarial noise to fail the victim deep neural network in the black box. The central task in black-box attack method design is to estimate and characterize the victim model in the high-dimensional model space based on feedback results of queries submitted to the victim network. The central performance goal is to minimize the number of queries needed for successful at-tack. Existing attack methods directly search and refine the adversarial noise in an extremely high-dimensional space, requiring hundreds or even thousands queries to the victim network. To address this challenge, we propose to explore a consistency and sensitivity guided ensemble attack (CSEA) method in a low-dimensional space. Specifically, we estimate the victim model in the black box using a learned linear composition of an ensemble of surrogate models with diversified network structures. Using random block masks on the input image, these surrogate models jointly construct and submit randomized and sparsified queries to the victim model. Based on these query results and guided by a consistency constraint, the surrogate models can be trained using a very small number of queries such that their learned composition is able to accurately approximate the victim model in the high-dimensional space. The randomized and sparsified queries also provide important information for us to construct an attack sensitivity map for the input image, with which the adversarial attack can be locally refined to further increase its success rate. Our extensive experimental results demonstrate that our proposed approach significantly reduces the number of queries to the victim network while maintaining very high success rates, outperforming existing black-box attack methods by large margins.
Loading