Keywords: Adversarial Robustness, Query-based black-box attacks, adversarial defense
Abstract: Query-based black-box attack algorithms can compute imperceptible adversarial perturbations to misguide learned models, relying only on model outputs. The success of these attack algorithms poses a significant problem, especially for Machine Learning as a Service (MLaaS) providers. Our study explores a new approach to obfuscate information from an attacker. To craft an adversarial example, attacks
exploit the relationship between successive responses to queries to optimize a perturbation. Our idea to attempt to obfuscate this relationship is to randomly select a model from a diverse set of models to respond to each query. Effectively, this randomization of models violates the attacker’s assumption of model parameters remaining unaltered between queries to extract information. What is unclear is, if model randomization leads to sufficient obfuscation to confuse attacks or how best to build such a method. This study seeks answers to these questions. Our theoretical analysis proves this approach consistently increases robustness. Extensive experiments across 7 state-of-the-art attacks and all major perturbation norms ($l_\infty$, $l_2$, $l_0$), including adaptive variants, confirm its effectiveness. Importantly, our findings reveal a new avenue for investigating robust methods against black-box attacks, offering theoretical understandings and a practical implementation pathway.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 23304
Loading