Black-Box Adversarial Attack on Dialogue Generation via Multi-Objective Optimization

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: dialogue generation, adversarial attack, multi-objective optimization, black-box attack
Abstract: Transformer-based dialogue generation (DG) models are ubiquitous in modern conversational artificial intelligence (AI) platforms. These models, however, are susceptible to adversarial attacks, i.e., prompts that appear textually indiscernible from normal inputs but are maliciously crafted to make the models generate responses incoherent and irrelevant to the conversational context. Evaluating the adversarial robustness of DG models is thus crucial to their real-world deployment. Adversarial methods typically exploit gradient information and output logits (or probabilities) to effectively modify key input tokens, thereby achieving excellent attack performance. Nevertheless, such white-box approaches are impractical in real-world scenarios since the models' internal parameters are typically inaccessible. While black-box methods, which exploit only input prompts and DG models' output responses to craft adversarial attacks, offer a wider applicability, they often suffer from poor performance. In a human-machine conversation, good generated responses are expected to be semantically coherent and textually succinct. We thus formulate adversarial attack on DG models as a bi-objective optimization problem, where input prompts are modified in order to 1) minimize the response coherence, and 2) maximize the generation length. In this paper, we empirically demonstrate that optimizing either objective alone results in subpar performance. We then propose a dialogue generation attack framework (DGAttack) that employs multi-objective optimization to consider both objectives simultaneously when perturbing user prompts to craft adversarial inputs. Leveraging the exploration capability of multi-objective evolutionary algorithm due to its intrinsic diversity preservation, DGAttack successfully creates effective adversarial prompts in a true black-box manner, i.e., accessing solely DG models' inputs and outputs. Experiments across four benchmark datasets and three language models (i.e., BART, DialoGPT, T5) demonstrate the excellent performance of DGAttack compared to existing white-box, gray-box, and black-box approaches. Especially, benchmarks with large language models (i.e., Llama 3.1 and Gemma 2) suggest that DGAttack is the state-of-the-art black-box adversarial attack on dialogue generation.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2921
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview