BIRD: Bi-Level Operator Scheduling for Black-Box Attacks on Large Language Models

ACL ARR 2026 May Submission17359 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Black-box LLM attacks; Textual adversarial examples; Confidence elicitation; Adaptive operator scheduling; Multi-armed bandits
Abstract: Black-box attacks on LLM-based classifiers seek adversarial inputs that induce incorrect predictions without access to model parameters, gradients, or logits. Existing confidence-guided attacks typically rely on a fixed perturbation operator or a manually specified operator set, overlooking that operator utility varies across inputs, target models, and attack trajectories. We formulate black-box LLM attacks as an online operator scheduling problem and propose BIRD (BI-level Bandit-dRiven Operator ScheDuling), a bi-level framework for automatic black-box attacks. BIRD maintains a heterogeneous pool of perturbation operators and adaptively schedules them using elicited confidence feedback. At the upper level, a sliding-window UCB scheduler selects operators according to cost-aware rewards; at the lower level, confidence-guided candidate selection accepts edits that reduce confidence in the original prediction or flip the label. Experiments on three benchmarks and three instruction-tuned LLMs show that BIRD improves average attack success from 27.71 to 36.60 over a strong confidence-guided baseline, while preserving comparable semantic similarity and practical query costs. Code is available at https://anonymous.4open.science/r/BIRD.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: adversarial attacks/examples/training, calibration/uncertainty, robustness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: Languages Studied* Please list the languages studied in your paper, separated by commas.
EMNLP 2026 AI Reviewing Experiment: no
Submission Number: 17359
Loading