RobustIT: Adapter-Centric and Attack-Agnostic Anti-Backdoor Instruction Tuning

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: robustness, security, large visual language models
Abstract: Large visual language models (LVLMs) have demonstrated excellent instruction-following capabilities, yet remain vulnerable to stealthy backdoor attacks when fine‑tuned using contaminated data. Existing defenses typically assume full access to model parameters or rely on known trigger patterns and clean validation sets, which are assumptions that fail in real‑world, efficient tuning applications where visual encoders and core LLM weights are frozen and attack priors are unavailable. Motivated by the empirical insight that LVLM adapters quickly overfit fixed triggers, we propose \textbf{R}obust \textbf{I}nstruction \textbf{T}uning~(\textbf{RobustIT}), a lightweight, \emph{attack‑agnostic} framework that tunes only adapter modules and text‑embedding layers. RobustIT combines two complementary regularizations: (1) \emph{\textbf {Input Diversity Regularization}}, which applies randomized spatial, color, and textual perturbations to disrupt fixed trigger–response mappings or consistent spurious cues; and (2) \emph{\textbf{Anomalous Activation Regularization}}, which dynamically sparsifies adapter channels exhibiting abnormally sharp activations associated with backdoor patterns. This dual strategy steers the model toward semantically grounded representations, without touching frozen cores or requiring any trigger supervision. Extensive evaluations on seven backdoor attacks across Flickr30k and MSCOCO show that RobustIT drives attack success rates to near zero with under 15\% extra training cost, while preserving or improving standard task performance of tuned models, and also highlight the critical role of efficient fine-tuning safeguards in securing real-world deployments of LVLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7373
Loading