Discrete Latent Features Ablate Adversarial Attack: A Robust Prompt Tuning Framework for VLMs

Yang Chen; Yanbin Wei; James Kwok; Yu Zhang

Discrete Latent Features Ablate Adversarial Attack: A Robust Prompt Tuning Framework for VLMs

Yang Chen, Yanbin Wei, James Kwok, Yu Zhang

Published: 26 Jan 2026, Last Modified: 28 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Prompt Learning, Adversarial Robustness, Vision-Language Models

TL;DR: We propose a Discrete Latent Feature based Adversarial Training (DEFEAT) method that mitigates the adversarial attacks for VLMs.

Abstract: While adversarial fine-tuning can enhance the robustness of vision-language models (VLMs), such approaches are computationally expensive. Adversarial prompt tuning has emerged as a practical alternative. However, existing methods are limited by their reliance on vulnerable continuous image features. To mitigate the vulnerability in the feature representation, we propose **DEFEAT** (**D**iscrete Lat**E**nt **F**eatur**E** based **A**dversarial **T**raining), a robust prompt tuning framework for VLMs. Specifically, the DEFEAT method introduces a perturbation discrete shield module that reconstructs discrete latent features and designs a logits fusion strategy, substantially reducing the discrepancy between clean and adversarial image representations. Moreover, the DEFEAT method integrates prompt tuning with adversarial training while applying regularization from learnable prompts to hand-crafted prompts, further enhancing the adversarial robustness. Extensive experiments across 15 datasets validate the effectiveness of the proposed DEFEAT method among existing adversarial prompt tuning methods. The official code is available at https://github.com/cheny02/DEFEAT-ICLR2026.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 10526

Loading