Discrete Latent Features Ablate Adversarial Attack: A Robust Prompt Tuning Framework for VLMs

ICLR 2026 Conference Submission10526 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Prompt Learning, Adversarial Robustness, Vision-Language Models
TL;DR: We propose a Discrete Latent Feature based Adversarial Training (DEFEAT) method that mitigates the adversarial attacks for VLMs.
Abstract: While adversarial fine-tuning can enhance the robustness of vision-language models (VLMs), such approaches are computationally expensive. Adversarial prompt tuning has emerged as a practical alternative. However, existing methods are limited by their reliance on vulnerable continuous image features. To mitigate the vulnerability in the feature representation, we propose **DEFEAT** (**D**iscrete Lat**E**nt **F**eatur**E** based **A**dversarial **T**raining), a robust prompt tuning framework for VLMs. Specifically, the DEFEAT method introduces a perturbation discrete shield module that reconstructs discrete latent features and designs a logits fusion strategy, substantially reducing the discrepancy between clean and adversarial image representations. Moreover, the DEFEAT method integrates prompt tuning with adversarial training while applying regularization from learnable prompts to hand-crafted prompts, further enhancing the adversarial robustness. Extensive experiments across 15 datasets validate the effectiveness of the proposed DEFEAT method among existing adversarial prompt tuning methods.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10526
Loading