Interpretable Adversarial Prompt Tuning via Semantic Concepts

Pedram MohajerAnsari; Zongxi Liu; Yi Zhu; Amir Salarpour; Mert D. Pesé

Interpretable Adversarial Prompt Tuning via Semantic Concepts

Pedram MohajerAnsari, Zongxi Liu, Yi Zhu, Amir Salarpour, Mert D. Pesé

Published: 27 Mar 2026, Last Modified: 11 Apr 20266thAdvMLEveryoneRevisionsBibTeXCC BY 4.0

Keywords: adversarial prompt tuning, vision-language models, interpretable machine learning, semantic concepts, few-shot learning, CLIP

TL;DR: We enhance adversarial prompt tuning with semantic concepts, achieving better few-shot performance and interpretable robustness mechanisms in vision-language models.

Abstract: Adversarial prompt tuning adapts vision-language models efficiently but suffers from poor few-shot performance and lack of interpretability. We propose concept-enhanced adversarial prompt tuning, which replaces abstract context vectors with structured semantic concepts. Our approach augments base text embeddings with weighted concept combinations, optimizing only scalar weights while keeping concept representations fixed. This provides semantic structure for few-shot learning and interpretability through learned weights. Across six benchmarks, we achieve substantial improvements: +19.58pp clean accuracy on EuroSAT 1-shot and substantial robustness gains against PGD-100 attacks. Our method uses 98.7\% fewer parameters than class-specific approaches (5,086 vs. 393K). Analysis reveals semantically meaningful patterns: kitchens emphasize cabinets ($\alpha=+0.92$) while suppressing sinks ($\alpha=-0.88$) to avoid bathroom confusion. Our approach successfully balances adversarial robustness, few-shot generalization, and interpretability.

Cps Compliance Confirmation: true

Submission Number: 19

Loading