ExplainableGuard: Interpretable Adversarial Defense using Chain-of-Thought Reasoning with DeepSeek-Reasoner

ExplainableGuard: Interpretable Adversarial Defense using Chain-of-Thought Reasoning with DeepSeek-Reasoner

ACL ARR 2025 May Submission7940 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) are increasingly vulnerable to adversarial attacks that can subtly manipulate their outputs. While various defense mechanisms have been proposed, many operate as black boxes, lacking transparency in their decision-making. This paper introduces ExplainableGuard, an interpretable adversarial defense framework leveraging the chain-of-thought (CoT) reasoning capabilities of DeepSeek-Reasoner. Our approach not only detects and neutralizes adversarial perturbations in text but also provides step-by-step explanations for each defense action. We demonstrate how tailored CoT prompts guide the LLM to perform a multi-faceted analysis (character, word, structural, semantic) and generate a purified output along with a human-readable justification. Preliminary results on BLUE and IMDB show promising defense efficacy while offering crucial insights into the attack vectors and defense rationale, paving the way for more trustworthy LLM deployments.

Paper Type: Short

Research Area: Language Modeling

Research Area Keywords: adversarial attacks/examples/training, free-text/natural language explanations, chain-of-thought, LLM/AI agents, safety and alignment, robustness

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 7940

Loading