ExplainableGuard: Interpretable Adversarial Defense using Chain-of-Thought Reasoning with DeepSeek-Reasoner
Abstract: Large Language Models (LLMs) are increasingly vulnerable to adversarial attacks that can subtly manipulate their outputs. While various defense mechanisms have been proposed, many operate as black boxes, lacking transparency in their decision-making. This paper introduces ExplainableGuard, an interpretable adversarial defense framework leveraging the chain-of-thought (CoT) reasoning capabilities of DeepSeek-Reasoner. Our approach not only detects and neutralizes adversarial perturbations in text but also provides step-by-step explanations for each defense action. We demonstrate how tailored CoT prompts guide the LLM to perform a multi-faceted analysis (character, word, structural, semantic) and generate a purified output along with a human-readable justification. Preliminary results on BLUE and IMDB show promising defense efficacy while offering crucial insights into the attack vectors and defense rationale, paving the way for more trustworthy LLM deployments.
Paper Type: Short
Research Area: Language Modeling
Research Area Keywords: adversarial attacks/examples/training, free-text/natural language explanations, chain-of-thought, LLM/AI agents, safety and alignment, robustness
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 7940
Loading