Defending LLMs Against Adversarial Prompts: A Gradient-Correlation Approach with Graph-Based Parameter Analysis
Abstract: Large Language Models (LLMs) face covert threats from toxic prompts, and existing detection methods often require substantial data and are inefficient. Cur-rent gradient-based approaches primarily focus on individual parameter compari-sons, limiting their effectiveness against sophisticated toxicity. To address this, we propose GradMesh, which integrates Eu-clidean distance metrics for gradient magnitudes with direction similarity analysis. We also employ Graph Neural Networks (GNN) to model relationships among parameters, enhancing detection accuracy by clustering correlated parame-ters. Additionally, we generate diverse toxic reference samples using the target LLM to improve reliability. Experiments on benchmark datasets ToxicChat and XSTest show that GradMesh outperforms existing methods across all evaluation metrics.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/unfairness mitigation, reflections and critiques
Contribution Types: NLP engineering experiment
Languages Studied: Chinese, English
Submission Number: 8077
Loading