Defending LLMs Against Adversarial Prompts: A Gradient-Correlation Approach with Graph-Based Parameter Analysis

ACL ARR 2026 January Submission5166 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Toxic Prompt Detection, Model Safety, Gradient-based Analysis, Graph Neural Networks
Abstract: Large language models (LLMs) are increasingly facing complex and covert toxic prompts. Existing gradient-based toxicity detection approaches mostly focus on analyzing the gradient directions of individual model parameters in isolation. However, it neglects the inherent synergistic relationships between parameters within the neural network structure, as well as the differences in the contribution weights of different parameters to the model's safety defense mechanism, restricting their ability to capture subtle safety-related behavioral patterns of LLMs when confronting covert toxic prompts. To address this, we propose the GradMesh method, which combines graph neural networks to model the synergistic relationships between parameters, clusters highly correlated parameters, and incorporates the Euclidean distance of gradients to comprehensively consider the safety scores of parameters. This allows for a more thorough assessment of each parameter's impact on model safety, improving the accuracy of toxic prompt detection. Additionally, we generate multiple types of toxic reference samples using the target LLM to address the issue of randomness in reference samples. Comprehensive experiments on widely-used benchmark datasets, ToxicChat and XStest, demonstrate that our proposed method outperforms existing methods in all aspects.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/unfairness mitigation, reflections and critiques
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5166
Loading