Enhancing Adversarial Robustness of LLMs with Analytic Hierarchy Process

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0
Research Area: Safety
Keywords: Adversarial Robustness, Analytic Hierarchy Process, AI Feedback
TL;DR: Drawing inspiration from cognitive theory, we introduce an innovative Analytic Hierarchy Process (AHP) inference framework to enhance the adversarial robustness of LLMs.
Abstract: With the increasing impact of large language models (LLMs) across diverse applications, ensuring the robustness of LLMs has become a pressing concern. Existing defense strategies are tailored to specific attack scenarios, which typically require high-cost model training and cannot rapidly respond to new threats. To tackle this issue, we conceptualize the defense strategy in LLMs as a cognitive process for dealing with complex user queries. Intuitively, faced with a spectrum of queries that potentially contain malicious perturbations, LLMs need human-like discernment to avoid being misled. Drawing inspiration from cognitive theory, we introduce an innovative Analytic Hierarchy Process (AHP) inference framework. Our methodology involves decomposing intricate tasks into manageable subtasks, prioritizing them, and systematically addressing each step. Our framework is based on AI feedback, eliminating the necessity for training and optimization. We evaluate the effectiveness of our framework in jailbreak attacks and robustness in downstream tasks using representative LLMs, including GPT-3.5 and Llama2. The experimental results demonstrate that our proposed framework significantly enhances the adversarial robustness of LLMs.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 343
Loading