Beyond Mere Token Analysis: A Hypergraph Metric Space Framework for Defending Against Socially Engineered LLM Attacks

Published: 22 Jan 2025, Last Modified: 28 Feb 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreak Attack, LLMs, LLM Security, AI Security
TL;DR: we propose a promising approach to enhancing LLM safety by leveraging the underlying geometry of input prompt token embeddings using hypergraphs.
Abstract: Recent jailbreak attempts on Large Language Models (LLMs) have shifted from algorithm-focused to human-like social engineering attacks, with persuasion-based techniques emerging as a particularly effective subset. These attacks evolve rapidly, demonstrate high creativity, and boast superior attack success rates. To combat such threats, we propose a promising approach to enhancing LLM safety by leveraging the underlying geometry of input prompt token embeddings using hypergraphs. This approach allows us to model the differences in information flow between benign and malicious LLM prompts. In our approach, each LLM prompt is represented as a metric hypergraph, forming a compact metric space. We then construct a higher-order metric space over these compact metric hypergraphs using the Gromov-Hausdorff distance as a generalized metric. Within this space of metric hypergraph spaces, our safety filter learns to classify between harmful and benign prompts. Our study presents theoretical guarantees on the classifier's generalization error for novel and unseen LLM input prompts. Extensive empirical evaluations demonstrate that our method significantly outperforms both existing state-of-the-art generic defense mechanisms and naive baselines. Notably, our approach also achieves comparable performance to specialized defenses against algorithm-focused attacks.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10409
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview