JailbreakTracer: Explainable Detection of Jailbreaking Prompts in LLMs Using Synthetic Data Generation

Published: 01 Jan 2025, Last Modified: 08 Nov 2025IEEE Access 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The emergence of Large Language Models (LLMs) has revolutionized natural language processing (NLP), enabling remarkable advancements across various applications. However, these models remain susceptible to adversarial prompts, commonly referred to as jailbreaks, which exploit their vulnerabilities to bypass ethical and safety constraints. These prompts manipulate LLMs to produce harmful or forbidden outputs, posing serious ethical and security challenges. In this study, we propose JailbreakTracer, a novel framework leveraging synthetic data generation and Explainable AI (XAI) to detect and classify jailbreaking prompts. We first construct two comprehensive datasets: a Toxic Prompt Classification Dataset, combining real-world and synthetic jailbreak prompts, and a Forbidden Question Reasoning Dataset, categorizing forbidden queries into 13 distinct scenarios with clear reasoning labels. Synthetic toxic prompts are generated using a fine-tuned GPT model, achieving an attack success rate of 95.1%, effectively addressing the class imbalance. Using transformer-based architectures, we train classifiers that achieved 97.25% accuracy in detecting jailbreak prompts and 100% accuracy in categorizing forbidden questions. Our approach integrates XAI techniques, such as LIME, to ensure interpretability and transparency in the model’s predictions. Extensive evaluations demonstrate the efficacy of JailbreakTracer in detecting and reasoning about jailbreak prompts, providing a critical step toward enhancing the safety and accountability of LLMs. The dataset and code are available on GitHub: https://github.com/faiyazabdullah/JailbreakTracer
Loading