Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

ACL ARR 2025 July Submission994 Authors

29 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As large language models (LLMs) are increasingly deployed, ensuring their safe use is paramount. Jailbreaking, adversarial prompts that bypass model alignment to trigger harmful outputs, present significant risks, with existing studies reporting high success rates in evading common LLMs. However, previous evaluations have focused solely on the models, neglecting the full deployment pipeline, which typically incorporates additional safety mechanisms like content moderation filters. To address this gap, we present the first systematic evaluation of jailbreak attacks targeting LLM safety alignment, assessing their success across the full inference pipeline, including both input and output filtering stages. Our findings yield two key insights: first, nearly \textit{all} evaluated jailbreak techniques can be detected by at least one safety filter, suggesting that prior assessments may have overestimated the practical success of these attacks; second, while safety filters are effective in detection, they still struggle to balance high recall (ensuring protection) and high precision (preserving user experience) resulting in suboptimal protection for real-world applications. We highlight critical gaps and call for further refinement of detection accuracy and usability in LLM safety systems.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: jailbreak attack, harmful content detectiom, safety filters

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: english

Previous URL: https://openreview.net/forum?id=oP3cdPZIZb

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).

Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: We discuss potential risks in a dedicated paragraph placed before the abstract, following ACL submission guidelines.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 3.3, where all reused datasets and models are cited with proper attribution to their original authors.

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Section 3.3. All benchmark datasets used in our study are publicly available and cited appropriately. Specifically, AdvBench50, MaliciousInstruct, JailbreakBench, HarmBench, and TruthfulQA are released for academic use under permissive terms by their respective authors. The models we evaluate (e.g., Llama, Vicuna, Qwen, Mistral, GPT-4) are accessed under their respective licenses or API terms, such as OpenAI API Terms of Use.

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Section 3.3. All datasets and models are used in accordance with their intended academic use cases. Our usage aligns with the original purposes of the benchmarks (e.g., evaluating model robustness and safety), and we do not repurpose the data outside of research contexts. All usage is consistent with the licenses and ethical guidelines of the original authors.

B4 Data Contains Personally Identifying Info Or Offensive Content: Yes

B4 Elaboration: Section 3.3. We use public benchmark datasets that are carefully constructed for adversarial robustness evaluation. We manually reviewed and filtered the prompts to remove duplicates and ensure no personally identifiable information (PII) is retained. While some prompts involve potentially harmful queries (e.g., related to violence or deception), they are included solely for the purpose of testing LLM safety filters and follow the precedents set by the original datasets.

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Section 3.3. We document the construction and filtering process of our final evaluation set, including dataset sources, prompt categories, and alignment strategies. The dataset covers 10 types of adversarial behaviors inspired by OpenAI's usage policy. We also document the LLMs used and the testing procedure for reproducibility. We plan to release the constructed dataset under CC-BY 4.0 license for academic use.

B6 Statistics For Data: Yes

B6 Elaboration: Section 3.3: We report comprehensive dataset statistics, explain how the dataset was constructed from multiple benchmark sources, and detail its coverage of 10 distinct misuse categories.

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 3.3 reports the detailed model size, Section 4 reports computational budget.

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 4 reports experimental setup and detector's hyperparameters.

C3 Descriptive Statistics: No

C3 Elaboration: As is common in LLM safety research, we report results from a single run due to the high computational cost and limited API query budgets associated with evaluating large language models.

C4 Parameters For Packages: N/A

C4 Elaboration: Not applicable. We did not use any standard NLP packages (e.g., NLTK, SpaCy, ROUGE) for preprocessing, normalization, or evaluation. Our evaluation is based on LLM outputs and prompt-response behavior without metric-based scoring.

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: N/A

D1 Elaboration: Section 3.3. The annotation was conducted by the author following a predefined set of labeling rules based on the harmfulness categories described in JailbreakBench and related benchmarks. No external annotators or crowdworkers were involved.

D2 Recruitment And Payment: No

D2 Elaboration: Section 3.3. The annotation was conducted by the author following a predefined set of labeling rules based on the harmfulness categories described in HarmBench and related benchmarks. Instructions included examples of harmful and benign prompts and decision guidelines for edge cases. No external annotators or crowdworkers were involved.

D3 Data Consent: Yes

D3 Elaboration: Section 3.3. All data used are sourced from publicly released benchmark datasets which were originally curated and published for academic research. The original authors obtained the necessary consent, and no additional personal data were collected in this work.

D4 Ethics Review Board Approval: No

D4 Elaboration: Since all data used in this study are publicly released for research purposes and no new data were collected from human subjects, ethics review board approval was not required.

D5 Characteristics Of Annotators: No

D5 Elaboration: All annotations were performed solely by the author; thus, no external annotator demographic information is applicable.

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: The writing and coding processes were conducted primarily by the author without relying on AI assistants.

Author Submission Checklist: yes

Submission Number: 994

Loading