JudgeRail: Harnessing Open-Source LLMs for Fast Harmful Text Detection with Judicial Prompting and Logit Rectification
Keywords: Large Language Model, Harmful Text Detection, Toxic Speech Detection, Content Moderation
Abstract: Large language models (LLMs) simultaneously facilitate the generation and detection of harmful text. Leading LLM developers, such as OpenAI, Meta, and Google, are driving a paradigm shift in the detection of harmful text, moving from conventional detectors to fine-tuned LLMs. However, these newly released models, which require substantial computational and data resources, have not yet been thoroughly investigated for their effectiveness in this new paradigm. In this work, we propose JudgeRail, a novel and generic framework that guides open-source LLMs to adhere to judicial principles during text moderation. Additionally, we introduce a new logit rectification method that accurately interprets an LLM's classification intent, rigorously controls its output format, and significantly accelerates detection. By integrating several top-performing open-source LLMs into JudgeRail without any fine-tuning and evaluating them against OpenAI Moderation API, LlamaGuard3, ShieldGemma, and other conventional moderation solutions across various datasets, including those specifically designed for jailbreaking LLMs, we demonstrate that JudgeRail can adapt these LLMs to be competitive with fine-tuned moderation models and significantly outperform conventional solutions. Moreover, we evaluate all models for detection latency, a critical yet rarely examined practical aspect, and show that LLMs with JudgeRail require only 46% to 55% of the time needed by LlamaGuard3 and ShieldGemma. The generic nature and competitive performance of JudgeRail highlight its potential for promoting the practicality of LLM-based harmful text detectors.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5595
Loading