BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, Content Moderation, Jailbreak Detection, Prompt Injection, Guardrails, Evaluation Benchmarks
TL;DR: We benchmark 28 LLM supervision systems on detection, FPR, latency, and cost. Specialized guardrails dominate content moderation; frontier LLMs dominate jailbreak detection; capability predicts neither, and small generalists are closing in.
Abstract: LLM supervision systems, namely input/output moderation filters and jailbreak detectors, are the primary safeguard against misuse in deployed AI applications, yet existing benchmarks are often vendor-biased, omit cost and latency, and rarely compare specialized guardrails against repurposed generalist LLMs. We present BELLS-O (Benchmark for the Evaluation of LLM Supervision Systems -- Operational), the first independent operational benchmark of LLM supervision systems. BELLS-O evaluates 28 systems from 17 providers: every major specialized guardrail (e.g., LlamaGuard-4, ShieldGemma-2, Lakera Guard) and frontier generalists repurposed as supervisors (e.g., GPT-5.4, Claude Sonnet 4.6, Grok-4.1), jointly on detection rate, false-positive rate, latency, and monetary cost. We cover input/output moderation across 11 harm categories and jailbreak detection across 13 attack techniques, using in-house datasets built from handcrafted prompts, expert-curated samples, and quality-controlled synthetic generation. To prevent latent generator-specific signals in synthetic data, every generated sample is run through a paraphrasing step that suppresses these fingerprints. Mapping the Pareto frontier reveals use-case-dependent tradeoffs. On content moderation, specialized supervisors are operationally dominant: top systems match frontier LLMs on detection ($\approx$95\% vs.\ 94\%) at comparably low false-positive rates ($\leq$2\%), while running 5--10$\times$ faster and ${\sim}$10$\times$ cheaper. On jailbreak detection, the tradeoff shifts: frontier LLMs achieve higher detection and lower false-positive rates but at 10--50$\times$ higher cost and 5--10$\times$ higher latency. We release the benchmark, framework, leaderboard, and datasets as the first vendor-neutral basis for selecting safeguards under real deployment constraints.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 179
Loading