Keywords: streaming, guardrail, dynamic
TL;DR: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection
Abstract: Large models (LMs) are powerful content generators, yet their open‑ended nature can also introduce potential risks, such as generating harmful or biased content.
Existing guardrails mostly perform post-hoc detection that may expose unsafe content before it is caught,
and the latency constraints further push them toward lightweight models, limiting detection accuracy.
In this work, we propose PlugGuard, a novel plug-in framework that enables streaming risk detection within the LM generation pipeline.
PlugGuard leverages intermediate LM hidden states through a Streaming Latent Dynamics Head (SLD), which models the temporal evolution of risk across the generated sequence for more accurate real-time risk detection.
To ensure reliable streaming moderation in real applications, we introduce an Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions by embedding a benign-then-harmful temporal prior. Besides, for a rigorous evaluation of streaming guardrails, we also present StreamGuardBench—a model-grounded benchmark featuring on-the-fly responses from each protected model, reflecting real-world streaming scenarios in both text and vision–language tasks.
Across diverse models and datasets, PlugGuard consistently outperforms state-of-the-art post-hoc guardrails and prior plug-in probes (15.61% higher average F1), while using only 20M parameters and adding less than 0.5 ms of per-token latency.
The code and StreamGuardBench are released at **PlugGuard** to facilitate research on streaming guardrails.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7154
Loading