Silicon A/B Testing for Emergency Alerts
Keywords: A/B testing, emergency alerts, public warnings, risk communication, message design, large language models, simulation, pre-deployment evaluation, human-in-the-loop, equity analysis, failure-mode discovery
Abstract: Emergency alerts must communicate actionable guidance quickly under stress, yet small differences in message structure and wording can change what recipients understand and remember. Rigorous pre-deployment testing of alert variants is often slow and expensive, limiting iteration by agencies. We present a lightweight A/B simulation framework that uses instruction-tuned LLMs as stratified resident respondents to stress-test emergency-alert rewrites. For each alert, we generate control and three variants (action-first, plain-language, constraint-aware) and evaluate them across resident profiles varying in English proficiency, mobility constraints, and trust in officials. Agents return structured JSON outputs, enabling automatic scoring of action recall, confusion, and intended compliance against human-specified required actions. Across 17 alerts, paired within-alert comparisons show that action-first formatting yields a small but consistent recall lift that remains stable when scaling from 40 to 80 agents, while subgroup analyses reveal heterogeneous effects. We also introduce an interpretable structure score that helps explain when formatting changes translate into recall gains. The framework is intended for rapid screening and failure-mode discovery, complementing—not replacing—human evaluation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 14
Loading