Keywords: crisis communication, social simulation
Abstract: While large language models have improved on procedurally verifiable tasks, their behavior in high-stakes institutional crises remains under-evaluated because success depends on evolving stakeholder perceptions rather than a single verifiable answer. Existing benchmarks focus on static, single-turn competence and provide limited coverage of risk-sensitive, goal-directed communication under interaction. We introduce BrandEval, a two-track benchmark that pairs a rubric-based static diagnostic with BrandPolis, a dynamic multi-agent sandbox of competitive, partially observable markets. We also introduce Strategic Rationale, a lightweight decision workflow, and BrandSRD, a Chinese dataset of crisis-response decision points with human-validated preferences. Using BrandSRD, we build a reference SR-based Strategic Agent and study how communication styles affect long-horizon trust and tail risk. These resources enable controlled stress testing of LLM crisis communication, exposing failure modes and societal risks that single-turn evaluation may miss. BrandSRD, BrandEval, and BrandPolis will be released publicly.
Paper Type: Long
Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good
Research Area Keywords: benchmarking, evaluation methodologies
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: Chinese, English
Submission Number: 5980
Loading