Multimodal Safety Evaluation in Generative Agent Social Simulations

Multimodal Safety Evaluation in Generative Agent Social Simulations

ACL ARR 2026 January Submission5948 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Emergent social behavior, Multimodal consistency, Safety in AI, Generative agents

Abstract: Can generative agents be trusted in multimodal environments? Despite recent advances, agents remain limited in their ability to reason about safety, coherence, and trust across modalities. We introduce a reproducible simulation framework to evaluate generative agents in three aspects: (1) safety improvement over time via iterative plan revision in multimodal scenarios; (2) detection of unsafe activities across social contexts; and (3) social dynamics, measured through interaction and acceptance rates. These multimodal agents are evaluated using metrics that quantify plan revisions and unsafe-to-safe conversions. Experiments show that while agents detect direct multimodal contradictions, they often fail to align local revisions with global safety, achieving only a 55\% success rate in correcting unsafe plans. We release a dataset of 1,000 multimodal plans, yielding more than 600,000 simulation steps. Notably, 45\% of unsafe actions are accepted when paired with misleading visual cues, revealing a strong tendency to overtrust visual content.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: LLM safety, multimodal safety, generative agents, safety evaluation, social simulation

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 5948

Loading