Keywords: Emergent social behavior, Multimodal consistency, Safety in AI, Generative agents
Abstract: Can generative agents be trusted in multimodal environments? Despite recent advances, agents remain limited in their ability to reason about safety, coherence, and trust across modalities. We introduce a reproducible simulation framework to evaluate generative agents in three aspects: (1) safety improvement over time via iterative plan revision in multimodal scenarios; (2) detection of unsafe activities across social contexts; and (3) social dynamics, measured through interaction and acceptance rates. These multimodal agents are evaluated using metrics that quantify plan revisions and unsafe-to-safe conversions. Experiments show that while agents detect direct multimodal contradictions, they often fail to align local revisions with global safety, achieving only a 55\% success rate in correcting unsafe plans. We release a dataset of 1,000 multimodal plans, yielding more than 600,000 simulation steps. Notably, 45\% of unsafe actions are accepted when paired with misleading visual cues, revealing a strong tendency to overtrust visual content.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: LLM safety, multimodal safety, generative agents, safety evaluation, social simulation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 5948
Loading