Multimodal Safety Evaluation in Generative Agent Social Simulations

Alhim Adonai Vera Gonzalez; Karen Sanchez; Carlos Hinojosa; Haidar Bin Hamid; Donghoon Kim; Bernard Ghanem

Multimodal Safety Evaluation in Generative Agent Social Simulations

Alhim Adonai Vera Gonzalez, Karen Sanchez, Carlos Hinojosa, Haidar Bin Hamid, Donghoon Kim, Bernard Ghanem

08 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Emergent social behavior, Multimodal consistency, Safety in AI, Generative agents

Abstract: Can generative agents be trusted in multimodal environments? Despite advances in large language models and vision-language models, which have enabled the development of generative agents capable of autonomous, goal-driven interaction in rich environments, their ability to reason about safety, coherence, and trust across modalities remains deeply limited. We introduce a reproducible simulation framework for evaluating generative agents along diverse dimensions: (1) safety improvement over time, including iterative plan revisions in multimodal (text-visual paired) scenarios; (2) detection of unsafe activities in multiple categories and subcategories of social situations; and (3) social dynamics, measured as interaction count and acceptance ratio of social interactions between agents. Agents are equipped with layered memory, dynamic planning, multimodal perception, and are instrumented with SocialMetrics, a suite of behavioral and structural metrics that quantifies plan revisions, unsafe-to-safe conversions, and information diffusion across agent networks. Experiments show that while agents can detect direct multimodal contradictions, they frequently fail to align local revisions with global safety, achieving only a 55% success rate in correcting unsafe plans. Across eight simulation runs with three models, Claude, GPT-4o mini, and Qwen-VL, five agents achieved an average unsafe-to-safe plan conversion rate of 75%, 55%, and 58%, respectively. Overall, performance ranged from 20% in multi-risk scenario settings with GPT-4o mini to 98% in localized contexts, such as Fire/Heat with Claude. We leverage a dataset consisting of 1,000 multimodal plans, which produce over 600,000 steps, with an average of approximately 650 conversations per simulation (approximately 5,200 total) and 132 plan revisions per plan (approximately 132,000 total). Notably, 45% of unsafe actions were accepted when paired with misleading visual cues, indicating a strong tendency to overtrust visual content. These findings expose critical limitations in current architectures and introduce a reproducible platform for studying multimodal safety, coherence, and social dynamics in generative agent environments.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 3192

Loading