Structured Multimodal World Models for Knowledge Localization, Safe Editing, and Predictive Situation Understanding

Aditi Tiwari; Heng Ji

Structured Multimodal World Models for Knowledge Localization, Safe Editing, and Predictive Situation Understanding

Aditi Tiwari, Heng Ji

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: world models, knowledge editing, multimodal reasoning, object-centric representations, knowledge graphs, hallucination reduction, temporal consistency, vision-language models, structured generation, agentic AI

TL;DR: We propose SPKS, a structured multimodal world model that represents knowledge as an explicit, object-centric, temporally indexed graph to enable fact-level localization, safe editing, and hallucination-suppressed generation.

Abstract: Generative AI models remain unreliable as knowledge systems: correcting a single factual error routinely causes unintended modifications to unrelated facts, because knowledge is distributed across parameters with no explicit structure, locality, or traceability. Retrieval-augmented generation, parameter-level editing, and alignment fine-tuning each treat symptoms rather than the underlying cause: the absence of an explicit, inspectable, and updateable representation of world state. We propose Structured Predictive Knowledge Systems (SPKS), a unified multimodal architecture that represents knowledge as an explicit, object-centric, temporally indexed world state and conditions generation on predicted state trajectories rather than implicit embeddings. SPKS comprises four tightly coupled components: a slot-based multimodal encoder producing entity-level traceable tokens; a dynamic knowledge graph supporting fact-node-level editing and constraint propagation without modifying backbone parameters; an action-conditioned graph neural network transition model enabling multi-step forecasting and counterfactual simulation; and a state-conditioned generation module with a consistency loss that enforces graph-validated outputs. Together, these components enable fact-level localization, controlled and traceable knowledge updates, hallucination suppression through graph constraint enforcement, and predictive situation forecasting within a single framework. We evaluate SPKS along three axes: knowledge localization and editing precision on standard benchmarks, hallucination rate and temporal consistency against state-of-the-art vision-language baselines, and situation forecasting accuracy on structured video domains, with ablation conditions isolating individual component contributions.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 100

Loading