MMA: Multimodal Memory Agent

ACL ARR 2026 January Submission3203 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Memory Agents; Epistemic Prudence; Calibrated Abstention; Belief Dynamics
Abstract: Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text–vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2\% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18\% Type-B accuracy in Vision mode, while the baseline collapses to 0.0\% under the same protocol.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality; calibration/uncertainty; retrieval-augmented generation; LLM/AI agents; factuality; multi-modal dialogue systems; robustness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 3203
Loading