Keywords: AI Assistance Systems, Human Cognitive Support, Large Language Models, Multimodal RAG, Video Understanding
Abstract: To effectively engage in human society, the ability to
adapt, filter information, and make informed decisions in ever-
changing situations is critical. As robots and intelligent agents
become more integrated into human life, there is a growing
opportunity—and need—to offload the cognitive burden on
humans to these systems, particularly in dynamic, information-
rich scenarios.
To fill this critical need, we present Multi-RAG, a multimodal
retrieval-augmented generation system designed to provide
adaptive assistance to humans in information-intensive
circumstances. Our system aims to improve situational
understanding and reduce cognitive load by integrating and
reasoning over multi-source information streams, including
video, audio, and text. As an enabling step toward long-term
human-robot partnerships, Multi-RAG explores how multimodal
information understanding can serve as a foundation for adaptive
robotic assistance in dynamic, human-centered situations. To
evaluate its capability in a realistic human-assistance proxy task,
we benchmarked Multi-RAG on the MMBench-Video dataset, a
challenging multimodal video understanding benchmark. Our
system achieves superior performance compared to existing
open-source video large language models (Video-LLMs) and
large vision-language models (LVLMs), while utilizing fewer
resources and less input data. The results demonstrate Multi-
RAG’s potential as a practical and efficient foundation for future
human-robot adaptive assistance systems in dynamic, real-world
contexts.
Submission Number: 10
Loading