Keywords: Egocentric AI assistant, Multimodal, Video–Grounded Reasoning, Lightweight model
Abstract: Egocentric AI assistants have emerged as a promising paradigm for real-world human–AI interaction, yet existing approaches face a critical trade-off: large language models provide strong reasoning but are too resource-intensive for mobile deployment, while lightweight models remain text-centric and lack interactive visual grounding. We introduce Ego-VGA, a lightweight multimodal assistant that delivers goal-oriented visual guidance with high efficiency. Ego-VGA incorporates a novel multimodal fusion layer, where region fusion supports fine-grained vision–language grounding and vision fusion distills temporal cues from egocentric video streams for context-aware reasoning. A lightweight projection module and a compact LLM further enhance efficiency, enabling deployment on mobile and wearable devices. To foster research in intent modeling, we construct Ego-IntentBench, a challenging benchmark with fine-grained procedural annotations. Extensive experiments validate our approach: Ego-VGA achieves +8.7\% recall@1 on AssistQ, +17.2 BLEU-1 / +7.6 METEOR on YouCook2, and ~20\% mean top-5 recall improvement on MECCANO (RGB-only). On Ego-IntentBench, where strong baselines such as Qwen2.5-VL and MiniCPM-V4 degrade substantially, Ego-VGA consistently outperforms them, demonstrating state-of-the-art generalization and adaptability in complex, goal-directed reasoning under visual guidance.The code and dataset are available at https://anonymous.4open.science/r/Ego-VGA-05CC
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11383
Loading