MetaSight: See, Streamline, Meta-Evolve --- A Super Efficient Multimodal Agent that Evolves at the Edge

Haoqin Tu; Jianwen Chen; Zijun Wang; Juncheng Wu; Hardy Chen; Haonian Ji; Kaiwen Xiong; Jiaqi Liu; Peng Xia; Jieru Mei; Hongliang Fei; Jason Eshraghian; Zeyu Zheng; Yuyin Zhou; Huaxiu Yao; Cihang Xie

MetaSight: See, Streamline, Meta-Evolve --- A Super Efficient Multimodal Agent that Evolves at the Edge

Haoqin Tu, Jianwen Chen, Zijun Wang, Juncheng Wu, Hardy Chen, Haonian Ji, Kaiwen Xiong, Jiaqi Liu, Peng Xia, Jieru Mei, Hongliang Fei, Jason Eshraghian, Zeyu Zheng, Yuyin Zhou, Huaxiu Yao, Cihang Xie

Published: 15 May 2026, Last Modified: 20 May 2026AgentSkills 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-Evolve, Efficiency, Multimodal Agent

Abstract: Multimodal large language models (MLLMs) are deployed today mostly as static endpoints with hard budgets: every additional video frame and prompt token costs latency and dollars, and the model has no mechanism to learn from the questions it gets wrong. We present MetaSight, a self-evolving multimodal agent that addresses both via hybrid encoding across three layers: a cascaded edge-side frame gate, hot/cold skill injection with top-$k$ retrieved reasoning skills, and memory routed into a skill evolver so each retrieved exemplar reshapes the skill bank that serves every future question, rather than being concatenated alongside skills into the per-question prompt as in prior memory-augmented agents. Across $4$ video-QA benchmarks with $2$ VLM families: Gemini 3 Flash and GPT-5.2, MetaSight cuts per-question API cost by an average -98% versus full-frame upload (peak -99.3% on Video-MME long) and by -25.9% over the offline uniform 8 frame ceiling at the same evolved skill bank configuration, while boosting accuracy on most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema using Gemini 3 Flash. When testing on a matched frame budget with the offline uniformly 8 frames per video, our cascade+uniform-filling variant still beats the straightforward uniform-8 upper-bound on almost all benchmarks (e.g., 67.4% vs. 65.7% on average with FullEvo), and our offline-best configuration with the full evolution passes Gemini 1.5 Pro on EgoSchema with a smaller backbone. These properties make MetaSight a natural fit for live edge applications such as AI glasses, where the cascade reduces a $1$-hour streaming session from around 3600 API uploads down to only $5$ to $20$ calls.

Presentation Mode: Yes, at least one author will attend and present in person.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 87

Loading