MetaSight: See, Streamline, Meta-Evolve --- A Super Efficient Multimodal Agent that Evolves at the Edge

Published: 15 May 2026, Last Modified: 20 May 2026AgentSkills 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-Evolve, Efficiency, Multimodal Agent
Abstract: Multimodal large language models (MLLMs) are deployed today mostly as static endpoints with hard budgets: every additional video frame and prompt token costs latency and dollars, and the model has no mechanism to learn from the questions it gets wrong. We present MetaSight, a self-evolving multimodal agent that addresses both via hybrid encoding across three layers: a cascaded edge-side frame gate, hot/cold skill injection with top-$k$ retrieved reasoning skills, and memory routed into a skill evolver so each retrieved exemplar reshapes the skill bank that serves every future question, rather than being concatenated alongside skills into the per-question prompt as in prior memory-augmented agents. Across $4$ video-QA benchmarks with $2$ VLM families: Gemini 3 Flash and GPT-5.2, MetaSight cuts per-question API cost by an average -98% versus full-frame upload (peak -99.3% on Video-MME long) and by -25.9% over the offline uniform 8 frame ceiling at the same evolved skill bank configuration, while boosting accuracy on most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema using Gemini 3 Flash. When testing on a matched frame budget with the offline uniformly 8 frames per video, our cascade+uniform-filling variant still beats the straightforward uniform-8 upper-bound on almost all benchmarks (e.g., 67.4% vs. 65.7% on average with FullEvo), and our offline-best configuration with the full evolution passes Gemini 1.5 Pro on EgoSchema with a smaller backbone. These properties make MetaSight a natural fit for live edge applications such as AI glasses, where the cascade reduces a $1$-hour streaming session from around 3600 API uploads down to only $5$ to $20$ calls.
Presentation Mode: Yes, at least one author will attend and present in person.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 87
Loading