PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios

Xudong Lu; Guan Huankang; Yang Bo; Jinpeng Chen; Xintong Guo; Shuhan LI; Fang Liu; Peiwen Sun; Xueying Li; Wei Zhang; Xue Yang; Rui Liu; Hongsheng Li

PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios

Xudong Lu, Guan Huankang, Yang Bo, Jinpeng Chen, Xintong Guo, Shuhan LI, Fang Liu, Peiwen Sun, Xueying Li, Wei Zhang, Xue Yang, Rui Liu, Hongsheng Li

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce PhoStream, a mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios for evaluating video, audio, and temporal reasoning.

Abstract: Multimodal Large Language Models excel at offline audio-visual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introduce **PhoStream**, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal a temporal asymmetry in LLM-judged scores (0--100): models perform well on Instant and Backward tasks (Gemini 3 Pro exceeds 80), but drop sharply on Forward tasks (16.40), largely due to early responses before the required visual and audio cues appear. This highlights a fundamental limitation: current MLLMs struggle to decide ***when*** to speak, not just ***what*** to say. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/PhoStream.

Lay Summary: Smartphone AI assistants are becoming increasingly capable of understanding images, videos, and speech, but real-life use is often more challenging than offline tests. In many situations, a user asks a question while a video or phone screen is still unfolding, so the assistant must decide whether to answer immediately or keep watching until enough evidence appears. We introduce **PhoStream**, a benchmark designed to test this kind of real-time mobile assistance. It contains more than 5,500 open-ended questions from real mobile scenarios, including phone tutorials, app recordings, daily videos, and first-person videos. Unlike many existing benchmarks, PhoStream evaluates both what the assistant answers and whether it responds at the right time. Our experiments show that many advanced multimodal AI models perform well when the answer is already visible, but struggle when they need to wait for future information. They often answer too early, before the relevant visual or audio evidence appears. This reveals an important limitation of current AI assistants: they need better judgment about **when to speak**, not just better ability to understand content. PhoStream provides a practical testbed for building more reliable mobile assistants for real-world streaming scenarios.

Originally Submitted Supplementary Material: zip

Link To Code: https://github.com/Lucky-Lance/PhoStream

Primary Area: Applications->Computer Vision

Keywords: Benchmark, Streaming Video Understanding, Multimodal Large Language Models, Mobile Assistant

Originally Submitted PDF: pdf

Submission Number: 159

Loading