Keywords: Multimodal Large Language Models, GUI Navigation, GUI Agents, Smart TV
Abstract: Smart TVs are central to modern home entertainment, yet their interfaces remain cumbersome to navigate, especially for tasks like searching video content using on-screen keyboards. We introduce TVAgent, a lightweight vision–language model (VLM)–based system that enables reliable, real-time Smart TV interaction through keyword-based video search and seamless intra-page and inter-page navigation. To handle diverse and frequently changing app layouts, TVAgent integrates: (1) a fine-tuned lightweight VLM for robust and generalizable GUI parsing, trained with a synthetic data generation pipeline that produces realistic, annotated Smart TV layouts to enable scalable training and rapid domain adaptation, (2) a Multinomial–Dirichlet Modeling (MDM)–driven navigation module specialized for video content search, allowing adaptive traversal of heterogeneous virtual keyboards without manual layout mapping, and (3) a dynamic knowledge base for platform-aware adaptation, reducing redundant exploration and improving responsiveness over time. Across real apps, TVAgent achieves 97.7% success on page-level navigation and 100% on video search, with 83.0% on intra-page content navigation; latency is sub-second throughout. TVAgent addresses pressing usability barriers in Smart TV navigation and demonstrates a clear path to near-term, on-device integration—offering significant potential for enhancing accessibility, efficiency, and personalization in home entertainment systems.
Submission Number: 42
Loading