Poster: Towards Efficient Spatio-Temporal Video Grounding in Pervasive Mobile Devices

Dulanga Kaveesha Weerakoon Weerakoon Mudiyanselage, Vigneshwaran Subbaraju, Joo Hwee Lim, Archan Misra

Published: 2024, Last Modified: 08 Feb 2025MobiSys 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: As the use of pervasive devices expands into complex collaborative tasks such as cognitive assistants and interactive AR/VR companions, they are equipped with a myriad of sensors facilitating natural interactions, such as voice commands. Spatio-Temporal Video Grounding (STVG), the task of identifying the target object in the field-of-view referred to in a language instruction, is a key capability needed for such systems. However, current STVG models tend to be resource-intensive, relying on multiple cross-attentional transformers applied to each video frame. This results in runtime complexity that increases linearly with video length. Furthermore, deploying these models on mobile devices while maintaining a low-latency poses additional challenges. Hence, this paper explores the latency and energy requirements for implementing STVG models on a pervasive device.