HeadSetOff: Enabling Photorealistic Video Conferencing on Economical VR Headsets

Yili Jin; Duan Xize; Fangxin Wang; Xue Liu

HeadSetOff: Enabling Photorealistic Video Conferencing on Economical VR Headsets

Yili Jin, Duan Xize, Fangxin Wang, Xue Liu

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Virtual Reality (VR) headsets have become increasingly popular for remote collaboration, but video conferencing poses challenges when the user's face is covered by the headset. Existing solutions have limitations in terms of accessibility. In this paper, we propose HeadsetOff, a novel system that achieves photorealistic video conferencing on economical VR headsets by leveraging voice-driven face reconstruction. HeadsetOff consists of three main components: a multimodal attention-based predictor, a generator, and an adaptive controller. The predictor effectively predicts user future behavior based on different modalities. The generator employs voice input, head motion, and eye blink to animate the human face. The adaptive controller dynamically selects the appropriate generator model based on the trade-off between video quality and delay, aiming to maximize Quality of Experience while minimizing latency. Experimental results demonstrate the effectiveness of HeadsetOff in achieving high-quality, low-latency video conferencing on economical VR headsets.

Primary Subject Area: [Systems] Systems and Middleware

Secondary Subject Area: [Experience] Interactions and Quality of Experience

Relevance To Conference: It proposes a novel system that enables photorealistic video conferencing on economical VR headsets, which is a multimedia application involving multiple modalities such as video, audio, and user motion data. It introduces a multimodal attention-based predictor that can effectively predict a user's future behavior by fusing various modalities like head motion, eye blink, voice, and gaze direction using crossmodal attention and self-attention mechanisms. The generator component employs voice input, head motion, and eye blink to animate the human face, achieving photorealistic video synthesis. This involves multimodal fusion of audio and behavior data for realistic face synthesis, which is a key aspect of multimedia processing. The system leverages an adaptive controller that dynamically selects the appropriate generator model based on the trade-off between video quality and delay, considering the current network conditions. This adaptive bitrate selection and quality-delay balancing is essential for efficient multimedia streaming and delivery. Overall, this work is a comprehensive multimedia/multimodal processing system that integrates various modalities, including audio, video, and user motion data, for photorealistic video conferencing in VR environments, addressing challenges in both multimedia generation and delivery.

Submission Number: 3944

Loading