Abstract: Natural face-to-face human-robot conversation is one of the most important features for virtual human in virtual reality and metaverse. However, the unintended wake-up of robot is often activated with only Voice Activity Detection (VAD). To address this issue, we propose a Multimodal Activity Detection (MAD) scheme, which considers not only voice but also gaze, lip-movement and talking content to decide whether the person is activating the robot. A dataset for large screen-based virtual human conversation is collected from various challenging cases. The experimental results show that the proposed MAD greatly outperforms VAD-only method.
Loading