Abstract: Micro lip reading, is characterized by tiny lip movements when someone is speaking. It is highly preferred in robotics and automation, such as a social robot for geriatric care, a patrol robot system for hospitals, and speech system for hearing impaired individuals, etc. However, it has not been well studied before. In this paper, we shed the light to this research topic. A labelled micro lip reading dataset (i.e. HUST-LMLR) of 399 video samples is first established. The samples are captured from the unconstrained movies. One key challenge for micro lip reading in the wild lies on extracting fine features of lip movements effectively and robustly. In this paper, we pay the research efforts to address this issue from two aspects: facial context attention and feature extraction. First, we propose a multi-task learning model for micro lip reading in the wild for the first time. It concerns micro lip reading and facial landmark detection jointly for capturing global face context with soft attention, to facilitate micro lip reading. Second, we propose that motion feature should be concerned and combined with appearance feature jointly to characterize tiny lip movements effectively. Finally, the experiments on HUST-LMLR demonstrate the challenges of our dataset and our proposed approach results in a remarkable improvement of the state-of-the-art by over 26% WER in HUST-LMLR. Furthermore, results in a well-known public dataset LRS2 also show the generalization and superiority of our approach. If accepted, we will publish our HUST-LMLR dataset and relative supporting matierials at https://hust-dpkw.github.io/Micro-lip-reading/.Note to Practitioners—This paper aims to construct a challenging public dataset and propose a novel vision-based approach for automatically micro lip reading under the unconstrained conditions. A labelled dataset HUST-LMLR was first established to reveal the “micro” characteristics, with 399 video samples captured from 40 unconstrained movies or documentaries. In addition, two key manners in perspective of algorithm are proposed to address the problem of micro lip reading at sentence level in the wild: multi-task learning with facial landmark detection, feature extraction with appearance and motion combined. The extensive experiments demonstrate the challenges of our dataset HUST-LMLR as well as the superiority of our approach in micro lip reading tasks. Nevertheless, it is still somewhat sensitive to the extremely tiny variation of lip movements and the dramatic variation on human posture. It is worth noting that our proposition can be applied not only for medical support, smart health care but also for human-computer interaction, special education, information security, assistant driving, and virtual reality etc.
External IDs:dblp:journals/tase/WangCCZLHYSZDX25
Loading