Abstract: More and more attention has been paid to Audio-video speech recognition systems (AVSR) development. In recent years, great achievements of AVSR have been achieved. However, as the basis of research, audio-video speech corpora are scarce and most are small-scale. The major obstacle to the research of AVSR is an appropriate audio-visual corpus. For AVSR research in China, the available audio-video speech corpora are much less. Besides, features extracted from the video images are easily affected by the light intensity, directional distortions, and image quality. Facing above challenges, a large-scale depth-based multimodal audio-visual corpus is presented in this paper. The new corpus containing 22.4 hours of total 10,074 phonetically-balanced utterances reading of 69 speakers was created specifically to assist AVSR research in Chinese and speaker verification. The database related to the corpus includes high-resolution, high-framerate video image streams, depth image streams, 3D information and audios utilizing Microsoft Kinect for Windows Version two. A voice recorder was also used to record audio streams. The process of building the corpus, including organizing language material, recording, labeling and post-processing phases is described in this paper. Preliminary results on the corpus are also presented at the end of the paper.
0 Replies
Loading