Abstract:In recent years, audio-visual speech recognition (AVSR) assistance systems have gained increasing attention from researchers as an important part of human-computer interaction (HCI). The objective of this paper is to further advance the development of assistive technologies in the AVSR field by introducing a multi-modal OpenAV dataset, intended for state-of-the-art neural network model training. The OpenAV is designed to train AVSR models for assistance to persons without hands or with disabilities of their hands or arms in HCI. The dataset could also be useful for ordinary users at hands-free contactless HCI. The dataset currently includes the recordings in two languages (English and Russian) of 15 speakers with a minimum of 10 recording sessions for each. Along with this we provide a detailed description of the dataset and its collection pipeline. In addition, we evaluate state-of-the-art audio-visual (AV) speech recognition approach and present a baseline recognition results. We also describe the recording methodology, release the recording software to public, as well as open the access to the dataset https://smil-spcras.github.io/OpenAV-dataset/.