Abstract: Highlights•Proposing a novel cross-modal audio–visual speech recognition network, named CATNet.•Devising a cross-modal bidirectional fusion model.•Devising an audio–visual dual-modal speech recognition network.•CATNet is robust against noises and outperforms other benchmarks.
Loading