A New Time-Frequency Attention Mechanism for TDNN and CNN-LSTM-TDNN, with Application to Language Identification
Abstract: In this paper, we aim to improve traditional DNN x-vector language identification (LID) performance by employing Convolutional and Long Short Term Memory-Recurrent (CLSTM) Neural Networks, as they can strengthen feature extraction and capture longer temporal dependencies. We also propose a two-dimensional attention mechanism. Compared with conventional one-dimensional time attention, our method introduces a frequency attention mechanism to give different weights to different frequency bands to generate weighted means and standard deviations. This mechanism can direct attention to either time or frequency information, and can be trained or fused singly or jointly. Experimental results show firstly that CLSTM can significantly outperform a traditional DNN x-vector implementation. Secondly, the proposed frequency attention method is more effective than time attention, particularly when the number of frequency bands matches the feature size. Furthermore, frequency-time score merging is the best, whereas frequency-time feature merge only shows improvements for small frequency dimension.
Loading