Abstract: This paper describes our NWPU-BBIC system submitted to the first Chinese auditory attention decoding challenge which is devoted to decoding the speech/spatial orientation of the listener from the EEG signal in the multi-speaker scene. Our system is a temporal-frequency fusion model consisting of three parts. The first part is a temporal feature extractor called Conformer, where the Transformer is incorporated into Convolutional neural network (CNN) to capture global dependencies in the temporal domain. The second part is a frequency feature extractor. We first transform the EEG signals into multi-band images, and then combine CNN and Convolutional-Long-Short-Term Memory (ConvLSTM) to extract frequency features. The third part is an adaptive weighted fusion module that integrates the temporal and frequency features. Experimental results demonstrate that our model outperforms the competition's baseline model, achieving fourth place in Track 1 with an average accuracy of 94.96% and second place in Track 2 with an average accuracy of 51.55%. These achievements not only underscore the superiority and robustness of our system but also contribute significantly to the advancement of research in auditory brain-computer interfaces.
Loading