GenEn-MNER: Enhancing Nested Chinese NER With Multimodal Fusion and Alignment via Speech-to-Text Generation

Jinzhong Ning, Yuanyuan Sun, Zhihao Yang, Zhijun Wang, Ling Luo, Hongfei Lin, Yijia Zhang

Published: 01 Jan 2025, Last Modified: 12 Mar 2026IEEE Transactions on Audio, Speech and Language ProcessingEveryoneRevisionsCC BY-SA 4.0

Abstract: In recent years, the academic community has increasingly focused on multimodal Chinese Named Entity Recognition (NER) that utilizes speech cues. Existing methods typically rely solely on the NER objective function to guide the alignment and fusion of speech and text, overlooking the inherent alignment within speech-text pairs. Furthermore, these approaches generally employ sequence labeling techniques, which are inadequate for handling nested entities. To address these limitations, we introduce GenEn-MNER, a novel multimodal nested Chinese NER approach that enhances fusion and alignment through speech-to-text generation. This method leverages natural alignment information obtained from the speech-to-text task, using a cross-modal Transformer to integrate and align modalities. Additionally, the table-filling module redefines nested NER by conceptualizing it as the prediction of token pair relationships. Experimental results, as indicated by F1 scores, on CNERTA flat version (80.83% ), CNERTA nest version (80.66% ), and AISHELL-NER (94.52% ) not only confirm the effectiveness of our approach but also demonstrate its superiority to existing state-of-the-art methods.

External IDs:doi:10.1109/taslpro.2025.3555106