USAF: Multimodal Chinese named entity recognition using synthesized acoustic features

Published: 2023, Last Modified: 04 May 2026Inf. Process. Manag. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Due to the particularity of Chinese word formation, the Chinese Named Entity Recognition (NER) task has attracted extensive attention over recent years. Recently, some researchers have tried to solve this problem by using a multimodal method combining acoustic features and text features. However, the text-speech data pairs required by the above methods are lacking in real-world scenarios, making it difficult to apply widely. To address this, we proposed a multimodal Chinese NER method called USAF, which uses synthesized acoustic features instead of actual human speech. USAF aligns text and acoustic features through unique position embeddings and uses a multi-head attention mechanism to fuse the features of the two modalities, which stably improves the performance of Chinese named entity recognition. To evaluate USAF, we implemented USAF on three Chinese NER datasets. Experimental results show that USAF witnesses a stable improvement compare to text-only methods on each dataset, and outperforms SOTA external-vocabulary-based method on two datasets. Specifically, compared to the SOTA external-vocabulary-based method, the F1 score of USAF is improved by 1.84 and 1.24 on CNERTA and Aishell3-NER, respectively.
Loading