- TL;DR: We propose a novel attention networks with the hybird encoder to solve the text representation issue of Chinese text classification, especially the language phenomena about pronunciations such as the polyphone and the homophone.
- Abstract: Chinese text classification has received more and more attention today. However, the problem of Chinese text representation still hinders the improvement of Chinese text classification, especially the polyphone and the homophone in social media. To cope with it effectively, we propose a new structure, the Extractor, based on attention mechanisms and design novel attention networks named Extractor-attention network (EAN). Unlike most of previous works, EAN uses a combination of a word encoder and a Pinyin character encoder instead of a single encoder. It improves the capability of Chinese text representation. Moreover, compared with the hybrid encoder methods, EAN has more complex combination architecture and more reducing parameters structures. Thus, EAN can take advantage of a large amount of information that comes from multi-inputs and alleviates efficiency issues. The proposed model achieves the state of the art results on 5 large datasets for Chinese text classification.