- Keywords: Language Model
- Abstract: Transformers are powerful for sequence modeling and have the potential of learning long-term dependency. Nearly all state-of-the-art language models and pre-trained language models are based on the Transformer architecture. However, it distinguishes sequential tokens only with the token position index. We hypothesize that better contextual representations can be generated from the Transformer with richer positional information. To verify this, we propose a segment-aware Transformer~(Segatron), by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token. We first introduce the segment-aware mechanism to the Transformer-XL, which is a popular Transformer model based on the relative position encoding with memory extension. Our proposed method outperforms the Transformer-XL base model and large model on the Wiki103 dataset over 1.5 and 1.2 perplexities, respectively, which is comparable to the state-of-the-art result. We further pre-trained our model on the masked language modeling task in BERT but without any affiliated tasks. Experimental results show that our pre-trained model can outperform the original BERT model on various NLP tasks.