AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

Xinsong Zhang; Hang Li

AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

Xinsong Zhang, Hang Li

28 Sept 2020 (modified: 08 Jun 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Pre-trained Language Model, Multi-Grained Tokenization

Abstract: Pre-trained language models such as BERT have exhibited remarkable performances in many tasks in natural language understanding (NLU). The tokens in the models are usually fine-grained in the sense that for languages like English they are words or sub-words and for languages like Chinese they are characters. In English, for example, there are multi-word expressions which form natural lexical units and thus the use of coarse-grained tokenization also appears to be reasonable. In fact, both fine-grained and coarse-grained tokenizations have advantages and disadvantages for learning of pre-trained language models. In this paper, we propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT), on the basis of both fine-grained and coarse-grained tokenizations. For English, AMBERT takes both the sequence of words (fine-grained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization, employs one encoder for processing the sequence of words and the other encoder for processing the sequence of the phrases, utilizes shared parameters between the two encoders, and finally creates a sequence of contextualized representations of the words and a sequence of contextualized representations of the phrases. Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE. The results show that AMBERT outperforms the existing best performing models in almost all cases, particularly the improvements are significant for Chinese. We also develop a version of AMBERT which performs equally well as AMBERT but uses about half of its inference time.

One-sentence Summary: We propose a novel pre-trained language model with multi-grained tokenization which can sufficiently utilize advantages of both fine-grained tokenization and coarse-grained tokenization.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/ambert-a-pre-trained-language-model-with/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=eIGebMYyD

13 Replies

Loading