GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang; Amanpreet Singh; Julian Michael; Felix Hill; Omer Levy; Samuel R. Bowman

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman

Published: 21 Dec 2018, Last Modified: 22 Jun 2025ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.

Keywords: natural language understanding, multi-task learning, evaluation

TL;DR: We present a multi-task benchmark and analysis platform for evaluating generalization in natural language understanding systems.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 7 code implementations](https://www.catalyzex.com/paper/glue-a-multi-task-benchmark-and-analysis/code)

9 Replies

Loading