BLM-s/lE: A structured dataset of English spray-load verb alternations for testing generalization in LLMs

Giuseppe Samo; Vivi Nastase; Chunyang Jiang; Paola Merlo

BLM-s/lE: A structured dataset of English spray-load verb alternations for testing generalization in LLMs

Giuseppe Samo, Vivi Nastase, Chunyang Jiang, Paola Merlo

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Resources and Evaluation

Submission Track 2: Language Modeling and Analysis of Language Models

Keywords: verb alternation, generating rules, sentence embeddings

Abstract: Current NLP models appear to be achieving performance comparable to human capabilities on well-established benchmarks. New benchmarks are now necessary to test deeper layers of understanding of natural languages by these models. Blackbird's Language Matrices are a recently developed framework that draws inspiration from tests of human analytic intelligence. The BLM task has revealed that successful performances in previously studied linguistic problems do not yet stem from a deep understanding of the generative factors that define these problems. In this study, we define a new BLM task for predicate-argument structure, and develop a structured dataset for its investigation, concentrating on the spray-load verb alternations in English, as a case study. The context sentences include one alternant from the spray-load alternation and the target sentence is the other alternant, to be chosen among a minimally contrastive and adversarial set of answers. We describe the generation process of the dataset and the reasoning behind the generating rules. The dataset aims to facilitate investigations into how verb information is encoded in sentence embeddings and how models generalize to the complex properties of argument structures. Benchmarking experiments conducted on the dataset and qualitative error analysis on the answer set reveal the inherent challenges associated with the problem even for current high-performing representations.

Submission Number: 3507

Loading