FrCoLA: a French Corpus of Linguistic Acceptability Judgments

FrCoLA: a French Corpus of Linguistic Acceptability Judgments

ACL ARR 2024 April Submission35 Authors

10 Apr 2024 (modified: 20 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large foundation language models and Transformer-based neural language models have exhibited outstanding performance in various downstream tasks. However, there is limited understanding regarding how these models internalize linguistic knowledge, so various linguistic benchmarks have recently been proposed to facilitate syntactic evaluation of language models across languages. This paper introduces FrCoLA (French Corpus of Linguistic Acceptability Judgments), consisting of 25,153 sentences annotated with binary acceptability judgments and categorized into four linguistic phenomena. Specifically, those sentences are manually extracted from an official online resource maintained by a Québec Governments institution, and split into in-domain data splits. Moreover, we also manually extracted 2,675 from a second France-based organization source and created an out-of-domain hold-out split. We then evaluate the linguistic capabilities of three different language models for each of the seven linguistic acceptability judgment benchmarks. The results demonstrated that, for most languages, on average, fine-tuned Transformer-based neural language models are strong baselines on the binary linguistic acceptability classification tasks. However, for the FrCoLA benchmark, on average, a fine-tuned Transformer-based model outperformed other methods tested.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, NLP datasets, acceptability judgments

Contribution Types: Reproduction study, Data resources

Languages Studied: English, Swedish, Italian, Russian, Chinese, Norwegian, Japanese, French

Section 2 Permission To Publish Peer Reviewers Content Agreement: Authors grant permission for ACL to publish peer reviewers' content

Submission Number: 35

Loading