Abstract: Large foundation language models and Transformer-based neural language models have exhibited outstanding performance in various downstream tasks. However, there is limited understanding regarding how these models internalize linguistic knowledge, so various linguistic benchmarks have recently been proposed to facilitate syntactic evaluation of language models across languages. This paper introduces FrCoLA (French Corpus of Linguistic Acceptability Judgments), consisting of 25,153 sentences annotated with binary acceptability judgments and categorized into four linguistic phenomena. Specifically, those sentences are manually extracted from an official online resource maintained by a Québec Governments institution, and split into in-domain data splits. Moreover, we also manually extracted 2,675 from a second France-based organization source and created an out-of-domain hold-out split. We then evaluate the linguistic capabilities of three different language models for each of the seven linguistic acceptability judgment benchmarks. The results demonstrated that, for most languages, on average, fine-tuned Transformer-based neural language models are strong baselines on the binary linguistic acceptability classification tasks. However, for the FrCoLA benchmark, on average, a fine-tuned Transformer-based model outperformed other methods tested.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, NLP datasets, acceptability judgments
Contribution Types: Reproduction study, Data resources
Languages Studied: English, Swedish, Italian, Russian, Chinese, Norwegian, Japanese, French
Section 2 Permission To Publish Peer Reviewers Content Agreement: Authors grant permission for ACL to publish peer reviewers' content
Submission Number: 35
Loading