A Novel Resource for English L2

ACL ARR 2025 February Submission2335 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The availability of suitable learner corpora is paramount for the study of second language acquisition (SLA) and language transfer. However, curating learner corpora is a challenging endeavor as high quality learner data is rarely publicly available. This results in only a few such corpora, such as ICLE and TOEFL-11, available to the community. To address this important gap, in this paper we present Anonymous, a novel English learner corpus with longitudinal data. Anonymous contains texts written by adult learners taking English as a second language courses in the USA with the goal of either preparing for university admission or improving their language proficiency while starting their university degrees. Anonymous contains 687 instances written by speakers of 15 different L1s. Unlike most learner corpora, the corpus contains longitudinal data which enables researchers to investigate language learning over time. We present two case studies using Anonymous at the intersection of SLA and Computational Linguistics: (1) Native Language Identification (NLI); and (2) a quantitative and qualitative study using LLMs on linguistic features influenced by L1.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: linguistic theories, computational psycholinguistics, GEC, educational applications, NLI
Contribution Types: Data resources, Data analysis
Languages Studied: English, Arabic, Chinese, Vietnamese
Submission Number: 2335
Loading