Large-Scale Native Language Identification with Cross-Corpus Evaluation

Shervin Malmasi, Mark Dras

2015 (modified: 16 Jul 2019)HLT-NAACL 2015Readers: Everyone

Abstract: We present a large-scale Native Language Identification (NLI) experiment on new data, with a focus on cross-corpus evaluation to identify corpusand genre-independent language transfer features. We test a new corpus and show it is comparable to other NLI corpora and suitable for this task. Cross-corpus evaluation on two large corpora achieves good accuracy and evidences the existence of reliable language transfer features, but lower performance also suggests that NLI models are not completely portable across corpora. Finally, we present a brief case study of features distinguishing Japanese learners’ English writing, demonstrating the presence of cross-corpus and cross-genre language transfer features that are highly applicable to SLA and ESL research.

0 Replies