Script-Agnostic Language IdentificationDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc. are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in the neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those in the Indian Subcontinent. To counter this, we propose learning script-agnostic embeddings using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.
Paper Type: long
Research Area: Multilinguality and Language Diversity
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: Tamil, Kannada, Malayalam, Telugu
0 Replies

Loading