IndiSocialFT: Multilingual Word Representation for Indian languages in code-mixed environment

Saurabh Kumar; Ranbir Singh Sanasam; Sukumar Nandi

IndiSocialFT: Multilingual Word Representation for Indian languages in code-mixed environment

Saurabh Kumar, Ranbir Singh Sanasam, Sukumar Nandi

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Short Paper

Submission Track: Resources and Evaluation

Keywords: Indian Languages, Multilingual Word Embedding, Code-mixed, Social Media Text

Abstract: The increasing number of Indian language users on the internet necessitates the development of Indian language technologies. In response to this demand, our paper presents a generalized representation vector for diverse text characteristics, including native scripts, transliterated text, multilingual, code-mixed, and social media-related attributes. We gather text from both social media and well-formed sources and utilize the FastText model to create the "IndiSocialFT" embedding. Through intrinsic and extrinsic evaluation methods, we compare IndiSocialFT with three popular pretrained embeddings trained over Indian languages. Our findings show that the proposed embedding surpasses the baselines in most cases and languages, demonstrating its suitability for various NLP applications.

Submission Number: 3202

Loading