Toshakhana: A Multidimensional Panjabi Corpus in Gurmukhi Script

Published: 01 Jan 2024, Last Modified: 05 Mar 2025ACM Southeast Regional Conference 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Panjabi (also referred to as Punjabi) is a name given to a collection of tonal languages originating in the Punjab area of South Asia. It is the ninth most spoken language in the world - roughly 1.9% of the world population. Panjabi is written in two scripts - Gurmukhi and Shahmukhi. Yet it can be considered a "low resource language" due to lack of basic building blocks of Natural Language Processing (NLP) research. Toshakhana is our attempt to build the first Panjabi corpus in Gurmukhi script with temporal component.
Loading