Abstract: This study proposes a language-agnostic transformer-based POS tagging framework designed for low-resource languages, with Bangla and Hindi serving as case studies. With only three lines of framework-related code, the framework was adapted to a new language, namely, from Bangla to Hindi. Displaying its effectiveness with minimal code change. Additionally, the framework achieves 96.85\% and 97\% token-level accuracy across POS categories, maintaining robust F1 scores despite dataset imbalance and linguistic overlaps in Bangla and Hindi, respectively. However, the performance discrepancy in a specific POS type highlights challenges in dataset curation. Moreover, the performance is due to the transformer used under the hood of this framework, which can be swapped with minimal code changes. The framework's modular, language-agnostic design and open-source design enable rapid adaptation with minimal code modification. By reducing model design and tuning overhead, researchers can prioritize linguistic preprocessing and dataset refinement, key tasks in advancing NLP for underrepresented languages.
Paper Type: Short
Research Area: Syntax: Tagging, Chunking and Parsing
Research Area Keywords: low-resources languages pos tagging,morphologically-rich languages pos tagging
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: Bangla, Hindi
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
Software: zip
Data: zip
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
A2 Elaboration: Operates solely on publicly available, anonymized text corpora (no human subjects or sensitive personal data) Proposes a technical framework for part-of-speech tagging without downstream applications that could enable misuse Involves no collection of new data from individuals, and all resources are released under open-license terms
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 3
B2 Discuss The License For Artifacts: No
B2 Elaboration: No — we did not include a central license statement because each artifact’s repository already provides clear licensing information (e.g., MIT for the code framework and CC BY 4.0 for the Bangla dataset).
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: 4
C Computational Experiments: Yes
C1 Model Size And Budget: No
C1 Elaboration: The computation was performed on a personal computer with Nvidia RTX 4070 for about 20 minutes for each iteration and this doesn't have any effect on the papers experiments in any way.
C2 Experimental Setup And Hyperparameters: No
C2 Elaboration: we didn't change any default hyperparameter of any sort to maintain a neutral environment across experiments.
C3 Descriptive Statistics: Yes
C3 Elaboration: 3,4
C4 Parameters For Packages: No
C4 Elaboration: all were kept as default.
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 378
Loading