FastPOS: Language-AgnosticScalablePOSTaggingFramework  Low-Resource Use Case

FastPOS: Language-AgnosticScalablePOSTaggingFramework Low-Resource Use Case

ACL ARR 2025 July Submission378 Authors

27 Jul 2025 (modified: 16 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This study proposes a language-agnostic transformer-based POS tagging framework designed for low-resource languages, with Bangla and Hindi serving as case studies. With only three lines of framework-related code, the framework was adapted to a new language, namely, from Bangla to Hindi. Displaying its effectiveness with minimal code change. Additionally, the framework achieves 96.85\% and 97\% token-level accuracy across POS categories, maintaining robust F1 scores despite dataset imbalance and linguistic overlaps in Bangla and Hindi, respectively. However, the performance discrepancy in a specific POS type highlights challenges in dataset curation. Moreover, the performance is due to the transformer used under the hood of this framework, which can be swapped with minimal code changes. The framework's modular, language-agnostic design and open-source design enable rapid adaptation with minimal code modification. By reducing model design and tuning overhead, researchers can prioritize linguistic preprocessing and dataset refinement, key tasks in advancing NLP for underrepresented languages.

Paper Type: Short

Research Area: Syntax: Tagging, Chunking and Parsing

Research Area Keywords: low-resources languages pos tagging,morphologically-rich languages pos tagging

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: Bangla, Hindi

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

Software: zip

Data: zip

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: No

A2 Elaboration: Operates solely on publicly available, anonymized text corpora (no human subjects or sensitive personal data) Proposes a technical framework for part-of-speech tagging without downstream applications that could enable misuse Involves no collection of new data from individuals, and all resources are released under open-license terms

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 3

B2 Discuss The License For Artifacts: No

B2 Elaboration: No — we did not include a central license statement because each artifact’s repository already provides clear licensing information (e.g., MIT for the code framework and CC BY 4.0 for the Bangla dataset).

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: 4

C Computational Experiments: Yes

C1 Model Size And Budget: No

C1 Elaboration: The computation was performed on a personal computer with Nvidia RTX 4070 for about 20 minutes for each iteration and this doesn't have any effect on the papers experiments in any way.

C2 Experimental Setup And Hyperparameters: No

C2 Elaboration: we didn't change any default hyperparameter of any sort to maintain a neutral environment across experiments.

C3 Descriptive Statistics: Yes

C3 Elaboration: 3,4

C4 Parameters For Packages: No

C4 Elaboration: all were kept as default.

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 378

Loading