Abstract: This paper presents the development of a Sinhala-Tamil bilingual parallel corpus with sentence-level alignment. The corpus comprises source language text from contemporary writings, with all sentences translated manually. Active learning methods were employed to select sentences, ensuring the representation of effective language structures in both languages. The corpus is divided into two parts: one with translations from Sinhala to Tamil direction, consisting of 25k parallel sentences, while the other consists of translations from Tamil to Sinhala direction, comprising 22k parallel sentences. Manual translations were conducted by two teams of professional translators. The resulting final version of TamSiPara, the Tamil-Sinhala bilingual parallel corpus consists of a total of 47k parallel sentences.
Loading