Bangla SBERT - Sentence Embedding Using Multilingual Knowledge Distillation

Published: 2024, Last Modified: 12 Nov 2025UEMCON 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Word embeddings have revolutionized NLP by capturing the semantic associations between words effectively. However, sentence embeddings present notable benefits in the context of advanced language comprehension. While these developments have impacted high-resource languages like English, low-resource languages like Bangla have not benefited as much. The objective of this study is to address this disparity by creating sentence embeddings for the Bangla language, to improve information retrieval, sentiment analysis, content suggestion, etc. In this study, we developed Bangla Sentence-BERT by fine-tuning it on novel datasets generated through machine translation and utilizing diverse open-source datasets. The approach we used consisted of utilizing the stsb-xlm-r-multilingual as the teacher model and XLM-RoBERTa (XLMR) as the student model for multilingual interpretation. We evaluated the efficacy of our suggested approach on multilingual sentence-BERT models and classical machine learning algorithms. The performance of our model was remarkable as it achieved an accuracy of $97 \%$ on real text classification. The results demonstrate the efficacy of our Bangla sentence transformer model in comprehending meaning and its potential for a range of Bangla natural language processing applications, such as text classification.
Loading