Transformation of Visual Information into Bangla Textual Representation

Nafisa Nawer, Md. Shakiful Islam Khan, Md. Mustakin Alam, Md. Humaion Kabir Mehedi, Annajiat Alim Rasel

Published: 01 Jan 2023, Last Modified: 07 Nov 2025CCWC 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the past several years, the interest in research like generating humanoid descriptions of scenarios by detecting and analyzing their components has been increased tremendously. Even though a significant amount of research has been put into automating the process of converting visual information into written representation, some languages, such as Bangla, which have a limited amount of resources, continue to be quite unfocused due to a lack of standard datasets. In order to resolve this issue, we have introduced a new dataset named “Biboron”, in which we manually gathered information in Bangla of images extracted from the widely available Flickr30k dataset that were then post-processed and examined for quality assurance. “Biboron” contains 1,58,915 distinct sentences describing 31,783 images which further specifies the versatile nature of the dataset. Furthermore, we have presented two models in order to enhance the automated extraction of visual information from images and represent in Bangla. The first model includes Local Attention, whilst the second model is based on Multi-Head Attention with Transformers. The image feature extractor of the models utilized VGG16, while bidirectional LSTM backed by CuDNN was used in the decoder network. The BLEU scores suggest that the second model appears to outperform the first one in terms of generating more relevant textual representations from images by achieving BLEU-1, BLEU -2, BLEU -3, BLEU -4 scores of 0.78, 0.53, 0.37, 0.21 respectively