Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder

Abdelrahman Mohamed, Fakhraddin Alwajih, El Moatez Billah Nagoudi, Alcides Alcoba Inciarte, Muhammad Abdul-Mageed

Published: 02 Nov 2023, Last Modified: 16 Feb 2025ArabicNLP 2023EveryoneCC BY 4.0

Abstract: Although image captioning has a vast array of applications, it has not reached its full potential in languages other than English. Arabic, for instance, although the native language of more than 400 mil- lion people, remains largely underrepresented in this area. This is due to the lack of labeled data and powerful Arabic generative models. We alleviate this issue by presenting a novel vision-language model dedicated to Arabic, dubbed Violet. Our model is based on a vision encoder and a Gem- ini text decoder that maintains generation fluency while allowing fusion between the vision and lan- guage components. To train our model, we intro- duce a new method for automatically acquiring data from available English datasets. We also man- ually prepare a new dataset for evaluation. Violet performs sizeably better than our baselines on all of our evaluation datasets. For example, it reaches a CIDEr score of 61.2 on our manually annotated dataset and achieves an improvement of 13 points on Flickr8k.