Abstract: Although image captioning has a vast array of
applications, it has not reached its full potential in
languages other than English. Arabic, for instance,
although the native language of more than 400 mil-
lion people, remains largely underrepresented in
this area. This is due to the lack of labeled data and
powerful Arabic generative models. We alleviate
this issue by presenting a novel vision-language
model dedicated to Arabic, dubbed Violet. Our
model is based on a vision encoder and a Gem-
ini text decoder that maintains generation fluency
while allowing fusion between the vision and lan-
guage components. To train our model, we intro-
duce a new method for automatically acquiring
data from available English datasets. We also man-
ually prepare a new dataset for evaluation. Violet
performs sizeably better than our baselines on all
of our evaluation datasets. For example, it reaches
a CIDEr score of 61.2 on our manually annotated
dataset and achieves an improvement of 13 points
on Flickr8k.
Loading