PAELLA: Parameter-Efficient Lightweight Language-Agnostic Captioning ModelDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: We proposed PAELLA, an efficient multilingual captioning model with retrieval-augmentation.
Abstract: We introduce PAELLA, a Parameter Efficient Lightweight Language-Agnostic image captioning model that uses retrieval augmentation to perform multilingual caption generation. The model is trained by learning a small mapping network with 30M parameters between a pre-trained visual model and a multilingual language model that is conditioned on two types of input: (i) the image itself, and (ii) a set of retrieved captions in the target language. The retrieved examples play a key role in guiding the model to generate captions across languages. Compared to other multilingual captioning models, PAELLA can be trained in one day on a single GPU. The model is lightweight in terms of the number of trainable parameters, which only exist in its mapping network, and also in the amount of multilingual training data that is required. Experiments on the XM3600 dataset, featuring 36 languages, show that PAELLA can outperform or compete against some models with 4-87x more learned parameters and 35-863x more data. We also find that PAELLA can be trained on only monolingual data and still show strong zero-shot abilities in other languages.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: ar, bn, cs, da, de, el, en, es, fa, fi, fil, fr, iw, hi, hr, hu, id, it, ja, ko, mi, nl, no, pl, pt, ro, ru, sv, sw, te, th, tr, uk, vi, quz, zh
0 Replies

Loading