Abstract: Recent work in image captioning seems to be driven by increasingly large amounts of training data, and requires considerable computing power for training. We propose and investigate a number of adjustments to state-of-the-art approaches, with an aim to train a performant image captioning model in under two hours on a single consumer-level GPU using only a few thousand images. Firstly, we address the issue of sparse object and scene representation in a small dataset by combining visual attention regions at various levels of granularity. Secondly, we suppress semantically unlikely caption candidates through the introduction of language model rescoring during inference. Thirdly, in order to increase vocabulary and expressiveness, we propose an augmentation of the set of training captions through the use of a paraphrase generator. State-of-the-art performance on the Flickr8k test set is achieved, across a number of evaluation metrics. The proposed model also attains competitive test scores compared to existing models trained on a much larger dataset. The findings of this paper can inspire solutions to other vision-and-language tasks where labelled data is scarce.
Loading