Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Abstract: We present a self-supervised method to improve
an agent’s abilities in describing arbitrary objects while actively
exploring a generic environment. This is a challenging problem,
as current models struggle to obtain coherent image captions due
to different camera viewpoints and clutter. We propose a three-
phase framework to fine-tune existing captioning models that
enhances caption accuracy and consistency across views via a
consensus mechanism. First, an agent explores the environment,
collecting noisy image-caption pairs. Then, a consistent pseudo-
caption for each object instance is distilled via consensus using
a large language model. Finally, these pseudo-captions are used
to fine-tune an off-the-shelf captioning model, with the addition
of contrastive learning. We analyse the performance of the
combination of captioning models, exploration policies, pseudo-
labeling methods, and fine-tuning strategies, on our manually
labeled test set. Results show that a policy can be trained to
mine samples with higher disagreement compared to classical
baselines. Our pseudo-captioning method, in combination with
all policies, has a higher semantic similarity compared to other
existing methods, and fine-tuning improves caption accuracy
and consistency by a significant margin.
Loading