Abstract: While recent years have seen rapid progress
in image-conditioned text generation, image
captioning still suffers from the fundamental
issue of hallucinations, namely, the generation
of spurious details that cannot be inferred from
the given image. Existing methods largely use
closed-vocabulary object lists to mitigate or
evaluate hallucinations in image captioning, ig-
noring the long-tailed nature of hallucinations
that occur in practice. To this end, we propose
a framework for addressing hallucinations in
image captioning in the open-vocabulary set-
ting. Our framework includes a new bench-
mark, OpenCHAIR, that leverages generative
foundation models to evaluate open-vocabulary
object hallucinations for image captioning, sur-
passing the popular and similarly-sized CHAIR
benchmark in both diversity and accuracy. Fur-
thermore, to mitigate open-vocabulary hallu-
cinations without using a closed object list,
we propose MOCHa, an approach harnessing
advancements in reinforcement learning. Our
multi-objective reward function explicitly tar-
gets the trade-off between fidelity and adequacy
in generations without requiring any strong su-
pervision. MOCHa improves a large variety
of image captioning models, as captured by
our OpenCHAIR benchmark and other existing
metrics.
Loading