Mitigating Open-Vocabulary Caption Hallucinations

Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, Hadar Averbuch-Elor

Published: 04 Oct 2024, Last Modified: 04 Oct 2024EMNLP 2024EveryoneRevisionsCC BY 4.0

Abstract: While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ig- noring the long-tailed nature of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary set- ting. Our framework includes a new bench- mark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations for image captioning, sur- passing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Fur- thermore, to mitigate open-vocabulary hallu- cinations without using a closed object list, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly tar- gets the trade-off between fidelity and adequacy in generations without requiring any strong su- pervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark and other existing metrics.