See both ways: A bidirectional evaluation of Multimodal Language Models and Human Spontaneous Speech for Image Captioning
Keywords: Multimodal large language models, Image captioning, Human–AI comparison, Model Evaluation
TL;DR: We present a novel study on understanding the alignment between captions generated by multimodal models and spontaneous human speech captions using a bidirectional evaluation framework.
Abstract: Multimodal large language models (MLLMs) have achieved notable success in image captioning, yet systematic comparisons with human-generated references remain underexplored. In this work, we present a novel study on understanding the alignment between captions generated by multimodal models and spontaneous human speech captions. To this end, we introduce a human–machine bidirectional evaluation framework, extending a recently proposed image-caption evaluation metric. This evaluation is performed by comparing crowd-sourced audio-based captions of images with model-generated captions from various MLLMs. Our detailed analysis reveals that, (i) humans are more selective than models in describing specific aspects of the image rather than providing a comprehensive summary, (ii) scores with human reference and model targets are significantly higher those computed with model reference and human response, and (iii) images from specific categories like "nature'' and "educational'' evoke more human imagination during the description task, compared to other categories. Together, these findings reveal a clear divergence in human vs. model captioning.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24335
Loading