See both ways: A bidirectional evaluation of Multimodal Language Models and Human Spontaneous Speech for Image Captioning

ACL ARR 2026 January Submission10514 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal large language models, Image captioning, Human–AI comparison and interaction, Model Evaluation
Abstract: Multimodal large language models (MLLMs) have achieved notable success in image captioning, yet systematic comparisons with human-generated references remain underexplored. In this work, we present a novel study on understanding the alignment between captions generated by multimodal models and spontaneous human speech captions. To this end, we introduce a human–machine bidirectional evaluation framework, which does not assume a "ground-truth". This evaluation is performed by comparing human audio-based captions of images with model generated captions from various MLLMs.Our detailed analysis reveals that, (i) humans are more selective than models in image captioning rather than providing a comprehensive summary, (ii) scores with human reference and model targets are significantly higher than those computed with model reference and human targets, and (iii) images from specific categories like "nature" and "educational" evoke more human imagination during the description task, compared to other categories. Together, these findings reveal a clear divergence in human vs. model captioning that can pave the way for human-aligned MLLM designs.
Paper Type: Long
Research Area: Human-AI Interaction/Cooperation and Human-Centric NLP
Research Area Keywords: Human-Centered NLP
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: Hindi, English
Submission Number: 10514
Loading