Comparative Study of Window Views' Distinctive Impact on Human and VLM Impressions

ACL ARR 2025 July Submission793 Authors

28 Jul 2025 (modified: 04 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper investigates the subjective dimensions of window view impressions by comparing human participants’ verbal responses with image descriptions generated by seven state-of-the-art vision-language models (VLMs). We analyze a dataset of transcribed impressions—2100 utterances collected in two separate virtual reality (VR) experiments—and compare it against synthetic texts from several high-performing VLMs. Using the combined dataset, we compare human and machine responses based on three key criteria: (1) most frequent N-grams, (2) clustering structure, and (3) sentiment. Our findings reveal significant differences across all three dimensions and highlight distinctive patterns in human perceptions of window views.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation, cross-modal application, multimodality, emotion detection and analysis, style analysis, human-centered evaluation
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 793
Loading