Comparative Study of Window Views' Distinctive Impact on Human and VLM Impressions

Comparative Study of Window Views' Distinctive Impact on Human and VLM Impressions

ACL ARR 2025 July Submission793 Authors

28 Jul 2025 (modified: 04 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper investigates the subjective dimensions of window view impressions by comparing human participants’ verbal responses with image descriptions generated by seven state-of-the-art vision-language models (VLMs). We analyze a dataset of transcribed impressions—2100 utterances collected in two separate virtual reality (VR) experiments—and compare it against synthetic texts from several high-performing VLMs. Using the combined dataset, we compare human and machine responses based on three key criteria: (1) most frequent N-grams, (2) clustering structure, and (3) sentiment. Our findings reveal significant differences across all three dimensions and highlight distinctive patterns in human perceptions of window views.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal content generation, cross-modal application, multimodality, emotion detection and analysis, style analysis, human-centered evaluation

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 793

Loading