Keywords: Vision-Language Models, Multimodal Evaluation, Benchmarking
Abstract: VLMs perform well on standard benchmarks, yet their performance on authentic, culturally grounded tasks remains underexplored.
We introduce HAERAE-Vision, a Korean real-world benchmark built from 86,052 question–image pairs across nine online platforms.
Through a six-stage pipeline that applies appropriateness filtering, difficulty calibration, image dependency verification, checklist-based decomposition, and multi-phase human validation, we curate 653 rigorously validated items across 13 domains (0.76% survival).
Each item is paired with a structured checklist rubric, enabling fine-grained evaluation beyond single-point correctness. We evaluate 39 VLMs spanning proprietary, open-weight, and Korean-specialized families under a unified protocol, and scoring with LLM judges demonstrates high reliability (Krippendorff’s α = 0.867). Even the strongest systems (Gemini 2.5 Pro, GPT-5) remain below 50% accuracy, with errors concentrated in explicitness and procedural reasoning, while Korean-specialized models show no clear advantage over multilingual counterparts. These findings highlight persistent gaps in real-world multimodal reasoning. Our work further offers a reproducible methodology for constructing robust, culturally grounded benchmarks across languages.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 11581
Loading