Abstract: We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4o, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4o ranks best in this comparison, the model’s linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: multilingualism, multimodality, automatic evaluation, multilingual benchmarks, multilingual QA, multimodal QA, vision question answering, NLP datasets, datasets for low resource languages
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: Modern Standard Arabic, Jordanian dialect, Emirati dialect, Egyptian dialect, Moroccan dialect
Previous URL: https://openreview.net/forum?id=3LCLVDGG0l
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: The previous reviewers raised concerns that we believe fall outside the scope of the paper: (1) the need for extending to other downstream tasks such as OCR and textual understanding tasks is listed as a weakness, when our paper has a focus on highlighting cultural and linguistic biases in visual-language processing. (2) the use of humans to create data without automation is listed as a weakness, which was a decision to ensure high-quality original, dialect-specific Arabic captions and VQA pairs, rather than translated or synthetic data. (3) critiqued the dataset size as insufficient for training, although our dataset is explicitly designed for benchmarking.
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Ethics Statement
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 5. Benchmarking VLMs
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: 3. Dataset Construction
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: 3. Dataset Construction, Licensing Information
B4 Data Contains Personally Identifying Info Or Offensive Content: Yes
B4 Elaboration: A. Annotation Statistics
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 4. Data Analysis
B6 Statistics For Data: Yes
B6 Elaboration: 4. Data Analysis
C Computational Experiments: Yes
C1 Model Size And Budget: No
C1 Elaboration: All models are evaluated zero-shot.
C2 Experimental Setup And Hyperparameters: No
C2 Elaboration: All models are evaluated zero-shot.
C3 Descriptive Statistics: Yes
C3 Elaboration: 5. Benchmarking VLMs
C4 Parameters For Packages: Yes
C4 Elaboration: 5.1.1 Traditional Evaluation Metrics
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Appendix B
D2 Recruitment And Payment: Yes
D2 Elaboration: 3 Dataset Construction
D3 Data Consent: Yes
D3 Elaboration: 3 Dataset Construction
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: Yes
D5 Elaboration: Appendix B
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: 5 Benchmarking VLMs
Author Submission Checklist: yes
Submission Number: 1283
Loading