---
# This .zip File Contains:
Two files: 
- "APPENDIX.pdf" : the appendix section of the main paper, including essential insights into reproducibility, encompassing detailed explanations and examples regarding model evaluation, analysis, and data collection.
- "visual_riddles_benchmark_data.csv" : a .csv file containing the Visual Riddles Benchmark data, the field of the csv are elaborated below in [Data Fields](#data-fields). This file can be loaded using the pandas package (locally), or huggingface datasets.
---

---
license: apache-2.0
---

annotations_creators:
- crowdsourced
language:
- en
language_creators:
- found
pretty_name: Visual-Riddles
size_categories:
- 10K<n<100K
source_datasets:
- original
tags:
- commonsense-reasoning
- open-ended-VQA
- visual-commonsense-reasoning
- world-knowledge
- image-generation
- visual-question-answering(VQA)
- question-answering
- synthethic-images


# Dataset Card for Visual-Riddles

- [Dataset Description](#dataset-description)
- [Contribute to Extend Visual-Riddles](#contribute-images-to-extend-visual-riddles)
  - [Languages](#languages)
- [Dataset](#dataset-structure)
  - [Data Fields](#data-fields)
  - [Data Splits](#data-splits)
  - [Data Loading](#data-loading)
- [Licensing Information](#licensing-information)
- [Annotations](#annotations)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Citation Information](#citation-information)


## Dataset Description
The Visual Riddles dataset is a benchmark designed to assess visual reasoning and commonsense understanding. It comprises a collection of visually rich riddles, each accompanied by a synthetic image generated specifically for the challenge. These riddles are carefully crafted to integrate subtle visual cues with everyday scenarios, challenging both humans and models in their interpretative and commonsense reasoning abilities.

The Visual Riddles benchmark encompasses various tasks:

Open-ended Visual Question Answering (VQA): In this task, participants provide free-text answers to questions based on the accompanying image. This format evaluates the ability to detect visual cues and apply commonsense reasoning to formulate responses.
Multiple-Choice VQA: Participants select answers from predefined options, assessing their comprehension of visual cues and reasoning abilities in a structured format.
Auto-Rating of Open-ended Responses: Models evaluate the accuracy of open-ended answers in both reference-free and reference-based scenarios. This task explores automatic evaluation methods for assessing the validity of responses, with and without external references.
Experimental results demonstrate a significant performance gap between humans and state-of-the-art vision-and-language models, highlighting the challenges in integrating commonsense reasoning and world knowledge into model architectures. Additionally, attempts to reproduce the benchmark's images using text-to-image models reveal unique challenges posed by visual riddles.

* Homepage: https://visual-riddles.github.io/


## Contribute Images to Extend Visual Riddles
Would you like to add a visual riddle to our database? Please send candidate images to <will be completed later>. Thanks!

### Languages
English.

## Dataset 
### Data Fields
    image (image) - The image of the visual riddle. Available only with huggingface datasets loading.
    image_url (string) - The URL of the image. Available only with local .csv loading using the pandas package.
    question (string) - A challenging question, related to a visual clue in the image and to additional external commonsense or world knowledge information.
    ground_truth_answer (string) - The answer for the riddle, given by the desinger.
    hint (string) - A hint that directs the attention to the visual clue int he image.
    attribution (string) - A URL for a web-source containing the attribution of the world-knowledge information. 
    human-caption (string) - The designer caption, depicting what is seen in the image.
    prompt (string) - The prompt was used to create the image of the riddle.
    attribution_content (string) - The textual content of the webpage URL given by the desginer in the 'attribution' field. Extrcted using the Selenium package.
    generative_model_name (string)- The name of the model that was used to generate the image of the riddle.
    designer (string) - The name of the visual riddle designer.
    difficulty_level_index (string) - The difficulty level of the riddle. range: 0 (direct visual clue, common knowledge) - 3 (hidden visual clue, very specific world knowledge)
    category (string)- The commonsense/knowledge category the riddle is related to.
    gemini-1.5-pro-caption (string) - The caption of the image given by the Gemini-Pro-1.5 model for the open-ended VQA task at the caption->LLM case.
    image_id (string)- The unique id of the image in the dataset.
    
    human_<1,2,3>-open_ended_answer (string) - Crowd colleced open-ended VQA answers for the visual riddle from three different annotators.
    human_<1,2,3>-open_ended_answer-human_annotation (boolean) - Crowd colleced annotations specifing whether the answers given by humans are correct or not.
    
    <model_name>-open_ended_answer-LVLM (string) - The open-ended VQA answers (LVLM case) for the riddle given by the models: llava-v1.5-7b,llava-v1.6-34b,gemini-1.5-pro,gemini-pro-vision,InstructBlip and GPT4.
    <model_name>-open_ended_answer-LVLM-human_annotation (string) - Crowd colleced annotations specifing whether the answers given by models in the LVLM case are correct or not.
    
    gemini-1.5-pro-open_ended_answer-<model_name>-caption_LLM (string) - The open-ended VQA answers (Caption->LLM case) of Gemini-Pro-1.5 for captions given the models: human (oracle) and gemini-1.5-pro.
    gemini-1.5-pro-open_ended_answer-<model_name>-caption_LLM-human_annotation - Crowd colleced annotations specifing whether the answers given by Gemini-Pro-1.5 in the Caption->LLM case (according to each model caption) are correct or not.
    
    models_answers_order-multiple_choice (list) - The order of models their answers are presented in the multiple choice VQA prompts.
    candidate_answers-multiple_choice (list) - The candidate answers of the models according to the order given in 'models_answers_order-multiple_choice'.
    
    prompt_clean-multiple_choice (string) - The prompt for multiple-choice VQA task, including (image) question, ground-truth answer, 3 incorrect answer candidates, and one "cannot determaine" distractor.
    prompt_hint-multiple_choice (string) - The prompt for multiple-choice VQA task, including hint.
    prompt_attribution-multiple_choice (string) - The prompt for multiple-choice VQA task, including attribution.
    
    model_order-auto_eval (list) - The order of two models their answers are presented in the Automatic Evaluation (judge) reference free and reference based prompts.
    model_answers-auto_eval (list) - The candidate answers of the models according to the order given in 'model_order-auto_eval'.
    
    prompts_ref_free-auto_eval (list) - The prompt for the Automatic Evaluation (judge) task in the reference free scenario. Containing answers of two mdoels, each in a different prompt. Each prompt also including - (image), question and the phrase "Based on the given image and question, is this answer correct?"
    prompts_ref_based-auto_eval (list) - The prompt for the Automatic Evaluation (judge) task in the reference based scenario. Containing answers of two mdoels, each in a different prompt. Each prompt also including - (image), question, ground-truth answer and the phrase "Based on the given image, question and ground-truth answer, is this answer correct?"


### Data Splits
Visual Riddles is a challenge set: there is a single TEST split. 


### Data Loading
You can load the data as follows:
```
from datasets import load_dataset
examples = load_dataset('visual-riddles/visual-riddles', use_auth_token=<YOUR USER ACCESS TOKEN>)
```
You can get `<YOUR USER ACCESS TOKEN>` by following these steps:
1) log into your Hugging Face account
2) click on your profile picture
3) click "Settings"
4) click "Access Tokens"
5) generate an access token

## Licensing Information
[apache-2.0](https://apache.org/licenses/LICENSE-2.0)  

1. **Purpose:** The dataset was primarily designed for use as a test set.
2. **Commercial Use:** Commercially, the dataset may be used as a test set, but it's prohibited to use it as a training set.
3. **Rights on Images:** All rights to the images within the dataset are retained by the Visual Riddles authors.

If you are unsure about your specific case - do not hesitate to reach out.

## Annotations
We paid Amazon Mechanical Turk Workers to supply open-ended VQA answers, and to annotate model and human answers for open-ended VQA task in two cases: LVLM, and Caption -> LLM. 

## Considerations for Using the Data
We took measures to filter out potentially harmful or offensive images and texts in Visual Riddles, but it is still possible that some individuals may find certain content objectionable. 
If you come across any instances of harm, please report them to our point of contact. We will review and eliminate any images from the dataset that are deemed harmful.

[//]: # (All images, questions answers, captions, prompts, hints and attributions were obtained with human annotators.)