---
language: en
license: cc-by-4.0
datasets:
- gpqa
tags:
- question-answering
- evaluation
- annotations
---

## Dataset Description

This dataset contains filtered JSONL files of human annotations on question specificity, answer uniqueness, answer matching to the ground truth for different models for the GPQA Diamond dataset.


## Fields

- **question_id**: Record ID from the original dataset to uniquely map questions. 
- **model**: List of models whose responses are being annotated (only a small subset was used here -- DeepSeek v3, GPT-4o, Llama-4-Maverick, Qwen3-32B).
- **thinking**: Thinking tokens (not being stored currently)
- **question_text**: Question.
- **answer**: Actual answer of the question.
- **response**: Models' responses.
- **rating_match**: Rating (1-5) on whether the model responses (functionally) matches the provided answer or not.
- **rating_osq**: Rating (1-5) on whether the sample (question, answer) is specific enough that it can be answered with just the question, without any reliance on the options.
- **rating_multians**: Rating (1-5) on whether the question has a single unique correct answer (ignoring paraphrasing and counting only semantically and functionally different answers).

## Usage

This dataset can be used for obtaining subset of questions as per requirements on specificity and uniqueness of the answer. 
