# OpenForesight Dataset

Dataset for **forecasting questions** generated from news articles with retrieval-augmented prompts, designed to evaluate AI models' ability to make predictions about future events using relevant context.

## Dataset Overview

This dataset contains following across three splits:
- **Train**: 1000 random sampled questions
- **Validation**: 207 questions
- **Test**: 302 questions

The questions are sourced from multiple news outlets and cover events from May to August 2025.

## Dataset Structure

### Fields Description

| Field | Type | Description |
|-------|------|-------------|
| `qid` | string | Unique question identifier |
| `question_title` | string | The main forecasting question |
| `background` | string | Context and background information for the question |
| `resolution_criteria` | string | HTML-formatted criteria for how the question will be resolved |
| `answer` | string | The ground truth answer to the question |
| `answer_type` | string | Type of answer expected (e.g., "string (location)", "string (name)", "string (date)") |
| `url` | string | URL of the source news article |
| `article_maintext` | string | Full text content of the news article |
| `article_publish_date` | string | Publication date of the article (YYYY-MM-DD format) |
| `article_modify_date` | string | Last modification date of the article (YYYY-MM-DD format) |
| `article_download_date` | string | Date when the article was downloaded (YYYY-MM-DD format) |
| `article_title` | string | Title of the news article |
| `article_description` | string | Description/summary of the news article |
| `data_source` | string | Source identifier for the data generation process |
| `news_source` | string | News outlet that published the article |
| `question_start_date` | string | Start date for the forecasting question (YYYY-MM-DD format) |
| `resolution_date` | string | Date when the question will be resolved (YYYY-MM-DD format) |
| `prompt` | string | Full prompt with retrieved news articles for forecasting |
| `prompt_without_retrieval` | string | Prompt without retrieved articles for baseline comparison |

## Data Generation Process

### News Sources
The dataset is generated from articles from multiple news sources:

Train Set:
- **Hindustan Times** (hindustantimes-2024-25)
- **Irish Times** (irishtimes-2024)
- **Forbes** (forbes-2024)
- **CNN** (cnn-2024)
- **DW** (dw-2024)

Validation Set:
- **The Guardian** (theguardian, UK-based)

Test Set:
- **Al Jazeera** (aljazeera, global news based in Middle East)
- **The Guardian** (theguardian, UK-based)
- **Time** (time.com, global news, US-based)
- **NDTV** (ndtv, India-focused)
- **Fox News** (foxnews, US-centric)

### Model Generation
Questions were generated using language models with the following process:

1. **Article Processing**: News articles were collected and processed to extract relevant information
2. **Question Generation**: Language models generated forecasting questions based on article content
3. **Retrieval Augmentation**: Relevant news articles were retrieved and incorporated into prompts
4. **Question Validation**: Generated questions were validated on whether the source article actually resolves the question (by the date) and whether the question is specific and correct
5. **Quality Control**: Questions were filtered for relevance and quality

### Split Generation

The column items are standardized across each split including both retrieval-augmented and non-retrieval prompts for comparison.

#### Train Split (52,183 total questions, 1000 here for 100MB constraints)
- Generated from diverse news sources across multiple time periods
- Covers a wide range of topics and answer types
- Sourced from 5 news sources (Hindustan Times, Irish Times, Forbes, CNN, DW)

#### Validation Split (207 questions)
- Smaller curated set for model validation
- Focused on recent events for temporal validation
- Sourced from The Guardian

#### Test Split (302 questions)
- Standardized test set for evaluation
- Balanced across different news sources and question types
- Sourced from 5 news sources from May to August 2025

## Answer Types

The dataset includes various answer types:
- **String (location)**: Geographic locations, places, venues
- **String (name)**: Person names, company names, product names
- **String (date)**: Specific dates or time periods
- **String**: General text answers

The questions are non-numeric in nature.
