# Forecasting-RL: AI-Powered Forecasting Research Platform

A comprehensive research platform for developing and evaluating AI forecasting models using reinforcement learning, supervised fine-tuning, and retrieval-augmented generation. This repository implements the full pipeline from news collection to question generation to model training and evaluation.

## 🏗️ Architecture Overview

```
forecasting/
├── 📰 news/                # News data collection and processing
├── 🔍 qgen/                # Question generation from articles  
├── 📊 data/                # Curated datasets (Metaculus, Manifold, etc.)
├── 🤖 libraries/verl       # RL training with verl library
├── 🎯 sft/                 # Supervised fine-tuning scripts
├── 📈 custom_eval_scripts/ # Evaluation datasets and results
├── ⚖️ local_judge/         # Local model evaluation framework
├── 📊 plotting/            # Visualization and analysis tools
```

## 🚀 Quick Start

### Installation
```bash
# Clone repository
git clone [REPOSITORY_URL]
cd forecasting-rl

# Automated setup (recommended)
./setup.sh

# Manual setup alternative
uv venv forecast && source forecast/bin/activate
uv pip install torch torchvision --index-url XXXX
uv pip install -e .
```

### Basic Usage
```bash
# Activate environment
source forecast/bin/activate

# Generate questions from news articles
python qgen/question_generator.py --articles_path data/articles.jsonl

# Train forecasting model
python trainingTRL/train_grpo.py --config train_config.yaml

# Evaluate model performance
python local_judge/llm_judge.py --model_path /path/to/model
```

## 📰 News Collection Pipeline

### Overview
Automated pipeline for collecting and processing news articles from Common Crawl data for forecasting question generation.

### Components

**`news/`** - News data collection and processing
- **`jobs_news.py`** - Cluster job management for news extraction
- **`to_jsonl.py`** - Convert extracted articles to JSONL format
- **`domains.txt`** - Curated list of ~150 high-quality news domains
- **`relevant_domains.txt`** - Subset of most relevant domains
- **`src/tokenize_for_rag.py`** - Tokenization for BM25 retrieval
- **`src/deduplicate_news_jsonl.py`** - Article deduplication
- **`src/bm25_jsonl.py`** - BM25 retrieval implementation

### Usage
```bash
# Extract news from Common Crawl
python jobs_news.py --num_extractors 1 --domains domains.txt

# Convert to JSONL and tokenize
python to_jsonl.py --input_dir /path/to/articles --output_dir /path/to/jsonl
python src/tokenize_for_rag.py --input_path articles.jsonl

# Deduplicate articles
python src/deduplicate_news_jsonl.py --input_dir tokenized_data/

# Build BM25 index for retrieval
python src/bm25_jsonl.py --articles_path articles.jsonl --questions_path questions.jsonl
```

**Data Scale**: 27M+ articles across 150+ domains, 150GB+ total

## 🔍 Question Generation (`qgen/`)

### Core Functionality
The question generation system transforms news articles into high-quality forecasting questions using LLM-based generation and filtering.

### Key Components

**`question_generator.py`** - Main question generation engine (1,510 lines)
- **Two modes**: MCQ (multiple choice) and FreeQ (free-form) questions
- **Leakage detection**: Prevents data leakage in generated questions
- **Quality validation**: Ensures questions meet forecasting standards
- **Batch processing**: Efficient handling of large article datasets

**`filter_articles.py`** - Article relevance filtering (406 lines)
- **VLLM-powered**: Uses efficient VLLM for GPU acceleration
- **Relevance scoring**: Evaluates forecasting potential of articles
- **Criteria**: Interest/reach (>100 people) and forecasting value (1+ week horizon)
- **Batch processing**: Tensor parallel processing for efficiency

**`article_processor.py`** - Article preprocessing (443 lines)
- **Content extraction**: Cleans and structures article content
- **Metadata parsing**: Extracts dates, sources, and categories
- **Format standardization**: Ensures consistent article format

**`remove_leakage.py`** - Data leakage prevention (282 lines)
- **Temporal validation**: Ensures questions don't leak future information
- **Content analysis**: Checks for answer hints in question text
- **Resolution date verification**: Validates question timing

### Question Generation Process

1. **Article Collection**: Scrape and filter relevant news articles
2. **Relevance Filtering**: Score articles for forecasting potential
3. **Question Generation**: Generate 3 questions per article using LLM
4. **Leakage Detection**: Remove questions with data leakage
5. **Quality Validation**: Ensure questions meet standards
6. **Final Filtering**: Select best questions for dataset

### Usage Examples

```bash
# Generate questions from articles
python qgen/question_generator.py \
    --articles_path data/articles.jsonl \
    --output_path questions.jsonl \
    --num_questions 3 \
    --use_freeq \
    --check_leakage

# Filter articles for relevance
python qgen/filter_articles.py \
    --articles_path raw_articles.jsonl \
    --model_path /path/to/model \
    --output_path filtered_articles.jsonl

# Remove data leakage
python qgen/remove_leakage.py \
    --questions_path questions.jsonl \
    --output_path clean_questions.jsonl

# Extract and validate dates
python qgen/extract_date.py \
    --input_path questions.jsonl \
    --output_path dated_questions.jsonl
```

### Question Types Generated

**MCQ Questions**:
- Multiple choice with probability distributions
- 4-5 answer options with confidence scores
- Binary yes/no questions with uncertainty

**FreeQ Questions**:
- Numerical predictions with ranges
- Categorical outcomes with probabilities
- Time-based predictions with dates

## 📊 Data Management (`data/`)

### Datasets Supported
- **Metaculus**: Professional forecasting platform questions
- **Manifold Markets**: Prediction market questions  
- **Kalshi**: Event-based prediction markets
- **FutureBench**: Standardized forecasting benchmark
- **Custom Generated**: Questions from news articles

### Key Scripts
- **`manifold_new.py`** - Process Manifold Markets data dump
- **`futureX.py`** - Handle FutureX prediction datasets
- **`process_relevant_docs.py`** - Process retrieved documents for RAG

## 🤖 Model Training

### Supervised Fine-tuning (`sft/`)
```bash
# Create SFT dataset
python sft/create_sft_data.py --input_path questions.jsonl

# Launch training job
python sft/jobs_train.py --config configs/sft_config.yaml
```
## ⚖️ Evaluation Framework (`local_judge/`)

### LLM Judge System
**`llm_judge.py`** - Comprehensive evaluation framework (1,236 lines)
- **Multiple metrics**: Accuracy, Brier score, calibration
- **Judge models**: GPT-4, Claude, local models
- **Batch evaluation**: Efficient processing of large datasets
- **Custom criteria**: Domain-specific evaluation rules

```bash
# Evaluate model with LLM judge
python local_judge/llm_judge.py \
    --model_path /path/to/model \
    --questions_path eval_questions.jsonl \
    --judge_model gpt-4 \
    --output_path results.jsonl
```

## 📈 Analysis and Visualization (`plotting/`)

### Binary Classification
- **ROC curves** and **calibration plots**
- **Performance metrics** across time periods
- **Model comparison** visualizations

### Freeform Questions  
- **Accuracy trends** over time
- **Confidence calibration** analysis
- **Cross-benchmark** performance comparison

```bash
# Generate performance plots
python plotting/binary/plot_results.py --results_dir evals/binary/
python plotting/freeform/across_benchmarks.py --output_dir plots/
```
