# CHRONOPLAY: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks

This repository contains the implementation and data for the paper "CHRONOPLAY: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks" submitted to ICLR 2026.

## Overview

ChronoPlay is a novel framework for the automated and continuous generation of game RAG benchmarks. Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. 

Our framework addresses the core challenge of **Dual Dynamics**: the constant interplay between game content updates and the shifting focus of the player community. ChronoPlay features:

- **Dual-Dynamic Update Mechanism**: Tracks both game content evolution and community focus shifts
- **Dual-Source Synthesis Engine**: Draws from official sources and player community to ensure both factual correctness and authentic query patterns
- **Player-Centric Authenticity**: Ensures generated questions are realistic and reflect genuine player concerns

This is the first dynamic RAG benchmark for the gaming domain, offering new insights into model performance under complex and realistic conditions.

## 🗂️ ChronoPlay Benchmark Dataset

### Synthetic QA Pairs (Main Benchmark Data)

Our benchmark provides temporally-segmented QA pairs for three games, ready for evaluation:

```
generation/data/
├── dune/                    # Dune: Awakening (6 temporal segments)
│   ├── segment_1/
│   │   └── generated_qa_pairs.jsonl
│   ├── segment_2/
│   │   └── generated_qa_pairs.jsonl
│   ├── ... (segments 3-6)
├── dyinglight2/             # Dying Light 2 (5 temporal segments)
│   ├── segment_1/
│   │   └── generated_qa_pairs.jsonl
│   ├── ... (segments 2-5)
└── pubgm/                   # PUBG Mobile (7 temporal segments)
    ├── segment_1/
    │   └── generated_qa_pairs.jsonl
    └── ... (segments 2-7)
```

### Knowledge Corpus

Temporal knowledge base supporting the benchmark:

```
data/
├── dune/corpus/             # Dune temporal corpus
│   ├── segment_1/
│   │   ├── corpus.jsonl     # Document corpus
│   │   ├── documents.json   # Structured documents
│   │   └── nodes.json       # Knowledge graph nodes
│   ├── ... (segments 2-6)
│   └── segment_timeless/    # Time-independent content
├── dyinglight2/corpus/      # Similar structure (segments 1-5)
└── pubgm/corpus/            # Similar structure (segments 1-7)
```

### Preprocessed Data (Provided)

We provide pre-processed templates and role data:

```
data/
├── merged_question_templates_dup_mov.jsonl  # Question templates with type annotations
├── merged_question_data_roles_dup_mov.jsonl # Player role characteristics
├── dune/question_segments_results.json      # Temporal segmentation results
├── dyinglight2/question_segments_results.json
└── pubgm/question_segments_results.json
```

## 🚀 Framework Usage Guide

### Data Requirements

**To use the full framework pipeline, you need to provide:**

1. **Wiki Game Data**: Official game documentation, patch notes, and structured game content
2. **Player Community Data**: Forum discussions, community posts, and player-generated content

**Input Format**: Please refer to the preprocessing scripts to understand the expected input formats:
- Check `preprocess/temporal_segmentation.py` for question data format
- Check `preprocess/role_chart_extraction.py` for player community data format  
- Check `corpus/corpus_builder.py` for wiki/documentation data format

**Note**: We have already provided the processed results from our data in the `data/` directory, so you can directly use the benchmark without needing to collect raw data.

### Installation

```bash
# Set up conda environment
conda create -n chrono python=3.12
conda activate chrono

# Install dependencies
pip install -r requirements.txt
```

### Stage 1: Preprocessing (Completed - Data Provided)

**This stage has been completed and the results are provided in the `data/` directory.**

The preprocessing stage includes:

1. **Temporal Segmentation Analysis**
   ```bash
   cd preprocess/
   python temporal_segmentation.py --input_file ../data/raw_questions.jsonl \
                                  --output_file ../data/dune/question_segments_results.json
   ```

2. **Question Template Extraction**
   ```bash
   python question_builder.py --input_dir ../data/raw_data/ \
                              --output_file ../data/merged_question_templates_dup_mov.jsonl
   ```

3. **Player Role Extraction**
   ```bash
   python role_chart_extraction.py --input_file ../data/player_questions.jsonl \
                                   --output_file ../data/merged_question_data_roles_dup_mov.jsonl
   ```

**Provided Data:**
- Question templates base: `data/merged_question_templates_dup_mov.jsonl`
- User role base: `data/merged_question_data_roles_dup_mov.jsonl`
- Temporal segmentation results: `data/{game}/question_segments_results.json`

### Stage 2: Generation

**We have already provided generated QA pairs in the `generation/data/` directory for direct use.**

To generate new QA pairs or extend to other games, use our dual-source synthesis engine:

```bash
cd generation/

# Generate QA pairs for a specific game and segment
python generation.py --game_name dune \
                     --segment_id 1 \
                     --target_sample_size 150 \
                     --corpus_path ../data/dune/corpus \
                     --template_file ../data/merged_question_templates_dup_mov.jsonl \
                     --role_file ../data/merged_question_data_roles_dup_mov.jsonl

# Generate for all segments
for segment in {1..6}; do
    python generation.py --game_name dune --segment_id $segment --target_sample_size 150
done
```

**Generated Data Location:**
```
generation/data/
├── dune/segment_1-6/generated_qa_pairs.jsonl       # 6 temporal segments
├── dyinglight2/segment_1-5/generated_qa_pairs.jsonl # 5 temporal segments  
└── pubgm/segment_1-7/generated_qa_pairs.jsonl      # 7 temporal segments
```

**Generation Process:**
1. **Template Sampling**: Intelligently samples question templates based on temporal patterns
2. **Role Matching**: Matches appropriate player roles to question types
3. **Dual-Source Synthesis**: Combines official documentation with player community patterns
4. **Quality Filtering**: Ensures generated QA pairs meet quality standards

### Stage 3: RAG Pipeline Execution & Evaluation

Test RAG systems using our benchmark through a two-step process: first run the RAG pipeline, then evaluate the results.

#### Step 3.1: Run RAG Pipeline

**Retrieval Runner**
```bash
cd evaluation/

# Run retrieval for specific segment
python retrieval_runner.py --game dune \
                           --segment_id 1 \

# Run retrieval for all segments
for segment in {1..6}; do
    python retrieval_runner.py --game dune --segment_id $segment
done
```

**Generation Runner**
```bash
# Run generation pipeline using retrieval results
python generation_runner.py --retrieval_results ./retrieval_results/retrieval_dune_segment_1.jsonl \
                            --model gpt-4o

# Run generation for multiple retrieval results
python generation_runner.py --retrieval_results ./retrieval_results/retrieval_dune_*.jsonl \
                            --model gpt-4
```

#### Step 3.2: Evaluate Results

**Retrieval Evaluation**
```bash
# Evaluate retrieval performance for specific segment
python retrieval_evaluator.py --game dune \
                              --segment_id 1 \
                              --retrieval_results ./retrieval_results/retrieval_dune_segment_1.jsonl
```

**Generation Evaluation**
```bash
# Evaluate generation quality
python generation_evaluator.py --generation_results ./generation_results/generation_dune_segment_1.jsonl --metrics correctness

# Batch evaluate multiple results
python generation_evaluator.py --generation_results ./generation_results/generation_dune_*.jsonl --metrics correctness
```

**Evaluation Metrics:**
- **Retrieval**: Precision@K, Recall@K, MNDCG@K
- **Generation**: Correctness, Faithfulness

**Results Storage:**
- Retrieval results: `evaluation/retrieval_results/`
- Generation results: `evaluation/generation_results/`  
- Evaluation reports: `evaluation/retrieval_evaluation/` and `evaluation/generation_evaluation/`