# Graph-Based ARC Task System

This directory contains the core framework for generating, manipulating, and evaluating graph-based ARC tasks. The system uses a modular architecture that allows for easy extension with new tasks, graph generators, properties, and pre-transformations.

## 🚀 Getting Started (macOS)

### Prerequisites

1. **Install Homebrew** (if you don't have it):
   ```bash
   /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
   ```

2. **Install Miniconda**:
   ```bash
   # Download Miniconda for macOS
   curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
   
   # Install Miniconda
   bash Miniconda3-latest-MacOSX-x86_64.sh
   
   # Follow the prompts and restart your terminal
   source ~/.bash_profile  # or ~/.zshrc if you use zsh
   ```

3. **Create the Conda Environment**:
   ```bash
   # Clone the repository (if you haven't already)
   git clone <repository-url>
   cd <repository-name>
   
   # Create environment from the environment file
   conda env create -f environment.yml
   
   # Activate the environment
   conda activate graph_arc
   ```

4. **Set up API Keys (Should actually already be done)**:
   ```bash
   # Copy the example environment file
   cp .env.example .env
   
   # Edit with your API keys
   nano .env  # or use your preferred editor
   
   # Add your keys:
   # OPENAI_API_KEY=your_openai_key_here
   # GEMINI_API_KEY=your_gemini_key_here
   
   # Load environment variables
   source load_env.sh
   ```

5. **Quick Test**:
   ```bash
   # Test the system with a simple generation
   python -m scripts.generate_graphs --tasks colorLeaves --node_sizes 5 10 --sequential
   
   # Generate visualizations
   python -m scripts.visualization.main --visualizations overview
   ```

## Architecture Overview

The system is organized into several key components:

1. **Tasks**: Definitions of graph transformations (e.g., coloring nodes with specific degrees)
2. **Graph Generators**: Functions that create different types of graph structures
3. **Properties**: Predicates that verify specific characteristics of graphs
4. **Pre-transformations**: Operations applied to graphs before the main task transformation
5. **Validation**: Ensures graphs meet required properties for tasks
6. **Prompt Generation**: Creates customized prompts for model evaluation with flexible question types
7. **Benchmark Running**: Executes models against stored prompts
8. **Evaluation**: Compares model outputs with ground truth
9. **Visualization**: Multi-level visualization system for comprehensive analysis

## Typical Workflow

### 1. Generate Graph Datasets

Start by generating graph datasets with default settings:

```bash
python -m scripts.generate_graphs
```

**Default settings:**
- **Node sizes**: 5, 10, 15 nodes
- **Number of pairs**: 3 input-output examples per task + 1 test case
- **Seed**: 42 (for reproducibility)
- **Workers**: CPU count - 1 (parallel processing)
- **Tasks**: All available tasks (see [Available Tasks](#available-tasks) section)
- **Graph types**: All compatible graph generators for each task

You can customize the node sizes:
```bash
python -m scripts.generate_graphs --node_sizes 25 50 100
```

Other useful options:
```bash
# Generate specific tasks only
python -m scripts.generate_graphs --tasks colorLeaves colorDegree2

# Use specific graph types
python -m scripts.generate_graphs --graph_types randomTree star

# Run sequentially (for debugging)
python -m scripts.generate_graphs --sequential

# List available tasks and graph types
python -m scripts.generate_graphs --list_tasks
python -m scripts.generate_graphs --list_graph_types
```

### 2. Generate Prompt Variations (Batch Mode)

Generate comprehensive prompt variations using the batch generator:

```bash
python -m scripts.generate_prompts_batch \
    --encodings adjacency incident \
    --patterns scale_up_3 \
    --all-system-prompts \
    --all-questions \
    --all-targets
```

**What this does:**
- **Encodings**: Creates prompts in both adjacency list and incident list formats
- **Patterns**: Uses specific size patterns:
  - `scale_up_3`: [5, 10, 15] - 3 examples scaling up in size
- **All system prompts**: Generates prompts with all available system prompts:
  - `none`: No system prompt
  - `analyst`: "You are a graph analyst..."
  - `programmer`: "You are a graph algorithm developer..."
  - `teacher`: "You are a mathematics teacher..."
- **All questions**: Creates prompts for all question types:
  - `full_output`: Generate complete transformed graph (default)
  - `node_count`, `edge_count`: Count nodes/edges in input/output
  - `blue_node_count`, `colored_node_count`: Count specific colored nodes
  - `is_connected`, `is_tree`, `has_cycles`: Boolean graph properties
  - `max_degree`, `min_degree`: Degree-based questions
  - `component_count`: Number of connected components
- **All targets**: For each question, target both input and output graphs (where applicable)

**Result**: This generates thousands of prompts across all combinations of tasks, encodings, patterns, system prompts, question types, and targets. Each prompt is saved individually in the respective task directories.

### 3. Collect Prompts into Single File

Collect all generated prompts into a single JSON file for batch processing:

```bash
python -m scripts.generate_prompts --collect
```

**What this does:**
- Scans all `datasets/*/*/prompts/` directories
- Collects all `.txt` prompt files
- Creates `llm-inference/prompts/prompts_xml_SU234_M3.json` with:
  - Unique ID for each prompt
  - System prompt and main content separated
  - Metadata (task, encoding, pattern, question type, etc.)
  - Structured format compatible with both OpenAI and open-source models

### 4. Split Prompts into Batches

For large prompt collections (10,000+ prompts), split into manageable batches:

```bash
# Split into batches of 9,000 prompts each
jq '.[0:9000]' llm-inference/prompts/prompts_xml_SU234_M3.json > llm-inference/prompts/prompts_xml_SU234_M3_batch1.json
jq '.[9000:18000]' llm-inference/prompts/prompts_xml_SU234_M3.json > llm-inference/prompts/prompts_xml_SU234_M3_batch2.json
# ... continue as needed
```

**File naming convention**: `prompts_xml_SU234_M3` indicates:
- `xml`: XML-based response format with `<thinking>` and `<answer>` tags
- `SU234`: Scale-Up patterns with 2, 3, 4 examples + Mixed_3 pattern
- `M3`: Mixed patterns with 3 examples

### 5A. OpenAI Models Track

For OpenAI models, use the batch submission script:

```bash
python -m scripts.submit_collected_prompts_batch \
    --prompts_file llm-inference/prompts/prompts_xml_SU234_M3_batch1.json \
    --model gpt-4.1-nano \
    --batch_name my_experiment_batch1
```

This uses OpenAI's Batch API for cost-efficient processing of large prompt sets.

### 5B. Open-Source Models Track (running on the cluster)

*Redacted for anonymization purposes*

### 6. Reformat Cluster Results

Convert cluster batch results to individual response files:

```bash
python -m scripts.reformat_cluster_results llm-inference/results/qwen3-8b_results_batch1.json
```

**What this does:**
- Converts batch JSON to individual response files in `datasets/*/*/responses/`
- Analyzes token usage and creates consolidated token statistics
- Maintains filename conventions compatible with evaluation system

### 7. Evaluate and Visualize

Run comprehensive evaluation and generate visualizations:

```bash
python -m scripts.evaluate_responses
```

**What this does:**
- Evaluates all model responses against ground truth
- Supports both graph-based and question-based evaluation
- Generates comprehensive performance statistics
- Creates comprehensive visualization system

### 8. Generate Targeted Visualizations

Create specific visualization sets for different purposes:

```bash
# For presentations and talks
python -m scripts.visualization.main --visualizations presentation --no-titles

# For research analysis  
python -m scripts.visualization.main --visualizations overview detailed summary --verbose

# For cost optimization
python -m scripts.visualization.main --visualizations prompt-token --datasets-dir datasets

# For specific models or tasks
python -m scripts.visualization.main --models gpt-4.1-nano o3-mini --tasks colorLeaves addHub
```

## 📊 Comprehensive Visualization System

The system provides a complete visualization suite with four distinct levels plus specialized analysis:

### **Level 1: Overview Visualizations** 🎯
**Purpose**: High-level performance insights and comparisons
**Generated files**: 13+ charts in `visualizations/overview/`

**Key charts**:
- Overall model performance (full output tasks)
- Question-based task performance  
- Input vs output comparison
- Question type breakdown
- System prompt impact analysis (separate for full output and question-based)
- Task performance summary (separate for full output and question-based)
- Token usage overview
- Encoding impact analysis (separate for full output and question-based)
- System prompt comparison analysis

### **Level 2: Detailed Visualizations** 🔍
**Purpose**: Deep-dive analysis per task and error patterns
**Generated files**: 100+ charts in `visualizations/detailed/`

**Structure**:
- **Per-task analysis**: Individual directories for each task
  - Question type performance matrices (input/output split)
  - Input vs output breakdown
  - Input-output answer transfer analysis
  - System prompt impact (separate for full output and question-based)
  - Graph type analysis (separate for full output and question-based)
  - Encoding impact (separate for full output and question-based)
  - Size pattern analysis (separate for full output and question-based)
- **Error analysis**: Success rates, target difficulty, challenging tasks

### **Level 3: Summary Visualizations** 📈
**Purpose**: Statistical analysis and data quality assessment
**Generated files**: 15+ files in `visualizations/summary/`

**Reports**:
- **Performance summaries** (`.txt`, `.json`) - Comprehensive statistics
- **Data coverage analysis** - Sample distribution heatmaps
- **Sample size analysis** - Statistical reliability assessment  
- **Pattern impact analysis** - Performance by size pattern and model
- **Example size analysis** - Model performance vs number of examples
- **Cross-modal analysis** - Full output vs question-based comparison

### **Level 4: Presentation Visualizations** 🎤 **NEW**
**Purpose**: Clean, high-impact visualizations optimized for talks and presentations
**Generated files**: 12+ charts in `visualizations/presentation/`

**Key features**:
- **Combined model performance**: Full output vs. question-based tasks in one chart
- **Large fonts and clear hierarchy**: Optimized for projection
- **Minimal clutter**: No floating annotations
- **Different bar patterns**: Clear visual distinction (solid vs hatched bars)
- **PDF format**: Vector graphics for crisp projection
- **Filtered input/output comparison**: Only meaningful task-question combinations

**Generated charts**:
- `01_combined_model_performance.pdf` - Main results combining full output and question-based
- `02_scaling_performance_by_pattern.pdf` - Model performance across size patterns
- `03_input_output_comparison.pdf` - Standard input vs output analysis
- `04_transfer_analysis.pdf` - Input-output answer transfer analysis
- `05_task_performance.pdf` - Performance on challenging tasks
- `06_model_performance_by_pattern.pdf` - Detailed pattern analysis
- `07_filtered_input_output_comparison.pdf` - **NEW**: Only meaningful comparisons


## 📋 Visualization Command Reference

### Basic Usage

```bash
# Generate all visualization levels (default)
python -m scripts.visualization.main

# Generate specific levels only
python -m scripts.visualization.main --visualizations overview detailed

# Generate presentation-ready charts
python -m scripts.visualization.main --visualizations presentation

# Include prompt token analysis
python -m scripts.visualization.main --visualizations overview prompt-token
```

### **Complete Parameter Reference**

#### **Data Source Parameters**
```bash
# Specify evaluation file (default: auto-detect latest)
python -m scripts.visualization.main /path/to/evaluation_results_20241215_143022.json

# Specify datasets directory for token analysis (default: datasets)
python -m scripts.visualization.main --datasets-dir /path/to/datasets
```

#### **Filtering Parameters**
```bash
# Filter by models
python -m scripts.visualization.main --models gpt-4.1-nano o3-mini qwen-3-8b-v4

# Filter by tasks
python -m scripts.visualization.main --tasks colorLeaves colorDegree2 addHub

# Filter by question types  
python -m scripts.visualization.main --question-types node_count blue_node_count is_connected

# Use size pattern profiles
python -m scripts.visualization.main --profile scaling  # Only cap10_3, cap25_3, etc.
python -m scripts.visualization.main --profile legacy   # Exclude cap series
python -m scripts.visualization.main --profile all      # Include everything

# Custom pattern filtering (overrides profiles)
python -m scripts.visualization.main --include-patterns scale_up_3
python -m scripts.visualization.main --exclude-patterns cap10_3 cap25_3
```

#### **Output Parameters**
```bash
# Custom output directory
python -m scripts.visualization.main --output-dir my_visualizations

# Add suffix to output directory
python -m scripts.visualization.main --output-suffix _experiment1

# Generate without titles (for publications)
python -m scripts.visualization.main --no-titles
```

#### **Visualization Selection**
```bash
# All visualization types
python -m scripts.visualization.main --visualizations overview detailed summary presentation prompt-token

# Just the essentials
python -m scripts.visualization.main --visualizations overview presentation

# Deep analysis only
python -m scripts.visualization.main --visualizations detailed summary
```

#### **Verbose Output**
```bash
# See detailed progress and statistics
python -m scripts.visualization.main --verbose
```



## Core Components in Detail

### Task Definition System (`task_definition.py`, `task_base.py`)

The task system allows defining graph transformation operations that can be applied to various graph types:

- Each task declares required properties, preferred generators, and pre-transformations
- Tasks can specify parameter schemas for customization
- The system automatically handles compatibility between tasks and graph generators

```python
@register_task(
    name="colorDegree2",
    required_properties=[has_degree(2)],
    preferred_generators=["random", "randomConnected", "randomTree"],
    parameter_schema={"color": "string"},
    description="Colors all nodes with degree 2."
)
def color_degree_2(G: nx.Graph, params: Dict[str, Any]) -> nx.Graph:
    """Colors all nodes with degree 2 blue."""
    color = params.get("color", "blue")
    
    for node in G.nodes:
        if G.degree[node] == 2:
            G.nodes[node]["color"] = color
    
    return G
```

### Available Tasks

#### Color-Based Tasks
Transform graphs by coloring nodes based on structural properties:

| Task | Description |
|------|-------------|
| `colorDegree1` | Colors all nodes with degree 1 (leaf nodes) |
| `colorDegree2` | Colors all nodes with degree 2 |
| `colorDegree3` | Colors all nodes with degree 3 |
| `colorMaxDegree` | Colors all nodes with maximum degree |
| `colorMinDegree` | Colors all nodes with minimum degree |
| `colorInternal` | Colors all non-leaf (internal) nodes |
| `colorLeaves` | Colors all leaf nodes (degree 1) |
| `colorNeighbors` | Colors all neighbors of a pre-colored orange node |
| `colorPath` | Colors all nodes on path between two colored leaves |
| `colorComponents` | Colors nodes based on connected component membership |
| `colorDistanceAtLeast2` | Colors nodes at distance ≥2 from marked nodes |
| `colorEquidistant` | Colors nodes equidistant from two blue nodes |

#### Structure Modification Tasks
Transform graphs by modifying structure (adding/removing nodes/edges):

| Task | Description |
|------|-------------|
| `addHub` | Adds a new colored hub node connected to all existing nodes |
| `edgeToNode` | Replaces each edge with a new intermediate node |
| `removeDegree1` | Removes all nodes with degree 1 |
| `removeDegree2` | Removes all nodes with degree 2 |
| `removeDegree3` | Removes all nodes with degree 3 |
| `bipartitionCompletion` | Colors remaining nodes in bipartite graph based on seeds |
| `blueSubgraph` | Returns subgraph induced by blue nodes |
| `mergeAtBlue` | Merges two components at their blue nodes |
| `complementGraph` | Returns complement graph (inverts edge presence) |
| `removeSameColorEdges` | Removes edges between same-colored nodes |

### Property System (`properties.py`)

Properties are predicates that check specific characteristics of graphs:

- Each property has a definitive status: `TRUE`, `FALSE`, or `MAYBE` for any graph generator
- Basic properties include: `connected`, `acyclic`, `tree`, `bipartite`, etc.
- Parameterized properties like `has_degree(n)` or `has_colored_leaves(n)` 
- Property verification functions determine if a graph has a given property

#### Property Status Enum

```python
class PropertyStatus(Enum):
    TRUE = auto()      # Property is guaranteed to be true
    FALSE = auto()     # Property is guaranteed to be false
    MAYBE = auto()     # Property may be true or false depending on random generation
```

### Pre-transformation System (`pretransformations.py`)

Pre-transformations modify graphs before the main task is applied:

- Examples include: `color_random_node`, `color_some_leaves`, etc.
- Each pre-transformation declares what properties it provides
- Pre-transformations can have parameters to customize their behavior

```python
@register_pretransformation(
    name="color_some_leaves",
    required_properties=["connected", has_degree(1)],
    provided_properties=["has_colored_leaves"],
    provided_properties_fn=lambda params: [f"has_colored_leaves_{params.get('count', 2)}"],
    parameter_schema={"color": "string", "count": "int"},
)
def color_existing_leaves(G: nx.Graph, params: Dict[str, Any]) -> nx.Graph:
    """Colors existing leaf nodes (degree 1) in a graph."""
    # Implementation
```

### Graph Generators (`create_graph.py`)

The system includes multiple graph generation functions:

- **Random Graphs**: `generate_random_graph()` - Erdős-Rényi random graphs
- **Connected Graphs**: `generate_random_connected_graph()` - Guaranteed to be connected
- **Trees**: `generate_random_tree()` - Acyclic connected graphs
- **Multi-Component Graphs**: `generate_random_2_component_graph()` - Exactly 2 components
- **Star Graphs**: `generate_star_graph()` - Central hub with connected leaves
- **Bipartite Graphs**: `generate_random_bipartite_graph()` - Two-colorable graphs

### Graph Validation (`graph_validation.py`)

The validation system ensures graphs satisfy the required properties:

- Generates graphs that meet task requirements
- Intelligently skips checking properties guaranteed to be `TRUE`
- Fails immediately if a property is guaranteed to be `FALSE`
- Only regenerates graphs for properties with status `MAYBE` 
- Applies pre-transformations as needed

### Prompt Generation (`generate_prompts.py`)

The prompt generation system creates test prompts for model evaluation with flexible question types and modular template structure:

#### Template Architecture

The system uses a modular template approach with clear separation of components:

```python
# Template components assembled in order:
# 1. SYSTEM_PROMPT_TEMPLATE (optional)
# 2. INTRODUCTION_TEMPLATE (minimal and universal)
# 3. FORMAT_INSTRUCTION_TEMPLATE (XML tags explanation)
# 4. EXAMPLE_TEMPLATE (repeated for each example)
# 5. FINAL_INPUT_TEMPLATE (neutral transition)
# 6. QUESTION_INSTRUCTION_TEMPLATE (specific question)
# 7. FINAL_FORMAT_REMINDER (XML format reminder)
```

#### Question Types

The system supports multiple question types that can target either input or output graphs:

- **full_output**: Generate the complete transformed graph (default behavior)
- **node_count**: Count nodes in input/output graph
- **edge_count**: Count edges in input/output graph  
- **blue_node_count**: Count blue nodes specifically
- **colored_node_count**: Count all non-grey nodes
- **is_connected**: Boolean connectivity check
- **is_tree**: Boolean tree property check
- **has_cycles**: Boolean cycle detection
- **max_degree**: Maximum node degree
- **min_degree**: Minimum node degree
- **component_count**: Number of connected components

Each question type supports both `input` and `output` targets (where applicable).

#### System Prompts

Flexible system prompts that work with any question type:

- **analyst**: "You are a graph analyst. Study the following graph examples carefully and answer the question that follows."
- **programmer**: "You are a graph algorithm developer. Analyze the example graphs and their patterns, then answer the question about the given input."
- **teacher**: "You are a mathematics teacher. Examine these graph examples to understand any patterns, then answer the question clearly and methodically."
- **none**: Empty prompt (default)

#### Size Patterns

Predefined patterns for example sizing:
- **scale_up_3**: [5, 10, 15] - Progressive scaling with 3 examples
- **scale_up_4**: [5, 10, 15, 15] - Progressive scaling with 4 examples

#### Filename Convention

Prompt files are named using the pattern:
```
{encoding}_{size_pattern}_{system_prompt}_{n_pairs}[_{question_type}_{target}].txt
```

Examples:
- `adjacency_scale_up_3_analyst_3.txt` (default full_output)
- `adjacency_scale_up_3_analyst_3_node_count_output.txt`

### Benchmark Running (`run_benchmarks.py`, `batch_run_benchmarks.py`)

The benchmark system runs models against stored prompts:

- **Pre-generated Prompts**: Uses prompts stored in the benchmark directories
- **Flexible Model Support**: OpenAI, Gemini, Qwen
- **Batch Processing**: Optional batching for OpenAI models
- **Run All Prompts**: Can run all existing prompts for a benchmark
- **Response Naming**: Follows the same pattern as prompts, replacing n_pairs with model name

Response filename pattern:
```
{encoding}_{size_pattern}_{system_prompt}[_{question_type}_{target}]_{model}.txt
```

Examples:
- `adjacency_scale_up_3_analyst_gpt-4.1-nano.txt` (full_output response)
- `adjacency_scale_up_3_analyst_node_count_output_gpt-4.1-nano.txt`

### Evaluation (`evaluate_responses.py`)

The evaluation system compares model outputs with ground truth:

- **Graph Decoding**: Extracts graphs from model responses using XML format
- **Comparison Modes**: Isomorphic (structure only) or label_consistent (exact match)
- **Failure Analysis**: Detailed reporting of parse errors, timeouts, etc.
- **Metadata Extraction**: Parses filenames to extract metadata
- **Support for Question-Based Prompts**: Handles both full graph outputs and specific answer formats
- **Progress Bar**: Displays progress to track evaluation
- **Token Analysis**: Includes comprehensive token usage analysis


## Graph Type Properties

Each graph generator has defined properties:

| Generator | connected | acyclic | cyclic | bipartite | has_degree_1 | has_internal_node |
|-----------|-----------|---------|--------|-----------|--------------|-------------------|
| random | MAYBE | MAYBE | MAYBE | MAYBE | MAYBE | MAYBE |
| randomConnected | TRUE | MAYBE | MAYBE | MAYBE | MAYBE | MAYBE |
| randomTree | TRUE | TRUE | FALSE | MAYBE | TRUE | TRUE |
| star | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE |
| random2Component | FALSE | MAYBE | MAYBE | MAYBE | MAYBE | MAYBE |
| bipartite | MAYBE | MAYBE | MAYBE | TRUE | MAYBE | MAYBE |

## Thread-Safe Metadata Management (`metadata.py`)

The system includes robust thread-safe file operations for managing metadata:
- Uses file locking to prevent race conditions during parallel generation
- Reloads latest version before making updates to avoid overwrites
- Includes retry mechanisms for handling temporary file access issues
- Properly sorts and organizes pairs to maintain consistent formatting
- Provides detailed error reporting for any metadata-related issues

## Graph Encoding and Decoding (`encode_graph.py`, `decode_graph.py`)

The system supports multiple text formats for encoding graphs:

### Adjacency Format
```
In an undirected graph, (i,j) means that node i
and node j are connected with an undirected edge. G
describes a graph among nodes 0, 1, 2, 3, 4.
The edges in G are: (0,1) (1,2) (2,3) (3,4).

The following nodes are colored: 1, 2, 3.
```

### Incident Format
```
G describes a graph among nodes 0, 1, 2, 3, 4.
In this graph:
Node 0 is connected to nodes 1.
Node 1 is connected to nodes 0, 2.
Node 2 is connected to nodes 1, 3.
Node 3 is connected to nodes 2, 4.
Node 4 is connected to nodes 3.

Node 1 is blue.
Node 2 is blue.
Node 3 is blue.
```

## Adding New Components

### Adding a New Task

To add a new task, create a function and register it using the `@register_task` decorator:

```python
from scripts.utils.task_definition import register_task
from scripts.utils.properties import has_degree

@register_task(
    name="colorCustomDegree",  # Unique name for the task
    required_properties=[has_degree(4)],  # Properties the graph must have
    preferred_generators=["random", "randomConnected"],  # Preferred graph generators
    parameter_schema={"color": "string"},  # Parameters the task accepts
    description="Colors all nodes with degree 4."  # Human-readable description
)
def color_custom_degree(G: nx.Graph, params: Dict[str, Any]) -> nx.Graph:
    """
    Colors all nodes with degree 4.
    
    Parameters:
    - G: NetworkX graph to transform
    - params: Dictionary with parameter "color" (default: "blue")
    
    Returns:
    - Graph with degree 4 nodes colored
    """
    color = params.get("color", "blue")
    
    for node in G.nodes:
        if G.degree[node] == 4:
            G.nodes[node]["color"] = color
    
    return G
```

Place your task in one of the task files in the `scripts/tasks/` directory or create a new file and import it in `scripts/generate_graphs.py`.

### Adding a New Graph Generator

To add a new graph generator, add a function to `scripts/utils/create_graph.py` and update the generator registries:

```python
def generate_custom_graph(n, p=0.3, seed=None):
    """
    Generates a custom graph structure.
    
    Properties:
    - connected: True
    - custom_property: True
    
    Parameters:
    - n (int): Number of nodes
    - p (float): Edge probability
    - seed (int, optional): Random seed
    
    Returns:
    - G (networkx.Graph): Generated graph
    """
    # Implementation
```

Then update the `GRAPH_GENERATORS` and `GRAPH_GENERATORS_FUNCTIONS` dictionaries in `scripts/utils/task_compatibility.py`:

```python
from scripts.utils.properties import PropertyStatus, TRUE, FALSE, MAYBE

GRAPH_GENERATORS = {
    # ... existing generators
    "customGraph": {
        "connected": TRUE,
        "acyclic": MAYBE,
        "custom_property": TRUE,
    },
}

GRAPH_GENERATORS_FUNCTIONS = {
    # ... existing functions
    "customGraph": generate_custom_graph,
}
```

### Adding a New Question Type

To add a new question type to the prompt generation system:

```python
# Add to QUESTION_TYPES dictionary in generate_prompts.py
QUESTION_TYPES = {
    # ... existing questions
    "your_question": {
        "input": {
            "question": "Your question about the input graph?",
            "answer_format": "The ANSWER section should contain your expected format."
        },
        "output": {
            "question": "Your question about the output graph after transformation?",
            "answer_format": "The ANSWER section should contain your expected format."
        }
    }
}
```

### Adding a New System Prompt

To add a new system prompt, update the `SYSTEM_PROMPTS` dictionary in `scripts/generate_prompts.py`:

```python
SYSTEM_PROMPTS = {
    # ... existing prompts
    "researcher": "You are a graph theory researcher. Examine these graph examples to understand any patterns, then answer the question with scientific rigor.",
}
```

### Adding a New Size Pattern

To add a new size pattern, update the `SIZE_PATTERNS` dictionary in `scripts/generate_prompts.py`:

```python
SIZE_PATTERNS = {
    # ... existing patterns
    "pyramid_4": [5, 10, 15, 20],  # Increasing sizes pattern
}
```

## Command-Line Usage Examples

### Generating Graphs

```bash
# Generate all benchmarks
python -m scripts.generate_graphs

# Generate specific benchmarks with custom settings
python -m scripts.generate_graphs --tasks colorLeaves colorDegree2 --node_sizes 5 10 15 --seed 42 --workers 8

# List available tasks and graph types
python -m scripts.generate_graphs --list_tasks
python -m scripts.generate_graphs --list_graph_types
```

### Generating Prompts

```bash
# Generate comprehensive prompt variations
python -m scripts.generate_prompts_batch \
    --encodings adjacency incident \
    --patterns mixed_3 scale_up_3 \
    --all-system-prompts \
    --all-questions \
    --all-targets

# Generate prompts with specific configurations
python -m scripts.generate_prompts --benchmarks colorLeaves --pattern scale_up_3 --system_prompt analyst

# Collect all prompts into single file
python -m scripts.generate_prompts --collect

# List available patterns and question types
python -m scripts.generate_prompts --list_patterns
python -m scripts.generate_prompts --list_questions
```

### Running Benchmarks

```bash
# Submit to OpenAI Batch API
python -m scripts.submit_collected_prompts_batch \
    --prompts_file llm-inference/prompts/prompts_xml_SU234_M3_batch1.json \
    --model gpt-4.1-nano

# Check batch status and retrieve results
python -m scripts.batch_run_benchmarks --check BATCH_ID
python -m scripts.batch_run_benchmarks --retrieve BATCH_ID
```

### Evaluating and Visualizing Results

```bash
# Evaluate all responses
python -m scripts.evaluate_responses

# Evaluate with specific filters
python -m scripts.evaluate_responses --models gpt-4.1-nano o3-mini --system_prompts analyst

# ===== COMPREHENSIVE VISUALIZATION EXAMPLES =====

# Generate all visualization levels (auto-detects latest evaluation file)
python -m scripts.visualization.main

# Generate presentation-ready charts for talks
python -m scripts.visualization.main --visualizations presentation --no-titles --verbose

# Generate specific visualization levels
python -m scripts.visualization.main --visualizations overview detailed summary

# Focus on specific models and tasks
python -m scripts.visualization.main --models gpt-4.1-nano o3-mini --tasks colorLeaves colorDegree2 --verbose

# Use evaluation file with filtering profiles
python -m scripts.visualization.main \
    evaluation_data/evaluation_results_20241215.json \
    --profile scaling \
    --visualizations overview presentation

# Generate token analysis for cost optimization
python -m scripts.visualization.main \
    --visualizations prompt-token \
    --datasets-dir datasets \
    --verbose

# Complete research analysis with custom output
python -m scripts.visualization.main \
    --visualizations overview detailed summary \
    --models gpt-4.1-nano qwen-3-8b-v4 \
    --tasks addHub colorLeaves removeDegree1 \
    --output-dir research_analysis \
    --verbose

# Quick performance check
python -m scripts.visualization.main \
    --visualizations overview \
    --output-suffix _quick_check

# Publication-ready figures without titles
python -m scripts.visualization.main \
    --visualizations overview presentation \
    --no-titles \
    --output-dir paper_figures

# ===== ADVANCED FILTERING EXAMPLES =====

# Legacy patterns only (exclude scaling experiments)
python -m scripts.visualization.main \
    --profile legacy \
    --visualizations overview detailed

# Custom pattern inclusion/exclusion
python -m scripts.visualization.main \
    --include-patterns scale_up_3 mixed_3 \
    --exclude-patterns cap10_3 cap25_3 \
    --visualizations overview

# Question type specific analysis
python -m scripts.visualization.main \
    --question-types node_count blue_node_count is_connected \
    --visualizations detailed

# Combine multiple filters
python -m scripts.visualization.main \
    --models gpt-4.1-nano o3-mini \
    --tasks colorLeaves addHub \
    --include-patterns scale_up_3 \
    --visualizations overview presentation \
    --output-suffix _filtered_analysis \
    --verbose
```

### Reformatting Cluster Results

```bash
# Convert cluster batch results to individual response files
python -m scripts.reformat_cluster_results llm-inference/results/qwen3-8b_results_batch1.json

# Include token analysis and individual token files
python -m scripts.reformat_cluster_results \
    llm-inference/results/qwen3-8b_results_batch1.json \
    --store-individual-tokens \
    --verbose
```

### Profile System Examples

```bash
# Use predefined visualization profiles
python -m scripts.visualization.main --profile scaling    # Only cap10_3, cap25_3, etc.
python -m scripts.visualization.main --profile legacy     # Exclude cap series patterns
python -m scripts.visualization.main --profile all        # Include everything

# Profiles with custom visualization types
python -m scripts.visualization.main --profile scaling --visualizations presentation
python -m scripts.visualization.main --profile legacy --visualizations overview detailed --verbose
```

## Common Issues and Solutions

1. **Task Generation Failures**:
   - Check that no required properties are marked as `FALSE` for the generator
   - Ensure pre-transformations provide the required properties
   - Increase the number of attempts or adjust node counts for difficult tasks

2. **Compatibility Issues**:
   - Verify property names match exactly between task requirements and generator properties
   - Check that parameterized properties follow the correct naming convention

3. **Prompt Generation Failures**:
   - Make sure the specified pattern or sizes exist in the dataset
   - Verify that there are enough input-output pairs for the pattern
   - Check that the question type and target combination is valid

4. **Question Type Errors**:
   - Use `--list_questions` to see available question types and valid targets
   - Ensure the question type supports the specified target (input/output)

5. **Evaluation Errors**:
   - Check that response files follow the expected naming convention
   - Ensure the graph decoder can parse the model's output format (XML tags)
   - For timeout issues, consider using the label_consistent comparison mode
   - Note that question-based prompts may require different evaluation approaches than full graph outputs

6. **Visualization Errors**:
   - Check that evaluation files exist in the expected location
   - Use `--verbose` flag to see detailed error messages
   - Filter to specific models or tasks if the full dataset is too large
   - Ensure the visualization module is properly imported



## Advanced Features

### Custom Evaluation Comparisons

The system supports two comparison modes for evaluation:

1. **Isomorphic**: Checks if graphs are structurally equivalent, ignoring specific node labels.
   ```bash
   python -m scripts.evaluate_responses --comparison-mode isomorphic
   ```

2. **Label Consistent**: Checks for exact match including node labels and colors.
   ```bash
   python -m scripts.evaluate_responses --comparison-mode label_consistent
   ```

### Batch API Integration

For large-scale evaluations, the system can use OpenAI's Batch API:

```bash
# Submit a batch job
python -m scripts.submit_collected_prompts_batch --prompts_file prompts.json --model gpt-4.1-nano

# Check status (non-blocking)
python -m scripts.batch_run_benchmarks --check BATCH_ID

# Retrieve results when complete
python -m scripts.batch_run_benchmarks --retrieve BATCH_ID
```

### Token Analysis

The system includes comprehensive token analysis:

- **Individual response tokens**: Tracked automatically during evaluation
- **Prompt token analysis**: Generated via visualization system
- **Token optimization insights**: Strategic recommendations for encoding and prompt construction
- **Consolidated token data**: Efficient storage and analysis across large datasets

### Visualization Customization

The visualization system offers multiple levels of detail:

```bash
# Generate only overview visualizations for quick insights
python -m scripts.visualization.main --visualizations overview

# Focus on specific models and tasks
python -m scripts.visualization.main --models gpt-4.1-nano o3-mini --tasks colorLeaves --visualizations detailed

# Generate all visualizations with verbose output
python -m scripts.visualization.main --verbose --output-dir custom_visualizations

# Include prompt token analysis
python -m scripts.visualization.main --visualizations overview detailed summary prompt-token

# Generate presentation-ready charts
python -m scripts.visualization.main --visualizations presentation --no-titles
```

The resulting visualizations are organized in a clear hierarchy with:
- Interactive README files at each level
- Consistent color schemes across all charts
- Statistical summaries and confidence metrics
- Cross-modal comparisons between different approaches
- Token optimization guidance for cost-effective LLM usage
- **NEW**: Presentation-ready charts optimized for talks and meetings