# DriveGuard RAGAS Evaluation Framework

This directory contains a comprehensive evaluation framework for the DriveGuard workflow using the RAGAS (Retrieval-Augmented Generation Assessment) framework, adapted for driving safety assessment.

## Overview

The evaluation framework assesses the quality of DriveGuard's video analysis workflow across multiple dimensions:

- **Faithfulness**: How grounded the safety assessment is in the video analysis
- **Answer Relevancy**: How relevant the safety assessment is to the driving scenario
- **Answer Correctness**: How accurate the assessment is compared to expert evaluations
- **Context Precision**: How precise the extracted scenes and analysis are
- **Context Recall**: How complete the extracted driving information is

## Complete Evaluation Pipeline

The evaluation process consists of several sequential steps:

### Step 1: Prepare Evaluation Data (`1_prepare_evaluation_data.py`)

```bash
uv run evaluation/1_prepare_evaluation_data.py
```

This script:
- Creates ground truth templates for all videos in `data/dashcam/`
- Generates system outputs by running DriveGuard workflow on videos
- Prepares the basic structure for manual annotation

### Step 2: Generate and Refine Ground Truth

#### 2.1 Generate Initial Annotations (`2_1_generate_ground_truth_annotations.py`)

```bash
# Process all files
uv run evaluation/2_1_generate_ground_truth_annotations.py

# Process with specific model
uv run evaluation/2_1_generate_ground_truth_annotations.py --model="openai:gpt-4o"

# Process single file
uv run evaluation/2_1_generate_ground_truth_annotations.py data/evaluation/ground_truth/000_cut_off_accident.json
```

This script:
- Uses LLM to transform system annotations into narrative ground truth format
- Converts structured categories into chronological stories
- Maintains technical accuracy while improving readability

#### 2.2 Optimize Annotations Interactively (`2_2_optimize_ground_truth_annotation.py`)

```bash
# Interactive optimization for a specific file
uv run evaluation/2_2_optimize_ground_truth_annotation.py 000

# With specific model
uv run evaluation/2_2_optimize_ground_truth_annotation.py 001 --model="openai:gpt-4o"
```

This script:
- Opens an interactive editor for providing revision instructions
- Uses LLM to optimize annotations based on your feedback
- Allows iterative refinement until satisfied

#### 2.3 Extract Scenes (`2_3_extract_scenes_from_annotation.py`)

```bash
# Process all files
uv run evaluation/2_3_extract_scenes_from_annotation.py

# Process single file
uv run evaluation/2_3_extract_scenes_from_annotation.py data/evaluation/ground_truth/000_cut_off_accident.json
```

This script:
- Uses SceneExtractor agent to identify atomic driving scenes
- Breaks down complex annotations into discrete, analyzable events
- Prepares scenes for violation and accident analysis

#### 2.4 Populate Violations and Accidents (`2_4_populate_scenes.py`)

```bash
# Process all files with intelligent analysis
uv run evaluation/2_4_populate_scenes.py

# Use specific model
uv run evaluation/2_4_populate_scenes.py --model="openai:gpt-4o"

# Show help
uv run evaluation/2_4_populate_scenes.py --help
```

This script:
- Analyzes each scene for traffic violations and accident risks
- Uses system outputs for context when available
- Generates intelligent violation/accident assessments via LLM
- Creates templates for manual refinement

#### 2.5 Generate Safety Assessments (`2_5_generate_assessment.py`)

```bash
# Process all files
uv run evaluation/2_5_generate_assessment.py

# Process single file
uv run evaluation/2_5_generate_assessment.py 000

# With specific model
uv run evaluation/2_5_generate_assessment.py --model="openai:gpt-4o"

# Show help
uv run evaluation/2_5_generate_assessment.py --help
```

This script:
- Generates comprehensive safety assessments based on annotations, violations, and accidents
- Calculates safety scores (1-10) and risk levels
- Identifies driving strengths and weaknesses
- Provides improvement recommendations
- Features intelligent caching to avoid reprocessing unchanged content

### Step 3: Run RAGAS Evaluation (`3_run_ragas_evaluation.py`)

```bash
uv run evaluation/3_run_ragas_evaluation.py
```

This script:
- Loads completed ground truth and system outputs
- Runs RAGAS evaluation metrics
- Features intelligent caching for efficient re-runs
- Generates detailed report in `data/evaluation/report/evaluation_report.md`

## File Structure

```
evaluation/
├── README.md                          # This file
├── 1_prepare_evaluation_data.py       # Step 1: Create templates and generate system outputs
├── 2_1_generate_ground_truth_annotations.py  # Step 2.1: Generate initial annotations
├── 2_2_optimize_ground_truth_annotation.py   # Step 2.2: Interactive annotation refinement
├── 2_3_extract_scenes_from_annotation.py     # Step 2.3: Extract atomic scenes
├── 2_4_populate_scenes.py             # Step 2.4: Analyze violations and accidents
├── 2_5_generate_assessment.py         # Step 2.5: Generate safety assessments
├── 3_run_ragas_evaluation.py          # Step 3: Run RAGAS evaluation
├── ragas_evaluation_setup.py          # Core RAGAS framework setup
└── make_dataset/                      # Dataset creation tools
    ├── s1_youtube_downloader.py
    ├── s2_video_reviewer/
    └── s3_extract_clips.py
```

### Evaluation Data Structure

```
data/
└── evaluation/                        # Evaluation data directory
    ├── ground_truth/                  # Ground truth annotations
    │   ├── 000_cut_off_accident.json
    │   └── 001_left_turn_cut_off.json
    ├── system_outputs/                # System-generated outputs
    │   ├── 000_cut_off_accident.json
    │   └── 001_left_turn_cut_off.json
    ├── cache/                         # Caching for efficient re-runs
    │   ├── assessment_cache.json
    │   └── ragas_evaluation_cache.json
    └── report/                        # Evaluation reports
        └── evaluation_report.md
```

## Evaluation Metrics Explained

### Faithfulness (0.0 - 1.0)
- **High Score (0.8+)**: Safety assessments are well-grounded in the video analysis
- **Low Score (<0.6)**: System may be hallucinating or making unsupported claims

### Answer Relevancy (0.0 - 1.0)
- **High Score (0.8+)**: Safety assessments are specific and relevant to the driving scenario
- **Low Score (<0.6)**: Assessments are too generic or irrelevant

### Answer Correctness (0.0 - 1.0)
- **High Score (0.8+)**: System assessments align well with expert evaluations
- **Low Score (<0.6)**: Significant discrepancies between system and expert assessments

### Context Precision (0.0 - 1.0)
- **High Score (0.8+)**: Extracted scenes and analysis are highly relevant
- **Low Score (<0.6)**: Too much irrelevant information in the analysis

### Context Recall (0.0 - 1.0)
- **High Score (0.8+)**: System captures all important driving behaviors and risks
- **Low Score (<0.6)**: Important information is missed in the analysis

## Evaluation Workflow Summary

### Sequential Pipeline

1. **Prepare Data** → 2. **Generate Annotations** → 3. **Extract Scenes** → 4. **Analyze Violations/Accidents** → 5. **Generate Assessments** → 6. **Run RAGAS**

### Key Features

- **LLM-Assisted Annotation**: All steps can leverage LLMs for intelligent analysis
- **Flexible Processing**: Process all files or single files as needed  
- **Model Override**: Use `--model` flag to specify different LLMs
- **Smart Caching**: Avoid reprocessing unchanged content
- **Interactive Refinement**: Step 2.2 allows iterative annotation improvement

### Typical Workflow

```bash
# Step 1: Initial setup
uv run evaluation/1_prepare_evaluation_data.py

# Step 2: Generate and refine annotations
uv run evaluation/2_1_generate_ground_truth_annotations.py
uv run evaluation/2_2_optimize_ground_truth_annotation.py 000  # Refine specific files

# Step 3: Extract and analyze scenes
uv run evaluation/2_3_extract_scenes_from_annotation.py
uv run evaluation/2_4_populate_scenes.py

# Step 4: Generate assessments
uv run evaluation/2_5_generate_assessment.py

# Step 5: Run evaluation
uv run evaluation/3_run_ragas_evaluation.py
```

## Best Practices for Ground Truth Annotation

### Using the Automated Pipeline

1. **Let LLMs Do Initial Work**: Use scripts 2.1-2.5 to generate initial content
2. **Review and Refine**: Manually review LLM outputs and refine as needed
3. **Use Interactive Optimization**: Script 2.2 helps iteratively improve annotations
4. **Validate Completeness**: Ensure all sections are filled before evaluation

### Manual Annotation Guidelines

#### 1. Video Annotation
- Create chronological narrative of driving events
- Focus on ego vehicle behavior and interactions
- Include environmental context (weather, road conditions)
- Be objective and factual

#### 2. Scene Extraction  
- Break down into atomic, discrete events
- Each scene should represent one specific action/situation
- Keep scenes concise (1-2 sentences)
- Cover all safety-relevant moments

#### 3. Violation Assessment
- Mark "found" only for clear traffic law violations
- Specify exact rule violated (e.g., "Failed to yield right of way")
- Consider context and severity
- Be consistent in judgments

#### 4. Accident Analysis
- Mark "found" for genuine collision risks or near-misses
- Describe realistic potential consequences
- Consider both immediate and chain-reaction risks
- Be specific about accident types

#### 5. Safety Assessment
- **Safety Score (1-10)**:
  - 1-3: Critical violations, high accident risk
  - 4-6: Moderate concerns, some risks
  - 7-8: Generally safe, minor issues
  - 9-10: Excellent, no concerns
- **Risk Level**: Aligns with safety score (critical/high/medium/low)
- **Strengths/Weaknesses**: Balance positive and negative observations
- **Improvement Advice**: Provide actionable, specific recommendations

## Interpreting Results

### Overall Score Ranges
- **0.8-1.0**: Excellent - System performs at expert level
- **0.7-0.8**: Good - System is reliable with minor issues
- **0.6-0.7**: Fair - System needs improvement
- **0.0-0.6**: Poor - System requires significant work

### Common Issues and Solutions

| Low Score | Possible Cause | Solution |
|-----------|---------------|----------|
| Faithfulness | System hallucinating | Improve grounding in context |
| Relevancy | Generic responses | Make assessments more specific |  
| Correctness | Poor alignment with experts | Improve training data/prompts |
| Precision | Too much noise | Better scene extraction |
| Recall | Missing information | Improve completeness |

## Advanced Usage

### Custom Metrics
You can define custom metrics specific to driving safety assessment:

```python
from ragas.metrics import BaseMetric

class DrivingSafetySeverity(BaseMetric):
    # Custom metric for assessing safety severity alignment
    pass
```

### Batch Evaluation
For large-scale evaluation:

```python
from evaluation.ragas_evaluation_setup import DriveGuardRAGASEvaluator

# Process multiple evaluation sets
evaluator = DriveGuardRAGASEvaluator(dataset)
results = evaluator.evaluate()
```

### Continuous Evaluation
Set up automated evaluation pipelines:

```bash
# Run evaluation on new data
python evaluation/run_ragas_evaluation.py --input-dir /path/to/new/videos
```

## Troubleshooting

### Common Issues

1. **"RAGAS not installed"**
   ```bash
   pip install ragas datasets pandas
   ```

2. **"No evaluation samples found"**
   - Ensure ground truth files are completed
   - Check that system outputs exist
   - Verify file naming matches pattern

3. **"Ground truth not completed"**
   - Replace all "MANUAL_ANNOTATION_REQUIRED" placeholders
   - Complete all required fields in ground truth files

4. **Low evaluation scores**
   - Check ground truth quality
   - Verify system is working correctly
   - Consider adjusting prompts or model parameters

### Getting Help

For issues with the evaluation framework:
1. Check the logs for specific error messages
2. Verify your ground truth annotations are complete
3. Ensure video files are accessible
4. Check RAGAS documentation for metric-specific issues

## Contributing

To improve the evaluation framework:
1. Add new evaluation metrics specific to driving safety
2. Improve ground truth annotation guidelines
3. Add support for additional video formats
4. Enhance the reporting and visualization

## References

- [RAGAS Documentation](https://docs.ragas.io/)
- [DriveGuard Workflow Documentation](../doc/workflow.png)
- [Evaluation Best Practices](https://arxiv.org/abs/2309.15217)