# DriveGuard Component Evaluation System

A comprehensive evaluation framework for DriveGuard components that compares multiple LLM models across annotation, scene extraction, violation detection, accident assessment, and driving assessment tasks.

## Features

- **Dynamic Model Loading**: Automatically loads models from configuration files
- **Resume by Default**: Intelligent caching system to skip completed evaluations
- **Dual Evaluation**: Traditional metrics + LLM-as-judge (GPT-4.1) evaluation
- **Enhanced Domain-Specific Metrics**: 60+ metrics including safety, temporal, and coherence analysis
- **Detailed Breakdown by Default**: Complete metric transparency with organized categorization
- **Model-by-Model Analysis**: Detailed performance comparison across models
- **Real-time Progress Tracking**: Live evaluation progress with detailed timing metrics
- **Comprehensive Report Generation**: Automated markdown reports and CSV exports

## Quick Start

### 1. Check Model Configuration
```bash
# View current model configuration
uv run python -m evaluation.component_eval --show-models

# Validate model configuration files
uv run python -m evaluation.component_eval --validate-models
```

### 2. Check Evaluation Status
```bash
# See current evaluation progress
uv run python -m evaluation.component_eval --status

# View timing analysis across all models
uv run python -m evaluation.component_eval --timing-analysis
```

### 3. Run Evaluations

```bash
# Evaluate specific component (resumes automatically)
uv run python -m evaluation.component_eval --component annotation
uv run python -m evaluation.component_eval --component scene
uv run python -m evaluation.component_eval --component violation
uv run python -m evaluation.component_eval --component accident
uv run python -m evaluation.component_eval --component assessment

# Evaluate all components
uv run python -m evaluation.component_eval --all-components

# Force re-evaluation (ignore cached results)
uv run python -m evaluation.component_eval --component scene --overwrite

# Combine evaluation with report generation
uv run python -m evaluation.component_eval --component scene --report --export-csv
```

### 4. View Detailed Metrics (Default)
```bash
# All evaluations show detailed breakdown by default
uv run python -m evaluation.component_eval --component scene

# For programmatic access to detailed metrics
uv run python -c "
from evaluation.component_eval.evaluator import ComponentEvaluator
evaluator = ComponentEvaluator()  # detailed_metrics=True by default
evaluator.show_detailed_metrics_for_model('scene', 'openai:gpt-4o', results)
"

# Use summary mode if preferred
uv run python -c "
from evaluation.component_eval.evaluator import ComponentEvaluator  
evaluator = ComponentEvaluator(detailed_metrics=False)
"
```

### 5. Generate Reports
```bash
# Generate reports for all available results (automatically loads cached data)
uv run python -m evaluation.component_eval --report --all-components

# Generate report for specific component
uv run python -m evaluation.component_eval --component scene --report

# Export all results to CSV with timestamp
uv run python -m evaluation.component_eval --export-csv

# Generate both reports and CSV exports  
uv run python -m evaluation.component_eval --report --export-csv --all-components

# Standalone report generation (no evaluation, just reporting from cached results)
uv run python -m evaluation.component_eval --report
```

## Model Configuration

The system dynamically loads models from text files:

- **Annotation Component**: `evaluation/models/annotation.txt` (12 multimodal models)
- **Other Components**: `evaluation/models/text.txt` (22 text models)

### Adding/Removing Models
Simply edit the corresponding text file:
```
# evaluation/models/annotation.txt
openai:gpt-4o
openai:gpt-4.1
gateway:anthropic/claude-sonnet-4
...

# evaluation/models/text.txt  
openai:gpt-4o
openai:gpt-4.1
groq:llama-3.3-70b-versatile
...
```

## Architecture

```
evaluation/component_eval/
├── main.py                    # CLI entry point
├── config.py                  # Model loading & configuration
├── evaluator.py              # Main evaluation coordinator
├── utils.py                  # Caching & data management
├── metrics/                  # Traditional evaluation metrics
├── llm_judge/               # LLM-as-judge evaluation
└── reporting/               # Report generation
```

## Data Organization

```
data/evaluation/component_eval/
├── cache/                    # Evaluation caches
├── results/                  # Raw evaluation results
│   ├── annotation/
│   ├── scene/
│   ├── violation/
│   ├── accident/
│   └── assessment/
├── reports/                  # Generated reports
└── exports/                  # CSV/JSON exports
```

## Evaluation Metrics by Component

The system computes **60+ enhanced metrics** organized by component for comprehensive evaluation. All metrics are displayed by default with detailed breakdown.

### Display Modes

- **Detailed Mode (Default)**: Shows all 60+ metrics organized by categories with performance ranking
- **Summary Mode**: Shows basic metrics + enhanced count (use `detailed_metrics=False`)

### LLM-as-Judge Configuration

**Implementation**: `LLMJudgeEvaluator` class in `llm_judge/judge_evaluator.py`

- **Model**: OpenAI GPT-4.1 (gpt-4-turbo) with structured output
- **Temperature**: 0.0 for consistent scoring
- **Output Format**: Structured JSON with integer scores (1-10) and reasoning
- **Retry Logic**: 3 attempts with exponential backoff for reliability

#### Scoring Scale (1-10 Integer Scale)
```
10 = Excellent (95-100% quality)    |  5 = Below Average (45-54% quality)
9  = Very Good (85-94% quality)     |  4 = Poor (35-44% quality)  
8  = Good (75-84% quality)          |  3 = Very Poor (25-34% quality)
7  = Above Average (65-74% quality) |  2 = Extremely Poor (15-24% quality)
6  = Average (55-64% quality)       |  1 = Unacceptable (0-14% quality)
```

## Component-Specific Metrics Documentation

### 1. Annotation Component

The annotation component evaluates multimodal models (12 models) that process video frames to generate comprehensive driving scenario descriptions.

#### Traditional Metrics
**Implementation**: `TextSimilarityMetrics` class in `traditional_metrics.py`

- **BLEU Score (0-1)**: Bilingual evaluation score comparing word n-gram overlap
  - **Implementation**: Uses NLTK BLEU with smoothing for short sentences
  - **Purpose**: Measures lexical similarity between system and ground truth annotations
  - **Usage**: Lower weight due to multiple valid annotation styles
  
- **ROUGE-L Score (0-1)**: Longest common subsequence similarity
  - **Implementation**: Official rouge-score library with sentence-level LCS
  - **Purpose**: Measures sequence-aware text similarity 
  - **Usage**: Captures ordering and structure better than pure n-gram overlap
  
- **Semantic Similarity (0-1)**: Sentence embedding cosine similarity
  - **Implementation**: SentenceTransformers with 'all-MiniLM-L6-v2' model
  - **Purpose**: Captures semantic meaning beyond lexical overlap
  - **Usage**: Primary metric for annotation quality assessment

#### LLM-as-Judge Metrics (GPT-4.1, 1-10 scale)
**Prompt Template**: `annotation_judge_prompt` in `judge_prompts.py`

- **Accuracy Score (1-10)**: How accurately the system annotation captures the driving scenario
  - **Focus**: Factual correctness, event detection, detail accuracy
  - **Implementation**: Compares system output against ground truth for content accuracy
  
- **Completeness Score (1-10)**: Coverage of all critical driving events
  - **Focus**: Missing events, comprehensive scene coverage, safety-relevant details
  - **Implementation**: Evaluates whether all important events are captured
  
- **Clarity Score (1-10)**: Clarity and usefulness for driving analysis
  - **Focus**: Language clarity, structured presentation, analytical utility
  - **Implementation**: Assesses readability and usefulness for downstream analysis

### 2. Scene Component

The scene component evaluates text models (22 models) that extract distinct, temporally-ordered scenes from driving annotations.

#### Traditional Metrics
**Implementation**: Multiple metric classes in `traditional_metrics.py`

##### Basic Text Metrics
- **BLEU Score (0-1)**: Word-level overlap in scene descriptions
  - **Implementation**: Uses NLTK BLEU with smoothing for short sentences  
  - **Purpose**: Measures lexical similarity between system and ground truth scenes

- **ROUGE-L Score (0-1)**: Sequential similarity of scene text
  - **Implementation**: Official rouge-score library with sentence-level LCS
  - **Purpose**: Measures sequence-aware text similarity

- **Semantic Similarity (0-1)**: Semantic similarity of scene content  
  - **Implementation**: SentenceTransformers with 'all-MiniLM-L6-v2' model
  - **Purpose**: Captures semantic meaning beyond lexical overlap

- **Scene Coverage (0-1)**: Ratio of extracted scenes to ground truth scenes
  - **Implementation**: `len(extracted_scenes) / len(ground_truth_scenes)`
  - **Purpose**: Measures completeness of scene extraction
  - **Usage**: Penalizes under-extraction and over-extraction

##### Enhanced Safety Metrics  
**Implementation**: `DrivingSafetyMetrics` class in `traditional_metrics.py`

- **Safety Weighted Detection (0-1)**: Enhanced detection scores prioritizing high-risk scenarios
  - **Implementation**: Applies safety weights to standard precision/recall calculations
  - **Calculation**: `weighted_correct_detections / weighted_total_ground_truth`
  - **Usage**: Prioritizes models that don't miss critical safety events

- **Critical Scene Precision (0-1)**: True critical scenes / (True critical + False critical)
- **Critical Scene Recall (0-1)**: True critical scenes / (True critical + Missed critical)  
- **Critical Scene F1 (0-1)**: Harmonic mean of critical scene precision and recall
  - **Implementation**: Uses importance weighting for safety-critical scene types
  - **Safety Weights**: Accident (1.0), Near-miss (0.8), Violation (0.6), Risky behavior (0.4)

##### Enhanced Temporal Metrics
**Implementation**: `SceneEvaluationMetrics` class in `traditional_metrics.py`

- **Temporal Order Accuracy (0-1)**: Correctness of chronological scene ordering
  - **Implementation**: 
    - Extracts temporal indicators ("first", "then", "next", "finally", "meanwhile")
    - Compares predicted vs ground truth scene sequences using LCS algorithm
  - **Calculation**: `LCS_length / max(pred_length, truth_length)`
  - **Usage**: Ensures models maintain logical narrative flow

- **Sequence Similarity (0-1)**: Alignment of predicted vs ground truth scene sequences  
  - **Implementation**: Semantic similarity between scene descriptions using sentence embeddings
  - **Calculation**: Average cosine similarity of matched scene pairs
  - **Usage**: Measures content quality beyond just ordering

- **Order Preservation (0-1)**: Whether extracted scenes maintain logical temporal flow
  - **Implementation**: Checks for temporal consistency violations and out-of-order scenes
  - **Calculation**: Penalty-based scoring for temporal inconsistencies
  - **Usage**: Penalizes models that break narrative chronology

##### Enhanced Coherence Metrics

- **Scene Coherence (0-1)**: Semantic consistency between consecutive scenes
  - **Implementation**: 
    - Calculates semantic similarity between adjacent scene descriptions
    - Uses sentence transformers for embedding-based similarity
    - Identifies coherence breaks and transitions
  - **Calculation**: Average similarity score across all scene transitions
  - **Usage**: Ensures smooth narrative flow without abrupt topic changes

- **Semantic Transitions (0-1)**: Quality of logical flow between scene descriptions
  - **Implementation**: Analyzes transition quality using linguistic markers and semantic coherence
  - **Calculation**: Weighted score based on transition quality indicators
  - **Usage**: Measures narrative sophistication and readability

- **Narrative Flow (0-1)**: Overall storytelling coherence across all scenes
  - **Implementation**: Combines scene coherence, temporal accuracy, and semantic transitions
  - **Calculation**: `0.4 * coherence + 0.3 * temporal + 0.3 * transitions`
  - **Usage**: Comprehensive narrative quality assessment

#### LLM-as-Judge Metrics (GPT-4.1, 1-10 scale)
**Prompt Template**: `scene_judge_prompt` in `judge_prompts.py`

- **Extraction Quality (1-10)**: Quality of scene decomposition from annotation
  - **Focus**: Scene boundary detection, logical segmentation, completeness
  - **Implementation**: Evaluates how well scenes are extracted and separated
  
- **Temporal Coherence (1-10)**: Logical temporal ordering of extracted scenes
  - **Focus**: Chronological accuracy, sequence consistency, narrative flow
  - **Implementation**: Checks temporal logic and scene progression
  
- **Safety Relevance (1-10)**: Focus on driving safety-relevant aspects
  - **Focus**: Safety-critical event emphasis, risk prioritization, safety context
  - **Implementation**: Evaluates emphasis on safety-relevant content

### 3. Violation Component

The violation component evaluates text models (22 models) that identify traffic violations and provide legal explanations.

#### Traditional Metrics
**Implementation**: `ClassificationMetrics` and `DrivingSafetyMetrics` classes in `traditional_metrics.py`

##### Basic Classification Metrics
- **Precision (0-1)**: True positives / (True positives + False positives)
  - **Implementation**: Classification metrics with exact match or semantic similarity
  - **Purpose**: Measures accuracy of positive predictions
  
- **Recall (0-1)**: True positives / (True positives + False negatives)
  - **Implementation**: Handles missing detections and partial matches
  - **Purpose**: Measures completeness of detection
  
- **F1 Score (0-1)**: Harmonic mean of precision and recall
  - **Calculation**: `2 * (precision * recall) / (precision + recall)`
  - **Purpose**: Balanced metric combining precision and recall
  
- **Accuracy (0-1)**: Correct predictions / Total predictions
  - **Implementation**: Overall classification accuracy across all categories
  - **Usage**: General performance indicator

##### Enhanced Safety Metrics
**Implementation**: `DrivingSafetyMetrics` class in `traditional_metrics.py`

- **Safety Criticality Score (0-1)**: Weighted severity analysis of detected violations
  - **Implementation**: Uses severity weights for different violation types:
    - Extreme: reckless driving, DUI, hit and run (weight: 1.0)
    - High: speeding, running red lights, wrong way (weight: 0.8) 
    - Medium: failure to yield, improper lane change (weight: 0.6)
    - Low: parking violations, minor infractions (weight: 0.4)
  - **Calculation**: `sum(severity_weight * frequency) / total_events`
  - **Usage**: Prioritizes models that correctly identify high-risk behaviors

- **Critical Events Detection**: 
  - **Critical Events Count**: Number of safety-critical incidents detected
  - **Critical Event Ratio**: Proportion of critical events vs total events (0-1)
  - **Total Events**: Total number of driving events analyzed
  - **Implementation**: Pattern matching against critical scene types

#### LLM-as-Judge Metrics (GPT-4.1, 1-10 scale)
**Prompt Template**: `violation_judge_prompt` in `judge_prompts.py`

- **Detection Accuracy (1-10)**: Accuracy of traffic violation identification
  - **Focus**: Correct violation identification, false positive/negative rates
  - **Implementation**: Evaluates precision and recall of violation detection
  
- **Explanation Quality (1-10)**: Quality and clarity of violation reasoning
  - **Focus**: Clear reasoning, evidence-based explanations, logical justification
  - **Implementation**: Assesses explanation completeness and clarity
  
- **Legal Consistency (1-10)**: Consistency with established traffic laws
  - **Focus**: Legal accuracy, regulation compliance, proper legal interpretation
  - **Implementation**: Evaluates adherence to traffic law standards

### 4. Accident Component

The accident component evaluates text models (22 models) that assess accident risks and predict consequences.

#### Traditional Metrics
**Implementation**: `ClassificationMetrics` and `DrivingSafetyMetrics` classes in `traditional_metrics.py`

##### Basic Classification Metrics
- **Precision (0-1)**: True positives / (True positives + False positives)
  - **Implementation**: Classification metrics with exact match or semantic similarity
  - **Purpose**: Measures accuracy of positive predictions
  
- **Recall (0-1)**: True positives / (True positives + False negatives)
  - **Implementation**: Handles missing detections and partial matches
  - **Purpose**: Measures completeness of detection
  
- **F1 Score (0-1)**: Harmonic mean of precision and recall
  - **Calculation**: `2 * (precision * recall) / (precision + recall)`
  - **Purpose**: Balanced metric combining precision and recall
  
- **Accuracy (0-1)**: Correct predictions / Total predictions
  - **Implementation**: Overall classification accuracy across all categories
  - **Usage**: General performance indicator

##### Enhanced Safety Metrics
**Implementation**: `DrivingSafetyMetrics` class in `traditional_metrics.py`

- **Temporal Causality Score (0-1)**: Detects logical causal relationships (violations → accidents → assessment)
  - **Implementation**: Analyzes temporal sequences and causal indicators
  - **Calculation**: Measures consistency between violation timing and subsequent accidents
  - **Usage**: Ensures models understand cause-and-effect in driving scenarios

- **Safety Criticality Score (0-1)**: Weighted severity analysis of detected accidents
  - **Implementation**: Uses severity weights for different accident types
  - **Calculation**: `sum(severity_weight * frequency) / total_events`
  - **Usage**: Prioritizes models that correctly identify high-risk situations

#### LLM-as-Judge Metrics (GPT-4.1, 1-10 scale)
**Prompt Template**: `accident_judge_prompt` in `judge_prompts.py`

- **Risk Assessment Accuracy (1-10)**: Accuracy of accident risk evaluation
  - **Focus**: Risk level appropriateness, severity assessment, probability estimation
  - **Implementation**: Compares predicted vs ground truth risk levels
  
- **Consequence Prediction (1-10)**: Quality of potential outcome prediction
  - **Focus**: Realistic outcome scenarios, consequence severity, impact assessment
  - **Implementation**: Evaluates prediction quality and realism
  
- **Context Understanding (1-10)**: Consideration of environmental factors
  - **Focus**: Environmental awareness, situational context, contributing factors
  - **Implementation**: Assesses contextual analysis completeness

### 5. Assessment Component

The assessment component evaluates text models (22 models) that provide comprehensive driving evaluations with safety scores and improvement advice.

#### Traditional Metrics
**Implementation**: `StructuredMetrics` and assessment-specific classes in `traditional_metrics.py`

##### Basic Performance Metrics
- **Score Correlation (0-1)**: Pearson correlation between predicted and actual safety scores
  - **Implementation**: `scipy.stats.pearsonr()` for numerical safety scores
  - **Purpose**: Measures consistency of numerical assessment
  
- **Risk Accuracy (0-1)**: Exact match accuracy of risk level classifications  
  - **Implementation**: String matching for risk categories (Low/Medium/High/Critical)
  - **Purpose**: Measures categorical assessment accuracy

##### Enhanced Coverage Metrics
- **Content Coverage (0-1)**: Completeness of assessment content vs ground truth
  - **Implementation**: 
    - Extracts key phrases from strengths, weaknesses, and advice sections
    - Calculates overlap ratio using TF-IDF similarity
    - Measures completeness across all assessment dimensions
  - **Calculation**: `matched_content_phrases / total_ground_truth_phrases`

- **Advice Similarity (0-1)**: Semantic similarity of improvement recommendations
  - **Implementation**: Sentence embedding similarity between predicted and ground truth advice
  - **Usage**: Ensures actionable and relevant improvement suggestions

- **Evaluation Similarity (0-1)**: Alignment of overall driving assessment content
  - **Implementation**: Comprehensive similarity across all assessment components
  - **Usage**: Measures holistic assessment quality

#### LLM-as-Judge Metrics (GPT-4.1, 1-10 scale)
**Prompt Template**: `assessment_judge_prompt` in `judge_prompts.py`

- **Assessment Accuracy (1-10)**: Alignment with expert driving evaluation
  - **Focus**: Professional-level evaluation quality, expert judgment alignment
  - **Implementation**: Compares system assessment with expert ground truth
  
- **Advice Actionability (1-10)**: Practical value of improvement suggestions
  - **Focus**: Specific recommendations, implementable advice, practical utility
  - **Implementation**: Evaluates usefulness and specificity of suggestions
  
- **Score Justification (1-10)**: How well safety score matches evidence
  - **Focus**: Evidence-based scoring, logical consistency, justification quality
  - **Implementation**: Assesses alignment between score and supporting evidence

## 6. Cross-Component Metrics

### Timing Metrics (All Components)

**Implementation**: `TimingMetrics` class in `metrics/timing_metrics.py`

#### Core Timing Analysis
- **Generation Time**: Time taken for each model to process individual videos (seconds)
  - **Extraction**: Parses timing data from system output metadata
  - **Granularity**: Per-video, per-model, per-component timing
  - **Storage**: Cached with evaluation results for persistence

#### Statistical Analysis  
- **Mean Time**: Average generation time across all videos for each model
- **Median Time**: 50th percentile timing (less sensitive to outliers)
- **Min/Max Time**: Fastest and slowest individual video processing times
- **Standard Deviation**: Timing consistency measurement
- **Total Time**: Cumulative time for processing all videos

#### Performance Comparison
- **Speed Rankings**: Models ranked by average generation time within each component
- **Cross-Component Analysis**: Timing comparison across different evaluation tasks  
- **Global Performance Leaders**: Fastest and slowest models across all components
- **Speed Ratios**: Performance multipliers (e.g., "Model A is 3.2x faster than Model B")

#### Efficiency Metrics
- **Quality vs. Speed Trade-offs**: Correlation analysis between timing and performance scores
- **Efficiency Scores**: Combined metric balancing speed and accuracy
- **Resource Utilization**: Analysis of timing patterns and resource efficiency
- **Scalability Assessment**: Performance trends across different dataset sizes

#### Implementation Details
```python
# Timing extraction from system outputs
def extract_generation_time(system_data):
    return system_data.get('timing', {}).get('generation_time')

# Statistical calculations  
def calculate_timing_stats(times_list):
    return {
        'mean_time': statistics.mean(times_list),
        'median_time': statistics.median(times_list), 
        'std_dev': statistics.stdev(times_list),
        'min_time': min(times_list),
        'max_time': max(times_list),
        'total_time': sum(times_list),
        'count': len(times_list)
    }
```

### LLM Judge Implementation Details

#### Batch Processing
- **Batch Size**: 5 videos per batch for efficiency
- **Parallel Processing**: Multiple models evaluated concurrently
- **Progress Tracking**: Real-time progress updates with detailed logging

#### Error Handling
- **Retry Logic**: 3 attempts with exponential backoff (1s, 2s, 4s delays)
- **Structured Output Validation**: Ensures all required scores are present and valid
- **Fallback Scoring**: Default scores if judge evaluation fails completely

#### Output Processing
- **Score Validation**: Ensures integer scores in 1-10 range
- **Reasoning Extraction**: Captures and stores brief explanations for each dimension
- **Aggregate Calculation**: Computes overall quality as weighted average of dimension scores


## Output Examples

### CLI Evaluation Output

#### Annotation Component (Detailed Mode - Default)
```bash
============================================================
EVALUATING ANNOTATION COMPONENT
============================================================

[1/12] Model: openai:gpt-4o
    ✅ Completed: 22/22 videos (100.0%)
    📊 Traditional Metrics:
    📊 DETAILED METRICS BREAKDOWN (3 total)
       📈 Basic (3 metrics):
          • Semantic Similarity: 0.890
          • Rouge L: 0.820
          • BLEU: 0.780
    🤖 LLM Judge: Accuracy=8, Completeness=9, Clarity=8 (Avg: 8.3)
    ⏱️  Timing: 4.2s avg, 1m 34s total
```

#### Scene Component (Enhanced - Default)
```bash
[1/22] Model: openai:gpt-4o
    ✅ Completed: 22/22 videos (100.0%)
    📊 Traditional Metrics:
    📊 DETAILED METRICS BREAKDOWN (15 total)
       📈 Basic (4 metrics):
          • Scene Coverage: 1.000
          • Semantic Similarity: 0.924
          • Rouge L: 0.530
          • BLEU: 0.352
       🧠 Semantic (2 metrics):
          • Semantic Transitions: 0.625
          • Semantic Similarity: 0.924
       🛡️ Safety (4 metrics):
          • Safety Weighted Detection: 0.933
          • Critical Scene Precision: 1.000
          • Critical Scene F1: 0.667
          • Critical Scene Recall: 0.500
       ⏰ Temporal (2 metrics):
          • Temporal Order Accuracy: 1.000
          • Order Preservation: 1.000
       🔗 Coherence (2 metrics):
          • Scene Coherence: 0.875
          • Narrative Flow: 0.750
       📌 Other (1 metrics):
          • Overall Scene Quality: 0.829
    🤖 LLM Judge: Extraction=8, Temporal=9, Safety=8 (Avg: 8.3)
    ⏱️  Timing: 5.8s avg, 2m 8s total

============================================================
SCENE COMPONENT SUMMARY  
============================================================
✅ Successfully evaluated: 22/22 models

🏆 METRIC LEADERS
📊 Traditional Metrics:
  • Best Scene Coverage: claude-sonnet-4 (1.000)
  • Best Semantic Similarity: openai:o3 (0.950)
  🛡️ Domain-Specific Safety:
    • Safety Weighted Detection: claude-sonnet-4 (0.967)
    • Critical Scene F1: gemini-2.5-pro (0.789)
    • Overall Scene Quality: claude-sonnet-4 (0.892)
  ⏰ Temporal & Order:
    • Temporal Order Accuracy: openai:gpt-4o (1.000)
    • Order Preservation: claude-sonnet-4 (1.000)
  🔗 Scene Coherence:
    • Narrative Flow: gemini-2.5-pro (0.834)
    • Scene Coherence: claude-sonnet-4 (0.901)

📈 Enhanced Metrics Summary:
   • Total metrics evaluated: 15
   🛡️  Domain-specific safety metrics: 4
   🎯 Advanced evaluation metrics: 6

⏱️  TIMING COMPARISON
🚀 Fastest: claude-sonnet-4 (3.1s avg)
🐌 Slowest: gemini-2.5-pro (8.5s avg)  
📈 Speed ratio: 2.7x
```

### Report Table Format

**Enhanced Scene Component Table:**
```
| Model | Videos | Basic (4) | Semantic (2) | Safety (4) | Temporal (2) | Coherence (2) | Overall | Avg Time | Status |
|-------|--------|-----------|--------------|------------|--------------|---------------|---------|----------|--------|
| openai:gpt-4o | 22 | 0.85 | 0.90 | 0.78 | 1.00 | 0.81 | 0.829 | 5.8s | ✅ Success |
| claude-sonnet-4 | 22 | 0.88 | 0.92 | 0.84 | 0.95 | 0.86 | 0.892 | 3.1s | ✅ Success |
```

**Enhanced Violation Component Table:**
```
| Model | Videos | Basic (4) | Semantic (3) | Safety (4) | Detection | Explain | Legal | Avg Time | Status |
|-------|--------|-----------|--------------|------------|-----------|---------|-------|----------|--------|
| openai:gpt-4o | 22 | 0.81 | 0.79 | 0.72 | 8 | 9 | 8 | 6.1s | ✅ Success |
| claude-sonnet-4 | 22 | 0.84 | 0.83 | 0.78 | 9 | 8 | 9 | 4.2s | ✅ Success |
```

**Enhanced Assessment Component Table:**
```
| Model | Videos | Basic (3) | Safety (12) | Coverage (7) | Assessment | Advice | Justify | Avg Time | Status |
|-------|--------|-----------|-------------|--------------|------------|--------|---------|----------|--------|
| openai:gpt-4o | 22 | 0.89 | 0.65 | 0.92 | 9 | 8 | 9 | 5.3s | ✅ Success |
| claude-sonnet-4 | 22 | 0.91 | 0.71 | 0.94 | 8 | 9 | 8 | 4.8s | ✅ Success |
```

**Legend:**
- **Basic**: Core metrics (precision, recall, F1, etc.)
- **Semantic**: Reasoning quality and similarity metrics  
- **Safety**: Domain-specific driving safety evaluation
- **Temporal**: Chronological order and causality metrics
- **Coherence**: Narrative flow and scene consistency
- **Coverage**: Content completeness and similarity
- **Overall**: Comprehensive quality score combining all categories

## Resumption & Caching

The system automatically:
- **Resumes interrupted evaluations** - Just re-run the same command
- **Caches all results** - No duplicate work across runs
- **Detects model changes** - Automatically handles updated model lists
- **Validates data** - Ensures ground truth and system outputs exist

## Error Handling

- **Graceful failures**: Individual model failures don't stop evaluation
- **Retry logic**: Automatic retries for transient failures
- **Detailed logging**: Clear error messages and progress indicators
- **Validation**: Pre-flight checks for configuration and data

## Current Status & Recent Improvements

### ✅ Recently Fixed (2025-01-25)
- **Zero Values in Reports**: Fixed issue where cached evaluation data showed 0.00 values in reports
- **All-Components Reporting**: `--report --all-components` command now works correctly with actual metrics
- **Cached Data Loading**: Improved cached result loading to return structured evaluation data instead of placeholders
- **Report Generation**: Enhanced report generation to properly handle both fresh and cached evaluation results
- **Instance Consistency**: Fixed multiple ComponentEvaluator instance creation causing inconsistent behavior

### ✅ Current Working Features
- **Scene Component**: ✅ 22/22 models evaluated, reports showing real values (BLEU: 0.05-0.41, Semantic: 0.82-0.90)
- **Accident Component**: ✅ 22/22 models evaluated, comprehensive metrics (Precision: 0.95-0.97, F1: 0.92-0.95)
- **Violation Component**: ✅ 22/22 models evaluated, detailed analysis (F1: 0.87-0.90, LLM Judge: 8.3-8.6)
- **Annotation Component**: ✅ 12/12 multimodal models with fresh evaluation capability
- **Assessment Component**: ✅ Ready for evaluation, full metric support implemented

### 📊 Verified Metrics Output
All components now display comprehensive metrics including:
- **Traditional Metrics**: BLEU, ROUGE-L, Precision, Recall, F1, Accuracy, Semantic Similarity
- **Enhanced Safety Metrics**: Critical event detection, safety criticality scores, temporal causality
- **LLM Judge Scores**: Component-specific 1-10 scale evaluations with reasoning
- **Timing Analysis**: Detailed performance profiling with speed comparisons

## Dependencies

### Required
- Python 3.12+
- LangChain (for LLM integration)
- NumPy (for metrics calculation)

### Optional (Enhanced Features)
- NLTK (improved BLEU/ROUGE scores)
- SentenceTransformers (semantic similarity)
- Rouge-score (official ROUGE implementation)

## Troubleshooting

### Common Issues

1. **No system outputs found**
   ```bash
   # Check available system outputs
   ls data/evaluation/system_outputs/annotation/
   ```

2. **Model configuration errors**
   ```bash
   # Validate configuration
   uv run python -m evaluation.component_eval --validate-models
   ```

3. **Cache issues**
   ```bash
   # Force fresh evaluation
   uv run python -m evaluation.component_eval --component annotation --overwrite
   ```

4. **LLM-as-judge failures**
   - Check OpenAI API key is set
   - Verify GPT-4.1 model access
   - Check network connectivity

## Performance Tips

- **Run components in parallel** (separate terminal sessions)
- **Start with annotation** (fewer models to test setup)
- **Use status command** to track progress
- **Resume interrupted evaluations** automatically

## Integration

This evaluation system integrates with:
- **Ground truth annotations** from `data/evaluation/ground_truth/`  
- **System outputs** from `data/evaluation/system_outputs/`
- **Existing DriveGuard LLM infrastructure** in `src/llm/`
- **RAGAS evaluation pipeline** for end-to-end metrics