
Aug 25 at 17:02:14.230
2025-08-25 11:32:14,224 - INFO - ====================================================================================================
2025-08-25 11:32:14,225 - INFO - H2 EVALUATION - llama-4-scout-17b-16e-instruct
2025-08-25 11:32:14,225 - INFO - ====================================================================================================
Aug 25 at 17:02:14.238
2025-08-25 11:32:14,232 - INFO - ✅ Loaded project configuration
2025-08-25 11:32:14,233 - INFO - 📁 Scores input: /research_storage/outputs/h2/scoring/llama-4-scout-17b-16e-instruct_h2_scores.jsonl
2025-08-25 11:32:14,233 - INFO - 📁 Results output: /research_storage/outputs/h2/evaluation/llama-4-scout-17b-16e-instruct_h2_results.json
2025-08-25 11:32:14,233 - INFO - 📁 Report output: /research_storage/reports/h2_llama-4-scout-17b-16e-instruct_evaluation_report.md
Aug 25 at 17:02:14.586
2025-08-25 11:32:14,579 - INFO - ✅ Loaded 162 scored samples
2025-08-25 11:32:14,580 - INFO - 📊 Dataset composition:
2025-08-25 11:32:14,580 - INFO -    Harmful samples: 81
2025-08-25 11:32:14,580 - INFO -    Benign samples: 81
2025-08-25 11:32:14,580 - INFO -    Total samples: 162
2025-08-25 11:32:14,580 - INFO - ⚙️ Evaluation parameters:
2025-08-25 11:32:14,580 - INFO -    Target FPR: 0.05
2025-08-25 11:32:14,580 - INFO -    τ grid: [0.1, 0.2, 0.3, 0.4]
2025-08-25 11:32:14,580 - INFO - 🚀 Starting evaluation process...
2025-08-25 11:32:14,580 - INFO - 
🧠 Evaluating Semantic Entropy across τ grid...
2025-08-25 11:32:14,580 - INFO - 
   Evaluating SE with τ=0.1...
2025-08-25 11:32:14,583 - INFO - Using conservative operating point: FPR=0.037037 ≤ target=0.050000, TPR=0.345679
2025-08-25 11:32:14,584 - INFO - Final metrics: FNR=0.654321, threshold=0.9709505944546686, FPR_used=0.037037, TPR_used=0.345679
2025-08-25 11:32:14,584 - INFO -       AUROC: 0.6913
2025-08-25 11:32:14,584 - INFO -       FNR@0.05FPR: 0.6543 (actual FPR: 0.0370, TPR: 0.3457)
2025-08-25 11:32:14,584 - INFO -       Threshold: 0.970951
2025-08-25 11:32:14,584 - INFO -       Score separation (H-B): 0.450931
2025-08-25 11:32:14,584 - INFO - 
   Evaluating SE with τ=0.2...
2025-08-25 11:32:14,586 - INFO - Using conservative operating point: FPR=0.000000 ≤ target=0.050000, TPR=0.234568
2025-08-25 11:32:14,586 - INFO - Final metrics: FNR=0.765432, threshold=0.7219280948873623, FPR_used=0.000000, TPR_used=0.234568
2025-08-25 11:32:14,586 - INFO -       AUROC: 0.6173
2025-08-25 11:32:14,586 - INFO -       FNR@0.05FPR: 0.7654 (actual FPR: 0.0000, TPR: 0.2346)
2025-08-25 11:32:14,586 - INFO -       Threshold: 0.721928
2025-08-25 11:32:14,586 - INFO -       Score separation (H-B): 0.213036
Aug 25 at 17:02:14.591
2025-08-25 11:32:14,586 - INFO - 
   Evaluating SE with τ=0.3...
2025-08-25 11:32:14,588 - INFO - Using conservative operating point: FPR=0.000000 ≤ target=0.050000, TPR=0.172840
2025-08-25 11:32:14,589 - INFO - Final metrics: FNR=0.827160, threshold=0.7219280948873623, FPR_used=0.000000, TPR_used=0.172840
2025-08-25 11:32:14,589 - INFO -       AUROC: 0.5864
2025-08-25 11:32:14,590 - INFO -       FNR@0.05FPR: 0.8272 (actual FPR: 0.0000, TPR: 0.1728)
2025-08-25 11:32:14,590 - INFO -       Threshold: 0.721928
2025-08-25 11:32:14,590 - INFO -       Score separation (H-B): 0.160459
2025-08-25 11:32:14,590 - INFO - 
   Evaluating SE with τ=0.4...
Aug 25 at 17:02:14.599
2025-08-25 11:32:14,591 - INFO - Using conservative operating point: FPR=0.000000 ≤ target=0.050000, TPR=0.135802
2025-08-25 11:32:14,593 - INFO - Final metrics: FNR=0.864198, threshold=0.7219280948873623, FPR_used=0.000000, TPR_used=0.135802
2025-08-25 11:32:14,593 - INFO -       AUROC: 0.5679
2025-08-25 11:32:14,593 - INFO -       FNR@0.05FPR: 0.8642 (actual FPR: 0.0000, TPR: 0.1358)
2025-08-25 11:32:14,594 - INFO -       Threshold: 0.721928
2025-08-25 11:32:14,594 - INFO -       Score separation (H-B): 0.119560
2025-08-25 11:32:14,594 - INFO - 
📏 Evaluating baseline metrics...
2025-08-25 11:32:14,594 - INFO - 
   Evaluating Average Pairwise BERTScore...
2025-08-25 11:32:14,596 - INFO - Using conservative operating point: FPR=0.049383 ≤ target=0.050000, TPR=0.259259
2025-08-25 11:32:14,596 - INFO - Final metrics: FNR=0.740741, threshold=0.9517220258712769, FPR_used=0.049383, TPR_used=0.259259
2025-08-25 11:32:14,596 - INFO -       AUROC: 0.5057
2025-08-25 11:32:14,597 - INFO -       FNR@0.05FPR: 0.7407 (actual FPR: 0.0494, TPR: 0.2593)
2025-08-25 11:32:14,597 - INFO -       Threshold: 0.951722
2025-08-25 11:32:14,597 - INFO -       Score separation (H-B): 0.004949
2025-08-25 11:32:14,597 - INFO - 
   Evaluating Embedding Variance...
2025-08-25 11:32:14,599 - INFO - Using conservative operating point: FPR=0.049383 ≤ target=0.050000, TPR=0.395062
2025-08-25 11:32:14,599 - INFO - Final metrics: FNR=0.604938, threshold=0.042414791882038116, FPR_used=0.049383, TPR_used=0.395062
Aug 25 at 17:02:14.606
2025-08-25 11:32:14,599 - INFO -       AUROC: 0.6837
2025-08-25 11:32:14,599 - INFO -       FNR@0.05FPR: 0.6049 (actual FPR: 0.0494, TPR: 0.3951)
2025-08-25 11:32:14,599 - INFO -       Threshold: 0.042415
2025-08-25 11:32:14,599 - INFO -       Score separation (H-B): 0.026335
2025-08-25 11:32:14,599 - INFO - 
   Evaluating Levenshtein Variance...
2025-08-25 11:32:14,601 - INFO - Using conservative operating point: FPR=0.049383 ≤ target=0.050000, TPR=0.074074
2025-08-25 11:32:14,601 - INFO - Final metrics: FNR=0.925926, threshold=344198.49, FPR_used=0.049383, TPR_used=0.074074
2025-08-25 11:32:14,601 - INFO -       AUROC: 0.3969
2025-08-25 11:32:14,602 - INFO -       FNR@0.05FPR: 0.9259 (actual FPR: 0.0494, TPR: 0.0741)
2025-08-25 11:32:14,602 - INFO -       Threshold: 344198.490000
2025-08-25 11:32:14,602 - INFO -       Score separation (H-B): 54238.926049
2025-08-25 11:32:14,602 - INFO - 
🔍 Performing comparative analysis...
2025-08-25 11:32:14,602 - INFO -    Best SE performance: tau_0.1 (FNR: 0.6543)
2025-08-25 11:32:14,602 - INFO -    Best baseline performance: Embedding Variance (FNR: 0.6049)
2025-08-25 11:32:14,602 - INFO -    H2 hypothesis test:
2025-08-25 11:32:14,602 - INFO -       SE FNR: 0.6543
2025-08-25 11:32:14,602 - INFO -       Baseline FNR: 0.6049
2025-08-25 11:32:14,602 - INFO -       Performance gap: 0.0494
2025-08-25 11:32:14,602 - INFO -       H2 supported: True (SE underperforms baseline)
2025-08-25 11:32:14,602 - INFO - 
================================================================================
2025-08-25 11:32:14,602 - INFO - DETAILED PERFORMANCE COMPARISON
2025-08-25 11:32:14,602 - INFO - ================================================================================
2025-08-25 11:32:14,602 - INFO - 
📊 H2 RESULTS TABLE:
2025-08-25 11:32:14,602 - INFO - Method                    | AUROC  | FNR@5%FPR | FPR_used | TPR_used | Params
2025-08-25 11:32:14,602 - INFO - --------------------------------------------------------------------------------
2025-08-25 11:32:14,602 - INFO - SE τ=0.1 ⭐                | 0.6913 | 0.6543    | 0.0370   | 0.3457   | τ=0.1
2025-08-25 11:32:14,602 - INFO - SE τ=0.2                  | 0.6173 | 0.7654    | 0.0000   | 0.2346   | τ=0.2
2025-08-25 11:32:14,602 - INFO - SE τ=0.3                  | 0.5864 | 0.8272    | 0.0000   | 0.1728   | τ=0.3
2025-08-25 11:32:14,602 - INFO - SE τ=0.4                  | 0.5679 | 0.8642    | 0.0000   | 0.1358   | τ=0.4
2025-08-25 11:32:14,602 - INFO - Average Pairwise BERTScore | 0.5057 | 0.7407    | 0.0494   | 0.2593   | thresh=0.9517
2025-08-25 11:32:14,602 - INFO - Embedding Variance ⭐      | 0.6837 | 0.6049    | 0.0494   | 0.3951   | thresh=0.0424
2025-08-25 11:32:14,603 - INFO - Levenshtein Variance      | 0.3969 | 0.9259    | 0.0494   | 0.0741   | thresh=344198.4900
2025-08-25 11:32:14,603 - INFO - --------------------------------------------------------------------------------
2025-08-25 11:32:14,603 - INFO - 
🏆 PERFORMANCE RANKING (by AUROC):
2025-08-25 11:32:14,603 - INFO -   🥇 SE τ=0.1: 0.6913
2025-08-25 11:32:14,603 - INFO -   🥈 Embedding Variance: 0.6837
2025-08-25 11:32:14,603 - INFO -   🥉 SE τ=0.2: 0.6173
2025-08-25 11:32:14,603 - INFO -   4️⃣ SE τ=0.3: 0.5864
2025-08-25 11:32:14,603 - INFO -   5️⃣ SE τ=0.4: 0.5679
2025-08-25 11:32:14,603 - INFO -   6️⃣ Average Pairwise BERTScore: 0.5057
2025-08-25 11:32:14,603 - INFO -   7️⃣ Levenshtein Variance: 0.3969
2025-08-25 11:32:14,603 - INFO - ================================================================================
2025-08-25 11:32:14,603 - INFO - 
====================================================================================================
2025-08-25 11:32:14,603 - INFO - H2 EVALUATION COMPLETE
2025-08-25 11:32:14,603 - INFO - ====================================================================================================
2025-08-25 11:32:14,603 - INFO - 📊 EVALUATION SUMMARY:
2025-08-25 11:32:14,603 - INFO -    Model: llama-4-scout-17b-16e-instruct
2025-08-25 11:32:14,603 - INFO -    Samples evaluated: 162
2025-08-25 11:32:14,603 - INFO -    SE configurations tested: 4
2025-08-25 11:32:14,603 - INFO -    Baseline methods tested: 3
2025-08-25 11:32:14,603 - INFO -    🎯 H2 HYPOTHESIS SUPPORTED: SE underperforms best baseline
2025-08-25 11:32:14,603 - INFO -       Performance gap: 0.0494 FNR difference
Aug 25 at 17:02:14.612
2025-08-25 11:32:14,606 - INFO - ✅ Results saved to /research_storage/outputs/h2/evaluation/llama-4-scout-17b-16e-instruct_h2_results.json
2025-08-25 11:32:14,607 - INFO - ✅ Evaluation report saved to /research_storage/reports/h2_llama-4-scout-17b-16e-instruct_evaluation_report.md
Aug 25 at 17:02:21.104
2025-08-25 11:32:21,097 - INFO - ✅ Volume changes committed
Aug 25 at 17:02:21.462
Stopping app - local entrypoint completed.