--- Starting Phase 5: Evaluation ---
Loaded selection data from phase4_analysis/detailed_selections.csv

--- 1. Task-Level Analysis ---

Found 8 case studies where Energy-Guided and Best-Time diverge.
Analyzing algorithmic structure for case studies...
Saved 8 detailed case studies to phase5_evaluation/case_studies.csv
Saved task-level summary table to phase5_evaluation/task_level_summary.csv


--- 2. Aggregate Analysis ---

Energy Efficiency Improvement (vs. Baselines):
  - Average Savings vs. Top-1:      44.69%
  - Average Savings vs. Best-Time:  1.86%

Runtime Penalty (vs. Best-Time):
  - Average Penalty: 0.0012 seconds
  - Median Penalty:  0.0000 seconds

Generated plot of energy savings distribution at: phase5_evaluation/energy_savings_distribution.png
Generated plot of runtime penalty distribution at: phase5_evaluation/runtime_penalty_distribution.png

Statistical Significance (Wilcoxon signed-rank test):

  Comparison vs. Top1:
    - p-value: 0.00000
    - Result: Statistically significant (p < 0.05)

  Comparison vs. Best-Time:
    - p-value: 0.00391
    - Result: Statistically significant (p < 0.05)


--- 3. Sensitivity Analysis ---

Robustness of measurements (Coefficient of Variation across runs):
  - Average CV for Runtime: 8.62%
  - Average CV for Energy:  10.75%

Ranking stability (Effect of input scale):
  - 3 out of 10 tasks (30.00%) had a consistent energy-guided candidate across all scales.


--- Phase 5 Evaluation Complete ---
Full report saved to: phase5_evaluation/evaluation_summary_report.txt
