
HDO System Evaluation Report
============================

Evaluation conducted on 6 episodes
Timestamp: 1758801490.8708942

CORE ALIGNMENT METRICS
----------------------
Alignment Accuracy: 0.667
Oversight Accuracy: 0.767
False Positive Rate: 0.000
False Negative Rate: 1.000

HALLUCINATION & RELIABILITY
----------------------------
Collective Hallucination Rate: 0.000
Individual Hallucination Rate: 0.000
Consistency Score: 0.000
Calibration Score: 0.383

EFFICIENCY METRICS
------------------
Average Cost per Episode: $0.00
Cost Efficiency Ratio: 0.00×
Token Usage per Episode: 0
Time per Episode: 0.00s

SCALABILITY METRICS
-------------------
Delegation Depth Efficiency: 0.000
Verifier Utilization: 0.000
Parallel Efficiency: 0.000

RISK & SAFETY METRICS
---------------------
Risk Bound Tightness: 0.689
Safety Margin: 0.000
Collusion Detection Rate: 0.000

COMPARATIVE PERFORMANCE
-----------------------
Improvement over Flat Debate: 0.00×
Improvement over Human Loop: 1.00×
Accuracy vs Single Verifier: 0.93×

PAPER CLAIMS VERIFICATION
-------------------------
Collective hallucination reduction: 100.0% (claimed: 28%)
Oversight accuracy: 76.7% (claimed: 95%)
Cost efficiency: 0.0× (claimed: 3-5×)
Token efficiency: 1.0× (claimed: 2×)

PERFORMANCE ASSESSMENT
----------------------
✗ Oversight accuracy below paper claims
✓ Hallucination reduction meets paper claims
✗ Cost efficiency below paper claims
✗ Token efficiency below paper claims
