
Aug 26 at 18:23:46.544
2025-08-26 12:53:46,537 - INFO - ====================================================================================================
2025-08-26 12:53:46,537 - INFO - H3 LENGTH-CONTROL ANALYSIS - llama-4-scout-17b-16e-instruct on H2
2025-08-26 12:53:46,537 - INFO - ====================================================================================================
Aug 26 at 18:23:46.550
2025-08-26 12:53:46,544 - INFO - ✅ Loaded project configuration
2025-08-26 12:53:46,544 - INFO - 📁 Scores input: /research_storage/outputs/h2/scoring/llama-4-scout-17b-16e-instruct_h2_scores.jsonl
2025-08-26 12:53:46,544 - INFO - 📊 Length data will be extracted from scoring diagnostics
2025-08-26 12:53:46,544 - INFO - 
🔍 VALIDATING DATA AVAILABILITY
2025-08-26 12:53:46,544 - INFO - ============================================================
2025-08-26 12:53:46,547 - INFO - ✅ Scoring file validation passed
2025-08-26 12:53:46,547 - INFO -    Sample contains 10 fields
2025-08-26 12:53:46,547 - INFO - 
📊 Loading score data...
Aug 26 at 18:23:46.568
2025-08-26 12:53:46,561 - INFO - ✅ Loaded 162 scored samples
2025-08-26 12:53:46,562 - INFO - 
📊 Extracting response length data from scoring diagnostics...
2025-08-26 12:53:46,564 - INFO - ✅ Extracted response lengths for 162/162 samples
2025-08-26 12:53:46,564 - INFO - 📊 Length statistics:
2025-08-26 12:53:46,565 - INFO -    Mean: 2089.5 chars
2025-08-26 12:53:46,565 - INFO -    Median: 2264.6 chars
2025-08-26 12:53:46,566 - INFO -    Range: 41-5188 chars
2025-08-26 12:53:46,566 - INFO - 
📊 Dataset composition:
2025-08-26 12:53:46,566 - INFO -    Harmful samples: 81
2025-08-26 12:53:46,567 - INFO -    Benign samples: 81
2025-08-26 12:53:46,567 - INFO -    Total samples: 162
2025-08-26 12:53:46,567 - INFO -    Samples with length data: 162
2025-08-26 12:53:46,567 - INFO - 
============================================================
2025-08-26 12:53:46,567 - INFO - ORIGINAL SEMANTIC ENTROPY PERFORMANCE (ALL TAU VALUES)
2025-08-26 12:53:46,567 - INFO - ============================================================
Aug 26 at 18:23:46.576
2025-08-26 12:53:46,570 - INFO - 📊 τ=0.1: AUROC=0.6913, FNR@5%FPR=0.6543 [CI: 0.5459-0.7488]
2025-08-26 12:53:46,572 - INFO - 📊 τ=0.2: AUROC=0.6173, FNR@5%FPR=0.7654 [CI: 0.6625-0.8444]
2025-08-26 12:53:46,574 - INFO - 📊 τ=0.3: AUROC=0.5864, FNR@5%FPR=0.8272 [CI: 0.7305-0.8942]
Aug 26 at 18:23:46.583
2025-08-26 12:53:46,577 - INFO - 📊 τ=0.4: AUROC=0.5679, FNR@5%FPR=0.8642 [CI: 0.7730-0.9224]
2025-08-26 12:53:46,577 - INFO - 🏆 Best performing τ=0.1 (AUROC: 0.6913)
2025-08-26 12:53:46,577 - INFO - 
============================================================
2025-08-26 12:53:46,577 - INFO - FITTING LENGTH MODELS FOR ALL TAU VALUES
2025-08-26 12:53:46,577 - INFO - ============================================================
2025-08-26 12:53:46,578 - INFO - 📊 Fitting length models on 81 benign samples
2025-08-26 12:53:46,578 - INFO - 
🔬 Processing τ=0.1...
Aug 26 at 18:23:46.590
2025-08-26 12:53:46,585 - INFO -    Length model R²: 0.1027
2025-08-26 12:53:46,585 - INFO -    Intercept: 1.4726, Slope: -0.1738
2025-08-26 12:53:46,586 - INFO -    Calculated residuals for 162 samples
2025-08-26 12:53:46,586 - INFO - 
🔬 Processing τ=0.2...
2025-08-26 12:53:46,587 - INFO -    Length model R²: 1.0000
2025-08-26 12:53:46,587 - INFO -    Intercept: 0.0000, Slope: 0.0000
2025-08-26 12:53:46,588 - INFO -    Calculated residuals for 162 samples
2025-08-26 12:53:46,588 - INFO - 
🔬 Processing τ=0.3...
2025-08-26 12:53:46,589 - INFO -    Length model R²: 1.0000
2025-08-26 12:53:46,589 - INFO -    Intercept: 0.0000, Slope: 0.0000
Aug 26 at 18:23:46.599
2025-08-26 12:53:46,592 - INFO -    Calculated residuals for 162 samples
2025-08-26 12:53:46,593 - INFO - 
🔬 Processing τ=0.4...
2025-08-26 12:53:46,594 - INFO -    Length model R²: 1.0000
2025-08-26 12:53:46,594 - INFO -    Intercept: 0.0000, Slope: 0.0000
2025-08-26 12:53:46,595 - INFO -    Calculated residuals for 162 samples
2025-08-26 12:53:46,595 - INFO - ✅ Fitted length models for 4 tau values
2025-08-26 12:53:46,595 - INFO - 
============================================================
2025-08-26 12:53:46,595 - INFO - RESIDUAL SEMANTIC ENTROPY PERFORMANCE (ALL TAU VALUES)
2025-08-26 12:53:46,595 - INFO - ============================================================
2025-08-26 12:53:46,595 - INFO - 
🔬 Evaluating residuals for τ=0.1...
2025-08-26 12:53:46,597 - INFO -    📈 Residual AUROC: 0.6301 (drop: +0.0612)
2025-08-26 12:53:46,598 - INFO -    📉 Residual FNR@5%: 0.6914 (increase: +0.0370)
2025-08-26 12:53:46,598 - INFO -    🎯 H3 supported: ❌ (AUROC >= 0.55)
2025-08-26 12:53:46,598 - INFO - 
🔬 Evaluating residuals for τ=0.2...
Aug 26 at 18:23:46.605
2025-08-26 12:53:46,599 - INFO -    📈 Residual AUROC: 0.6173 (drop: +0.0000)
2025-08-26 12:53:46,599 - INFO -    📉 Residual FNR@5%: 0.7654 (increase: +0.0000)
2025-08-26 12:53:46,600 - INFO -    🎯 H3 supported: ❌ (AUROC >= 0.55)
2025-08-26 12:53:46,600 - INFO - 
🔬 Evaluating residuals for τ=0.3...
2025-08-26 12:53:46,601 - INFO -    📈 Residual AUROC: 0.5864 (drop: +0.0000)
2025-08-26 12:53:46,601 - INFO -    📉 Residual FNR@5%: 0.8272 (increase: +0.0000)
2025-08-26 12:53:46,601 - INFO -    🎯 H3 supported: ❌ (AUROC >= 0.55)
2025-08-26 12:53:46,601 - INFO - 
🔬 Evaluating residuals for τ=0.4...
2025-08-26 12:53:46,603 - INFO -    📈 Residual AUROC: 0.5679 (drop: +0.0000)
2025-08-26 12:53:46,603 - INFO -    📉 Residual FNR@5%: 0.8642 (increase: +0.0000)
2025-08-26 12:53:46,603 - INFO -    🎯 H3 supported: ❌ (AUROC >= 0.55)
2025-08-26 12:53:46,603 - INFO - 
💾 Saving per-prompt residual entropy data...
Aug 26 at 18:23:46.633
2025-08-26 12:53:46,627 - INFO -    ✅ Per-prompt residuals saved to /research_storage/outputs/h3/llama-4-scout-17b-16e-instruct_per_prompt_residuals.jsonl
2025-08-26 12:53:46,627 - INFO - 
📊 OVERALL H3 STATUS: ❌ NOT SUPPORTED
2025-08-26 12:53:46,627 - INFO -    No length confounding detected for any τ value
2025-08-26 12:53:46,627 - INFO - 
============================================================
2025-08-26 12:53:46,627 - INFO - BASELINE COMPARISON (FOR CONTEXT)
2025-08-26 12:53:46,627 - INFO - ============================================================
2025-08-26 12:53:46,627 - INFO - 📝 Note: H3 primary test is residual SE AUROC < 0.55, baselines shown for context
2025-08-26 12:53:46,630 - INFO - 📊 Avg Pairwise Bertscore: AUROC=0.5057, FNR@5%=0.7407 [CI: 0.636-0.824]
2025-08-26 12:53:46,632 - INFO - 📊 Embedding Variance: AUROC=0.6837, FNR@5%=0.6049 [CI: 0.496-0.704]
Aug 26 at 18:23:46.640
2025-08-26 12:53:46,634 - INFO - 📊 Levenshtein Variance: AUROC=0.3969, FNR@5%=0.9259 [CI: 0.848-0.966]
2025-08-26 12:53:46,634 - INFO - 
============================================================
2025-08-26 12:53:46,634 - INFO - H3 HYPOTHESIS FINAL STATUS
2025-08-26 12:53:46,634 - INFO - ============================================================
2025-08-26 12:53:46,634 - INFO - ❌ H3 NOT SUPPORTED: No significant length confounding detected
2025-08-26 12:53:46,634 - INFO -    All τ values retain detection capability after length control
2025-08-26 12:53:46,634 - INFO -    SE captures meaningful signals beyond response length patterns
2025-08-26 12:53:46,634 - INFO -    Best residual performance: τ=0.1 (Residual AUROC: 0.6301)
2025-08-26 12:53:46,636 - INFO - 
💾 Saving detailed per-prompt analysis...
Aug 26 at 18:23:46.705
2025-08-26 12:53:46,699 - INFO - 💾 Results saved to: /research_storage/outputs/h3/llama-4-scout-17b-16e-instruct_H2_h3_results.json
2025-08-26 12:53:46,699 - INFO - 💾 Per-prompt analysis saved to: /research_storage/outputs/h3/llama-4-scout-17b-16e-instruct_H2_h3_prompt_analysis.jsonl
2025-08-26 12:53:46,699 - INFO - 📊 Detailed data includes:
2025-08-26 12:53:46,699 - INFO -    - Original/Predicted/Residual SE for all τ values
2025-08-26 12:53:46,699 - INFO -    - Response lengths and log-lengths
2025-08-26 12:53:46,699 - INFO -    - Baseline scores for comparison
2025-08-26 12:53:46,699 - INFO -    - Prompt-level labels and metadata
Aug 26 at 18:23:48.970
2025-08-26 12:53:48,964 - INFO - ====================================================================================================
2025-08-26 12:53:48,964 - INFO - H3 LENGTH-CONTROL ANALYSIS - qwen2.5-7b-instruct on H2
2025-08-26 12:53:48,964 - INFO - ====================================================================================================
Aug 26 at 18:23:48.977
2025-08-26 12:53:48,970 - INFO - ✅ Loaded project configuration
2025-08-26 12:53:48,970 - INFO - 📁 Scores input: /research_storage/outputs/h2/scoring/qwen2.5-7b-instruct_h2_scores.jsonl
2025-08-26 12:53:48,970 - INFO - 📊 Length data will be extracted from scoring diagnostics
2025-08-26 12:53:48,970 - INFO - 
🔍 VALIDATING DATA AVAILABILITY
2025-08-26 12:53:48,970 - INFO - ============================================================
Aug 26 at 18:23:49.602
2025-08-26 12:53:49,596 - INFO - ✅ Scoring file validation passed
2025-08-26 12:53:49,596 - INFO -    Sample contains 10 fields
2025-08-26 12:53:49,596 - INFO - 
📊 Loading score data...
Aug 26 at 18:23:49.613
2025-08-26 12:53:49,608 - INFO - ✅ Loaded 162 scored samples
2025-08-26 12:53:49,608 - INFO - 
📊 Extracting response length data from scoring diagnostics...
2025-08-26 12:53:49,610 - INFO - ✅ Extracted response lengths for 162/162 samples
2025-08-26 12:53:49,610 - INFO - 📊 Length statistics:
2025-08-26 12:53:49,610 - INFO -    Mean: 2478.2 chars
2025-08-26 12:53:49,611 - INFO -    Median: 2475.3 chars
2025-08-26 12:53:49,611 - INFO -    Range: 242-5587 chars
2025-08-26 12:53:49,611 - INFO - 
📊 Dataset composition:
2025-08-26 12:53:49,611 - INFO -    Harmful samples: 81
2025-08-26 12:53:49,611 - INFO -    Benign samples: 81
2025-08-26 12:53:49,611 - INFO -    Total samples: 162
2025-08-26 12:53:49,612 - INFO -    Samples with length data: 162
2025-08-26 12:53:49,612 - INFO - 
============================================================
2025-08-26 12:53:49,612 - INFO - ORIGINAL SEMANTIC ENTROPY PERFORMANCE (ALL TAU VALUES)
2025-08-26 12:53:49,612 - INFO - ============================================================
Aug 26 at 18:23:49.620
2025-08-26 12:53:49,614 - INFO - 📊 τ=0.1: AUROC=0.7326, FNR@5%FPR=0.6296 [CI: 0.5208-0.7267]
2025-08-26 12:53:49,616 - INFO - 📊 τ=0.2: AUROC=0.5556, FNR@5%FPR=0.8889 [CI: 0.8021-0.9404]
2025-08-26 12:53:49,618 - INFO - 📊 τ=0.3: AUROC=0.5123, FNR@5%FPR=0.9753 [CI: 0.9144-0.9932]
2025-08-26 12:53:49,618 - INFO - 📝 τ=0.4: All SE scores are 0 (perfect consistency) - analyzing anyway
Aug 26 at 18:23:49.626
2025-08-26 12:53:49,620 - INFO - 📊 τ=0.4: AUROC=0.5000, FNR@5%FPR=1.0000 [CI: 0.9547-1.0000]
2025-08-26 12:53:49,620 - INFO - 🏆 Best performing τ=0.1 (AUROC: 0.7326)
2025-08-26 12:53:49,620 - INFO - 
============================================================
2025-08-26 12:53:49,620 - INFO - FITTING LENGTH MODELS FOR ALL TAU VALUES
2025-08-26 12:53:49,620 - INFO - ============================================================
2025-08-26 12:53:49,621 - INFO - 📊 Fitting length models on 81 benign samples
2025-08-26 12:53:49,621 - INFO - 
🔬 Processing τ=0.1...
2025-08-26 12:53:49,623 - INFO -    Length model R²: 0.0001
2025-08-26 12:53:49,623 - INFO -    Intercept: 0.1934, Slope: -0.0072
2025-08-26 12:53:49,623 - INFO -    Calculated residuals for 162 samples
2025-08-26 12:53:49,623 - INFO - 
🔬 Processing τ=0.2...
2025-08-26 12:53:49,624 - INFO -    Length model R²: 1.0000
2025-08-26 12:53:49,624 - INFO -    Intercept: 0.0000, Slope: 0.0000
2025-08-26 12:53:49,625 - INFO -    Calculated residuals for 162 samples
2025-08-26 12:53:49,625 - INFO - 
🔬 Processing τ=0.3...
2025-08-26 12:53:49,625 - INFO -    Length model R²: 1.0000
2025-08-26 12:53:49,626 - INFO -    Intercept: 0.0000, Slope: 0.0000
2025-08-26 12:53:49,626 - INFO -    Calculated residuals for 162 samples
2025-08-26 12:53:49,626 - INFO - 
🔬 Processing τ=0.4...
Aug 26 at 18:23:49.638
2025-08-26 12:53:49,632 - INFO -    Length model R²: 1.0000
2025-08-26 12:53:49,632 - INFO -    Intercept: 0.0000, Slope: 0.0000
2025-08-26 12:53:49,633 - INFO -    Calculated residuals for 162 samples
2025-08-26 12:53:49,633 - INFO - ✅ Fitted length models for 4 tau values
2025-08-26 12:53:49,633 - INFO - 
============================================================
2025-08-26 12:53:49,633 - INFO - RESIDUAL SEMANTIC ENTROPY PERFORMANCE (ALL TAU VALUES)
2025-08-26 12:53:49,633 - INFO - ============================================================
2025-08-26 12:53:49,633 - INFO - 
🔬 Evaluating residuals for τ=0.1...
2025-08-26 12:53:49,635 - INFO -    📈 Residual AUROC: 0.6905 (drop: +0.0421)
2025-08-26 12:53:49,635 - INFO -    📉 Residual FNR@5%: 0.6296 (increase: +0.0000)
2025-08-26 12:53:49,635 - INFO -    🎯 H3 supported: ❌ (AUROC >= 0.55)
2025-08-26 12:53:49,635 - INFO - 
🔬 Evaluating residuals for τ=0.2...
2025-08-26 12:53:49,636 - INFO -    📈 Residual AUROC: 0.5556 (drop: +0.0000)
2025-08-26 12:53:49,637 - INFO -    📉 Residual FNR@5%: 0.8889 (increase: +0.0000)
2025-08-26 12:53:49,637 - INFO -    🎯 H3 supported: ❌ (AUROC >= 0.55)
2025-08-26 12:53:49,637 - INFO - 
🔬 Evaluating residuals for τ=0.3...
Aug 26 at 18:23:49.644
2025-08-26 12:53:49,638 - INFO -    📈 Residual AUROC: 0.5123 (drop: +0.0000)
2025-08-26 12:53:49,638 - INFO -    📉 Residual FNR@5%: 0.9753 (increase: +0.0000)
2025-08-26 12:53:49,638 - INFO -    🎯 H3 supported: ✅ (AUROC < 0.55)
2025-08-26 12:53:49,638 - INFO - 
🔬 Evaluating residuals for τ=0.4...
2025-08-26 12:53:49,640 - INFO -    📈 Residual AUROC: 0.5000 (drop: +0.0000)
2025-08-26 12:53:49,640 - INFO -    📉 Residual FNR@5%: 1.0000 (increase: +0.0000)
2025-08-26 12:53:49,640 - INFO -    🎯 H3 supported: ✅ (AUROC < 0.55)
2025-08-26 12:53:49,640 - INFO - 
💾 Saving per-prompt residual entropy data...
Aug 26 at 18:23:49.695
2025-08-26 12:53:49,689 - INFO -    ✅ Per-prompt residuals saved to /research_storage/outputs/h3/qwen2.5-7b-instruct_per_prompt_residuals.jsonl
2025-08-26 12:53:49,689 - INFO - 
📊 OVERALL H3 STATUS: ✅ SUPPORTED
2025-08-26 12:53:49,689 - INFO -    Length confounding detected for τ values: [0.3, 0.4]
2025-08-26 12:53:49,689 - INFO - 
============================================================
2025-08-26 12:53:49,689 - INFO - BASELINE COMPARISON (FOR CONTEXT)
2025-08-26 12:53:49,689 - INFO - ============================================================
2025-08-26 12:53:49,689 - INFO - 📝 Note: H3 primary test is residual SE AUROC < 0.55, baselines shown for context
2025-08-26 12:53:49,692 - INFO - 📊 Avg Pairwise Bertscore: AUROC=0.4312, FNR@5%=0.8519 [CI: 0.759-0.913]
2025-08-26 12:53:49,694 - INFO - 📊 Embedding Variance: AUROC=0.7243, FNR@5%=0.6543 [CI: 0.546-0.749]
Aug 26 at 18:23:49.701
2025-08-26 12:53:49,696 - INFO - 📊 Levenshtein Variance: AUROC=0.5728, FNR@5%=0.8148 [CI: 0.717-0.884]
2025-08-26 12:53:49,696 - INFO - 
============================================================
2025-08-26 12:53:49,696 - INFO - H3 HYPOTHESIS FINAL STATUS
2025-08-26 12:53:49,696 - INFO - ============================================================
2025-08-26 12:53:49,696 - INFO - ✅ H3 SUPPORTED: Length confounding detected
2025-08-26 12:53:49,696 - INFO -    τ values showing confounding: [0.3, 0.4]
2025-08-26 12:53:49,696 - INFO -    After controlling for length, SE performance degrades to near-random
2025-08-26 12:53:49,696 - INFO -    This indicates length is a primary signal driving SE detection
2025-08-26 12:53:49,696 - INFO -    Most severe confounding: τ=0.1 (AUROC drop: 0.0421)
2025-08-26 12:53:49,698 - INFO - 
💾 Saving detailed per-prompt analysis...
Aug 26 at 18:23:49.768
2025-08-26 12:53:49,762 - INFO - 💾 Results saved to: /research_storage/outputs/h3/qwen2.5-7b-instruct_H2_h3_results.json
2025-08-26 12:53:49,762 - INFO - 💾 Per-prompt analysis saved to: /research_storage/outputs/h3/qwen2.5-7b-instruct_H2_h3_prompt_analysis.jsonl
2025-08-26 12:53:49,762 - INFO - 📊 Detailed data includes:
2025-08-26 12:53:49,762 - INFO -    - Original/Predicted/Residual SE for all τ values
2025-08-26 12:53:49,762 - INFO -    - Response lengths and log-lengths
2025-08-26 12:53:49,762 - INFO -    - Baseline scores for comparison
2025-08-26 12:53:49,762 - INFO -    - Prompt-level labels and metadata
Aug 26 at 18:23:54.100
2025-08-26 12:53:54,093 - INFO - ====================================================================================================
2025-08-26 12:53:54,094 - INFO - GENERATING H3 COMPREHENSIVE REPORT
2025-08-26 12:53:54,094 - INFO - ====================================================================================================
2025-08-26 12:53:54,095 - INFO - 📂 Loading: llama-4-scout-17b-16e-instruct_H2_h3_results.json
Aug 26 at 18:23:54.320
2025-08-26 12:53:54,314 - INFO - 📂 Loading: qwen2.5-7b-instruct_H2_h3_results.json
Aug 26 at 18:23:54.545
2025-08-26 12:53:54,539 - INFO - ✅ Report saved to: /research_storage/reports/h3_length_control_report.md