# Paper Outline Review\n\n## What Works Well\n- Strong, memorable title and core narrative: "The Consistency Confound" crisply captures the central mechanism and frames the paper.\n- Protocol alignment: The Evaluation Protocol (2.4) now matches actual practice (no separate calibration split; thresholds chosen to not exceed 5% FPR; actual_fpr reported). This increases reproducibility and honesty.\n- τ policy clarified: Best-τ for AUROC and canonical τ=0.2 for FNR comparisons are explicitly stated, reducing ambiguity.\n- Implementation breadcrumbs: Clear references to code files and logs (src/core/semantic_entropy.py, src/core/baseline_metrics.py, src/core/response_generator_openrouter.py; generation logs in outputs/h1/... and outputs/h2/...).\n- Inclusion of H4 N-sensitivity: New results quantify N=5→10 effects and show brittleness persists with τ, citing outputs/h4/evaluation/h4_brittleness_results.json.\n- H5 paraphrase section corrected: Now states "some baselines improved" with concrete deltas (from outputs/h5/evaluation/h5_paraphrase_degradation_report.md).\n- H6 audits strengthened: Percentages now include counts (44/60, 79/81) and sources are referenced.\n- Figure/Table specs improved: Several figures include axes and data sources (e.g., Fig. 1, Fig. 3); good direction.\n\n## Critical Issues Found\n\n### Missing Results\n- H6 Qwen/HarmBench JSON is cited but unavailable:\n  - Missing file: outputs/h6/qwen-h2-harmbench/qwen-2.5-7b-instruct_H2_h6_qualitative_audit_results.json (attempted read returned unavailable). The markdown audit exists and should be cited instead.\n    - Available source: outputs/h6/qwen-h2-harmbench/qwen-2.5-7b-instruct_H2_h6_qualitative_audit.md\n- Dataset manifest references (for JBB and H2 twins) are not linked inline for reproducibility even though they exist:\n  - data/manifests/jbb_validation_ids.json, data/processed/h2_harmbench_twins_test.jsonl\n\n### Accuracy Problems  \n- τ policy inconsistency in Results text:\n  - Section 2.4 says FNR@5%FPR should be reported at canonical τ=0.2 for SE. However, Section 3.2 uses SE at τ=0.1 for FNR comparisons (e.g., Llama: compares Embedding Variance 0.605 vs SE 0.654 at τ=0.1; Qwen: reports SE τ=0.1 FNR 0.630). This conflicts with the stated policy.\n  - Fix: Either (a) adhere to τ=0.2 for all FNR mentions in text, and separately note best-τ FNRs in Table 2, or (b) change the stated policy to "report FNR at best τ" and apply consistently across H1–H2–H5.\n- Minor artifact: The outline contains a stray line "code\nCode" under the title. Remove to avoid confusion.\n\n### Narrative Gaps\n- Figure/Table specification completeness:\n  - While several figures now specify axes and sources, ensure all do:\n    - Figure 1: add explicit filepaths: outputs/h1/evaluation/llama4scout_120val_results.json; outputs/h1/evaluation/qwen25_120val_results.json. State that SE bar uses best τ AUROC.\n    - Figure 2 (H3): specify data filepath explicitly (outputs/h3/per_prompt_analysis/llama-4-scout-17b-16e-instruct_H2_h3_prompt_analysis.jsonl) and whether length is median or mean across N.\n    - Figure 3 (H4): specify which baseline "stable baseline" is (e.g., Embedding Variance) and include its data series; source likely in outputs/h2/evaluation/qwen2.5-7b-instruct_h2_results.json for baseline reference.\n    - Figure 4 (H6): include data sources for counts (H6 JSON/MD files) and the per-prompt prediction files if needed: outputs/h6/*/per_prompt_predictions.jsonl.\n- Data provenance section: The previous outline included a Data Provenance Note. Consider reinstating a brief section to record file versions/dates and outstanding planned experiments (e.g., H7 not executed).\n- Formal citations: Section 2.2 references Farquhar et al. (2024) and a Nature paper; add explicit citation keys or footnotes sourced from papers/methodology_notes.md and literature_review_synthesis_notes_1.md.\n\n## Specific Improvements Needed\n\n### Section-by-Section Feedback\nIntroduction:\n- Current: Well-framed and introduces three claims clearly.\n- Issue: None critical; optionally foreshadow τ/N brittleness and evaluation protocol choices.\n- Suggestion: Add one sentence: "Because SE thresholds clustered responses and entropy for small N, its utility depends sensitively on τ and N (Sections 4.2; outputs/h4/evaluation/h4_brittleness_results.json)."\n\n2. Methodology:\n- Current: Concise and accurate; now cites core code files and logs.\n- Issues:\n  - SE implementation details: briefly note clustering linkage (average) and cosine distance with code reference (src/core/semantic_entropy.py) so the method is reproducible.\n  - Datasets: Add explicit references to dataset files/manifests (data/processed/h2_harmbench_twins_test.jsonl; data/manifests/jbb_validation_ids.json).\n- Suggestions:\n  - Keep the clarified evaluation protocol and τ policy. If you keep canonical τ=0.2 for FNR, ensure all Results adhere (see below).\n\n3.1 H1 (JBB):\n- Current: Correct comparisons; includes τ context and actual_fpr for Llama.\n- Issues: For Qwen, include actual_fpr and thresholds if available in qwen25_120val_results.json; otherwise, explicitly note they were not logged.\n- Suggestions:\n  - Add specific thresholds for SE FNR at τ=0.1 and 0.2 (from the JSON, if available). Include the exact filepath.\n\n3.2 H2 (HarmBench twins):\n- Current: Accurately characterizes instability and model dependence; numeric values match raw files.\n- Issue: τ policy inconsistency (see above).\n- Suggestions:\n  - Option A (recommended): Report SE FNR at canonical τ=0.2 (Llama: 0.765) in the text, and mention that best-τ=0.1 achieves 0.654, but this "win" is still worse than Embedding Variance (0.605). For Qwen, report both τ=0.2 and best-τ results for transparency.\n  - Update Table 2 to include: for SE, both τ=0.2 (canonical) and best-τ; include actual_fpr and thresholds columns.\n\n4.1 H3:\n- Current: Numerically consistent; clear setup.\n- Suggestion: Include model spec for the regression (SE score ~ log(length)), number of samples (N=162), and confirm that the figure uses median response length across 5 generations; cite the JSONL path.\n\n4.2 H4:\n- Current: Excellent inclusion of N-sensitivity and τ-brittleness with exact numbers from outputs/h4/evaluation/h4_brittleness_results.json.\n- Suggestion: Specify which "stable baseline" you will plot (e.g., Embedding Variance) and include its FNR@5%FPR value for Qwen/H2 from outputs/h2/evaluation/qwen2.5-7b-instruct_h2_results.json.\n\n4.3 H5:\n- Current: Corrected to "some baselines improved" with precise deltas.\n- Suggestion: Add AUROC shifts as in the report for completeness.\n\n5. H6:\n- Current: Counts added for both audits; definitions are precise.\n- Issues: Qwen/H2 audit JSON is unavailable.\n- Suggestions:\n  - Cite the markdown audit instead: outputs/h6/qwen-h2-harmbench/qwen-2.5-7b-instruct_H2_h6_qualitative_audit.md. Optionally, export a JSON summary from that report to align with Llama’s format for consistency.\n  - Include per-prompt prediction filepaths for reproducibility: outputs/h6/qwen-h2-harmbench/qwen-2.5-7b-instruct_H2_per_prompt_predictions.jsonl and outputs/h6/llama-h2-harmbench/llama-4-scout-17b-16e-instruct_H2_per_prompt_predictions.jsonl (if used for counts).\n\n6. Discussion/Conclusion:\n- Current: Limitations added; clear implications.\n- Suggestion: Add a brief "Data Provenance" note or appendix listing precise file versions used (timestamps in H4 JSON: 2025-08-28) and any discrepancies resolved (e.g., use of markdown audit for Qwen/H2 H6 due to missing JSON).\n\n### Missing Experiments to Include\n- H2 detailed evaluation reports (optional but informative):\n  - Location: outputs/h2/evaluation/h2_llama-4-scout-17b-16e-instruct_evaluation_report.md; outputs/h2/evaluation/h2_qwen2.5-7b-instruct_evaluation_report.md\n  - Suggested placement: References in Sections 3.2 and 4.2.\n\n### Reproducibility Gaps\n- τ policy adherence: Ensure all FNR mentions in text follow the stated canonical τ=0.2 or revise the policy to match usage; then update text and tables consistently.\n- Qwen/H2 H6 JSON missing: Either generate and commit the JSON summary or switch the citation to the markdown audit and include counts directly in the text.\n- Add dataset manifests/paths inline in Methodology for JBB and H2 twins; cite configs/project_config.yaml where relevant.\n- Include actual_fpr and threshold values in tables for transparency (already available in raw JSONs for H1/H2/H4).\n- Add a short "Implementation Details" subsection citing src/core/evaluation.py (threshold selection logic) and src/core/baseline_metrics.py (method names/parameters).\n\n## Priority Revisions (Ranked)\n1. Resolve τ policy inconsistency: Align Section 3.2 and Table 2 with the stated policy or revise the policy; ensure consistent reporting across H1–H2.\n2. Fix H6 Qwen/H2 citation: Replace missing JSON with the existing markdown audit path; optionally export a JSON summary for symmetry with Llama/JBB.\n3. Add complete figure/table specifications with filepaths, thresholds, and actual_fpr columns; specify which baseline is used as "stable" in H4 plots.\n4. Add dataset manifest paths in Methodology; include data provenance note (timestamps/versions) and per-run seeds/logs references.\n5. Remove the "code\nCode" artifact.\n\n## Verification Checklist\n- [ ] All completed experiments included: PARTIAL – H6 Qwen/H2 JSON missing; otherwise H1–H5–H6 results are incorporated.\n- [ ] All metrics accurate: YES – Numbers in Sections 3–5 match raw files (H1 Llama JSON, H2 Llama JSON, H4 brittleness JSON, H5 report; Qwen H2 values corroborated by H4 JSON). Minor additions needed for actual_fpr/thresholds in text.\n- [ ] Narrative flow logical: YES – Clear progression from hypothesis → negative results → mechanism → implications; minor consistency edits needed for τ policy.\n- [ ] Reproducibility info complete: PARTIAL – Evaluation protocol aligned; add dataset manifests, thresholds/actual_fpr in tables; include evaluation code path and baseline parameters.\n- [ ] Future work clearly separated: YES – Limitations and Future Work are explicit (calibration/CIs; H7).\n