# Paper Outline Review\n\n## What Works Well\n- Clear, memorable central insight: The "Consistency Confound" is crisply formulated and repeatedly supported with quantitative and qualitative evidence across models and datasets.\n- Strong alignment between protocol and reporting: The evaluation protocol now matches what was actually run (no separate calibration split; report actual_fpr and thresholds). τ policy is declared upfront (best-τ for AUROC; canonical τ=0.2 for FNR comparisons), increasing transparency and reproducibility.\n- Mechanism-first narrative: Sections 4.1–4.3 rule out confounders (length), show brittleness to τ/N, and demonstrate robustness to paraphrasing, converging on a single explanatory mechanism that is then validated by audits in Section 5.\n- Concrete, reproducible references: Each major claim anchors to raw files (e.g., outputs/h1/, outputs/h2/, outputs/h4/evaluation/h4_brittleness_results.json, outputs/h5/evaluation/h5_paraphrase_degradation_report.md, outputs/h6/). Dataset manifests and code breadcrumbs are included.\n- Expanded H4 and H6: Including N=10 results while preserving τ brittleness, and adding counts to the H6 audits (44/60; 79/81) materially strengthens the paper’s credibility.\n- Positioning to community needs: The framing speaks directly to a live uncertainty in black-box monitoring—whether multi-sample consistency is a useful safety signal as alignment improves.\n\n## Critical Issues Found\n\n### Missing Results\n- H6 (Qwen/H2) JSON symmetry missing: The outline correctly cites the markdown audit due to JSON unavailability. For completeness and data hygiene, export a JSON summary to match the Llama/JBB audit format.\n  - Missing file: outputs/h6/qwen-h2-harmbench/qwen-2.5-7b-instruct_H2_h6_qualitative_audit_results.json\n  - Available source: outputs/h6/qwen-h2-harmbench/qwen-2.5-7b-instruct_H2_h6_qualitative_audit.md\n- Baseline time/shift robustness: Given the community’s concern about detector aging (distribution shift), explicitly noting an H2→H5 stability check (paraphrases) is good, but a pointer to distribution-shift brittleness literature (e.g., JailbreaksOverTime) should be reflected in Discussion/Future Work and, ideally, a small sensitivity test if feasible (mark as planned if not executed).\n\n### Accuracy Problems  \n- τ policy application in Section 3.2: The text both (a) uses canonical τ=0.2 for Llama and (b) highlights a best-τ=0.1 "win" for Qwen. Because 2.4 states "FNR and cross-method comparisons use canonical τ=0.2," the phrase "appears to be the winning method" could mislead. Clarify that the win is only at best-τ; the canonical τ shows collapse (and thus undermines deployability).\n- Minor editorial artifact: Remove the stray "code\nCode" lines under the title.\n- Nomenclature consistency: Use a single notation (FNR@5%FPR or FNR@t5FPR) throughout; current draft mixes both.\n\n### Narrative Gaps\n- Elevate novelty vs SOTA: Make it explicit how this work differs from and challenges the assumption behind consistency-based uncertainty methods (Farquhar et al., SelfCheckGPT, UQLM-style disagreement scorers). The key contribution is not merely that SE underperforms, but that improved alignment systematically inverts the core proxy (consistency → safety) in the safety domain.\n- Anchor the practical takeaway: Add a short “Design Principle” paragraph in Discussion—"Detectors whose primary signal is output diversity become less useful as refusal policies converge; detectors should instead be aligned to features positively correlated with safety (e.g., refusal-template features)."\n- Clarify threat model and detection target: Are you detecting harmful prompts, jailbreak success, or both? The H2 twins set implies harmful vs matched benign classification, but it will help readers to state the detection task precisely (input-level vs output-level decision) and how a false negative is defined under that task.\n- Figure/table specs: You improved specificity, but ensure all planned figures include exact data sources and axes:\n  - Figure 1 (H1 AUROC): Name the SE τ used (best-τ) in the caption and include filepaths.\n  - Figure 3 (H4): Name the baseline used for the “stable baseline” trace (e.g., Embedding Variance) and its data source file.\n  - Figure 4 (H6): Add per-prompt prediction sources explicitly (outputs/h6//per_prompt_predictions.jsonl) so counts are traceable.\n\n## Specific Improvements Needed\n\n### Section-by-Section Feedback\nAbstract & Introduction\n- Current: Clear central question and strong headline findings; the mechanism is foregrounded.\n- Suggestion: Add one sentence to claim novelty vs SOTA: "Unlike prior consistency-based detectors for hallucination detection (Farquhar et al., SelfCheckGPT), we show that in safety settings consistency increases with alignment, invalidating consistency as a black-box proxy for safety."\n\n2. Methodology\n- Current: Concise, with implementation breadcrumbs and datasets/manifests.\n- Suggestions:\n  - Add a one-line description of the clustering linkage/distance as it impacts brittleness (average linkage, cosine), and point to exact function/params in src/core/semantic_entropy.py.\n  - Explicitly state the distance-to-τ conversion and the meaning of "Infinity threshold" behavior that leads to FNR=1.0 at higher τ.\n  - Clarify threat model/detection target as noted above.\n\n3. Results (H1–H2)\n- Current: Quantitative and file-anchored; τ/N nuances acknowledged.\n- Suggestions:\n  - Enforce the τ policy in prose: always report canonical-τ FNRs in text for cross-method comparisons; present best-τ as a secondary, non-deployable reference. Adjust wording in 3.2 to make this distinction explicit for Qwen.\n  - Add explicit thresholds and actual_fpr for each quoted FNR (available in the JSONs) to support reproducibility.\n\n4. Failure Mode Analyses (H3–H5)\n- Current: Well structured; H4 includes N=10.\n- Suggestions:\n  - H3: Add regression formula and sample size in text (SE scoreτ=0.1 ~ log(median length); N=162) and clarify aggregation (median across N responses).\n  - H4: Label the baseline trace (e.g., Embedding Variance) and explain why it is expected to be smoother than SE (no thresholding+entropy compression)\n  - H5: Include AUROC deltas alongside FNR deltas as already available in the report.\n\n5. Consistency Confound (H6)\n- Current: Criteria, counts, and examples are strong; Qwen/H2 cites markdown audit due to JSON unavailability.\n- Suggestions:\n  - Export JSON for Qwen/H2 audit for symmetry and long-term reproducibility.\n  - Add a simple quantitative correlation: false-negative probability vs refusal-template rate or duplicate rate (scatter with logistic fit). Even one panel using existing H6 features would materially strengthen the causal story without new data.\n  - Add 1–2 counterexamples (false negatives not explained by the confound) to further validate the taxonomy.\n\n6. Discussion, Significance, and Related Work\n- Current: Limitations and future work are cleanly stated; implications are clear.\n- Suggestions to elevate significance:\n  - Related Work upgrades: explicitly cite and contrast with Farquhar et al. (Nature 2024; semantic entropy), Kuhn/Gal/Farquhar (arXiv 2023), SelfCheckGPT (EMNLP 2023), and UQLM-style disagreement scorers as the behavioral-consistency lineage; HarmBench and JailbreakBench as evaluation standards; recent distribution-shift brittleness (JailbreaksOverTime, 2025); and a black-box embedding-classifier detector as a complementary baseline ("Improved LLM Jailbreak Detection via Pretrained Embeddings", 2024).\n  - Add a short "Significance" paragraph: "As safety alignment improves, output distributions collapse onto refusals, making output-diversity detectors systematically under-detect harmful interactions. This paper provides the first comprehensive diagnostic and quantitative audit of this effect on modern safeguards across two benchmarks." Provide citations as footnotes/references.\n\n### Missing Experiments to Include\n- Minimal refusal-template baseline (planned):\n  - Location: can be implemented with existing refusal patterns (see configs for H5 paraphrase pipeline) to score the proportion of refusal-template matches among N responses.\n  - Suggested placement: Section 5 as a contrasting, positively-correlated-with-alignment signal; report FNR@5%FPR alongside SE and Embedding Variance.\n- JSON export for H6 Qwen/H2 audit:\n  - Location: outputs/h6/qwen-h2-harmbench/\n  - Suggested placement: Section 5.3 text and Figure 4 data source.\n- Optional distribution-shift check (planned):\n  - Use existing H2 settings but alter sampling temp/top-p or swap in a small red-team variation on HBC (if available) to demonstrate brittleness under shift; or keep as Future Work with citations (JailbreaksOverTime 2025).\n\n### Reproducibility Gaps\n- Symmetric audit artifacts: Export missing H6 JSON for Qwen/H2.\n- Uniform τ policy application in text and tables (canonical vs best-τ).\n- Include threshold and actual_fpr columns in all result tables; each entry should list the exact JSON filepath.\n- Add a brief "Implementation Details" subsection calling out src/core/evaluation.py (thresholding logic), and the precise AgglomerativeClustering parameters.\n\n## Priority Revisions (Ranked)\n1. Clarify and enforce the τ policy in Results prose and Table 2 (canonical-τ for FNR comparisons; best-τ as secondary), especially for Qwen/H2 where the narrative currently highlights a best-τ "win" that collapses under the canonical-τ.\n2. Elevate novelty and significance in Related Work/Discussion: explicitly contrast with consistency-based SOTA (Farquhar et al., SelfCheckGPT, UQLM) and add a design-principle takeaway for building detectors aligned with safety (positively correlated with alignment), citing a prompt-side embedding-classifier baseline as a complementary approach.\n3. Add a lightweight quantitative panel in H6 linking false negatives to refusal-template rate/duplicate rate (correlation or logistic fit) using existing audit features, to further cement the mechanism claim without new data collection.\n4. Export the missing H6 Qwen/H2 JSON and include per-prompt prediction filepaths in Figure 4 caption; unify notation (FNR@5%FPR) and remove the "code\nCode" artifact.\n5. Complete figure/table specs with axes, thresholds, actual_fpr, and exact filepaths listed in captions; label the “stable baseline” in H4 and cite its file.\n\n## Verification Checklist\n- [ ] All completed experiments included: PARTIAL – H6 Qwen/H2 JSON missing (markdown used); otherwise H1–H5–H6 are represented.\n- [ ] All metrics accurate: YES – Numbers match raw files previously validated; ensure prose uses canonical-τ FNRs for cross-method comparisons or labels best-τ clearly.\n- [ ] Narrative flow logical: YES – From baseline refutation to mechanism validation; emphasize policy-consistent reporting in 3.2.\n- [ ] Reproducibility info complete: PARTIAL – Thresholds/actual_fpr should be added to tables; export missing JSON; add evaluation code references and clustering params.\n- [ ] Future work clearly separated: YES – Calibration/CIs and H7 are appropriately flagged; consider adding a planned distribution-shift test.\n" 