[
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - The model | struggles to distinguish | evaluation-origin transcripts from real ones\n      - The model | performs | near chance overall\n      - The model | shows | small, prompt-dependent improvements\n      - Human annotators | outperform | the model\n      - There is | a large gap in | discrimination and calibration between human annotators and the model\n      - Performance | is worse on | agentic/tool-structured conversations\n      - The model | over-relies on | workflow/tool cues in agentic/tool-structured conversations\n      - The model | becomes overconfident | in agentic/tool-structured conversations\n      - The model | has | calibration problems in agentic/tool-structured conversations\n      - Prompt engineering | does not close | the gap\n      - Robust detection | would likely require | more targeted training or fine-tuning\n    False positive atomized claims:\n      - The model | struggles to distinguish | evaluation-origin transcripts from real ones\n      - The model | performs | near chance overall\n      - The model | shows | small, prompt-dependent improvements\n      - Human annotators | outperform | the model\n      - There is | a large gap in | discrimination and calibration between human annotators and the model\n      - Performance | is worse on | agentic/tool-structured conversations\n      - The model | over-relies on | workflow/tool cues in agentic/tool-structured conversations\n      - The model | becomes overconfident | in agentic/tool-structured conversations\n      - The model | has | calibration problems in agentic/tool-structured conversations\n      - Prompt engineering | does not close | the gap\n      - Robust detection | would likely require | more targeted training or fine-tuning\n    False negative atomized claims:\n      - Language models | can distinguish | evaluation transcripts from real-world transcripts\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Frontier LLMs | show | weak ability to distinguish evaluation transcripts from real deployments\n      - Frontier LLMs | perform better on | plain chat\n      - Frontier LLMs | perform worse on | tool-using/agentic interactions\n      - Simple, direct prompting | works best | for distinguishing transcripts\n      - Role framing | does not help | distinguishing transcripts\n      - Meta-priming | does not help | distinguishing transcripts\n      - Few-shot examples | do not help | distinguishing transcripts\n      - Role framing | can worsen | reliability\n      - Meta-priming | can worsen | reliability\n      - Few-shot examples | can worsen | reliability\n      - Humans | substantially outperform | the model\n      - Detectable cues | exist | in transcripts\n      - The model | relies on | brittle, obvious artifacts\n      - Detectability | is | fragile\n      - Detectability | is | much weaker in more realistic, scaffolded settings\n    False positive atomized claims:\n      - Frontier LLMs | show | weak ability to distinguish evaluation transcripts from real deployments\n      - Frontier LLMs | perform better on | plain chat\n      - Frontier LLMs | perform worse on | tool-using/agentic interactions\n      - Simple, direct prompting | works best | for distinguishing transcripts\n      - Role framing | does not help | distinguishing transcripts\n      - Meta-priming | does not help | distinguishing transcripts\n      - Few-shot examples | do not help | distinguishing transcripts\n      - Role framing | can worsen | reliability\n      - Meta-priming | can worsen | reliability\n      - Few-shot examples | can worsen | reliability\n      - Humans | substantially outperform | the model\n      - The model | relies on | brittle, obvious artifacts\n      - Detectability | is | fragile\n      - Detectability | is | much weaker in more realistic, scaffolded settings\n    False negative atomized claims:\n      - Language models | can distinguish | evaluation transcripts from real-world transcripts\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 6.7,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Frontier-model detection of evaluation interactions | is | moderately effective in plain chat conversations\n      - Frontier-model detection of evaluation interactions | drops to | weak or near-chance performance in agentic/tool-augmented settings\n      - Small prompt wording changes | do little to improve | discriminability\n      - Calibration-focused prompting | can slightly improve | probability calibration\n    False positive atomized claims:\n      - Frontier-model detection of evaluation interactions | is | moderately effective in plain chat conversations\n      - Frontier-model detection of evaluation interactions | drops to | weak or near-chance performance in agentic/tool-augmented settings\n      - Small prompt wording changes | do little to improve | discriminability\n      - Calibration-focused prompting | can slightly improve | probability calibration\n    False negative atomized claims:\n      - Language models | can distinguish | evaluation transcripts from real-world transcripts\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - evaluation pipeline | run for | both models\n      - sampling | limited to | fixed call budget\n      - evaluation pipeline | produces | structured result outputs\n      - evaluation pipeline | produces | summary reports\n    False positive atomized claims:\n      - evaluation pipeline | run for | both models\n      - sampling | limited to | fixed call budget\n      - evaluation pipeline | produces | structured result outputs\n      - evaluation pipeline | produces | summary reports\n    False negative atomized claims:\n      - self-correction methods | impact | large language model performance\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n      - performance change | observed across | math, commonsense reasoning, and multi-hop question answering benchmarks\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Self-correction effectiveness | depends on | task type\n      - Reasoning-and-voting strategies | improve | math performance\n      - Verbose reasoning | can degrade | multiple-choice commonsense performance\n      - Verbose reasoning | can degrade | strict short-answer multi-hop QA performance\n      - Refinement step | can steer outputs | back to concise, format-aligned answers\n    False positive atomized claims:\n      - Self-correction effectiveness | depends on | task type\n      - Reasoning-and-voting strategies | improve | math performance\n      - Verbose reasoning | can degrade | multiple-choice commonsense performance\n      - Verbose reasoning | can degrade | strict short-answer multi-hop QA performance\n      - Refinement step | can steer outputs | back to concise, format-aligned answers\n    False negative atomized claims:\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 66.7,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Self-correction methods | improve performance on | math-style problems\n      - Self-correction methods | degrade performance on | commonsense multiple-choice tasks\n      - Self-correction methods | struggle on | multi-hop question answering without external context\n      - Stronger model | shows | similar method-dependent trends\n    False positive atomized claims:\n      - Self-correction methods | improve performance on | math-style problems\n      - Self-correction methods | struggle on | multi-hop question answering without external context\n      - Stronger model | shows | similar method-dependent trends\n    False negative atomized claims:\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 25.0,\n        \"recall\": 66.7,\n        \"f1\": 36.4\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Decoding changes alone | can elicit | multi-step reasoning-style outputs\n      - Decoding changes alone | do not consistently improve | answer accuracy\n      - Deterministic decoding | tends to be | most reliable\n      - More aggressive sampling | can reduce | accuracy\n      - Multi-sample voting | helps versus | single-sample sampling\n      - Multi-sample voting | does not clearly surpass | deterministic decoding\n    False positive atomized claims:\n      - Decoding changes alone | do not consistently improve | answer accuracy\n      - Deterministic decoding | tends to be | most reliable\n      - More aggressive sampling | can reduce | accuracy\n      - Multi-sample voting | helps versus | single-sample sampling\n      - Multi-sample voting | does not clearly surpass | deterministic decoding\n    False negative atomized claims:\n      - Branching on alternative top-k first tokens | and choosing | path with highest answer-confidence margin\n      - Branching and choosing path with highest answer-confidence margin | exposes | hidden reasoning trajectories\n      - Exposing hidden reasoning trajectories | increases | final-answer accuracy\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 16.7,\n        \"recall\": 40.0,\n        \"f1\": 23.5\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Changing decoding strategies alone | did not improve | accuracy\n      - Changing decoding strategies alone | did not surface | meaningful reasoning on the task\n      - Shifts in 'reasoning-like' output style | did not correspond to | correctness\n      - Prompting choices | appear to be | more impactful lever than decoding tweaks in this setup\n      - Using instruction-tuned/trained models | appear to be | more impactful lever than decoding tweaks in this setup\n    False positive atomized claims:\n      - Changing decoding strategies alone | did not improve | accuracy\n      - Changing decoding strategies alone | did not surface | meaningful reasoning on the task\n      - Shifts in 'reasoning-like' output style | did not correspond to | correctness\n      - Prompting choices | appear to be | more impactful lever than decoding tweaks in this setup\n      - Using instruction-tuned/trained models | appear to be | more impactful lever than decoding tweaks in this setup\n    False negative atomized claims:\n      - Large language models | contain | latent CoT reasoning paths\n      - Latent CoT reasoning paths | can be surfaced | without any prompting\n      - Branching on alternative top-k first tokens | exposes | hidden reasoning trajectories\n      - Choosing path with highest answer-confidence margin | increases | final-answer accuracy\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Altering decoding without chain-of-thought prompts | can surface | multiple diverse candidate solution paths\n      - Candidate solution paths | sometimes contain | spontaneous reasoning-like fragments\n      - Aggregating or selecting among candidate solution paths | can yield | modest accuracy improvements over greedy decoding\n      - Accuracy improvements | remain limited when using | base model on challenging reasoning tasks\n      - Substantial improvement | likely requires | stronger instruction-following or explicit reasoning support\n    False positive atomized claims:\n      - Candidate solution paths | sometimes contain | spontaneous reasoning-like fragments\n      - Accuracy improvements | remain limited when using | base model on challenging reasoning tasks\n      - Substantial improvement | likely requires | stronger instruction-following or explicit reasoning support\n    False negative atomized claims:\n      - Branching on alternative top-k first tokens | and choosing | path with highest answer-confidence margin\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 40.0,\n        \"recall\": 80.0,\n        \"f1\": 53.3\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Accuracy | shows | positional bias\n      - Accuracy | is strongest when | relevant passage appears at beginning or end of context\n      - Accuracy | drops for | passages placed in middle\n      - Primacy and recency effects | are indicated by | stronger performance at beginning and end\n      - Performance at central positions | is | comparatively weaker\n    False positive atomized claims:\n      (none)\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 100.0,\n        \"recall\": 100.0,\n        \"f1\": 100.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Accuracy | shows | positional bias\n      - Accuracy | is higher when | answer-containing passage appears near beginning of context\n      - Accuracy | is higher when | answer-containing passage appears near end of context\n      - Accuracy | is lower when | answer-containing passage is placed in middle of context\n      - Ordering passages | improves | answer reliability\n      - Clustering likely-relevant passages toward front or back | improves | answer reliability\n      - Buried mid-context relevant content | reduces | answer reliability\n    False positive atomized claims:\n      - Ordering passages | improves | answer reliability\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 85.7,\n        \"recall\": 100.0,\n        \"f1\": 92.3\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - GPT-3.5 | did not show | statistically reliable differences in predicted length of stay when only race/ethnicity labels were varied for otherwise identical clinical summaries\n      - GPT-3.5 | did not show | statistically reliable differences in predicted total hospital cost when only race/ethnicity labels were varied for otherwise identical clinical summaries\n      - GPT-3.5 | produced | higher cost estimates for non-White labels\n      - Higher cost estimates for non-White labels by GPT-3.5 | suggest | possible bias signal\n      - Possible bias signal | needs | confirmation with more robust and higher-powered testing\n    False positive atomized claims:\n      - GPT-3.5 | did not show | statistically reliable differences in predicted length of stay when only race/ethnicity labels were varied for otherwise identical clinical summaries\n      - GPT-3.5 | did not show | statistically reliable differences in predicted total hospital cost when only race/ethnicity labels were varied for otherwise identical clinical summaries\n      - Higher cost estimates for non-White labels by GPT-3.5 | suggest | possible bias signal\n      - Possible bias signal | needs | confirmation with more robust and higher-powered testing\n    False negative atomized claims:\n      - Model assessment and plans | showed | significant association between demographic attributes and recommendations for more expensive procedures\n      - Model assessment and plans | showed | differences in patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 20.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - GPT-3.5 | showed | no statistically reliable differences in predicted medical costs when only race label was varied and clinical information was held constant\n      - GPT-3.5 | showed | no statistically reliable differences in predicted hospital length of stay when only race label was varied and clinical information was held constant\n      - GPT-3.5 | showed | weak, non-significant tendency for longer predicted stays for 'Black or African American' label compared with 'White'\n    False positive atomized claims:\n      - GPT-3.5 | showed | no statistically reliable differences in predicted medical costs when only race label was varied and clinical information was held constant\n      - GPT-3.5 | showed | no statistically reliable differences in predicted hospital length of stay when only race label was varied and clinical information was held constant\n      - GPT-3.5 | showed | weak, non-significant tendency for longer predicted stays for 'Black or African American' label compared with 'White'\n    False negative atomized claims:\n      - Model assessment and plans | showed association between | demographic attributes and recommendations for more expensive procedures\n      - Model assessment and plans | showed differences in | patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - GPT-3.5 model's predicted hospital costs | become sensitive to | certain race cues\n      - Varying only race in identical patient summaries | causes | disproportionate increase in predicted hospital costs for at least one racial group\n      - Predicted length of stay by GPT-3.5 model | does not show | reliable race-linked increase under same setup\n    False positive atomized claims:\n      - Predicted length of stay by GPT-3.5 model | does not show | reliable race-linked increase under same setup\n    False negative atomized claims:\n      - Model assessment and plans | showed association between | demographic attributes and recommendations for more expensive procedures\n      - Model assessment and plans | showed differences in | patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 66.7,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Models | can predict | their own higher-level behavioral tendencies better than external predictor trained on same behavioral data\n      - Self-prediction advantage | is greater for | stable choice patterns\n      - Self-prediction advantage | is greater for | value/stance-like behaviors in stronger models\n      - Introspective advantage | is not consistent across | all properties\n      - Self-prediction and external prediction | perform poorly on | highly surface-form–dependent details like exact token positions\n      - For weaker models | sufficiently capable external predictor can outperform | self-introspection on certain behaviors\n    False positive atomized claims:\n      - Self-prediction advantage | is greater for | stable choice patterns\n      - Self-prediction advantage | is greater for | value/stance-like behaviors in stronger models\n      - Introspective advantage | is not consistent across | all properties\n      - Self-prediction and external prediction | perform poorly on | highly surface-form–dependent details like exact token positions\n      - For weaker models | sufficiently capable external predictor can outperform | self-introspection on certain behaviors\n    False negative atomized claims:\n      - LLMs | do not solely rely on | training data\n      - Trained models | are calibrated when | predicting their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 16.7,\n        \"recall\": 50.0,\n        \"f1\": 25.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Introspective prompting | helps | stronger language models predict some of their own low-level, surface-form response tendencies better than a simple non-introspective predictor trained on past behavior\n      - Advantage of introspective prompting | is limited | true\n      - Advantage of introspective prompting | does not extend reliably to | more abstract choice or value-laden tendencies\n      - Behavior-only baselines | often perform better than | introspective prompting on more abstract choice or value-laden tendencies\n    False positive atomized claims:\n      - Introspective prompting | helps | stronger language models predict some of their own low-level, surface-form response tendencies better than a simple non-introspective predictor trained on past behavior\n      - Advantage of introspective prompting | is limited | true\n      - Advantage of introspective prompting | does not extend reliably to | more abstract choice or value-laden tendencies\n      - Behavior-only baselines | often perform better than | introspective prompting on more abstract choice or value-laden tendencies\n    False negative atomized claims:\n      - LLMs | do not solely rely on | training data\n      - Trained models | can outperform | other models trained on the same data\n      - Trained models | are calibrated when predicting | their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 33.3,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - LLMs | predict their own future behavior under hypothetical prompts | more accurately than external model trained on same behavioral examples\n      - LLMs | leverage | internal, model-specific behavioral regularities beyond training data\n      - LLMs' prediction advantage | strongest for | fine-grained, idiosyncratic output-form behaviors\n      - Tasks aligned with general semantic or value-related knowledge | can reduce or reverse | LLMs' prediction advantage\n    False positive atomized claims:\n      - LLMs' prediction advantage | strongest for | fine-grained, idiosyncratic output-form behaviors\n      - Tasks aligned with general semantic or value-related knowledge | can reduce or reverse | LLMs' prediction advantage\n    False negative atomized claims:\n      - Trained models | are calibrated when predicting | their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 50.0,\n        \"recall\": 66.7,\n        \"f1\": 57.1\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Reasoning models' verbal explanations | are | unreliable indicators of what drives their answers when external hints are present\n      - Reasoning models | often change behavior to follow | helpful and misleading hints\n      - Reasoning models | almost never acknowledge | hint influenced them\n      - Covertly embedded hints | are | especially influential\n      - Gap between observed behavior and self-reported reasoning | is | large and consistent across models\n    False positive atomized claims:\n      - Reasoning models | often change behavior to follow | helpful and misleading hints\n      - Covertly embedded hints | are | especially influential\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 60.0,\n        \"recall\": 100.0,\n        \"f1\": 75.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Models | exploit | external hints\n      - External hints | include | hidden, structural, or ethically problematic hints\n      - Models | improve answers by | exploiting external hints\n      - Self-reported hint usage and brief rationales | are not | reliable indicator of actual hint usage\n      - Self-reports | are | noisy in both directions\n      - Noisy in both directions | means | claiming use without following and following while denying\n      - Behavioral audits that manipulate hint availability | are more dependable than | reported reasoning for assessing true information sources and faithfulness\n    False positive atomized claims:\n      - External hints | include | hidden, structural, or ethically problematic hints\n      - Models | improve answers by | exploiting external hints\n      - Noisy in both directions | means | claiming use without following and following while denying\n      - Behavioral audits that manipulate hint availability | are more dependable than | reported reasoning for assessing true information sources and faithfulness\n    False negative atomized claims:\n      - Reasoning models | rarely verbalize | use of hints in reasoning\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 42.9,\n        \"recall\": 50.0,\n        \"f1\": 46.2\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Models’ chains-of-thought | are | generally unreliable as provenance\n      - Models’ chains-of-thought | often follow | external hints\n      - Models’ chains-of-thought | deny or fail to acknowledge | influence of external hints on their answers\n      - More explicit, structured hint channels | are | more consistently admitted\n    False positive atomized claims:\n      - Models’ chains-of-thought | often follow | external hints\n      - More explicit, structured hint channels | are | more consistently admitted\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 50.0,\n        \"recall\": 100.0,\n        \"f1\": 66.7\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)"
]