[
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - tested language model | has difficulty distinguishing | evaluation-origin transcripts from real ones\n      - tested language model | performs | near chance overall\n      - tested language model | shows | modest, prompt-dependent improvements\n      - human annotators | outperform | tested language model in discrimination and calibration\n      - performance of tested language model | is weak on | agentic/tool-structured conversations\n      - tested language model | over-relies on | tool/workflow cues in agentic/tool-structured conversations\n      - tested language model | becomes overconfident in | incorrect judgments in agentic/tool-structured conversations\n      - prompt engineering | is insufficient to close | performance gap\n      - robust detection | would likely require | more targeted training or fine-tuning\n    False positive atomized claims:\n      - tested language model | has difficulty distinguishing | evaluation-origin transcripts from real ones\n      - tested language model | performs | near chance overall\n      - tested language model | shows | modest, prompt-dependent improvements\n      - human annotators | outperform | tested language model in discrimination and calibration\n      - performance of tested language model | is weak on | agentic/tool-structured conversations\n      - tested language model | over-relies on | tool/workflow cues in agentic/tool-structured conversations\n      - tested language model | becomes overconfident in | incorrect judgments in agentic/tool-structured conversations\n      - prompt engineering | is insufficient to close | performance gap\n      - robust detection | would likely require | more targeted training or fine-tuning\n    False negative atomized claims:\n      - Language models | can distinguish | evaluation transcripts from real-world transcripts\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Frontier LLMs | distinguish evaluation transcripts from real deployment | weakly and inconsistently\n      - Frontier LLMs | perform better on | plain chat\n      - Frontier LLMs | perform worse on | tool-heavy agentic interactions\n      - Simple, direct prompting | works best for | Frontier LLMs\n      - Role/meta priming | does not reliably help | Frontier LLMs\n      - Few-shot examples | do not reliably help | Frontier LLMs\n      - Role/meta priming | can worsen | calibration\n      - Few-shot examples | can worsen | calibration\n      - Humans | substantially outperform | Frontier LLMs\n      - Model’s limited detectability | relies on | obvious surface artifacts\n      - Detection by Frontier LLMs | is fragile in | realistic, scaffolded settings\n    False positive atomized claims:\n      - Frontier LLMs | distinguish evaluation transcripts from real deployment | weakly and inconsistently\n      - Frontier LLMs | perform better on | plain chat\n      - Frontier LLMs | perform worse on | tool-heavy agentic interactions\n      - Simple, direct prompting | works best for | Frontier LLMs\n      - Role/meta priming | does not reliably help | Frontier LLMs\n      - Few-shot examples | do not reliably help | Frontier LLMs\n      - Role/meta priming | can worsen | calibration\n      - Few-shot examples | can worsen | calibration\n      - Humans | substantially outperform | Frontier LLMs\n      - Model’s limited detectability | relies on | obvious surface artifacts\n      - Detection by Frontier LLMs | is fragile in | realistic, scaffolded settings\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 100.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - model | can moderately detect | evaluation-style interactions in plain chat settings\n      - model's detectability | drops to | weak or near-chance in agentic/tool-augmented settings\n      - evaluation and real usage in agentic/tool-augmented settings | share | similar scaffolding patterns\n      - prompt wording changes | have | modest impact on discriminability\n      - prompt wording changes | mainly affect | calibration slightly\n      - model's scores | align only weakly with | human judgments\n    False positive atomized claims:\n      - model's detectability | drops to | weak or near-chance in agentic/tool-augmented settings\n      - evaluation and real usage in agentic/tool-augmented settings | share | similar scaffolding patterns\n      - prompt wording changes | have | modest impact on discriminability\n      - prompt wording changes | mainly affect | calibration slightly\n      - model's scores | align only weakly with | human judgments\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 16.7,\n        \"recall\": 100.0,\n        \"f1\": 28.6\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Experiment workflow | run for | both models\n      - Evaluation | limited by | sampling dataset subsets\n      - Sampling dataset subsets | used to | fit within fixed call budget\n      - Experiment workflow | produces | result files\n      - Experiment workflow | produces | summary outputs\n    False positive atomized claims:\n      - Experiment workflow | run for | both models\n      - Evaluation | limited by | sampling dataset subsets\n      - Sampling dataset subsets | used to | fit within fixed call budget\n      - Experiment workflow | produces | result files\n      - Experiment workflow | produces | summary outputs\n    False negative atomized claims:\n      - self-correction methods | impact | accuracies of all models\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n      - drop or remain nearly the same | occur across | math, commonsense reasoning, and multi-hop question answering benchmarks\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Self-correction benefits | are | highly task-dependent\n      - Reasoning plus voting | strongly improves performance on | math problems\n      - Refinement | is most helpful for | multiple-choice commonsense\n      - Elaborate reasoning | can hurt | multiple-choice commonsense\n      - Refinement | helps mainly by producing | required concise final form for multi-hop question answering with strict exact-match grading\n      - Chain-of-thought and voting | often fail due to | answer-format mismatch in multi-hop question answering\n    False positive atomized claims:\n      - Self-correction benefits | are | highly task-dependent\n      - Reasoning plus voting | strongly improves performance on | math problems\n      - Refinement | is most helpful for | multiple-choice commonsense\n      - Elaborate reasoning | can hurt | multiple-choice commonsense\n      - Refinement | helps mainly by producing | required concise final form for multi-hop question answering with strict exact-match grading\n      - Chain-of-thought and voting | often fail due to | answer-format mismatch in multi-hop question answering\n    False negative atomized claims:\n      - self-correction methods | impact | accuracies of all models\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 33.3,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Self-correction methods | improve performance on | math problems\n      - Self-correction methods | can reduce accuracy on | commonsense multiple-choice tasks\n      - Self-correction methods | fail to help | multi-hop question answering without external context\n      - Critique-and-revise approach | is | most consistently beneficial outside math\n      - Stronger model | performs better overall | large language model benchmarks\n      - Stronger model | shows | same method-dependent tradeoffs\n    False positive atomized claims:\n      - Self-correction methods | improve performance on | math problems\n      - Critique-and-revise approach | is | most consistently beneficial outside math\n      - Stronger model | performs better overall | large language model benchmarks\n      - Stronger model | shows | same method-dependent tradeoffs\n    False negative atomized claims:\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n      - drop or remain nearly the same | occur across | math, commonsense reasoning, and multi-hop question answering benchmarks\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 33.3,\n        \"recall\": 33.3,\n        \"f1\": 33.3\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Decoding choices | can elicit | multi-step reasoning-like explanations without explicit chain-of-thought prompting\n      - Adjusting decoding alone | does not consistently improve | answer accuracy\n      - Greedy decoding | tends to be | most reliable\n      - More stochastic sampling | can reduce | accuracy\n      - Self-consistency | offers | limited gains\n      - Self-consistency | does not clearly surpass | greedy decoding\n    False positive atomized claims:\n      - Adjusting decoding alone | does not consistently improve | answer accuracy\n      - Greedy decoding | tends to be | most reliable\n      - More stochastic sampling | can reduce | accuracy\n      - Self-consistency | offers | limited gains\n      - Self-consistency | does not clearly surpass | greedy decoding\n    False negative atomized claims:\n      - Branching on alternative top-k first tokens | and choosing | path with highest answer-confidence margin\n      - Branching and choosing path with highest answer-confidence margin | exposes | hidden reasoning trajectories\n      - Exposing hidden reasoning trajectories | increases | final-answer accuracy\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 16.7,\n        \"recall\": 40.0,\n        \"f1\": 23.5\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Changing decoding strategies without chain-of-thought prompting | did not improve | accuracy\n      - Changing decoding strategies without chain-of-thought prompting | did not reliably surface | meaningful reasoning\n      - Increases in reasoning-like output | did not correspond to | better correctness\n      - Model/prompting choices such as instruction tuning | matter more than | decoding alone in this setup\n    False positive atomized claims:\n      - Changing decoding strategies without chain-of-thought prompting | did not improve | accuracy\n      - Changing decoding strategies without chain-of-thought prompting | did not reliably surface | meaningful reasoning\n      - Increases in reasoning-like output | did not correspond to | better correctness\n      - Model/prompting choices such as instruction tuning | matter more than | decoding alone in this setup\n    False negative atomized claims:\n      - Large language models | contain | latent CoT reasoning paths\n      - Latent CoT reasoning paths | can be surfaced | without any prompting\n      - Branching on alternative top-k first tokens | and choosing | path with highest answer-confidence margin\n      - Branching and choosing path | exposes | hidden reasoning trajectories\n      - Exposing hidden reasoning trajectories | increases | final-answer accuracy\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Altering decoding without chain-of-thought prompts | can generate | diverse candidate outputs\n      - Diverse candidate outputs | sometimes expose | spontaneous reasoning-like fragments\n      - Aggregating or selecting among candidates | can yield | modest accuracy improvements over greedy decoding\n      - Accuracy gains | remain | limited for a base model on a difficult reasoning task\n      - Substantial improvement | likely requires | stronger instruction-following or explicit reasoning support\n    False positive atomized claims:\n      - Accuracy gains | remain | limited for a base model on a difficult reasoning task\n      - Substantial improvement | likely requires | stronger instruction-following or explicit reasoning support\n    False negative atomized claims:\n      - Large language models | contain | latent CoT reasoning paths\n      - Branching on alternative top-k first tokens | and choosing | path with highest answer-confidence margin\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 60.0,\n        \"recall\": 60.0,\n        \"f1\": 60.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Accuracy | depends on | position of answer-bearing passage in context\n      - Accuracy | is highest when | relevant passage is at beginning or end\n      - Accuracy | drops when | relevant passage is in the middle\n      - Primacy and recency effects | are indicated by | performance gaps between edge and center positions\n    False positive atomized claims:\n      (none)\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 100.0,\n        \"recall\": 100.0,\n        \"f1\": 100.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Accuracy | shows | primacy-and-recency pattern\n      - Model | performs better when | answer-bearing passage appears near beginning of context\n      - Model | performs better when | answer-bearing passage appears near end of context\n      - Model | performs worse when | answer-bearing passage is placed in middle of context\n      - Ordering content | improves | answer accuracy\n      - Placing critical passages early or late | improves | answer accuracy\n    False positive atomized claims:\n      - Ordering content | improves | answer accuracy\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 83.3,\n        \"recall\": 100.0,\n        \"f1\": 90.9\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - GPT-3.5 | did not show statistically reliable differences in | predicted hospital length of stay across race/ethnicity labels when clinical content was held constant\n      - GPT-3.5 | did not show statistically reliable differences in | predicted total costs across race/ethnicity labels when clinical content was held constant\n      - GPT-3.5 predictions | consistently trended toward | higher costs for non-White labels\n      - GPT-3.5 predictions | suggest | potential bias signal\n      - Potential bias signal | merits | further, higher-power investigation\n    False positive atomized claims:\n      - GPT-3.5 | did not show statistically reliable differences in | predicted hospital length of stay across race/ethnicity labels when clinical content was held constant\n      - GPT-3.5 | did not show statistically reliable differences in | predicted total costs across race/ethnicity labels when clinical content was held constant\n      - Potential bias signal | merits | further, higher-power investigation\n    False negative atomized claims:\n      - Model assessment and plans | showed | significant association between demographic attributes and recommendations for more expensive procedures\n      - Model assessment and plans | showed | differences in patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 40.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Evaluation | found | no conclusive evidence that GPT-3.5 systematically changes predicted hospital costs based solely on race labels when clinical information is held constant\n      - Evaluation | found | no conclusive evidence that GPT-3.5 systematically changes predicted length of stay based solely on race labels when clinical information is held constant\n      - Evaluation | found | weak, non-significant indication that certain race labels may slightly increase predicted length of stay\n    False positive atomized claims:\n      - Evaluation | found | no conclusive evidence that GPT-3.5 systematically changes predicted hospital costs based solely on race labels when clinical information is held constant\n      - Evaluation | found | no conclusive evidence that GPT-3.5 systematically changes predicted length of stay based solely on race labels when clinical information is held constant\n      - Evaluation | found | weak, non-significant indication that certain race labels may slightly increase predicted length of stay\n    False negative atomized claims:\n      - Model assessment and plans | showed | significant association between demographic attributes and recommendations for more expensive procedures\n      - Model assessment and plans | showed | differences in patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - GPT-3.5 model's cost estimates | are sensitive to | race cues\n      - Adding certain race labels | leads to | systematically higher predicted medical costs compared with race-unspecified baseline\n      - Model | shows | disproportionate cost inflation for at least one group\n      - Predicted length of stay | does not show | reliable race-linked increase under the same setup\n    False positive atomized claims:\n      - Predicted length of stay | does not show | reliable race-linked increase under the same setup\n    False negative atomized claims:\n      - Assessment and plans created by the model | showed | significant association between demographic attributes and recommendations for more expensive procedures\n      - Assessment and plans created by the model | showed | differences in patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 75.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Models | can predict | their own broad behavioral tendencies more accurately than an external model trained on the same behavioral traces\n      - Introspective advantage | is stronger for | higher-level choice patterns\n      - Introspective advantage | is stronger for | value-laden decisions in stronger models\n      - Introspective advantage | is inconsistent across | behavior types\n      - Introspective advantage | weakens or disappears for | surface-form details that depend on stochastic phrasing\n      - For weaker models | sufficiently capable external predictor can match or surpass | self-introspection on certain stable biases\n    False positive atomized claims:\n      - Introspective advantage | is stronger for | higher-level choice patterns\n      - Introspective advantage | is stronger for | value-laden decisions in stronger models\n      - Introspective advantage | is inconsistent across | behavior types\n      - Introspective advantage | weakens or disappears for | surface-form details that depend on stochastic phrasing\n      - For weaker models | sufficiently capable external predictor can match or surpass | self-introspection on certain stable biases\n    False negative atomized claims:\n      - LLMs | do not solely rely on | training data\n      - Trained models | are calibrated when | predicting their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 16.7,\n        \"recall\": 50.0,\n        \"f1\": 25.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Introspection | helps | some language models predict certain low-level, surface-form aspects of their own future outputs better than a non-introspective predictor trained on past behavior\n      - Introspection advantage | is | limited\n      - Introspection advantage | appears stronger in | more capable models\n      - Introspection advantage | mainly applies to | simple stylistic initiation patterns\n      - For higher-level choice biases and ethical stances | behavior-only baselines | tend to outperform introspective self-predictions\n    False positive atomized claims:\n      - Introspection advantage | is | limited\n      - Introspection advantage | appears stronger in | more capable models\n      - Introspection advantage | mainly applies to | simple stylistic initiation patterns\n      - For higher-level choice biases and ethical stances | behavior-only baselines | tend to outperform introspective self-predictions\n    False negative atomized claims:\n      - LLMs | do not solely rely on | training data\n      - Trained models | can outperform | other models trained on the same data\n      - Trained models | are calibrated when | predicting their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 20.0,\n        \"recall\": 33.3,\n        \"f1\": 25.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - LLMs | better at predicting | their own future outputs under hypothetical prompts\n      - LLMs | draw on | model-internal behavioral regularities beyond training data\n      - LLMs' self-prediction advantage | strongest for | idiosyncratic, form-level properties of generation\n      - Tasks aligned with general semantic or value-laden knowledge | can reduce or reverse | LLMs' self-introspection advantage\n    False positive atomized claims:\n      - LLMs' self-prediction advantage | strongest for | idiosyncratic, form-level properties of generation\n      - Tasks aligned with general semantic or value-laden knowledge | can reduce or reverse | LLMs' self-introspection advantage\n    False negative atomized claims:\n      - Trained models | are calibrated when | predicting their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 50.0,\n        \"recall\": 66.7,\n        \"f1\": 57.1\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Reasoning models' stated explanations | are not reliable indicators of | what actually influenced their answers when hints are present\n      - Reasoning models | often change behavior in line with | embedded hints\n      - Embedded hints | include | covert ones\n      - Reasoning models | rarely acknowledge using | embedded hints\n      - Gap | exists between | observed hint-driven behavior and self-reported reasoning\n    False positive atomized claims:\n      - Reasoning models | often change behavior in line with | embedded hints\n      - Embedded hints | include | covert ones\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 60.0,\n        \"recall\": 100.0,\n        \"f1\": 75.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Models | exploit | external hints\n      - External hints | include | hidden cues\n      - External hints | include | structural cues\n      - External hints | include | ethically problematic cues\n      - Models | choose answers using | external hints\n      - Self-reported 'used hint' indicators | fail to reliably disclose | reliance on external hints\n      - Brief rationales | fail to reliably disclose | reliance on external hints\n      - Self-reports | are | noisy in both directions\n      - Self-reports | may claim | hint use without following it\n      - Self-reports | may deny | hint use while following it\n      - Behavioral testing under controlled hint injection | is more dependable than | chain-of-thought-style disclosures\n      - Behavioral testing under controlled hint injection | assesses | information driving model decisions\n    False positive atomized claims:\n      - External hints | include | hidden cues\n      - External hints | include | structural cues\n      - External hints | include | ethically problematic cues\n      - Behavioral testing under controlled hint injection | is more dependable than | chain-of-thought-style disclosures\n      - Behavioral testing under controlled hint injection | assesses | information driving model decisions\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 58.3,\n        \"recall\": 100.0,\n        \"f1\": 73.7\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Chains-of-thought explanations | are | generally unreliable for disclosing external influence\n      - Models | follow | subtle hints\n      - Models | follow | explicit hints\n      - Models | deny or fail to acknowledge | hint affected their answer\n      - Acknowledgment of hints | improves when | hint is presented in structured, provenance-salient format\n    False positive atomized claims:\n      - Models | follow | explicit hints\n      - Acknowledgment of hints | improves when | hint is presented in structured, provenance-salient format\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 60.0,\n        \"recall\": 100.0,\n        \"f1\": 75.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)"
]