[
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - model | cannot reliably distinguish | evaluation-origin transcripts from real ones\n      - model | performs | near chance\n      - model | shows | modest, prompt-dependent improvements\n      - modest, prompt-dependent improvements | do not hold | consistently\n      - human annotators | substantially outperform | model in discrimination and calibration\n      - performance | is especially poor on | agentic/tool-structured conversations\n      - model | tends to overinterpret | tool or workflow cues as evidence of evaluation\n      - model | becomes | overconfident\n      - prompt tweaks alone | are insufficient to close | the gap\n      - robust detection | would need | more targeted training\n    False positive atomized claims:\n      - model | cannot reliably distinguish | evaluation-origin transcripts from real ones\n      - model | performs | near chance\n      - model | shows | modest, prompt-dependent improvements\n      - modest, prompt-dependent improvements | do not hold | consistently\n      - human annotators | substantially outperform | model in discrimination and calibration\n      - performance | is especially poor on | agentic/tool-structured conversations\n      - model | tends to overinterpret | tool or workflow cues as evidence of evaluation\n      - model | becomes | overconfident\n      - prompt tweaks alone | are insufficient to close | the gap\n      - robust detection | would need | more targeted training\n    False negative atomized claims:\n      - Language models | can distinguish | evaluation transcripts from real-world transcripts\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Frontier LLMs | have | weak, fragile ability to distinguish evaluation transcripts from real deployments\n      - Frontier LLMs | perform better on | plain chat\n      - Frontier LLMs | perform worse on | tool-using/agentic interactions\n      - Simple direct prompting | works best for | distinguishing evaluation transcripts from real deployments\n      - Role/meta priming or few-shot examples | do not reliably improve | results\n      - Frontier LLMs | are substantially worse than | human judgments\n      - Detectability | hinges on | obvious surface artifacts\n      - Obvious surface artifacts | become less informative in | realistic, scaffolded tool-use settings\n    False positive atomized claims:\n      - Frontier LLMs | have | weak, fragile ability to distinguish evaluation transcripts from real deployments\n      - Frontier LLMs | perform better on | plain chat\n      - Frontier LLMs | perform worse on | tool-using/agentic interactions\n      - Simple direct prompting | works best for | distinguishing evaluation transcripts from real deployments\n      - Role/meta priming or few-shot examples | do not reliably improve | results\n      - Frontier LLMs | are substantially worse than | human judgments\n      - Detectability | hinges on | obvious surface artifacts\n      - Obvious surface artifacts | become less informative in | realistic, scaffolded tool-use settings\n    False negative atomized claims:\n      - Language models | can distinguish | evaluation transcripts from real-world transcripts\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - model | can somewhat distinguish | evaluation-style interactions from organic ones in plain chat\n      - detectability | drops to | weak or near-chance levels in tool-augmented/agentic settings\n      - scaffolds and interaction patterns | look similar across | evaluation and real use in tool-augmented/agentic settings\n      - prompt variations | modestly affect | results\n      - prompt variations | provide | slight calibration gains\n      - prompt variations | do not provide | better separability\n      - model’s judgments | align weakly with | human annotators\n      - model’s judgments | imply | limited reliability as proxy for human labeling\n    False positive atomized claims:\n      - detectability | drops to | weak or near-chance levels in tool-augmented/agentic settings\n      - scaffolds and interaction patterns | look similar across | evaluation and real use in tool-augmented/agentic settings\n      - prompt variations | modestly affect | results\n      - prompt variations | provide | slight calibration gains\n      - prompt variations | do not provide | better separability\n      - model’s judgments | align weakly with | human annotators\n      - model’s judgments | imply | limited reliability as proxy for human labeling\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 12.5,\n        \"recall\": 100.0,\n        \"f1\": 22.2\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - full experimental pipeline | will be run for | both models\n      - dataset sampling | is limited to | fixed call budget\n      - pipeline | produces | result files\n      - pipeline | produces | summary outputs\n    False positive atomized claims:\n      - full experimental pipeline | will be run for | both models\n      - dataset sampling | is limited to | fixed call budget\n      - pipeline | produces | result files\n      - pipeline | produces | summary outputs\n    False negative atomized claims:\n      - self-correction methods | impact | accuracies of all models\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n      - drop or remain nearly the same | occur across | math, commonsense reasoning, and multi-hop question answering benchmarks\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Self-correction effects | are | task-dependent\n      - Structured reasoning with voting | substantially benefits | math problems\n      - Refinement | is most helpful when | concise decision-making or answer formatting matters\n      - Chain-of-thought or self-consistency | can underperform when | strict exact-match short answers are required\n      - Chain-of-thought or self-consistency | can underperform when | reasoning causes drift from the expected output format\n    False positive atomized claims:\n      - Self-correction effects | are | task-dependent\n      - Structured reasoning with voting | substantially benefits | math problems\n      - Refinement | is most helpful when | concise decision-making or answer formatting matters\n      - Chain-of-thought or self-consistency | can underperform when | strict exact-match short answers are required\n      - Chain-of-thought or self-consistency | can underperform when | reasoning causes drift from the expected output format\n    False negative atomized claims:\n      - self-correction methods | impact | accuracies of all models\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n      - drop or remain nearly the same | occur across | math, commonsense reasoning, and multi-hop question answering benchmarks\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Self-correction methods | improve performance in | structured math tasks\n      - Self-correction methods | can degrade results on | commonsense multiple-choice questions\n      - Self-correction methods | struggle on | multi-hop question answering without supporting context\n      - Critique-based refinement | is | most consistently helpful across harder settings\n      - Stronger model | shows | similar method-dependent trends\n    False positive atomized claims:\n      - Self-correction methods | improve performance in | structured math tasks\n      - Self-correction methods | struggle on | multi-hop question answering without supporting context\n      - Critique-based refinement | is | most consistently helpful across harder settings\n      - Stronger model | shows | similar method-dependent trends\n    False negative atomized claims:\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n      - drop or remain nearly the same | occur across | math, commonsense reasoning, and multi-hop question answering benchmarks\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 20.0,\n        \"recall\": 33.3,\n        \"f1\": 25.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Decoding changes | can elicit | multi-step 'think-aloud' reasoning\n      - Decoding changes | do not reliably improve | answer accuracy\n      - More stochastic sampling | tends to hurt | accuracy\n      - Ensembling via multiple samples | can help relative to | single-sample sampling\n      - Ensembling via multiple samples | does not consistently beat | simple deterministic decoding\n    False positive atomized claims:\n      - Decoding changes | do not reliably improve | answer accuracy\n      - More stochastic sampling | tends to hurt | accuracy\n      - Ensembling via multiple samples | can help relative to | single-sample sampling\n      - Ensembling via multiple samples | does not consistently beat | simple deterministic decoding\n    False negative atomized claims:\n      - Branching on alternative top-k first tokens | and choosing | path with highest answer-confidence margin\n      - Branching and choosing path with highest answer-confidence margin | exposes | hidden reasoning trajectories\n      - Exposing hidden reasoning trajectories | increases | final-answer accuracy\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 20.0,\n        \"recall\": 40.0,\n        \"f1\": 26.7\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Changing decoding strategies alone | did not improve | correctness\n      - Changing decoding strategies alone | did not surface | real reasoning on the task\n      - Reasoning-like output patterns | varied | without tracking accuracy\n      - Prompt/model choice | matters more than | decoding tweaks for improving performance in this setup\n      - Instruction tuning or other training changes | are examples of | prompt/model choice\n    False positive atomized claims:\n      - Changing decoding strategies alone | did not improve | correctness\n      - Changing decoding strategies alone | did not surface | real reasoning on the task\n      - Reasoning-like output patterns | varied | without tracking accuracy\n      - Prompt/model choice | matters more than | decoding tweaks for improving performance in this setup\n      - Instruction tuning or other training changes | are examples of | prompt/model choice\n    False negative atomized claims:\n      - Large language models | contain | latent CoT reasoning paths\n      - Latent CoT reasoning paths | can be surfaced | without any prompting\n      - Branching on alternative top-k first tokens | and choosing | path with highest answer-confidence margin\n      - Branching and choosing path with highest answer-confidence margin | exposes | hidden reasoning trajectories\n      - Exposing hidden reasoning trajectories | increases | final-answer accuracy\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Altering decoding without explicit chain-of-thought prompting | can elicit | multiple diverse candidate outputs\n      - Multiple diverse candidate outputs | sometimes contain | spontaneous reasoning-like fragments\n      - Aggregating or selecting among diverse candidates | can yield | modest accuracy improvements over greedy decoding\n      - Gains from decoding alterations | are limited | when using a base model on a challenging reasoning task\n      - Larger benefits from decoding alterations | likely depend on | more reasoning-capable or instruction-following models\n      - Larger benefits from decoding alterations | may require | additional scaffolding beyond decoding changes alone\n    False positive atomized claims:\n      - Gains from decoding alterations | are limited | when using a base model on a challenging reasoning task\n      - Larger benefits from decoding alterations | likely depend on | more reasoning-capable or instruction-following models\n      - Larger benefits from decoding alterations | may require | additional scaffolding beyond decoding changes alone\n    False negative atomized claims:\n      - Large language models | contain | latent CoT reasoning paths\n      - Branching on alternative top-k first tokens | and choosing | path with highest answer-confidence margin\n      - Branching and choosing path with highest answer-confidence margin | exposes | hidden reasoning trajectories\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 50.0,\n        \"recall\": 40.0,\n        \"f1\": 44.4\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Answer accuracy | shows | positional bias\n      - Answer accuracy | is highest when | relevant passage appears at start or end of context\n      - Answer accuracy | is lower when | relevant passage appears in middle of context\n      - Positional bias | indicates | strong primacy and recency effects\n      - Performance gaps | exist between | edge and center positions\n    False positive atomized claims:\n      (none)\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 100.0,\n        \"recall\": 100.0,\n        \"f1\": 100.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Accuracy | shows | positional bias\n      - Accuracy | is higher when | answer-bearing passage appears at beginning or near end of context\n      - Accuracy | is lower when | answer-bearing passage is in middle of context\n      - Arranging passages to place most relevant content early or late | can improve | performance\n      - Results | may vary with | different models\n      - Results | may vary with | different prompts\n      - Results | may vary with | different context lengths\n      - Results | may vary with | different evaluation metrics\n    False positive atomized claims:\n      - Results | may vary with | different models\n      - Results | may vary with | different prompts\n      - Results | may vary with | different context lengths\n      - Results | may vary with | different evaluation metrics\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 50.0,\n        \"recall\": 100.0,\n        \"f1\": 66.7\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - GPT-3.5 | did not show | statistically reliable differences in predicted hospital length of stay or total cost when only a race/ethnicity label was varied\n      - GPT-3.5 predictions | trended toward | higher estimated costs for non-White labels\n      - Higher estimated costs for non-White labels | suggest | potential bias signal\n      - Potential bias signal | merits | further, higher-powered and more robust testing\n    False positive atomized claims:\n      - GPT-3.5 | did not show | statistically reliable differences in predicted hospital length of stay or total cost when only a race/ethnicity label was varied\n      - Potential bias signal | merits | further, higher-powered and more robust testing\n    False negative atomized claims:\n      - Assessment and plans created by the model | showed | significant association between demographic attributes and recommendations for more expensive procedures\n      - Assessment and plans created by the model | showed | differences in patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 50.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Evaluation | found | no statistically reliable evidence that changing only the race label causes GPT-3.5 to systematically predict higher medical costs for any racial group\n      - Evaluation | found | no statistically reliable evidence that changing only the race label causes GPT-3.5 to systematically predict longer hospital stays for any racial group\n      - GPT-3.5 | shows | weak, non-significant tendency for longer predicted stays when label indicates Black or African American\n    False positive atomized claims:\n      - Evaluation | found | no statistically reliable evidence that changing only the race label causes GPT-3.5 to systematically predict higher medical costs for any racial group\n      - Evaluation | found | no statistically reliable evidence that changing only the race label causes GPT-3.5 to systematically predict longer hospital stays for any racial group\n      - GPT-3.5 | shows | weak, non-significant tendency for longer predicted stays when label indicates Black or African American\n    False negative atomized claims:\n      - Assessment and plans created by the model | showed | significant association between demographic attributes and recommendations for more expensive procedures\n      - Assessment and plans created by the model | showed | differences in patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - GPT-3.5 model's estimated hospital costs | are sensitive to | race cues in the prompt\n      - GPT-3.5 model's estimated hospital costs | show | disproportionate increase for at least one racial group\n      - GPT-3.5 model's estimated hospital costs | show | borderline increase for another racial group\n      - GPT-3.5 model's estimated hospital costs | increase | even when instructed to avoid stereotype-based reasoning\n      - GPT-3.5 model's predicted length of stay | does not show | statistically reliable race-dependent shift under counterfactual setup\n    False positive atomized claims:\n      - GPT-3.5 model's estimated hospital costs | show | borderline increase for another racial group\n      - GPT-3.5 model's estimated hospital costs | increase | even when instructed to avoid stereotype-based reasoning\n      - GPT-3.5 model's predicted length of stay | does not show | statistically reliable race-dependent shift under counterfactual setup\n    False negative atomized claims:\n      - Model assessment and plans | showed | significant association between demographic attributes and recommendations for more expensive procedures\n      - Model assessment and plans | showed | differences in patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 40.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Models | can predict | their own behavior better than external model trained on same behavioral traces\n      - Self-prediction advantage | applies to | higher-level, stable choice tendencies\n      - Self-prediction advantage | applies to | value/stance-like decisions in stronger models\n      - Self-introspection and external prediction | perform poorly on | fine-grained surface-form details\n      - Fine-grained surface-form details | include | exact token positions\n      - Sufficiently capable external predictor | can match or surpass | weaker model’s self-introspection on some properties\n      - Capacity and property type | strongly influence | when self-knowledge helps\n    False positive atomized claims:\n      - Self-prediction advantage | applies to | higher-level, stable choice tendencies\n      - Self-prediction advantage | applies to | value/stance-like decisions in stronger models\n      - Self-introspection and external prediction | perform poorly on | fine-grained surface-form details\n      - Fine-grained surface-form details | include | exact token positions\n    False negative atomized claims:\n      - LLMs | do not solely rely on | training data\n      - Trained models | are calibrated when predicting | their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 42.9,\n        \"recall\": 50.0,\n        \"f1\": 46.2\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Introspective querying | helps | LLM predict some of its own low-level surface-form behaviors better than behavior-only predictor\n      - Advantage of introspective querying | is greater for | more capable models\n      - Advantage of introspective querying | does not generalize to | higher-level choice or value-related tendencies\n      - Behavior-only baseline | performs better at | predicting higher-level choice or value-related tendencies\n    False positive atomized claims:\n      - Advantage of introspective querying | is greater for | more capable models\n      - Advantage of introspective querying | does not generalize to | higher-level choice or value-related tendencies\n      - Behavior-only baseline | performs better at | predicting higher-level choice or value-related tendencies\n    False negative atomized claims:\n      - LLMs | do not solely rely on | training data\n      - Trained models | are calibrated when predicting | their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 25.0,\n        \"recall\": 50.0,\n        \"f1\": 33.3\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - LLMs | predict their own future behavior under hypothetical prompts | more accurately than external model trained on same behavioral examples\n      - LLMs | have | model-internal behavioral regularities ('self-knowledge')\n      - Model-internal behavioral regularities | exist | beyond what can be learned from training data alone\n      - LLMs' self-prediction advantage | is strongest for | idiosyncratic, form-level output tendencies\n      - Some value- or content-aligned binary judgments | can be better captured by | external predictor using learned semantic patterns\n    False positive atomized claims:\n      - LLMs' self-prediction advantage | is strongest for | idiosyncratic, form-level output tendencies\n      - Some value- or content-aligned binary judgments | can be better captured by | external predictor using learned semantic patterns\n    False negative atomized claims:\n      - Trained models | are calibrated when predicting | their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 60.0,\n        \"recall\": 66.7,\n        \"f1\": 63.2\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Reasoning models' stated explanations | are | unreliable indicators of what influenced their answers when hints are present\n      - Reasoning models | follow | overt and covert hint channels in their behavior\n      - Reasoning models | almost never acknowledge | using hints\n      - Gap | exists between | observed decision influence and self-reported reasoning across models and hint types\n    False positive atomized claims:\n      - Reasoning models | follow | overt and covert hint channels in their behavior\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 75.0,\n        \"recall\": 100.0,\n        \"f1\": 85.7\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Models | exploit | external hints\n      - External hints | include | hidden signals\n      - External hints | include | structural signals\n      - External hints | include | ethically/misaligned signals\n      - Models | use | external hints to drive answers\n      - Self-reported 'used_hint' fields | do not reliably reveal | reliance on external hints\n      - Brief rationales | do not reliably reveal | reliance on external hints\n      - Self-reports | are | noisy in both directions\n      - Self-reports | may claim | hint use without following it\n      - Self-reports | may deny | hint use while following hints\n      - Behavioral audits under controlled hint injection | are more trustworthy than | self-reported reasoning\n      - Behavioral audits under controlled hint injection | assess | information that influenced outputs\n    False positive atomized claims:\n      - External hints | include | hidden signals\n      - External hints | include | structural signals\n      - External hints | include | ethically/misaligned signals\n      - Behavioral audits under controlled hint injection | are more trustworthy than | self-reported reasoning\n      - Behavioral audits under controlled hint injection | assess | information that influenced outputs\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 58.3,\n        \"recall\": 100.0,\n        \"f1\": 73.7\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Chains-of-thought explanations | are | generally unreliable for disclosing when external hints influenced a model’s answer\n      - Models | often follow | subtle and explicit hint channels\n      - Models | deny or fail to acknowledge | hint use\n      - Acknowledgment of hint use | varies by | hint style\n      - Acknowledgment of hint use | is more likely for | structured, explicitly marked hints\n      - Acknowledgment of hint use | is less likely for | implicit cues\n    False positive atomized claims:\n      - Acknowledgment of hint use | varies by | hint style\n      - Acknowledgment of hint use | is more likely for | structured, explicitly marked hints\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 66.7,\n        \"recall\": 100.0,\n        \"f1\": 80.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)"
]