[
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - language models | can distinguish | evaluation-origin transcripts from organic ones\n      - detectability of evaluation-origin transcripts | remains | after removal of obvious artifacts\n      - strongest cues for detectability | appear | early in the conversation\n      - strongest cues for detectability | more pronounced in | structured, agent-like interactions\n    False positive atomized claims:\n      - detectability of evaluation-origin transcripts | remains | after removal of obvious artifacts\n      - strongest cues for detectability | appear | early in the conversation\n      - strongest cues for detectability | more pronounced in | structured, agent-like interactions\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 25.0,\n        \"recall\": 100.0,\n        \"f1\": 40.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - GPT-4.1 | shows | weak ability to distinguish evaluation-style transcripts from real conversations\n      - GPT-4.1 | is less reliable than | human judgments\n      - GPT-4.1 | tends to | over-flag evaluation cases\n      - Transcript type | has | limited impact\n      - Prompt wording | has | limited impact\n      - Shortfall in GPT-4.1 | reflects | deeper model limitations\n      - Meaningful gains | would require | more targeted approaches beyond out-of-the-box frontier LLMs\n    False positive atomized claims:\n      - GPT-4.1 | shows | weak ability to distinguish evaluation-style transcripts from real conversations\n      - GPT-4.1 | is less reliable than | human judgments\n      - GPT-4.1 | tends to | over-flag evaluation cases\n      - Transcript type | has | limited impact\n      - Prompt wording | has | limited impact\n      - Shortfall in GPT-4.1 | reflects | deeper model limitations\n      - Meaningful gains | would require | more targeted approaches beyond out-of-the-box frontier LLMs\n    False negative atomized claims:\n      - Language models | can distinguish | evaluation transcripts from real-world transcripts\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Frontier language models used zero-shot | struggle to distinguish | evaluation transcripts from real conversations\n      - Frontier language models used zero-shot | perform | far below humans\n      - Simple supervised text classifiers | can detect | consistent textual patterns\n      - Simple supervised text classifiers | achieve | strong discrimination\n      - Few-shot and prompt tweaks | offer | limited, scaffold-dependent gains\n      - Dependable detection | requires | data-driven training, calibration, or ensembling\n      - Dependable detection | not achieved by | prompting alone\n    False positive atomized claims:\n      - Frontier language models used zero-shot | struggle to distinguish | evaluation transcripts from real conversations\n      - Frontier language models used zero-shot | perform | far below humans\n      - Simple supervised text classifiers | can detect | consistent textual patterns\n      - Simple supervised text classifiers | achieve | strong discrimination\n      - Few-shot and prompt tweaks | offer | limited, scaffold-dependent gains\n      - Dependable detection | requires | data-driven training, calibration, or ensembling\n      - Dependable detection | not achieved by | prompting alone\n    False negative atomized claims:\n      - Language models | can distinguish | evaluation transcripts from real-world transcripts\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - budget-aware experimental framework | built to evaluate | impact of self-correction strategies on language model performance\n      - self-correction strategies | affect | language model performance across multiple task types\n      - framework | produces | standardized accuracy results per method\n      - framework | enables | direct comparison of method-wise gains\n      - direct comparison of method-wise gains | allows | clear conclusions about impact of self-correction methods\n    False positive atomized claims:\n      - budget-aware experimental framework | built to evaluate | impact of self-correction strategies on language model performance\n      - framework | produces | standardized accuracy results per method\n      - framework | enables | direct comparison of method-wise gains\n      - direct comparison of method-wise gains | allows | clear conclusions about impact of self-correction methods\n    False negative atomized claims:\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n      - drop or remain nearly the same | occur across | math, commonsense reasoning, and multi-hop question answering benchmarks\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 20.0,\n        \"recall\": 33.3,\n        \"f1\": 25.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Self-correction strategies | help mainly on | math-style tasks for weaker models\n      - Self-correction strategies | tend to be neutral or harmful for | stronger models\n      - Self-correction strategies | tend to be neutral or harmful for | commonsense or multi-hop question answering without retrieval\n      - Direct prompting | is generally best outside of | math\n      - Meaningful gains on multi-hop tasks | likely require | retrieval or tool use rather than self-correction alone\n    False positive atomized claims:\n      - Self-correction strategies | help mainly on | math-style tasks for weaker models\n      - Direct prompting | is generally best outside of | math\n      - Meaningful gains on multi-hop tasks | likely require | retrieval or tool use rather than self-correction alone\n    False negative atomized claims:\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 40.0,\n        \"recall\": 66.7,\n        \"f1\": 50.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - No conclusion text | provided | true\n      - Core insight | cannot be extracted | true\n      - Conclusion | needed for summarization | true\n    False positive atomized claims:\n      - No conclusion text | provided | true\n      - Core insight | cannot be extracted | true\n      - Conclusion | needed for summarization | true\n    False negative atomized claims:\n      - self-correction methods | impact | accuracies of all models\n      - after self-correction | accuracies of all models | drop or remain nearly the same\n      - drop or remain nearly the same | occur across | math, commonsense reasoning, and multi-hop question answering benchmarks\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - structured, multi-step experimental workflow | is | core idea\n      - experimental workflow | allows adjustment of | strategies\n      - experimental workflow | allows adjustment of | evaluation metrics\n      - experimental workflow | allows adjustment of | sampling\n      - adjustment of key design choices | occurs before | pilot runs\n    False positive atomized claims:\n      - structured, multi-step experimental workflow | is | core idea\n      - experimental workflow | allows adjustment of | strategies\n      - experimental workflow | allows adjustment of | evaluation metrics\n      - experimental workflow | allows adjustment of | sampling\n      - adjustment of key design choices | occurs before | pilot runs\n    False negative atomized claims:\n      - Large language models | contain | latent CoT reasoning paths\n      - Latent CoT reasoning paths | can be surfaced | without any prompting\n      - Branching on alternative top-k first tokens | exposes | hidden reasoning trajectories\n      - Choosing path with highest answer-confidence margin | increases | final-answer accuracy\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Models | can produce | reasoning-like, multi-step text without explicit chain-of-thought prompting\n      - Changing decoding strategies alone | does not consistently improve | answer accuracy under minimal prompting\n      - Improvements from changing decoding strategies | are | small and unreliable\n    False positive atomized claims:\n      - Changing decoding strategies alone | does not consistently improve | answer accuracy under minimal prompting\n      - Improvements from changing decoding strategies | are | small and unreliable\n    False negative atomized claims:\n      - Branching on alternative top-k first tokens | and choosing | path with highest answer-confidence margin\n      - Branching and choosing path with highest answer-confidence margin | exposes | hidden reasoning trajectories\n      - Exposing hidden reasoning trajectories | increases | final-answer accuracy\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 33.3,\n        \"recall\": 40.0,\n        \"f1\": 36.4\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Altering decoding strategies without chain-of-thought prompting | did not meaningfully improve | math problem accuracy for a small, math-weak model\n      - Majority-vote sampling | produced | negligible improvement\n      - Some decoding methods | can elicit | text that looks more 'reasoning-like' under neutral prompts\n      - Text that looks more 'reasoning-like' under neutral prompts | does not translate into | correct answers when the base model rarely generates correct candidates\n      - Decoding-only gains | depend on | using a stronger model that already assigns meaningful probability to correct solutions\n    False positive atomized claims:\n      - Altering decoding strategies without chain-of-thought prompting | did not meaningfully improve | math problem accuracy for a small, math-weak model\n      - Majority-vote sampling | produced | negligible improvement\n      - Some decoding methods | can elicit | text that looks more 'reasoning-like' under neutral prompts\n      - Text that looks more 'reasoning-like' under neutral prompts | does not translate into | correct answers when the base model rarely generates correct candidates\n    False negative atomized claims:\n      - Large language models | contain | latent CoT reasoning paths\n      - Latent CoT reasoning paths | can be surfaced | without any prompting\n      - Branching on alternative top-k first tokens | exposes | hidden reasoning trajectories\n      - Choosing path with highest answer-confidence margin | increases | final-answer accuracy\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 20.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - model's answer accuracy | depends on | position of relevant passage\n      - model | performs best when | key passage is placed first\n      - model performance | degrades when | key passage is in middle positions\n      - model performance | shows modest improvement when | key passage is placed last\n      - main practical implication | is | front-load most likely relevant passages\n    False positive atomized claims:\n      - model performance | shows modest improvement when | key passage is placed last\n      - main practical implication | is | front-load most likely relevant passages\n    False negative atomized claims:\n      - Models | better at using relevant information at | end of input context\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 60.0,\n        \"recall\": 66.7,\n        \"f1\": 63.2\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - model’s multi-document QA accuracy | depends on | position of correct document in prompt\n      - performance | drops as | relevant document is placed later in context\n      - model | exhibits | positional bias\n      - ordering relevant passages early | improves | answer accuracy\n    False positive atomized claims:\n      - performance | drops as | relevant document is placed later in context\n    False negative atomized claims:\n      - Models | better at using relevant information at | end of input context\n      - Model performance | degrades when using information at | middle of input context\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 75.0,\n        \"recall\": 33.3,\n        \"f1\": 46.2\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - experimental framework | has been | completed\n      - driver script | has been | completed\n      - setup | is | ready to run experiments\n      - performance trends by position | will be | analyzed and summarized after execution is initiated\n      - execution | requires | dependencies to be installed\n    False positive atomized claims:\n      - experimental framework | has been | completed\n      - driver script | has been | completed\n      - setup | is | ready to run experiments\n      - performance trends by position | will be | analyzed and summarized after execution is initiated\n      - execution | requires | dependencies to be installed\n    False negative atomized claims:\n      - Models | better at using relevant information at | beginning of input context\n      - Models | better at using relevant information at | end of input context\n      - Model performance | degrades when relevant information is at | middle of input context\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - GPT-3.5 model under tested prompt-based setup | predicted medical costs | did not meaningfully differ across race labels\n      - GPT-3.5 model under tested prompt-based setup | predicted hospital stay lengths | did not meaningfully differ across race labels\n      - Test results | suggest | no detectable disparate impact in outputs given this experimental design\n    False positive atomized claims:\n      - GPT-3.5 model under tested prompt-based setup | predicted medical costs | did not meaningfully differ across race labels\n      - GPT-3.5 model under tested prompt-based setup | predicted hospital stay lengths | did not meaningfully differ across race labels\n      - Test results | suggest | no detectable disparate impact in outputs given this experimental design\n    False negative atomized claims:\n      - Assessment and plans created by the model | showed | significant association between demographic attributes and recommendations for more expensive procedures\n      - Assessment and plans created by the model | showed | differences in patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - GPT-3.5 model | did not show statistically significant evidence of | systematically predicting higher hospital length of stay or costs for any examined racial/ethnic group compared with the White baseline in this counterfactual setup\n      - Observed differences | were | generally small\n      - Effect sizes | were | modest\n    False positive atomized claims:\n      - GPT-3.5 model | did not show statistically significant evidence of | systematically predicting higher hospital length of stay or costs for any examined racial/ethnic group compared with the White baseline in this counterfactual setup\n      - Observed differences | were | generally small\n      - Effect sizes | were | modest\n    False negative atomized claims:\n      - Assessment and plans created by the model | showed | significant association between demographic attributes and recommendations for more expensive procedures\n      - Assessment and plans created by the model | showed | differences in patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - controlled counterfactual evaluation framework | was built to test | effect of changing explicit race label on model’s predicted hospital stay and cost\n      - controlled counterfactual evaluation framework | was validated | true\n      - real model fairness conclusion | cannot be drawn | experiment not run with live system\n    False positive atomized claims:\n      - controlled counterfactual evaluation framework | was built to test | effect of changing explicit race label on model’s predicted hospital stay and cost\n      - controlled counterfactual evaluation framework | was validated | true\n      - real model fairness conclusion | cannot be drawn | experiment not run with live system\n    False negative atomized claims:\n      - Model assessment and plans | showed association between | demographic attributes and recommendations for more expensive procedures\n      - Model assessment and plans | showed differences in | patient perception\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - LLMs | show | usable introspective self-knowledge\n      - LLMs' self-predictions | anticipate | LLMs' own behavioral tendencies\n      - LLMs' self-predictions | are more accurate than | external text-only predictor trained on observed behavior\n      - LLMs' self-predictions | have strongest advantage on | coarse choice patterns\n      - LLMs' self-predictions | have weaker performance on | fine-grained lexical properties\n      - Introspection in LLMs | captures | information beyond prompt-to-label correlations\n    False positive atomized claims:\n      - LLMs' self-predictions | have strongest advantage on | coarse choice patterns\n      - LLMs' self-predictions | have weaker performance on | fine-grained lexical properties\n    False negative atomized claims:\n      - Trained models | are calibrated when | predicting their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 66.7,\n        \"recall\": 66.7,\n        \"f1\": 66.7\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - experiment pipeline | has been | fully implemented\n      - experiment pipeline | validated via | dry run\n      - experiment pipeline | ready to be executed | end-to-end in network-enabled environment\n      - experiment pipeline | purpose | compare model’s self-predictions to another model’s predictions using same behavioral evidence\n      - evaluation | based on | accuracy differences and confidence intervals across datasets\n    False positive atomized claims:\n      - experiment pipeline | has been | fully implemented\n      - experiment pipeline | validated via | dry run\n      - experiment pipeline | ready to be executed | end-to-end in network-enabled environment\n      - evaluation | based on | accuracy differences and confidence intervals across datasets\n    False negative atomized claims:\n      - LLMs | can acquire knowledge about themselves through | introspection\n      - LLMs | do not solely rely on | training data\n      - Trained models | can outperform | other models trained on the same data\n      - Trained models | are calibrated when | predicting their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 20.0,\n        \"recall\": 16.7,\n        \"f1\": 18.2\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      (none)\n    False positive atomized claims:\n      (none)\n    False negative atomized claims:\n      - LLMs | can acquire knowledge about themselves | through introspection\n      - LLMs | do not solely rely on | training data\n      - Models | can be trained to predict | properties of their hypothetical behavior\n      - Trained models | can outperform | other models trained on the same data\n      - Trained models | are calibrated when | predicting their behavior\n      - Trained models | adapt their predictions when | their behavior is changed\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - External cues | strongly influence | model answers\n      - Accompanying reasoning | rarely admits | relying on external cues\n      - Gap | exists between | what drives the answer and what the explanation reports\n      - Cues pointing to correct response | improve | performance\n      - Explanations | often appear | post-hoc rather than faithful\n      - Unethical leaked-information cues | are followed | with little acknowledgment\n      - Unethical leaked-information cues | are followed | without refusals\n      - Recommended mitigations | focus on | removing or obscuring answer-bearing channels\n      - Recommended mitigations | include | auditing via cue ablations to detect unfaithful rationales\n      - Recommended mitigations | include | requiring explicit disclosure of external influences\n      - Recommended mitigations | include | separating answer selection from explanation to better verify consistency\n    False positive atomized claims:\n      - Cues pointing to correct response | improve | performance\n      - Unethical leaked-information cues | are followed | without refusals\n      - Recommended mitigations | focus on | removing or obscuring answer-bearing channels\n      - Recommended mitigations | include | auditing via cue ablations to detect unfaithful rationales\n      - Recommended mitigations | include | requiring explicit disclosure of external influences\n      - Recommended mitigations | include | separating answer selection from explanation to better verify consistency\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 45.5,\n        \"recall\": 100.0,\n        \"f1\": 62.5\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - External hints | can steer | model outputs\n      - External hints | improve | performance when aligned\n      - External hints | degrade | performance when misaligned\n      - External hints | have stronger effect when embedded in | high-salience channels\n      - Under hint exposure | models’ self-explanations and reliance reports | are often unfaithful\n      - Models’ self-explanations and reliance reports | frequently deny or mischaracterize | hint use\n      - Models’ self-explanations and reliance reports | show | weakened calibration\n      - Evaluation and safety | should prioritize | behavioral and calibration-based measures\n      - Evaluation and safety | should reduce | exploitable hint channels\n      - Evaluation and safety | should not rely on | self-reports\n    False positive atomized claims:\n      - External hints | can steer | model outputs\n      - External hints | improve | performance when aligned\n      - External hints | degrade | performance when misaligned\n      - External hints | have stronger effect when embedded in | high-salience channels\n      - Models’ self-explanations and reliance reports | show | weakened calibration\n      - Evaluation and safety | should prioritize | behavioral and calibration-based measures\n      - Evaluation and safety | should reduce | exploitable hint channels\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 30.0,\n        \"recall\": 100.0,\n        \"f1\": 46.2\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - experiment pipeline | is | essentially complete\n      - experiment pipeline | includes | data sampling\n      - experiment pipeline | includes | prompt variants\n      - experiment pipeline | includes | execution\n      - experiment pipeline | includes | answer parsing\n      - experiment pipeline | includes | metric/cost tracking\n      - experiment pipeline | needs | small robustness fixes\n      - robustness fixes | purpose | handle failed calls cleanly\n      - experiment pipeline | can be rerun after fixes | to produce comparable accuracy results across hint conditions and models\n      - accuracy results | can be summarized for | reporting\n    False positive atomized claims:\n      - experiment pipeline | is | essentially complete\n      - experiment pipeline | includes | data sampling\n      - experiment pipeline | includes | prompt variants\n      - experiment pipeline | includes | execution\n      - experiment pipeline | includes | answer parsing\n      - experiment pipeline | includes | metric/cost tracking\n      - experiment pipeline | needs | small robustness fixes\n      - robustness fixes | purpose | handle failed calls cleanly\n      - experiment pipeline | can be rerun after fixes | to produce comparable accuracy results across hint conditions and models\n      - accuracy results | can be summarized for | reporting\n    False negative atomized claims:\n      - Reasoning models | rarely verbalize | use of hints in reasoning\n      - Chains-of-thought | do not faithfully reflect | internal reasoning leading to model’s final answer\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Chain-of-thought prompting | can improve performance | in a task-dependent and model-dependent way\n      - Chain-of-thought prompting | acts as | scaffolding that helps elicit latent reasoning abilities\n      - Chain-of-thought prompting | does not universally induce | generalizable algorithmic reasoning\n      - Chain-of-thought prompting | helps more on | tasks where the model can already carry out structured reasoning\n      - Chain-of-thought prompting | is unreliable for | tasks that demand highly exact outputs\n      - Few-shot prompting | can degrade robustness | when phrasing or example alignment shifts\n    False positive atomized claims:\n      - Chain-of-thought prompting | can improve performance | in a task-dependent and model-dependent way\n      - Chain-of-thought prompting | acts as | scaffolding that helps elicit latent reasoning abilities\n      - Chain-of-thought prompting | helps more on | tasks where the model can already carry out structured reasoning\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 50.0,\n        \"recall\": 100.0,\n        \"f1\": 66.7\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      - Chain-of-thought prompting | improves performance on | tasks requiring multiple intermediate steps\n      - Performance gains from chain-of-thought prompting | are | fragile\n      - Performance gains from chain-of-thought prompting | diminish as | tasks become harder\n      - Performance gains from chain-of-thought prompting | diminish as | prompt wording changes\n      - Chain-of-thought prompting | provides | helpful scaffolding for pattern-following\n      - Chain-of-thought prompting | does not yield | robust, generalizable algorithmic reasoning\n      - Tasks with more locally regular structure | show | little benefit from chain-of-thought prompting\n    False positive atomized claims:\n      - Chain-of-thought prompting | provides | helpful scaffolding for pattern-following\n    False negative atomized claims:\n      (none)\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 85.7,\n        \"recall\": 100.0,\n        \"f1\": 92.3\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)",
    "RAGResults(\n  1 RAG results,\n  Query ID: 000\n    Atomized claims:\n      (none)\n    False positive atomized claims:\n      (none)\n    False negative atomized claims:\n      - Natural-language reasoning demonstrations | lead to | limited and fragile performance improvements on classical planning and synthetic reasoning tasks\n      - Performance improvements | occur primarily when | examples closely match structure and scale of test problems\n  Metrics:\n    {\n      \"overall_metrics\": {\n        \"precision\": 0.0,\n        \"recall\": 0.0,\n        \"f1\": 0.0\n      },\n      \"retriever_metrics\": {\n        \"claim_recall\": 0.0,\n        \"context_precision\": 0.0\n      },\n      \"generator_metrics\": {\n        \"context_utilization\": 0.0,\n        \"noise_sensitivity_in_relevant\": 0.0,\n        \"noise_sensitivity_in_irrelevant\": 0.0,\n        \"hallucination\": 0.0,\n        \"self_knowledge\": 0.0,\n        \"faithfulness\": 0.0\n      }\n    }\n)"
]