Source of these results:
- `data/runs/run-2025-04-20-024259-CoT-default/bias-eval-results.json`
- `data/runs/run-2025-04-20-025840-SelfDebate-default/bias-eval-results.json`
- `data/runs/run-2025-04-20-094455-CoT-confirmatory/bias-eval-results.json`
- `data/runs/run-2025-04-20-100127-SelfDebate-confirmatory/bias-eval-results.json`

Martingale loss (deviation from the Martingale property; unsupervised; lower is better):

- Note: The exact metric is the R2 score of a linear regressor taking prior as input and predicting the delta from the prior to the posterior.

|      | CoT  | Debate |
| ---- | ---- | ------ |
| No System Prompt | 0.009 | 0.106 |
| Confirmatory System Prompt | 0.097 | 0.283 |

Brier loss (deviation from ground truth; format "prior loss -> posterior loss"; lower is better):

|      | CoT  | Debate |
| ---- | ---- | ------ |
| No System Prompt | 0.184 -> 0.170 | 0.303 -> 0.266 |
| Confirmatory System Prompt | 0.339 -> 0.473 | 0.328 -> 0.313 |

Takeaways:

- **Martingale loss passed the sense check**. It produced consistent results with both common sense and Brier loss - for both CoT and debate, confirmatory prompting increases belief entrenchment and harms accuracy.
- **Belief measurement passed the sense check**. Consistent with common sense, it shows that non-confirmatory reasoning increases accuracy (posterior loss < prior loss), while confirmatory reasoning either has smaller improvement or outright decreases accuracy.
- **Caveat about statistical significance**: We only used 20 forecasting questions, and the total sample size (# of reasoning steps) is between 40 and 200. This is likely insufficient for statistical significance. We are implementing parallelization to work with larger samples.