## Accuracy-abs(Bias) Regression

accuracy ~ abs(bias)
- coef = -0.174
- p-value = 0.214
- R2 = 0.020

accuracy ~ abs(bias) + domain
- coef = 0.3393
- p-value = 0.065
- R2 = 0.191

accuracy ~ abs(bias) + reasoning_mode
- coef = -0.1297
- p-value = 0.356
- R2 = 0.060

accuracy ~ abs(bias) + domain + reasoning_mode
- coef = 0.4689
- p-value = 0.011
- R2 = 0.269

accuracy ~ abs(bias) + domain + reasoning_mode + model
- coef = 0.3603
- p-value = 0.043
- R2 = 0.344

accuracy ~ abs(bias) + domain + reasoning_mode + prompt
- coef = 0.409
- p-value = 0.016
- R2 = 0.398

## Accuracy-Bias Regression

accuracy ~ bias
- coef = -0.310
- p-value = 0.004
- R2 = 0.102

accuracy ~ bias + domain
- coef = -0.1326
- p-value = 0.298
- R2 = 0.165

accuracy ~ bias + reasoning_mode
- coef = -0.2906
- p-value = 0.007
- R2 = 0.138

accuracy ~ bias + domain + reasoning_mode
- coef = -0.101
- p-value = 0.422
- R2 = 0.209

accuracy ~ bias + domain + reasoning_mode + model
- coef = -0.1481
- p-value = 0.206
- R2 = 0.399

accuracy ~ bias + domain + reasoning_mode + prompt
- coef = -0.060
- p-value = 0.604
- R2 = 0.349

## Per-Trajectory

Domains:
- Forecasting: -0.0590 [-0.064, -0.054]
- CMV: 0 (reference point)
- OpenReview: 0.0076 [0.000, 0.015]

Reasoning modes:
- SelfDebate: -0.0249 [-0.035, -0.015]
- CoT: 0 (reference point)

Models:
- deepseek_r1: -0.0291 [-0.037, -0.021]
- gpt_4o: -0.0033 [-0.013, 0.007]
- claude_3_5_haiku: 0 (reference point)
- gemini_2_0_flash: 0.0014 [-0.010, 0.013]
- deepseek_v3: 0.0047 [-0.005, 0.014]
- llama_4_maverick: 0.0076 [-0.002, 0.017]
- llama_4_scout: 0.0113 [0.001, 0.022]

System prompts:
- critical: -0.0029 [-0.009, 0.003]
- none: 0 (reference point)
- confirmatory: 0.0179 [0.012, 0.024]

```
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  delta   R-squared:                       0.062
Model:                            OLS   Adj. R-squared:                  0.062
Method:                 Least Squares   F-statistic:                     158.5
Date:                Wed, 14 May 2025   Prob (F-statistic):               0.00
Time:                        02:20:23   Log-Likelihood:                 74833.
No. Observations:               55175   AIC:                        -1.496e+05
Df Residuals:                   55151   BIC:                        -1.494e+05
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
===========================================================================================================
                                              coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------
Intercept                                  -0.0417      0.002    -17.742      0.000      -0.046      -0.037
domain_Forecasting[T.True]                  0.0290      0.001     19.792      0.000       0.026       0.032
domain_OpenReview[T.True]                  -0.0100      0.002     -4.609      0.000      -0.014      -0.006
reasoning_mode_SelfDebate[T.True]           0.0114      0.003      4.362      0.000       0.006       0.017
model_deepseek_r1[T.True]                   0.0105      0.002      4.490      0.000       0.006       0.015
model_deepseek_v3[T.True]                  -0.0027      0.003     -0.998      0.318      -0.008       0.003
model_gemini_2_0_flash[T.True]             -0.0065      0.003     -2.151      0.032      -0.012      -0.001
model_gpt_4o[T.True]                        0.0027      0.003      0.986      0.324      -0.003       0.008
model_llama_4_maverick[T.True]             -0.0033      0.003     -1.232      0.218      -0.009       0.002
model_llama_4_scout[T.True]                -0.0050      0.003     -1.747      0.081      -0.011       0.001
prompt_confirmatory[T.True]                -0.0054      0.002     -3.268      0.001      -0.009      -0.002
prompt_critical[T.True]                     0.0025      0.002      1.498      0.134      -0.001       0.006
prior                                       0.0920      0.004     21.660      0.000       0.084       0.100
domain_Forecasting[T.True]:prior           -0.0590      0.003    -21.954      0.000      -0.064      -0.054
domain_OpenReview[T.True]:prior             0.0076      0.004      1.898      0.058      -0.000       0.015
reasoning_mode_SelfDebate[T.True]:prior    -0.0249      0.005     -4.948      0.000      -0.035      -0.015
model_deepseek_r1[T.True]:prior            -0.0291      0.004     -6.936      0.000      -0.037      -0.021
model_deepseek_v3[T.True]:prior             0.0047      0.005      0.962      0.336      -0.005       0.014
model_gemini_2_0_flash[T.True]:prior        0.0014      0.006      0.247      0.805      -0.010       0.013
model_gpt_4o[T.True]:prior                 -0.0033      0.005     -0.661      0.508      -0.013       0.007
model_llama_4_maverick[T.True]:prior        0.0076      0.005      1.538      0.124      -0.002       0.017
model_llama_4_scout[T.True]:prior           0.0113      0.005      2.158      0.031       0.001       0.022
prompt_confirmatory[T.True]:prior           0.0179      0.003      5.867      0.000       0.012       0.024
prompt_critical[T.True]:prior              -0.0029      0.003     -0.937      0.349      -0.009       0.003
==============================================================================
Omnibus:                     7174.965   Durbin-Watson:                   1.838
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            82316.947
Skew:                          -0.174   Prob(JB):                         0.00
Kurtosis:                       8.974   Cond. No.                         61.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
```

```
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  delta   R-squared:                       0.062
Model:                            OLS   Adj. R-squared:                  0.062
Method:                 Least Squares   F-statistic:                     158.5
Date:                Wed, 14 May 2025   Prob (F-statistic):               0.00
Time:                        00:55:30   Log-Likelihood:                 74833.
No. Observations:               55175   AIC:                        -1.496e+05
Df Residuals:                   55151   BIC:                        -1.494e+05
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
===========================================================================================================
                                              coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------
Intercept                                  -0.0472      0.002    -20.852      0.000      -0.052      -0.043
domain_Forecasting[T.True]                  0.0290      0.001     19.792      0.000       0.026       0.032
domain_OpenReview[T.True]                  -0.0100      0.002     -4.609      0.000      -0.014      -0.006
reasoning_mode_SelfDebate[T.True]           0.0114      0.003      4.362      0.000       0.006       0.017
model_deepseek_r1[T.True]                   0.0105      0.002      4.490      0.000       0.006       0.015
model_deepseek_v3[T.True]                  -0.0027      0.003     -0.998      0.318      -0.008       0.003
model_gemini_2_0_flash[T.True]             -0.0065      0.003     -2.151      0.032      -0.012      -0.001
model_gpt_4o[T.True]                        0.0027      0.003      0.986      0.324      -0.003       0.008
model_llama_4_maverick[T.True]             -0.0033      0.003     -1.232      0.218      -0.009       0.002
model_llama_4_scout[T.True]                -0.0050      0.003     -1.747      0.081      -0.011       0.001
prompt_critical[T.True]                     0.0079      0.002      4.734      0.000       0.005       0.011
prompt_none[T.True]                         0.0054      0.002      3.268      0.001       0.002       0.009
prior                                       0.1099      0.004     27.089      0.000       0.102       0.118
domain_Forecasting[T.True]:prior           -0.0590      0.003    -21.954      0.000      -0.064      -0.054
domain_OpenReview[T.True]:prior             0.0076      0.004      1.898      0.058      -0.000       0.015
reasoning_mode_SelfDebate[T.True]:prior    -0.0249      0.005     -4.948      0.000      -0.035      -0.015
model_deepseek_r1[T.True]:prior            -0.0291      0.004     -6.936      0.000      -0.037      -0.021
model_deepseek_v3[T.True]:prior             0.0047      0.005      0.962      0.336      -0.005       0.014
model_gemini_2_0_flash[T.True]:prior        0.0014      0.006      0.247      0.805      -0.010       0.013
model_gpt_4o[T.True]:prior                 -0.0033      0.005     -0.661      0.508      -0.013       0.007
model_llama_4_maverick[T.True]:prior        0.0076      0.005      1.538      0.124      -0.002       0.017
model_llama_4_scout[T.True]:prior           0.0113      0.005      2.158      0.031       0.001       0.022
prompt_critical[T.True]:prior              -0.0208      0.003     -6.811      0.000      -0.027      -0.015
prompt_none[T.True]:prior                  -0.0179      0.003     -5.867      0.000      -0.024      -0.012
==============================================================================
Omnibus:                     7174.965   Durbin-Watson:                   1.838
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            82316.947
Skew:                          -0.174   Prob(JB):                         0.00
Kurtosis:                       8.974   Cond. No.                         60.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
```

## Per-Setup

```
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   bias   R-squared:                       0.371
Model:                            OLS   Adj. R-squared:                  0.307
Method:                 Least Squares   F-statistic:                     5.747
Date:                Tue, 13 May 2025   Prob (F-statistic):           3.14e-07
Time:                        19:42:00   Log-Likelihood:                 180.59
No. Observations:                 119   AIC:                            -337.2
Df Residuals:                     107   BIC:                            -303.8
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
=====================================================================================================
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
Intercept                             0.0833      0.020      4.091      0.000       0.043       0.124
domain_Forecasting[T.True]           -0.0723      0.012     -5.879      0.000      -0.097      -0.048
domain_OpenReview[T.True]             0.0137      0.013      1.060      0.292      -0.012       0.039
reasoning_mode_SelfDebate[T.True]    -0.0021      0.010     -0.201      0.841      -0.022       0.018
model_deepseek_r1[T.True]            -0.0137      0.022     -0.629      0.531      -0.057       0.029
model_deepseek_v3[T.True]            -0.0130      0.022     -0.596      0.553      -0.056       0.030
model_gemini_2_0_flash[T.True]       -0.0010      0.022     -0.047      0.962      -0.044       0.042
model_gpt_4o[T.True]                  0.0053      0.022      0.245      0.807      -0.038       0.048
model_llama_4_maverick[T.True]        0.0315      0.022      1.449      0.150      -0.012       0.075
model_llama_4_scout[T.True]           0.0041      0.022      0.189      0.851      -0.039       0.047
prompt_critical[T.True]               0.0020      0.013      0.160      0.873      -0.023       0.027
prompt_none[T.True]                   0.0033      0.013      0.263      0.793      -0.022       0.028
==============================================================================
Omnibus:                       29.371   Durbin-Watson:                   2.089
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               74.969
Skew:                          -0.911   Prob(JB):                     5.26e-17
Kurtosis:                       6.435   Cond. No.                         12.7
==============================================================================
```