## Overall Comparison

| Topic | agents4sci_v2 | baseline_single | baseline_tree | baseline_debate |
|---|---:|---:|---:|---:|
| what_if_a_persistent_global_cloud | 90.00 | 75.00 | 85.00 | 79.00 |
| what_if_a_supervolcano_erupted_next | 92.00 | 80.00 | 88.00 | 76.00 |
| what_if_all_insects_disappeared | 85.00 | 84.00 | 90.00 | 87.00 |
| what_if_earths_axial_tilt_increased | 92.00 | 87.00 | 87.00 | 86.00 |
| what_if_earths_magnetic_field_collapsed | 89.00 | 84.00 | 82.00 | 78.00 |
| what_if_earths_rotation_slowed_to | 90.00 | 79.00 | 88.00 | 80.00 |
| what_if_global_sea_levels_rose | 92.00 | 68.00 | 83.00 | 80.00 |
| what_if_gravity_became_ten_times | 90.00 | 83.00 | 75.00 | 92.00 |
| what_if_oxygen_levels_rose_to | 79.00 | 82.00 | 89.00 | 81.00 |
| what_if_the_moon_disappeared | 96.00 | 88.00 | 88.00 | 77.00 |

### Macro Average by Metric (5D rubric, across topics; mean ± std)
| Metric | agents4sci_v2 | baseline_single | baseline_tree | baseline_debate |
|---|---:|---:|---:|---:|
| rigor_traceability | 22.20 ± 2.32 | 20.60 ± 1.62 | 21.90 ± 1.14 | 21.15 ± 1.84 |
| integration_causality | 22.70 ± 1.10 | 21.60 ± 1.20 | 22.30 ± 0.78 | 21.85 ± 1.34 |
| feasibility_minimality | 17.80 ± 0.75 | 17.10 ± 1.22 | 16.60 ± 2.94 | 15.90 ± 0.70 |
| uncertainty_adaptation | 13.20 ± 0.40 | 11.90 ± 0.70 | 12.90 ± 0.70 | 12.20 ± 1.08 |
| decisionability | 13.60 ± 0.92 | 9.80 ± 1.94 | 11.80 ± 0.75 | 10.50 ± 1.80 |
| overall | 89.50 ± 4.39 | 81.00 ± 5.64 | 85.50 ± 4.27 | 81.60 ± 4.84 |