================================================================================
Seed Variation Analysis Report
================================================================================

Models tested: 8
Seeds used: [31, 32, 33]
Temperature: 0.7 (default)

================================================================================
MCQ Setting Results (Mean ± Std across 3 seeds)
================================================================================

Model                               Q7 Accuracy          Q7 F1 (macro)        Over-Disclosure (%)  Under-Disclosure (%)
-------------------------------------------------------------------------------------------------------------------
gemini-2.5-flash                    N/A                  N/A                  N/A                  N/A                 
gpt-5                               N/A                  N/A                  N/A                  N/A                 
o3                                  N/A                  N/A                  N/A                  N/A                 
o4-mini                             N/A                  N/A                  N/A                  N/A                 
gpt-4.1                             N/A                  N/A                  N/A                  N/A                 
gpt-4.1-mini                        N/A                  N/A                  N/A                  N/A                 
gpt-4o                              N/A                  N/A                  N/A                  N/A                 
Llama-4-Maverick-17B-128E-Instruct-FP8 N/A                  N/A                  N/A                  N/A                 

================================================================================
Free-form (Vanilla Prompting) Setting Results (Mean ± Std across 3 seeds)
================================================================================

Model                               Q7-label Accuracy    Q7-label F1 (macro)  Over-Disclosure (%)  Under-Disclosure (%)
-------------------------------------------------------------------------------------------------------------------
gemini-2.5-flash                    0.4748±0.0030     0.4023±0.0016     46.00±0.25%     6.52±0.51%
gpt-5                               0.4292±0.0041     0.3264±0.0031     51.55±0.67%     5.53±0.30%
o3                                  0.4444±0.0061     0.3749±0.0060     46.11±0.59%     9.45±0.18%
o4-mini                             0.4422±0.0151     0.3624±0.0142     47.65±0.43%     8.14±1.70%
gpt-4.1                             0.4584±0.0041     0.3965±0.0040     44.41±0.56%     9.75±0.69%
gpt-4.1-mini                        0.5024±0.0082     0.4781±0.0099     30.66±0.47%     19.09±0.69%
gpt-4o                              0.4711±0.0068     0.4113±0.0078     43.76±0.43%     9.13±0.26%
Llama-4-Maverick-17B-128E-Instruct-FP8 0.4124±0.0192     0.4086±0.0179     31.37±0.93%     27.40±1.46%

================================================================================
End of Report
================================================================================
