Model,All scores,Noiseless scores,All scores (code),Noiseless scores (code)
o1-mini,0.014516685153225017,0.01731867083489313,nan,nan
Claude 3.7 Sonnet,0.019217893979644372,0.02678834384267479,nan,nan
Kimi K2 (*),0.020721137027420965,0.02058849141691482,nan,nan
gpt-oss-20b (*),0.027452410041519142,0.02719665382222392,nan,nan
GLM-4.5 (*),0.033122295709202644,0.04343093905025202,nan,nan
Gemini 2.0 Flash,0.041075868669549924,0.058957958024969885,nan,nan
o3-mini (low),0.04163729510721105,0.05102871734251289,nan,nan
gpt-oss-120b (*),0.04943059234644526,0.07139570603100144,nan,nan
o4-mini (low),0.05321589749115431,0.0718387529652477,0.14243130184701674,0.2028722810736803
DeepSeek-R1 (*),0.05770296444805328,0.06996589173669451,nan,nan
o1 (low),0.059847150009800364,0.05150959411774659,nan,nan
o1 (medium),0.06444772350260702,0.07491137079260529,nan,nan
o3-mini (medium),0.0687497684430072,0.10364227348451302,nan,nan
o1 (high),0.08014719867922207,0.1069105392571279,nan,nan
o3-mini (high),0.08582831071830922,0.12093434676079962,nan,nan
Qwen3 (*),0.088755042256967,0.1815157219133473,nan,nan
o4-mini (medium),0.10281053586075617,0.1479085461168566,0.20831325956879534,0.3708145261788344
o3 (low),0.10584600519033364,0.18378666197030466,0.16739193601027585,0.26285941004413516
Gemini 2.5,0.1133841876439464,0.20730693744221618,nan,nan
o4-mini (high),0.11732023423530111,0.19094995133220913,0.20094673921261724,0.38953277251654833
o3-pro,0.12023092035263966,0.2084896377361535,nan,nan
o3 (medium),0.12217163432541356,0.1937044344219018,0.18452746744091936,0.307481805795016
o3 (high),0.131977506228661,0.19381471179491236,0.23186881262018919,0.3965950473697276
Code (human-written),0.37613125139229053,0.5518314156862899,nan,nan
