id,task,model,Best score (across scorers),Scores,log viewer,Log download
blabla,GPQA diamond,qwen3-32b,0.1111,choice:0.62±.03,blabla,,
mTa4f52zsEuxdrPy7WfQvX,GPQA diamond,claude-2.0,0.3465909090909091,choice:0.35±0.03,XXXX-viewer/c79c08da/viewer.html?log_file=https%3A%2F%2Flogs.epoch.ai%2Finspect_ai_logs%2FmTa4f52zsEuxdrPy7WfQvX.eval,XXXX
Gs4vxvDD5wmrSvyTkqx2YV,MATH level 5,claude-2.0,0.1172583081570997,"normalized_string_match:0.1±0.01,sympy_equiv:0.1±0.01,model_graded_equiv:0.12±0.01",,
eMAisrJd7fAUHPfcPWXeXo,GPQA diamond,claude-2.1,0.3295454545454545,choice:0.33±0.02,XXXX
78Wpvp8WDByAcQRBas72cs,GPQA diamond,claude-3-5-sonnet-20240620,0.5404040404040404,choice:0.54±0.03,XXXX
H42HJG29bwqzaMewTMsmky,MATH level 5,claude-3-5-sonnet-20240620,0.5168051359516617,"normalized_string_match:0.44±0.01,sympy_equiv:0.47±0.01,model_graded_equiv:0.52±0.01",,
JCCVcVLz3J4TQXSU67hMJZ,GPQA diamond,claude-3-5-sonnet-20241022,0.553030303030303,choice:0.55±0.03,XXXX
AjroyuZECXLtMTwnzaMHtX,MATH level 5,claude-3-5-sonnet-20241022,0.5694864048338368,"normalized_string_match:0.49±0.01,sympy_equiv:0.52±0.01,model_graded_equiv:0.57±0.01",,
T3FWdCCWBryVjxwTcoY5bK,GPQA diamond,claude-3-haiku-20240307,0.3630050505050505,choice:0.36±0.02,XXXX
QSnWVKUVj9six4jYj4HJD9,MATH level 5,claude-3-haiku-20240307,0.1487915407854985,"normalized_string_match:0.13±0.01,sympy_equiv:0.13±0.01,model_graded_equiv:0.15±0.01",,
hTnDa4gn8RYduCEYqoNywe,GPQA diamond,claude-3-opus-20240229,0.4715909090909091,choice:0.47±0.03,XXXX
6SNzvpd6kXXxqpJoaHsSDd,MATH level 5,claude-3-opus-20240229,0.3748111782477341,"normalized_string_match:0.35±0.01,sympy_equiv:0.36±0.01,model_graded_equiv:0.37±0.01",,
QLejPZXhjX6SQCQgzjNvCG,GPQA diamond,claude-3-sonnet-20240229,0.4059343434343434,choice:0.41±0.02,XXXX
FD2ztfjFUeqGyrtF8yuCUB,MATH level 5,claude-3-sonnet-20240229,0.1817409365558912,"normalized_string_match:0.15±0.01,sympy_equiv:0.16±0.01,model_graded_equiv:0.18±0.01",,
Hc2ycMP6DyCZxCPjKbZFQU,GPQA diamond,gemini-1.5-flash-001,0.4037247474747474,choice:0.4±0.02,XXXX
CPpd4hqQHtZ5pUejBELwKb,MATH level 5,gemini-1.5-flash-001,0.2512273413897281,"normalized_string_match:0.24±0.01,sympy_equiv:0.23±0.01,model_graded_equiv:0.25±0.01",,
PCLvrs4TEBtqQqcYeU8pAG,GPQA diamond,gemini-1.5-flash-002,0.4731691919191919,choice:0.47±0.03,XXXX
kDnfjYKQpnLGEjuhBc9AvM,MATH level 5,gemini-1.5-flash-002,0.6186744712990937,"normalized_string_match:0.6±0.01,sympy_equiv:0.59±0.01,model_graded_equiv:0.62±0.01",,
PKdqpKZ9tqDWyWJ7GksBWz,GPQA diamond,gemini-1.5-pro-001,0.4586489898989899,choice:0.46±0.03,XXXX
PxLmyq7PFfARAuqxg5oPtE,MATH level 5,gemini-1.5-pro-001,0.4074773413897281,"normalized_string_match:0.38±0.01,sympy_equiv:0.37±0.01,model_graded_equiv:0.41±0.01",,
9TrLEEw8TvsyugyCjwvUvD,GPQA diamond,gemini-1.5-pro-002,0.5722853535353535,choice:0.57±0.03,XXXX
QyqLmgcJpeDMsnQyaaJVkR,MATH level 5,gemini-1.5-pro-002,0.7039274924471299,"normalized_string_match:0.67±0.01,sympy_equiv:0.67±0.01,model_graded_equiv:0.7±0.01",,
fQjoHKyUEPAuwdrBGfL8SG,GPQA diamond,gemma-2-9b-it,0.2746212121212121,choice:0.27±0.02,XXXX
ZUJxsJDpHnLXxNiDcvj22W,GPQA diamond,gemma-2-27b-it,0.3648989898989899,choice:0.36±0.02,XXXX
eZaQhpbhbzbo3sbKGS5Efa,GPQA diamond,gpt-3.5-turbo-0125,0.271780303030303,choice:0.27±0.02,XXXX
9tSgxGV3nzhTBpq7GcX7yx,MATH level 5,gpt-3.5-turbo-0125,0.1163141993957704,"normalized_string_match:0.1±0,sympy_equiv:0.11±0.01,model_graded_equiv:0.12±0.01",,
enhRcQmfN6RFB2zu7R3Uxr,GPQA diamond,gpt-3.5-turbo-1106,0.2803030303030303,choice:0.28±0.02,XXXX
cJWh887zSiTtnxm9XAJVkr,MATH level 5,gpt-3.5-turbo-1106,0.158893504531722,"normalized_string_match:0.14±0.01,sympy_equiv:0.15±0.01,model_graded_equiv:0.16±0.01",,
4j3WRhhruXrfaq9Jki3NDc,GPQA diamond,gpt-4-0125-preview,0.4226641414141414,choice:0.42±0.03,XXXX
fu7pWqbSC4MUBdwv3GFNa5,GPQA diamond,gpt-4-0613,0.3065025252525252,choice:0.31±0.02,XXXX
WsSrH8oHBm7m7vq9yTRqSC,MATH level 5,gpt-4-0613,0.2297016616314199,"normalized_string_match:0.2±0.01,sympy_equiv:0.2±0.01,model_graded_equiv:0.23±0.01",,
AWewipyfSQb8hXtS8d4fNL,GPQA diamond,gpt-4-1106-preview,0.4236111111111111,choice:0.42±0.02,XXXX
VBStRVGFi47EVz2pn33xu4,MATH level 5,gpt-4-1106-preview,0.4002077039274924,"normalized_string_match:0.37±0.01,sympy_equiv:0.38±0.01,model_graded_equiv:0.4±0.01",,
Y59q8WbEDp2HH8oovyJCj9,GPQA diamond,gpt-4o-2024-05-13,0.4889520202020202,choice:0.49±0.03,XXXX
Uqsvq9TAGoqD2xpKfhHBmS,MATH level 5,gpt-4o-2024-05-13,0.5104796072507553,"normalized_string_match:0.47±0.01,sympy_equiv:0.47±0.01,model_graded_equiv:0.51±0.01",,
fairJpL9VekDBKxYHG9ZoG,GPQA diamond,gpt-4o-2024-08-06,0.4921085858585858,choice:0.49±0.03,XXXX
ZQgz8AGm4KHin9MFupLu58,MATH level 5,gpt-4o-2024-08-06,0.5327605740181269,"normalized_string_match:0.49±0.01,sympy_equiv:0.5±0.01,model_graded_equiv:0.53±0.01",,
L9yfwZC3snsKZshBXTaXjc,GPQA diamond,gpt-4o-mini-2024-07-18,0.3772095959595959,choice:0.38±0.02,XXXX
5iyKgWVhAmNwrA5rNL7MbW,MATH level 5,gpt-4o-mini-2024-07-18,0.5263406344410876,"normalized_string_match:0.48±0.01,sympy_equiv:0.48±0.01,model_graded_equiv:0.53±0.01",,
83DRJD4FvjwqU9eStRaNiU,GPQA diamond,grok-2-1212,0.5378787878787878,choice:0.54±0.03,XXXX
DHdeZNynu4VtRp2VGkCGvd,GPQA diamond,Hermes-2-Theta-Llama-3-70B,0.3746843434343434,choice:0.37±0.02,XXXX
ZAVTwsDqrvbWbqSgeNdXFd,MATH level 5,Hermes-2-Theta-Llama-3-70B,0.226869335347432,"normalized_string_match:0.2±0.01,sympy_equiv:0.2±0.01,model_graded_equiv:0.23±0.01",,
evPjJw9SKqq3TieSdxrSQn,GPQA diamond,Llama-2-70b-chat-hf,0.2632575757575757,choice:0.26±0.02,XXXX
4bkXhqCNoe6RpTe9taJ7kR,MATH level 5,Llama-2-70b-chat-hf,0.0328549848942598,"normalized_string_match:0.02±0,sympy_equiv:0.03±0,model_graded_equiv:0.03±0",,
WWS3aotkrymLmBBDhhtvWQ,GPQA diamond,Llama-3.1-8B-Instruct,0.2594696969696969,choice:0.26±0.02,XXXX
gyxR3YTcmh4UcKZkET5pPX,MATH level 5,Llama-3.1-8B-Instruct,0.2287575528700906,"normalized_string_match:0.21±0.01,sympy_equiv:0.21±0.01,model_graded_equiv:0.23±0.01",,
MEwMcmSV32JHrsNoAq4BFL,GPQA diamond,Llama-3.1-70B-Instruct,0.4419191919191919,choice:0.44±0.02,XXXX
6JiXiQ67iyiC7cFFRiogvs,MATH level 5,Llama-3.1-70B-Instruct,0.366786253776435,"normalized_string_match:0.34±0.01,sympy_equiv:0.34±0.01,model_graded_equiv:0.37±0.01",,
S5QYXSvQBRSbUbXSnAGbMm,GPQA diamond,Llama-3.1-405B-Instruct,0.5091540404040404,choice:0.51±0.03,XXXX
ZR7amdUBRr5Cke8Aoz4snr,MATH level 5,Llama-3.1-405B-Instruct,0.4977341389728096,"normalized_string_match:0.47±0.01,sympy_equiv:0.47±0.01,model_graded_equiv:0.5±0.01",,
iT3BBFWjsZk2rKHKeJDsM8,GPQA diamond,Meta-Llama-3-8B-Instruct,0.2607323232323232,choice:0.26±0.02,XXXX
BURdKU7mhUX4YMPUDoFsed,MATH level 5,Meta-Llama-3-8B-Instruct,0.0612726586102719,"normalized_string_match:0.04±0,sympy_equiv:0.05±0,model_graded_equiv:0.06±0",,
jrvutCoFFG4KxYtU7REdpe,GPQA diamond,Meta-Llama-3-70B-Instruct,0.4056186868686868,choice:0.41±0.03,XXXX
dekbZbU3BtiQZNKG8Txaax,MATH level 5,Meta-Llama-3-70B-Instruct,0.225547583081571,"normalized_string_match:0.21±0.01,sympy_equiv:0.21±0.01,model_graded_equiv:0.23±0.01",,
4NiUpCrLfz3oFaerkrMeBt,GPQA diamond,o1-mini-2024-09-12_medium,0.5950126262626263,choice:0.6±0.03,XXXX
FaRZDKSriUbS3Brt6qprP9,MATH level 5,o1-mini-2024-09-12_medium,0.8425226586102719,"normalized_string_match:0.78±0.01,sympy_equiv:0.77±0.01,model_graded_equiv:0.84±0.01",,
7CkADTsoNbDd88uqYbqfwj,GPQA diamond,o1-preview-2024-09-12,0.5031565656565656,choice:0.5±0.03,XXXX
9K6c9b6v9zQrQYtw827BDn,GPQA diamond,qwen1.5-32b-chat,0.307449494949495,choice:0.31±0.02,XXXX
NeqeQSxHmdUNLk9Ytpo6qo,GPQA diamond,qwen1.5-72b-chat,0.2881944444444444,choice:0.29±0.02,XXXX
Pex6SpYJX8ZPySncfNNy7H,GPQA diamond,qwen2-72b-instruct,0.4078282828282828,choice:0.41±0.03,XXXX
Jwi8JTeeTxJuJY8qidxngw,GPQA diamond,Yi-1.5-34B-Chat,0.319760101010101,choice:0.32±0.02,XXXX
bH5kVfLE53CxY3qWmB9C5K,MATH level 5,Yi-1.5-34B-Chat,0.2548149546827795,"normalized_string_match:0.24±0.01,sympy_equiv:0.24±0.01,model_graded_equiv:0.25±0.01",,
WxMpEBcTf8qPFPz4BrF86B,GPQA diamond,Yi-34B-Chat,0.1474116161616161,choice:0.15±0.01,XXXX
nzy4SGSeKGzQSKJLWExKpV,MATH level 5,Yi-34B-Chat,0.0514539274924471,"normalized_string_match:0.04±0,sympy_equiv:0.04±0,model_graded_equiv:0.05±0",,
c4V7sc8TUNJ8wQKRhuA2Rv,MATH level 5,gpt-4-0125-preview,0.3541351963746224,"normalized_string_match:0.33±0.01,sympy_equiv:0.33±0.01,model_graded_equiv:0.35±0.01",,
YbQkMPC2GtNLhxqTEhhVsa,GPQA diamond,Mistral-7B-Instruct-v0.3,0.1518308080808081,choice:0.15±0.01,XXXX
82RTgqgiC2UuM5M37L3WWM,GPQA diamond,deepseek-llm-67b-chat,0.2462121212121212,choice:0.25±0.01,XXXX
Sn2qbB2HiySe7NcG28jnyM,GPQA diamond,Mixtral-8x7B-Instruct-v0.1,0.3058712121212121,choice:0.31±0.02,XXXXct-viewer/c79c08da/viewer.html?log_file=https%3A%2F%2Flogs.epoch.ai%2Finspect_ai_logs%2FSn2qbB2HiySe7NcG28jnyM.eval,XXXX
Hdcqs42VnPNVBquXttexhq,MATH level 5,gemma-2-9b-it,0.2100641993957704,"normalized_string_match:0.18±0.01,sympy_equiv:0.18±0.01,model_graded_equiv:0.21±0.01",,
PM3GCQcVK7t2MujNp4F4kH,MATH level 5,gemma-2-27b-it,0.2788897280966767,"normalized_string_match:0.24±0.01,sympy_equiv:0.24±0.01,model_graded_equiv:0.28±0.01",,
WtALgW6VSFdihDZWXRm2AZ,GPQA diamond,qwen2.5-72b-instruct,0.4914772727272727,choice:0.49±0.03,XXXX
9mWoY9fM2ifKvDg3xaug6h,MATH level 5,qwen2-72b-instruct,0.3906722054380664,"normalized_string_match:0.37±0.01,sympy_equiv:0.36±0.01,model_graded_equiv:0.39±0.01",,
6xHApTdQAWM9e5oCvKYS3W,MATH level 5,Mistral-7B-Instruct-v0.3,0.0359705438066465,"normalized_string_match:0.02±0,sympy_equiv:0.02±0,model_graded_equiv:0.04±0",,
Cm4UkgA5TtmzQegrq3rbvm,MATH level 5,deepseek-llm-67b-chat,0.0639161631419939,"normalized_string_match:0.05±0,sympy_equiv:0.05±0,model_graded_equiv:0.06±0",,
4PTATufH2ra3guNeLwT9p9,MATH level 5,Mixtral-8x7B-Instruct-v0.1,0.0929003021148036,"normalized_string_match:0.07±0,sympy_equiv:0.08±0,model_graded_equiv:0.09±0.01",,
MQJA3RUjsQzSubGJiukZ7A,GPQA diamond,WizardLM-2-8x22B,0.4343434343434343,choice:0.43±0.02,XXXX
TTa3hu7PgRxHgizVreZsdJ,GPQA diamond,dbrx-instruct,0.3289141414141414,choice:0.33±0.03,XXXX
4bAYQuh4heJoTNhapVZLNB,MATH level 5,WizardLM-2-8x22B,0.2573640483383685,"normalized_string_match:0.22±0.01,sympy_equiv:0.23±0.01,model_graded_equiv:0.26±0.01",,
PMbefrjykQmTBpABMa7FJi,MATH level 5,dbrx-instruct,0.1165030211480362,"normalized_string_match:0.09±0.01,sympy_equiv:0.09±0.01,model_graded_equiv:0.12±0.01",,
ibpS8bs4z3wD3Hb6F3kpdV,GPQA diamond,gemini-1.0-pro-001,0.3396464646464646,choice:0.34±0.02,XXXX
iUdecHq9vU5utytzgTXzqY,MATH level 5,gemini-1.0-pro-001,0.1124433534743202,"normalized_string_match:0.1±0.01,sympy_equiv:0.1±0.01,model_graded_equiv:0.11±0.01",,
bJV4D8SEo9zMJPEYTTXJNZ,GPQA diamond,ministral-3b-2410,0.2525252525252525,choice:0.25±0.02,XXXX
DGEzZA6nvTne4MEGSuo3Tm,GPQA diamond,ministral-8b-2410,0.2714646464646464,choice:0.27±0.02,XXXX
dDeREdDJ5bVyci3APZxzWb,GPQA diamond,mistral-large-2402,0.3876262626262626,choice:0.39±0.02,XXXX
ZQPNnicTuhuscJvni85PFY,GPQA diamond,mistral-large-2407,0.4902146464646464,choice:0.49±0.03,XXXX
5XuVuHeUDekHqjUy2P8gww,GPQA diamond,open-mistral-7b,0.132260101010101,choice:0.13±0.01,XXXX
XCSWtbDquA7FUiWfd7kh9q,GPQA diamond,open-mistral-nemo-2407,0.2989267676767677,choice:0.3±0.02,XXXX
ifAM3DJBdm5eU9Y2DjaEvW,GPQA diamond,open-mixtral-8x22b,0.3405934343434343,choice:0.34±0.02,XXXX
Y6KPSujFepdwqinKLf9mDF,GPQA diamond,open-mixtral-8x7b,0.2982954545454545,choice:0.3±0.02,XXXX
YNhHxRT9Ee2uEWWJh9RXuP,MATH level 5,ministral-3b-2410,0.1444486404833836,"normalized_string_match:0.13±0.01,sympy_equiv:0.13±0.01,model_graded_equiv:0.14±0.01",,
hqdiKcvEPSRyvenUVB9qxM,MATH level 5,ministral-8b-2410,0.149358006042296,"normalized_string_match:0.14±0.01,sympy_equiv:0.13±0.01,model_graded_equiv:0.15±0.01",,
gypQecfFNK2x4pNGDtGMDv,MATH level 5,mistral-large-2402,0.2446185800604229,"normalized_string_match:0.21±0.01,sympy_equiv:0.22±0.01,model_graded_equiv:0.24±0.01",,
div9uE9eovqykqx883kSpE,MATH level 5,mistral-large-2407,0.4481684290030212,"normalized_string_match:0.41±0.01,sympy_equiv:0.41±0.01,model_graded_equiv:0.45±0.01",,
9ZQaypWLwLHbQrqTWfoEor,MATH level 5,open-mistral-7b,0.0368202416918429,"normalized_string_match:0.02±0,sympy_equiv:0.03±0,model_graded_equiv:0.04±0",,
9qCrggpoq8FjQWVQhHaGBT,MATH level 5,open-mistral-nemo-2407,0.1082892749244713,"normalized_string_match:0.1±0.01,sympy_equiv:0.1±0.01,model_graded_equiv:0.11±0.01",,
cGMb2vs9NhfkwZJsQXzLmQ,MATH level 5,open-mixtral-8x22b,0.2424471299093655,"normalized_string_match:0.22±0.01,sympy_equiv:0.22±0.01,model_graded_equiv:0.24±0.01",,
GN3Hez798byishFw3U6cpp,MATH level 5,open-mixtral-8x7b,0.0995090634441087,"normalized_string_match:0.08±0,sympy_equiv:0.08±0,model_graded_equiv:0.1±0.01",,
9SSHEP95FYVZvHDdoHSRy3,GPQA diamond,Llama-3.2-90B-Vision-Instruct,0.4103535353535353,choice:0.41±0.02,XXXX
iHRmwxsh8M3FGun8x5mxkt,MATH level 5,Llama-3.2-90B-Vision-Instruct,0.3943542296072507,"normalized_string_match:0.35±0.01,sympy_equiv:0.36±0.01,model_graded_equiv:0.39±0.01",,
5JgivNDZp2D7Lc7cpDhS28,GPQA diamond,Llama-3.1-Tulu-3-70B-DPO,0.4627525252525252,choice:0.46±0.02,XXXX
Hf2ESooSteQK99yFjhFm8t,MATH level 5,Llama-3.1-Tulu-3-70B-DPO,0.426642749244713,"normalized_string_match:0.39±0.01,sympy_equiv:0.39±0.01,model_graded_equiv:0.43±0.01",,
RJm4u2vXF6PUmThJGzAfHE,GPQA diamond,Eurus-2-7B-PRIME,0.3390151515151515,choice:0.34±0.02,XXXX
TxXS78Wg2pSCDbpmeJuzQn,GPQA diamond,Llama-3.3-70B-Instruct,0.4744318181818182,choice:0.47±0.03,XXXX
DeaDxFfiQtCTdnZPH7Awaf,MATH level 5,Llama-3.3-70B-Instruct,0.4159743202416918,"normalized_string_match:0.41±0.01,sympy_equiv:0.4±0.01,model_graded_equiv:0.42±0.01",,
ni4DJboYYpdRssVeAuro2V,MATH level 5,qwen2.5-72b-instruct,0.631703172205438,"normalized_string_match:0.59±0.01,sympy_equiv:0.59±0.01,model_graded_equiv:0.63±0.01",,
8Uu6XsSJCKwcwhXDt8vgzq,MATH level 5,o1-preview-2024-09-12,0.8164652567975831,"normalized_string_match:0.68±0.01,sympy_equiv:0.68±0.01,model_graded_equiv:0.82±0.01",,
Uczhz7MKLstjLkSbbGuwdN,GPQA diamond,o1-2024-12-17_medium,0.7575757575757576,choice:0.76±0.03,XXXX
AzanNoyhhVrtQi63Azo6hu,MATH level 5,o1-2024-12-17_medium,0.9441087613293052,"normalized_string_match:0.78±0.01,sympy_equiv:0.8±0.01,model_graded_equiv:0.94±0.01",,
Vwo3nMA8g2gGBCyTdKFMQJ,GPQA diamond,DeepSeek-V3,0.5653409090909091,choice:0.57±0.03,XXXX
exvQCLCiga6gE7zvQNp2Z6,MATH level 5,DeepSeek-V3,0.6485083081570997,"normalized_string_match:0.58±0.01,sympy_equiv:0.58±0.01,model_graded_equiv:0.65±0.01",,
NbsnvBsMoMizozbPZY8LLb,MATH level 5,mistral-small-2501,0.4481684290030212,"normalized_string_match:0.43±0.01,sympy_equiv:0.42±0.01,model_graded_equiv:0.45±0.01",,
d6JihAFVqUy3u9suNt4P3L,GPQA diamond,qwen2.5-32b-instruct,0.4608585858585858,choice:0.46±0.03,XXXX
Nk4mmCXFH3mEyCYnQPdPx5,MATH level 5,qwen2.5-32b-instruct,0.5607061933534743,"normalized_string_match:0.51±0.01,sympy_equiv:0.54±0.01,model_graded_equiv:0.56±0.01",,
oDq358GfyV5Au65scERdVo,GPQA diamond,mistral-small-2501,0.4529671717171717,choice:0.45±0.02,XXXX
XPHDbKVUCPNCs5NoVWU8S3,GPQA diamond,DeepSeek-R1,0.7171717171717171,choice:0.72±0.03,XXXX
XVqLs2khGFSNwEbHJTEgRU,MATH level 5,DeepSeek-R1,0.9305135951661632,"normalized_string_match:0.78±0.01,sympy_equiv:0.78±0.01,model_graded_equiv:0.93±0.01",,
ehvA5nisC7GMbgbyy3Z4Et,GPQA diamond,o3-mini-2025-01-31_medium,0.742739898989899,choice:0.74±0.03,XXXX
2RiSB2ii5kf2Q6fDBv2P3E,MATH level 5,o3-mini-2025-01-31_medium,0.9516616314199396,"normalized_string_match:0.78±0.01,sympy_equiv:0.81±0.01,model_graded_equiv:0.95±0",,
bDS2AGEf3AywbVcdDyAKZK,GPQA diamond,Phi-3-medium-128k-instruct ,0.2758838383838384,choice:0.28±0.02,XXXX
RAwuPRwJedy7P2KeQhEmr8,GPQA diamond,phi-4,0.5606060606060606,choice:0.56±0.03,XXXX
njA86GC9DzSXyiZWtUxXiD,MATH level 5,Phi-3-medium-128k-instruct ,0.1756042296072507,"normalized_string_match:0.07±0,sympy_equiv:0.08±0,model_graded_equiv:0.18±0.01",,
PnvmfAHVxH9oXfcdcUeUZ3,MATH level 5,phi-4,0.6493580060422961,"normalized_string_match:0.04±0,sympy_equiv:0.05±0,model_graded_equiv:0.65±0.01",,
gULKGqAe9nG9syxmYMtkhi,GPQA diamond,gpt-4o-2024-11-20,0.4788510101010101,choice:0.48±0.03,XXXX
D3G2BKGnh6YSvw4bEYSdiY,MATH level 5,gpt-4o-2024-11-20,0.4977341389728096,"normalized_string_match:0.45±0.01,sympy_equiv:0.47±0.01,model_graded_equiv:0.5±0.01",,
fbszwTApm3f28z5VA6oXBK,GPQA diamond,gemini-2.0-flash-001,0.6414141414141414,choice:0.64±0.03,XXXX
SqwGZz2UfTmbvPiw2rq3LH,MATH level 5,gemini-2.0-flash-001,0.8216578549848943,"normalized_string_match:0.79±0.01,sympy_equiv:0.78±0.01,model_graded_equiv:0.82±0.01",,
g5YE7G6DXb45EH4HxEdiht,GPQA diamond,gemini-2.0-pro-exp-02-05,0.6565656565656566,choice:0.66±0.03,XXXX
fvDt6HaHV5d3gSEuavixsv,MATH level 5,gemini-2.0-pro-exp-02-05,0.8345921450151057,"normalized_string_match:0.79±0.01,sympy_equiv:0.78±0.01,model_graded_equiv:0.83±0.01",,
ag9w2w4xa47ghZaWt5oZey,GPQA diamond,gemini-2.0-flash-thinking-exp-01-21,0.5707070707070707,choice:0.57±0.04,XXXX
MzsTvCEEfEpzn4WvxYdJkX,GPQA diamond,o3-mini-2025-01-31_high,0.7702020202020202,choice:0.77±0.03,XXXX
X2gKewiGPZx5DLFEddUUqW,GPQA diamond,o1-mini-2024-09-12_high,0.6237373737373737,choice:0.62±0.03,XXXX
3VDhPT5WvzjJZkRc36Kuvi,MATH level 5,o3-mini-2025-01-31_high,0.9648791540785498,"normalized_string_match:0.8±0.01,sympy_equiv:0.82±0.01,model_graded_equiv:0.96±0",,
4DWnpJowAfeQDHc2xDNy2C,MATH level 5,o1-mini-2024-09-12_high,0.8918051359516617,"normalized_string_match:0.82±0.01,sympy_equiv:0.82±0.01,model_graded_equiv:0.89±0.01",,
NAkCFiFiDSN3NXMj7HMvMS,GPQA diamond,o1-2024-12-17_high,0.7676767676767676,choice:0.77±0.03,XXXX
jVXtRkTBDnqH8p4sjVUbgN,MATH level 5,o1-2024-12-17_high,0.947129909365559,"normalized_string_match:0.78±0.01,sympy_equiv:0.8±0.01,model_graded_equiv:0.95±0.01",,
Kpck3pLw3veM5Mt7WZAb8G,MATH level 5,grok-2-1212,0.6351963746223565,"normalized_string_match:0.6±0.01,sympy_equiv:0.6±0.01,model_graded_equiv:0.64±0.01",,
PyNEkFCfMEReodeE2oiPhh,GPQA diamond,claude-3-7-sonnet-20250219,0.6603535353535354,choice:0.66±0.03,XXXX
6uSKG2QxR8atueXmDWAPXn,MATH level 5,claude-3-7-sonnet-20250219,0.6818353474320241,"normalized_string_match:0.61±0.01,sympy_equiv:0.63±0.01,model_graded_equiv:0.68±0.01",,
bHLfPfZj8HgD9Gef75NtrM,OTIS Mock AIME 2024-2025,mistral-large-2407,0.0847222222222222,model_graded:0.08±0.02,XXXXps://logs.epoch.ai/inspect_ai_logs/bHLfPfZj8HgD9Gef75NtrM.eval
bBY5VZzpDkmTBdSSvu5Ups,OTIS Mock AIME 2024-2025,DeepSeek-R1,0.5333333333333333,model_graded:0.53±0.08,XXXX
ZnGSRrKbiEnuju4kk4sgVb,OTIS Mock AIME 2024-2025,gemini-2.0-flash-001,0.3111111111111111,model_graded:0.31±0.06,XXXX
MMqL6fQSwsBKAdBSfxrYpU,MATH level 5,mistral-large-2411,0.5028323262839879,"normalized_string_match:0.47±0.01,sympy_equiv:0.47±0.01,model_graded_equiv:0.5±0.01",,
n9gpXfUzo8SbWqvrG2q4xM,GPQA diamond,mistral-large-2411,0.5132575757575758,choice:0.51±0.03,XXXX
buNVL2TWcdVgxAcMfVE8xJ,OTIS Mock AIME 2024-2025,gpt-4o-2024-11-20,0.0625,model_graded:0.06±0.02,XXXX
5mmrBSQb7hVSJ9VYtNm46i,OTIS Mock AIME 2024-2025,phi-4,0.1375,model_graded:0.14±0.04,XXXX
ajytUBY9bDPNNxuBfi5UBS,OTIS Mock AIME 2024-2025,o3-mini-2025-01-31_medium,0.6388888888888888,model_graded:0.64±0.06,XXXX
KvcqycaJpCQMBigiEosyzC,OTIS Mock AIME 2024-2025,o1-2024-12-17_medium,0.7333333333333333,model_graded:0.73±0.07,XXXX
Mjx6vYbgAb2DHXK8xcupX8,OTIS Mock AIME 2024-2025,o3-mini-2025-01-31_high,0.7694444444444445,model_graded:0.77±0.05,XXXX
9cCn5aW522cTRxJSE69crm,OTIS Mock AIME 2024-2025,gpt-4o-2024-05-13,0.0625,model_graded:0.06±0.02,XXXX
mBUKqiVQsqXd3qmdtKP5UP,OTIS Mock AIME 2024-2025,gpt-4o-2024-08-06,0.0638888888888888,model_graded:0.06±0.03,XXXX
ThMd6DNDTtyWxhPcb9Kq4y,OTIS Mock AIME 2024-2025,claude-3-haiku-20240307,0.0180555555555555,model_graded:0.02±0.01,XXXX
mYb9MGQD39RS8VdDyMqujr,OTIS Mock AIME 2024-2025,claude-3-opus-20240229,0.0472222222222222,model_graded:0.05±0.02,XXXX
AeRCEiG87bMFpxbxbmjCyn,OTIS Mock AIME 2024-2025,claude-3-sonnet-20240229,0.025,model_graded:0.03±0.02,XXXX
cqhrR5KnwpcuAVpHkZGQFd,OTIS Mock AIME 2024-2025,claude-3-5-haiku-20241022,0.0430555555555555,model_graded:0.04±0.02,XXXX
2xNPhKLsiWLHQBLLDGU4VJ,OTIS Mock AIME 2024-2025,claude-3-5-sonnet-20240620,0.0652777777777777,model_graded:0.07±0.02,XXXX
geVVaRWxuprKUqnW2VzvCa,OTIS Mock AIME 2024-2025,claude-3-5-sonnet-20241022,0.0847222222222222,model_graded:0.08±0.03,XXXX
N2NcmgmHyGJ9HNmQe5yvcS,OTIS Mock AIME 2024-2025,claude-3-7-sonnet-20250219,0.2194444444444444,model_graded:0.22±0.05,XXXX
NpJvYQDcmwbGgS5MzZUPUg,OTIS Mock AIME 2024-2025,gemini-1.0-pro-001,0.0111111111111111,model_graded:0.01±0.01,XXXX
QdGZ2m52PWt3EkLPbcoRwz,OTIS Mock AIME 2024-2025,gemini-1.5-flash-002,0.1625,model_graded:0.16±0.04,XXXX
U9GVBhYjRzU55YVjna5mMT,OTIS Mock AIME 2024-2025,gemini-1.5-flash-001,0.0388888888888888,model_graded:0.04±0.01,XXXX
jGY5Ee5P9dyStA4Fw5Pa6q,OTIS Mock AIME 2024-2025,gemini-1.5-flash-8b-001,0.0458333333333333,model_graded:0.05±0.02,XXXX
fhCdyX9uUS4WPtjLiVdBdK,OTIS Mock AIME 2024-2025,gemini-1.5-pro-002,0.2305555555555555,model_graded:0.23±0.05,XXXX
3S6fjGAMKWKK3zcdxkmh9H,OTIS Mock AIME 2024-2025,gemini-1.5-pro-001,0.0680555555555555,model_graded:0.07±0.02,XXXX
PpSVDb67Ucbxpcd8JQv5ii,OTIS Mock AIME 2024-2025,grok-2-1212,0.1152777777777777,model_graded:0.12±0.03,XXXX
NLk2oE4rzSFihPzF95peHr,OTIS Mock AIME 2024-2025,Llama-2-70b-chat-hf,0.0,model_graded:0±0,XXXX
7P5HLaaD6DBjNYFx7o6Jma,OTIS Mock AIME 2024-2025,Meta-Llama-3-8B-Instruct,0.0083333333333333,model_graded:0.01±0.01,XXXX
6RD3A6J432TZzxrKYRxph2,OTIS Mock AIME 2024-2025,Meta-Llama-3-70B-Instruct,0.0430555555555555,model_graded:0.04±0.02,XXXX
6SY8iSf5uKn7SrCzzrabPb,OTIS Mock AIME 2024-2025,Llama-3.1-8B-Instruct,0.025,model_graded:0.03±0.02,XXXX
X2tQXGELXtEbc2VX6gARd7,OTIS Mock AIME 2024-2025,Llama-3.1-70B-Instruct,0.0361111111111111,model_graded:0.04±0.02,XXXX
2fQJP3brRApfBdQq7Zy3bm,OTIS Mock AIME 2024-2025,Llama-3.1-405B-Instruct,0.0972222222222222,model_graded:0.1±0.03,XXXX
LHRL6nQJZXGJmYppR7kbGw,OTIS Mock AIME 2024-2025,Llama-3.2-90B-Vision-Instruct,0.0263888888888888,model_graded:0.03±0.01,XXXX
ZjsXQpZ7SsejXDo52aetZZ,OTIS Mock AIME 2024-2025,Llama-3.3-70B-Instruct,0.0513888888888888,model_graded:0.05±0.02,XXXX
7saSEhfVvXuz4yTcGy5u2y,OTIS Mock AIME 2024-2025,DeepSeek-V3,0.1583333333333333,model_graded:0.16±0.04,XXXX
NRCNnCCL5aQKsZf6cKC69Z,OTIS Mock AIME 2024-2025,qwen2.5-32b-instruct,0.0736111111111111,model_graded:0.07±0.02,XXXX
67YbVQfJh7sd77KH3zFgTn,OTIS Mock AIME 2024-2025,qwen2.5-72b-instruct,0.0805555555555555,model_graded:0.08±0.03,XXXX
ZVHvk3GTPUNTgjc5w3vkhW,OTIS Mock AIME 2024-2025,mistral-large-2402,0.0194444444444444,model_graded:0.02±0.01,XXXX
beRGgJXTJggTi8PPEZNoZ5,OTIS Mock AIME 2024-2025,mistral-large-2411,0.0777777777777777,model_graded:0.08±0.02,XXXX.eval,XXXX
Vn3Ug88CRbxEaJhLbRzAWD,GPQA diamond,claude-3-7-sonnet-20250219_16K,0.7676767676767676,choice:0.77±0.03,XXXX
hWeVyRo3hN5H4AspNc9cuf,MATH level 5,claude-3-7-sonnet-20250219_16K,0.8625377643504532,"normalized_string_match:0.78±0.01,sympy_equiv:0.79±0.01,model_graded_equiv:0.86±0.01",,
64LAJymM6eXDkRZb92Mdg8,OTIS Mock AIME 2024-2025,claude-3-7-sonnet-20250219_16K,0.4666666666666667,model_graded:0.47±0.08,XXXX
TaCoN7eyPQEfDWXqs2MpcF,GPQA diamond,gpt-4-turbo-2024-04-09,0.4659090909090909,choice:0.47±0.03,XXXX
AZAJomaSEfZtqYqXUre9fS,MATH level 5,gpt-4-turbo-2024-04-09,0.467333836858006,"normalized_string_match:0.44±0.01,sympy_equiv:0.44±0.01,model_graded_equiv:0.47±0.01",,
MZABVPEY8VCPAyCZ4iir68,OTIS Mock AIME 2024-2025,gpt-4-turbo-2024-04-09,0.0666666666666666,model_graded:0.07±0.02,XXXX
dodibSv7pk8GtqeMy4YLQg,GPQA diamond,gpt-4.5-preview-2025-02-27,0.6868686868686869,choice:0.69±0.03,XXXX
9DCJMt7YsTMDenTdX4Db6V,OTIS Mock AIME 2024-2025,gpt-4.5-preview-2025-02-27,0.3777777777777777,model_graded:0.38±0.07,XXXX
japqZFi3FS7um32EHG6wwW,MATH level 5,gpt-4.5-preview-2025-02-27,0.7862537764350453,"normalized_string_match:0.72±0.01,sympy_equiv:0.73±0.01,model_graded_equiv:0.79±0.01",,
JJqAHUs8LTLCKhnTWQUVTi,FrontierMath-2025-02-28-Public,o3-mini-2025-01-31_high,0.4,verification_code:0.4±0.16,XXXX
GZeqEeaiEhnLJSzVogT9N4,FrontierMath-2025-02-28-Private,o3-mini-2025-01-31_high,0.1103448275862069,verification_code:0.11±0.02,,
7rbXSQ8hFiXCCFv2vtqiiH,FrontierMath-2025-02-28-Public,grok-2-1212,0.0,verification_code:0±0,XXXX
kinyvFjqSqUiY5FyYv7tsA,FrontierMath-2025-02-28-Private,grok-2-1212,0.0068965517241379,verification_code:0.01±0,,
UjGSDrmEPtVoVGdVSbUMsb,FrontierMath-2025-02-28-Public,o3-mini-2025-01-31_medium,0.1125,verification_code:0.11±0.06,XXXX
U8CjtcxudQeP2cnMNixfJH,FrontierMath-2025-02-28-Private,o3-mini-2025-01-31_medium,0.0808189655172413,verification_code:0.08±0.01,,
kPCieUA4pKDto2xEfvQqSL,FrontierMath-2025-02-28-Public,mistral-large-2411,0.0,verification_code:0±0,XXXX
cEvTUVyxAqq5UDgsGQbvLM,FrontierMath-2025-02-28-Private,mistral-large-2411,0.0034482758620689,verification_code:0±0,,
SNDo6TsviYYTiH2kCmf7Pr,FrontierMath-2025-02-28-Public,gpt-4o-2024-11-20,0.0,verification_code:0±0,XXXX%2FSNDo6TsviYYTiH2kCmf7Pr.eval,XXXX
Z7HU95T8eema56EpLtECvG,FrontierMath-2025-02-28-Private,gpt-4o-2024-11-20,0.0034482758620689,verification_code:0±0,,
m7X2ppkiEkweCtFppiRW8S,FrontierMath-2025-02-28-Private,claude-3-7-sonnet-20250219,0.0310344827586206,verification_code:0.03±0.01,,
dvMWxBZMaxp6adP9zGszYG,FrontierMath-2025-02-28-Public,claude-3-7-sonnet-20250219,0.0,verification_code:0±0,XXXX
PU7LXbdNB8Unfvu4qCeJZF,FrontierMath-2025-02-28-Private,claude-3-7-sonnet-20250219_16K,0.0413793103448275,verification_code:0.04±0.01,,
8wm9yBTJHoY6ui45QTpjn2,FrontierMath-2025-02-28-Public,claude-3-7-sonnet-20250219_16K,0.0,verification_code:0±0,XXXX
9zhPZLvNeUPQ7JBSpTwKPp,OTIS Mock AIME 2024-2025,o1-mini-2024-09-12_high,0.4694444444444444,model_graded:0.47±0.06,XXXX
QUTfpiKjphedrKg8H6fGzF,OTIS Mock AIME 2024-2025,o1-mini-2024-09-12_medium,0.4472222222222222,model_graded:0.45±0.06,XXXX
JPadBpaxxGpdvTXtfHuvPV,FrontierMath-2025-02-28-Public,o1-mini-2024-09-12_high,0.0,verification_code:0±0,XXXX
2ekT2UzaebvmGprKvmeeEa,FrontierMath-2025-02-28-Private,o1-mini-2024-09-12_high,0.0137931034482758,verification_code:0.01±0.01,,
QKY2tL2pcQ33bDkdMCQzdr,FrontierMath-2025-02-28-Public,o1-mini-2024-09-12_medium,0.0,verification_code:0±0,XXXX
hC2FY4WuKm6BLgPhbWeZDY,FrontierMath-2025-02-28-Private,o1-mini-2024-09-12_medium,0.0172413793103448,verification_code:0.02±0.01,,
ADAsCz7ZtJ8E4inNusQebu,FrontierMath-2025-02-28-Public,claude-3-5-sonnet-20241022,0.0,verification_code:0±0,XXXX
MVnDrt94wyokzx3a7ZJkph,FrontierMath-2025-02-28-Private,claude-3-5-sonnet-20241022,0.0206896551724137,verification_code:0.02±0.01,,
Hs7dxLcim5bsmuG2DqnnsA,FrontierMath-2025-02-28-Public,gemini-1.5-flash-002,0.0,verification_code:0±0,XXXX
8FtfWCm8DdAqF9rnMVwv8h,FrontierMath-2025-02-28-Private,gemini-1.5-flash-002,0.0,verification_code:0±0,,
Qe4yHtQm26TQHFCAjhKBaA,FrontierMath-2025-02-28-Public,claude-3-5-haiku-20241022,0.0,verification_code:0±0,XXXX
JQcqpW9xPfJRKJkHne6DpK,FrontierMath-2025-02-28-Private,claude-3-5-haiku-20241022,0.0034482758620689,verification_code:0±0,,
i5xfb9Bk8TaiuQ6WccCAzu,FrontierMath-2025-02-28-Public,claude-3-5-sonnet-20240620,0.0,verification_code:0±0,XXXX
h7f9jZVWzZNDHNxP5eewNH,FrontierMath-2025-02-28-Private,claude-3-5-sonnet-20240620,0.0103448275862068,verification_code:0.01±0.01,,
BhMT5JGDGaz4FvSyNLYmok,FrontierMath-2025-02-28-Public,gpt-4o-2024-08-06,0.0,verification_code:0±0,XXXX
SmT4ukjkxsXZvFExMMVFgW,FrontierMath-2025-02-28-Private,gpt-4o-2024-08-06,0.0034482758620689,verification_code:0±0,,
PDACS493wNy85eyMC9NULs,FrontierMath-2025-02-28-Public,gemini-2.0-flash-001,0.0,verification_code:0±0,XXXX
5RpSr8gJZfcZn586aV2hDW,FrontierMath-2025-02-28-Private,gemini-2.0-flash-001,0.0172413793103448,verification_code:0.02±0.01,,
BhT6UhKLcghqng7J8NbPWH,FrontierMath-2025-02-28-Public,o1-2024-12-17_high,0.0,verification_code:0±0,XXXX
bynpttv6jLvbXTY7ktnBpk,FrontierMath-2025-02-28-Private,o1-2024-12-17_high,0.093103448275862,verification_code:0.09±0.02,,
RaMuNb5qvzGzMveuZuhEAZ,FrontierMath-2025-02-28-Public,gemini-2.0-pro-exp-02-05,0.0,verification_code:0±0,XXXX
Cprjaj9JQ9MY5NCXqeQR5H,OTIS Mock AIME 2024-2025,o1-preview-2024-09-12,0.3111111111111111,model_graded:0.31±0.07,XXXX
cXz2gsCmGYSvQ6Ey4LDiN7,OTIS Mock AIME 2024-2025,gemini-2.0-flash-thinking-exp-01-21,0.5777777777777777,model_graded:0.58±0.07,XXXX
7LLTmqs4LC5GLVfJEqJWSC,OTIS Mock AIME 2024-2025,claude-2.0,0.025,model_graded:0.03±0.02,XXXX
2vg9qw9pd8zSyRMu9YNiVm,OTIS Mock AIME 2024-2025,claude-2.1,0.0194444444444444,model_graded:0.02±0.01,XXXX
89zcekqEYnE8zhwG4pR5Nu,OTIS Mock AIME 2024-2025,DeepSeek-R1-Distill-Llama-70B,0.5138888888888888,model_graded:0.51±0.06,XXXX
WNE7cveVN8rsi7DmsFsFVy,OTIS Mock AIME 2024-2025,gemma-2-9b-it,0.0055555555555555,model_graded:0.01±0,XXXX
jD6rJdvfNPErKEz9qkpssj,OTIS Mock AIME 2024-2025,gemma-2-27b-it,0.0138888888888888,model_graded:0.01±0.01,XXXX
FGcofc566CradiNmHUGqVB,OTIS Mock AIME 2024-2025,Llama-3.1-Tulu-3-70B-DPO,0.0444444444444444,model_graded:0.04±0.02,XXXX
iKvrBWEvZT26XhFBRcavtf,FrontierMath-2025-02-28-Public,DeepSeek-V3,0.0,verification_code:0±0,XXXX
aXcfCytWFYNk7E5AZGMzLb,FrontierMath-2025-02-28-Private,DeepSeek-V3,0.0172413793103448,verification_code:0.02±0.01,,
f6pnAATjxmrFJS353EjQKi,GPQA diamond,DeepSeek-R1-Distill-Llama-70B,0.5574494949494949,choice:0.56±0.03,XXXX
oPDoTJFMhMNvU4sybYu2jQ,MATH level 5,DeepSeek-R1-Distill-Llama-70B,0.8989803625377644,"normalized_string_match:0.83±0.01,sympy_equiv:0.83±0.01,model_graded_equiv:0.9±0.01",,
iv9yfwZ9vTznxQfvJF6J7p,GPQA diamond,claude-3-7-sonnet-20250219_32K,0.7676767676767676,choice:0.77±0.03,XXXX
8UqaLobhmNJLuPGTTpeaCw,GPQA diamond,DeepSeek-R1-Distill-Qwen-14B,0.4469696969696969,choice:0.45±0.03,XXXX
R8XGt5ScHpmTbD5DnXR59V,MATH level 5,DeepSeek-R1-Distill-Qwen-14B,0.8712235649546828,"normalized_string_match:0.81±0.01,sympy_equiv:0.81±0.01,model_graded_equiv:0.87±0.01",,
dSuGgEbnQErengjEboPn62,OTIS Mock AIME 2024-2025,claude-3-7-sonnet-20250219_32K,0.5333333333333333,model_graded:0.53±0.08,XXXX
WHGMHrneqJs4nefAh4TjZR,MATH level 5,claude-3-7-sonnet-20250219_32K,0.9003021148036254,"normalized_string_match:0.83±0.01,sympy_equiv:0.84±0.01,model_graded_equiv:0.9±0.01",,
gFT36W4KU3wggLQ9nNCctg,GPQA diamond,gemini-1.5-flash-8b-001,0.3295454545454545,choice:0.33±0.02,XXXX
JpLxrpvthx7A75sbaHZ9Et,GPQA diamond,claude-3-5-haiku-20241022,0.3813131313131313,choice:0.38±0.03,XXXX
GSvi9Zf5aK3t5TfCPqNRHy,MATH level 5,claude-3-5-haiku-20241022,0.4635574018126888,"normalized_string_match:0.42±0.01,sympy_equiv:0.43±0.01,model_graded_equiv:0.46±0.01",,
VzgR6YSTMWX3dFDwQgmFpM,FrontierMath-2025-02-28-Public,claude-3-7-sonnet-20250219_32K,0.0,verification_code:0±0,XXXX
aZqVpPa4SWHapN9QHEoEm3,FrontierMath-2025-02-28-Private,claude-3-7-sonnet-20250219_32K,0.0344827586206896,verification_code:0.03±0.01,,
FtxiPnS7ZH5qLrmFgp4ca2,GPQA diamond,gemma-3-27b-it,0.4886363636363636,choice:0.49±0.03,XXXX
A3ZGeCNhHMp55ZjtRtCGVB,OTIS Mock AIME 2024-2025,gemma-3-27b-it,0.1972222222222222,model_graded:0.2±0.05,XXXX
amrtN7UjAtuyEmv5DaGshP,MATH level 5,gemma-3-27b-it,0.740370090634441,"normalized_string_match:0.71±0.01,sympy_equiv:0.71±0.01,model_graded_equiv:0.74±0.01",,
Rfy7A5TbRK343mTyQkeGt7,FrontierMath-2025-02-28-Public,claude-3-7-sonnet-20250219_64K,0.0,verification_code:0±0,XXXX
DoU54Xy93J3MxztC4uxwYy,FrontierMath-2025-02-28-Private,claude-3-7-sonnet-20250219_64K,0.0310344827586206,verification_code:0.03±0.01,,
JBX3gYnFhUbDVb4oDpaCax,GPQA diamond,claude-3-7-sonnet-20250219_64K,0.7727272727272727,choice:0.77±0.03,XXXX
QW3FUBn5qtAFEX4oQuzEAV,OTIS Mock AIME 2024-2025,claude-3-7-sonnet-20250219_64K,0.5777777777777777,model_graded:0.58±0.07,XXXX
epCyFNTf6vyUi5Dby7GeTq,MATH level 5,claude-3-7-sonnet-20250219_64K,0.9116314199395772,"normalized_string_match:0.83±0.01,sympy_equiv:0.83±0.01,model_graded_equiv:0.91±0.01",,
S8bBDAbucsFceirHxeEDvu,GPQA diamond,mistral-small-2503,0.4747474747474747,choice:0.47±0.03,XXXX
Kas7Y5PBsT7Li5Wm6Y8mj8,MATH level 5,mistral-small-2503,0.4677114803625378,"normalized_string_match:0.45±0.01,sympy_equiv:0.44±0.01,model_graded_equiv:0.47±0.01",,
JxqtHWrwCtYoQYBB3jVqB9,OTIS Mock AIME 2024-2025,mistral-small-2503,0.0583333333333333,model_graded:0.06±0.02,XXXX
b4vEwq8UVmGmAKh4eqp6G3,OTIS Mock AIME 2024-2025,mistral-small-2501,0.0527777777777777,model_graded:0.05±0.03,XXXX
WKjzEQdfnBHnYeZoKeCWnB,GPQA diamond,gemini-2.5-pro-exp-03-25,0.8383838383838383,choice:0.84±0.03,XXXX
XRkZCzSoQC5yYiQ3xitNdC,MATH level 5,gemini-2.5-pro-preview-03-25,0.9556268882175226,"normalized_string_match:0.91±0.01,sympy_equiv:0.9±0.01,model_graded_equiv:0.96±0",,
VjfLwKbT6kTprYsMGcz2YT,GPQA diamond,DeepSeek-V3-0324,0.6761363636363636,choice:0.68±0.03,XXXX
YFyn28VJ6m2hfop3KGRkBs,MATH level 5,DeepSeek-V3-0324,0.75547583081571,"normalized_string_match:0.7±0.01,sympy_equiv:0.69±0.01,model_graded_equiv:0.76±0.01",,
VpsMexBEhXkPYJBe7v6Fk9,OTIS Mock AIME 2024-2025,DeepSeek-V3-0324,0.3777777777777777,model_graded:0.38±0.06,XXXX://logs.epoch.ai/inspect_ai_logs/VpsMexBEhXkPYJBe7v6Fk9.eval
63MnE4e4AXvbQ585MaWTFv,OTIS Mock AIME 2024-2025,Hermes-2-Theta-Llama-3-70B,0.025,model_graded:0.03±0.01,XXXX
GBf7rqHkGGzfX87V37F9mH,GPQA diamond,qwen-max-2025-01-25,0.5612373737373737,choice:0.56±0.03,XXXX
GFZDNH95TnVegNrEtmUT2Q,MATH level 5,qwen-max-2025-01-25,0.6718277945619335,"normalized_string_match:0.64±0.01,sympy_equiv:0.64±0.01,model_graded_equiv:0.67±0.01",,
hWj4Vg5DgUWxhZrT4LWKeU,FrontierMath-2025-02-28-Public,qwen-max-2025-01-25,0.0,verification_code:0±0,XXXX
F77vDPynTKJbz7m2bt7Py8,FrontierMath-2025-02-28-Private,qwen-max-2025-01-25,0.0103448275862068,verification_code:0.01±0.01,,
PZt4Yb9bpXVPDoPi4uTxwK,OTIS Mock AIME 2024-2025,qwen-max-2025-01-25,0.1611111111111111,model_graded:0.16±0.04,XXXX
QtraMoKiQRRpZc9txpFAYF,GPQA diamond,qwen-plus-2025-01-25,0.4810606060606061,choice:0.48±0.03,XXXX
J3WoEpLKAJZWGGpyBVCCCn,MATH level 5,qwen-plus-2025-01-25,0.6527567975830816,"normalized_string_match:0.62±0.01,sympy_equiv:0.62±0.01,model_graded_equiv:0.65±0.01",,
AULNJG2C75JBP6uaiAgj3P,OTIS Mock AIME 2024-2025,qwen-plus-2025-01-25,0.1777777777777777,model_graded:0.18±0.04,XXXX
9Ke5At9bYuf56SmKJJ7Sz7,GPQA diamond,qwen-turbo-2024-11-01,0.4179292929292929,choice:0.42±0.03,XXXX
jWV936hYKg6rYcPEynzEEt,MATH level 5,qwen-turbo-2024-11-01,0.5623111782477341,"normalized_string_match:0.52±0.01,sympy_equiv:0.51±0.01,model_graded_equiv:0.56±0.01",,
G5gxbTH33S7LXA29o5fgmv,OTIS Mock AIME 2024-2025,qwen-turbo-2024-11-01,0.0611111111111111,model_graded:0.06±0.02,XXXX
MsXAXDEKP3xBKtiA7aTBBG,GPQA diamond,Llama-4-Scout-17B-16E-Instruct,0.5183080808080808,choice:0.52±0.03,XXXX
LyHy99ubGCqBhoaksQjkiF,GPQA diamond,Llama-4-Maverick-17B-128E-Instruct-FP8,0.6698232323232324,choice:0.67±0.03,XXXX
c6rTyTFCeoHMBzC2QwjxFL,MATH level 5,Llama-4-Scout-17B-16E-Instruct,0.6227341389728097,"normalized_string_match:0.6±0.01,sympy_equiv:0.6±0.01,model_graded_equiv:0.62±0.01",,
T3X5DeC47iqCaoRYXu5k8L,OTIS Mock AIME 2024-2025,Llama-4-Scout-17B-16E-Instruct,0.0777777777777777,model_graded:0.08±0.03,XXXX
m96PYCivjSmTVDd27zw84m,FrontierMath-2025-02-28-Public,Llama-4-Scout-17B-16E-Instruct,0.0,verification_code:0±0,XXXX
bhjLDA5A7gddsX5cAYKma8,FrontierMath-2025-02-28-Private,Llama-4-Scout-17B-16E-Instruct,0.0,verification_code:0±0,,
HkjCosgMFq5QXB2CyTyU7r,MATH level 5,Llama-4-Maverick-17B-128E-Instruct-FP8,0.7301737160120846,"normalized_string_match:0.7±0.01,sympy_equiv:0.7±0.01,model_graded_equiv:0.73±0.01",,
GJEpPThqUxXJBSbvkuXVnw,OTIS Mock AIME 2024-2025,Llama-4-Maverick-17B-128E-Instruct-FP8,0.2055555555555555,model_graded:0.21±0.05,XXXX
M2qDSno6GVHdWjcdGRGy9D,FrontierMath-2025-02-28-Public,Llama-4-Maverick-17B-128E-Instruct-FP8,0.0,verification_code:0±0,XXXX
QjvBS34fsngfBBMZUirSyh,FrontierMath-2025-02-28-Private,Llama-4-Maverick-17B-128E-Instruct-FP8,0.0068965517241379,verification_code:0.01±0,,
evNX52n9sxzGLwthPEEeFF,GPQA diamond,grok-3-beta,0.7575757575757576,choice:0.76±0.03,XXXX
DKTzh9kSoipZGkDsC4NGf3,OTIS Mock AIME 2024-2025,grok-3-beta,0.5555555555555556,model_graded:0.56±0.07,XXXX
WBwJHBpCqKQvZBnoZ4Y8H9,FrontierMath-2025-02-28-Public,grok-3-beta,0.0,verification_code:0±0,XXXX
i9eqMbaVTDDuCxWxzGHxyU,MATH level 5,grok-3-beta,0.8874622356495468,"normalized_string_match:0.78±0.01,sympy_equiv:0.78±0.01,model_graded_equiv:0.89±0.01",,
DzjFQArKFTZHioQVwKYVtX,FrontierMath-2025-02-28-Private,grok-3-beta,0.0379310344827586,verification_code:0.04±0.01,,
Skcwy4eaZjtyW54X8GZCLC,GPQA diamond,grok-3-mini-beta_high,0.7373737373737373,choice:0.74±0.03,XXXX
fPtize44vDPcDAzMFuAurw,OTIS Mock AIME 2024-2025,grok-3-mini-beta_high,0.7777777777777778,model_graded:0.78±0.06,XXXX
HMm8XaDBr4CQdMFddasEWb,MATH level 5,grok-3-mini-beta_high,0.8806646525679759,"normalized_string_match:0.82±0.01,sympy_equiv:0.82±0.01,model_graded_equiv:0.88±0.01",,
HQ3GnLdmf8Cy6dceb8d7cL,FrontierMath-2025-02-28-Public,grok-3-mini-beta_high,0.0,verification_code:0±0,XXXX
Z83F6bJ7Ne5rXaXwgR2xWT,FrontierMath-2025-02-28-Private,grok-3-mini-beta_high,0.0586206896551724,verification_code:0.06±0.01,,
McxfQdC5aypnD3STwZUBjH,GPQA diamond,grok-3-mini-beta_low,0.7626262626262627,choice:0.76±0.03,XXXX
TTmhagb2ovbi7GJnDxEfrj,OTIS Mock AIME 2024-2025,grok-3-mini-beta_low,0.6222222222222222,model_graded:0.62±0.07,XXXX
5QyCm49Lce2PpGVcZLomCz,MATH level 5,grok-3-mini-beta_low,0.9093655589123868,"normalized_string_match:0.85±0.01,sympy_equiv:0.84±0.01,model_graded_equiv:0.91±0.01",,
mvPsa56tAProncoFm6Y7aW,FrontierMath-2025-02-28-Public,grok-3-mini-beta_low,0.0,verification_code:0±0,XXXX
9FTMS2t6Ksm43CduHeqtJr,FrontierMath-2025-02-28-Private,grok-3-mini-beta_low,0.0275862068965517,verification_code:0.03±0.01,,
PtxHF8BH8WuNdNfDnutGBb,GPQA diamond,qwq-plus,0.6540404040404041,choice:0.65±0.03,XXXX
Gj8pd5gU9r4LqwDK8XZ4mL,GPQA diamond,gpt-4.1-2025-04-14,0.6691919191919192,choice:0.67±0.03,XXXX
VAQtARXUPTpdxGKMekUSbX,OTIS Mock AIME 2024-2025,gpt-4.1-2025-04-14,0.3833333333333333,model_graded:0.38±0.06,XXXX
YyJXPVXJjN5EnLfYjfCTrQ,MATH level 5,gpt-4.1-2025-04-14,0.8300604229607251,"normalized_string_match:0.78±0.01,sympy_equiv:0.78±0.01,model_graded_equiv:0.83±0.01",,
QjrnNoyDCzDYmFxgfYCuTm,FrontierMath-2025-02-28-Public,gpt-4.1-2025-04-14,0.0,verification_code:0±0,XXXX
NPeiYpjbscreYnFkHPvcNB,FrontierMath-2025-02-28-Private,gpt-4.1-2025-04-14,0.0551724137931034,verification_code:0.06±0.01,,
dD8n3SgLkV3Xse6bbG8mMB,SWE-Bench verified,gpt-4.1-mini-2025-04-14,0.328,swe_bench_scorer:0.33±0.02,XXXX
6ybVweWdTfiCuirnS2BKXD,MATH level 5,gpt-4.1-mini-2025-04-14,0.8729229607250756,"normalized_string_match:0.81±0.01,sympy_equiv:0.82±0.01,model_graded_equiv:0.87±0.01",,
dXHddeGRX7JRVa5UNfad5C,GPQA diamond,gpt-4.1-mini-2025-04-14,0.6584595959595959,choice:0.66±0.03,XXXX
cXbbdQDW2jeYassmRuZYpU,OTIS Mock AIME 2024-2025,gpt-4.1-mini-2025-04-14,0.4472222222222222,model_graded:0.45±0.06,XXXX
jM7D9QpAZVXuP93HM6ST5w,FrontierMath-2025-02-28-Public,gpt-4.1-mini-2025-04-14,0.1,verification_code:0.1±0.1,XXXX
ephMMUGDANc9iVTNNuuoZx,FrontierMath-2025-02-28-Private,gpt-4.1-mini-2025-04-14,0.0448275862068965,verification_code:0.04±0.01,,
K4TfMXsaPd4Z46zFmiV4Rf,MATH level 5,gpt-4.1-nano-2025-04-14,0.6999622356495468,"normalized_string_match:0.64±0.01,sympy_equiv:0.65±0.01,model_graded_equiv:0.7±0.01",,
azWfeFPP6zSocuQX3JYBU5,GPQA diamond,gpt-4.1-nano-2025-04-14,0.4892676767676767,choice:0.49±0.02,XXXX
3qLHwibDhgtWEbKGwxUYVa,OTIS Mock AIME 2024-2025,gpt-4.1-nano-2025-04-14,0.2888888888888888,model_graded:0.29±0.06,XXXX
R4GLbPVPjbATgWQJFjF2EQ,FrontierMath-2025-02-28-Public,gpt-4.1-nano-2025-04-14,0.0,verification_code:0±0,XXXX
chtQvebcoaE33KkhAjbCnw,FrontierMath-2025-02-28-Private,gpt-4.1-nano-2025-04-14,0.0103448275862068,verification_code:0.01±0.01,,
Q6UB2pBAwruHWXCNQvKGVv,SWE-Bench verified,gpt-4.1-2025-04-14,0.41,swe_bench_scorer:0.41±0.02,XXXX
evANZQ9oQTdGYbDERw7VrD,GPQA diamond,o3-2025-04-16_high,0.8181818181818182,choice:0.82±0.02,XXXX
6anLU9CfBBNQkWyJG2GDCa,OTIS Mock AIME 2024-2025,o3-2025-04-16_high,0.8388888888888889,model_graded:0.84±0.04,XXXX
gXQ6PHjaqbjcBj6KV9UKSK,FrontierMath-2025-02-28-Public,o3-2025-04-16_high,0.1,verification_code:0.1±0.1,XXXX
dECNMk4kcUtMHLzb3tgaCX,FrontierMath-2025-02-28-Private,o3-2025-04-16_high,0.1034482758620689,verification_code:0.1±0.02,,
9Sgsz5XCjscuZkexf7MX9c,GPQA diamond,o4-mini-2025-04-16_high,0.7960858585858586,choice:0.8±0.02,XXXX
TVsxkJ8ZyfMVmVx88Qq3KN,OTIS Mock AIME 2024-2025,o4-mini-2025-04-16_high,0.8166666666666667,model_graded:0.82±0.05,XXXX
E9AfnKpQrDsFWKmXMhBbJS,FrontierMath-2025-02-28-Public,o4-mini-2025-04-16_high,0.3,verification_code:0.3±0.15,XXXX
ir92cSRbauVXmQfqTSaiSx,FrontierMath-2025-02-28-Private,o4-mini-2025-04-16_high,0.1724137931034483,verification_code:0.17±0.02,,
6oW3GceNQznewQJRwLigth,MATH level 5,o3-2025-04-16_high,0.9777190332326284,"normalized_string_match:0.86±0.01,sympy_equiv:0.87±0.01,model_graded_equiv:0.98±0",,
727ixGKZ7A7G8LmavE2ScL,MATH level 5,o4-mini-2025-04-16_high,0.978285498489426,"normalized_string_match:0.84±0.01,sympy_equiv:0.88±0.01,model_graded_equiv:0.98±0",,
9Ffc6sAe5z8ZAMgUEbktr6,GPQA diamond,gemini-2.5-flash-preview-04-17,0.0839646464646464,choice:0.08±0.01,XXXX
KpZYG5nvdJ6QYweLz6dn5a,OTIS Mock AIME 2024-2025,gemini-2.5-flash-preview-04-17,0.7305555555555555,model_graded:0.73±0.06,XXXX
couunRFGbK2kR3995XSJ6Z,MATH level 5,gemini-2.5-flash-preview-04-17,0.2311178247734139,"normalized_string_match:0.22±0.01,sympy_equiv:0.21±0.01,model_graded_equiv:0.23±0.01",,
gdKs4ZBYcVz8oUvzu5VkDB,FrontierMath-2025-02-28-Public,o3-2025-04-16_medium,0.0,verification_code:0±0,XXXX
jMt3i7znSKdhTniR6o6b5C,FrontierMath-2025-02-28-Private,o3-2025-04-16_medium,0.1,verification_code:0.1±0.02,,
47yfmJgzAJq5DfUcBGVebV,FrontierMath-2025-02-28-Public,o3-2025-04-16_low,0.1,verification_code:0.1±0.1,XXXX
4yHiPUjmcqGZijjnQ6V25s,FrontierMath-2025-02-28-Private,o3-2025-04-16_low,0.1034482758620689,verification_code:0.1±0.02,,
V7txn6pocsNsZSzt5dkZkJ,FrontierMath-2025-02-28-Public,o4-mini-2025-04-16_medium,0.2,verification_code:0.2±0.13,XXXX
8ozSh7wRxGu4zL4yjyEF2L,FrontierMath-2025-02-28-Private,o4-mini-2025-04-16_medium,0.193103448275862,verification_code:0.19±0.02,,
BMEJvvJKbhHa5q7YTuCeAi,FrontierMath-2025-02-28-Public,o4-mini-2025-04-16_low,0.3,verification_code:0.3±0.15,XXXX
HTqZUWGZ8bXbJ4dBRUZ3wv,FrontierMath-2025-02-28-Private,o4-mini-2025-04-16_low,0.096551724137931,verification_code:0.1±0.02,,
MCv2wDLoTzqMRrN8dATPEv,SWE-Bench verified,claude-3-7-sonnet-20250219,0.522,swe_bench_scorer:0.52±0.02,XXXX
ne2Anciuxh2k3VXGYG4Nqy,SWE-Bench verified,gemini-2.0-flash-001,0.22,swe_bench_scorer:0.22±0.02,XXXX
MgQteQgHEpQAAQw7DFbh2u,SWE-Bench verified,grok-3-mini-beta_low,0.152,swe_bench_scorer:0.15±0.02,XXXX
Sp6nNR4N7zRGrqPCkgJ4Mn,SWE-Bench verified,grok-3-beta,0.386,swe_bench_scorer:0.39±0.02,XXXX
eTCmM2smJzrY4xjJnrboT2,SWE-Bench verified,o4-mini-2025-04-16_medium,0.346,swe_bench_scorer:0.35±0.02,XXXX
kvVyU6crkxAnG3DAUciMGc,SWE-Bench verified,o3-mini-2025-01-31_medium,0.378,swe_bench_scorer:0.38±0.02,XXXX
GUbGX7GABSbhxMXiYC5NFr,SWE-Bench verified,claude-3-5-sonnet-20241022,0.406,swe_bench_scorer:0.41±0.02,XXXX
Mi8LCxmduKukxazAtmuhqg,MATH level 5,gemini-2.5-pro-preview-05-06,0.959025679758308,"normalized_string_match:0.9±0.01,sympy_equiv:0.9±0.01,model_graded_equiv:0.96±0",,
FNMbfvEEm9PRnrJnrqk2PH,GPQA diamond,mistral-medium-2505,0.5953282828282829,choice:0.6±0.03,XXXX
QWw65xeTn8x42SymMsVREN,OTIS Mock AIME 2024-2025,mistral-medium-2505,0.3222222222222222,model_graded:0.32±0.06,XXXX
K86QrHt55VSsziWQMEz4jS,FrontierMath-2025-02-28-Public,mistral-medium-2505,0.0,verification_code:0±0,XXXX
PZRX6ME25bn26B6BTRWx6W,FrontierMath-2025-02-28-Private,mistral-medium-2505,0.0034602076124567,verification_code:0±0,,
LNUnvvrDra89Vhdc9bzdCy,MATH level 5,mistral-medium-2505,0.8162764350453172,"normalized_string_match:0.77±0.01,sympy_equiv:0.76±0.01,model_graded_equiv:0.82±0.01",,
SCpoD9h4p73bd6SwebPop4,SWE-Bench verified,qwen-plus-2025-04-28,0.28,swe_bench_scorer:0.28±0.02,XXXX
fpjeeuxXr5G2EdUXAgtynx,FrontierMath-2025-02-28-Public,DeepSeek-V3-0324,0.0,verification_code:0±0,XXXX
