
Model,MMLU,ARC-C,HellaSwag,Winograd,TruthfulQA,GSM8K,XWinograd,HumanEval
Claude-3 Haiku,0.75,0.89,0.86,,,0.89,,0.76
Claude-3 Sonnet,0.79,0.93,0.89,,,0.92,,0.73
Claude-3 Opus,0.87,0.96,0.95,,,0.95,,0.85
Claude-3.5 Sonnet,0.98,,,,,0.96,,0.94
Cohere Command R,,,,,,,,
Cohere Command R+,0.76,0.71,0.89,0.85,0.56,0.71,,
Gemini 1.0 Pro,0.72,,,,,,,
Gemini 1.5 Flash,0.79,,0.86,,,0.86,,0.74
Gemini 1.5 Pro,0.86,,,,,0.91,,0.84
GPT-3.5-turbo,0.7,,,,,,,0.68
GPT-4-turbo,0.86,,,,,,,0.87
GPT-4o,0.86,,,,,,,
Llama 3 8B Instruct,0.73,,,,,0.845,,0.742
Llama 3 70B Instruct,0.86,,0.87,,,0.951,,0.805
Mistral Large,0.812,,,,,,,
Mixtral 8x22B MoE,0.773,0.705,0.889,,,0.765,,
o1-mini,0.85,,,,,,,0.92
o1-preview,0.91,,,,,,,
