Model,Task_SR_V,Task_SR_B,Exec_SR_V,Exec_SR_B,Grammar_Parsing_V,Grammar_Parsing_B,Grammar_Halluc_V,Grammar_Halluc_B,Grammar_PredArgNum_V,Grammar_PredArgNum_B,Runtime_WrongOrder_V,Runtime_WrongOrder_B,Runtime_MissingStep_V,Runtime_MissingStep_B,Runtime_Affordance_V,Runtime_Affordance_B,Runtime_AdditionalStep_V,Runtime_AdditionalStep_B
Claude-3 Haiku,54.8,26.0,60.7,32.0,0.0,0.0,6.6,6.0,0.0,0.0,0.7,7.0,30.5,54.0,1.6,1.0,1.6,1.0
Claude-3 Sonnet,58.4,44.0,63.3,57.0,0.0,0.0,4.9,1.0,0.0,0.0,7.9,4.6,11.0,26.9,19.0,0.3,11.0,3.0
Claude-3 Opus,64.9,51.0,69.5,59.0,0.0,0.0,17.0,0.0,0.0,0.0,0.3,3.0,12.8,35.0,0.3,3.0,2.3,2.0
Claude-3.5 Sonnet,76.7,60.0,81.3,69.0,0.0,0.0,2.0,0.0,0.3,0.0,0.3,5.0,14.4,25.0,1.6,1.0,1.3,2.0
Gemini 1.0 Pro,45.6,27.0,56.7,32.0,0.0,0.0,7.0,3.6,3.0,2.6,6.0,2.0,13.0,33.4,1.6,4.0,8.2,4.0
Gemini 1.5 Flash,69.5,40.0,75.4,52.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,5.0,22.3,0.3,1.0,1.6,2.0
Gemini 1.5 Pro,76.7,42.0,83.6,54.0,0.0,0.0,1.3,0.0,0.7,0.0,0.3,6.0,14.1,39.0,0.0,1.0,3.0,2.0
GPT-3.5-turbo,24.9,16.0,40.7,20.0,10.2,4.0,1.0,7.0,2.0,23.0,0.0,1.0,41.6,36.0,4.6,8.0,3.0,1.3
GPT-4-turbo,60.0,38.0,65.2,45.0,0.0,0.0,2.0,0.0,0.0,1.6,0.0,7.0,30.8,47.0,0.3,1.0,1.3,0.0
GPT-4o,71.5,47.0,81.3,53.0,1.0,0.0,2.0,1.0,0.7,0.0,0.3,9.0,15.1,36.0,0.0,1.0,2.3,0.0
Cohere Command R,44.9,16.0,44.3,19.0,2.0,5.0,23.3,13.0,2.6,0.0,2.0,8.0,17.0,43.0,9.8,12.0,3.6,4.0
Cohere Command R+,54.1,27.0,65.2,35.0,2.9,0.0,12.5,1.0,2.6,0.0,0.3,10.0,11.8,39.0,1.6,0.0,3.9,15.0
Mistral Large,78.4,33.0,84.6,50.0,0.0,0.0,2.0,0.0,0.3,0.0,0.0,8.0,12.5,35.0,0.7,6.0,3.9,7.0
Mixtral 8x22B MoE,63.3,30.0,67.9,40.0,0.0,3.0,13.1,6.0,1.0,0.0,1.3,10.0,14.1,32.0,2.6,9.0,5.6,2.0
Llama 3 8B,21.3,14.0,23.6,16.0,1.0,0.0,34.1,15.0,8.2,9.0,1.0,6.0,30.8,44.0,1.3,9.0,2.6,5.0
Llama 3 70B,59.0,34.0,66.6,42.0,0.0,0.0,14.1,2.0,8.2,0.0,2.0,15.0,9.2,38.0,0.0,3.0,6.2,6.0
o1-mini,71.5,56.0,76.4,65.0,4.9,0.0,2.0,3.0,0.0,0.0,1.0,7.0,17.7,17.0,0.3,0.6,2.6,5.0
o1-preview,65.2,81.0,72.5,91.0,6.6,0.0,11.5,0.0,0.0,0.0,0.0,0.0,12.1,6.0,0.3,2.0,2.0,3.0
