User Input,reasoning_capacity
Difficulty,Model,Objective Score,Alignment Rate
easy,GPT-4o,0.8813559322033898,0.7375
easy,GPT-4o mini,0.8192090395480226,0.7375
easy,Gemini-Flash,0.8531073446327684,0.7375
easy,Claude-3.5,0.8531073446327684,0.7375
easy,Claude-3,0.8248587570621468,0.7375
easy,GLM-4v,0.8700564971751412,0.7375
easy,Qwen2-VL,0.8757062146892656,0
medium,GPT-4o,0.8195121951219512,0.8541666666666666
medium,GPT-4o mini,0.775609756097561,0.8541666666666666
medium,Gemini-Flash,0.7044334975369458,0.8458333333333333
medium,Claude-3.5,0.7463414634146341,0.8541666666666666
medium,Claude-3,0.6731707317073171,0.8541666666666666
medium,GLM-4v,0.7463414634146341,0.8541666666666666
medium,Qwen2-VL,0.7317073170731707,0
hard,GPT-4o,0.7650273224043715,0.7625
hard,GPT-4o mini,0.6994535519125683,0.7625
hard,Gemini-Flash,0.6815642458100558,0.7458333333333333
hard,Claude-3.5,0.5846994535519126,0.7625
hard,Claude-3,0.7158469945355191,0.7625
hard,GLM-4v,0.6612021857923497,0.7625
hard,Qwen2-VL,0.7103825136612022,0
