difficulty,task_id,model_family,model_name,run_times,max_turns,failure_num,num_correct,total_samples,accuracy
baseline,char_to_num,gpt,gpt-4.1-2025-04-14,run_1,12.0,0,0,2,0.0