model_name,thinking_mode,model_family,instruction_tuned,source,AGI-Eval,ANLI,ARC-Challenge,ARC-Easy,Average,BigBenchHard,BoolQ,CodeGeneration,CommonsenseQA,DROP,FactualKnowledge,GPQA,GSM8K-CoT,HellaSwag,HumanEval,LanguageUnderstanding,MATH,MBPP,MGSM,MMLU,Math,MedQA,Multilingual,OpenBookQA,PIQA,PopularAgg,Reasoning,Robustness,SimpleQA,SocialIQa,TriviaQA,TruthfulQA,WinoGrande
Microsoft/Phi-3-Mini-128K-Instruct,Non-thinking,Phi,Yes,Phi Report (Benchmarks Small/Medium Models),39.5,52.3,85.5,,66.4,72.1,77.1,,,,,29.7,85.3,70.5,60.4,,,70.0,,69.7,,56.4,,78.8,80.1,,,,,74.7,57.8,64.8,71.0
Microsoft/Phi-3-Medium-128K-Instruct-14B,Non-thinking,Phi,Yes,Phi Report (Benchmarks Small/Medium Models),49.7,57.3,91.0,97.6,77.3,77.9,86.5,,82.2,,,,87.5,81.6,58.5,,,73.8,,76.6,,67.6,,87.2,87.8,,,,,79.0,73.9,74.3,78.9
Microsoft/Phi-4-14B,Non-thinking,Phi,No,Phi Report (Benchmarks Small/Medium Models),,,,,,,,,,75.5,,56.1,,,82.6,,80.4,,80.6,84.8,,,,,,,,,3.0,,,,
Microsoft/Phi-3-Mini-128K-Instruct,Non-thinking,Phi,Yes,Phi Report (Aggregated Categories Small Models),,,,,,,,61.0,,,35.8,,,,,57.5,,,,,51.6,,56.4,,,60.6,69.4,61.1,,,,,
Microsoft/Phi-3-Medium-128K-Instruct-14B,Non-thinking,Phi,Yes,Phi Report (Medium/Large Instruction Models),49.7,57.3,91.0,97.6,77.3,77.9,86.5,,82.2,,,,87.5,81.6,58.5,,,73.8,,76.6,,67.6,,87.2,87.8,,,,,79.0,73.9,74.3,78.9
Microsoft/Phi-4-14B,Non-thinking,Phi,No,Phi Report (Medium/Large Instruction Models),,,,,,,,,,75.5,,56.1,,,82.6,,80.4,,80.6,84.8,,,,,,,,,3.0,,,,
