gpt-4-0613
math_word_problem_generation   & 20.0 & 47.1 &  0.0 &  0.0 & 55.0 \\
finegrained_fact_verification  & 24.3 &  5.7 & 37.1 &  0.0 & 55.7 \\
answerability_classification   & 20.0 &  0.0 &  0.0 & 43.6 & 57.9 \\

meta-llama/Llama-2-70b-chat-hf
math_word_problem_generation   & 49.4 & 63.1 &  0.0 &  0.0 & 75.0 \\
finegrained_fact_verification  & 55.6 & 44.4 & 45.6 &  0.0 & 78.8 \\
answerability_classification   & 37.5 &  0.0 &  0.0 & 51.9 & 77.5 \\

