# Evals scripts
We evaluate the performance of our models in the following three categories:
1. In-domain game performance (winrate vs untrained baseline)
2. Out-of-domain game performance (winrate vs SFT baseline)
3. OOD generalization (reasoning and code benchmark)