Abstract: The success of large language models has shifted the evaluation paradigms in natural lan- guage processing (NLP). The community’s in- terest has drifted towards comparing NLP mod- els across many tasks, domains, and datasets, often at an extreme scale. This imposes new engineering challenges: efforts in construct- ing datasets and models have been fragmented, and their formats and interfaces are incom- patible. As a result, it often takes extensive (re)implementation efforts to make fair and con- trolled comparisons at scale.
Catwalk aims to address these issues. Catwalk provides a unified interface to a broad range of existing NLP datasets and models, ranging from both canonical supervised training and fine-tuning, to more modern paradigms like in-context learning. Its carefully-designed ab- stractions allow for easy extensions to many others. Catwalk substantially lowers the bar- riers to conducting controlled experiments at scale. For example, we finetuned and evalu- ated over 64 models on over 86 datasets with a single command, without writing any code. Maintained by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2), Cat- walk is an ongoing open-source effort: https: //github.com/allenai/catwalk.
0 Replies
Loading