Catwalk: A Unified Language Model Evaluation Framework for Many Datasets

Dirk Groeneveld, Anas Awadalla, Iz Beltagy, Akshita Bhagia, Ian Magnusson, Hao Peng, Oyvind Tafjord, Evan Pete Walsh, Kyle Richardson, Jesse Dodge

20 Feb 2024OpenReview Archive Direct UploadReaders: Everyone

Abstract: The success of large language models has shifted the evaluation paradigms in natural lan- guage processing (NLP). The community’s in- terest has drifted towards comparing NLP mod- els across many tasks, domains, and datasets, often at an extreme scale. This imposes new engineering challenges: efforts in construct- ing datasets and models have been fragmented, and their formats and interfaces are incom- patible. As a result, it often takes extensive (re)implementation efforts to make fair and con- trolled comparisons at scale. Catwalk aims to address these issues. Catwalk provides a unified interface to a broad range of existing NLP datasets and models, ranging from both canonical supervised training and fine-tuning, to more modern paradigms like in-context learning. Its carefully-designed ab- stractions allow for easy extensions to many others. Catwalk substantially lowers the bar- riers to conducting controlled experiments at scale. For example, we finetuned and evalu- ated over 64 models on over 86 datasets with a single command, without writing any code. Maintained by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2), Cat- walk is an ongoing open-source effort: https: //github.com/allenai/catwalk.

0 Replies