Keywords: sentence transformers, natural language processing, representation learning, visualisation, benchmark, metascience
Abstract: The ICLR conference is unique among the top machine learning conferences in that all submitted papers are openly available. Here we present the _ICLR dataset_ consisting of abstracts of all 24 thousand ICLR submissions from 2017--2024 with meta-data, decision scores, and custom keyword-based labels. We find that on this dataset, bag-of-words representation outperforms most dedicated sentence transformer models in terms of $k$NN classification accuracy, and the top performing language models barely outperform TF-IDF. We see this as a challenge for the NLP community. Furthermore, we use the ICLR dataset to study how the field of machine learning has changed over the last seven years, finding some improvement in gender balance. Using a 2D embedding of the abstracts' texts, we describe a shift in research topics from 2017 to 2024 and identify _hedgehogs_ and _foxes_ among the authors with the highest number of ICLR submissions.
Primary Subject Area: Data collection and benchmarking techniques
Paper Type: Research paper: up to 8 pages
Participation Mode: In-person
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 55
Loading