The Curious Case of LLM Evaluations

Mimansa Jaiswal

24 Jan 2024OpenReview Archive Direct UploadReaders: Everyone

Abstract: Every popular paper later, we keep coming back to the same questions: how do we know that this is a good evaluation? And unfortunately, the answer is not as simple. I might even go as far to say; it — in most likelihood is not solid. We might want it to be, but evaluation and benchmarking had already been complicated, even for classification models. We, honestly never solved it for small generative models and long form generations; and then suddenly we were faced with an influx of extremely large, multi-purpose language models; aka; foundation models. And now everyone is left with these carefully curated academic datasets that they report numbers on; that in all likelihood leaked into the training set when the whole accessible internet was scraped for it; and; buggy techniques; because we as ML practitioners were never taught basic statistics.

0 Replies