Keywords: ChatGPT, Bard, Claude, GPT-4, large language models, chatbots, conversational agents
TL;DR: A carefully-designed benchmark for evaluating new large scale language models
Abstract: Although informal evaluations of modern LLMs can be found on social media, blogs, and
news outlets, a formal and comprehensive comparison among them has yet to be conducted.
In response to this gap, we have undertaken an extensive benchmark evaluation of LLMs and
conversational bots. Our evaluation involved the collection of 1002 questions encompassing 27
categories, which we refer to as the “Wordsmiths dataset.” These categories include reasoning,
logic, facts, coding, bias, language, humor, and more. Each question in the dataset is accompanied
by an accurate and verified answer. We meticulously assessed four leading chatbots: ChatGPT,
GPT-4, Bard, and Claude, using this dataset. The results of our evaluation revealed the following
key findings: a) GPT-4 emerged as the top-performing chatbot across all categories, achieving a
success rate of 84.1%. On the other hand, Bard faced challenges and achieved a success rate of
62.4%. b) Among the four models evaluated, one of them responded correctly approximately
93% of the time. However, all models were correct only about 44%. c) Bard is less correlated
with other models while ChatGPT and GPT-4 are highly correlated in terms of their responses.
d) Chatbots demonstrated proficiency in language understanding , facts, and self awareness.
However, they encountered difficulties in areas such as math, coding, IQ, and reasoning. e) In
terms of bias, discrimination, and ethics categories, models generally performed well, suggesting
they are relatively safe to utilize. To make future model evaluations on our dataset easier,
we also provide a multiple-choice version of it (called Wordsmiths-MCQ). The understanding
and assessment of the capabilities and limitations of modern chatbots hold immense societal
implications. In an effort to foster further research in this field, we have made our dataset
available for public access, which can be found at [masked].
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8398
Loading