Track: Responsible AI for Education (Day 2)
Paper Length: short-paper (2 pages + references)
Keywords: AI, ML, NLP, LLMs, Education, ChatGPT, GPT-4, GPT-3.5, University, Higher Education, LLMs Impact on Education, LLMs Performance in Higher Education, University Courses' Vulnerabilities to LLMs, GPT-4 Grading
TL;DR: We investigate how susceptible higher education courses are to LLMs such as ChatGPT, and what makes questions more or less vulnerable, as well as whether GPT-4 can be trusted as a grader.
Abstract: With the release of ChatGPT, the incredible potential of Large Language Models (LLMs) to perform a wide array of tasks has been seared into the public mind, inviting both excitement and concern about the significant changes caused by widespread LLM usage.
This paper investigates how grounded these concerns are by investigating how much university students can leverage these models to answer STEM courses' questions and problems. We examine the abilities of GPT-3.5 and GPT-4 in a bilingual college-level education setting by having them answer questions from $\sim$100 of our university's courses across a variety of subjects. We employ state-of-the-art prompting strategies and analyze the obtained results across different axes.
Using both automatic and human grading, through our university's teaching staff, we find that GPT-4 consistently outperforms GPT-3.5, with the latter being freely available to the public and able to pass 34\% of the courses it was tested on.
We observe that the models' performance is affected not only by the prompting strategy but also by course topic and language. While they perform better in English, this is not necessarily an impediment to their performance in other languages. We also find introductory and general courses to be more susceptible to LLMs, thought they struggle with uncommon question formats and questions that require multi-step reasoning. We conclude all courses have some level of vulnerability to LLMs.
On the other end of applying LLMs to educational domains, we analyze GPT-4's potential as an automatic grader. We find it insufficient compared to human graders, in part because of its tendency to avoid marking answers as either definitively correct or incorrect.
Finally, we provide a set of implications and takeaways for educators on to make their course material less susceptible to the challenges posed by LLMs' usage.
Submission Number: 31
Loading