Keywords: A reliable foundation model, Benchmark datasets, LLM in Education
Abstract: Currently, generative AI is performing well in various fields. In particu- lar, GPT-4, one of the basic models, has been evaluated for its discourse quality, knowledge level, and problem-solving ability on various benchmark datasets. However, it is questionable whether the base model can appro- priately adjust its output level according to the user’s knowledge level. If the base model fails to consider the user’s knowledge level, the quality and reliability of the discourse is bound to decrease. However, common datasets are still insufficient to measure whether the base model responds appropriately to the user’s knowledge level. Therefore, based on Korean educational experts and curricula, we developed a benchmark dataset to evaluate whether the underlying model can elicit appropriate discourse ac- cording to the user’s knowledge level. This mini-dataset consists of about 500 Korean datasets centered on science and current events in the field of science, and we introduce the evaluation method using it. The dataset will also be released soon after it is expanded.
Submission Number: 34
Loading