Reducing Human Effort in Evaluating Small and Medium Language Models as Students and as Teachers

Published: 08 Jun 2025, Last Modified: 08 Jun 2025DaSHEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Keywords: Small and Medium Language Models, Evaluation, MCQ Generation
TL;DR: Language Models as Students and as Teachers
Abstract: Multiple Choice Questions (MCQs) are commonly used by teachers to assess student understanding, but generating high-quality MCQs is a demanding task. Large Language Models (LLMs) offer a potential solution, yet their use raises concerns about privacy, cost, and energy consumption, especially in educational settings. In this paper, we present a simple and reproducible evaluation framework designed to assess the ability of small and medium-sized LMs to answer (LM as student) and generate (LM as teacher) high-quality MCQs. The framework uses a set of clearly defined measures, such as syntactic correctness, relevance to source material, distractor quality, and answer consistency, to provide a detailed analysis of model performance. We applied the framework to evaluate several language models and found that each exhibits distinct strengths and weaknesses across different metrics. Notably, some small models—such as Phi-3.5-mini and Llama3.1:8b—outperform larger peers in specific areas, demonstrating that model size does not always correlate with overall quality. These findings empower teachers to choose models that best align with their goals and priorities, reinforcing their agency while highlighting the practical value of lightweight models in educational settings. We also outline future work, including targeted fine-tuning to improve model performance on specific MCQ quality dimensions.
Submission Number: 2
Loading