Are Students’ Evaluations of Auto-Graders Biased by Their Grades?

Yufan Zhang, Jaromir Savelka, Heather Burte, Christopher Bogart, Seth Goldstein, Majd Sakr

Published: 01 Jan 2026, Last Modified: 07 May 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Auto-graders are increasingly popular in computer science education, but evaluating their effectiveness remains difficult. Of the 101 auto-graders reviewed by Keuning et al., a third are evaluated by surveys [5]. However, it is possible that the effectiveness ratings students give to auto-graders in surveys (“the ratings”) are biased by the grades students get from auto-graders (“the grades”), and that the ratings are based not on quality of feedback but sentiments toward grades. We assess that possibility in the context of a graduate-level cloud computing course, comparing students’ grades, ratings, and utilization of verbose feedback from the auto-graders (“the utilization”). Our data set includes 1,163 students and 2,200 survey responses from 4 semesters, where students are surveyed after each of the 7 projects in the course. We find in 96% of the responses, students rate the auto-graders either “effective” or “somewhat effective”. We find a weak correlation (\(\rho =0.19\)) between ratings and grades, and almost no correlation between ratings and utilization (\(\rho =0.07\)), both statistically significant (\(p < 0.01\)). We find that many students give different ratings between consecutive survey responses, which suggests that they likely responded to the surveys thoughtfully. Our findings show no evidence that student ratings are biased by their grades or their utilization of auto-graders. We encourage fellow researchers to incorporate similar surveys alongside empirical evaluations of auto-graders for validating their results.

External IDs:doi:10.1007/978-3-032-03873-9_38