Abstract: We have applied BLEU (Papineni et al., 2001), a method originally designed to evaluate automatic Machine Translation systems, in assessing short essays written by students. We study how much BLEU scores correlate to human scorings and other keyword-based evaluation metrics. We conclude that, although it is only applicable to a restricted category of questions, BLEU attains better results than other keyword-based procedures. Its simplicity and language-independence makes it a good candidate to be combined with other well-studied computer assessment scoring procedures.
Loading