On the evaluation of dialogue systems with next utterance classification