Quality Evaluation of the General Domain Chinese Dialogue Generation Models with a BLEURT-based Model

Abstract: The development of natural language models for Chinese dialogue language generation has gradually progressed to a quite high quality over the last year. However, it still requires human to evaluate the generated dialogue, that cost time and labor to test new models. The development of automated evaluation methods has not been able to achieve a balance between human effort reduction and evaluation accuracy. Current evaluation methods are unstable and inefficient. In the past, the most popular choices of natural language evaluation methods, such as BLEU and ROUGE, performed at a certain level for translation tasks, but could not reach the same standard of stability for non-sequence-to-sequence generation types such as dialogue or questions and answers. Until recently, BLEURT model began to provide a training to evaluation approach and a guide for evaluating translation models automatically. BLEURT uses a relatively small number of manual methods to allow the model to evaluate any natural language generation task. We adopt the BLEURT pre-training model as the basic training model to learn the manual evaluation on the generated Chinese dialogue. It gives good evaluation results as human judge, and we find that some generation model can generate better dialogue than ordinary human on social media.
0 Replies
Loading