TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot

ACL ARR 2024 June Submission5756 Authors

16 Jun 2024 (modified: 17 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the rapid development of large language models (LLM), the evaluation of LLM becomes increasingly important. Measuring text gener- ation tasks such as summarization and article creation is very difficult. Especially in spe- cific application domains (e.g., to-business or to-customer service), in-house evaluation cri- teria have to meet not only general standards (correctness, helpfulness and creativity, etc.) but also specific needs of customers and busi- ness security requirements at the same time, making the evaluation more difficult. So far, the evaluation of LLM in business scenarios has mainly relied on manual, which is expensive and time-consuming. In this paper, we propose a model-based evaluation method: TALEC, which allows users to flexibly set their own evaluation criteria, and uses in-context learning (ICL) to teach judge model these in-house cri- teria. In addition, we try combining zero-shot and few-shot to make the judge model focus on more information. We also propose a prompt paradigm and an engineering approach to ad- just and iterate the shots ,helping judge model to better understand the complex criteria. We then compare fine-tuning with ICL, finding that fine-tuning can be replaced by ICL. TALEC demonstrates a strong capability to accurately reflect human preferences and achieves a cor- relation of over 80% with human judgments, outperforming even the inter-human correlation in some tasks.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: LLM, automatic evaluation, in-context learning
Contribution Types: NLP engineering experiment
Languages Studied: Chinese
Submission Number: 5756
Loading