Track: Technical
Keywords: Large Language Models, Evaluations of LMs, Reasoning
TL;DR: ReFeR is a new framework for evaluating NLG using LLMs in peer review format, enhancing accuracy, reasoning, and feedback, while enabling effective fine-tuning of smaller models for improved evaluations.
Abstract: Assessing the quality of Natural Language Generation (NLG) outputs, such as those produced by large language models (LLMs), poses significant challenges. Human evaluations are not scalable, and traditional automatic metrics exhibit low correlation with human judgment. In this study, we propose Review-Feedback-Reason (ReFeR), a novel evaluation framework for NLG using LLM agents. The proposed framework enhances the accuracy of NLG evaluation, surpassing previous benchmarks by $\sim$20\%. Moreover, feedback collected from our framework is then leveraged to instruction fine-tune smaller models like Mistral-7B, yielding a better correlation with human evaluations and performance nearly on par with GPT-3.5. We highlight another ancillary benefit of our methodology through its application on reasoning benchmarks, outperforming most of the state-of-the-art methods and also beating GPT-3.5 Turbo by $\sim$11.67\% and GPT-4 by $\sim$1\% on an average.
Submission Number: 53
Loading