A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task

A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task

ACL ARR 2024 June Submission2043 Authors

15 Jun 2024 (modified: 04 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The progress in text summarization techniques has been remarkable. However the task of accurately extracting and summarizing necessary information from highly specialized documents such as research papers has not been sufficiently investigated. We are focusing on the task of extracting research questions (RQ) from research papers and construct a new dataset consisting of machine learning papers, RQ extracted from these papers by GPT-4, and human evaluations of the extracted RQ from multiple perspectives. Using this dataset, we systematically compared recently proposed LLM-based evaluation functions for summarizations, and found that none of the functions showed sufficiently high correlations with human evaluations. We expect our dataset provides a foundation for further research on developing better evaluation functions tailored to the RQ extraction task, and contribute to enhance the performance of the task. The dataset is available at https://anonymous.4open.science/r/PaperRQ-HumanAnno-Dataset-8473/README.md.

Paper Type: Long

Research Area: Summarization

Research Area Keywords: Generation, Human-Centered NLP, Language Modeling, NLP Applications, Summarization

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis, Position papers

Languages Studied: English

Submission Number: 2043

Loading