The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language ModelsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-related capabilities of LLMs.ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries. By leveraging ZIQI-Eval, we conduct a comprehensive evaluation over 15 LLMs to evaluate and analyze LLMs' performance in the domain of music.Results indicate that only GPT-4 is capable of effectively understanding and generating music, achieving an average accuracy rate, suggesting that there is ample room for improvement in existing LLMs.With ZIQI-Eval, we aim to provide a standardized and robust evaluation framework that facilitates a comprehensive assessment of LLMs' music-related abilities.
Paper Type: long
Research Area: Special Theme (conference specific)
Contribution Types: Data resources, Data analysis
Languages Studied: Chinese
0 Replies

Loading