The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models

Anonymous

The Music Maestro or The Musically Challenged, A Massive Music Evaluation Benchmark for Large Language Models

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Benchmark plays a pivotal role in assessing the advancements of large language models (LLMs). While numerous benchmarks have been proposed to evaluate LLMs' capabilities, there is a notable absence of a dedicated benchmark for assessing their musical abilities. To address this gap, we present ZIQI-Eval, a comprehensive and large-scale music benchmark specifically designed to evaluate the music-related capabilities of LLMs.ZIQI-Eval encompasses a wide range of questions, covering 10 major categories and 56 subcategories, resulting in over 14,000 meticulously curated data entries. By leveraging ZIQI-Eval, we conduct a comprehensive evaluation over 15 LLMs to evaluate and analyze LLMs' performance in the domain of music.Results indicate that only GPT-4 is capable of effectively understanding and generating music, achieving an average accuracy rate, suggesting that there is ample room for improvement in existing LLMs.With ZIQI-Eval, we aim to provide a standardized and robust evaluation framework that facilitates a comprehensive assessment of LLMs' music-related abilities.

Paper Type: long

Research Area: Special Theme (conference specific)

Contribution Types: Data resources, Data analysis

Languages Studied: Chinese

0 Replies

Loading