Abstract: Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications.
Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge.
Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge.
These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates.
Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development.
In this paper, we introduce \textsc{GuideBench}, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. \textsc{GuideBench} evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences.
Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.
Data and code are available at \url{https://github.com/Dlxxx/GuideBench}.
Loading