
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>

<script type="text/javascript" charset="utf-8" src="https://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js"></script> 
<!---
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
--->


<style type="text/css">
body {
    font-family: "Titillium Web", "HelveticaNeue-Light", "Helvetica Neue Light", "Helvetica Neue", Helvetica, Arial, "Lucida Grande", sans-serif;
    font-weight: 300;
    font-size: 20px;
    margin-left: auto;
    margin-right: auto;
}

@media screen and (min-width: 980px){
    body {
        width: 980px;
    }
}


h1 {
    font-weight:300;
    line-height: 1.15em;
}

h2 {
    font-size: 1.75em;
}
a:link,a:visited {
    color: #5364cc;
    text-decoration: none;
}
a:hover {
    color: #208799;
}
h1 {
    text-align: center;
}
h2,h3 {
    text-align: left;
}

h1 {
    font-size: 40px;
    font-weight: 500;
}
h2 {
    font-weight: 400;
    margin: 16px 0px 4px 0px;
}
h3 {
    font-weight: 600;
    margin: 16px 0px 4px 0px;
}

.paper-title {
    padding: 1px 0px 1px 0px;
}
section {
    margin: 32px 0px 32px 0px;
    text-align: justify;
    clear: both;
}
.col-5 {
     width: 20%;
     float: left;
}

.move-down {
    margin-top:1.2cm;
}

.col-4 {
     width: 25%;
     float: left;
}
.col-3 {
     width: 33%;
     float: left;
}
.col-2 {
     width: 50%;
     float: left;
}
.col-1 {
     width: 100%;
     float: left;
}

.author-row, .affil-row {
    font-size: 17px;
}

.author-row-new { 
    text-align: center; 
}

.author-row-new a {
    display: inline-block;
    font-size: 17px;
    padding: 4px;
}

.author-row-new sup {
    color: #313436;
    font-size: 13   px;
    padding: 4px;
}

.affiliations-new {
    font-size: 16px;
    text-align: center;
    width: 80%;
    margin: 0 auto;
    margin-bottom: 20px;
}

.row {
    margin: 16px 0px 16px 0px;
}
.authors {
    font-size: 26px;
}
.affiliatons {
    font-size: 18px;
}
.affil-row {
    margin-top: 18px;
}
.teaser {
    max-width: 100%;
}
.text-center {
    text-align: center;  
}
.screenshot {
    width: 256px;
    border: 1px solid #ddd;
}
.screenshot-el {
    margin-bottom: 16px;
}
hr {
    height: 1px;
    border: 0; 
    border-top: 1px solid #ddd;
    margin: 0;
}
.material-icons {
    vertical-align: -6px;
}
p {
    line-height: 1.25em;
}
.caption {
    font-size: 16px;
    color: #666;
    margin-top: 4px;
    margin-bottom: 10px;
	text-align: left;
}


video {
    display: block;
    margin: auto;
}


figure {
    display: block;
    margin: auto;
    margin-top: 10px;
    margin-bottom: 10px;
}
#bibtex pre {
    font-size: 14px;
    background-color: #eee;
    padding: 16px;
}
.blue {
    color: #2c82c9;
    font-weight: bold;
}
.orange {
    color: #d35400;
    font-weight: bold;
}
.flex-row {
    display: flex;
    flex-flow: row wrap;
    padding: 0;
    margin: 0;
    list-style: none;
}
.flex-row-center {
    display: flex;
    flex-flow: row wrap;
    padding: 0;
    margin: 0;
    list-style: none;
    justify-content: center;
    text-align: center;
}
.flex-container {
  display: flex;
  flex-wrap: wrap;
}

.flex-item {
  flex: 0 0 50%;
  padding: 10px;
  box-sizing: border-box;
}

.paper-btn-coming-soon {
    position: relative; 
    top: 0;
    left: 0;
}

.coming-soon {
    position: absolute;
    top: -15px;
    right: -15px;
}

.center {
  margin-left: 10.0%;
  margin-right: 10.0%;
}

.paper-btn-small {
  position: relative;
  text-align: center;
  vertical-align: center;

  display: inline-block;
  margin: 8px;
  padding: 8px 8px;

  border-width: 0;
  outline: none;
  border-radius: 2px;
  
  background-color: #E0F7FA;
  color: #01579B !important;
  font-size: 20px;
  width: 100px;
  font-weight: 600;
}



.paper-btn {
  position: relative;
  text-align: center;
  vertical-align: center;

  display: inline-block;
  margin: 8px;
  padding: 8px 8px;

  border-width: 0;
  outline: none;
  border-radius: 2px;
  
  background-color: #E0F7FA;
  color: #01579B !important;
  font-size: 20px;
  width: 250px;
  font-weight: 600;
}

.paper-btn-tapestry {
  position: relative;
  text-align: center;

  display: inline-block;
  margin: 8px;
  padding: 8px 8px;

  border-width: 0;
  outline: none;
  border-radius: 2px;
  
  background-color: #5364cc;
  color: white !important;
  font-size: 20px;
  width: 200px;
  font-weight: 600;
}

.paper-btn-parent {
    display: flex;
    justify-content: center;
    margin: 16px 0px;
}

.paper-btn:hover {
    opacity: 0.85;
}

.container {
    margin-left: auto;
    margin-right: auto;
    padding-left: 16px;
    padding-right: 16px;
}

.venue {
    font-size: 23px;
}

.topnav {
    background-color: #EEEEEE;
    overflow: hidden;
}

.topnav div {
    max-width: 1070px;
    margin: 0 auto;
}

.topnav a {
    display: inline-block;
    color: black;
    text-align: center;
    vertical-align: middle;
    padding: 16px 16px;
    text-decoration: none;
    font-size: 18px;
}

.topnav img {
    padding: 2px 0px;
    width: 100%;
    margin: 0.2em 0px 0.3em 0px;
    vertical-align: middle;
}

pre {
    font-size: 0.9em;
    padding-left: 7px;
    padding-right: 7px;
    padding-top: 3px;
    padding-bottom: 3px;
    border-radius: 3px;
    background-color: rgb(235, 235, 235);
    overflow-x: auto;
}

.download-thumb {
    display: flex;
}

@media only screen and (max-width: 620px) {
    .download-thumb {
        display: none;
    }
}

.paper-stuff {
    width: 50%;
    font-size: 20px;
}

@media only screen and (max-width: 620px) {
    .paper-stuff {
        width: 100%;
    }
}
* {
  box-sizing: border-box;
}

.column {
  text-align: center;
  float: left;
  width: 16.666%;
  padding: 5px;
}
.column3 {
  text-align: center;
  float: left;
  width: 33.333%;
  padding: 5px;
}
.column4 {
  text-align: center;
  float: left;
  width: 50%;
  padding: 5px;
}
.column5 {
  text-align: center;
  float: left;
  width: 20%;
  padding: 5px;
}
.column10 {
  text-align: center;
  float: left;
  width: 10%;
  padding: 5px;
}
.border-right {
    border-right: 1px solid black;
}
.border-bottom{
    border-bottom: 1px solid black;
}


.row-center {
    margin: 16px 0px 16px 0px;
    text-align: center;
}

/* Clearfix (clear floats) */
.row::after {
  content: "";
  clear: both;
  display: table;
}
.img-fluid {
  max-width: 100%;
  height: auto;
}
.figure-img {
  margin-bottom: 0.5rem;
  line-height: 1;
}

.rounded-circle {
  border-radius: 50% !important;
}

/* Responsive layout - makes the three columns stack on top of each other instead of next to each other */
@media screen and (max-width: 500px) {
  .column {
    width: 100%;
  }
}
@media screen and (max-width: 500px) {
  .column3 {
    width: 100%;
  }
}

.left-column {
    float: left;
    width: 5%;
    text-align: center;
    vertical-align: center;
}

.right-column {
    float: right;
    width: 95%;
    text-align: center;
    vertical-align: center;
}

</style>

<script type="text/javascript"></script>
    <link href='https://fonts.googleapis.com/css?family=Titillium+Web:400,600,400italic,600italic,300,300italic' rel='stylesheet' type='text/css'>
    <head>
        <title> CriticBench: Evaluating Large Language Model as Critic </title>
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <meta property="og:description" content="A comprehensive benchmark for evaluating critique ability of LLMs"/>
        <link href="https://fonts.googleapis.com/css2?family=Material+Icons" rel="stylesheet">
        <!--<link rel="icon" href = "https://images.emojiterra.com/google/noto-emoji/unicode-15.1/color/512px/1f6e0.png">-->
    </head>

 <body>

<div class="container">
    <div class="paper-title">
    <h1> 
        CriticBench: Evaluating Large Language Model as Critic
    </div>

    <div id="authors">
        <center>
            <div class="author-row-new">
                <a href="https://github.com/gmftbyGMFTBY">Tian Lan<sup>1*</sup></a>,
                <a href="https://zhangwenwei.cn/">Wenwei Zhang<sup>2*</sup></a>,
                Chen Xu<sup>1</sup>,
                Heyan Huang<sup>1</sup>,
                <a href="http://dahua.site/">Dahua Lin<sup>2</sup></a>,
                <a href="https://chenkai.site/">Kai Chen<sup>2†</sup></a>
                Xian-ling Mao<sup>1†</sup>,
            </div>
        </center>
        <center>
        <div class="affiliations">
            <span><sup>1</sup> Beijing Institute of Technology</span>
            <span><sup>2</sup> Shanghai AI Laboratory</span>
        </div>

        </center>

        <div>
            <div class="paper-btn-parent">
            <a class="paper-btn-small" href="https://arxiv.org/abs/2402.13764">
                <span class="material-icons"> description </span> 
                Paper
            </a>
            <a class="paper-btn-small" href="https://github.com/open-compass/CriticBench">
                <span class="material-icons"> code </span>
                Code
            </a>
            <a class="paper-btn" href="./leaderboard_subjective.html">
                <nobr>
                    <span class="material-icons"> description </span>
                    Subjective Leaderboard
                </nobr>
            </a>
            <a class="paper-btn" href="./leaderboard_objective.html">
                <nobr>
                    <span class="material-icons"> description </span>
                    Objective Leaderboard
                </nobr>
            </a>
            </div>
        </div>
    </div>
    <section id="abstract"/>
        <h2 style="text-align: center;">Abstract</h2>
        <div class="flex-row" style="width: 75%; margin: 0 auto;">
            <p>
                Critique ability are crucial in the scalable oversight and self-improvement of Large Language Models (LLMs). While many recent studies explore the critique ability of LLMs to judge and refine flaws in generations, how to comprehensively and reliably measure the critique abilities of LLMs is under-explored. This paper introduces <b>CriticBench</b>, a novel benchmark designed to comprehensively and reliably evaluate four key critique ability dimensions of LLMs: feedback, comparison, refinement and meta-feedback. <b>CriticBench</b> encompasses nine diverse tasks, each assessing the LLMs' ability to critique responses at varying levels of quality granularity. Our extensive evaluations of open-source and closed-source LLMs reveal intriguing relationships between the critique ability and tasks, response qualities, and model scales.
            </p>
        </div>
    </section>
    
    <section id="teaser-image">
        <hr>
        <center>
            <figure>
                <a>
                    <img width="65%" src="figure/comparison.png"> 
                </a>
                <p class="caption", style="text-align: center;">
                    Figure 1. Comparison between <b>CriticBench</b> and previous works.
                </p>
            </figure>
        </center>
    </section>

    <section id="evaluation protocol"/>
        <h2 style="text-align: center;">Introduction</h2>
            <p>
                <b>CriticBench</b> evaluate 9 tasks (<b>translate, general chat, question answer, summary, harmlessness, math with chain-of-thought, math with program-of-thought, code with executions, code without executions</b>) for 4 critique dimensions (<b>Feedback, Comparison, Correction, Meta-Feedback</b>) on 4 kinds of response qualities (<b>low-quality, medium-quality, high-quality, correct</b>). Besides, the objective and subjective scores are computed for each task and each critique dimensions.
                
            <p>
            Overall, <b>CriticBench</b> exhibits significant advantages over previous benchmarks on critique evaluation (Fig. 1), showing great diversity in response quality granularity, critique formats, critique dimensions, and data size, allowing deeper analysis of the LLMs' critique capabilities.
            </p>
            </p>
        </div>
    </section>
    
    <section id="data generation pipeline"/>
    <h2 style="text-align: center;">Data Generation Pipeline</h2>
     
    <section id="teaser-image">
        <hr>
        <center>
            <figure>
                <a>
                    <img width="95%" src="figure/overview.png"> 
                </a>
                <p class="caption", style="text-align: center;">
                    Figure 2. Overview and Construction Pipeline of <b>CriticBench</b>.
                </p>
            </figure>
        </center>
    </section>
     
    <div class="flex-row">
        <p>
        The human-in-the-loop construction of <b>CriticBench</b> are conducted. <b>CriticBench</b> consists of three main phases: <b>instruction collection</b>, <b>response generation</b>, and <b>reference critique generation</b>. The overview of the construction is shown in Fig. 2., and the details of each phase are described as follow:
        </p>
        <p class="inline-block mt-3">
            <ol>
              <li><b>Instruction collection:</b> Instructions for 9 distinct tasks are collected to evaluate critique capabilities comprehensively (Step 1 in Fig. 2). Specifically, the benchmark includes three representative classical language tasks: summary, translation, and question-answering. Since a popular application of LLMs is to serve as a chatbot, where alignment is important to ensure the safe application of LLMs, we collect instructions from general chat scenarios and harmlessness cases to evaluate the LLMs' critique ability for alignment. Furthermore, the reasoning and code capabilities are also fundamental for augmenting LLMs as agents, another important and promising application of LLMs. Thus, we also collect instructions for math reasoning with chain-of-thought and program-of-thought, and coding with and without execution results. To ensure the difficulty of <b>CriticBench</b>, we only collect coding and math reasoning questions that some 70B LLMs cannot correctly answer.</li>
              <li><b>Response Generation:</b> For each collected instruction in each task, LLMs of different scales and capabilities are employed to generate responses with flaws, which naturally form responses of various qualities (Step 2 (a) in Fig. 2). To identify the quality of these responses efficiently, GPT-4 is utilized to initially assign quality ratings ranging from 1 to 7 (Step 2 (b) in Fig. 2.) then let human annotators meticulously review and adjust these scores. Subsequently, three responses with distinct quality differences for each instruction are chosen based on their human-varified quality scores, including low-, medium-, and high-quality responses.</li>
              <li><b>Reference Critique Generation:</b> After collecting instructions and the corresponding responses, we collect reference critiques on these responses to make the subjective evaluation more reliable, with the assistance of GPT-4, including the feedback, correction, comparison, and meta-feedback. Note that correction and meta-feedback critique dimensions are overlooked in previous works.</li>
            </ol>
        </p>
    </section>

    <section id="result"/>
    <hr>
    <h2 style="text-align: center;">Result</h2>
    <div class="flex-row-center">
        <p>
        You can find the newest <a href="./leaderboard_subjective.html">subjective</a> and <a href="./leaderboard_objective.html">objective</a> leaderboards of our CriticBench.
        </p>
        <br>
    </section>   

    <section>
        <hr>
        This webpage template was recycled from <a href='https://nv-tlabs.github.io/LION/'>here</a>.
        <!-- <center><p><a href='https://accessibility.mit.edu/'><b>Accessibility</b></a></p></center> -->
    </section>

</div>
</body>
</html>
