<!DOCTYPE html>
<html lang="en-US">
  <head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Prometheus</title>
    <link rel="stylesheet" href="/assets/css/styles.css">
    <link rel="shortcut icon" href="">
    <link rel="preconnect" href="https://fonts.gstatic.com">
    <link href="https://fonts.googleapis.com/css2?family=Open+Sans&display=swap" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=Lato" rel="stylesheet">
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@100;200;300;400;500;600;700;800;900&family=Lato:ital,wght@0,100;0,300;0,400;0,700;0,900;1,100;1,300;1,400;1,700;1,900&family=Noto+Sans:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&display=swap" rel="stylesheet">
    <link rel="apple-touch-icon" sizes="180x180" href="/assets/favicon/apple-touch-icon.png">
    <link rel="icon" type="image/png" sizes="32x32" href="/assets/favicon/favicon-32x32.png">
    <link rel="icon" type="image/png" sizes="16x16" href="/assets/favicon/favicon-16x16.png">
    <link rel="manifest" href="/assets/favicon/site.webmanifest">
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons/css/academicons.min.css">
    <!--[if lt IE 9]>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv.min.js"></script>
    <![endif]-->
  </head>
  <base target="_blank">
  <body>
    <div>
      <div class="wrapper">
        <h1 style="font-family: 'Inter', sans-serif; text-align: center;">
          <span style="vertical-align:middle; color: #555555; font-variant: small-caps;">Prometheus</span>
        </h1>
        <h4 style="text-align: center; font-size: 22px;">Inducing Fine-grained Evaluation Capability in Language Models</h4>
        <div class="authors-wrapper">
        
          <div class="author-container">
            <div class="author-image">
              
                <a href="https://taesookim.com">
                  <img src="/assets/img/taesoo.jpeg"/>
                </a>
              
            </div>
            <p>
            
              <a href="https://taesookim.com">
                Tae Soo Kim
              </a>
            
            </p>
            <p>
              KAIST
            </p>
          </div>
        
          <div class="author-container">
            <div class="author-image">
              
                <a href="https://yoonjoolee.com">
                  <img src="/assets/img/yoonjoo.jpeg"/>
                </a>
              
            </div>
            <p>
            
              <a href="https://yoonjoolee.com">
                Yoonjoo Lee
              </a>
            
            </p>
            <p>
              KAIST
            </p>
          </div>
        
          <div class="author-container">
            <div class="author-image">
              
                <a href="https://www.jayshin.xyz/">
                  <img src="/assets/img/jamin.jpg"/>
                </a>
              
            </div>
            <p>
            
              <a href="https://www.jayshin.xyz/">
                Jamin Shin
              </a>
            
            </p>
            <p>
              NAVER AI Lab
            </p>
          </div>
        
          <div class="author-container">
            <div class="author-image">
              
                <a href="http://younghokim.net/">
                  <img src="/assets/img/youngho.jpg"/>
                </a>
              
            </div>
            <p>
            
              <a href="http://younghokim.net/">
                Young-Ho Kim
              </a>
            
            </p>
            <p>
              NAVER AI Lab
            </p>
          </div>
        
          <div class="author-container">
            <div class="author-image">
              
                <a href="https://juhokim.com">
                  <img src="/assets/img/juho.jpg"/>
                </a>
              
            </div>
            <p>
            
              <a href="https://juhokim.com">
                Juho Kim
              </a>
            
            </p>
            <p>
              KAIST
            </p>
          </div>
        
        </div>
      </div>
      <div class="button-container">
        <a class="button" href="https://arxiv.org/abs/2309.13633" target="_blank">
          <span>Paper</span>
          <svg xmlns="http://www.w3.org/2000/svg" height="1em" viewBox="0 0 512 512"><!--! Font Awesome Free 6.4.2 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license (Commercial License) Copyright 2023 Fonticons, Inc. --><path d="M0 64C0 28.7 28.7 0 64 0H224V128c0 17.7 14.3 32 32 32H384V304H176c-35.3 0-64 28.7-64 64V512H64c-35.3 0-64-28.7-64-64V64zm384 64H256V0L384 128zM176 352h32c30.9 0 56 25.1 56 56s-25.1 56-56 56H192v32c0 8.8-7.2 16-16 16s-16-7.2-16-16V448 368c0-8.8 7.2-16 16-16zm32 80c13.3 0 24-10.7 24-24s-10.7-24-24-24H192v48h16zm96-80h32c26.5 0 48 21.5 48 48v64c0 26.5-21.5 48-48 48H304c-8.8 0-16-7.2-16-16V368c0-8.8 7.2-16 16-16zm32 128c8.8 0 16-7.2 16-16V400c0-8.8-7.2-16-16-16H320v96h16zm80-112c0-8.8 7.2-16 16-16h48c8.8 0 16 7.2 16 16s-7.2 16-16 16H448v32h32c8.8 0 16 7.2 16 16s-7.2 16-16 16H448v48c0 8.8-7.2 16-16 16s-16-7.2-16-16V432 368z"/></svg>
        </a>
        
        <a class="button-disabled">
          <span>Demo (TBA)</span>
          <svg xmlns="http://www.w3.org/2000/svg" height="1em" viewBox="0 0 512 512"><!--! Font Awesome Free 6.4.2 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license (Commercial License) Copyright 2023 Fonticons, Inc. --><path d="M352 0c-12.9 0-24.6 7.8-29.6 19.8s-2.2 25.7 6.9 34.9L370.7 96 201.4 265.4c-12.5 12.5-12.5 32.8 0 45.3s32.8 12.5 45.3 0L416 141.3l41.4 41.4c9.2 9.2 22.9 11.9 34.9 6.9s19.8-16.6 19.8-29.6V32c0-17.7-14.3-32-32-32H352zM80 32C35.8 32 0 67.8 0 112V432c0 44.2 35.8 80 80 80H400c44.2 0 80-35.8 80-80V320c0-17.7-14.3-32-32-32s-32 14.3-32 32V432c0 8.8-7.2 16-16 16H80c-8.8 0-16-7.2-16-16V112c0-8.8 7.2-16 16-16H192c17.7 0 32-14.3 32-32s-14.3-32-32-32H80z"/></svg>        </a>
      </div>
    </div>
    <div class="wrapper">
      <hr/>
      <p><span class="sys-name">EvalLM</span> ⚗️ is an interactive system that aids prompt designers in iterating on <strong>prompts</strong> by evaluating and comparing generated outputs on user-defined <strong>criteria</strong>. With the aid of an <strong>LLM-based evaluation assistant</strong>, the user can iteratively evolve <strong>criteria+prompts</strong> to distinguish more specific qualities in outputs and then improve the quality of outputs on these aspects.</p>

<p><br /></p>

<p class="sys-img"><img src="/assets/img/animation.gif" alt="Animation of the overall workflow of EvalLM where users sample inputs from a dataset, generate outputs from each input using two different prompts, and then comparatively evaluate these outputs on user-defined criteria." /></p>

<hr />

<h2 id="interface">Interface</h2>

<p>The main screen of the interface consists of three panels.</p>

<p class="sys-img"><img src="/assets/img/interface.png" alt="Main screen of EvalLM shows three panels. The generation panel shows text boxes for the prompt and task instruction, and buttons for input sampling. The evaluation panel shows text boxes for the criteria, buttons for evaluating, and stacked bar charts for the evaluation results." /></p>

<p><b>Generation Panel</b>: To generate outputs, the user defines their overall <strong>task instruction</strong> (A), two <strong>prompts</strong> they want to compare (B), and then <strong>samples inputs</strong> from a dataset (C) which will be used to test the prompts.</p>

<p><strong>Evaluation Panel</strong>: To evaluate outputs, the user defines a set of evaluation <strong><a href="#criteria" target="_self">criteria</a></strong> (D). Then, after evaluating, they can verify the overall <em>evaluation</em> performance of each prompt (E) or, if they created a validation set, <em>validate</em> how automatic evaluations align with ground-truth evaluations (F).</p>

<p><strong>Data Panel</strong>: This panel shows <strong><a href="#datarow" target="_self">data rows</a></strong> containing inputs, outputs, and evaluation results.</p>

<p><br /></p>

<h3 id="criteria"><span id="criteria">Criteria</span></h3>

<p class="text-left"><span class="sys-name">EvalLM</span> allows users to evaluate outputs on their own criteria specific to their application and/or context. 
<br /><br />
To define a criteria, the user simply provides the criteria with a <strong>name</strong> (A) and <strong>description</strong> (B) in natural language.
<br /><br />
To assist users in creating more effective and helpful criteria, the system automatically <strong>reviews</strong> their criteria (C) and provides <strong>suggestions</strong> (D) on how the criteria can be <em>refined</em>, <em>merged</em> and <em>split</em>.</p>

<p class="img-right"><img src="/assets/img/criteria.png" alt="Criteria are represented as a set of text boxes that contain the name and description of the criteria. Suggested revisions are shown below the criteria." /></p>

<p><br /></p>

<h3 id="data-row"><span id="datarow">Data Row</span></h3>

<p class="sys-img"><img src="/assets/img/datarow.png" alt="Data Rows in the interface display inputs, output pairs, and evaluation results. Clicking on evaluation results opens a panel that shows the explanation for that evaluation underneath the row." /></p>

<p>For each sampled <strong>input</strong> (A), the interface presents the <strong>outputs</strong> generated from each prompt side-by-side (B) and the <strong>evaluation results</strong> for each criteria next to the outputs (C). For each criteria, the evaluation results show which prompt produced the output that better satisfied that criteria.</p>

<p>If the user wants to see more details, they can click on one of these evaluations to see the assistant’s <strong>explanation</strong> (D). To help the user match the explanation and outputs, the system also <strong>highlights</strong> spans from the outputs that were considered to be important when evaluating the criteria (E).</p>

<p>If the user selected to evaluate outputs on multiple trials, they can see the evaluations for <strong>other trials</strong> through the carousel (F).</p>

<hr />

<h2 id="video-demo">Video Demo</h2>

<p>See <span class="sys-name">EvalLM</span> in action in this Video Demo.</p>

<div class="video-wrapper">
  <iframe src="https://www.youtube-nocookie.com/embed/7hvTnhiCO7Y?si=6I9aqVnatM8VoaJd&amp;color=white&amp;rel=0&amp;modestlogo=1" id="yt-video" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</div>

<hr />

<h2 id="bibtex">Bibtex</h2>
<pre>
@inproceedings{kim2023evallm,
title={EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria}, 
author={Tae Soo Kim and Yoonjoo Lee and Jamin Shin and Young-Ho Kim and Juho Kim},
year={2023},
eprint={2309.13633},
archivePrefix={arXiv},
primaryClass={cs.HC}
}
</pre>

<hr />

<p class="logos"><a href="https://kixlab.org"><img src="/assets/img/kixlab_logo.png" alt="Logo of KIXLAB" /></a>
<a href="https://kaist.ac.kr"><img src="/assets/img/kaist_logo.png" alt="Logo of KAIST" /></a>
<a href="https://www.facebook.com/NAVERAILAB"><img src="/assets/img/naver_logo.png" alt="Logo of NAVER" /></a></p>

<p class="center acknowledgement">This research was supported by the <strong>KAIST-NAVER Hypercreative AI Center</strong>.</p>

    </div>
    <div class="footer">
      <div>
        <p class="center credits">
          Template from <a href="https://github.com/kixlab/evallm-website" target="_blank">EvalLM</a> by <a href="https://taesookim.com" target="_blank">tsook</a>. Licensed under MIT License.
          <br/>
          Feel free to borrow the template. We only ask you to keep the credit links above.
        </p>
      </div>
    </div>
    
  </body>
</html>
