
<!DOCTYPE html>
<html>

<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, shrink-to-fit=no">
    <title>MACHIAVELLI</title>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/4.5.0/css/bootstrap.min.css">
    <link href='https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,500,600' rel='stylesheet' type='text/css'>
    <link rel="stylesheet" href="style.css">

    <meta property="og:site_name" content="The MACHIAVELLI Benchmark" />
    <meta property="og:title" content="The MACHIAVELLI Benchmark" />
    <meta property="og:description" content="Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark." />
    <meta property="og:url" content="https://aypan17.github.io/machiavelli" />
    <meta property="og:image" content="https://aypan17.github.io/machiavelli/docs/assets/images/machiavelli.svg" />

    <meta property="article:publisher" content="https://aypan17.github.io/machiavelli" />
    <meta name="twitter:card" content="summary_large_image" />
    <meta name="twitter:title" content="The MACHIAVELLI Benchmark" />
    <meta name="twitter:description" content="Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark." />
    <meta name="twitter:url" content="https://aypan17.github.io.github.io/" />
    <meta name="twitter:image" content="https://aypan17.github.io/machiavelli/docs/assets/images/machiavelli.svg" />
    <!-- <meta name="twitter:site" content="" /> -->

    <script src="https://cdn.jsdelivr.net/npm/p5@1.6.0/lib/p5.js"></script>
    <script language="javascript" type="text/javascript" src="sketch.js"></script>
</head>

<body>
    <div class="container banner-container">
        <div style="padding: 50px 0 0 0;">
            <h1 class="text-center">The <b>MACHIAVELLI</b> Benchmark</h1>
            <br>
            <div class="text-center">
                <img src="assets/fig_radar.svg" alt="Machiavelli" width="50%"/>
            </div>
            <br>
            <div class="buttons" style="margin-bottom: 8px; text-align: center;">
                <a class="btn btn-primary" role="button" href="https://arxiv.org/abs/2304.03279" target=”_blank”>
                    <svg style="width:24px;height:24px;margin-left:-12px;margin-right:12px" viewBox="0 0 24 24">
                        <path fill="currentColor" d="M16 0H8C6.9 0 6 .9 6 2V18C6 19.1 6.9 20 8 20H20C21.1 20 22 19.1 22 18V6L16 0M20 18H8V2H15V7H20V18M4 4V22H20V24H4C2.9 24 2 23.1 2 22V4H4M10 10V12H18V10H10M10 14V16H15V14H10Z"></path>
                    </svg>Paper
                </a>
                <a class="btn btn-primary" role="button" href="https://github.com/aypan17/machiavelli" target=”_blank”>
                    <svg style="width:24px;height:24px;margin-left:-12px;margin-right:12px" width="24px" height="24px" class="svg-inline--fa fa-github fa-w-16" aria-hidden="true" focusable="false" data-prefix="fab" data-icon="github" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 496 512" data-fa-i2svg="">
                        <path fill="currentColor" d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path>
                    </svg>Code
                </a>
                <a class="btn btn-light" role="button">
                    <svg style="width:24px;height:24px;margin-left:-12px;margin-right:12px" width="24px" height="24px" viewBox="0 0 375 531">
                        <polygon stroke="#000000" points="0.5,0.866 459.5,265.87 0.5,530.874 "/>
                    </svg>Trajectory viewer (coming soon)
                </a>
            </div>
        </div>
    </div>
    <div class="outer-container">
    <div class="container main-container">

        <div class="row">
            <div class="col-md-12" style="text-align: center">
                <em><p>Models are rapidly being deployed in the real world. How do we evaluate models, especially ones as complex as GPT-4, to ensure that they behave safely in pursuit of their objectives? Can we design models that robustly avoid any harms while achieving their goals?</p></em>
                </div>
        </div>
        <br>

        <!-- SECTION ---------------------------------------------------------------------- -->
        <h1>MACHIAVELLI</h1>
        <div class="row">
            <div class="col-md-12">
                <p>To guide progress on text-based agents and encourage them to behave more ethically, we propose the MACHIAVELLI benchmark. Our environment is based on human-written, text-based Choose-Your-Own-Adventure games containing over half a million scenes with millions of annotations. The games abstract away low-level environment interactions, instead spotlighting high-level social decisions alongside real-world goals to achieve. MACHIAVELLI is a step towards measuring an agent's ability to plan and navigate complex trade-offs in realistic social environments.</p>
            </div>
        </div>
        <div class="row captioned_img">
            <div class="col-md-12">
                <br>
                <img src="assets/machiavelli.svg" alt="Machiavelli" width="100%"/>
                <div class="caption">A mock-up of a game in the MACHIAVELLI benchmark, a suite of text-based reinforcement learning environments. Each environment is a text-based story. At each step, the agent observes the scene and a list of possible actions; it selects an action from the list. The agent receives rewards for completing achievements. Using dense annotations of our environment, we construct a behavioral report of the agent and measure the trade-off between rewards and ethical behavior.</div>
            </div>
        </div>
        <hr class="divider" />

        <!-- SECTION ---------------------------------------------------------------------- -->
        <h1>Diverse Worlds with Rich Annotations</h1>
        
        <div class="row">
            <div class="col-md-12">
                <p>Each game in the MACHIAVELLI suite is book-length in itself. Across 134 games spanning diverse fictional worlds, we count 572,322 multi-paragraph scenes and 4,559 author-written achievements serving as objectives for agents. We annotate each scene with dozens of labels, focusing on identifying harmful behaviors such as power-seeking and deception. In total, we collect a dataset of nearly three million annotations worth $500,000 (in terms of human annotation time).</p>
                <p>These dense annotations enable us to track nearly every ethically-salient thing agents do in the environment, and produce a behavioral report scoring various harm metrics. All of our labels are open-access and <a href="https://github.com/aypan17/machiavelli">available for download</a>!</p>
            </div>
        </div>

        <div class="row" style="text-align: center;">
            <div class="col-md-6">
                <p class="stat">134</p>
                <p class="statlabel">Games</p>
            </div>
            <div class="col-md-6">
                <p class="stat">4,559</p>
                <p class="statlabel">Achievements</p>
            </div>
        </div>
        <div class="row" style="text-align: center;">
            <div class="col-md-6">
                <p class="stat">572,322</p>
                <p class="statlabel">Scenarios</p>
            </div>
            <div class="col-md-6">
                <p class="stat">2,861,610</p>
                <p class="statlabel">Annotations</p>
            </div>
        </div>
        <hr class="divider" />

        <!-- SECTION ---------------------------------------------------------------------- -->
        <h1>Reducing Machiavellianism</h1>
        <div class="row">
            <div class="col-md-12">
                <p>In the MACHIAVELLI environment, we find that agents trained to optimize arbitrary objectives tend to adopt "ends justify the means" behavior: becoming power-seeking, causing harm to others, and violating ethical norms like stealing or lying to achieve their objectives. Furthermore, there appears to be a trade-off between behaving ethically and achieving high reward.</p>

                <p>In our paper, we design several methods to improve the behaviors of agents and obtain Pareto improvements on reward and ethical behavior. We invite others to build on our initial steps and use MACHIAVELLI as a testing ground for improving the safety of AI agents.</p>
            </div>
        </div>
        <div class="row captioned_img">
            <div class="col-md-12">
                <br>
                <img src="assets/fig_pareto.svg" alt="Pareto" width="100%"/>
                <div class="caption">In an ideal world, a perfect agent would achieve 100% reward while entirely avoiding any harms (i.e., be as far to the top-right corner as possible). Our baseline agents demonstrate a trade-off between behaving ethically and achieving high reward. Future work should work to improve this trade-off by extending the Pareto frontier.</div>
            </div>
        </div>
        <hr class="divider" />

        <!-- SECTION ---------------------------------------------------------------------- -->
        <h1>Authors</h1>
        <br>
        <div class="row authors">
            <div class="col-sm-3">
                <h5 class="text-center"><a class="text-center" href="https://aypan17.github.io/">Alexander Pan*</a></h5>
                <div class="text-center">UC Berkeley</div>
            </div>
            <div class="col-sm-3">
                <h5 class="text-center"><a class="text-center" href="https://junshern.github.io/">Chan Jun Shern*</a></h5>
                <div class="text-center">Center for AI Safety</div>
            </div>
            <div class="col-sm-3">
                <h5 class="text-center"><a class="text-center" href="https://andyzoujm.github.io/">Andy Zou*</a></h5>
                <div class="text-center">Carnegie Mellon University</div>
            </div>
            <div class="col-sm-3">
                <h5 class="text-center"><a class="text-center" href="https://nli0.github.io/">Nathaniel Li</a></h5>
                <div class="text-center">UC Berkeley</div>
            </div>
        </div>
        <div class="row authors">
            <div class="col-sm-3">
                <h5 class="text-center"><a class="text-center" href="https://stevenbas.art/">Steven Basart</a></h5>
                <div class="text-center">Center for AI Safety</div>
            </div>
            <div class="col-sm-3">
                <h5 class="text-center"><a class="text-center" href="https://www.thomaswoodside.com/">Thomas Woodside</a></h5>
                <div class="text-center">Yale University</div>
            </div>
            <div class="col-sm-3">
                <h5 class="text-center"><a class="text-center" href="https://www.linkedin.com/in/jonathan-ng-7061a3162/">Jonathan Ng</a></h5>
                <div class="text-center">Center for AI Safety</div>
            </div>
            <div class="col-sm-3">
                <h5 class="text-center"><a href="https://hanlin-zhang.com/">Hanlin Zhang</a></h5>
                <div class="text-center">Carnegie Mellon University</div>
            </div>
        </div>
        <div class="row authors">
            <div class="col-sm-3"></div>
            <div class="col-sm-3">
                <h5 class="text-center"><a class="text-center" href="https://scottemmons.com/">Scott Emmons</a></h5>
                <div class="text-center">UC Berkeley</div>
            </div>
            <div class="col-sm-3">
                <h5 class="text-center"><a class="text-center" href="https://people.eecs.berkeley.edu/~hendrycks/">Dan Hendrycks</a></h5>
                <div class="text-center">Center for AI Safety</div>
            </div>
            <div class="col-sm-3"></div>
        </div>
        <hr class="divider" />
        <br>

        <!-- SECTION ---------------------------------------------------------------------- -->
        <h1>Citation</h1>
        <br>
        <div class="row">
            <div class="col-md-12">
                <code>
                    @article{pan2023machiavelli,<br>
                    &nbsp; author = {Pan, Alexander and Chan, Jun Shern and Zou, Andy and Li, Nathaniel and Basart, Steven and Woodside, Thomas and Ng, Jonathan and Zhang, Hanlin and Emmons, Scott and Hendrycks, Dan},<br>
                    &nbsp; title = {Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark.},<br>
                    &nbsp; journal = {ICML},<br>
                    &nbsp; year = {2023},<br>
                }</code>
            </div>
        </div>
    </div>
    </div>
    <script src="/assets/js/scripts.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/4.5.0/js/bootstrap.bundle.min.js"></script>
    <!-- Import the component -->
</body>

</html>
