<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="description"
        content="PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models">
  <meta name="keywords" content="VIEW360, Anomaly Detection">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models</title>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">

  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">
  <link rel="icon" href="./static/images/favicon.svg">

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>
</head>
<body>


<section class="hero">
  <div class="hero-body">
    <div class="container is-max-desktop">
      <div class="columns is-centered">
        <div class="column has-text-centered">
          <h1 class="title is-3 publication-title">PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models</h1>
          <h2 class="title is-4 publication-title"><span style="color: #dc3545;">ICLR 2026 Submission</span></h2>
          <h2 class="title is-5 publication-title"><span>Paper ID: 13535</span></h2>

          
        </div>
      </div>
    </div>
  </div>
</section>



<section class="hero is-small py-6">
  <div class="container is-max-desktop">
    <div class="hero-body">
      <video id="teaser" autoplay muted loop playsinline height="100%">
        <source src="static/videos/Alarm.mp4"
                type="video/mp4">
      </video>
      <div style="height: 40px;"></div>
      <h2 class="subtitle has-text-centered">
        PCEval is the first benchmark to systematically and <strong>automatically</strong> evaluate the capabilities of LLMs in physical computing, 
        with a unique focus on real-world <strong>physical circuit</strong> understanding.
      </h2>
    </div>
  </div>
</section>



<section class="hero is-light is-small py-6">
  <div class="container is-max-desktop">
    <!-- Abstract. -->
    <!-- <div class="columns is-centered"> -->
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">&nbsp;Motivation</h2>
        <div class="content has-text-justified">
          <p>
            Focus group interviews with physical computing education experts revealed three primary challenges in physical computing education:
          </p>
        </div>
        <div class="column">
          <h2 class="title is-5">1. Hardware-Software Integration Complexity</h2>
          <p class="mb-4">
            Intricate interdependence between circuit construction and code functionality
          </p>
          <div class="columns is-vcentered is-mobile" style="margin: 0; max-width: 100%; gap: 0;">
            <div class="column is-4 has-text-right" style="padding: 0;">
              <img src="./static/images/Circuit.png" style="max-width: 90%;">
            </div>
            <div class="column is-4 has-text-centered" style="padding: 0;">
              <img src="./static/videos/chain.gif" style="width: 50%;">
            </div>
            <div class="column is-4 has-text-left" style="padding: 0;">
              <img src="./static/images/Code.png" style="max-width: 90%;">
            </div>
          </div>
        </div>


        <div style="height: 40px;"></div>

        <div class="columns is-centered">
          <div class="column">
              <h2 class="title is-5">2. Teacher Expertise</h2>
              <div class="columns is-centered">
              <div class="column content">
                <div class="content has-text-justified">
                  <p class="mb-4">
                    Physical computing education requires extensive instructor knowledge spanning hardware interfaces, software development, and diverse physical components. Achieving this multidisciplinary expertise demands substantial time and financial investment.
                  </p>
                  <!-- <div class="column is-8 has-text-centered" style="padding: 0;">
                    <img src="./static/images/Teacher_Exp.png" style="width: 100%;">
                  </div>            -->
                </div>
              </div>
            </div>
          </div>
    
          <div class="column">
            <h2 class="title is-5">3. Feedback Overload</h2>
            <div class="columns is-centered">
              <div class="column content">
                <div class="content has-text-justified">
                  <p class="mb-4">
                    Educators face substantial difficulties managing heterogeneous student capabilities while providing personalized debugging assistance, particularly for hardware-related problems, which can impede instructional objectives.
                  </p>
                  <!-- <video id="teaser" autoplay muted loop playsinline height="100%">
                    <source src="static/videos/Feedback.mp4"
                            type="video/mp4">
                  </video> -->
                </div>
              </div>
            </div>
          </div>
        </div>
        

      </div>
    </div>
  </div>
</section>


<section class="hero is-small py-6">
  <div class="container is-max-desktop">
    <h2 class="title is-3">Distinctive Features of PCEval</h2>
    <div style="height: 1px;"></div>
    <div class="content has-text-justified">
      <!-- <p>
        We designed 4 tasks for comprehensive, while well distinct, evaluation of physical computing capabilities.
      </p> -->
    </div>

    <div class="columns is-centered has-text-centered">

      <!-- Automated Evaluation. -->
      <div class="column">
          <h2 class="title is-5">Automated Evaluation Framework</h2>
          <div class="columns is-centered">
          <div class="column content">
            <div class="content has-text-justified">
              <p class="mb-4">
                PCEval introduces a fully-automated evaluation protocol, a significant advancement over <strong>prior work that often required manual expert assessment or complex hardware-in-the-loop setups</strong>.
                Our structured methodology, with clear task separation and automated metrics, provides a robust and reproducible framework for assessing LLM-generated circuits and code.
              </p>
            </div>
          </div>
        </div>
      </div>

      <div class="column">
        <h2 class="title is-5">Comprehensive Physical Circuit Assessment</h2>

        <div class="columns is-centered">
          <div class="column content">
            <div class="content has-text-justified">
              <p>
                PCEval uniquely assesses LLMs' ability to generate physically implementable breadboard layouts and to produce code that is compatible with these specific physical constraints.
                This addresses a critical gap, as previous works often overlooked the complexities of physical circuit implementation and breadboard layout challenges, focusing instead on logical schematics or code generation from abstract representations.
              </p>
            </div>
          </div>

        </div>
      </div>
    </div>

    </div>
  </div>
</div>

  </div>
</section>

<section class="section">
  <div class="container is-max-desktop">

    <div class="columns is-centered">
  <div class="column">
    <h2 class="title is-3">Core Tasks in PCEval</h2>
    <div class="columns is-centered">
      <div class="column content">
        <div class="content has-text-justified">
        <p>
          PCEval evaluates LLMs across four distinct generation tasks, designed to comprehensively assess different facets of physical computing capabilities, from logical design to physical implementation and code-hardware compatibility.
          Each task challenges an LLM to produce a specific artifact based on controlled inputs from our dataset.
        </p>
        </div>
        <video id="teaser" autoplay muted loop playsinline height="100%">
          <source src="static/videos/Tasks.mp4"
                  type="video/mp4">
        </video>
      </div>
        </div>
      </div>
    </div>
  </div>
</section>

<section class="hero is-light is-small py-6">
  <div class="container is-max-desktop">
    <h2 class="title is-3 has-text-centered">Key Findings from PCEval</h2>
    <div class="content has-text-justified">
      <p>
        Our evaluation of 13 leading LLMs on the PCEval benchmark yielded several critical insights into their current capabilities and limitations in the physical computing domain.
      </p>
      <ul>
        <li><strong>Code Generation Predominance:</strong> LLMs generally demonstrated higher success rates in code generation tasks compared to circuit generation. This suggests that generating syntactically and logically correct code for a given hardware specification is currently more tractable for LLMs than inferring and designing the hardware circuitry itself.</li>
    
        
        <li><strong>Logical vs. Physical Circuit Generation Disparity:</strong> A striking performance gap was observed between logical circuit design and actual physical circuit (breadboard layout) generation. Success rates for physical circuit generation were markedly lower across all models, highlighting a profound difficulty LLMs face in translating conceptual requirements into physically valid layouts while adhering to hardware constraints.</li>
    
        <li><strong>Impact of Physical Implementation Errors:</strong> Success in physical circuit generation requires not only logical correctness but also the avoidance of critical implementation errors, such as pin conflicts and breadboard bypasses. Pin conflicts, in particular, emerged as a dominant error type that significantly degraded performance for many models.</li>
    
        <li><strong>Capability in Code Generation from Provided Physical Circuits:</strong> Despite difficulties in generating physical circuits, LLMs showed surprisingly strong capabilities in generating code when a specific physical circuit layout was *provided*. This indicates that LLMs can effectively recognize patterns and adhere to constraints when physical connections are explicitly detailed.</li>
    
        <li><strong>Performance and Project Complexity:</strong> As anticipated, LLM performance generally decreased as the complexity of the projects (in terms of code length, component count, and connection density) increased across the defined levels.</li>
      </ul>
      <p>
        These findings underscore a key limitation in current LLMs: a less developed understanding of physical hardware constraints compared to their reasoning capabilities in logical or code-based tasks. This likely reflects biases in their training data, which predominantly features logical rather than physical circuit representations.
      </p>
    </div>

      <div class="columns mt-5 is-desktop">
        <div class="column">
          <h4 class="title is-5 has-text-centered">Success Rate (%) Performances</h4>
          <p class="has-text-centered is-size-7 mb-2" style="min-height: 3em; display: flex; align-items: center; justify-content: center;">Task Performance Success Rates (%).
            Success rates for primary evaluation tasks. Code generation performance represents the average of two tasks.
            </p>
          <div class="table-container">
            <table class="table is-bordered is-striped is-narrow is-hoverable is-fullwidth">
              <thead>
                <tr>
                  <th class="has-text-left" style="white-space: nowrap; vertical-align: middle; min-height: 5em;">Model</th>
                  <th class="has-text-centered" style="vertical-align: middle; min-height: 5em;">D, C → L</th>
                  <th class="has-text-centered" style="vertical-align: middle; min-height: 5em;">D, C → P</th>
                  <th class="has-text-centered" style="vertical-align: middle; min-height: 5em;">D, L → C & <br> D, P → C</th>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td class="has-text-left" style="white-space: nowrap;">GPT-4o</td>
                  <td class="has-text-centered">58.0</td>
                  <td class="has-text-centered">26.8</td>
                  <td class="has-text-centered">58.8</td>
                </tr>
                <tr>
                  <td class="has-text-left" style="white-space: nowrap;">Claude 3.7 Sonnet</td>
                  <td class="has-text-centered">65.6</td>
                  <td class="has-text-centered">13.6</td>
                  <td class="has-text-centered">63.4</td>
                </tr>
                <tr>
                  <td class="has-text-left" style="white-space: nowrap;">o3-mini</td>
                  <td class="has-text-centered">66.0</td>
                  <td class="has-text-centered">45.2</td>
                  <td class="has-text-centered">67.8</td>
                </tr>
                <tr>
                  <td class="has-text-left" style="white-space: nowrap;">Mistral-Small 3</td>
                  <td class="has-text-centered">46.4</td>
                  <td class="has-text-centered">13.6</td>
                  <td class="has-text-centered">38.2</td>
                </tr>
              </tbody>
            </table>
          </div>
        </div>
        <div class="column">
          <h4 class="title is-5 has-text-centered">Physical Circuit Generation Errors</h4>
          <p class="has-text-centered is-size-7 mb-2" style="min-height: 3em; display: flex; align-items: center; justify-content: center;">Physical Circuit Generation Error Analysis. Average error frequencies in physical circuit generation task (D, C → P)</p>
          <div class="table-container">
            <table class="table is-bordered is-striped is-narrow is-hoverable is-fullwidth">
              <thead>
                <tr>
                  <th class="has-text-left" style="white-space: nowrap; vertical-align: middle; min-height: 5em;">Model</th>
                  <th class="has-text-centered" style="vertical-align: middle; min-height: 5em;">Pin Conflict</th>
                  <th class="has-text-centered" style="vertical-align: middle; min-height: 5em;">Breadboard Bypass</th>
                  <th class="has-text-centered" style="vertical-align: middle; min-height: 5em;">Missing Component</th>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td class="has-text-left" style="white-space: nowrap;">GPT-4o</td>
                  <td class="has-text-centered">2.07</td>
                  <td class="has-text-centered">1.16</td>
                  <td class="has-text-centered">0.20</td>
                </tr>
                <tr>
                  <td class="has-text-left" style="white-space: nowrap;">Claude 3.7 Sonnet</td>
                  <td class="has-text-centered">7.52</td>
                  <td class="has-text-centered">0.17</td>
                  <td class="has-text-centered">0.0</td>
                </tr>
                <tr>
                  <td class="has-text-left" style="white-space: nowrap;">o3-mini</td>
                  <td class="has-text-centered">4.20</td>
                  <td class="has-text-centered">0.01</td>
                  <td class="has-text-centered">0.02</td>
                </tr>
                <tr>
                  <td class="has-text-left" style="white-space: nowrap;">Mistral-Small 3</td>
                  <td class="has-text-centered">2.35</td>
                  <td class="has-text-centered">1.01</td>
                  <td class="has-text-centered">0.19</td>
                </tr>
              </tbody>
            </table>
          </div>
        </div>
      </div>

  </div>
</section>



<section class="section">
  <div class="container is-max-desktop">

    <div class="columns is-centered">
      <!-- More Qualitative Results. -->
      <div class="column">
        <h2 class="title is-3">Qualitative Examples and Visualizations</h2>
        <div class="columns is-centered">
          <div class="column content">
            <div class="content has-text-justified">
              <p class="mb-4">
                This section provides qualitative examples and visualizations of LLM outputs from the PCEval benchmark.
                The examples illustrate success and failure modes of button led project in physical circuit generation and code generation from logical circuit, 
                providing a more nuanced understanding of the challenges LLMs encounter.
              </p>
            </div>
        <div class="content has-text-centered">
          <video id="teaser" autoplay muted loop playsinline height="100%">
            <source src="static/videos/Button_LED_Success.mp4"
                    type="video/mp4">
          </video>
          <div style="height: 40px;"></div>
          <video id="teaser" autoplay muted loop playsinline height="100%">
            <source src="static/videos/Button_LED_Fail.mp4"
                    type="video/mp4">
          </video>
        </div>
      </div>
    </div>
  </div>

</section>


<footer class="footer">
  <div class="container">
    <div class="columns is-centered">
      <div class="column is-8">
        <div class="content">
          <p>
            This website accompanies the ICLR 2026 submission titled "PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models" (Paper ID: 13535).
          </p>
        </div>
      </div>
    </div>
  </div>
</footer>

</body>
</html>
