PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models

ICLR 2026 Submission

Paper ID: 13535

PCEval is the first benchmark to systematically and automatically evaluate the capabilities of LLMs in physical computing, with a unique focus on real-world physical circuit understanding.

 Motivation

Focus group interviews with physical computing education experts revealed three primary challenges in physical computing education:

1. Hardware-Software Integration Complexity

Intricate interdependence between circuit construction and code functionality

2. Teacher Expertise

Physical computing education requires extensive instructor knowledge spanning hardware interfaces, software development, and diverse physical components. Achieving this multidisciplinary expertise demands substantial time and financial investment.

3. Feedback Overload

Educators face substantial difficulties managing heterogeneous student capabilities while providing personalized debugging assistance, particularly for hardware-related problems, which can impede instructional objectives.

Distinctive Features of PCEval

Automated Evaluation Framework

PCEval introduces a fully-automated evaluation protocol, a significant advancement over prior work that often required manual expert assessment or complex hardware-in-the-loop setups. Our structured methodology, with clear task separation and automated metrics, provides a robust and reproducible framework for assessing LLM-generated circuits and code.

Comprehensive Physical Circuit Assessment

PCEval uniquely assesses LLMs' ability to generate physically implementable breadboard layouts and to produce code that is compatible with these specific physical constraints. This addresses a critical gap, as previous works often overlooked the complexities of physical circuit implementation and breadboard layout challenges, focusing instead on logical schematics or code generation from abstract representations.

Core Tasks in PCEval

PCEval evaluates LLMs across four distinct generation tasks, designed to comprehensively assess different facets of physical computing capabilities, from logical design to physical implementation and code-hardware compatibility. Each task challenges an LLM to produce a specific artifact based on controlled inputs from our dataset.

Key Findings from PCEval

Our evaluation of 13 leading LLMs on the PCEval benchmark yielded several critical insights into their current capabilities and limitations in the physical computing domain.

  • Code Generation Predominance: LLMs generally demonstrated higher success rates in code generation tasks compared to circuit generation. This suggests that generating syntactically and logically correct code for a given hardware specification is currently more tractable for LLMs than inferring and designing the hardware circuitry itself.
  • Logical vs. Physical Circuit Generation Disparity: A striking performance gap was observed between logical circuit design and actual physical circuit (breadboard layout) generation. Success rates for physical circuit generation were markedly lower across all models, highlighting a profound difficulty LLMs face in translating conceptual requirements into physically valid layouts while adhering to hardware constraints.
  • Impact of Physical Implementation Errors: Success in physical circuit generation requires not only logical correctness but also the avoidance of critical implementation errors, such as pin conflicts and breadboard bypasses. Pin conflicts, in particular, emerged as a dominant error type that significantly degraded performance for many models.
  • Capability in Code Generation from Provided Physical Circuits: Despite difficulties in generating physical circuits, LLMs showed surprisingly strong capabilities in generating code when a specific physical circuit layout was *provided*. This indicates that LLMs can effectively recognize patterns and adhere to constraints when physical connections are explicitly detailed.
  • Performance and Project Complexity: As anticipated, LLM performance generally decreased as the complexity of the projects (in terms of code length, component count, and connection density) increased across the defined levels.

These findings underscore a key limitation in current LLMs: a less developed understanding of physical hardware constraints compared to their reasoning capabilities in logical or code-based tasks. This likely reflects biases in their training data, which predominantly features logical rather than physical circuit representations.

Success Rate (%) Performances

Task Performance Success Rates (%). Success rates for primary evaluation tasks. Code generation performance represents the average of two tasks.

Model D, C → L D, C → P D, L → C &
D, P → C
GPT-4o 58.0 26.8 58.8
Claude 3.7 Sonnet 65.6 13.6 63.4
o3-mini 66.0 45.2 67.8
Mistral-Small 3 46.4 13.6 38.2

Physical Circuit Generation Errors

Physical Circuit Generation Error Analysis. Average error frequencies in physical circuit generation task (D, C → P)

Model Pin Conflict Breadboard Bypass Missing Component
GPT-4o 2.07 1.16 0.20
Claude 3.7 Sonnet 7.52 0.17 0.0
o3-mini 4.20 0.01 0.02
Mistral-Small 3 2.35 1.01 0.19

Qualitative Examples and Visualizations

This section provides qualitative examples and visualizations of LLM outputs from the PCEval benchmark. The examples illustrate success and failure modes of button led project in physical circuit generation and code generation from logical circuit, providing a more nuanced understanding of the challenges LLMs encounter.