Focus group interviews with physical computing education experts revealed three primary challenges in physical computing education:
Intricate interdependence between circuit construction and code functionality
Physical computing education requires extensive instructor knowledge spanning hardware interfaces, software development, and diverse physical components. Achieving this multidisciplinary expertise demands substantial time and financial investment.
Educators face substantial difficulties managing heterogeneous student capabilities while providing personalized debugging assistance, particularly for hardware-related problems, which can impede instructional objectives.
PCEval introduces a fully-automated evaluation protocol, a significant advancement over prior work that often required manual expert assessment or complex hardware-in-the-loop setups. Our structured methodology, with clear task separation and automated metrics, provides a robust and reproducible framework for assessing LLM-generated circuits and code.
PCEval uniquely assesses LLMs' ability to generate physically implementable breadboard layouts and to produce code that is compatible with these specific physical constraints. This addresses a critical gap, as previous works often overlooked the complexities of physical circuit implementation and breadboard layout challenges, focusing instead on logical schematics or code generation from abstract representations.
PCEval evaluates LLMs across four distinct generation tasks, designed to comprehensively assess different facets of physical computing capabilities, from logical design to physical implementation and code-hardware compatibility. Each task challenges an LLM to produce a specific artifact based on controlled inputs from our dataset.
Our evaluation of 13 leading LLMs on the PCEval benchmark yielded several critical insights into their current capabilities and limitations in the physical computing domain.
These findings underscore a key limitation in current LLMs: a less developed understanding of physical hardware constraints compared to their reasoning capabilities in logical or code-based tasks. This likely reflects biases in their training data, which predominantly features logical rather than physical circuit representations.
Task Performance Success Rates (%). Success rates for primary evaluation tasks. Code generation performance represents the average of two tasks.
| Model | D, C → L | D, C → P | D, L → C & D, P → C |
|---|---|---|---|
| GPT-4o | 58.0 | 26.8 | 58.8 |
| Claude 3.7 Sonnet | 65.6 | 13.6 | 63.4 |
| o3-mini | 66.0 | 45.2 | 67.8 |
| Mistral-Small 3 | 46.4 | 13.6 | 38.2 |
Physical Circuit Generation Error Analysis. Average error frequencies in physical circuit generation task (D, C → P)
| Model | Pin Conflict | Breadboard Bypass | Missing Component |
|---|---|---|---|
| GPT-4o | 2.07 | 1.16 | 0.20 |
| Claude 3.7 Sonnet | 7.52 | 0.17 | 0.0 |
| o3-mini | 4.20 | 0.01 | 0.02 |
| Mistral-Small 3 | 2.35 | 1.01 | 0.19 |
This section provides qualitative examples and visualizations of LLM outputs from the PCEval benchmark. The examples illustrate success and failure modes of button led project in physical circuit generation and code generation from logical circuit, providing a more nuanced understanding of the challenges LLMs encounter.