Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

Raghuvir Duvvuri; Chandini Vysyaraju; Avi Goyal; Dmitry Ignatov; Radu Timofte

Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

Raghuvir Duvvuri, Chandini Vysyaraju, Avi Goyal, Dmitry Ignatov, Radu Timofte

Published: 30 Apr 2026, Last Modified: 30 Apr 2026CVPR-NAS26 OralEveryoneRevisionsCC BY 4.0

Keywords: Neural Architecture Search, Large Language Models, Few-Shot Prompting, Code Generation, Hash Validation

TL;DR: We demonstrate that three-shot prompting optimizes LLM-based neural architecture generation and introduce a rapid whitespace-normalized hashing method to eliminate redundant training.

Abstract: While large language models (LLMs) have unlocked a new paradigm for automated neural architecture design, their generative power remains poorly understood: the critical question of how in-context example count governs generation quality has never been systematically investigated. Building on the NNGPT/LEMUR framework, we conduct the first empirical study of few-shot prompting for neural architecture generation. We generate and evaluate 1,900 neural architectures produced by LLMs across six common vision benchmarks, rigorously assessing how the number of supporting examples ($n \in \{1,\dots,6\}$) shapes generation stability, architectural diversity, and early-epoch validation performance. Although $n{=}3$ yields the highest dataset-balanced mean accuracy (53.1\%), this improvement is task-dependent. Performance gains increase with task complexity (CIFAR-100: $+11.6\%$, $p{=}0.001$, $d{=}0.73$; CelebA-Gender: $+6.5\%$ at $n{=}2$, $p{=}0.038$), whereas larger context sizes ($n{>}3$) lead to statistically significant degradation on more structured benchmarks (ImageNette: $-14.5\%$, $p{=}0.010$; CIFAR-10: $-8.4\%$, $p{=}0.016$). At $n{=}6$, generation performance collapses entirely (99.8\% failure rate), further corroborating that $n{=}3$ strikes an optimal balance between providing informative context and avoiding prompt saturation. Qualitative analysis reveals that $n{=}3$ uniquely enables architectural pattern synthesis---hybrid ResNet-DPN and ResNet-AlexNet structures---absent under single-example prompting. To support scalable deployment, we introduce whitespace-normalized hashing for real-time duplicate detection, achieving a $100\times$ speedup over AST-based methods and eliminating redundant training of formatting-level duplicates. Together, these findings establish prompt context calibration and lightweight validation as key principles for scalable, resource-efficient LLM-driven architecture design.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 17

Loading