Abstract: Neural Architecture Search (NAS) benchmarks significantly improved the capability of developing and comparing NAS methods while at the same time drastically reduced the computational overhead by providing meta-information of trained neural networks. However, tabular benchmarks have several drawbacks that can hinder fair comparisons and provide unreliable results. These usually focus on a small pool of operations in heavily constrained search spaces – usually cell-based neural networks with pre-defined outer-skeletons. In this work, we conducted an empirical analysis of the widely used NAS-Bench-101, NAS-Bench-201 and TransNAS-Bench-101 benchmarks in terms of their generability and how different operations influence the performance of the generated architectures. We found that only a subset of the operation pool is required to generate architectures close to the upper-bound of the performance range. More, the performance distribution is negatively skewed, with many architectures clustered near the upper accuracy bound. Further experiments revealed that convolution layers have the highest impact on the architecture's performance and that specific combinations of operations favor top-scoring architectures. Overall, our results demonstrate the need for benchmarks with greater operation diversity and less constrained search spaces. We provide suggestions for improving future benchmark design and evaluation of NAS methods when using existing benchmarks. The code used to conduct the evaluations is available at https://github.com/VascoLopes/NAS-Benchmark-Evaluation.
Loading