01 Dec 2021 |
deep learning
neural architecture search
zero-cost proxies
Anonymous
Cases For and Against Zero-Cost Proxies
Based on the experiments from the previous section, and based on recent literature, we aggregate the main strengths and weaknesses of ZC proxies.
Strengths
- Speed.
Although the previous section shows that ZC proxies do not always achieve a strong correlation with model performance, their speed sets ZC proxies apart from all other types of performance prediction techniques. All ZC proxies are computed with at most a forward and backward pass from a single minibatch of data. The wall-clock time depends on the size of the network and the type of data, but it typically takes five seconds or less on a GPU or CPU.
We encourage the community to think of ZC proxies as cheap “weak learners”, which may be combined with other ZC proxies, or other techniques, to achieve strong performance.
ZC proxies are especially useful when improving other, slower techniques at little extra cost, and/or when used as features in a prediction model, which can then choose whether or not to ignore the signal from each individual ZC proxy on a task by task basis. In the next two points, we give specific examples of this.
- Usage with model-based prediction.
There is initial work in using ZC proxies as features in model-based prediction for NAS.
Model-based prediction is a common subroutine used to guide NAS algorithms, especially in combination with Bayesian optimization (e.g. NASBOT, BANANAS, NASBOWL). At various points in time during an algorithm, when there is already a set of architectures fully evaluated, a meta-model can be trained using the architecture topology as features, and the validation accuracies as labels. This model can then be used to predict the validation accuracy of new architectures that have not yet been evaluated. White et al. (2021) showed that adding jacob_cov as an additional feature can improve performance of this model by up to 20%.
Shen et al. (2021) further added ZC proxies to Bayesian optimization, showing 3-5x speedups over previous state-of-the-art methods.
Additional improvements may be possible, for example, if several zero-cost proxies could be used as additional features instead of just jacob_cov. ZC proxies are particularly well-suited to be part of features of a model that predicts architecture performance, because it mitigates two of their downsides: the model can learn to use only the ZC proxies that are most correlated with current task, and information from multiple ZC proxies can be leveraged by the model, including flops and params.
- Usage with one-shot methods.
There is also initial work by Xiang et al. (2021) in combining zero-cost proxies with one-shot methods.
Specifically, this work builds off of the popular recent work on perturbation-based operation selection for differentiable NAS.
Xiang et al. (2021) use ZC proxies to score operation perturbations to make decisions during the one-shot procedure. This leads to a new NAS method, Zero-Cost-PT, that can achieve up to 40x speedups compared to prior methods. Again, ZC proxies are particularly well-suited for this task, since many perturbations are encountered throughout each run of a one-shot algorithm, which must be quickly scored.
- Untapped potential.
Preliminary research shows that the best performance from ZC proxies are not when they are used individually, but when they are used in combination. For example, Abdelfattah et al. showed “vote”, the majority vote among jacob_cov, synflow, and snip, achieved top performance in their settings.
Recent work has also shown that combining jacob_cov, snip, synflow, and zen, as well as combining each ZC proxy with flops and params, leads to even better performance. Fleshing out this direction is a promising avenue for future work. Furthermore, understanding why certain zero-cost proxies are effective has been relatively under-studied as of now. Tackling this problem could be the key to better combining the strengths of each ZC proxy, and devising newer, better ZC proxies. Overall, ZC proxies have not yet achieved their full potential.
Weaknesses
- Unreliable performance.
In the previous section, our experiments across a diverse set of datasets and tasks showed that while ZC proxies perform well on some datasets and tasks, they do not perform well on other datasets and tasks (e.g. Gaussian data, PDE-solving, EMG signals) even when keeping the search space constant. For some tasks, the majority of ZC proxies have a negative correlation with model performance, meaning that ZC proxies would perform worse than randomly picking neural networks.
In Table 4, we even found that flops, a simple baseline, was the ZC proxy with the best average rank over all 12 tasks we studied.
Therefore, more work must be done to create ZC proxies that consistently outperform flops and params. There is already initial work in this direction, simply by combining ZC proxies with flops and params.
- Unhelpful biases.
The goal of a ZC proxy is to correlate strongly with target error metrics.
However, ZC proxies have been found to have other strong preferences that may bias the search process.
For example, synflow has been shown both experimentally and theoretically to prefer large models by Ning et al. (2021).
Furthermore, Chen et al. (2021) experimentally show that snip has a preference for wide channels, while grasp has a preference for narrow architectures.
- Amdahl’s law.
In early ZC proxy research, one of the main selling points was the creation of new NAS algorithms that output an architecture in minutes. However, finding an architecture is only part of the machine learning pipeline, with discovered architectures still needing to be trained to be useful.
As a result, in practical settings, ZC proxies run into an issue akin to the one described by Amdahl’s law from parallel computing: they are optimizing only an already-fast component of the pipeline and so the overall achievable speedup is actually quite small.
For example, on the DARTS space TE-NAS reports that it takes 0.05 GPU-days to achieve a CIFAR-10 accuracy close to that of PDARTS, which takes 0.3 GPU-days; this is a six-fold improvement in search-time.
This is overwhelmed by the training time of a DARTS architecture, which takes roughly 1.2 GPU-days, and thus the improvement for the full pipeline is only 1.2x.
In fact, the best possible theoretical improvement according to Amdahl’s law, in the case where the ZC is truly zero-cost, is only 1.25x.
However, this weakness does not apply to applications in which ZC proxies are used to improve the performance of other techniques such as model-based prediction or one-shot models, described in the “Strengths” section.
- Correlation decay.
Many ZC proxies, both oblivious and data-dependent, explicitly use model weights to predict an architecture’s performance.
Although this has not been a target of past ZC proxy papers, ideally the predictive power of ZC proxies would increase as one trains the architecture, allowing them to be combined with early-stopping methods.
However, in our experiments, we showed that in fact the performance of many proxies decreases with the number of training iterations.
On the other hand, ZC proxies could be used in tandem with other techniques that do have this property, such as learning curve extrapolation. There is some preliminary work in this vein.
Conclusions and Future Directions
In this blog post, we took a deeper look at zero-cost proxies for NAS. We ran new experiments using the recent NAS-Bench-360 and TransNAS-Bench-101 benchmarks to probe the effectiveness of zero-cost proxies on more diverse datasets than had previously been tested in existing literature.
Our main findings were the following:
- ZC proxies have differing performance profiles across tasks, and across diverse tasks, there is no single ZC proxy which performs significantly better than the others.
- ZC proxies still require further research since flops and params are very competitive baselines.
- Data-agnostic ZC proxies such as synflow have inconsistent performance across different tasks.
In general, ZC proxies are best thought of as cheap “weak learners” which can quickly improve the performance of other techniques.
Based on prior work and on our experimental observations, we find particularly promising avenues for future work:
- Integrating zero-cost methods into one-shot and model-based methods.
- Better ways of combining ZC proxies with each other and with flops and params.
- Understanding why zero-cost proxies work well in certain situations, which can lead to the development of even better ZC proxies.
Overall, while there are currently issues with inconsistency, zero-cost proxies are a promising, novel direction that are sure to play a key role in future NAS techniques.
-->

01 Sep 2021 |
sample
template
tutorial
Bubeck, Sebastien (Microsoft); Dobre, David (Mila); Gauthier, Charlie (Mila); Gidel, Gauthier (Mila); Vernade, Claire (DeepMind)
This post outlines a few more things you may need to know for creating and configuring your blog posts.
02 Apr 2020 |
test
tutorial
markdown
Doe, John, School of Life; Doe, Jane, A School
Howdy! This is an example blog post that shows several types of HTML content supported in this theme.