Abstract: Several algorithms are proposed to improve the robustness of deep neural networks against adversarial perturbations beyond $\ell_p$ cases, i.e. weather perturbations. However, evaluations of existing robust training algorithms are over-optimistic. This is in part due to the lack of a standardized evaluation protocol across various robust training algorithms, leading to ad-hoc methods that test robustness on either random perturbations or the adversarial samples from generative models that are used for robust training, which is either uninformative of the worst case, or is heavily biased. In this paper, we identify such evaluation bias in these existing works and propose the first standardized and fair evaluation that compares various robust training algorithms by using physics simulators for common adverse weather effects i.e. rain and snow. With this framework, we evaluated several existing robust training algorithms on two streetview classification datasets (BIC\_GSV, Places365) and show the evaluation bias in experiments.