Adversarially robust models are locally smooth around each data sample so that small perturbations cannot drastically change model outputs. In modern systems, such smoothness is usually obtained via adversarial training, which explicitly enforces models to perform well on perturbed examples. In this work, we show the surprising effectiveness of instead regularizing the gradient with respect to model inputs on natural examples only. Penalizing input gradient norm is commonly believed to be a much inferior approach. Our analyses identify that the performance of gradient norm regularization critically depends on the smoothness of activation functions, and are in fact extremely effective on modern vision transformers that adopt smooth activations over piecewise linear ones (eg, ReLU), contrary to prior belief. On ImageNet-1k, gradient norm training achieves > 90% the performance of state-of-the-art PGD-3 adversarial training (52% vs.~56%), while using only 60% computation cost of the state-of-the-art without complex adversarial optimization.
Minimizing gradient norm yields a highly competitive model despite seeing only natural examples and having 60% of the computational budget. On AutoAttack \(L_{\infty}\) with \(\epsilon=\frac{4}{255}\), the standard benchmark for ImageNet, we report 51.58% robust performance compared to the 56.12% obtained by state-of-the-art adversarial training.
As we can see in the figures below, the ResNet with ReLU is completely incapable of properly fitting gradient norm regularization, with clean performance sharply decaying and robust performance barely increasing. In contrast, the GeLU ResNet displays similar convergence behaviour to the adversarially trained models, obtaining extremely similar clean and robust accuracies. The work of (Xie2021) conducted a similar analysis for Adversarial Training, observing small increases in performance from using smooth non-linearities. For gradient norm the effect is more than 20 times larger (1% vs 23%).
[1] Wightman, R.: PyTorch Image Models (2019). https://doi.org/10.5281/zenodo.4414861, https://github.com/rwightman/pytorch-image-models
[2] Liu, C., Dong, Y., Xiang, W., Yang, X., Su, H., Zhu, J., Chen, Y., He, Y., Xue, H., Zheng, S.:
A Comprehensive Study on Robustness of Image Classification Models: Benchmarking and
Rethinking (Feb 2023). https://doi.org/10.48550/arXiv.2302.14301, http://arxiv.org/abs/
2302.14301, arXiv:2302.14301 [cs]
[3] Xie, C., Tan, M., Gong, B., Yuille, A., Le, Q.V.: Smooth Adversarial Training (Jul 2021),
http://arxiv.org/abs/2006.14536, arXiv:2006.14536 [cs]
[4] Simon-Gabriel, C.J., Ollivier, Y., Bottou, L., Schölkopf, B., Lopez-Paz, D.: First-order Adversarial
Vulnerability of Neural Networks and Input Dimension (Jun 2019), http://arxiv.
org/abs/1802.01421, arXiv:1802.01421 [cs, stat]
[5] Ross, A.S., Doshi-Velez, F.: Improving the Adversarial Robustness and Interpretability of Deep
Neural Networks by Regularizing their Input Gradients (Nov 2017), http://arxiv.org/abs/
1711.09404, arXiv:1711.09404 [cs]
[6] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards Deep Learning Models
Resistant to Adversarial Attacks. In: International Conference on Learning Representations
(2018), https://arxiv.org/abs/1706.06083
@inproceedings{rodriguezmunoz2024characterizing,
title={Characterizing model robustness via natural input gradients},
author={Adrián Rodríguez-Muñoz and Tongzhou Wang and Antonio Torralba},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2024},
url={}
}