Characterizing Model Robustness via Natural Input Gradients

Adrián Rodríguez-Muñoz, Tongzhou Wang, Antonio Torralba

MIT CSAIL

Abstract

Adversarially robust models are locally smooth around each data sample so that small perturbations cannot drastically change model outputs. In modern systems, such smoothness is usually obtained via adversarial training, which explicitly enforces models to perform well on perturbed examples. In this work, we show the surprising effectiveness of instead regularizing the gradient with respect to model inputs on natural examples only. Penalizing input gradient norm is commonly believed to be a much inferior approach. Our analyses identify that the performance of gradient norm regularization critically depends on the smoothness of activation functions, and are in fact extremely effective on modern vision transformers that adopt smooth activations over piecewise linear ones (eg, ReLU), contrary to prior belief. On ImageNet-1k, gradient norm training achieves > 90% the performance of state-of-the-art PGD-3 adversarial training (52% vs.~56%), while using only 60% computation cost of the state-of-the-art without complex adversarial optimization.

Characterizing robustness via natural input gradients. Comparison of loss-input gradients of non-robust and robust models across architectures for a set of images. Gradients of robust models (adversarial training and gradient norm regularization) highly resemble the input images, and look visually similar to each other to the human eye. By contrast, gradients of vulnerable models are noise-like, bearing apparently little resemblance to each other or the input images. Numerically, the norm of the input gradient (top right for each gradient) is also highly discriminative of vulnerability or robustness.

Performance comparison with SOTA adversarial training

Minimizing gradient norm yields a highly competitive model despite seeing only natural examples and having 60% of the computational budget. On AutoAttack \(L_{\infty}\) with \(\epsilon=\frac{4}{255}\), the standard benchmark for ImageNet, we report 51.58% robust performance compared to the 56.12% obtained by state-of-the-art adversarial training.

**Performance comparison on AutoAttack (standard)** Robustness of Swin Transformers trained with gradient norm, natural training, and state-of-the-art adversarial training, on AutoAttack-\(L_{\infty}\). Training performed from pretrained timm (Wightman2019) checkpoint using the recipe of (Liu2023).

**Robust accuracy vs epsilon for the PGD100 attack on ImageNet** Gradient norm achieves slightly better accuracy on clean images (\(\epsilon=0\)) and good robust performance for large \(\epsilon>0\). Robust accuracy for both models smoothly converges towards 0% as the adversarial strength (\(\epsilon\)) grows.

Smooth activation functions make gradient norm regularization effective

As we can see in the figures below, the ResNet with ReLU is completely incapable of properly fitting gradient norm regularization, with clean performance sharply decaying and robust performance barely increasing. In contrast, the GeLU ResNet displays similar convergence behaviour to the adversarially trained models, obtaining extremely similar clean and robust accuracies. The work of (Xie2021) conducted a similar analysis for Adversarial Training, observing small increases in performance from using smooth non-linearities. For gradient norm the effect is more than 20 times larger (1% vs 23%).

**ResNet50 activation swapping** Clean and PGD10 (\(\epsilon=\frac{4}{255}\)) robust accuracy vs epoch for ResNet50 with ReLU and GeLU trained with adversarial training and gradient norm. We observe how the ReLU ResNet is not capable of handling the regularization objective.

AutoAttack performance comparison Clean and AutoAttack-\(L_{\infty}\) accuracy at \(\epsilon=\frac{4}{255}\) for ResNet50 with ReLU, GeLU, and SiLU non-linearities trained with both adversarial training and gradient norm.

Related Work

[1] Wightman, R.: PyTorch Image Models (2019). https://doi.org/10.5281/zenodo.4414861, https://github.com/rwightman/pytorch-image-models
[2] Liu, C., Dong, Y., Xiang, W., Yang, X., Su, H., Zhu, J., Chen, Y., He, Y., Xue, H., Zheng, S.: A Comprehensive Study on Robustness of Image Classification Models: Benchmarking and Rethinking (Feb 2023). https://doi.org/10.48550/arXiv.2302.14301, http://arxiv.org/abs/ 2302.14301, arXiv:2302.14301 [cs]
[3] Xie, C., Tan, M., Gong, B., Yuille, A., Le, Q.V.: Smooth Adversarial Training (Jul 2021), http://arxiv.org/abs/2006.14536, arXiv:2006.14536 [cs]
[4] Simon-Gabriel, C.J., Ollivier, Y., Bottou, L., Schölkopf, B., Lopez-Paz, D.: First-order Adversarial Vulnerability of Neural Networks and Input Dimension (Jun 2019), http://arxiv. org/abs/1802.01421, arXiv:1802.01421 [cs, stat]
[5] Ross, A.S., Doshi-Velez, F.: Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients (Nov 2017), http://arxiv.org/abs/ 1711.09404, arXiv:1711.09404 [cs]
[6] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards Deep Learning Models Resistant to Adversarial Attacks. In: International Conference on Learning Representations (2018), https://arxiv.org/abs/1706.06083

BibTeX


    @inproceedings{rodriguezmunoz2024characterizing,
      title={Characterizing model robustness via natural input gradients},
      author={Adrián Rodríguez-Muñoz and Tongzhou Wang and Antonio Torralba},
      booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
      year={2024},
      url={}
    }

Acknowledgements

Adrián Rodríguez-Muñoz is supported by the LaCaixa fellowship. Tongzhou Wang is supported by the ONR MURI program.