Share this page:

Robustness Verification for Transformers

Zhouxing Shi, Huan Zhang, Kai-Wei Chang, Minlie Huang, and Cho-Jui Hsieh, in ICLR, 2020.

Code

Download the full text


Abstract

Robustness verification that aims to formally certify the prediction behavior of neural networks has become an important tool for understanding the behavior of a given model and for obtaining safety guarantees. However, previous methods are usually limited to relatively simple neural networks. In this paper, we consider the robustness verification problem for Transformers. Transformers have complex self-attention layers that pose many challenges for verification, including cross-nonlinearity and cross-position dependency, which have not been discussed in previous work. We resolve these challenges and develop the first verification algorithm for Transformers. The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound Propagation. These bounds also shed light on interpreting Transformers as they consistently reflect the importance of words in sentiment analysis.



Bib Entry

@inproceedings{shi2020robustnest,
  author = {Shi, Zhouxing and Zhang, Huan and Chang, Kai-Wei and Huang, Minlie and Hsieh, Cho-Jui},
  title = {Robustness Verification for Transformers},
  booktitle = {ICLR},
  year = {2020}
}

Related Publications

  • Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble

    Yi Zhou, Xiaoqing Zheng, Cho-Jui Hsieh, Kai-Wei Chang, and Xuanjing Huang, in ACL, 2021.
    Full Text Abstract BibTeX Details
    Although deep neural networks have achieved prominent performance on many NLP tasks, they are vulnerable to adversarial examples. We propose Dirichlet Neighborhood Ensemble (DNE), a randomized method for training a robust model to defense synonym substitutionbased attacks. During training, DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data. In such a way, the model is robust to adversarial attacks while maintaining the performance on the original clean data. DNE is agnostic to the network architectures and scales to large models (e.g., BERT) for NLP applications. Through extensive experimentation, we demonstrate that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.
    @inproceedings{zhou2021defensf,
      title = {Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble},
      author = {Zhou, Yi and Zheng, Xiaoqing and Hsieh, Cho-Jui and Chang, Kai-Wei and Huang, Xuanjing},
      booktitle = {ACL},
      year = {2021}
    }
    
    Details
  • Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation

    Chong Zhang, Jieyu Zhao, Huan Zhang, Kai-Wei Chang, and Cho-Jui Hsieh, in NAACL, 2021.
    Full Text Video Code Abstract BibTeX Details
    Robustness and counterfactual bias are usually evaluated on a test dataset. However, are these evaluations robust? If the test dataset is perturbed slightly, will the evaluation results keep the same? In this paper, we propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset. The framework first perturbs the test dataset to construct abundant natural sentences similar to the test data, and then diagnoses the prediction change regarding a single-word substitution. We apply this framework to study two perturbation-based approaches that are used to analyze models’ robustness and counterfactual bias in English. (1) For robustness, we focus on synonym substitutions and identify vulnerable examples where prediction can be altered. Our proposed attack attains high success rates (96.0%-99.8%) in finding vulnerable examples on both original and robustly trained CNNs and Transformers. (2) For counterfactual bias, we focus on substituting demographic tokens (e.g., gender, race) and measure the shift of the expected prediction among constructed sentences. Our method is able to reveal the hidden model biases not directly shown in the test dataset.
    @inproceedings{zhang2021doublf,
      title = {	Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation},
      booktitle = {NAACL},
      author = {Zhang, Chong and Zhao, Jieyu and Zhang, Huan and Chang, Kai-Wei and Hsieh, Cho-Jui},
      year = {2021},
      presentation_id = {https://underline.io/events/122/sessions/4229/lecture/19609-double-perturbation-on-the-robustness-of-robustness-and-counterfactual-bias-evaluation}
    }
    
    Details
  • Provable, Scalable and Automatic Perturbation Analysis on General Computational Graphs

    Kaidi Xu, Zhouxing Shi, Huan Zhang, Yihan Wang, Kai-Wei Chang, Minlie Huang, Bhavya Kailkhura, Xue Lin, and Cho-Jui Hsieh, in NeurIPS, 2020.
    Full Text Code Abstract BibTeX Details
    Linear relaxation based perturbation analysis (LiRPA) for neural networks, which computes provable linear bounds of output neurons given a certain amount of input perturbation, has become a core component in robustness verification and certified defense. The majority of LiRPA-based methods only consider simple feed-forward networks and it needs particular manual derivations and implementations when extended to other architectures. In this paper, we develop an automatic framework to enable perturbation analysis on any neural network structures, by generalizing exiting LiRPA algorithms such as CROWN to operate on general computational graphs. The flexibility, differentiability and ease of use of our framework allow us to obtain state-of-the-art results on LiRPA based certified defense on fairly complicated networks like DenseNet, ResNeXt and Transformer that are not supported by prior work. Our framework also enables loss fusion, a technique that significantly reduces the computational complexity of LiRPA for certified defense. For the first time, we demonstrate LiRPA based certified defense on Tiny ImageNet and Downscaled ImageNet where previous approaches cannot scale to due to the relatively large number of classes. Our work also yields an open-source library for the community to apply LiRPA to areas beyond certified defense without much LiRPA expertise, e.g., we create a neural network with a provably flat optimization landscape. Our open source library is available at https://github.com/KaidiXu/auto_LiRPA
    @inproceedings{xu2020provablf,
      author = {Xu, Kaidi and Shi, Zhouxing and Zhang, Huan and Wang, Yihan and Chang, Kai-Wei and Huang, Minlie and Kailkhura, Bhavya and Lin, Xue and Hsieh, Cho-Jui},
      title = {Provable, Scalable and Automatic Perturbation Analysis on General Computational Graphs},
      booktitle = {NeurIPS},
      year = {2020}
    }
    
    Details
  • On the Robustness of Language Encoders against Grammatical Errors

    Fan Yin, Quanyu Long, Tao Meng, and Kai-Wei Chang, in ACL, 2020.
    Full Text Slides Video Code Abstract BibTeX Details
    We conduct a thorough study to diagnose the behaviors of pre-trained language encoders (ELMo, BERT, and RoBERTa) when confronted with natural grammatical errors. Specifically, we collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. We use this approach to facilitate debugging models on downstream applications. Results confirm that the performance of all tested models is affected but the degree of impact varies. To interpret model behaviors, we further design a linguistic acceptability task to reveal their abilities in identifying ungrammatical sentences and the position of errors. We find that fixed contextual encoders with a simple classifier trained on the prediction of sentence correctness are able to locate error positions. We also design a cloze test for BERT and discover that BERT captures the interaction between errors and specific tokens in context. Our results shed light on understanding the robustness and behaviors of language encoders against grammatical errors.
    @inproceedings{yin2020robustnest,
      author = {Yin, Fan and Long, Quanyu and Meng, Tao and Chang, Kai-Wei},
      title = {On the Robustness of Language Encoders against Grammatical Errors},
      booktitle = {ACL},
      presentation_id = {https://virtual.acl2020.org/paper_main.310.html},
      year = {2020}
    }
    
    Details
  • Robustness Verification for Transformers

    Zhouxing Shi, Huan Zhang, Kai-Wei Chang, Minlie Huang, and Cho-Jui Hsieh, in ICLR, 2020.
    Full Text Video Code Abstract BibTeX Details
    Robustness verification that aims to formally certify the prediction behavior of
    neural networks has become an important tool for understanding the behavior of
    a given model and for obtaining safety guarantees. However, previous methods
    are usually limited to relatively simple neural networks. In this paper, we consider the robustness verification problem for Transformers. Transformers have
    complex self-attention layers that pose many challenges for verification, including
    cross-nonlinearity and cross-position dependency, which have not been discussed
    in previous work. We resolve these challenges and develop the first verification
    algorithm for Transformers. The certified robustness bounds computed by our
    method are significantly tighter than those by naive Interval Bound Propagation.
    These bounds also shed light on interpreting Transformers as they consistently
    reflect the importance of words in sentiment analysis.
    @inproceedings{shi2020robustnest,
      author = {Shi, Zhouxing and Zhang, Huan and Chang, Kai-Wei and Huang, Minlie and Hsieh, Cho-Jui},
      title = {Robustness Verification for Transformers},
      booktitle = {ICLR},
      year = {2020}
    }
    
    Details
  • Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

    Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei Wang, in EMNLP, 2019.
    Full Text Code Abstract BibTeX Details
    Adversarial attacks against machine learning models have threatened various real-world applications such as spam filtering and sentiment analysis. In this paper, we propose a novel framework, learning to DIScriminate Perturbations (DISP), to identify and adjust malicious perturbations, thereby blocking adversarial attacks for text classification models. To identify adversarial attacks, a perturbation discriminator validates how likely a token in the text is perturbed and provides a set of potential perturbations. For each potential perturbation, an embedding estimator learns to restore the embedding of the original word based on the context and a replacement token is chosen based on approximate kNN search. DISP can block adversarial attacks for any NLP model without modifying the model structure or training procedure. Extensive experiments on two benchmark datasets demonstrate that DISP significantly outperforms baseline methods in blocking adversarial attacks for text classification. In addition, in-depth analysis shows the robustness of DISP across different situations.
    @inproceedings{zhou2019learninh,
      author = {Zhou, Yichao and Jiang, Jyun-Yu and Chang, Kai-Wei and Wang, Wei},
      title = {Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification},
      booktitle = {EMNLP},
      year = {2019}
    }
    
    Details
  • Retrofitting Contextualized Word Embeddings with Paraphrases

    Weijia Shi, Muhao Chen, Pei Zhou, and Kai-Wei Chang, in EMNLP (short), 2019.
    Full Text Slides Video Code Abstract BibTeX Details
    Contextualized word embedding models, such as ELMo, generate meaningful representations of words and their context. These models have been shown to have a great impact on downstream applications. However, in many cases, the contextualized embedding of a word changes drastically when the context is paraphrased. As a result, the downstream model is not robust to paraphrasing and other linguistic variations. To enhance the stability of contextualized word embedding models, we propose an approach to retrofitting contextualized embedding models with paraphrase contexts. Our method learns an orthogonal transformation on the input space, which seeks to minimize the variance of word representations on paraphrased contexts. Experiments show that the retrofitted model significantly outperforms the original ELMo on various sentence classification and language inference tasks.
    @inproceedings{shi2019retrofittinh,
      author = {Shi, Weijia and Chen, Muhao and Zhou, Pei and Chang, Kai-Wei},
      title = {Retrofitting Contextualized Word Embeddings with Paraphrases},
      booktitle = {EMNLP (short)},
      vimeo_id = {430797636},
      year = {2019}
    }
    
    Details
  • Generating Natural Language Adversarial Examples

    Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang, in EMNLP (short), 2018.
    Full Text Code Abstract BibTeX Details
    Deep neural networks (DNNs) are vulnerable to adversarial examples, perturbations to correctly classified examples which can cause the network to misclassify. In the image domain, these perturbations can often be made virtually indistinguishable to human perception, causing humans and state-of-the-art models to disagree. However, in the natural language domain, small perturbations are clearly perceptible, and the replacement of a single word can drastically alter the semantics of the document. Given these challenges, we use a population-based optimization algorithm to generate semantically and syntactically similar adversarial examples. We demonstrate via a human study that 94.3% of the generated examples are classified to the original label by human evaluators, and that the examples are perceptibly quite similar. We hope our findings encourage researchers to pursue improving the robustness of DNNs in the natural language domain.
    @inproceedings{alzanto2018generatinh,
      author = {Alzantot, Moustafa and Sharma, Yash and Elgohary, Ahmed and Ho, Bo-Jhang and Srivastava, Mani and Chang, Kai-Wei},
      title = {Generating Natural Language Adversarial Examples},
      booktitle = {EMNLP (short)},
      year = {2018}
    }
    
    Details