Share this page:

Improving the Adversarial Robustness of NLP Models by Information Bottleneck

Cenyuan Zhang, Xiang Zhou, Yixin Wan, Xiaoqing Zheng, Kai-Wei Chang, and Cho-Jui Hsieh, in ACL-Finding, 2022.

Download the full text


Abstract

Existing studies have demonstrated that adversarial examples can be directly attributed to the presence of non-robust features, which are highly predictive, but can be easily manipulated by adversaries to fool NLP models. In this study, we explore the feasibility of capturing task-specific robust features, while eliminating the non-robust ones by using the information bottleneck theory. Through extensive experiments, we show that the models trained with our information bottleneck-based method are able to achieve a significant improvement in robust accuracy, exceeding performances of all the previously reported defense methods while suffering almost no performance drop in clean accuracy on SST-2, AGNEWS and IMDB datasets.


Bib Entry

@inproceedings{zhang2022improving,
  title = {Improving the Adversarial Robustness of NLP Models by Information Bottleneck},
  author = {Zhang, Cenyuan and Zhou, Xiang and Wan, Yixin and Zheng, Xiaoqing and Chang, Kai-Wei and Hsieh, Cho-Jui},
  booktitle = {ACL-Finding},
  year = {2022}
}

Related Publications

  1. VideoCon: Robust video-language alignment via contrast captions, CVPR, 2024
  2. CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning, ICCV, 2023
  3. Red Teaming Language Model Detectors with Language Models, TACL, 2023
  4. ADDMU: Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation, EMNLP, 2022
  5. Investigating Ensemble Methods for Model Robustness Improvement of Text Classifiers, EMNLP-Finding (short), 2022
  6. Unsupervised Syntactically Controlled Paraphrase Generation with Abstract Meaning Representations, EMNLP-Finding (short), 2022
  7. Searching for an Effiective Defender: Benchmarking Defense against Adversarial Word Substitution, EMNLP, 2021
  8. On the Transferability of Adversarial Attacks against Neural Text Classifier, EMNLP, 2021
  9. Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble, ACL, 2021
  10. Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation, NAACL, 2021
  11. Provable, Scalable and Automatic Perturbation Analysis on General Computational Graphs, NeurIPS, 2020
  12. On the Robustness of Language Encoders against Grammatical Errors, ACL, 2020
  13. Robustness Verification for Transformers, ICLR, 2020
  14. Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification, EMNLP, 2019
  15. Retrofitting Contextualized Word Embeddings with Paraphrases, EMNLP (short), 2019
  16. Generating Natural Language Adversarial Examples, EMNLP (short), 2018