At UCLA-NLP, our mission is to develop reliable, fair, accountable, robust natural language understanding and generation technology to benefit everyone.

Please see our recent papers at

In the following, we will highlight our reseach papers at NAACL 2021 on the following topics:


Fairness and Social NLP

(missing reference)


    Language Generation

    (missing reference)

      NLP Model Evaluation and Interpretation

      [1], [2]
      1. Evaluating the Values of Sources in Transfer Learning

        Md Rizwan Parvez and Kai-Wei Chang, in NAACL, 2021.
        QA Sessions: 14C-ORAL: INTERPRETABILITY AND ANALYSIS OF MODELS FOR NLP Paper link in the virtual conference
        Full Text Code BibTeX Details
        Transfer learning that adapts a model trained on data-rich sources to low-resource targets has been widely applied in natural language processing (NLP). However, when training a transfer model over multiple sources, not every source is equally useful for the target. To better transfer a model, it is essential to understand the values of the sources. In this paper, we develop SEAL-Shap, an efficient source valuation framework for quantifying the usefulness of the sources (e.g., domains/languages) in transfer learning based on the Shapley value method. Experiments and comprehensive analyses on both cross-domain and cross-lingual transfers demonstrate that our framework is not only effective in choosing useful transfer sources but also the source values match the intuitive source-target similarity.
        @inproceedings{parvez2021evaluating,
          title = {Evaluating the Values of Sources in Transfer Learning},
          author = {Parvez, Md Rizwan and Chang, Kai-Wei},
          booktitle = {NAACL},
          presentation_id = {https://underline.io/events/122/sessions/4261/lecture/19707-evaluating-the-values-of-sources-in-transfer-learning},
          year = {2021}
        }
        
        Details
      2. Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation

        Chong Zhang, Jieyu Zhao, Huan Zhang, Kai-Wei Chang, and Cho-Jui Hsieh, in NAACL, 2021.
        QA Sessions: 11B-ORAL: INTERPRETABILITY AND ANALYSIS OF MODELS FOR NLP Paper link in the virtual conference
        Full Text Code BibTeX Details
        Robustness and counterfactual bias are usually evaluated on a test dataset. However, are these evaluations robust? If the test dataset is perturbed slightly, will the evaluation results keep the same? In this paper, we propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset. The framework first perturbs the test dataset to construct abundant natural sentences similar to the test data, and then diagnoses the prediction change regarding a single-word substitution. We apply this framework to study two perturbation-based approaches that are used to analyze models’ robustness and counterfactual bias in English. (1) For robustness, we focus on synonym substitutions and identify vulnerable examples where prediction can be altered. Our proposed attack attains high success rates (96.0%-99.8%) in finding vulnerable examples on both original and robustly trained CNNs and Transformers. (2) For counterfactual bias, we focus on substituting demographic tokens (e.g., gender, race) and measure the shift of the expected prediction among constructed sentences. Our method is able to reveal the hidden model biases not directly shown in the test dataset.
        @inproceedings{zhang2021double,
          title = {	Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation},
          booktitle = {NAACL},
          author = {Zhang, Chong and Zhao, Jieyu and Zhang, Huan and Chang, Kai-Wei and Hsieh, Cho-Jui},
          year = {2021},
          presentation_id = {https://underline.io/events/122/sessions/4229/lecture/19609-double-perturbation-on-the-robustness-of-robustness-and-counterfactual-bias-evaluation}
        }
        

        Related Publications


        Details


      (Multi-Modal) Representation Learning

      [1], [2], [3]
      1. Unified Pre-training for Program Understanding and Generation

        Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in NAACL, 2021.
        QA Sessions: 8A-ORAL: MACHINE LEARNING FOR NLP: LANGUAGE MODELING AND SEQUENCE TO SEQUENCE MODELS Paper link in the virtual conference
        Full Text Code BibTeX Details Top-10 cited paper at NAACL 21
        Code summarization nd generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.
        @inproceedings{ahmad2021unified,
          title = {Unified Pre-training for Program Understanding and Generation},
          author = {Ahmad, Wasi and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
          booktitle = {NAACL},
          presentation_id = {https://underline.io/events/122/sessions/4197/lecture/20024-unified-pre-training-for-program-understanding-and-generation},
          year = {2021}
        }
        
        Details
      2. Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models

        James Y. Huang, Kuan-Hao Huang, and Kai-Wei Chang, in NAACL (short), 2021.
        QA Sessions: 4C-ORAL: SEMANTICS: SENTENCE-LEVEL SEMANTICS AND TEXTUAL INFERENCE Paper link in the virtual conference
        Full Text Code BibTeX Details
        Pre-trained language models have achieved huge success on a wide range of NLP tasks. However, contextual representations from pre-trained models contain entangled semantic and syntactic information, and therefore cannot be directly used to derive useful semantic sentence embeddings for some tasks. Paraphrase pairs offer an effective way of learning the distinction between semantics and syntax, as they naturally share semantics and often vary in syntax. In this work, we present ParaBART, a semantic sentence embedding model that learns to disentangle semantics and syntax in sentence embeddings obtained by pre-trained language models. ParaBART is trained to perform syntax-guided paraphrasing, based on a source sentence that shares semantics with the target paraphrase, and a parse tree that specifies the target syntax. In this way, ParaBART learns disentangled semantic and syntactic representations from their respective inputs with separate encoders. Experiments in English show that ParaBART outperforms state-of-the-art sentence embedding models on unsupervised semantic similarity tasks. Additionally, we show that our approach can effectively remove syntactic information from semantic sentence embeddings, leading to better robustness against syntactic variation on downstream semantic tasks.
        @inproceedings{huang2021disentangling,
          title = {Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models},
          author = {Huang, James Y. and Huang, Kuan-Hao and Chang, Kai-Wei},
          booktitle = {NAACL (short)},
          presentation_id = {https://underline.io/events/122/sessions/4151/lecture/19910-disentangling-semantics-and-syntax-in-sentence-embeddings-with-pre-trained-language-models},
          year = {2021}
        }
        

        Related Publications

        No related publications found.


        Details
      3. Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

        Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, and Kai-Wei Chang, in NAACL, 2021.
        QA Sessions: 15A-ORAL: LANGUAGE GROUNDING TO VISION, ROBOTICS AND BEYOND Paper link in the virtual conference
        Full Text BibTeX Details
        Pre-trained contextual vision-and-language (V&L) models have brought impressive performance improvement on various benchmarks. However, the paired text-image data required for pre-training are hard to collect and scale up. We investigate if a strong V&L representation model can be learned without text-image pairs. We propose Weakly-supervised VisualBERT with the key idea of conducting "mask-and-predict" pre-training on language-only and image-only corpora. Additionally, we introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. Evaluation on four V&L benchmarks shows that Weakly-supervised VisualBERT achieves similar performance with a model pre-trained with paired data. Besides, pre-training on more image-only data further improves a model that already has access to aligned data, suggesting the possibility of utilizing billions of raw images available to enhance V&L models.
        @inproceedings{li2021unsupervised,
          author = {Li, Liunian Harold and You, Haoxuan and Wang, Zhecan and Zareian, Alireza and Chang, Shih-Fu and Chang, Kai-Wei},
          title = {Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions},
          booktitle = {NAACL},
          presentation_id = {https://underline.io/events/122/sessions/4269/lecture/19725-unsupervised-vision-and-language-pre-training-without-parallel-images-and-captions},
          year = {2021}
        }
        
        Details

      Event Extraction

      (missing reference)