UCLA-NLP (Chang's and PLUS lab) @ NAACL 2021

At UCLA-NLP, our mission is to develop reliable, fair, accountable, robust natural language understanding and generation technology to benefit everyone.

Please see our recent papers at

In the following, we will highlight our reseach papers at NAACL 2021 on the following topics:

Fairness and Social NLP
Language Generation
(Multi-Modal) Represenation Learning
Model Evaluation and Interpretation
Event Extraction

"Nice Try, Kiddo": Investigating Ad Hominems in Dialogue Responses

Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng, in NAACL, 2021.

QA Sessions: 3A-ORAL: DIALOGUE AND INTERACTIVE SYSTEMS Paper link in the virtual conference

Full Text Code BibTeX Details

Ad hominem attacks are those that target some feature of a person’s character instead of the position the person is maintaining. These attacks are harmful because they propagate implicit biases and diminish a person’s credibility. Since dialogue systems respond directly to user input, it is important to study ad hominems in dialogue responses. To this end, we propose categories of ad hominems, compose an annotated dataset, and build a classifier to analyze human and dialogue system responses to English Twitter posts. We specifically compare responses to Twitter topics about marginalized communities (#BlackLivesMatter, #MeToo) versus other topics (#Vegan, #WFH), because the abusive language of ad hominems could further amplify the skew of power away from marginalized populations. Furthermore, we propose a constrained decoding technique that uses salient n-gram similarity as a soft constraint for top-k sampling to reduce the amount of ad hominems generated. Our results indicate that 1) responses from both humans and DialoGPT contain more ad hominems for discussions around marginalized communities, 2) different quantities of ad hominems in the training data can influence the likelihood of generating ad hominems, and 3) we can use constrained decoding techniques to reduce ad hominems in generated dialogue responses.

@inproceedings{sheng2021nice,
  title = {"Nice Try, Kiddo": Investigating Ad Hominems in Dialogue Responses},
  booktitle = {NAACL},
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Prem and Peng, Nanyun},
  presentation_id = {https://underline.io/events/122/sessions/4137/lecture/19854-%27nice-try,-kiddo%27-investigating-ad-hominems-in-dialogue-responses},
  year = {2021}
}

Really excited about 👉 “Nice Try, Kiddo”: Investigating Ad Hominems in Dialogue Responses (https://t.co/A9aBtzyXmm) w/@kaiwei_chang @natarajan_prem @VioletNPeng #NAACL2021
We find that there are more ad hominem responses in discussions about marginalized communities…
— Emily Sheng (@ewsheng) April 14, 2021

Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems

Yixin Wan, Jieyu Zhao, Aman Chadha, Nanyun Peng, and Kai-Wei Chang, in EMNLP-Finding, 2023.
Full Text Abstract BibTeX Details

Recent advancements in Large Language Models empower them to follow freeform instructions, including imitating generic or specific demographic personas in conversations. Generic personas refer to a demographic group (e.g. an Asian person), whereas specific personas can be actual names of historical figures. While the adoption of personas allows dialogue systems to be more engaging and approachable to users, it also carries the potential risk of exacerbating social biases in model responses, further causing societal harms through interactions with users. In this paper, we systematically study “persona biases”, which we define to be the sensitivity of harmful dialogue model behaviors to different persona adoptions.We categorize persona biases into biases in harmful expression and harmful agreement, as well as establish a comprehensive evaluation framework to measure persona biases in five aspects: Offensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic Agreement. Additionally, we propose to comprehensively investigate persona biases through experimenting with UniversalPersona, a systematized persona dataset with a comprehensive list of both generic and specific model personas. Through benchmarking on four different models, including Blender, ChatGPT, Alpaca, and Vicuna, our study uncovers significant persona biases in dialogue systems. Findings of our study underscores the immediate need to revisit the use of persona traits in dialogue agents to ensure their safe application.

@inproceedings{wan2023personalized,
  author = {Wan, Yixin and Zhao, Jieyu and Chadha, Aman and Peng, Nanyun and Chang, Kai-Wei},
  title = {Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems},
  booktitle = {EMNLP-Finding},
  year = {2023}
}

Details

Kelly is a Warm Person, Joseph is a Role Model: Gender Biases in LLM-Generated Reference Letters

Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, and Nanyun Peng, in EMNLP-Findings, 2023.
Full Text Abstract BibTeX Details

As generative language models advance, users have started to utilize Large Language Models (LLMs) to assist in writing various types of content, including professional documents such as recommendation letters. Despite their convenience, these applications introduce unprecedented fairness concerns. As generated reference letter might be directly utilized by users in professional or academic scenarios, it has the potential to cause direct harm such as lowering success rates for female applicants. Therefore, it is imminent and necessary to comprehensively study fairness issues and associated harms in such real-world use cases for future mitigation and monitoring. In this paper, we critically examine gender bias in LLM-generated reference letters. Inspired by findings in social science, we specifically design evaluation methods to manifest gender biases in LLM-generated letters through two dimensions: biases in language style and biases in lexical content. Furthermore, we investigate the extent of bias propagation by separately analyze bias amplification in model-hallucinated contents, which we define to be hallucination bias of model-generated documents. Through benchmarking evaluation on 4 popular LLMs, including ChatGPT, Alpaca, Vicuna and StableLM, our study reveal significant gender biases in LLM-generated recommendation letters. Our findings further point towards the importance and imminence to recognize bias in LLM-generated professional documents.

@inproceedings{wan2023kelly,
  title = {Kelly is a Warm Person, Joseph is a Role Model: Gender Biases in LLM-Generated Reference Letters},
  author = {Wan, Yixin and Pu, George and Sun, Jiao and Garimella, Aparna and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {EMNLP-Findings},
  year = {2023}
}

Details

How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?

Hritik Bansal, Da Yin, Masoud Monajatipoor, and Kai-Wei Chang, in EMNLP (Short), 2022.
Full Text Code Abstract BibTeX Details

Text-to-image generative models have achieved unprecedented success in generating high-quality images based on natural language descriptions. However, it is shown that these models tend to favor specific social groups when prompted with neutral text descriptions (e.g., ’a photo of a lawyer’). Following Zhao et al. (2021), we study the effect on the diversity of the generated images when adding ethical intervention that supports equitable judgment (e.g., ’if all individuals can be a lawyer irrespective of their gender’) in the input prompts. To this end, we introduce an Ethical NaTural Language Interventions in Text-to-Image GENeration (ENTIGEN) benchmark dataset to evaluate the change in image generations conditional on ethical interventions across three social axes – gender, skin color, and culture. Through ENTIGEN framework, we find that the generations from minDALL.E, DALL.E-mini and Stable Diffusion cover diverse social groups while preserving the image quality. Preliminary studies indicate that a large change in the model predictions is triggered by certain phrases such as ’irrespective of gender’ in the context of gender bias in the ethical interventions. We release code and annotated data at https://github.com/Hritikbansal/entigen_emnlp.

@inproceedings{bansal2022how,
  title = {How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?},
  author = {Bansal, Hritik and Yin, Da and Monajatipoor, Masoud and Chang, Kai-Wei},
  booktitle = {EMNLP (Short)},
  year = {2022}
}

Details

On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations

Yang Trista Cao, Yada Pruksachatkun, Kai-Wei Chang, Rahul Gupta, Varun Kumar, Jwala Dhamala, and Aram Galstyan, in ACL (short), 2022.
Full Text Abstract BibTeX Details

Multiple metrics have been introduced to measure fairness in various natural language processing tasks. These metrics can be roughly categorized into two categories: 1) \emphextrinsic metrics for evaluating fairness in downstream applications and 2) \emphintrinsic metrics for estimating fairness in upstream contextualized language representation models. In this paper, we conduct an extensive correlation study between intrinsic and extrinsic metrics across bias notions using 19 contextualized language models. We find that intrinsic and extrinsic metrics do not necessarily correlate in their original setting, even when correcting for metric misalignments, noise in evaluation datasets, and confounding factors such as experiment configuration for extrinsic metrics.

@inproceedings{trista2022evaluation,
  title = {On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations},
  author = {Cao, Yang Trista and Pruksachatkun, Yada and Chang, Kai-Wei and Gupta, Rahul and Kumar, Varun and Dhamala, Jwala and Galstyan, Aram},
  booktitle = {ACL (short)},
  year = {2022}
}

Details

Societal Biases in Language Generation: Progress and Challenges

Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng, in ACL, 2021.
Full Text Abstract BibTeX Details

Technology for language generation has advanced rapidly, spurred by advancements in pre-training large models on massive amounts of data and the need for intelligent agents to communicate in a natural manner. While techniques can effectively generate fluent text, they can also produce undesirable societal biases that can have a disproportionately negative impact on marginalized populations. Language generation presents unique challenges for biases in terms of direct user interaction and the structure of decoding techniques. To better understand these challenges, we present a survey on societal biases in language generation, focusing on how data and techniques contribute to biases and progress towards reducing biases. Motivated by a lack of studies on biases from decoding techniques, we also conduct experiments to quantify the effects of these techniques. By further discussing general trends and open challenges, we call to attention promising directions for research and the importance of fairness and inclusivity considerations for language generation applications.

@inproceedings{sheng2021societam,
  title = {Societal Biases in Language Generation: Progress and Challenges},
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Prem and Peng, Nanyun},
  booktitle = {ACL},
  year = {2021}
}

Details

"Nice Try, Kiddo": Investigating Ad Hominems in Dialogue Responses

Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng, in NAACL, 2021.
Full Text Video Code Abstract BibTeX Details

Ad hominem attacks are those that target some feature of a person’s character instead of the position the person is maintaining. These attacks are harmful because they propagate implicit biases and diminish a person’s credibility. Since dialogue systems respond directly to user input, it is important to study ad hominems in dialogue responses. To this end, we propose categories of ad hominems, compose an annotated dataset, and build a classifier to analyze human and dialogue system responses to English Twitter posts. We specifically compare responses to Twitter topics about marginalized communities (#BlackLivesMatter, #MeToo) versus other topics (#Vegan, #WFH), because the abusive language of ad hominems could further amplify the skew of power away from marginalized populations. Furthermore, we propose a constrained decoding technique that uses salient n-gram similarity as a soft constraint for top-k sampling to reduce the amount of ad hominems generated. Our results indicate that 1) responses from both humans and DialoGPT contain more ad hominems for discussions around marginalized communities, 2) different quantities of ad hominems in the training data can influence the likelihood of generating ad hominems, and 3) we can use constrained decoding techniques to reduce ad hominems in generated dialogue responses.

@inproceedings{sheng2021nice,
  title = {"Nice Try, Kiddo": Investigating Ad Hominems in Dialogue Responses},
  booktitle = {NAACL},
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Prem and Peng, Nanyun},
  presentation_id = {https://underline.io/events/122/sessions/4137/lecture/19854-%27nice-try,-kiddo%27-investigating-ad-hominems-in-dialogue-responses},
  year = {2021}
}

Details

BOLD: Dataset and metrics for measuring biases in open-ended language generation

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta, in FAccT, 2021.
Full Text Code Abstract BibTeX Details

Recent advances in deep learning techniques have enabled machines to generate cohesive open-ended text when prompted with a sequence of words as context. While these models now empower many downstream applications from conversation bots to automatic storytelling, they have been shown to generate texts that exhibit social biases. To systematically study and benchmark social biases in open-ended language generation, we introduce the Bias in Open-Ended Language Generation Dataset (BOLD), a large-scale dataset that consists of 23,679 English text generation prompts for bias benchmarking across five domains: profession, gender, race, religion, and political ideology. We also propose new automated metrics for toxicity, psycholinguistic norms, and text gender polarity to measure social biases in open-ended text generation from multiple angles. An examination of text generated from three popular language models reveals that the majority of these models exhibit a larger social bias than human-written Wikipedia text across all domains. With these results we highlight the need to benchmark biases in open-ended language generation and caution users of language generation models on downstream tasks to be cognizant of these embedded prejudices.

@inproceedings{dhamala2021bold,
  author = {Dhamala, Jwala and Sun, Tony and Kumar, Varun and Krishna, Satyapriya and Pruksachatkun, Yada and Chang, Kai-Wei and Gupta, Rahul},
  title = {BOLD: Dataset and metrics for measuring biases in open-ended language generation},
  booktitle = {FAccT},
  year = {2021}
}

Details

Towards Controllable Biases in Language Generation

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng, in EMNLP-Finding, 2020.
Full Text Code Abstract BibTeX Details

We present a general approach towards controllable societal biases in natural language generation (NLG). Building upon the idea of adversarial triggers, we develop a method to induce societal biases in generated text when input prompts contain mentions of specific demographic groups. We then analyze two scenarios: 1) inducing negative biases for one demographic and positive biases for another demographic, and 2) equalizing biases between demographics. The former scenario enables us to detect the types of biases present in the model. Specifically, we show the effectiveness of our approach at facilitating bias analysis by finding topics that correspond to demographic inequalities in generated text and comparing the relative effectiveness of inducing biases for different demographics. The second scenario is useful for mitigating biases in downstream applications such as dialogue generation. In our experiments, the mitigation technique proves to be effective at equalizing the amount of biases across demographics while simultaneously generating less negatively biased text overall.

@inproceedings{sheng2020towards,
  title = {Towards Controllable Biases in Language Generation},
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
  booktitle = {EMNLP-Finding},
  year = {2020}
}

Details

The Woman Worked as a Babysitter: On Biases in Language Generation

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng, in EMNLP (short), 2019.
Full Text Slides Video Code Abstract BibTeX Details

We present a systematic study of biases in natural language generation (NLG) by analyzing text generated from prompts that contain mentions of different demographic groups. In this work, we introduce the notion of the regard towards a demographic, use the varying levels of regard towards different demographics as a defining metric for bias in NLG, and analyze the extent to which sentiment scores are a relevant proxy metric for regard. To this end, we collect strategically-generated text from language models and manually annotate the text with both sentiment and regard scores. Additionally, we build an automatic regard classifier through transfer learning, so that we can analyze biases in unseen text. Together, these methods reveal the extent of the biased nature of language model generations. Our analysis provides a study of biases in NLG, bias metrics and correlated human judgments, and empirical evidence on the usefulness of our annotated dataset.

@inproceedings{sheng2019woman,
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
  title = {The Woman Worked as a Babysitter: On Biases in Language Generation},
  booktitle = {EMNLP (short)},
  vimeo_id = {426366363},
  year = {2019}
}

Details

Adapting Coreference Resolution for Processing Violent Death Narratives

Ankith Uppunda, Susan Cochran, Jacob Foster, Alina Arseniev-Koehler, Vickie Mays, and Kai-Wei Chang, in NAACL (short), 2021.

QA Sessions: 13A-Oral: NLP Applications Paper link in the virtual conference

Full Text BibTeX Details

Coreference resolution is an important component in analyzing narrative text from administrative data (e.g., clinical or police sources). However, existing coreference models trained on general language corpora suffer from poor transferability due to domain gaps, especially when they are applied to gender-inclusive data with lesbian, gay, bisexual, and transgender (LGBT) individuals. In this paper, we analyzed the challenges of coreference resolution in an exemplary form of administrative text written in English: violent death narratives from the USA’s Centers for Disease Control’s (CDC) National Violent Death Reporting System. We developed a set of data augmentation rules to improve model performance using a probabilistic data programming framework. Experiments on narratives from an administrative database, as well as existing gender-inclusive coreference datasets, demonstrate the effectiveness of data augmentation in training coreference models that can better handle text data about LGBT individuals.

@inproceedings{uppunda2021adapting,
  title = {Adapting Coreference Resolution for Processing Violent Death Narratives},
  author = {Uppunda, Ankith and Cochran, Susan and Foster, Jacob and Arseniev-Koehler, Alina and Mays, Vickie and Chang, Kai-Wei},
  booktitle = {NAACL (short)},
  presentation_id = {https://underline.io/events/122/sessions/4249/lecture/19662-adapting-coreference-resolution-for-processing-violent-death-narratives},
  year = {2021}
}

Our #NAACL2021 paper demonstrates the challenges when applying NLP in analyzing narratives from USA’s CDC National Violent Death Reporting System. We showed that Coref suffers from poor transferability due to domain gaps, especially in narratives involved LGBT individuals 1/n pic.twitter.com/VNFy26f0CX
— Kai-Wei Chang (@kaiwei_chang) June 5, 2021

Identifying Distributional Perspective Differences from Colingual Groups

Yufei Tian, Tuhin Chakrabarty, Fred Morstatter, and Nanyun Peng, in NAACL 2021 Workshop of Social NLP, 2021.

QA Sessions: NINTH INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE PROCESSING FOR SOCIAL MEDIA (SOCIALNLP 2021) Paper link in the virtual conference

Full Text Code BibTeX Details

Perspective differences exist among different cultures or languages. A lack of mutual understanding among different groups about their perspectives on specific values or events may lead to uninformed decisions or biased opinions. Automatically understanding the group perspectives can provide essential background for many downstream applications of natural language processing techniques. In this paper, we study colingual groups and use language corpora as a proxy to identify their distributional perspectives. We present a novel computational approach to learn shared understandings, and benchmark our method by building culturally-aware models for the English, Chinese, and Japanese languages. On a held out set of diverse topics including marriage, corruption, democracy, our model achieves high correlation with human judgements regarding intra-group values and inter-group differences.

@inproceedings{tian2021identifying,
  title = {Identifying Distributional Perspective Differences from Colingual Groups},
  author = {Tian, Yufei and Chakrabarty, Tuhin and Morstatter, Fred and Peng, Nanyun},
  booktitle = {NAACL 2021 Workshop of Social NLP},
  presentation_id = {https://underline.io/events/122/posters/4298/poster/20429-identifying-distributional-perspectives-from-colingual-groups},
  year = {2021}
}

COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences

Shikhar Singh, Nuan Wen, Yu Hou, Pegah Alipoormolabashi, Te-lin Wu, Xuezhe Ma, and Nanyun Peng, in ACL-Findings, 2021.
Full Text BibTeX Details

@inproceedings{sw2021com,
  title = {COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences},
  author = {Singh, Shikhar and Wen, Nuan and Hou, Yu and Alipoormolabashi, Pegah and Wu, Te-lin and Ma, Xuezhe and Peng, Nanyun},
  booktitle = {ACL-Findings},
  year = {2021}
}

Details

Identifying Distributional Perspective Differences from Colingual Groups

Yufei Tian, Tuhin Chakrabarty, Fred Morstatter, and Nanyun Peng, in NAACL 2021 Workshop of Social NLP, 2021.
Full Text Code Abstract BibTeX Details

Perspective differences exist among different cultures or languages. A lack of mutual understanding among different groups about their perspectives on specific values or events may lead to uninformed decisions or biased opinions. Automatically understanding the group perspectives can provide essential background for many downstream applications of natural language processing techniques. In this paper, we study colingual groups and use language corpora as a proxy to identify their distributional perspectives. We present a novel computational approach to learn shared understandings, and benchmark our method by building culturally-aware models for the English, Chinese, and Japanese languages. On a held out set of diverse topics including marriage, corruption, democracy, our model achieves high correlation with human judgements regarding intra-group values and inter-group differences.

@inproceedings{tian2021identifying,
  title = {Identifying Distributional Perspective Differences from Colingual Groups},
  author = {Tian, Yufei and Chakrabarty, Tuhin and Morstatter, Fred and Peng, Nanyun},
  booktitle = {NAACL 2021 Workshop of Social NLP},
  presentation_id = {https://underline.io/events/122/posters/4298/poster/20429-identifying-distributional-perspectives-from-colingual-groups},
  year = {2021}
}

Details

Language Generation

Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

Sarik Ghazarian, Zixi Liu, Akash S. M, Ralph Weischedel, Aram Galstyan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.

QA Sessions: 12D-ORAL: LANGUAGE RESOURCES AND EVALUATION Paper link in the virtual conference

Full Text Slides Code BibTeX Details

With the recent advances of open-domain story generation models, the lack of reliable automatic evaluation metrics becomes an increasingly imperative issue that hinders the development of such models. A critical bottleneck of obtaining a trustworthy learnable evaluation metric is the lack of high-quality training data for learning classifiers to efficiently distinguish between plausible and implausible machine-generated stories. Previous works relied on heuristically manipulate plausible examples to mimic possible system drawbacks such as repetition, contradiction, or irrelevant content in the text level, which can be unnatural and oversimplify the characteristics of implausible machine-generated stories. We propose to tackle these issues by generating a more comprehensive set of implausible stories using plots, which are structured representations of controllable factors used to generate stories.  Since these plots are compact and structured, it is easier to manipulate them to generate text with targeted undesirable properties, while at the same time maintain the naturalness of the generation. To improve the quality of incoherent stories, we further apply the adversarial filtering procedure to select a more nuanced set of implausible texts. We find that the evaluation metrics trained on our generated data result in more reliable automatic assessments that correlate remarkably better with human judgments than other baselines.

@inproceedings{ghazarian2021plot,
  title = {Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation},
  author = {Ghazarian, Sarik and Liu, Zixi and M, Akash S and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  publisher = {Association for Computational Linguistics},
  pages = {4334--4344},
  presentation_id = {https://underline.io/events/122/sessions/4241/lecture/19650-plot-guided-adversarial-example-construction-for-evaluating-open-domain-story-generation},
  year = {2021}
}

In our first paper in the title "Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation", we tried to achieve a more accurate story plausibility evaluator by proposing a more comprehensive set of incoherent stories based on plot manipulations.
— Sarik (@Sarikgha) March 19, 2021

Related Publications

Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

With the recent advances of open-domain story generation models, the lack of reliable automatic evaluation metrics becomes an increasingly imperative issue that hinders the development of such models. A critical bottleneck of obtaining a trustworthy learnable evaluation metric is the lack of high-quality training data for learning classifiers to efficiently distinguish between plausible and implausible machine-generated stories. Previous works relied on heuristically manipulate plausible examples to mimic possible system drawbacks such as repetition, contradiction, or irrelevant content in the text level, which can be unnatural and oversimplify the characteristics of implausible machine-generated stories. We propose to tackle these issues by generating a more comprehensive set of implausible stories using plots, which are structured representations of controllable factors used to generate stories.  Since these plots are compact and structured, it is easier to manipulate them to generate text with targeted undesirable properties, while at the same time maintain the naturalness of the generation. To improve the quality of incoherent stories, we further apply the adversarial filtering procedure to select a more nuanced set of implausible texts. We find that the evaluation metrics trained on our generated data result in more reliable automatic assessments that correlate remarkably better with human judgments than other baselines.

@inproceedings{ghazarian2021plot,
  title = {Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation},
  author = {Ghazarian, Sarik and Liu, Zixi and M, Akash S and Weischedel, Ralph and Galstyan, Aram and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  publisher = {Association for Computational Linguistics},
  pages = {4334--4344},
  presentation_id = {https://underline.io/events/122/sessions/4241/lecture/19650-plot-guided-adversarial-example-construction-for-evaluating-open-domain-story-generation},
  year = {2021}
}

Details

MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding

Tuhin Chakrabarty, Xurui Zhang, Smaranda Muresan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.

QA Sessions: 12C-ORAL: LANGUAGE GENERATION Paper link in the virtual conference

Full Text Poster Code BibTeX Details

Generating metaphors is a challenging task as it requires a proper understanding of abstract concepts, making connections between unrelated concepts, and deviating from the literal meaning. In this paper, we aim to generate a metaphoric sentence given a literal expression by replacing relevant verbs. Based on a theoretically-grounded connection between metaphors and symbols, we propose a method to automatically construct a parallel corpus by transforming a large number of metaphorical sentences from the Gutenberg Poetry corpus (CITATION) to their literal counterpart using recent advances in masked language modeling coupled with commonsense inference. For the generation task, we incorporate a metaphor discriminator to guide the decoding of a sequence to sequence model fine-tuned on our parallel data to generate high-quality metaphors. Human evaluation on an independent test set of literal statements shows that our best model generates metaphors better than three well-crafted baselines 66% of the time on average. A task-based evaluation shows that human-written poems enhanced with metaphors proposed by our model are preferred 68% of the time compared to poems without metaphors.

@inproceedings{chakrabarty2021mermaid,
  title = {MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding},
  author = {Chakrabarty, Tuhin and Zhang, Xurui and Muresan, Smaranda and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  presentation_id = {https://underline.io/events/122/sessions/4240/lecture/19642-mermaid-metaphor-generation-with-symbolism-and-discriminative-decoding},
  talk_url = {https://underline.io/events/122/sessions/4240/lecture/19642-mermaid-metaphor-generation-with-symbolism-and-discriminative-decoding},
  year = {2021}
}

🔥Metaphors are not to be trifled with🔥 Excited to share #NAACL2021 preprint titled “MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding”https://t.co/gcnvn995vR . Joint work with my figurative NLG constants @VioletNPeng and Smaranda Muresan. #NLProc pic.twitter.com/3mJqlLV6j2
— Tuhin Chakrabarty (@TuhinChakr) March 12, 2021

Related Publications

Metaphor Generation with Conceptual Mappings

Kevin Stowe, Tuhin Chakrabarty, Nanyun Peng, Smaranda Muresan, and Iryna Gurevych, in ACL, 2021.
Full Text BibTeX Details

@inproceedings{stowe2021metaphor,
  title = {Metaphor Generation with Conceptual Mappings},
  author = {Stowe, Kevin and Chakrabarty, Tuhin and Peng, Nanyun and Muresan, Smaranda and Gurevych, Iryna},
  booktitle = {ACL},
  year = {2021}
}

Details

MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding

Tuhin Chakrabarty, Xurui Zhang, Smaranda Muresan, and Nanyun Peng, in The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
Full Text Poster Code Abstract BibTeX Details

Generating metaphors is a challenging task as it requires a proper understanding of abstract concepts, making connections between unrelated concepts, and deviating from the literal meaning. In this paper, we aim to generate a metaphoric sentence given a literal expression by replacing relevant verbs. Based on a theoretically-grounded connection between metaphors and symbols, we propose a method to automatically construct a parallel corpus by transforming a large number of metaphorical sentences from the Gutenberg Poetry corpus (CITATION) to their literal counterpart using recent advances in masked language modeling coupled with commonsense inference. For the generation task, we incorporate a metaphor discriminator to guide the decoding of a sequence to sequence model fine-tuned on our parallel data to generate high-quality metaphors. Human evaluation on an independent test set of literal statements shows that our best model generates metaphors better than three well-crafted baselines 66% of the time on average. A task-based evaluation shows that human-written poems enhanced with metaphors proposed by our model are preferred 68% of the time compared to poems without metaphors.

@inproceedings{chakrabarty2021mermaid,
  title = {MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding},
  author = {Chakrabarty, Tuhin and Zhang, Xurui and Muresan, Smaranda and Peng, Nanyun},
  booktitle = {The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  presentation_id = {https://underline.io/events/122/sessions/4240/lecture/19642-mermaid-metaphor-generation-with-symbolism-and-discriminative-decoding},
  talk_url = {https://underline.io/events/122/sessions/4240/lecture/19642-mermaid-metaphor-generation-with-symbolism-and-discriminative-decoding},
  year = {2021}
}

Details

DiSCoL: Toward Engaging Dialogue Systems through Conversational Line Guided Response Generation

Sarik Ghazarian, Zixi Liu, Tuhin Chakrabarty, Xuezhe Ma, Aram Galstyan, and Nanyun Peng, 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track, 2021.

QA Sessions: 10F-POSTER: SYSTEM DEMONSTRATIONS Paper link in the virtual conference

Full Text Code BibTeX Details

Having engaging and informative conversations with users is the utmost goal for open-domain conversational systems. Recent advances in transformer-based language models and their applications to dialogue systems have succeeded to generate fluent and human-like responses. However, they still lack control over the generation process towards producing contentful responses and achieving engaging conversations. To achieve this goal, we present DiSCoL (Dialogue Systems through Coversational Line guided response generation). DiSCoL is an open-domain dialogue system that leverages conversational lines (briefly convlines) as controllable and informative content-planning elements to guide the generation model produce engaging and informative responses. Two primary modules in DiSCoL’s pipeline are conditional generators trained for 1) predicting relevant and informative convlines for dialogue contexts and 2) generating high-quality responses conditioned on the predicted convlines. Users can also change the returned convlines to control the direction of the conversations towards topics that are more interesting for them. Through automatic and human evaluations, we demonstrate the efficiency of the convlines in producing engaging conversations.

@article{ghazarian2021discol,
  title = {DiSCoL: Toward Engaging Dialogue Systems through Conversational Line Guided Response Generation},
  author = {Ghazarian, Sarik and Liu, Zixi and Chakrabarty, Tuhin and Ma, Xuezhe and Galstyan, Aram and Peng, Nanyun},
  booktitle = {2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track},
  presentation_id = {https://underline.io/events/122/posters/4227/poster/20579-discol-toward-engaging-dialogue-systems-through-conversational-line-guided-response-generation},
  pages = {26–34},
  publisher = {Association for Computational Linguistics},
  year = {2021}
}

In our demo system called DiSCoL (https://t.co/32JAfKtvmH), we presented an engaging conversational system that uses conversational lines to guide the response generation. Users also have the control to change the dialog toward their more favorite direction.
— Sarik (@Sarikgha) March 19, 2021

Related Publications

DiSCoL: Toward Engaging Dialogue Systems through Conversational Line Guided Response Generation

Having engaging and informative conversations with users is the utmost goal for open-domain conversational systems. Recent advances in transformer-based language models and their applications to dialogue systems have succeeded to generate fluent and human-like responses. However, they still lack control over the generation process towards producing contentful responses and achieving engaging conversations. To achieve this goal, we present DiSCoL (Dialogue Systems through Coversational Line guided response generation). DiSCoL is an open-domain dialogue system that leverages conversational lines (briefly convlines) as controllable and informative content-planning elements to guide the generation model produce engaging and informative responses. Two primary modules in DiSCoL’s pipeline are conditional generators trained for 1) predicting relevant and informative convlines for dialogue contexts and 2) generating high-quality responses conditioned on the predicted convlines. Users can also change the returned convlines to control the direction of the conversations towards topics that are more interesting for them. Through automatic and human evaluations, we demonstrate the efficiency of the convlines in producing engaging conversations.

@article{ghazarian2021discol,
  title = {DiSCoL: Toward Engaging Dialogue Systems through Conversational Line Guided Response Generation},
  author = {Ghazarian, Sarik and Liu, Zixi and Chakrabarty, Tuhin and Ma, Xuezhe and Galstyan, Aram and Peng, Nanyun},
  booktitle = {2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track},
  presentation_id = {https://underline.io/events/122/posters/4227/poster/20579-discol-toward-engaging-dialogue-systems-through-conversational-line-guided-response-generation},
  pages = {26–34},
  publisher = {Association for Computational Linguistics},
  year = {2021}
}

Details

NLP Model Evaluation and Interpretation

Evaluating the Values of Sources in Transfer Learning

Md Rizwan Parvez and Kai-Wei Chang, in NAACL, 2021.

QA Sessions: 14C-ORAL: INTERPRETABILITY AND ANALYSIS OF MODELS FOR NLP Paper link in the virtual conference

Full Text Code BibTeX Details

Transfer learning that adapts a model trained on data-rich sources to low-resource targets has been widely applied in natural language processing (NLP). However, when training a transfer model over multiple sources, not every source is equally useful for the target. To better transfer a model, it is essential to understand the values of the sources. In this paper, we develop SEAL-Shap, an efficient source valuation framework for quantifying the usefulness of the sources (e.g., domains/languages) in transfer learning based on the Shapley value method. Experiments and comprehensive analyses on both cross-domain and cross-lingual transfers demonstrate that our framework is not only effective in choosing useful transfer sources but also the source values match the intuitive source-target similarity.

@inproceedings{parvez2021evaluating,
  title = {Evaluating the Values of Sources in Transfer Learning},
  author = {Parvez, Md Rizwan and Chang, Kai-Wei},
  booktitle = {NAACL},
  presentation_id = {https://underline.io/events/122/sessions/4261/lecture/19707-evaluating-the-values-of-sources-in-transfer-learning},
  year = {2021}
}

When performing transfer learning with multiple sources, one key question is how much info one can leverage from each source. In #NAACL2021 paper, Rizwan Parvez @uclanlp developed SEAL-SHAP, an efficient source valuation framework for quantifying the usefulness of the sources 1/n pic.twitter.com/5qmAG7a1q7
— Kai-Wei Chang (@kaiwei_chang) June 5, 2021

Related Publications

Multilingual Generative Language Models for Zero-Shot Cross-Lingual Event Argument Extraction

Kuan-Hao Huang, I.-Hung Hsu, Prem Natarajan, Kai-Wei Chang, and Nanyun Peng, in ACL, 2022.
Full Text Code Abstract BibTeX Details

We present a study on leveraging multilingual pre-trained generative language models for zero-shot cross-lingual event argument extraction (EAE). By formulating EAE as a language generation task, our method effectively encodes event structures and captures the dependencies between arguments. We design language-agnostic templates to represent the event argument structures, which are compatible with any language, hence facilitating the cross-lingual transfer. Our proposed model finetunes multilingual pre-trained generative language models to generate sentences that fill in the language-agnostic template with arguments extracted from the input passage. The model is trained on source languages and is then directly applied to target languages for event argument extraction. Experiments demonstrate that the proposed model outperforms the current state-of-the-art models on zero-shot cross-lingual EAE. Comprehensive studies and error analyses are presented to better understand the advantages and the current limitations of using generative language models for zero-shot cross-lingual transfer EAE.

@inproceedings{huang2022multilingual,
  title = {Multilingual Generative Language Models for Zero-Shot Cross-Lingual Event Argument Extraction},
  author = {Huang, Kuan-Hao and Hsu, I-Hung and Natarajan, Prem and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {ACL},
  year = {2022}
}

Details

Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training

Kuan-Hao Huang, Wasi Ahmad, Nanyun Peng, and Kai-Wei Chang, in EMNLP, 2021.
Full Text Code Abstract BibTeX Details

Pre-trained multilingual language encoders, such as multilingual BERT and XLM-R, show great potential for zero-shot cross-lingual transfer. However, these multilingual encoders do not precisely align words and phrases across languages. Especially, learning alignments in the multilingual embedding space usually requires sentence-level or word-level parallel corpora, which are expensive to be obtained for low-resource languages. An alternative is to make the multilingual encoders more robust; when fine-tuning the encoder using downstream task, we train the encoder to tolerate noise in the contextual embedding spaces such that even if the representations of different languages are not aligned well, the model can still achieve good performance on zero-shot cross-lingual transfer. In this work, we propose a learning strategy for training robust models by drawing connections between adversarial examples and the failure cases of zero-shot cross-lingual transfer. We adopt two widely used robust training methods, adversarial training and randomized smoothing, to train the desired robust model. The experimental results demonstrate that robust training improves zero-shot cross-lingual transfer on text classification tasks. The improvement is more significant in the generalized cross-lingual transfer setting, where the pair of input sentences belong to two different languages.

@inproceedings{huang2021improving,
  title = {Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training},
  author = {Huang, Kuan-Hao and Ahmad, Wasi and Peng, Nanyun and Chang, Kai-Wei},
  presentation_id = {https://underline.io/events/192/posters/7783/poster/40656-improving-zero-shot-cross-lingual-transfer-learning-via-robust-training},
  booktitle = {EMNLP},
  year = {2021}
}

Details

Syntax-augmented Multilingual BERT for Cross-lingual Transfer

Wasi Ahmad, Haoran Li, Kai-Wei Chang, and Yashar Mehdad, in ACL, 2021.
Full Text Video Code Abstract BibTeX Details

In recent years, we have seen a colossal effort
in pre-training multilingual text encoders using large-scale corpora in many languages to
facilitate cross-lingual transfer learning. However, due to typological differences across languages, the cross-lingual transfer is challenging. Nevertheless, language syntax, e.g., syntactic dependencies, can bridge the typological gap. Previous works have shown that pretrained multilingual encoders, such as mBERT
(Devlin et al., 2019), capture language syntax, helping cross-lingual transfer. This work
shows that explicitly providing language syntax and training mBERT using an auxiliary
objective to encode the universal dependency
tree structure helps cross-lingual transfer. We
perform rigorous experiments on four NLP
tasks, including text classification, question answering, named entity recognition, and taskoriented semantic parsing. The experiment results show that syntax-augmented mBERT improves cross-lingual transfer on popular benchmarks, such as PAWS-X and MLQA, by 1.4
and 1.6 points on average across all languages.
In the generalized transfer setting, the performance boosted significantly, with 3.9 and 3.1
points on average in PAWS-X and MLQA.

@inproceedings{ahmad2021syntax,
  title = {Syntax-augmented Multilingual BERT for Cross-lingual Transfer},
  author = {Ahmad, Wasi and Li, Haoran and Chang, Kai-Wei and Mehdad, Yashar},
  booktitle = {ACL},
  year = {2021}
}

Details

Evaluating the Values of Sources in Transfer Learning

Md Rizwan Parvez and Kai-Wei Chang, in NAACL, 2021.
Full Text Video Code Abstract BibTeX Details

Transfer learning that adapts a model trained on data-rich sources to low-resource targets has been widely applied in natural language processing (NLP). However, when training a transfer model over multiple sources, not every source is equally useful for the target. To better transfer a model, it is essential to understand the values of the sources. In this paper, we develop SEAL-Shap, an efficient source valuation framework for quantifying the usefulness of the sources (e.g., domains/languages) in transfer learning based on the Shapley value method. Experiments and comprehensive analyses on both cross-domain and cross-lingual transfers demonstrate that our framework is not only effective in choosing useful transfer sources but also the source values match the intuitive source-target similarity.

@inproceedings{parvez2021evaluating,
  title = {Evaluating the Values of Sources in Transfer Learning},
  author = {Parvez, Md Rizwan and Chang, Kai-Wei},
  booktitle = {NAACL},
  presentation_id = {https://underline.io/events/122/sessions/4261/lecture/19707-evaluating-the-values-of-sources-in-transfer-learning},
  year = {2021}
}

Details

GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction

Wasi Ahmad, Nanyun Peng, and Kai-Wei Chang, in AAAI, 2021.
Full Text Code Abstract BibTeX Details

Prevalent approaches in cross-lingual relation and event extraction use graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic representations such that models trained on one language can be applied to other languages. However, GCNs lack in modeling long-range dependencies or disconnected words in the dependency tree. To address this challenge, we propose to utilize the self-attention mechanism where we explicitly fuse structural information to learn the dependencies between words at different syntactic distances. We introduce GATE, a \bf Graph \bf Attention \bf Transformer \bf Encoder, and test its cross-lingual transferability on relation and event extraction tasks. We perform rigorous experiments on the widely used ACE05 dataset that includes three typologically different languages: English, Chinese, and Arabic. The evaluation results show that GATE outperforms three recently proposed methods by a large margin. Our detailed analysis reveals that due to the reliance on syntactic dependencies, GATE produces robust representations that facilitate transfer across languages.

@inproceedings{ahmad2021gate,
  author = {Ahmad, Wasi and Peng, Nanyun and Chang, Kai-Wei},
  title = {GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction},
  booktitle = {AAAI},
  year = {2021}
}

Details

Cross-Lingual Dependency Parsing by POS-Guided Word Reordering

Lu Liu, Yi Zhou, Jianhan Xu, Xiaoqing Zheng, Kai-Wei Chang, and Xuanjing Huang, in EMNLP-Finding, 2020.
Full Text Abstract BibTeX Details

We propose a novel approach to cross-lingual dependency parsing based on word reordering. The words in each sentence of a source language corpus are rearranged to meet the word order in a target language under the guidance of a part-of-speech based language model (LM). To obtain the highest reordering score under the LM, a population-based optimization algorithm and its genetic operators are designed to deal with the combinatorial nature of such word reordering. A parser trained on the reordered corpus then can be used to parse sentences in the target language. We demonstrate through extensive experimentation that our approach achieves better or comparable results across 25 target languages (1.73% increase in average), and outperforms a baseline by a significant margin on the languages that are greatly different from the source one. For example, when transferring the English parser to Hindi and Latin, our approach outperforms the baseline by 15.3% and 6.7% respectively.

@inproceedings{liu2020cross-lingual,
  author = {Liu, Lu and Zhou, Yi and Xu, Jianhan and Zheng, Xiaoqing and Chang, Kai-Wei and Huang, Xuanjing},
  title = {Cross-Lingual Dependency Parsing by POS-Guided Word Reordering},
  booktitle = {EMNLP-Finding},
  year = {2020}
}

Details

Cross-lingual Dependency Parsing with Unlabeled Auxiliary Languages

Wasi Ahmad, Zhisong Zhang, Xuezhe Ma, Kai-Wei Chang, and Nanyun Peng, in CoNLL, 2019.
Full Text Poster Code Abstract BibTeX Details

Cross-lingual transfer learning has become an important weapon to battle the unavailability of annotated resources for low-resource languages.  One of the fundamental techniques to transfer across languages is learning language-agnostic representations, in the form of word embeddings or contextual encodings. In this work, we propose to leverage unannotated sentences from auxiliary languages to help learning language-agnostic representations  Specifically, we explore adversarial training for learning contextual encoders that produce invariant representations across languages to facilitate cross-lingual transfer. We conduct experiments on cross-lingual dependency parsing where we train a dependency parser on a source language and transfer it to a wide range of target languages.  Experiments on 28 target languages demonstrate that adversarial training significantly improves the overall transfer performances under several different settings.  We conduct a careful analysis to evaluate the language-agnostic representations resulted from adversarial training.

@inproceedings{ahmad2019crosslingual,
  author = {Ahmad, Wasi and Zhang, Zhisong and Ma, Xuezhe and Chang, Kai-Wei and Peng, Nanyun},
  title = {  Cross-lingual Dependency Parsing with Unlabeled Auxiliary Languages},
  booktitle = {CoNLL},
  year = {2019}
}

Details

Target Language-Aware Constrained Inference for Cross-lingual Dependency Parsing

Tao Meng, Nanyun Peng, and Kai-Wei Chang, in EMNLP, 2019.
Full Text Poster Code Abstract BibTeX Details

Prior work on cross-lingual dependency parsing often focuses on capturing the commonalities between source and target languages and overlooks the potential of leveraging linguistic properties of the languages to facilitate the transfer. In this paper, we show that weak supervisions of linguistic knowledge for the target languages can improve a cross-lingual graph-based dependency parser substantially. Specifically, we explore several types of corpus linguistic statistics and compile them into corpus-wise constraints to guide the inference process during the test time. We adapt two techniques, Lagrangian relaxation and posterior regularization, to conduct inference with corpus-statistics constraints. Experiments show that the Lagrangian relaxation and posterior regularization inference improve the performances on 15 and 17 out of 19 target languages, respectively. The improvements are especially significant for target languages that have different word order features from the source language.

@inproceedings{meng2019target,
  author = {Meng, Tao and Peng, Nanyun and Chang, Kai-Wei},
  title = {Target Language-Aware Constrained Inference for Cross-lingual Dependency Parsing},
  booktitle = {EMNLP},
  year = {2019}
}

Details

On Difficulties of Cross-Lingual Transfer with Order Differences: A Case Study on Dependency Parsing

Wasi Uddin Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard Hovy, Kai-Wei Chang, and Nanyun Peng, in NAACL, 2019.
Full Text Video Code Abstract BibTeX Details

Different languages might have different wordorders. In this paper, we investigate cross-lingual transfer and posit that an order-agnostic model will perform better when trans-ferring to distant foreign languages. To test ourhypothesis, we train dependency parsers on anEnglish corpus and evaluate their transfer per-formance on 30 other languages. Specifically,we compare encoders and decoders based onRecurrent Neural Networks (RNNs) and mod-ified self-attentive architectures. The formerrelies on sequential information while the lat-ter is more flexible at modeling word order.Rigorous experiments and detailed analysisshows that RNN-based architectures transferwell to languages that are close to English,while self-attentive models have better overallcross-lingual transferability and perform espe-cially well on distant languages.

@inproceedings{ahmad2019difficulties,
  author = {Ahmad, Wasi Uddin and Zhang, Zhisong and Ma, Xuezhe and Hovy, Eduard and Chang, Kai-Wei and Peng, Nanyun},
  title = {On Difficulties of Cross-Lingual Transfer with Order Differences: A Case Study on Dependency Parsing},
  booktitle = {NAACL},
  year = {2019}
}

Details

Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation

Chong Zhang, Jieyu Zhao, Huan Zhang, Kai-Wei Chang, and Cho-Jui Hsieh, in NAACL, 2021.

QA Sessions: 11B-ORAL: INTERPRETABILITY AND ANALYSIS OF MODELS FOR NLP Paper link in the virtual conference

Full Text Code BibTeX Details

Robustness and counterfactual bias are usually evaluated on a test dataset. However, are these evaluations robust? If the test dataset is perturbed slightly, will the evaluation results keep the same? In this paper, we propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset. The framework first perturbs the test dataset to construct abundant natural sentences similar to the test data, and then diagnoses the prediction change regarding a single-word substitution. We apply this framework to study two perturbation-based approaches that are used to analyze models’ robustness and counterfactual bias in English. (1) For robustness, we focus on synonym substitutions and identify vulnerable examples where prediction can be altered. Our proposed attack attains high success rates (96.0%-99.8%) in finding vulnerable examples on both original and robustly trained CNNs and Transformers. (2) For counterfactual bias, we focus on substituting demographic tokens (e.g., gender, race) and measure the shift of the expected prediction among constructed sentences. Our method is able to reveal the hidden model biases not directly shown in the test dataset.

@inproceedings{zhang2021double,
  title = {	Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation},
  booktitle = {NAACL},
  author = {Zhang, Chong and Zhao, Jieyu and Zhang, Huan and Chang, Kai-Wei and Hsieh, Cho-Jui},
  year = {2021},
  presentation_id = {https://underline.io/events/122/sessions/4229/lecture/19609-double-perturbation-on-the-robustness-of-robustness-and-counterfactual-bias-evaluation}
}

Prior studies often test model robustness by applying semantic-invariant perturbation on a given test set. In our #NAACL2021 “Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation”, we propose a new framework for robustness verification. 1/n pic.twitter.com/h4V1dKhYXL
— Jieyu Zhao (@jieyuzhao11) June 5, 2021

Related Publications

VideoCon: Robust video-language alignment via contrast captions

Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover, in CVPR, 2024.
Full Text Code Demo Abstract BibTeX Details Best paper at DPFM workshop at ICLR

Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order, which alignment models should be robust against. To this end, we introduce the VideoCon, a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then, a generative video-language model is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally, our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover, our model shows superior performance on novel videos and human-crafted captions and explanations.

@inproceedings{bansal2023videocon,
  author = {Bansal, Hritik and Bitton, Yonatan and Szpektor, Idan and Chang, Kai-Wei and Grover, Aditya},
  title = {VideoCon: Robust video-language alignment via contrast captions},
  booktitle = {CVPR},
  year = {2024}
}

Details

Red Teaming Language Model Detectors with Language Models

Zhouxing Shi, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang, and Cho-Jui Hsieh, in TACL, 2023.
Full Text Code Abstract BibTeX Details

The prevalence and high capacity of large language models (LLMs) present significant safety and ethical risks when malicious users exploit them for automated content generation. To prevent the potentially deceptive usage of LLMs, recent works have proposed several algorithms to detect machine-generated text. In this paper, we systematically test the reliability of the existing detectors, by designing two types of attack strategies to fool the detectors: 1) replacing words with their synonyms based on the context; 2) altering the writing style of generated text. These strategies are implemented by instructing LLMs to generate synonymous word substitutions or writing directives that modify the style without human involvement, and the LLMs leveraged in the attack can also be protected by detectors. Our research reveals that our attacks effectively compromise the performance of all tested detectors, thereby underscoring the urgent need for the development of more robust machine-generated text detection systems.

@inproceedings{shi2023red,
  author = {Shi, Zhouxing and Wang, Yihan and Yin, Fan and Chen, Xiangning and Chang, Kai-Wei and Hsieh, Cho-Jui},
  title = {Red Teaming Language Model Detectors with Language Models},
  booktitle = {TACL},
  year = {2023}
}

Details

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, and Kai-Wei Chang, in ICCV, 2023.
Full Text Code Abstract BibTeX Details Best Paper Award at ICLR Workshop, Oral at ICCV (195 out of 8088 submissions, top 2.5%)

Multimodal contrastive pretraining has been used to train multimodal representation models, such as CLIP, on large amounts of paired image-text data. However, previous studies have revealed that such models are vulnerable to backdoor attacks. Specifically, when trained on backdoored examples, CLIP learns spurious correlations between the embedded backdoor trigger and the target label, aligning their representations in the joint embedding space. Injecting even a small number of poisoned examples, such as 75 examples in 3 million pretraining data, can significantly manipulate the model’s behavior, making it difficult to detect or unlearn such correlations. To address this issue, we propose CleanCLIP, a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks by independently re-aligning the representations for individual modalities. We demonstrate that unsupervised finetuning using a combination of multimodal contrastive and unimodal self-supervised objectives for individual modalities can significantly reduce the impact of the backdoor attack. We show empirically that CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.

@inproceedings{bansal2023cleanclip,
  author = {Bansal, Hritik and Singhi, Nishad and Yang, Yu and Yin, Fan and Grover, Aditya and Chang, Kai-Wei},
  title = {CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning},
  booktitle = {ICCV},
  year = {2023}
}

Details

ADDMU: Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation

Fan Yin, Yao Li, Cho-Jui Hsieh, and Kai-Wei Chang, in EMNLP, 2022.
Full Text Abstract BibTeX Details

Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks and has drawn increasing attention from the Natural Language Processing (NLP) community. Despite the surge of new AED methods, our studies show that existing methods heavily rely on a shortcut to achieve good performance. In other words, current search-based adversarial attacks in NLP stop once model predictions change, and thus most adversarial examples generated by those attacks are located near model decision boundaries. To surpass this shortcut and fairly evaluate AED methods, we propose to test AED methods with Far Boundary (FB) adversarial examples. Existing methods show worse than random guess performance under this scenario. To overcome this limitation, we propose a new technique, ADDMU, adversary detection with data and model uncertainty, which combines two types of uncertainty estimation for both regular and FB adversarial example detection. Our new method outperforms previous methods by 3.6 and 6.0 AUC points under each scenario. Finally, our analysis shows that the two types of uncertainty provided by ADDMU can be leveraged to characterize adversarial examples and identify the ones that contribute most to model’s robustness in adversarial training.

@inproceedings{yin2022addmu,
  title = {ADDMU: Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation},
  author = {Yin, Fan and Li, Yao and Hsieh, Cho-Jui and Chang, Kai-Wei},
  booktitle = {EMNLP},
  year = {2022}
}

Details

Investigating Ensemble Methods for Model Robustness Improvement of Text Classifiers

Jieyu Zhao, Xuezhi Wang, Yao Qin, Jilin Chen, and Kai-Wei Chang, in EMNLP-Finding (short), 2022.
Full Text BibTeX Details

@inproceedings{zhao2022investigating,
  title = {	Investigating Ensemble Methods for Model Robustness Improvement of Text Classifiers},
  author = {Zhao, Jieyu and Wang, Xuezhi and Qin, Yao and Chen, Jilin and Chang, Kai-Wei},
  booktitle = {EMNLP-Finding (short)},
  year = {2022}
}

Details

Unsupervised Syntactically Controlled Paraphrase Generation with Abstract Meaning Representations

Kuan-Hao Huang, Varun Iyer, Anoop Kumar, Sriram Venkatapathy, Kai-Wei Chang, and Aram Galstyan, in EMNLP-Finding (short), 2022.
Full Text BibTeX Details

@inproceedings{huang2022unsupervised,
  title = {Unsupervised Syntactically Controlled Paraphrase Generation with Abstract Meaning Representations},
  author = {Huang, Kuan-Hao and Iyer, Varun and Kumar, Anoop and Venkatapathy, Sriram and Chang, Kai-Wei and Galstyan, Aram},
  booktitle = {EMNLP-Finding (short)},
  year = {2022}
}

Details

Improving the Adversarial Robustness of NLP Models by Information Bottleneck

Cenyuan Zhang, Xiang Zhou, Yixin Wan, Xiaoqing Zheng, Kai-Wei Chang, and Cho-Jui Hsieh, in ACL-Finding, 2022.
Full Text Abstract BibTeX Details

Existing studies have demonstrated that adversarial examples can be directly attributed to the presence of non-robust features, which are highly predictive, but can be easily manipulated by adversaries to fool NLP models. In this study, we explore the feasibility of capturing task-specific robust features, while eliminating the non-robust ones by using the information bottleneck theory. Through extensive experiments, we show that the models trained with our information bottleneck-based method are able to achieve a significant improvement in robust accuracy, exceeding performances of all the previously reported defense methods while suffering almost no performance drop in clean accuracy on SST-2, AGNEWS and IMDB datasets.

@inproceedings{zhang2022improving,
  title = {Improving the Adversarial Robustness of NLP Models by Information Bottleneck},
  author = {Zhang, Cenyuan and Zhou, Xiang and Wan, Yixin and Zheng, Xiaoqing and Chang, Kai-Wei and Hsieh, Cho-Jui},
  booktitle = {ACL-Finding},
  year = {2022}
}

Details

Searching for an Effiective Defender: Benchmarking Defense against Adversarial Word Substitution

Zongyi Li, Jianhan Xu, Jiehang Zeng, Linyang Li, Xiaoqing Zheng, Qi Zhang, Kai-Wei Chang, and Cho-Jui Hsieh, in EMNLP, 2021.
Full Text Abstract BibTeX Details

Recent studies have shown that deep neural networks are vulnerable to intentionally crafted adversarial examples, and various methods have been proposed to defend against adversarial word-substitution attacks for neural NLP models. However, there is a lack of systematic study on comparing different defense approaches under the same attacking setting. In this paper, we seek to fill the gap of systematic studies through comprehensive researches on understanding the behavior of neural text classifiers trained by various defense methods under representative adversarial attacks. In addition, we propose an effective method to further improve the robustness of neural text classifiers against such attacks and achieved the highest accuracy on both clean and adversarial examples on AGNEWS and IMDB datasets by a significant margin.

@inproceedings{li2021searching,
  title = {Searching for an Effiective Defender: Benchmarking Defense against Adversarial Word Substitution},
  author = {Li, Zongyi and Xu, Jianhan and Zeng, Jiehang and Li, Linyang and Zheng, Xiaoqing and Zhang, Qi and Chang, Kai-Wei and Hsieh, Cho-Jui},
  presentation_id = {https://underline.io/events/192/posters/8225/poster/38025-searching-for-an-effective-defender-benchmarking-defense-against-adversarial-word-substitution},
  booktitle = {EMNLP},
  year = {2021}
}

Details

On the Transferability of Adversarial Attacks against Neural Text Classifier

Liping Yuan, Xiaoqing Zheng, Yi Zhou, Cho-Jui Hsieh, and Kai-Wei Chang, in EMNLP, 2021.
Full Text Abstract BibTeX Details

Deep neural networks are vulnerable to adversarial attacks, where a small perturbation to an input alters the model prediction. In many cases, malicious inputs intentionally crafted for one model can fool another model. In this paper, we present the first study to systematically investigate the transferability of adversarial examples for text classification models and explore how various factors, including network architecture, tokenization scheme, word embedding, and model capacity, affect the transferability of adversarial examples. Based on these studies, we propose a genetic algorithm to find an ensemble of models that can be used to induce adversarial examples to fool almost all existing models. Such adversarial examples reflect the defects of the learning process and the data bias in the training set. Finally, we derive word replacement rules that can be used for model diagnostics from these adversarial examples.

@inproceedings{yuan2021on,
  title = {On the Transferability of Adversarial Attacks against Neural Text Classifier},
  author = {Yuan, Liping and Zheng, Xiaoqing and Zhou, Yi and Hsieh, Cho-Jui and Chang, Kai-Wei},
  presentation_id = {https://underline.io/events/192/posters/8223/poster/38067-on-the-transferability-of-adversarial-attacks-against-neural-text-classifier},
  booktitle = {EMNLP},
  year = {2021}
}

Details

Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble

Yi Zhou, Xiaoqing Zheng, Cho-Jui Hsieh, Kai-Wei Chang, and Xuanjing Huang, in ACL, 2021.
Full Text Code Abstract BibTeX Details

Although deep neural networks have achieved prominent performance on many NLP tasks, they are vulnerable to adversarial examples. We propose Dirichlet Neighborhood Ensemble (DNE), a randomized method for training a robust model to defense synonym substitutionbased attacks. During training, DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data. In such a way, the model is robust to adversarial attacks while maintaining the performance on the original clean data. DNE is agnostic to the network architectures and scales to large models (e.g., BERT) for NLP applications. Through extensive experimentation, we demonstrate that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.

@inproceedings{zhou2021defense,
  title = {Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble},
  author = {Zhou, Yi and Zheng, Xiaoqing and Hsieh, Cho-Jui and Chang, Kai-Wei and Huang, Xuanjing},
  booktitle = {ACL},
  year = {2021}
}

Details

Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation

Chong Zhang, Jieyu Zhao, Huan Zhang, Kai-Wei Chang, and Cho-Jui Hsieh, in NAACL, 2021.
Full Text Video Code Abstract BibTeX Details

Robustness and counterfactual bias are usually evaluated on a test dataset. However, are these evaluations robust? If the test dataset is perturbed slightly, will the evaluation results keep the same? In this paper, we propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset. The framework first perturbs the test dataset to construct abundant natural sentences similar to the test data, and then diagnoses the prediction change regarding a single-word substitution. We apply this framework to study two perturbation-based approaches that are used to analyze models’ robustness and counterfactual bias in English. (1) For robustness, we focus on synonym substitutions and identify vulnerable examples where prediction can be altered. Our proposed attack attains high success rates (96.0%-99.8%) in finding vulnerable examples on both original and robustly trained CNNs and Transformers. (2) For counterfactual bias, we focus on substituting demographic tokens (e.g., gender, race) and measure the shift of the expected prediction among constructed sentences. Our method is able to reveal the hidden model biases not directly shown in the test dataset.

@inproceedings{zhang2021double,
  title = {	Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation},
  booktitle = {NAACL},
  author = {Zhang, Chong and Zhao, Jieyu and Zhang, Huan and Chang, Kai-Wei and Hsieh, Cho-Jui},
  year = {2021},
  presentation_id = {https://underline.io/events/122/sessions/4229/lecture/19609-double-perturbation-on-the-robustness-of-robustness-and-counterfactual-bias-evaluation}
}

Details

Provable, Scalable and Automatic Perturbation Analysis on General Computational Graphs

Kaidi Xu, Zhouxing Shi, Huan Zhang, Yihan Wang, Kai-Wei Chang, Minlie Huang, Bhavya Kailkhura, Xue Lin, and Cho-Jui Hsieh, in NeurIPS, 2020.
Full Text Code Abstract BibTeX Details

Linear relaxation based perturbation analysis (LiRPA) for neural networks, which computes provable linear bounds of output neurons given a certain amount of input perturbation, has become a core component in robustness verification and certified defense. The majority of LiRPA-based methods only consider simple feed-forward networks and it needs particular manual derivations and implementations when extended to other architectures. In this paper, we develop an automatic framework to enable perturbation analysis on any neural network structures, by generalizing exiting LiRPA algorithms such as CROWN to operate on general computational graphs. The flexibility, differentiability and ease of use of our framework allow us to obtain state-of-the-art results on LiRPA based certified defense on fairly complicated networks like DenseNet, ResNeXt and Transformer that are not supported by prior work. Our framework also enables loss fusion, a technique that significantly reduces the computational complexity of LiRPA for certified defense. For the first time, we demonstrate LiRPA based certified defense on Tiny ImageNet and Downscaled ImageNet where previous approaches cannot scale to due to the relatively large number of classes. Our work also yields an open-source library for the community to apply LiRPA to areas beyond certified defense without much LiRPA expertise, e.g., we create a neural network with a provably flat optimization landscape. Our open source library is available at https://github.com/KaidiXu/auto_LiRPA

@inproceedings{xu2020provable,
  author = {Xu, Kaidi and Shi, Zhouxing and Zhang, Huan and Wang, Yihan and Chang, Kai-Wei and Huang, Minlie and Kailkhura, Bhavya and Lin, Xue and Hsieh, Cho-Jui},
  title = {Provable, Scalable and Automatic Perturbation Analysis on General Computational Graphs},
  booktitle = {NeurIPS},
  year = {2020}
}

Details

On the Robustness of Language Encoders against Grammatical Errors

Fan Yin, Quanyu Long, Tao Meng, and Kai-Wei Chang, in ACL, 2020.
Full Text Slides Video Code Abstract BibTeX Details

We conduct a thorough study to diagnose the behaviors of pre-trained language encoders (ELMo, BERT, and RoBERTa) when confronted with natural grammatical errors. Specifically, we collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. We use this approach to facilitate debugging models on downstream applications. Results confirm that the performance of all tested models is affected but the degree of impact varies. To interpret model behaviors, we further design a linguistic acceptability task to reveal their abilities in identifying ungrammatical sentences and the position of errors. We find that fixed contextual encoders with a simple classifier trained on the prediction of sentence correctness are able to locate error positions. We also design a cloze test for BERT and discover that BERT captures the interaction between errors and specific tokens in context. Our results shed light on understanding the robustness and behaviors of language encoders against grammatical errors.

@inproceedings{yin2020robustness,
  author = {Yin, Fan and Long, Quanyu and Meng, Tao and Chang, Kai-Wei},
  title = {On the Robustness of Language Encoders against Grammatical Errors},
  booktitle = {ACL},
  presentation_id = {https://virtual.acl2020.org/paper_main.310.html},
  year = {2020}
}

Details

Robustness Verification for Transformers

Zhouxing Shi, Huan Zhang, Kai-Wei Chang, Minlie Huang, and Cho-Jui Hsieh, in ICLR, 2020.
Full Text Video Code Abstract BibTeX Details

Robustness verification that aims to formally certify the prediction behavior of
neural networks has become an important tool for understanding the behavior of
a given model and for obtaining safety guarantees. However, previous methods
are usually limited to relatively simple neural networks. In this paper, we consider the robustness verification problem for Transformers. Transformers have
complex self-attention layers that pose many challenges for verification, including
cross-nonlinearity and cross-position dependency, which have not been discussed
in previous work. We resolve these challenges and develop the first verification
algorithm for Transformers. The certified robustness bounds computed by our
method are significantly tighter than those by naive Interval Bound Propagation.
These bounds also shed light on interpreting Transformers as they consistently
reflect the importance of words in sentiment analysis.

@inproceedings{shi2020robustness,
  author = {Shi, Zhouxing and Zhang, Huan and Chang, Kai-Wei and Huang, Minlie and Hsieh, Cho-Jui},
  title = {Robustness Verification for Transformers},
  booktitle = {ICLR},
  year = {2020}
}

Details

Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei Wang, in EMNLP, 2019.
Full Text Code Abstract BibTeX Details

Adversarial attacks against machine learning models have threatened various real-world applications such as spam filtering and sentiment analysis. In this paper, we propose a novel framework, learning to DIScriminate Perturbations (DISP), to identify and adjust malicious perturbations, thereby blocking adversarial attacks for text classification models. To identify adversarial attacks, a perturbation discriminator validates how likely a token in the text is perturbed and provides a set of potential perturbations. For each potential perturbation, an embedding estimator learns to restore the embedding of the original word based on the context and a replacement token is chosen based on approximate kNN search. DISP can block adversarial attacks for any NLP model without modifying the model structure or training procedure. Extensive experiments on two benchmark datasets demonstrate that DISP significantly outperforms baseline methods in blocking adversarial attacks for text classification. In addition, in-depth analysis shows the robustness of DISP across different situations.

@inproceedings{zhou2019learning,
  author = {Zhou, Yichao and Jiang, Jyun-Yu and Chang, Kai-Wei and Wang, Wei},
  title = {Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification},
  booktitle = {EMNLP},
  year = {2019}
}

Details

Retrofitting Contextualized Word Embeddings with Paraphrases

Weijia Shi, Muhao Chen, Pei Zhou, and Kai-Wei Chang, in EMNLP (short), 2019.
Full Text Slides Video Code Abstract BibTeX Details

Contextualized word embedding models, such as ELMo, generate meaningful representations of words and their context. These models have been shown to have a great impact on downstream applications. However, in many cases, the contextualized embedding of a word changes drastically when the context is paraphrased. As a result, the downstream model is not robust to paraphrasing and other linguistic variations. To enhance the stability of contextualized word embedding models, we propose an approach to retrofitting contextualized embedding models with paraphrase contexts. Our method learns an orthogonal transformation on the input space, which seeks to minimize the variance of word representations on paraphrased contexts. Experiments show that the retrofitted model significantly outperforms the original ELMo on various sentence classification and language inference tasks.

@inproceedings{shi2019retrofitting,
  author = {Shi, Weijia and Chen, Muhao and Zhou, Pei and Chang, Kai-Wei},
  title = {Retrofitting Contextualized Word Embeddings with Paraphrases},
  booktitle = {EMNLP (short)},
  vimeo_id = {430797636},
  year = {2019}
}

Details

Generating Natural Language Adversarial Examples

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang, in EMNLP (short), 2018.
Full Text Code Abstract BibTeX Details Top-10 cited paper at EMNLP 18

Deep neural networks (DNNs) are vulnerable to adversarial examples, perturbations to correctly classified examples which can cause the network to misclassify. In the image domain, these perturbations can often be made virtually indistinguishable to human perception, causing humans and state-of-the-art models to disagree. However, in the natural language domain, small perturbations are clearly perceptible, and the replacement of a single word can drastically alter the semantics of the document. Given these challenges, we use a population-based optimization algorithm to generate semantically and syntactically similar adversarial examples. We demonstrate via a human study that 94.3% of the generated examples are classified to the original label by human evaluators, and that the examples are perceptibly quite similar. We hope our findings encourage researchers to pursue improving the robustness of DNNs in the natural language domain.

@inproceedings{alzanto2018generating,
  author = {Alzantot, Moustafa and Sharma, Yash and Elgohary, Ahmed and Ho, Bo-Jhang and Srivastava, Mani and Chang, Kai-Wei},
  title = {Generating Natural Language Adversarial Examples},
  booktitle = {EMNLP (short)},
  year = {2018}
}

Details

Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, and Kai-Wei Chang, in NAACL, 2021.

QA Sessions: 15A-ORAL: LANGUAGE GROUNDING TO VISION, ROBOTICS AND BEYOND Paper link in the virtual conference

Full Text BibTeX Details

Pre-trained contextual vision-and-language (V&L) models have brought impressive performance improvement on various benchmarks. However, the paired text-image data required for pre-training are hard to collect and scale up. We investigate if a strong V&L representation model can be learned without text-image pairs. We propose Weakly-supervised VisualBERT with the key idea of conducting "mask-and-predict" pre-training on language-only and image-only corpora. Additionally, we introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. Evaluation on four V&L benchmarks shows that Weakly-supervised VisualBERT achieves similar performance with a model pre-trained with paired data. Besides, pre-training on more image-only data further improves a model that already has access to aligned data, suggesting the possibility of utilizing billions of raw images available to enhance V&L models.

@inproceedings{li2021unsupervised,
  author = {Li, Liunian Harold and You, Haoxuan and Wang, Zhecan and Zareian, Alireza and Chang, Shih-Fu and Chang, Kai-Wei},
  title = {Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions},
  booktitle = {NAACL},
  presentation_id = {https://underline.io/events/122/sessions/4269/lecture/19725-unsupervised-vision-and-language-pre-training-without-parallel-images-and-captions},
  year = {2021}
}

Excited to share our NAACL paper Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions! https://t.co/R248NcGH3b

We show that one could pre-train a V&L model on unaligned images and text with competitive performance as models trained on aligned data. pic.twitter.com/7TrKAMxL6a
— Liunian Harold Li (@LiLiunian) April 16, 2021

Related Publications

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, and Kai-Wei Chang, in EMNLP, 2021.
Full Text Code Abstract BibTeX Details

Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models’ ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition.

@inproceedings{yin2021broaden,
  title = {	Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning},
  author = {Yin, Da and Li, Liunian Harold and Hu, Ziniu and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {EMNLP},
  presentation_id = {https://underline.io/events/192/sessions/7790/lecture/37514-broaden-the-vision-geo-diverse-visual-commonsense-reasoning},
  year = {2021}
}

Details

Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, and Kai-Wei Chang, in NAACL, 2021.
Full Text Video Abstract BibTeX Details

Pre-trained contextual vision-and-language (V&L) models have brought impressive performance improvement on various benchmarks. However, the paired text-image data required for pre-training are hard to collect and scale up. We investigate if a strong V&L representation model can be learned without text-image pairs. We propose Weakly-supervised VisualBERT with the key idea of conducting "mask-and-predict" pre-training on language-only and image-only corpora. Additionally, we introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. Evaluation on four V&L benchmarks shows that Weakly-supervised VisualBERT achieves similar performance with a model pre-trained with paired data. Besides, pre-training on more image-only data further improves a model that already has access to aligned data, suggesting the possibility of utilizing billions of raw images available to enhance V&L models.

@inproceedings{li2021unsupervised,
  author = {Li, Liunian Harold and You, Haoxuan and Wang, Zhecan and Zareian, Alireza and Chang, Shih-Fu and Chang, Kai-Wei},
  title = {Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions},
  booktitle = {NAACL},
  presentation_id = {https://underline.io/events/122/sessions/4269/lecture/19725-unsupervised-vision-and-language-pre-training-without-parallel-images-and-captions},
  year = {2021}
}

Details

What Does BERT with Vision Look At?

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang, in ACL (short), 2020.
Full Text Slides Video Code Abstract BibTeX Details

Pre-trained visually grounded language models such as ViLBERT, LXMERT, and UNITER have achieved significant performance improvement on vision-and-language tasks but what they learn during pre-training remains unclear. In this work, we demonstrate that certain attention heads of a visually grounded language model actively ground elements of language to image regions. Specifically, some heads can map entities to image regions, performing the task known as entity grounding. Some heads can even detect the syntactic relations between non-entity words and image regions, tracking, for example, associations between verbs and regions corresponding to their arguments. We denote this ability as \emphsyntactic grounding. We verify grounding both quantitatively and qualitatively, using Flickr30K Entities as a testbed.

@inproceedings{li2020what,
  author = {Li, Liunian Harold and Yatskar, Mark and Yin, Da and Hsieh, Cho-Jui and Chang, Kai-Wei},
  title = {What Does BERT with Vision Look At?},
  booktitle = {ACL (short)},
  presentation_id = {https://virtual.acl2020.org/paper_main.469.html},
  year = {2020}
}

Details

VisualBERT: A Simple and Performant Baseline for Vision and Language

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang, in Arxiv, 2019.
Full Text Code Abstract BibTeX Details

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

@inproceedings{li2019visualbert,
  author = {Li, Liunian Harold and Yatskar, Mark and Yin, Da and Hsieh, Cho-Jui and Chang, Kai-Wei},
  title = {VisualBERT: A Simple and Performant Baseline for Vision and Language},
  booktitle = {Arxiv},
  year = {2019}
}

Details

Unified Pre-training for Program Understanding and Generation

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in NAACL, 2021.

QA Sessions: 8A-ORAL: MACHINE LEARNING FOR NLP: LANGUAGE MODELING AND SEQUENCE TO SEQUENCE MODELS Paper link in the virtual conference

Full Text Code BibTeX Details Top-10 cited paper at NAACL 21

Code summarization nd generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.

@inproceedings{ahmad2021unified,
  title = {Unified Pre-training for Program Understanding and Generation},
  author = {Ahmad, Wasi and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
  booktitle = {NAACL},
  presentation_id = {https://underline.io/events/122/sessions/4197/lecture/20024-unified-pre-training-for-program-understanding-and-generation},
  year = {2021}
}

De-noising pretraining excels for dual modeling of programming language (e.g., source code) + natural language (e.g., code comment). See our new @NAACLHLT paper https://t.co/YrLFIJE1RH. Thanks to awesome collaborations by Wasi Ahmed, Saikat Chakraborty, @kaiwei_chang
.
— Baishakhi Ray (@baishakhir) March 13, 2021

Related Publications

AVATAR: A Parallel Corpus for Java-Python Program Translation

Wasi Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang, in ACL-Finding (short), 2023.
Full Text Code Abstract BibTeX Details

Program translation refers to migrating source code from one programming language to another. It has a tremendous practical value in software development as porting software across different languages is time-consuming and costly. Automating program translation is of paramount importance in software migration, and recently researchers explored unsupervised approaches due to the unavailability of parallel corpora. However, the availability of pre-trained language models for programming languages enable supervised fine-tuning with a small amount of labeled examples. In this work, we present a corpus of 8,475 programming problems and their solutions written in two popular languages, Java and Python. We collect the dataset from competitive programming sites, online platforms, and open source repositories. We present several baselines, including models trained from scratch or pre-trained on large-scale source code collection and fine-tuned on our proposed dataset. Experiment results show that while the models perform relatively well in terms of the lexical match, they lack in generating code that is accurate in terms of syntax and data-flow match.

@inproceedings{ahmad2021avatar,
  title = {AVATAR: A Parallel Corpus for Java-Python Program Translation},
  author = {Ahmad, Wasi and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
  booktitle = {ACL-Finding (short)},
  year = {2023}
}

Details

Retrieval Augmented Code Generation and Summarization

Md Rizwan Parvez, Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in EMNLP-Finding, 2021.
Full Text Abstract BibTeX Details

Software developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers’ code or summary generation behavior, we propose a retrieval augmented framework, \tool, that retrieves relevant code or summaries from a retrieval database and provides them as a supplement to code generation or summarization models. \tool has a couple of uniqueness. First, it extends the state-of-the-art dense retrieval technique to search for relevant code or summaries. Second, it can work with retrieval databases that include unimodal (only code or natural language description) or bimodal instances (code-description pairs). We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python, and the promising results endorse the effectiveness of our proposed retrieval augmented framework.

@inproceedings{parvez2021retrieval,
  title = {Retrieval Augmented Code Generation and Summarization},
  author = {Parvez, Md Rizwan and Ahmad, Wasi and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
  booktitle = {EMNLP-Finding},
  presentation_id = {https://underline.io/events/192/sessions/7923/lecture/38314-retrieval-augmented-code-generation-and-summarization},
  year = {2021}
}

Details

Unified Pre-training for Program Understanding and Generation

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in NAACL, 2021.
Full Text Video Code Abstract BibTeX Details Top-10 cited paper at NAACL 21

Code summarization nd generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.

@inproceedings{ahmad2021unified,
  title = {Unified Pre-training for Program Understanding and Generation},
  author = {Ahmad, Wasi and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
  booktitle = {NAACL},
  presentation_id = {https://underline.io/events/122/sessions/4197/lecture/20024-unified-pre-training-for-program-understanding-and-generation},
  year = {2021}
}

Details

Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models

James Y. Huang, Kuan-Hao Huang, and Kai-Wei Chang, in NAACL (short), 2021.

QA Sessions: 4C-ORAL: SEMANTICS: SENTENCE-LEVEL SEMANTICS AND TEXTUAL INFERENCE Paper link in the virtual conference

Full Text Code BibTeX Details

Pre-trained language models have achieved huge success on a wide range of NLP tasks. However, contextual representations from pre-trained models contain entangled semantic and syntactic information, and therefore cannot be directly used to derive useful semantic sentence embeddings for some tasks. Paraphrase pairs offer an effective way of learning the distinction between semantics and syntax, as they naturally share semantics and often vary in syntax. In this work, we present ParaBART, a semantic sentence embedding model that learns to disentangle semantics and syntax in sentence embeddings obtained by pre-trained language models. ParaBART is trained to perform syntax-guided paraphrasing, based on a source sentence that shares semantics with the target paraphrase, and a parse tree that specifies the target syntax. In this way, ParaBART learns disentangled semantic and syntactic representations from their respective inputs with separate encoders. Experiments in English show that ParaBART outperforms state-of-the-art sentence embedding models on unsupervised semantic similarity tasks. Additionally, we show that our approach can effectively remove syntactic information from semantic sentence embeddings, leading to better robustness against syntactic variation on downstream semantic tasks.

@inproceedings{huang2021disentangling,
  title = {Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models},
  author = {Huang, James Y. and Huang, Kuan-Hao and Chang, Kai-Wei},
  booktitle = {NAACL (short)},
  presentation_id = {https://underline.io/events/122/sessions/4151/lecture/19910-disentangling-semantics-and-syntax-in-sentence-embeddings-with-pre-trained-language-models},
  year = {2021}
}

Check out our #NAACL2021 paper on semantic sentence embeddings! By disentangling the semantics and the syntax of sentences, our ParaBART achieves better performance on semantic textual similarity tasks. (https://t.co/QspSh8W2XJ w/ James Huang and @kaiwei_chang) [1/2] #UCLANLP pic.twitter.com/XzgSmN0353
— Kuan-Hao Huang (@kuanhao_) April 15, 2021

Related Publications

Details

Event Extraction

EventPlus: A Temporal Event Understanding Pipeline

Mingyu Derek Ma, Jiao Sun, Mu Yang, Kung-Hsiang Huang, Nuan Wen, Shikhar Singh, Rujun Han, and Nanyun Peng, in 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track, 2021.

QA Sessions: 10F-POSTER: SYSTEM DEMONSTRATIONS Paper link in the virtual conference

Full Text Slides Poster Code BibTeX Details

We present EventPlus, a temporal event understanding pipeline that integrates various state-of-the-art event understanding components including event trigger and type detection, event argument detection, event duration and temporal relation extraction. Event information, especially event temporal knowledge, is a type of common sense knowledge that helps people understand how stories evolve and provides predictive hints for future events. EventPlus as the first comprehensive temporal event understanding pipeline provides a convenient tool for users to quickly obtain annotations about events and their temporal information for any user-provided document. Furthermore, we show EventPlus can be easily adapted to other domains (e.g., biomedical domain). We make EventPlus publicly available to facilitate event-related information extraction and downstream applications.

@inproceedings{ma2021eventplus,
  title = {EventPlus: A Temporal Event Understanding Pipeline},
  author = {Ma, Mingyu Derek and Sun, Jiao and Yang, Mu and Huang, Kung-Hsiang and Wen, Nuan and Singh, Shikhar and Han, Rujun and Peng, Nanyun},
  booktitle = {2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track},
  presentation_id = {https://underline.io/events/122/posters/4227/poster/20582-eventplus-a-temporal-event-understanding-pipeline},
  year = {2021}
}

Related Publications

ESTER: A Machine Reading Comprehension Dataset for Event Semantic Relation Reasoning

Rujun Han, I.-Hung Hsu, Jiao Sun, Julia Baylon, Qiang Ning, Dan Roth, and Nanyun Peng, in EMNLP, 2021.
Full Text Code BibTeX Details

@inproceedings{han2021ester,
  title = {ESTER: A Machine Reading Comprehension Dataset for Event Semantic Relation Reasoning},
  author = {Han, Rujun and Hsu, I-Hung and Sun, Jiao and Baylon, Julia and Ning, Qiang and Roth, Dan and Peng, Nanyun},
  booktitle = {EMNLP},
  presentation_id = {https://underline.io/events/192/sessions/7816/lecture/37869-ester-a-machine-reading-comprehension-dataset-for-reasoning-about-event-semantic-relations},
  year = {2021}
}

Details

ECONET: Effective Continual Pretraining of Language Models for Event Temporal Reasoning

Rujun Han, Xiang Ren, and Nanyun Peng, in EMNLP, 2021.
Full Text Code BibTeX Details

@inproceedings{han2021econet,
  title = {ECONET: Effective Continual Pretraining of Language Models for Event Temporal Reasoning},
  author = {Han, Rujun and Ren, Xiang and Peng, Nanyun},
  booktitle = {EMNLP},
  presentation_id = {https://underline.io/events/192/posters/8243/poster/37875-econet-effective-continual-pretraining-of-language-models-for-event-temporal-reasoning},
  year = {2021}
}

Details

EventPlus: A Temporal Event Understanding Pipeline

We present EventPlus, a temporal event understanding pipeline that integrates various state-of-the-art event understanding components including event trigger and type detection, event argument detection, event duration and temporal relation extraction. Event information, especially event temporal knowledge, is a type of common sense knowledge that helps people understand how stories evolve and provides predictive hints for future events. EventPlus as the first comprehensive temporal event understanding pipeline provides a convenient tool for users to quickly obtain annotations about events and their temporal information for any user-provided document. Furthermore, we show EventPlus can be easily adapted to other domains (e.g., biomedical domain). We make EventPlus publicly available to facilitate event-related information extraction and downstream applications.

@inproceedings{ma2021eventplus,
  title = {EventPlus: A Temporal Event Understanding Pipeline},
  author = {Ma, Mingyu Derek and Sun, Jiao and Yang, Mu and Huang, Kung-Hsiang and Wen, Nuan and Singh, Shikhar and Han, Rujun and Peng, Nanyun},
  booktitle = {2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track},
  presentation_id = {https://underline.io/events/122/posters/4227/poster/20582-eventplus-a-temporal-event-understanding-pipeline},
  year = {2021}
}

Details

Document-level Event Extraction with Efficient End-to-end Learning of Cross-event Dependencies

Kung-Hsiang Huang and Nanyun Peng, in The 3rd Workshop on Narrative Understanding (NAACL 2021), 2021.

QA Sessions: THE THIRD WORKSHOP ON NARRATIVE UNDERSTANDING Paper link in the virtual conference

Full Text BibTeX Details

Fully understanding narratives often requires identifying events in the context of whole documents and modeling the event relations. However, document-level event extraction is a challenging task as it requires the extraction of event and entity coreference, and capturing arguments that span across different sentences. Existing works on event extraction usually confine on extracting events from single sentences, which fail to capture the relationships between the event mentions at the scale of a document, as well as the event arguments that appear in a different sentence than the event trigger. In this paper, we propose an end-to-end model leveraging Deep Value Networks (DVN), a structured prediction algorithm, to efficiently capture cross-event dependencies for document-level event extraction. Experimental results show that our approach achieves comparable performance to CRF-based models on ACE05, while enjoys significantly higher computational efficiency.

@inproceedings{huang2021document,
  title = {Document-level Event Extraction with Efficient End-to-end Learning of Cross-event Dependencies},
  author = {Huang, Kung-Hsiang and Peng, Nanyun},
  booktitle = {The 3rd Workshop on Narrative Understanding (NAACL 2021)},
  presentation_id = {https://underline.io/events/122/posters/4309/poster/20541-document-level-event-extraction-with-efficient-end-to-end-learning-of-cross-event-dependencies},
  year = {2021}
}

Related Publications

Document-level Event Extraction with Efficient End-to-end Learning of Cross-event Dependencies

Kung-Hsiang Huang and Nanyun Peng, in The 3rd Workshop on Narrative Understanding (NAACL 2021), 2021.
Full Text Abstract BibTeX Details

Fully understanding narratives often requires identifying events in the context of whole documents and modeling the event relations. However, document-level event extraction is a challenging task as it requires the extraction of event and entity coreference, and capturing arguments that span across different sentences. Existing works on event extraction usually confine on extracting events from single sentences, which fail to capture the relationships between the event mentions at the scale of a document, as well as the event arguments that appear in a different sentence than the event trigger. In this paper, we propose an end-to-end model leveraging Deep Value Networks (DVN), a structured prediction algorithm, to efficiently capture cross-event dependencies for document-level event extraction. Experimental results show that our approach achieves comparable performance to CRF-based models on ACE05, while enjoys significantly higher computational efficiency.

@inproceedings{huang2021document,
  title = {Document-level Event Extraction with Efficient End-to-end Learning of Cross-event Dependencies},
  author = {Huang, Kung-Hsiang and Peng, Nanyun},
  booktitle = {The 3rd Workshop on Narrative Understanding (NAACL 2021)},
  presentation_id = {https://underline.io/events/122/posters/4309/poster/20541-document-level-event-extraction-with-efficient-end-to-end-learning-of-cross-event-dependencies},
  year = {2021}
}

Details

Fairness and Social NLP

Language Generation

NLP Model Evaluation and Interpretation

(Multi-Modal) Representation Learning

Event Extraction