Publications

Preprint

2026

Learning Structured Reasoning via Tractable Trajectory Control

Po-Nien Kung, Zhen Yang, Jeffrey Luo, Cheng-Fu Yang, Haikang Deng, Zi-Yi Dou, Yinfei Yang, Nanyun Peng, Zhe Gan, and Kai-Wei Chang, in ICML, 2026.

Full Text Abstract BibTeX Details Spotlight (536/23,918, top 2.2%)

Large language models can exhibit emergent reasoning behaviors, often manifested as recurring lexical patterns (e.g., "wait," indicating verification). However, complex reasoning trajectories remain sparse in unconstrained sampling, and standard RL often fails to guarantee the acquisition of diverse reasoning behaviors. We propose a systematic discovery and reinforcement of diverse reasoning patterns through structured reasoning, a paradigm that requires targeted exploration of specific reasoning patterns during the RL process. To this end, we propose Ctrl-R, a framework for learning structured reasoning via tractable trajectory control that actively guides the rollout process, incentivizing the exploration of diverse reasoning patterns that are critical for complex problem-solving. The resulting behavior policy enables accurate importance-sampling estimation, supporting unbiased on-policy optimization. We further introduce a power-scaling factor on the importance-sampling weights, allowing the policy to selectively learn from exploratory, out-of-distribution trajectories while maintaining stable optimization. Experiments demonstrate that Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns, yielding consistent improvements across language and vision-language models on mathematical reasoning tasks.

@inproceedings{kung2026structured,
  title = {Learning Structured Reasoning via Tractable Trajectory Control},
  author = {Kung, Po-Nien and Yang, Zhen and Luo, Jeffrey and Yang, Cheng-Fu and Deng, Haikang and Dou, Zi-Yi and Yang, Yinfei and Peng, Nanyun and Gan, Zhe and Chang, Kai-Wei},
  booktitle = {ICML},
  year = {2026}
}

Details

VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang, in ICLR, 2026.

Full Text Code Abstract BibTeX Details Best paper at WorldModel workshop in ICML 2025

Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation.

@inproceedings{bansal2026videophy2,
  author = {Bansal, Hritik and Peng, Clark and Bitton, Yonatan and Goldenberg, Roman and Grover, Aditya and Chang, Kai-Wei},
  title = {VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation},
  year = {2026},
  booktitle = {ICLR}
}

Details

OpenThoughts: Data Recipes for Reasoning Models

Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Rea Sprague, Ashima Suvarna, Benjamin Feuer, Leon Liangyu Chen, Zaid Khan, Eric Frankel, and others, in ICLR, 2026.

Full Text Code LinkedIn Post Abstract BibTeX Details oral, top 1.8%

Reasoning models have made rapid progress on many benchmarks involving math,
code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary
datasets with little to no public information available. To address this, the goal
of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led
to OpenThinker2-32B, the first model trained on public reasoning data to match
DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and
LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments,
which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using
QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-ofthe-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond ¡V improvements of 15.3, 17.2, and 20.5 percentage points
compared to the DeepSeek-R1-Distill-Qwen-7B.

@inproceedings{guha2026openthoughts,
  title = {OpenThoughts: Data Recipes for Reasoning Models},
  author = {Guha, Etash Kumar and Marten, Ryan and Keh, Sedrick and Raoof, Negin and Smyrnis, Georgios and Bansal, Hritik and Nezhurina, Marianna and Mercat, Jean and Vu, Trung and Sprague, Zayne Rea and Suvarna, Ashima and Feuer, Benjamin and Chen, Leon Liangyu and Khan, Zaid and Frankel, Eric and others},
  booktitle = {ICLR},
  year = {2026}
}

Details

MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Knowledge Poisoning Attacks

Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, and Heng Ji, in ACL, 2026.

Full Text Code Abstract BibTeX Details

Multimodal large language models with Retrieval Augmented Generation (RAG) have significantly advanced tasks such as multimodal question answering by grounding responses in external text and images. This grounding improves factuality, reduces hallucination, and extends reasoning beyond parametric knowledge. However, this reliance on external knowledge poses a critical yet underexplored safety risk: knowledge poisoning attacks, where adversaries deliberately inject adversarial multimodal content into external knowledge bases to steer model toward generating incorrect or even harmful responses. To expose such vulnerabilities, we propose MM-PoisonRAG, the first framework to systematically design knowledge poisoning in multimodal RAG. We introduce two complementary attack strategies: Localized Poisoning Attack (LPA), which implants targeted multimodal misinformation to manipulate specific queries, and Globalized Poisoning Attack (GPA), which inserts a single adversarial knowledge to broadly disrupt reasoning and induce nonsensical responses across all queries. Comprehensive experiments across tasks, models, and access settings show that LPA achieves targeted manipulation with attack success rates of up to 56%, while GPA completely disrupts model generation to 0% accuracy with just a single adversarial knowledge injection. Our results reveal the fragility of multimodal RAG and highlight the urgent need for defenses against knowledge poisoning.

@inproceedings{ha2026mmpoisonrag,
  title = {MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Knowledge Poisoning Attacks},
  author = {Ha, Hyeonjeong and Zhan, Qiusi and Kim, Jeonghwan and Bralios, Dimitrios and Sanniboina, Saikrishna and Peng, Nanyun and Chang, Kai-Wei and Kang, Daniel and Ji, Heng},
  booktitle = {ACL},
  year = {2026}
}

Details

LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs

Pei-Fu Guo, Yun-Da Tsai, Chun-Chia Hsu, Kai-Xin Chen, Ya An Tsai, Kai-Wei Chang, Nanyun Peng, Mi-Yen Yeh, and Shou-De Lin, in ACL, 2026.

Full Text Abstract BibTeX Details

Evaluating cross-lingual knowledge transfer in large language models is challenging, as correct answers in a target language may arise either from genuine transfer or from prior exposure during pre-training. We present LiveCLKTBench, an automated generation pipeline specifically designed to isolate and measure cross-lingual knowledge transfer. Our pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, and verifies them against the model’s knowledge. The documents of these valid entities are then used to generate factual questions, which are translated into multiple languages to evaluate transferability across linguistic boundaries. Using LiveCLKTBench, we evaluate several LLMs across five languages and observe that cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across language directions. While larger models improve transfer, the gains diminish with scale and vary across domains. These findings provide new insights into multilingual transfer and demonstrate the value of LiveCLKTBench as a reliable benchmark for future research.

@inproceedings{guo2026liveclktbench,
  title = {LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs},
  author = {Guo, Pei-Fu and Tsai, Yun-Da and Hsu, Chun-Chia and Chen, Kai-Xin and Tsai, Ya An and Chang, Kai-Wei and Peng, Nanyun and Yeh, Mi-Yen and Lin, Shou-De},
  booktitle = {ACL},
  year = {2026}
}

Details

Gold-Medal-Level Olympiad Geometry Solving with Efficient Heuristic Auxiliary Constructions

Boyan Duan, Xiao Liang, Shuai Lu, Yaoxiang Wang, Yelong Shen, Kai-Wei Chang, Ying Nian Wu, Mao Yang, Weizhu Chen, and Yeyun Gong, in ACL, 2026.

Full Text Code Abstract BibTeX Details

Automated theorem proving in Euclidean geometry, particularly for International Mathematical Olympiad (IMO) level problems, remains a major challenge and an important research focus in Artificial Intelligence. In this paper, we present a highly efficient method for geometry theorem proving that runs entirely on CPUs without relying on neural network-based inference. Our initial study shows that a simple random strategy for adding auxiliary points can achieve silver-medal level human performance on IMO. Building on this, we propose HAGeo, a Heuristic-based method for adding Auxiliary constructions in Geometric deduction that solves 28 of 30 problems on the IMO-30 benchmark, achieving gold-medal level performance and surpassing AlphaGeometry, a competitive neural network-based approach, by a notable margin. To evaluate our method and existing approaches more comprehensively, we further construct HAGeo-409, a benchmark consisting of 409 geometry problems with human-assessed difficulty levels. Compared with the widely used IMO-30, our benchmark poses greater challenges and provides a more precise evaluation, setting a higher bar for geometry theorem proving.

@inproceedings{duan2026gold,
  title = {Gold-Medal-Level Olympiad Geometry Solving with Efficient Heuristic Auxiliary Constructions},
  author = {Duan, Boyan and Liang, Xiao and Lu, Shuai and Wang, Yaoxiang and Shen, Yelong and Chang, Kai-Wei and Wu, Ying Nian and Yang, Mao and Chen, Weizhu and Gong, Yeyun},
  booktitle = {ACL},
  year = {2026}
}

Details

From Narrow Unlearning to Emergent Misalignment in LLMs

Erum Mushtaq, Anil Ramakrishna, Satyapriya Krishna, Sattvik Sahai, Prasoon Goyal, Kai-Wei Chang, Tao Zhang, and Rahul Gupta, in ACL, 2026.

Full Text Abstract BibTeX Details

Recent work has shown that fine-tuning on insecure code data can trigger an emergent misalignment (EMA) phenomenon, where models generate malicious responses even to prompts unrelated to the original insecure code-writing task. Such cross-domain generalization of harmful behavior underscores the need for a deeper understanding of the algorithms, tasks, and datasets that induce emergent misalignment. In this work, we extend this study by demonstrating that emergent misalignment can also arise from narrow refusal unlearning in specific domains. We perform refusal unlearning on Cybersecurity and Safety concept, and evaluate EMA by monitoring refusal scores across seven responsible AI (RAI) domains, Cybersecurity, Safety, Toxicity, Bias, Sensitive Content, Medical/Legal, and Privacy. Our work shows that narrow domain unlearning can yield compliance responses for the targeted concept, however, it may also propagate EMA to unrelated domains. Among the two intervened concepts, Cybersecurity and Safety, we find that the safety concept can have larger EMA impact, i.e, causing lower refusal scores, across other unrelated domains such as bias. We observe this effect consistently across two model families, Mistral-7b-0.3v, and Qwen-7b-2.5. Further, we show that refusal unlearning augmented with cross-entropy loss function on a small set of retain data from the affected domains can largely, if not fully, restore alignment across the impacted domains while having lower refusal rate on the concept we perform unlearning on. To investigate the underlying causes of EMA, we analyze concept entanglements at the representation level via concept vectors. Our analysis reveals that concepts with higher representation similarity in earlier layers are more susceptible to EMA after intervention when the refusal stream is altered through targeted refusal unlearning.

@inproceedings{mushtaq2026narrow,
  title = {From Narrow Unlearning to Emergent Misalignment in LLMs},
  author = {Mushtaq, Erum and Ramakrishna, Anil and Krishna, Satyapriya and Sahai, Sattvik and Goyal, Prasoon and Chang, Kai-Wei and Zhang, Tao and Gupta, Rahul},
  booktitle = {ACL},
  year = {2026}
}

Details

InsideOut: Measuring and Mitigating Insider-Outsider Bias in Interview Script Generation

Yixin Wan, Xingrun Chen, and Kai-Wei Chang, in ACL, 2026.

Full Text Abstract BibTeX Details

Advancements in Large language models (LLMs) have enabled a variety of downstream applications like story and interview script generation. However, recent research raised concerns about culture-related fairness issues in LLM-generated content. In this work, we identify and systematically investigate LLMs’ insider-outsider bias, a phenomenon where models position themselves as ’insiders’ of mainstream cultures during generation while externalizing less dominant cultures. We propose the InsideOut benchmark with 4,000 generation prompts and three evaluation metrics to quantify this bias through a culturally situated interview script generation task, in which an LLM is positioned as a reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals that while models adopt insider tones in over 88% US-contexted scripts on average, they disproportionately default to ’outsider’ stances for non-Western cultures. To mitigate these biases, we propose 2 inference-time methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of a Single-Agent (MFA-SA), a Hierarchical-Agent (MFA-HA), and an autonomous Agentic Planning (MFA-Plan) pipeline. Empirical results demonstrate that agent-based MFA methods achieve outstanding and robust performance in mitigating the insider-outsider bias: For instance, on the Cultural Alignment Gap (CAG) metric, MFA-SA reduces bias in Llama model by 89.70 % and MFA-HA mitigates bias in Qwen by 82.54%. These findings showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.

@inproceedings{wan2026insideout,
  title = {InsideOut: Measuring and Mitigating Insider-Outsider Bias in Interview Script Generation},
  author = {Wan, Yixin and Chen, Xingrun and Chang, Kai-Wei},
  booktitle = {ACL},
  year = {2026}
}

Details

VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

Di Wu, Yixin Wan, and Kai-Wei Chang, in ACL, 2026.

Full Text Code Abstract BibTeX Details

Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts, underrepresenting structured visual relationships such as pose and viewpoint. We proposeVisualize-then-Retrieve (VisRet), a retrieval paradigm that mitigates this limitation of cross-modal similarity alignment. VisRet first projects textual queries into the image modality via T2I generation, then performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Across four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on average with CLIP as the retriever and by 0.121 with E5-V. For downstream question answering, VisRet increases accuracy on Visual-RAG and Visual-RAG-ME by 3.8% and 15.7% in top-1 retrieval, and by 3.9% and 11.1% in top-10 retrieval. Ablation studies show compatibility with different T2I instruction LLMs, T2I generation models, and downstream LLMs. VisRet provides a simple yet effective perspective for advancing in text-image retrieval. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.

@inproceedings{wu2026visret,
  title = {VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval},
  author = {Wu, Di and Wan, Yixin and Chang, Kai-Wei},
  booktitle = {ACL},
  year = {2026}
}

Details

Training LLMs for Divide-and-Conquer Reasoning

Xiao Liang, Zhong-Zhi Li, Zhenghao Lin, Eric Hanchen Jiang, Hengyuan Zhang, Yelong Shen, Kai-Wei Chang, Ying Nian Wu, Yeyun Gong, and Weizhu Chen, in ACL, 2026.

Full Text Code Abstract BibTeX Details

Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model’s capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs’ reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.

@inproceedings{jiang2026divide,
  title = {Training LLMs for Divide-and-Conquer Reasoning},
  author = {Liang, Xiao and Li, Zhong-Zhi and Lin, Zhenghao and Jiang, Eric Hanchen and Zhang, Hengyuan and Shen, Yelong and Chang, Kai-Wei and Wu, Ying Nian and Gong, Yeyun and Chen, Weizhu},
  booktitle = {ACL},
  year = {2026}
}

Details

SWAN: Semantic Watermarking with Abstract Meaning Representation

Ziping Ye, Gourab Dey, Christos Christodoulopoulos, Charith Peris, Anil Ramakrishna, Weitong Ruan, Aram Galstyan, Kai-Wei Chang, Rahul Gupta, and Ninareh Mehrabi, in ACL, 2026.

BibTeX Details

@inproceedings{ye2026swan,
  title = {SWAN: Semantic Watermarking with Abstract Meaning Representation},
  author = {Ye, Ziping and Dey, Gourab and Christodoulopoulos, Christos and Peris, Charith and Ramakrishna, Anil and Ruan, Weitong and Galstyan, Aram and Chang, Kai-Wei and Gupta, Rahul and Mehrabi, Ninareh},
  booktitle = {ACL},
  year = {2026}
}

Details

Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy

Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, and Xinfeng Li, in ACL, 2026.

Full Text Abstract BibTeX Details

Safety alignment of large language models currently faces a central challenge: existing alignment techniques often prioritize mitigating responses to harmful prompts at the expense of overcautious behavior, leading models to incorrectly refuse benign requests. A key goal of safe alignment is therefore to improve safety while simultaneously minimizing false refusals. In this work, we introduce Energy Landscape Steering (ELS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We train a lightweight external Energy-Based Model (EBM) to assign high energy to undesirable states (false refusal or jailbreak) and low energy to desirable states (helpful response or safe reject). During inference, the EBM maps the LLM’s internal activations to an energy landscape, and we use the gradient of the energy function to steer the hidden states toward low-energy regions in real time. This dynamically guides the model toward desirable behavior without modifying its parameters. By decoupling behavioral control from the model’s core knowledge, ELS provides a flexible and computationally efficient solution. Extensive experiments across diverse models demonstrate its effectiveness, raising compliance on the ORB-H benchmark from 57.3 percent to 82.6 percent while maintaining baseline safety performance. Our work establishes a promising paradigm for building LLMs that simultaneously achieve high safety and low false refusal rates.

@inproceedings{jiang2026overrefusal,
  title = {Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy},
  author = {Jiang, Eric Hanchen and Ou, Weixuan and Liu, Run and Pang, Shengyuan and Wan, Guancheng and Duan, Ranjie and Dong, Wei and Chang, Kai-Wei and Wang, XiaoFeng and Wu, Ying Nian and Li, Xinfeng},
  booktitle = {ACL},
  year = {2026}
}

Details

Dynamic Generation of Multi LLM Agents Communication Topologies with Graph Diffusion Models

Eric Hanchen Jiang, Mengting Li, Guancheng Wan, Xiao Liang, Sophia Yin, Yuchen Wu, Xinfeng Li, Yizhou Sun, Wei Wang, Kai-Wei Chang, and Ying Nian Wu, in ACL, 2026.

Full Text Code Abstract BibTeX Details

The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called Guided Topology Diffusion (GTD). Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration.

@inproceedings{jiang2026dynamic,
  title = {Dynamic Generation of Multi LLM Agents Communication Topologies with Graph Diffusion Models},
  author = {Jiang, Eric Hanchen and Li, Mengting and Wan, Guancheng and Liang, Xiao and Yin, Sophia and Wu, Yuchen and Li, Xinfeng and Sun, Yizhou and Wang, Wei and Chang, Kai-Wei and Wu, Ying Nian},
  booktitle = {ACL},
  keyword_extra = {AI-agent},
  year = {2026}
}

Details

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, and Charith Peris, in ACL, 2026.

BibTeX Details

@inproceedings{liang2026ares,
  title = {ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System},
  author = {Liang, Jiacheng and Ma, Yao and Kumarage, Tharindu and Krishna, Satyapriya and Gupta, Rahul and Chang, Kai-Wei and Galstyan, Aram and Peris, Charith},
  booktitle = {ACL},
  year = {2026}
}

Details

BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning

Jia-Chen Gu, Junyi Zhang, Di Wu, Yuankai Li, Kai-Wei Chang, and Nanyun Peng, in ACL-Findings, 2026.

Full Text Code Abstract BibTeX Details

As retrieval-augmented generation (RAG) tackles complex tasks, increasingly expanded contexts offer richer information, but at the cost of higher latency and increased cognitive load on the model. To mitigate this bottleneck, especially for intricate multi-hop questions, we introduce BRIEF-Pro. It is a universal, lightweight compressor that distills relevant evidence for a given query from retrieved documents into a concise summary for seamless integration into in-context RAG. Using seed data consisting of relatively short contexts (fewer than 1k words), BRIEF-Pro is trained to perform abstractive compression of extended contexts exceeding 10k words across a wide range of scenarios. Furthermore, BRIEF-Pro offers flexible user control over summary length by allowing users to specify the desired number of sentences. Experiments on four open-domain multi-hop question-answering datasets show that BRIEF-Pro generates more concise and relevant summaries, enhancing performance across small, large, and proprietary language models. With the 70B reader model, 32x compression by BRIEF-Pro improves QA performance by 4.67% on average over LongLLMLingua’s 9x, while requiring only 23% of its computational overhead.

@inproceedings{gu2026briefpro,
  title = {BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning},
  author = {Gu, Jia-Chen and Zhang, Junyi and Wu, Di and Li, Yuankai and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {ACL-Findings},
  year = {2026}
}

Details

AutoSUIT Bench - Automated Security UnIt Test Benchmark for LLM Coding

Samuel Osebe, Fan Yang, Junyi Li, Yue Gu, Yongxin Wang, Satyapriya Krishna, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, and Weitong Ruan, in ACL-Findings, 2026.

BibTeX Details

@inproceedings{osebe2026autosuit,
  title = {AutoSUIT Bench - Automated Security UnIt Test Benchmark for LLM Coding},
  author = {Osebe, Samuel and Yang, Fan and Li, Junyi and Gu, Yue and Wang, Yongxin and Krishna, Satyapriya and Chang, Kai-Wei and Galstyan, Aram and Gupta, Rahul and Ruan, Weitong},
  booktitle = {ACL-Findings},
  keyword_extra = {AI-agent},
  year = {2026}
}

Details

Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models

Pei-Fu Guo, Ya An Tsai, Chun-Chia Hsu, Kai-Xin Chen, Yun-Da Tsai, Kai-Wei Chang, Nanyun Peng, Mi-Yen Yeh, and Shou-De Lin, in ACL-Findings, 2026.

Full Text Abstract BibTeX Details

While most reading comprehension benchmarks for LLMs focus on factual information that can be answered by localizing specific textual evidence, many real-world tasks require understanding distributional information, such as population-level trends and preferences expressed across collections of text. We introduce Text2DistBench, a reading comprehension benchmark for evaluating LLMs’ ability to infer distributional knowledge from natural language. Built from real-world YouTube comments about movie and music entities, the benchmark provides models with entity metadata and associated comments, and requires them to answer distributional questions, such as estimating the proportions of positive and negative comments, or identifying the most and second most frequent topics discussed among viewers. To support reliable and long-term evaluation, the construction pipeline of Text2DistBench is fully automated and continuously updated to incorporate newly emerging entities over time. Experiments across multiple LLMs show that while models substantially outperform random baselines, performance varies widely across different distribution types and characteristics. These findings highlight both the capabilities and limitations of current LLMs in distributional reading comprehension and demonstrate the value of Text2DistBench as a practical and scalable testbed for future research.

@inproceedings{guo2026beyondfacts,
  title = {Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models},
  author = {Guo, Pei-Fu and Tsai, Ya An and Hsu, Chun-Chia and Chen, Kai-Xin and Tsai, Yun-Da and Chang, Kai-Wei and Peng, Nanyun and Yeh, Mi-Yen and Lin, Shou-De},
  booktitle = {ACL-Findings},
  year = {2026}
}

Details

ContextNav: Towards Agentic Multimodal In-Context Learning

Honghao Fu, Yuan Ouyang, Kai-Wei Chang, Yiwei Wang, Zi Huang, and Yujun Cai, in ICLR, 2026.

Full Text Abstract BibTeX Details

Recent advances demonstrate that multimodal large language models (MLLMs) exhibit strong multimodal in-context learning (ICL) capabilities, enabling them to adapt to novel vision-language tasks from a few contextual examples. However, existing ICL approaches face challenges in reconciling scalability with robustness across diverse tasks and noisy contextual examples: manually selecting examples produces clean contexts but is labor-intensive and task-specific, while similarity-based retrieval improves scalability but could introduce irrelevant or structurally inconsistent samples that degrade ICL performance. To address these limitations, we propose ContextNav, the first agentic framework that integrates the scalability of automated retrieval with the quality and adaptiveness of human-like curation, enabling noise-robust and dynamically optimized contextualization for multimodal ICL. ContextNav unifies context management and noise-robust contextualization within a closed-loop workflow driven by graph-based orchestration. Specifically, it builds a resource-aware multimodal embedding pipeline, maintains a retrievable vector database, and applies agentic retrieval and structural alignment to construct noise-resilient contexts. An Operational Grammar Graph (OGG) further supports adaptive workflow planning and optimization, enabling the agent to refine its operational strategies based on downstream ICL feedback. Experimental results demonstrate that ContextNav achieves state-of-the-art performance across various datasets, underscoring the promise of agentic workflows for advancing scalable and robust contextualization in multimodal ICL.

@inproceedings{fu2026contextnav,
  title = {ContextNav: Towards Agentic Multimodal In-Context Learning},
  author = {Fu, Honghao and Ouyang, Yuan and Chang, Kai-Wei and Wang, Yiwei and Huang, Zi and Cai, Yujun},
  booktitle = {ICLR},
  keyword_extra = {AI-agent},
  year = {2026}
}

Details

HoneyBee: Data Recipes for Vision-Language Reasoners

Hritik Bansal, Devendra Singh Sachan, Kai-Wei Chang, Aditya Grover, Gargi Ghosh, Wen-tau Yih, and Ramakanth Pasunuru, in CVPR, 2026.

Full Text Code LinkedIn Post Abstract BibTeX Details

Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research. Data is available at https://huggingface.co/datasets/facebook/HoneyBee.

@inproceedings{bansal2026honeybee,
  title = {HoneyBee: Data Recipes for Vision-Language Reasoners},
  author = {Bansal, Hritik and Sachan, Devendra Singh and Chang, Kai-Wei and Grover, Aditya and Ghosh, Gargi and Yih, Wen-tau and Pasunuru, Ramakanth},
  booktitle = {CVPR},
  year = {2026}
}

Details

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, and Dong Yu, in CVPR, 2026.

Full Text Code Abstract BibTeX Details

We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness. Our code is at https://github.com/elainew728/motion-edit/.

@inproceedings{wan2026motionedit,
  title = {MotionEdit: Benchmarking and Learning Motion-Centric Image Editing},
  author = {Wan, Yixin and Ke, Lei and Yu, Wenhao and Chang, Kai-Wei and Yu, Dong},
  booktitle = {CVPR},
  year = {2026}
}

Details

BLUR: A Bi-Level Optimization Approach for LLM Unlearning

Hadi Reisizadeh, Jinghan Jia, Zhiqi Bu, Bhanukiran Vinzamuri, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, Sijia Liu, and Mingyi Hong, in EACL, 2026.

Full Text Code Abstract BibTeX Details

Enabling large language models (LLMs) to unlearn knowledge and capabilities acquired during training has proven vital for ensuring compliance with data regulations and promoting ethical practices in generative AI. Although there are growing interests in developing various unlearning algorithms, it remains unclear how to best formulate the unlearning problem. The most popular formulation uses a weighted sum of forget and retain loss, but it often leads to performance degradation due to the inherent trade-off between forget and retain losses. In this work, we argue that it is important to model the hierarchical structure of the unlearning problem, where the forget problem (which unlearns certain knowledge and/or capabilities) takes priority over the retain problem (which preserves model utility). This hierarchical structure naturally leads to a bi-level optimization formulation where the lower-level objective focuses on minimizing the forget loss, while the upper-level objective aims to maintain the model’s utility. Based on this new formulation, we propose a novel algorithm, termed Bi-Level UnleaRning (BLUR), which not only possesses strong theoretical guarantees but more importantly, delivers superior performance. In particular, our extensive experiments demonstrate that BLUR consistently outperforms all the state-of-the-art algorithms across various unlearning tasks, models, and metrics. Codes are available at https://github.com/OptimAI-Lab/BLURLLMUnlearning.

@inproceedings{reisizadeh2026blur,
  title = {BLUR: A Bi-Level Optimization Approach for LLM Unlearning},
  author = {Reisizadeh, Hadi and Jia, Jinghan and Bu, Zhiqi and Vinzamuri, Bhanukiran and Ramakrishna, Anil and Chang, Kai-Wei and Cevher, Volkan and Liu, Sijia and Hong, Mingyi},
  booktitle = {EACL},
  year = {2026}
}

Details

Open-Domain Safety Policy Construction

Di Wu, Siyue Liu, Zixiang Ji, Ya-Liang Chang, Zhe-Yu Liu, Andrew Pleffer, and Kai-Wei Chang, in EACL-Findings, 2026.

Full Text Code Abstract BibTeX Details

Moderation layers are increasingly a core component of many products built on user- or model-generated content. However, drafting and maintaining domain-specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human-written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in-house multimodal advertisement moderation benchmark. DPR consistently outperforms definition-only and in-context learning baselines, and in our end-to-end setting it is competitive with expert-written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general-purpose deep research system, suggesting that a task-specific, structured research loop can be more effective than generic web research for policy drafting.

@inproceedings{wu2026opendomain,
  title = {Open-Domain Safety Policy Construction},
  author = {Wu, Di and Liu, Siyue and Ji, Zixiang and Chang, Ya-Liang and Liu, Zhe-Yu and Pleffer, Andrew and Chang, Kai-Wei},
  booktitle = {EACL-Findings},
  year = {2026}
}

Details

2025

The Amazon Nova Family of Models: Technical Report and Model Card

Amazon AGI, 2025.

Full Text Abstract BibTeX Details

We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.

@inproceedings{amazon2025nova,
  title = {The Amazon Nova Family of Models: Technical Report and Model Card},
  author = {AGI, Amazon},
  year = {2025}
}

Details

Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, and Kai-Wei Chang, in NeurIPS, 2025.

Full Text Code Abstract BibTeX Details Spotlight (top 5% papers)

AI agents today are mostly siloed ¡X they either retrieve and reason over vast amounts of digital information or interact with the physical world through embodied perception, planning and action, but rarely both. This separation limits their ability to solve tasks requiring integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. This paper introduces Embodied Web Agents, a paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. The authors develop Embodied Web Agents task environments, a unified simulation platform integrating realistic 3D indoor and outdoor environments with functional web interfaces. Building on this platform, they construct and release the Embodied Web Agents Benchmark, which encompasses cooking, navigation, shopping, tourism, and geolocation tasks requiring coordinated reasoning across physical and digital realms. Experiments reveal significant performance gaps between state-of-the-art AI systems and human capabilities, highlighting challenges and opportunities at the intersection of embodied cognition and web-scale knowledge.

@inproceedings{hong2025embodied,
  title = {Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence},
  author = {Hong, Yining and Sun, Rui and Li, Bingxuan and Yao, Xingcheng and Wu, Maxine and Chien, Alexander and Yin, Da and Wu, Ying Nian and Wang, Zhecan James and Chang, Kai-Wei},
  booktitle = {NeurIPS},
  keyword_extra = {AI-agent},
  year = {2025}
}

Details

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover, in NeurIPS, 2025.

Full Text Code Abstract BibTeX Details Spotlight (top 5% papers)

Existing autoregressive vision-language models (VLMs) offer impressive visual reasoning but suffer from slow sequential decoding and limited control over generation. Discrete diffusion models (DMs) provide parallel decoding and bidirectional context, yet their use in multimodal tasks is underexplored. LaViDa introduces a family of diffusion-based VLMs that integrate a vision encoder into a diffusion model and jointly fine-tune the combined parts for multimodal instruction following. The model incorporates complementary masking to improve training efficiency, a prefix KV cache for faster inference, and timestep shifting for high-quality sampling. LaViDa achieves competitive or superior performance to autoregressive VLMs on multi-modal benchmarks such as MMMU and COCO, while offering flexible speed-quality trade-offs and controllable generation. For example, LaViDa surpasses Open-LLaVa-Next-Llama3-8B by +4.1 CIDEr on COCO captioning with a 1.92x speedup and improves constrained poem completion by 59%. Code and models are available at the authors’ repository.

@inproceedings{li2025lavida,
  title = {LaViDa: A Large Diffusion Language Model for Multimodal Understanding},
  author = {Li, Shufan and Kallidromitis, Konstantinos and Bansal, Hritik and Gokul, Akash and Kato, Yusuke and Kozuka, Kazuki and Kuen, Jason and Lin, Zhe and Chang, Kai-Wei and Grover, Aditya},
  booktitle = {NeurIPS},
  year = {2025}
}

Details

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, and Heng Ji, in NeurIPS, 2025.

Full Text Code Abstract BibTeX Details Spotlight (top 5% papers)

Real-world objects are composed of distinct, object-specific parts that support fine-grained reasoning. However, large multimodal models (LMMs) struggle to identify parts and reason about part-whole relationships. This paper introduces PARTONOMY, an LMM benchmark designed for pixel-level part grounding. The benchmark combines existing part datasets and a new annotated set comprising 862 part labels and 534 object labels. Experiments reveal that state-of-the-art segmenting LMMs perform poorly on part-level tasks (e.g., a strong model attains only 5.9% global IoU), highlighting a major capability gap. The authors identify architectural shortcomings in current segmenting LMMs, such as using [SEG] tokens and discarding predicted segmentations, and train several part-centric LMMs to address these issues. They propose PLUM, a novel segmenting LMM that uses span tagging and conditions on prior predictions in a feedback loop. PLUM trained on PARTONOMY achieves stronger performance on reasoning-based segmentation, VQA and visual hallucination benchmarks, opening avenues for more grounded visual understanding in LMMs.

@inproceedings{blume2025partonomy,
  title = {PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding},
  author = {Blume, Ansel and Kim, Jeonghwan and Ha, Hyeonjeong and Chatikyan, Elen and Jin, Xiaomeng and Nguyen, Khanh Duy and Peng, Nanyun and Chang, Kai-Wei and Hoiem, Derek and Ji, Heng},
  booktitle = {NeurIPS},
  year = {2025}
}

Details

3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, and Kai-Wei Chang, in NeurIPS, 2025.

Full Text Code Abstract BibTeX Details Best Paper at Foundation Models Meet Embodied Agents Workshop at CVPR 2025

Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to plan and act in dynamic, multi-room 3D environments because they lack proper 3D spatial-temporal memory modeling. To address this, the authors introduce 3DMem-Bench, a benchmark with over 26,000 trajectories and 2,892 embodied tasks designed to evaluate an agent’s ability to reason over long-term memory in 3D environments. They then propose 3DLLM-Mem, a dynamic memory management and fusion model for embodied spatial-temporal reasoning. The model uses working-memory tokens to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, enabling agents to focus on task-relevant information while maintaining memory efficiency. Experiments show that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming strong baselines by 16.5% in success rate on the most challenging in-the-wild tasks of 3DMem-Bench.

@inproceedings{hu2025tdllm,
  title = {3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model},
  author = {Hu, Wenbo and Hong, Yining and Wang, Yanjun and Gao, Leison and Wei, Zibu and Yao, Xingcheng and Peng, Nanyun and Bitton, Yonatan and Szpektor, Idan and Chang, Kai-Wei},
  booktitle = {NeurIPS},
  keyword_extra = {AI-agent},
  year = {2025}
}

Details

White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs

Kai-Wei Chang Yixin Wan, in ACL, 2025.

Full Text Abstract BibTeX Details Best paper at TrustNLP workshop at NAACL 2024

Language agency is an important aspect of evaluating social biases in texts. While several studies approached agency-related bias in human-written language, very limited research has investigated such biases in Large Language Model (LLM)-generated content. In addition, previous research often relies on string-matching techniques to identify agentic and communal words within texts, which fall short of accurately classifying language agency. We introduce the novel Language Agency Bias Evaluation (LABE) benchmark, which comprehensively evaluates biases in LLMs by analyzing agency levels attributed to different demographic groups in model generations. LABE leverages 5,400 template-based prompts, an accurate agency classifier, and corresponding bias metrics to test for gender, racial, and intersectional language agency biases in LLMs on 3 text generation tasks: biographies, professor reviews, and reference letters. To build better and more accurate automated agency classifiers, we also contribute and release the Language Agency Classification (LAC) dataset, consisting of 3,724 agentic and communal sentences. Using LABE, we unveil previously under-explored language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral. We observe that: (1) For the same text category, LLM generations demonstrate higher levels of gender bias than human-written texts; (2) On most generation tasks, models show remarkably higher levels of intersectional bias than the other bias aspects. Those who are at the intersection of gender and racial minority groups – such as Black females – are consistently described by texts with lower levels of agency; (3) Among the 3 LLMs investigated, Llama3 demonstrates greatest overall bias in language agency; (4) Not only does prompt-based mitigation fail to resolve language agency bias in LLMs, but it frequently leads to the exacerbation of biases in generated texts.

@inproceedings{wan2024white,
  title = {White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs},
  author = {Yixin Wan, Kai-Wei Chang},
  year = {2025},
  booktitle = {ACL}
}

Details

MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations

Shaochen Zhong, Yifan Lu, Lize Shao, Bhargav Bhushanam, Xiaocong Du, Yixin Wan, Yucheng Shi, Daochen Zha, Yiwei Wang, Ninghao Liu, Kaixiong Zhou, Shuai Xu, Kai-Wei Chang, Louis Feng, Vipin Chaudhary, and Xia Hu, in ICLR, 2025.

Full Text Code Abstract BibTeX Details Spotlight (top 5% papers)

Large language models (LLMs) can give out erroneous answers to factually rooted questions either as a result of undesired training outcomes or simply because the world has moved on after a certain knowledge cutoff date. Under such scenarios, knowledge editing often comes to the rescue by delivering efficient patches for such erroneous answers without significantly altering the rest, where many editing methods have seen reasonable success when the editing targets are simple and direct (e.g., 
what club does Lionel Messi currently play for?”).However, knowledge fragments like this are often deeply intertwined in the real world, making effectively propagating the editing effect to non-directly related questions a practical challenge (to entertain an extreme example: "What car did the wife of the owner of the club that Messi currently plays for used to get to school in the 80s?"). Prior arts have coined this task as multi-hop knowledge editing with the most popular dataset being MQuAKE, serving as the sole evaluation benchmark for many later proposed editing methods due to the expensive nature of constructing knowledge editing datasets at scale. In this work, we reveal that up to 33% or 76% of \mquake’s questions and ground truth labels are, in fact, corrupted in various fashions due to some unintentional clerical or procedural oversights. Our work provides a detailed audit of MQuAKE’s error pattern and a comprehensive fix without sacrificing its dataset capacity. Additionally, we benchmarked almost all proposed MQuAKE-evaluated editing methods on our post-fix dataset, MQuAKE-Remastered. We observe that many methods try to overfit the original MQuAKE by exploiting some dataset idiosyncrasies of MQuAKE. We provide a guideline on how to approach such datasets faithfully and show that a simple, minimally invasive approach ¡X GWalk ¡X can offer beyond SOTA editing performance without such exploitation. The MQuAKE-Remastered datasets and utilities are available at huggingface.co/datasets/henryzhongsc/MQuAKE-Remastered and github.com/henryzhongsc/MQuAKE-Remastered, respectively.

@inproceedings{zhong2025mquake,
  title = {MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations},
  author = {Zhong, Shaochen and Lu, Yifan and Shao, Lize and Bhushanam, Bhargav and Du, Xiaocong and Wan, Yixin and Shi, Yucheng and Zha, Daochen and Wang, Yiwei and Liu, Ninghao and Zhou, Kaixiong and Xu, Shuai and Chang, Kai-Wei and Feng, Louis and Chaudhary, Vipin and Hu, Xia},
  booktitle = {ICLR},
  year = {2025}
}

Details

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, Ying Nian Wu, and Lijuan Wang, in ICLR, 2025.

Full Text Abstract BibTeX Details Spotlight (top 5% papers)

Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model’s context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi-episode experiences for context-aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast-VGen outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well. Project Website: https://slowfast-vgen.github.io

@inproceedings{hong2025slowfast,
  title = {SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation},
  author = {Hong, Yining and Liu, Beide and Wu, Maxine and Zhai, Yuanhao and Chang, Kai-Wei and Li, Linjie and Lin, Kevin and Lin, Chung-Ching and Wang, Jianfeng and Yang, Zhengyuan and Wu, Ying Nian and Wang, Lijuan},
  booktitle = {ICLR},
  year = {2025}
}

Details

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, and Charith Peris, in ACL-Findings, 2025.

Full Text Abstract BibTeX Details

Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. The AIDSAFE-generated CoT datasets are publicly available on Hugging Face.

@inproceedings{kumarage2025towards,
  title = {Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation},
  author = {Kumarage, Tharindu and Mehrabi, Ninareh and Ramakrishna, Anil and Zhao, Xinyan and Zemel, Richard and Chang, Kai-Wei and Galstyan, Aram and Gupta, Rahul and Peris, Charith},
  booktitle = {ACL-Findings},
  keyword_extra = {AI-agent},
  year = {2025}
}

Details

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang, in NeurIPS, 2025.

Full Text Code Abstract BibTeX Details

OpenVLThinker is among the first open-source large vision-language models (LVLMs) that exhibit sophisticated chain-of-thought reasoning. When reasoning capabilities from text-only models are distilled into LVLMs via supervised fine-tuning (SFT), performance often degrades due to imprecise visual grounding; pure reinforcement-learning (RL) methods suffer from large search spaces that inhibit reflective behaviors in smaller models. The authors find that alternating between SFT and RL markedly improves performance after a few iterations. Initially, the base LVLM seldom exhibits reasoning behaviors, but SFT surfaces these latent actions and narrows the RL search space. Each subsequent RL stage refines the model’s reasoning and provides higher-quality SFT data for further improvement. OpenVLThinker-7B achieves consistent gains across six benchmarks requiring mathematical and general reasoning, improving MathVista by 3.8%, EMMA by 2.4% and HallusionBench by 1.6%, illustrating the synergy between SFT and RL for complex multimodal reasoning. The authors make the code, model and data publicly available.

@inproceedings{deng2025openvlthinker,
  title = {OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles},
  author = {Deng, Yihe and Bansal, Hritik and Yin, Fan and Peng, Nanyun and Wang, Wei and Chang, Kai-Wei},
  booktitle = {NeurIPS},
  year = {2025}
}

Details

LUME: LLM Unlearning with Multitask Evaluations

Anil Ramakrishna, Yixin Wan, Xiaomeng Jin, Kai-Wei Chang, Zhiqi Bu, Bhanukiran Vinzamuri, Volkan Cevher, Mingyi Hong, and Rahul Gupta, in EMNLP-Finding, 2025.

Full Text Code Abstract BibTeX Details

Unlearning aims to remove copyrighted, sensitive, or private content from large language models without full retraining. This paper introduces LUME, a multi-task unlearning benchmark with three tasks: unlearning synthetically generated creative short novels, unlearning synthetic biographies with sensitive information, and unlearning a collection of public biographies. The authors release two fine-tuned language models (1B and 7B parameters) as target models and conduct detailed evaluations of several unlearning algorithms, presenting results on carefully crafted metrics to understand their behavior and limitations.

@inproceedings{ramakrishna2025lume,
  title = {LUME: LLM Unlearning with Multitask Evaluations},
  author = {Ramakrishna, Anil and Wan, Yixin and Jin, Xiaomeng and Chang, Kai-Wei and Bu, Zhiqi and Vinzamuri, Bhanukiran and Cevher, Volkan and Hong, Mingyi and Gupta, Rahul},
  booktitle = {EMNLP-Finding},
  year = {2025}
}

Details

SNaRe: Domain-aware Data Generation for Low-Resource Event Detection

Tanmay Parekh, Yuxuan Dong, Lucas Bandarkar, Artin Kim, I.-Hung Hsu, Kai-Wei Chang, and Nanyun Peng, in EMNLP, 2025.

Full Text Code Abstract BibTeX Details

Event detection (ED) is important for reasoning in specialized domains such as biomedicine, law and epidemiology, but existing generation approaches suffer from label noise and domain drift when applied to specialized domains. This paper introduces SNaRe, a domain-aware synthetic data generation framework with three components: Scout, Narrator and Refiner. Scout extracts triggers from unlabeled target domain data and curates a high-quality domain-specific trigger list. Narrator uses these triggers to generate domain-aligned sentences, and Refiner identifies additional event mentions to ensure annotation quality. Experiments on diverse ED datasets show that SNaRe outperforms baselines with 3-7% F1 gains in zero-/few-shot settings and 4-20% improvements in multilingual generation.

@inproceedings{parekh2025snare,
  title = {SNaRe: Domain-aware Data Generation for Low-Resource Event Detection},
  author = {Parekh, Tanmay and Dong, Yuxuan and Bandarkar, Lucas and Kim, Artin and Hsu, I-Hung and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {EMNLP},
  year = {2025}
}

Details

DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning

Tanmay Parekh, Kartik Mehta, Ninareh Mehrabi, Kai-Wei Chang, and Nanyun Peng, in EMNLP, 2025.

Full Text Code Abstract BibTeX Details

Zero-shot event detection (ED) identifies event mentions in text without training data, but large language models struggle with complex ontologies and structural constraints. This paper proposes DiCoRe, a divergent-convergent reasoning framework that decouples ED using two modules: a Dreamer that encourages open-ended event discovery to boost coverage and a Grounder that uses finite-state-machine-guided decoding to align predictions with task-specific constraints. An LLM-based judge verifies outputs. Experiments across six datasets, five domains and nine models show that DiCoRe consistently outperforms zero-shot, transfer learning, and reasoning baselines, achieving 4-7% average F1 gains and establishing DiCoRe as a strong zero-shot ED framework.

@inproceedings{parekh2025dicore,
  title = {DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning},
  author = {Parekh, Tanmay and Mehta, Kartik and Mehrabi, Ninareh and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {EMNLP},
  year = {2025}
}

Details

Not Every Token Needs Forgetting: Selective Unlearning to Limit Change in Utility in Large Language Model Unlearning

Yixin Wan, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, and Rahul Gupta, in EMNLP-Finding, 2025.

Full Text Abstract BibTeX Details

Large Language Model (LLM) unlearning has recently gained significant attention because of the need to remove unwanted information such as private or copyrighted content. Conventional unlearning approaches indiscriminately update model parameters to forget all tokens, including common tokens that carry general knowledge. This paper highlights that not every token needs forgetting and proposes Selective Unlearning (SU), which identifies a critical subset of tokens within the forgetting set that is relevant to unwanted information and unlearns only those tokens. Experiments on two benchmarks and six baseline unlearning algorithms show that SU achieves effective unlearning on the targeted forget data while significantly preserving the model’s utility in the retaining set.

@inproceedings{wan2025not,
  title = {Not Every Token Needs Forgetting: Selective Unlearning to Limit Change in Utility in Large Language Model Unlearning},
  author = {Wan, Yixin and Ramakrishna, Anil and Chang, Kai-Wei and Cevher, Volkan and Gupta, Rahul},
  booktitle = {EMNLP-Finding},
  year = {2025}
}

Details

Where Fact Ends and Fairness Begins: Redefining AI Bias Evaluation through Cognitive Biases

Jen-tse Huang, Yuhang Yan, Linqi Liu, Yixin Wan, Wenxuan Wang, Kai-Wei Chang, and Michael R. Lyu, in EMNLP-Finding, 2025.

Full Text Code Abstract BibTeX Details

Instances such as misrepresentative images generated by AI illustrate how outputs can be factually plausible yet socially harmful. Existing fairness benchmarks conflate factual correctness and normative fairness, leading to ambiguous evaluations. This paper argues for distinguishing fact and fairness when assessing bias and introduces the Fact-or-Fair benchmark containing objective queries aligned with fact-based judgments and subjective queries aligned with fairness-based judgments. The queries draw on cognitive psychology biases and experiments across frontier models reveal different fact-fair trade-offs. The authors provide both a theoretical lens and a practical benchmark to advance responsible model.

@inproceedings{huang2025where,
  title = {Where Fact Ends and Fairness Begins: Redefining AI Bias Evaluation through Cognitive Biases},
  author = {Huang, Jen-tse and Yan, Yuhang and Liu, Linqi and Wan, Yixin and Wang, Wenxuan and Chang, Kai-Wei and Lyu, Michael R.},
  booktitle = {EMNLP-Finding},
  year = {2025}
}

Details

When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, and Anna Rohrbach, in COLM 2025, 2025.

Full Text Code Abstract BibTeX Details

Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification.

@inproceedings{singhi2025solve,
  title = {When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning},
  author = {Singhi, Nishad and Bansal, Hritik and Hosseini, Arian and Grover, Aditya and Chang, Kai-Wei and Rohrbach, Marcus and Rohrbach, Anna},
  booktitle = {COLM 2025},
  year = {2025}
}

Details

A Meta-Evaluation of Measuring LLM Misgendering

Arjun Subramonian, Vagrant Gautam, Preethi Seshadri, Dietrich Klakow, Kai-Wei Chang, and Yizhou Sun, in COLM 2025, 2025.

Full Text Abstract BibTeX Details

Numerous methods have been proposed to measure LLM misgendering, including probability-based evaluations (e.g., automatically with templatic sentences) and generation-based evaluations (e.g., with automatic heuristics or human validation). However, it has gone unexamined whether these evaluation methods have convergent validity, that is, whether their results align. Therefore, we conduct a systematic meta-evaluation of these methods across three existing datasets for LLM misgendering. We propose a method to transform each dataset to enable parallel probability- and generation-based evaluation. Then, by automatically evaluating a suite of 6 models from 3 families, we find that these methods can disagree with each other at the instance, dataset, and model levels, conflicting on 20.2% of evaluation instances. Finally, with a human evaluation of 2400 LLM generations, we show that misgendering behaviour is complex and goes far beyond pronouns, which automatic evaluations are not currently designed to capture, suggesting essential disagreement with human evaluations. Based on our findings, we provide recommendations for future evaluations of LLM misgendering. Our results are also more widely relevant, as they call into question broader methodological conventions in LLM evaluation, which often assume that different evaluation methods agree.

@inproceedings{subramonian2025meta,
  title = {A Meta-Evaluation of Measuring LLM Misgendering},
  author = {Subramonian, Arjun and Gautam, Vagrant and Seshadri, Preethi and Klakow, Dietrich and Chang, Kai-Wei and Sun, Yizhou},
  booktitle = {COLM 2025},
  year = {2025}
}

Details

Customize Multi-modal RAI Guardrails with Precedent-based predictions

Cheng-Fu Yang, Thanh Tran, Christos Christodoulopoulos, Weitong Ruan, Rahul Gupta, and Kai-Wei Chang, in COLM 2025, 2025.

Full Text Abstract BibTeX Details

A multi-modal guardrail must effectively filter image content based on user-defined policies, identifying material that may be hateful, reinforce harmful stereotypes, contain explicit material, or spread misinformation. Deploying such guardrails in real-world applications, however, poses significant challenges. Users often require varied and highly customizable policies and typically cannot provide abundant examples for each custom policy. Consequently, an ideal guardrail should be scalable to the multiple policies and adaptable to evolving user standards with minimal retraining. Existing fine-tuning methods typically condition predictions on pre-defined policies, restricting their generalizability to new policies or necessitating extensive retraining to adapt. Conversely, training-free methods struggle with limited context lengths, making it difficult to incorporate all the policies comprehensively. To overcome these limitations, we propose to condition model’s judgment on "precedents", which are the reasoning processes of prior data points similar to the given input. By leveraging precedents instead of fixed policies, our approach greatly enhances the flexibility and adaptability of the guardrail. In this paper, we introduce a critique-revise mechanism for collecting high-quality precedents and two strategies that utilize precedents for robust prediction. Experimental results demonstrate that our approach outperforms previous methods across both few-shot and full-dataset scenarios and exhibits superior generalization to novel policies.

@inproceedings{yang2025customize,
  title = {Customize Multi-modal RAI Guardrails with Precedent-based predictions},
  author = {Yang, Cheng-Fu and Tran, Thanh and Christodoulopoulos, Christos and Ruan, Weitong and Gupta, Rahul and Chang, Kai-Wei},
  booktitle = {COLM 2025},
  year = {2025}
}

Details

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel, in COLM 2025, 2025.

Full Text Code Abstract BibTeX Details

Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.

@inproceedings{rahman2025xteaming,
  title = {X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents},
  author = {Rahman, Salman and Jiang, Liwei and Shiffer, James and Liu, Genglin and Issaka, Sheriff and Parvez, Md Rizwan and Palangi, Hamid and Chang, Kai-Wei and Choi, Yejin and Gabriel, Saadia},
  booktitle = {COLM 2025},
  keyword_extra = {AI-agent},
  year = {2025}
}

Details

Verbalized Representation Learning for Interpretable Few-Shot Generalization

Cheng-Fu Yang, Da Yin, Wenbo Hu, Heng Ji, Nanyun Peng, Bolei Zhou, and Kai-Wei Chang, in ICCV, 2025.

Full Text Code Abstract BibTeX Details

Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller model. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks.

@inproceedings{yang2025verbalized,
  title = {Verbalized Representation Learning for Interpretable Few-Shot Generalization},
  author = {Yang, Cheng-Fu and Yin, Da and Hu, Wenbo and Ji, Heng and Peng, Nanyun and Zhou, Bolei and Chang, Kai-Wei},
  booktitle = {ICCV},
  year = {2025}
}

Details

STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Lezhi Li, Yizhou Sun, Kai-Wei Chang, and Yinfei Yang, in ICCV, 2025.

Full Text Abstract BibTeX Details

We present a simple and scalable text and image conditioned video generation method. Our approach, named STIV, integrates a variable number of image conditions into a Diffusion Transformer (DiT) through frame replacement. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously, as well as long video generation through autoregressive rollouts. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, and multi-view generation, etc. With comprehensive ablation studies on T2I, T2V, TI2V, and long video generation, STIV demonstrate strong performance, despite its simple design. An 8.7B model with (512^2) resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at (512^2) resolution. Combine all of these, we finally scale up our model to 540p with over 200 frames. By providing a transparent recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress for video generation.

@inproceedings{lin2025stiv,
  title = {STIV: Scalable Text and Image Conditioned Video Generation},
  author = {Lin, Zongyu and Liu, Wei and Chen, Chen and Lu, Jiasen and Hu, Wenze and Fu, Tsu-Jui and Allardice, Jesse and Lai, Zhengfeng and Song, Liangchen and Zhang, Bowen and Chen, Cha and Fei, Yiran and Li, Lezhi and Sun, Yizhou and Chang, Kai-Wei and Yang, Yinfei},
  booktitle = {ICCV},
  year = {2025}
}

Details

Contradiction Retrieval via Contrastive Learning with Sparsity

Haike Xu, Zongyu Lin, Kai-Wei Chang, Yizhou Sun, and Piotr Indyk, in ICML, 2025.

Full Text Abstract BibTeX Details

Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query, which is important to many downstream applications like fact checking and data cleaning. To retrieve contradiction argument to the query from large document corpora, existing methods such as similarity search and crossencoder models exhibit significant limitations. The former struggles to capture the essence of contradiction due to its inherent nature of favoring similarity, while the latter suffers from computational inefficiency, especially when the size of corpora is large. To address these challenges, we introduce a novel approach: SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences. Our method utilizes a combined metric of cosine similarity and a sparsity function to efficiently identify and retrieve documents that contradict a given query. This approach dramatically enhances the speed of contradiction detection by reducing the need for exhaustive document comparisons to simple vector calculations. We validate our model using the Arguana dataset, a benchmark dataset specifically geared towards contradiction retrieval, as well as synthetic contradictions generated from the MSMARCO and HotpotQA datasets using GPT-4. Our experiments demonstrate the efficacy of our approach not only in contradiction retrieval with more than 30% accuracy improvements on MSMARCO and HotpotQA across different model architectures but also in applications such as cleaning corrupted corpora to restore high-quality QA retrieval. This paper outlines a promising direction for improving the accuracy and efficiency of contradiction retrieval in large-scale text corpora.

@inproceedings{xu2025contradiction,
  title = {Contradiction Retrieval via Contrastive Learning with Sparsity},
  author = {Xu, Haike and Lin, Zongyu and Chang, Kai-Wei and Sun, Yizhou and Indyk, Piotr},
  booktitle = {ICML},
  year = {2025}
}

Details

QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, and Kai-Wei Chang, in ICML, 2025.

Full Text Abstract BibTeX Details

Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.

@inproceedings{lin2025qlass,
  title = {QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search},
  author = {Lin, Zongyu and Tang, Yao and Yao, Xingcheng and Yin, Da and Hu, Ziniu and Sun, Yizhou and Chang, Kai-Wei},
  booktitle = {ICML},
  keyword_extra = {AI-agent},
  year = {2025}
}

Details

Contrastive Visual Data Augmentation

Yu Zhou, Bingxuan Li, Tang Mohan, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, and Nanyun Peng, in ICML, 2025.

Full Text Abstract BibTeX Details

Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.

@inproceedings{zhou2025contrastive,
  title = {Contrastive Visual Data Augmentation},
  author = {Zhou, Yu and Li, Bingxuan and Mohan, Tang and Jin, Xiaomeng and Wu, Te-Lin and Huang, Kuan-Hao and Ji, Heng and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {ICML},
  year = {2025}
}

Details

SYNTHIA: Novel Concept Design with Affordance Composition

Hyeonjeong Ha, Xiaomeng Jin, Jeonghwan Kim, Jiateng Liu, Zhenhailong Wang, Khanh Duy Nguyen, Ansel Blume, Nanyun Peng, Kai-Wei Chang, and Heng Ji, in ACL, 2025.

Full Text Code Abstract BibTeX Details

Text-to-image (T2I) models enable rapid concept design, making them widely used in AI-driven design. While recent studies focus on generating semantic and stylistic variations of given design concepts, functional coherence–the integration of multiple affordances into a single coherent concept–remains largely overlooked. In this paper, we introduce SYNTHIA, a framework for generating novel, functionally coherent designs based on desired affordances. Our approach leverages a hierarchical concept ontology that decomposes concepts into parts and affordances, serving as a crucial building block for functionally coherent design. We also develop a curriculum learning scheme based on our ontology that contrastively fine-tunes T2I models to progressively learn affordance composition while maintaining visual novelty. To elaborate, we (i) gradually increase affordance distance, guiding models from basic concept-affordance association to complex affordance compositions that integrate parts of distinct affordances into a single, coherent form, and (ii) enforce visual novelty by employing contrastive objectives to push learned representations away from existing concepts. Experimental results show that SYNTHIA outperforms state-of-the-art T2I models, demonstrating absolute gains of 25.1% and 14.7% for novelty and functional coherence in human evaluation, respectively.

@inproceedings{ha2025synthia,
  title = {SYNTHIA: Novel Concept Design with Affordance Composition},
  author = {Ha, Hyeonjeong and Jin, Xiaomeng and Kim, Jeonghwan and Liu, Jiateng and Wang, Zhenhailong and Nguyen, Khanh Duy and Blume, Ansel and Peng, Nanyun and Chang, Kai-Wei and Ji, Heng},
  booktitle = {ACL},
  year = {2025}
}

Details

The Male CEO and the Female Assistant: Evaluation and Mitigation of Gender Biases in Text-To-Image Generation of Dual Subjects

Yixin Wan and Kai-Wei Chang, in ACL, 2025.

Full Text Abstract BibTeX Details

Recent large-scale T2I models like DALLE-3 have made progress in reducing gender stereotypes when generating single-person images. However, significant biases remain when generating images with more than one person. To systematically evaluate this, we propose the Paired Stereotype Test (PST) framework, which queries T2I models to depict two individuals assigned with male-stereotyped and female-stereotyped social identities, respectively (e.g. "a CEO" and "an Assistant"). This contrastive setting often triggers T2I models to generate gender-stereotyped images. Using PST, we evaluate two aspects of gender biases – the well-known bias in gendered occupation and a novel aspect: bias in organizational power. Experiments show that over 74% images generated by DALLE-3 display gender-occupational biases. Additionally, compared to single-person settings, DALLE-3 is more likely to perpetuate male-associated stereotypes under PST. We further propose FairCritic, a novel and interpretable framework that leverages an LLM-based critic model to i) detect bias in generated images, and ii) adaptively provide feedback to T2I models for improving fairness. FairCritic achieves near-perfect fairness on PST, overcoming the limitations of previous prompt-based intervention approaches.

@inproceedings{wan2025male,
  title = {The Male CEO and the Female Assistant: Evaluation and Mitigation of Gender Biases in Text-To-Image Generation of Dual Subjects},
  author = {Wan, Yixin and Chang, Kai-Wei},
  booktitle = {ACL},
  year = {2025}
}

Details

Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation

Fan Yin, Zifeng Wang, I.-Hung Hsu, Jun Yan, Ke Jiang, Yanfei Chen, Jindong Gu, Long Le, Kai-Wei Chang, Chen-Yu Lee, Hamid Palangi, and Tomas Pfister, in ACL, 2025.

Full Text Abstract BibTeX Details

Large language models (LLMs) have exhibited the ability to effectively utilize external tools to address user queries. However, their performance may be limited in complex, multi-turn interactions involving users and multiple tools. To address this, we propose Magnet, a principled framework for synthesizing high-quality training trajectories to enhance the function calling capability of large language model agents in multi-turn conversations with humans. The framework is based on automatic and iterative translations from a function signature path to a sequence of queries and executable function calls. We model the complicated function interactions in multi-turn cases with graph and design novel node operations to build reliable signature paths. Motivated by context distillation, when guiding the generation of positive and negative trajectories using a teacher model, we provide reference function call sequences as positive hints in context and contrastive, incorrect function calls as negative hints. Experiments show that training with the positive trajectories with supervised fine-tuning and preference optimization against negative trajectories, our 14B model, Magnet-14B-mDPO, obtains 68.01 on BFCL-v3 and 73.30 on ToolQuery, surpassing the performance of the teacher model Gemini-1.5-pro-002 by a large margin in function calling.

@inproceedings{yin2025magnet,
  title = {Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation},
  author = {Yin, Fan and Wang, Zifeng and Hsu, I-Hung and Yan, Jun and Jiang, Ke and Chen, Yanfei and Gu, Jindong and Le, Long and Chang, Kai-Wei and Lee, Chen-Yu and Palangi, Hamid and Pfister, Tomas},
  booktitle = {ACL},
  keyword_extra = {AI-agent},
  year = {2025}
}

Details

Vulnerability of LLMs to Vertically Aligned Text Manipulations

Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Zhen Xiong, Nanyun Peng, and Kai-Wei Chang, in ACL, 2025.

Full Text Abstract BibTeX Details

Text classification involves categorizing a given text, such as determining its sentiment or identifying harmful content. With the advancement of large language models (LLMs), these models have become highly effective at performing text classification tasks. However, they still show vulnerabilities to variations in text formatting. Recent research demonstrates that modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks. While easily understood by humans, these inputs can significantly mislead models, posing a potential risk of bypassing detection in real-world scenarios involving harmful or sensitive information. With the expanding application of LLMs, a crucial question arises: Do decoder-based LLMs exhibit similar vulnerabilities to vertically formatted text input? In this paper, we investigate the impact of vertical text input on the performance of various LLMs across multiple text classification datasets and analyze the underlying causes. Our findings are as follows: (i) Vertical text input significantly degrades the accuracy of LLMs in text classification tasks. (ii) Chain of Thought (CoT) reasoning does not help LLMs recognize vertical input or mitigate its vulnerability, but few-shot learning with careful analysis does. (iii) We explore the underlying cause of the vulnerability by analyzing the inherent issues in tokenization and attention matrices.

@inproceedings{li2025vulnerability,
  title = {Vulnerability of LLMs to Vertically Aligned Text Manipulations},
  author = {Li, Zhecheng and Wang, Yiwei and Hooi, Bryan and Cai, Yujun and Xiong, Zhen and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {ACL},
  year = {2025}
}

Details

METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling

Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, and Nanyun Peng, in ACL, 2025.

Full Text Code Abstract BibTeX Details

Chart generation aims to generate code to produce charts satisfying the desired visual properties, e.g., texts, layout, color, and type. It has great potential to empower the automatic professional report generation in financial analysis, research presentation, education, and healthcare. In this work, we build a vision-language model (VLM) based multi-agent framework for effective automatic chart generation. Generating high-quality charts requires both strong visual design skills and precise coding capabilities that embed the desired visual properties into code. Such a complex multi-modal reasoning process is difficult for direct prompting of VLMs. To resolve these challenges, we propose METAL, a multi-agent framework that decomposes the task of chart generation into the iterative collaboration among specialized agents. METAL achieves 5.2% improvement over the current best result in the chart generation task. The METAL framework exhibits the phenomenon of test-time scaling: its performance increases monotonically as the logarithmic computational budget grows from 512 to 8192 tokens. In addition, we find that separating different modalities during the critique process of METAL boosts the self-correction capability of VLMs in the multimodal context.

@inproceedings{li2025metal,
  title = {METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling},
  author = {Li, Bingxuan and Wang, Yiwei and Gu, Jiuxiang and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {ACL},
  keyword_extra = {AI-agent},
  year = {2025}
}

Details

V-ALPHASOCIAL: Benchmark and Self-Reflective Chain-of-Thought Generation for Visual Social Commonsense Reasoning

Zongyu Lin, Zhikun Xu, Xiaohan Song, Yixin Wan, Xingcheng Yao, Tsung-Han Lin, Selina Song, Pranav Subbaraman, Ben Zhou, Kai-Wei Chang, and Yizhou Sun, in ACL-Findings, 2025.

Full Text Abstract BibTeX Details

Social commonsense reasoning naturally involves both the verbal and non-verbal cues of a social interaction. It is important for Large Vision-Language Models (VLMs) to leverage both textual and visual information in performing tasks like social understanding and reasoning. However, while current LLMs have shown good social reasoning capabilities in textual context, whether they can effectively incorporate visual information in social comprehension remains under-explored. To narrow the gap, we first construct and propose a benchmark: V-Social, featuring well-aligned text and visual content, tailored to assess visual social commonsense for multimodal foundation models. Through experimenting with V-Social, we find that even the most advanced VLM, GPT-4o, often falls short in social commonsense reasoning. This highlights the critical need to enhance the social grounding of VLMs. One major obstacle for improving this is the lack of high-quality data with good reasoning process. To overcome this obstacle, we introduce V-AlphaSocial, a novel method that generates high-quality chain-of-thought reasoning paths from unlabeled data. We design a visual reasoning reward model to improve VLM, and then iteratively refine both the VLM and the reward model. Our extensive analysis showcases how our method enhances social commonsense reasoning, proposing an effective approach that facilitates deeper exploration into field.

@inproceedings{lin2025valphasocial,
  title = {V-ALPHASOCIAL: Benchmark and Self-Reflective Chain-of-Thought Generation for Visual Social Commonsense Reasoning},
  author = {Lin, Zongyu and Xu, Zhikun and Song, Xiaohan and Wan, Yixin and Yao, Xingcheng and Lin, Tsung-Han and Song, Selina and Subbaraman, Pranav and Zhou, Ben and Chang, Kai-Wei and Sun, Yizhou},
  booktitle = {ACL-Findings},
  year = {2025}
}

Details

DRS: Deep Question Reformulation With Structured Output

Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, and Kai-Wei Chang, in ACL-Findings, 2025.

Full Text Code Abstract BibTeX Details

Question answering represents a core capability of large language models (LLMs). However, when individuals encounter unfamiliar knowledge in texts, they often formulate questions that the text itself cannot answer due to insufficient understanding of the underlying information. Recent studies reveal that while LLMs can detect unanswerable questions, they struggle to assist users in reformulating these questions. Even advanced models like GPT-3.5 demonstrate limited effectiveness in this regard. To address this limitation, we propose DRS: Deep Question Reformulation with Structured Output, a novel zero-shot method aimed at enhancing LLMs ability to assist users in reformulating questions to extract relevant information from new documents. DRS combines the strengths of LLMs with a DFS-based algorithm to iteratively explore potential entity combinations and constrain outputs using predefined entities. This structured approach significantly enhances the reformulation capabilities of LLMs. Comprehensive experimental evaluations demonstrate that DRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while also enhancing the performance of open-source models, such as Gemma2-9B, from 26.35% to 56.75%.

@inproceedings{li2025drs,
  title = {DRS: Deep Question Reformulation With Structured Output},
  author = {Li, Zhecheng and Wang, Yiwei and Hooi, Bryan and Cai, Yujun and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {ACL-Findings},
  year = {2025}
}

Details

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover, in ACL-Finding, 2025.

Full Text Code Abstract BibTeX Details

A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This method, however, relies solely on pairwise comparisons, where the generations are evaluated within an identical context. While effective to such conditional preferences often fail to encompass the nuanced and multidimensional nature of human preferences. In this work, we revisit the traditional paradigm of preference acquisition and propose a new axis based on eliciting preferences jointly over the instruction-response pairs. Unlike prior preference optimizations, which are designed for conditional ranking protocols (e.g., DPO), we propose Joint Preference Optimization (JPO), a new preference optimization objective that upweights the joint probability of the chosen instruction-response pair over the rejected instruction-response pair. Interestingly, LLMs trained with joint instruction-response preference data using JPO outperform LLM trained with DPO by 5.2% and 3.3% win-rate for summarization and open-ended dialogue datasets, respectively. Our findings reveal that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs by tapping into a broader spectrum of human preference elicitation.

@inproceedings{bansal2025comparing,
  title = {Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization},
  author = {Bansal, Hritik and Suvarna, Ashima and Bhatt, Gantavya and Peng, Nanyun and Chang, Kai-Wei and Grover, Aditya},
  booktitle = {ACL-Finding},
  year = {2025}
}

Details

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu, in ICLR, 2025.

Full Text Code Abstract BibTeX Details

Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. We introduce LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into three stages: indexing, retrieval, and reading. Built upon key experimental insights, we propose several memory design optimizations including session decomposition for value granularity, fact-augmented key expansion for indexing, and time-aware query expansion for refining the search scope. Extensive experiments show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI.

@inproceedings{wu2025longmemeval,
  title = {LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory},
  author = {Wu, Di and Wang, Hongwei and Yu, Wenhao and Zhang, Yuwei and Chang, Kai-Wei and Yu, Dong},
  booktitle = {ICLR},
  keyword_extra = {AI-agent},
  year = {2025}
}

Details

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Shuyang Hao, Bryan Hooi, Jun Liu, Kai-Wei Chang, Zi Huang, and Yujun Cai, in CVPR, 2025.

Full Text Abstract BibTeX Details

Despite inheriting security measures from underlying language models, Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues. Through empirical analysis, we uncover two critical findings: scenario-matched images can significantly amplify harmful outputs, and contrary to common assumptions in gradient-based attacks, minimal loss values do not guarantee optimal attack effectiveness. Building on these insights, we introduce MLAI (Multi-Loss Adversarial Images), a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment, exploits flat minima theory for robust adversarial image selection, and employs multi-image collaborative attacks for enhanced effectiveness. Extensive experiments demonstrate MLAI’s significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2, substantially outperforming existing methods by margins of 34.37% and 12.77% respectively. Furthermore, MLAI shows considerable transferability to commercial black-box VLMs, achieving up to 60.11% success rate. Our work reveals fundamental visual vulnerabilities in current VLMs safety mechanisms and underscores the need for stronger defenses. Warning: This paper contains potentially harmful example text.

@inproceedings{hao2025exploring,
  title = {Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models},
  author = {Hao, Shuyang and Hooi, Bryan and Liu, Jun and Chang, Kai-Wei and Huang, Zi and Cai, Yujun},
  booktitle = {CVPR},
  year = {2025}
}

Details

VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

Xueqing Wu, Yuheng Ding, Bingxuan Li, Pan Lu, Da Yin, Kai-Wei Chang, and Nanyun Peng, in CVPR, 2025.

Full Text Code Abstract BibTeX Details

The ability of large vision-language models (LVLMs) to critique and correct their reasoning is an essential building block towards their self-improvement. However, a systematic analysis of such capabilities in LVLMs is still lacking. We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs. Compared to existing work that uses a single scalar value to critique the entire reasoning [4], VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought and provide natural language explanations to support their judgments. Extensive evaluation of 24 LVLMs demonstrates that human-written critiques significantly enhance the performance after correction, showcasing the potential of the self-improvement strategy. However, the model-generated critiques are less helpful and sometimes detrimental to the performance, suggesting that critique is the crucial bottleneck. We identified three common patterns in critique failures: failure to critique visual perception, reluctance to "say no", and exaggerated assumption of error propagation. To address these issues, we propose an effective LookBack strategy that revisits the image to verify each piece of information in the initial reasoning. LookBack significantly improves critique and correction performance by up to 13.5%.

@inproceedings{wu2025visco,
  title = {VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning},
  author = {Wu, Xueqing and Ding, Yuheng and Li, Bingxuan and Lu, Pan and Yin, Da and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {CVPR},
  year = {2025}
}

Details

VideoPhy: Evaluating Physical Commonsense for Video Generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover, in ICLR, 2025.

Full Text Code Abstract BibTeX Details

Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts, synthesize realistic motions and render complex objects. Hence, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models. To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate diverse prompts that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., CogVideoX) and closed models (e.g., Lumiere, Dream Machine). Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense. Specifically, the best performing model, CogVideoX-5B, generates videos that adhere to the caption and physical laws for 39.6% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we propose an auto-evaluator, VideoCon-Physics, to assess the performance reliably for the newly released models.

@inproceedings{bansal2025videophy,
  title = {VideoPhy: Evaluating Physical Commonsense for Video Generation},
  author = {Bansal, Hritik and Lin, Zongyu and Xie, Tianyi and Zong, Zeshun and Yarom, Michal and Bitton, Yonatan and Jiang, Chenfanfu and Sun, Yizhou and Chang, Kai-Wei and Grover, Aditya},
  booktitle = {ICLR},
  year = {2025}
}

Details

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, and others, in ICLR, 2025.

Full Text Code Abstract BibTeX Details

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.

@inproceedings{wang2025muirbench,
  title = {MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding},
  author = {Wang, Fei and Fu, Xingyu and Huang, James Y. and Li, Zekun and Liu, Qin and Liu, Xiaogeng and Ma, Mingyu Derek and Xu, Nan and Zhou, Wenxuan and Zhang, Kai and Yan, Tianyi Lorena and Mo, Wenjie Jacky and Liu, Hsiang-Hui and Lu, Pan and Li, Chunyuan and others},
  booktitle = {ICLR},
  year = {2025}
}

Details

Controllable Generation via Locally Constrained Resampling

Kareem Ahmed, Kai-Wei Chang, and Guy Van den Broeck, in ICLR, 2025.

Full Text Abstract BibTeX Details

Autoregressive models have demonstrated an unprecedented ability at modeling the intricacies of natural language. However, they continue to struggle with generating complex outputs that adhere to logical constraints. Sampling from a fully-independent distribution subject to a constraint is hard. Sampling from an autoregressive distribution subject to a constraint is doubly hard: We have to contend not only with the hardness of the constraint but also the distribution’s lack of structure. We propose a tractable probabilistic approach that performs Bayesian conditioning to draw samples subject to a constraint. Our approach considers the entire sequence, leading to a more globally optimal constrained generation than current greedy methods. Starting from a model sample, we induce a local, factorized distribution which we can tractably condition on the constraint. To generate samples that satisfy the constraint, we sample from the conditional distribution, correct for biases in the samples and resample. The resulting samples closely approximate the target distribution and are guaranteed to satisfy the constraints. We evaluate our approach on several tasks, including LLM detoxification and solving Sudoku puzzles. We show that by disallowing a list of toxic expressions our approach is able to steer the model’s outputs away from toxic generations, outperforming similar approaches to detoxification. We conclude by showing that our approach achieves a perfect accuracy on Sudoku compared to <50% for GPT4-o and Gemini 1.5.

@inproceedings{ahmed2025controllable,
  title = {Controllable Generation via Locally Constrained Resampling},
  author = {Ahmed, Kareem and Chang, Kai-Wei and den Broeck, Guy Van},
  booktitle = {ICLR},
  keyword_extra = {constraint},
  year = {2025}
}

Details

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, and Nanyun Peng, in ICLR, 2025.

Full Text Code Abstract BibTeX Details

Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in which we systematically identify and categorize scenarios where visually augmented knowledge is better than textual knowledge, for instance, more images from varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. With MRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-Bench is vision-centric. Additionally, we conduct extensive analysis with MRAG-Bench, which offers valuable insights into retrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faces challenges in effectively leveraging retrieved knowledge, achieving only a 5.82% improvement with ground-truth information, in contrast to a 33.16% improvement observed in human participants. These findings highlight the importance of MRAG-Bench in encouraging the community to enhance LVLMs’ ability to utilize retrieved visual knowledge more effectively.

@inproceedings{hu2025mrag,
  title = {MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models},
  author = {Hu, Wenbo and Gu, Jia-Chen and Dou, Zi-Yi and Fayyaz, Mohsen and Lu, Pan and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {ICLR},
  year = {2025}
}

Details

Unlearning as Multi-task Optimization: A Normalized Gradient Difference Approach with an Adaptive Learning Rate

Xiaomeng Jin, Zhiqi Bu, Bhanukiran Vinzamuri, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, and Mingyi Hong, in NAACL, 2025.

Full Text Abstract BibTeX Details

Machine unlearning has been used to remove unwanted knowledge acquired by large language models (LLMs). In this paper, we examine machine unlearning from an optimization perspective, framing it as a regularized multi-task optimization problem, where one task optimizes a forgetting objective and another optimizes the model performance. In particular, we introduce a normalized gradient difference (NGDiff) algorithm, enabling us to have better control over the trade-off between the objectives, while integrating a new, automatic learning rate scheduler. We provide a theoretical analysis and empirically demonstrate the superior performance of NGDiff among state-of-the-art unlearning methods on the TOFU and MUSE datasets while exhibiting stable training.

@inproceedings{jin2025unlearning,
  title = {Unlearning as Multi-task Optimization: A Normalized Gradient Difference Approach with an Adaptive Learning Rate},
  author = {Jin, Xiaomeng and Bu, Zhiqi and Vinzamuri, Bhanukiran and Ramakrishna, Anil and Chang, Kai-Wei and Cevher, Volkan and Hong, Mingyi},
  booktitle = {NAACL},
  year = {2025}
}

Details

BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression

Yuankai Li, Jia-Chen Gu, Di Wu, Kai-Wei Chang, and Nanyun Peng, in NAACL-Finding, 2025.

Full Text Code Abstract BibTeX Details

Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across documents. To accelerate inference, reduce costs, and minimize distractions, this paper presents BRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context RAG. To enable learning compression for multi-hop reasoning, we curate synthetic data by extracting atomic propositions that encapsulate distinct factoids from the source documents to compose synthetic summaries. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries and enables a range of LLMs to achieve exceptional open-domain question answering (QA) performance. For example, on HotpotQA, BRIEF improves the compression rate by 2 times compared to the state-of-the-art baseline, while outperforming it by 3.00% EM and 4.16% F1 with Flan-UL2 as the reader model. It also generates more concise summaries than proprietary GPT-3.5, while demonstrating nearly identical QA performance.

@inproceedings{li2025brief,
  title = {BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression},
  author = {Li, Yuankai and Gu, Jia-Chen and Wu, Di and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {NAACL-Finding},
  year = {2025}
}

Details

Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety

Yiwei Wang, Muhao Chen, Nanyun Peng, and Kai-Wei Chang, in NAACL-Finding, 2025.

Full Text Abstract BibTeX Details

Previous research on jailbreak attacks has mainly focused on optimizing the adversarial snippet content injected into input prompts to expose LLM security vulnerabilities. A significant portion of this research focuses on developing more complex, less readable adversarial snippets that can achieve higher attack success rates. In contrast to this trend, our research investigates the impact of the adversarial snippet’s position on the effectiveness of jailbreak attacks. We find that placing a simple and readable adversarial snippet at the beginning of the output effectively exposes LLM safety vulnerabilities, leading to much higher attack success rates than the input suffix attack or prompt-based output jailbreaks. Precisely speaking, we discover that directly enforcing the user’s target embedded output prefix is an effective method to expose LLMs’ safety vulnerabilities.

@inproceedings{wang2025vulnerability,
  title = {Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety},
  author = {Wang, Yiwei and Chen, Muhao and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {NAACL-Finding},
  year = {2025}
}

Details

On Localizing and Deleting Toxic Memories in Large Language Models

Anubrata Das, Manoj Kumar, Ninareh Mehrabi, Anil Ramakrishna, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Morteza Ziyadi, and Rahul Gupta, in NAACL-Finding, 2025.

Full Text Abstract BibTeX Details

Ensuring that large language models (LLMs) do not generate harmful text is critical for their safe deployment. A common failure mode involves producing toxic responses to otherwise innocuous prompts. While various detoxification methods have been proposed, the underlying mechanisms that drive toxic generation in LLMs are not yet fully understood. Our work aims to provide a mechanistic understanding of toxic generation against innocuous-seeming adversarial prompts through the lens of memory localization. We find evidence of localization of toxic memories in the early Multi-layer Perceptron (MLP) layers of GPT-2-XL. We further investigate the effects of editing and deleting these toxic memories in MLP layers to reduce toxic generation. Editing significantly reduces toxic generation, from 62.86% to 28.61%. However, this reduction comes with a trade-off in generation quality as perplexity increases from 78.18 on GPT2-XL against the adversarial prompts to 106.06 after editing. Localization-informed deletion achieves a better toxicity-perplexity tradeoff compared to random early layer editing, which reduces toxicity but leads to greater perplexity increases.

@inproceedings{das2025localizing,
  title = {On Localizing and Deleting Toxic Memories in Large Language Models},
  author = {Das, Anubrata and Kumar, Manoj and Mehrabi, Ninareh and Ramakrishna, Anil and Rumshisky, Anna and Chang, Kai-Wei and Galstyan, Aram and Ziyadi, Morteza and Gupta, Rahul},
  booktitle = {NAACL-Finding},
  year = {2025}
}

Details

Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation

Cheng-Yi Li, Kao-Jung Chang, Cheng-Fu Yang, Hsin-Yu Wu, Wenting Chen, Hritik Bansal, Ling Chen, Yi-Ping Yang, Yu-Chun Chen, Shih-Pin Chen, Shih-Jen Chen, Jiing-Feng Lirng, Kai-Wei Chang, and Shih-Hwa Chiou, in Nature Communications, 2025.

Full Text Abstract BibTeX Details

Multi-modal large language models (MLLMs) have transformed the landscape of modern healthcare, with automated radiology report generation (RRG) emerging as a cutting-edge application. While 2D MLLM-based RRG has been well established, its utility for 3D medical images remains largely unexplored. In this regard, we curate the 3D-BrainCT dataset (18,885 text-scan pairs) and develop BrainGPT, a clinically visual instruction-tuned (CVIT) model designed for 3D CT RRG. While we notice that the traditional LLM metrics failed to gauge the diagnostic quality of the RRG, we propose feature-oriented radiology task evaluation (FORTE), an evaluation scheme that captures the clinical essence of the generated reports. Here we show that BrainGPT achieves an average FORTE F1-score of 0.71 (degree = 0.661; landmark = 0.706; feature = 0.693, and impression = 0.779) and 74% of BrainGPT-generated reports  were indistinguishable from human-written ground truth in a Turing-like test. Together, our work establishes a comprehensive framework encompassing dataset curation, anatomy-aware model fine-tuning, and the development of robust evaluation metrics for the RRG. By sharing our experience in 3D MLLM-based RRG, we aim to accelerate the expedition in human-machine collaboration for next-generation healthcare.

@inproceedings{li2025holistic,
  title = {Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation},
  author = {Li, Cheng-Yi and Chang, Kao-Jung and Yang, Cheng-Fu and Wu, Hsin-Yu and Chen, Wenting and Bansal, Hritik and Chen, Ling and Yang, Yi-Ping and Chen, Yu-Chun and Chen, Shih-Pin and Chen, Shih-Jen and Lirng, Jiing-Feng and Chang, Kai-Wei and Chiou, Shih-Hwa},
  booktitle = {Nature Communications},
  year = {2025}
}

Details

2024

VideoCon: Robust video-language alignment via contrast captions

Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover, in CVPR, 2024.

Full Text Code Demo Abstract BibTeX Details Best paper at DPFM workshop at ICLR

Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order, which alignment models should be robust against. To this end, we introduce the VideoCon, a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then, a generative video-language model is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally, our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover, our model shows superior performance on novel videos and human-crafted captions and explanations.

@inproceedings{bansal2023videocon,
  author = {Bansal, Hritik and Bitton, Yonatan and Szpektor, Idan and Chang, Kai-Wei and Grover, Aditya},
  title = {VideoCon: Robust video-language alignment via contrast captions},
  booktitle = {CVPR},
  keyword_extra = {vlmodel},
  year = {2024}
}

Details

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Pengshuo Qiu, Ziyu Guo, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li, in ECCV, 2024.

Full Text Code Abstract BibTeX Details Top-10 cited paper at ECCV 24

The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging true or false, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then assess each step with error analysis to derive a total score, which can reveal the inner CoT reasoning quality by MLLMs. With MathVerse, we unveil that, most existing MLLMs struggle to understand math diagrams, relying heavily on textual questions. Surprisingly, some of them even achieve 5%+ higher accuracy without the visual input, e.g., Gemini-Pro and SPHINX-MoE. In contrast, GPT-4V and InternLM-XComposer2 demonstrate relatively better comprehension of the visual content for mathematical reasoning. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs.

@inproceedings{zhang2024mathverse,
  title = {MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?},
  author = {Zhang, Renrui and Jiang, Dongzhi and Zhang, Yichi and Lin, Haokun and Qiu, Pengshuo and Guo, Ziyu and Zhou, Aojun and Lu, Pan and Chang, Kai-Wei and Gao, Peng and Li, Hongsheng},
  booktitle = {ECCV},
  year = {2024}
}

Details

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao, in ICLR, 2024.

Full Text Code Demo Abstract BibTeX Details Oral, 85 out of 7200 submissions, top 1.2%, top-10 cited paper at ICLR 2024

Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive skills in various domains, their ability for mathematical reasoning within visual contexts has not been formally examined. Equipping LLMs and LMMs with this capability is vital for general-purpose AI assistants and showcases promising potential in education, data analysis, and scientific discovery. To bridge this gap, we present MathVista, a benchmark designed to amalgamate challenges from diverse mathematical and visual tasks. We first taxonomize the key task types, reasoning skills, and visual contexts from the literature to guide our selection from 28 existing math-focused and visual question answering datasets. Then, we construct three new datasets, IQTest, FunctionQA, and PaperQA, to accommodate for missing types of visual contexts. The problems featured often require deep visual understanding beyond OCR or image captioning, and compositional reasoning with rich domain-specific tools, thus posing a notable challenge to existing models. We conduct a comprehensive evaluation of 11 prominent open-source and proprietary foundation models (LLMs, LLMs augmented with tools, and LMMs). The best-performing model, Multimodal Bard, achieves only 58% of human performance (34.8% vs 60.3%), indicating ample room for further improvement. Given this significant gap, MathVista fuels future research in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks.

@inproceedings{lu2024mathvista,
  title = {MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts},
  author = {Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng},
  booktitle = {ICLR},
  year = {2024}
}

Details

MQT-LLaVA: Matryoshka Query Transformer for Large Vision-Language Models

Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang, in NeurIPS, 2024.

Full Text Code Abstract BibTeX Details

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA’s fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

@inproceedings{hu2024mqt,
  title = {MQT-LLaVA: Matryoshka Query Transformer for Large Vision-Language Models},
  author = {Hu, Wenbo and Dou, Zi-Yi and Li, Liunian Harold and Kamath, Amita and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {NeurIPS},
  year = {2024}
}

Details

SafeWorld: Geo-Diverse Safety Alignment

Da Yin, Haoyi Qiu, Kung-Hsiang Huang, Kai-Wei Chang, and Nanyun Peng, in NeurIPS, 2024.

Full Text Abstract BibTeX Details

Content Warning: This paper may contain examples of harmful contents by nature.
In the rapidly evolving field of Large Language Models (LLMs), ensuring safety
is a crucial and widely discussed topic. However, existing works often overlook
the geo-diversity of cultural and legal standards across the world. To demonstrate
the challenges posed by geo-diverse safety standards, we introduce SAFEWORLD,
a novel benchmark specifically designed to evaluate LLMs’ ability to generate
responses that are not only helpful but also culturally sensitive and legally compliant across diverse global contexts. SAFEWORLD encompasses 2,775 test user
queries, each grounded in high-quality, human-verified cultural norms and legal
policies from 50 countries and 493 regions/races. On top of it, we propose a multidimensional automatic safety evaluation framework that assesses the contextual
appropriateness, accuracy, and comprehensiveness of responses. Our evaluations
reveal that current LLMs struggle to meet these criteria. To enhance LLMs’ alignment with geo-diverse safety standards, we synthesize helpful preference pairs for
Direct Preference Optimization (DPO) alignment training. The preference pair
construction aims to encourage LLMs to behave appropriately and provide precise
references to relevant cultural norms and policies when necessary. Our trained
SAFEWORLDLM outperforms all competing models, including GPT-4o on all the
three evaluation dimensions by a large margin. Global human evaluators also note
a nearly 20% higher winning rate in helpfulness and harmfulness evaluation.

@inproceedings{yin2024safeworld,
  title = {SafeWorld: Geo-Diverse Safety Alignment},
  author = {Yin, Da and Qiu, Haoyi and Huang, Kung-Hsiang and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {NeurIPS},
  year = {2024}
}

Details

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quanquan Gu, James Zou, Kai-Wei Chang, and Wei Wang, in NeurIPS, 2024.

Full Text Abstract BibTeX Details

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model’s own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies investigate various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training. Code and data are made publicly available.

@inproceedings{deng2024enhancing,
  title = {Enhancing Large Vision Language Models with Self-Training on Image Comprehension},
  author = {Deng, Yihe and Lu, Pan and Yin, Fan and Hu, Ziniu and Shen, Sheng and Gu, Quanquan and Zou, James and Chang, Kai-Wei and Wang, Wei},
  booktitle = {NeurIPS},
  year = {2024}
}

Details

DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation

Xueqing Wu, Rui Zheng, Jingzhen Sha, Te-Lin Wu, Hanyu Zhou, Tang Mohan, Kai-Wei Chang, Nanyun Peng, and Haoran Huang, in NeurIPS (Datasets and Benchmarks Track), 2024.

Full Text Abstract BibTeX Details

Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights to comprehensively answer a given user query for tabular data. In this work, we aim to propose new resources and benchmarks to inspire future research on this crucial yet challenging and under-explored task. However, collecting data analysis annotations curated by experts can be prohibitively expensive. We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs with a multi-turn prompting technique. We construct the DACO dataset, containing (1) 440 databases (of tabular data) collected from real-world scenarios, (2)  2k query-answer pairs that can serve as weak supervision for model training, and (3) a concentrated but high-quality test set with human refined annotations that serves as our main evaluation benchmark. We train a 6B supervised fine-tuning (SFT) model on DACO dataset, and find that the SFT model learns reasonable data analysis capabilities. To further align the models with human preference, we use reinforcement learning to encourage generating analysis perceived by human as helpful, and design a set of dense rewards to propagate the sparse human preference reward to intermediate code generation steps. Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases, validating the effectiveness of our proposed algorithm.

@inproceedings{wu2024daco,
  title = {DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation},
  author = {Wu, Xueqing and Zheng, Rui and Sha, Jingzhen and Wu, Te-Lin and Zhou, Hanyu and Mohan, Tang and Chang, Kai-Wei and Peng, Nanyun and Huang, Haoran},
  booktitle = {NeurIPS (Datasets and Benchmarks Track)},
  github_url = {https://github.com/shirley-wu/daco},
  keyword_extra = {AI-agent},
  year = {2024}
}

Details

JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images

Zhecan Wang, Junzhang Liu, Chia-Wei Tang, Hani Alomari, Anushka Sivakumar, Rui Sun, Wenhao Li, Md. Atabuzzaman, Hammad Ayyubi, Haoxuan You, Alvi Md Ishmam, Kai-Wei Chang, Shih-Fu Chang, and Chris Thomas, in NeurIPS (Datasets and Benchmarks Track), 2024.

Full Text Abstract BibTeX Details

Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model’s fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language bias and holistic image gist are insufficient. We benchmark state-of-the-art models on JourneyBench and analyze performance along a number of fine-grained dimensions. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models, indicating that models’ visual reasoning abilities are not as strong as they first appear. We discuss the implications of our findings and propose avenues for further research.

@inproceedings{wang2024journeybench,
  title = {JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images},
  author = {Wang, Zhecan and Liu, Junzhang and Tang, Chia-Wei and Alomari, Hani and Sivakumar, Anushka and Sun, Rui and Li, Wenhao and Atabuzzaman, Md. and Ayyubi, Hammad and You, Haoxuan and Ishmam, Alvi Md and Chang, Kai-Wei and Chang, Shih-Fu and Thomas, Chris},
  booktitle = {NeurIPS (Datasets and Benchmarks Track)},
  keyword_extra = {vlmodel},
  year = {2024}
}

Details

Control Large Language Models via Divide and Conquer

Bingxuan Li, Yiwei Wang, Tao Meng, Kai-Wei Chang, and Nanyun Peng, in EMNLP, 2024.

Full Text Abstract BibTeX Details

This paper investigates the capability of LLMs on controllable generation with prompt-based controlling, focusing on Lexically Constrained Generation (LCG). We systematically evaluate the performance of LLMs on satisfying lexical constraints with prompt-based controlling, as well as their efficacy in downstream applications. We identified three key reasons that highlight the limitations of LLMs in LCG, including (1) position bias, where LLMs tend to satisfy constraints that appear in specific positions within the input; (2) low responsiveness to control decoding parameters, which minimally impact the performance of LLMs; and (3) struggle with handling the inherent complexity of certain constraints (e.g. compound word). We conclude that black-box LLMs face significant challenges in consistently satisfying lexical constraints with prompt-based controlling. To address this bottleneck, we introduce the Divide and Conquer Generation strategy, effective for both white-box and black-box LLMs, to enhance LLMs performance in LCG tasks, which demonstrates over 90% improvement on success rate in the most challenging LCG task. Our analysis aims to provide valuable insights into the performance of LLMs in LCG with prompt-based controlling, and our proposed strategy offers a pathway to more sophisticated and customized text generation applications.

@inproceedings{li2024llms,
  title = {Control Large Language Models via Divide and Conquer},
  author = {Li, Bingxuan and Wang, Yiwei and Meng, Tao and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {EMNLP},
  year = {2024}
}

Details

FLIRT: Feedback Loop In-context Red Teaming

Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, and Rahul Gupta, in EMNLP, 2024.

Full Text Abstract BibTeX Details

As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. Here we propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. We propose different in-context attack strategies to automatically learn effective and diverse adversarial prompts for text-to-image models. Our experiments demonstrate that compared to baseline approaches, our proposed strategy is significantly more effective in exposing vulnerabilities in Stable Diffusion (SD) model, even when the latter is enhanced with safety features. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models, resulting in significantly higher toxic response generation rate compared to previously reported numbers.

@inproceedings{mehrabi2024flirt,
  title = {FLIRT: Feedback Loop In-context Red Teaming},
  author = {Mehrabi, Ninareh and Goyal, Palash and Dupuy, Christophe and Hu, Qian and Ghosh, Shalini and Zemel, Richard and Chang, Kai-Wei and Galstyan, Aram and Gupta, Rahul},
  booktitle = {EMNLP},
  year = {2024}
}

Details

QUDSELECT: Selective Decoding for Questions Under Discussion Parsing

Ashima Suvarna, Xiao Liu, Tanmay Parekh, Kai-Wei Chang, and Nanyun Peng, in EMNLP, 2024.

Full Text Abstract BibTeX Details

Question Under Discussion (QUD) is a discourse framework that uses implicit questions to reveal discourse relationships between sentences. In QUD parsing, each sentence is viewed as an answer to a question triggered by an anchor sentence in prior context. The resulting QUD structure is required to conform to several theoretical criteria like answer compatibility (how well the question is answered), making QUD parsing a challenging task. Previous works construct QUD parsers in a pipelined manner (i.e. detect the trigger sentence in context and then generate the question). However, these parsers lack a holistic view of the task and can hardly satisfy all the criteria. In this work, we introduce QUDSELECT, a joint-training framework that selectively decodes the QUD dependency structures considering the QUD criteria. Using instruction-tuning, we train models to simultaneously predict the anchor sentence and generate the associated question. To explicitly incorporate the criteria, we adopt a selective decoding strategy of sampling multiple QUD candidates during inference, followed by selecting the best one with criteria scorers. Our method outperforms the state-of-the-art baseline models by 9% in human evaluation and 4% in automatic evaluation, demonstrating the effectiveness of our framework.

@inproceedings{suvarna2024qudselect,
  title = {QUDSELECT: Selective Decoding for Questions Under Discussion Parsing},
  author = {Suvarna, Ashima and Liu, Xiao and Parekh, Tanmay and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {EMNLP},
  year = {2024}
}

Details

Data Advisor: Data Curation with Foresight for Safety Alignment of Large Language Models

Fei Wang, Ninareh Mehrabi, Palash Goyal, Rahul Gupta, Kai-Wei Chang, and Aram Galstyan, in EMNLP, 2024.

Full Text Abstract BibTeX Details

Data is a crucial element in large language model (LLM) alignment. Recent studies have explored using LLMs for efficient data collection. However, LLM-generated data often suffers from quality issues, with underrepresented or absent aspects and low-quality datapoints. To address these problems, we propose Data Advisor, an enhanced LLM-based method for generating data that takes into account the characteristics of the desired dataset. Starting from a set of pre-defined principles in hand, Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation accordingly. Data Advisor can be easily integrated into existing data generation methods to enhance data quality and coverage. Experiments on safety alignment of three representative LLMs (i.e., Mistral, Llama2, and Falcon) demonstrate the effectiveness of Data Advisor in enhancing model safety against various fine-grained safety issues without sacrificing model utility.

@inproceedings{wang2024data,
  title = {Data Advisor: Data Curation with Foresight for Safety Alignment of Large Language Models},
  author = {Wang, Fei and Mehrabi, Ninareh and Goyal, Palash and Gupta, Rahul and Chang, Kai-Wei and Galstyan, Aram},
  booktitle = {EMNLP},
  year = {2024}
}

Details

The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention

Yixin Wan, Di Wu, Haoran Wang, and Kai-Wei Chang, in EMNLP, 2024.

Full Text Abstract BibTeX Details

Prompt-based "diversity interventions" are commonly adopted to improve the diversity of Text-to-Image (T2I) models depicting individuals with various racial or gender traits. However, will this strategy result in nonfactual demographic distribution, especially when generating real historical figures? In this work, we propose DemOgraphic FActualIty Representation (DoFaiR), a benchmark to systematically quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. DoFaiR consists of 756 meticulously fact-checked test instances to reveal the factuality tax of various diversity prompts through an automated evidence-supported evaluation pipeline. Experiments on DoFaiR unveil that diversity-oriented instructions increase the number of different gender and racial groups in DALLE-3’s generations at the cost of historically inaccurate demographic distributions. To resolve this issue, we propose Fact-Augmented Intervention (FAI), which instructs a Large Language Model (LLM) to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history, and incorporate it into the generation context of T2I models. By orienting model generations using the reflected historical truths, FAI significantly improves the demographic factuality under diversity interventions while preserving diversity.

@inproceedings{wan2024factuality,
  title = {The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention},
  author = {Wan, Yixin and Wu, Di and Wang, Haoran and Chang, Kai-Wei},
  booktitle = {EMNLP},
  year = {2024}
}

Details

Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation

Di Wu, Jia-Chen Gu, Fan Yin, Nanyun Peng, and Kai-Wei Chang, in EMNLP, 2024.

Full Text Abstract BibTeX Details

Retrieval-augmented language models (RALMs) have shown strong performance and wide applicability in knowledge-intensive tasks. However, there are significant trustworthiness concerns as RALMs are prone to generating unfaithful outputs, including baseless information or contradictions with the retrieved context. This paper proposes SynCheck, a lightweight monitor that leverages fine-grained decoding dynamics including sequence likelihood, uncertainty quantification, context influence, and semantic alignment to synchronously detect unfaithful sentences. By integrating efficiently measurable and complementary signals, SynCheck enables accurate and immediate feedback and intervention, achieving 0.85 AUROC in detecting faithfulness errors across six long-form retrieval-augmented generation tasks, improving prior best method by 4%. Leveraging SynCheck, we further introduce FOD, a faithfulness-oriented decoding algorithm guided by beam search for long-form retrieval-augmented generation. Empirical results demonstrate that FOD outperforms traditional strategies such as abstention, reranking, or contrastive decoding significantly in terms of faithfulness, achieving over 10% improvement across six datasets.

@inproceedings{wu2024synchronous,
  title = {Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation},
  author = {Wu, Di and Gu, Jia-Chen and Yin, Fan and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {EMNLP},
  year = {2024}
}

Details

SPEED++: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness

Tanmay Parekh, Jeffrey Kwan, Jiarui Yu, Sparsh Johri, Hyosang Ahn, Sreya Muppalla, Kai-Wei Chang, Wei Wang, and Nanyun Peng, in EMNLP, 2024.

Full Text Abstract BibTeX Details

Social media is often the first place where communities discuss the latest societal trends. Prior works have utilized this platform to extract epidemic-related information (e.g. infections, preventive measures) to provide early warnings for epidemic prediction. However, these works only focused on English posts, while epidemics can occur anywhere in the world, and early discussions are often in the local, non-English languages. In this work, we introduce the first multilingual Event Extraction (EE) framework SPEED++ for extracting epidemic event information for any disease and language. To this end, we extend a previous epidemic ontology with 20 argument roles; and curate our multilingual EE dataset SPEED++ comprising 5.1K tweets in four languages for four diseases. Annotating data in every language is infeasible; thus we develop zero-shot cross-lingual cross-disease models (i.e., training only on English COVID data) utilizing multilingual pre-training and show their efficacy in extracting epidemic-related events for 65 diverse languages across different diseases. Experiments demonstrate that our framework can provide epidemic warnings for COVID-19 in its earliest stages in Dec 2019 (3 weeks before global discussions) from Chinese Weibo posts without any training in Chinese. Furthermore, we exploit our framework’s argument extraction capabilities to aggregate community epidemic discussions like symptoms and cure measures, aiding misinformation detection and public attention monitoring. Overall, we lay a strong foundation for multilingual epidemic preparedness.

@inproceedings{parekh2024speed,
  title = {SPEED++: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness},
  author = {Parekh, Tanmay and Kwan, Jeffrey and Yu, Jiarui and Johri, Sparsh and Ahn, Hyosang and Muppalla, Sreya and Chang, Kai-Wei and Wang, Wei and Peng, Nanyun},
  booktitle = {EMNLP},
  year = {2024}
}

Details

Re-ReST: Reflection-Reinforced Self-Training for Language Agents

Zi-Yi Dou, Cheng-Fu Yang, Xueqing Wu, Kai-Wei Chang, and Nanyun Peng, in EMNLP, 2024.

Full Text Code Abstract BibTeX Details

Finetuning language agents with reasoning-action trajectories is effective, but obtaining these trajectories from human annotations or stronger models is costly and sometimes impractical. In this paper, we investigate the use of self-training in language agents, which can generate supervision from the agent itself, offering a promising alternative without relying on human or stronger model demonstrations. Self-training, however, requires high-quality model-generated samples, which are hard to obtain for challenging language agent tasks. To address this, we present Reflection-Reinforced Self-Training (Re-ReST), which uses a reflector to refine low-quality generated samples during self-training. The reflector takes the agent’s output and feedback from an external environment (e.g., unit test results in code generation) to produce improved samples. This technique enhances the quality of inferior samples and efficiently enriches the self-training dataset with higher-quality samples. We conduct extensive experiments on open-source language agents across tasks, including multi-hop question answering, sequential decision-making, code generation, visual question answering, and text-to-image generation. The results demonstrate the effectiveness of self-training and Re-ReST in language agent tasks, with self-training improving baselines by 7.6% on HotpotQA and 28.4% on AlfWorld, and Re-ReST further boosting performance by 2.0% and 14.1%, respectively. Our studies also confirm the efficiency of using a reflector to generate high-quality samples for self-training. Moreover, we demonstrate a method to employ reflection during inference without ground-truth feedback, addressing the limitation of previous reflection work. Our code is released at https://github.com/PlusLabNLP/Re-ReST.

@inproceedings{dou2024rere,
  title = {Re-ReST: Reflection-Reinforced Self-Training for Language Agents},
  author = {Dou, Zi-Yi and Yang, Cheng-Fu and Wu, Xueqing and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {EMNLP},
  keyword_extra = {AI-agent, constraint},
  year = {2024}
}

Details

Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue

Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-Hua Ling, Kai-Wei Chang, and Nanyun Peng, in EMNLP, 2024.

Full Text Abstract BibTeX Details

Model editing is a technique that edits the large language models (LLMs) with updated knowledge to alleviate hallucinations without resource-intensive retraining. While current model editing methods can effectively modify a model’s behavior within a specific area of interest, they often overlook the potential unintended side effects on the general abilities of LLMs such as reasoning, natural language inference, and question answering. In this paper, we raise concerns that model editing’s improvements on factuality may come at the cost of a significant degradation of the model’s general abilities. We systematically analyze the side effects by evaluating four popular editing methods on three LLMs across eight representative tasks. Our extensive empirical experiments show that it is challenging for current editing methods to simultaneously improve factuality of LLMs and maintain their general abilities. Our analysis reveals that the side effects are caused by model editing altering the original model weights excessively, leading to overfitting to the edited facts. To mitigate this, a method named RECT (RElative Change in weighT) is proposed to regularize the edit update weights. Evaluation results show that RECT can significantly mitigate the side effects of editing while still maintaining over 94% editing performance.

@inproceedings{gu2024model,
  title = {Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue},
  author = {Gu, Jia-Chen and Xu, Hao-Xiang and Ma, Jun-Yu and Lu, Pan and Ling, Zhen-Hua and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {EMNLP},
  year = {2024}
}

Details

Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification

Tao Meng, Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Aram Galstyan, Richard Zemel, Kai-Wei Chang, Rahul Gupta, and Charith Peris, in EMNLP-Finding, 2024.

Full Text Abstract BibTeX Details

We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control. Given a training corpus and control criteria formulated as a sequence-level constraint on model outputs, our method fine-tunes the LLM on the training corpus while enhancing constraint satisfaction with minimal impact on its utility and generation quality. Specifically, our approach regularizes the LLM training by penalizing the KL divergence between the desired output distribution, which satisfies the constraints, and the LLM’s posterior. This regularization term can be approximated by an auxiliary model trained to decompose the sequence-level constraints into token-level guidance, allowing the term to be measured by a closed-form formulation. To further improve efficiency, we design a parallel scheme for concurrently updating both the LLM and the auxiliary model. We evaluate the empirical performance of our approach by controlling the toxicity when training an LLM. We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.

@inproceedings{meng2024attribute,
  title = {Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification},
  author = {Meng, Tao and Mehrabi, Ninareh and Goyal, Palash and Ramakrishna, Anil and Galstyan, Aram and Zemel, Richard and Chang, Kai-Wei and Gupta, Rahul and Peris, Charith},
  booktitle = {EMNLP-Finding},
  year = {2024}
}

Details

LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning

Silin Meng, Yiwei Wang, Cheng-Fu Yang, Nanyun Peng, and Kai-Wei Chang, in EMNLP-Finding, 2024.

Full Text Abstract BibTeX Details

Path planning is a fundamental scientific problem in robotics and autonomous navigation, requiring the derivation of efficient routes from starting to destination points while avoiding obstacles. Traditional algorithms like A* and its variants are capable of ensuring path validity but suffer from significant computational and memory inefficiencies as the state space grows. Conversely, large language models (LLMs) excel in broader environmental analysis through contextual understanding, providing global insights into environments. However, they fall short in detailed spatial and temporal reasoning, often leading to invalid or inefficient routes. In this work, we propose LLM-A*, an new LLM based route planning method that synergistically combines the precise pathfinding capabilities of A* with the global reasoning capability of LLMs. This hybrid approach aims to enhance pathfinding efficiency in terms of time and space complexity while maintaining the integrity of path validity, especially in large-scale scenarios. By integrating the strengths of both methodologies, LLM-A* addresses the computational and memory limitations of conventional algorithms without compromising on the validity required for effective pathfinding.

@inproceedings{meng2024llm,
  title = {LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning},
  author = {Meng, Silin and Wang, Yiwei and Yang, Cheng-Fu and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {EMNLP-Finding},
  year = {2024}
}

Details

MetaKP: On-Demand Keyphrase Generation

Di Wu, Xiaoxian Shen, and Kai-Wei Chang, in EMNLP-Finding, 2024.

Full Text Abstract BibTeX Details

Traditional keyphrase prediction methods predict a single set of keyphrases per document, failing to cater to the diverse needs of users and downstream applications. To bridge the gap, we introduce on-demand keyphrase generation, a novel paradigm that requires keyphrases that conform to specific high-level goals or intents. For this task, we present MetaKP, a large-scale benchmark comprising four datasets, 7500 documents, and 3760 goals across news and biomedical domains with human-annotated keyphrases. Leveraging MetaKP, we design both supervised and unsupervised methods, including a multi-task fine-tuning approach and a self-consistency prompting method with large language models. The results highlight the challenges of supervised fine-tuning, whose performance is not robust to distribution shifts. By contrast, the proposed self-consistency prompting approach greatly improves the performance of large language models, enabling GPT-4o to achieve 0.548 SemF1, surpassing the performance of a fully fine-tuned BART-base model. Finally, we demonstrate the potential of our method to serve as a general NLP infrastructure, exemplified by its application in epidemic event detection from social media.

@inproceedings{wu2024metakp,
  title = {MetaKP: On-Demand Keyphrase Generation},
  author = {Wu, Di and Shen, Xiaoxian and Chang, Kai-Wei},
  booktitle = {EMNLP-Finding},
  year = {2024}
}

Details

MACAROON: Training Vision-Language Models To Be Your Engaged Partners

Shujin Wu, Yi Fung, Sha Li, Yixin Wan, Kai-Wei Chang, and Heng Ji, in EMNLP-Finding, 2024.

Full Text Abstract BibTeX Details

Large vision-language models (LVLMs), while proficient in following instructions and responding to diverse questions, invariably generate detailed responses even when questions are ambiguous or unanswerable, leading to hallucinations and bias issues. Thus, it is essential for LVLMs to proactively engage with humans to ask for clarifications or additional information for better responses. In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners. We begin by establishing a three-tiered hierarchy for questions of invalid, ambiguous, and personalizable nature to measure the proactive engagement capabilities of LVLMs. Utilizing this hierarchy, we create PIE, (ProactIve Engagement Evaluation) through GPT-4o and human annotators, consisting of 853 questions across six distinct, fine-grained question types that are verified by human annotators and accompanied with well-defined metrics. Our evaluations on \benchmark indicate poor performance of existing LVLMs, with the best-performing open-weights model only achieving an Aggregate Align Rate (AAR) of 0.28. In response, we introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions given the task description and human-crafted criteria. Then, the self-imagined data is formatted for conditional reinforcement learning. Experimental results show MACAROON effectively improves LVLMs’ capabilities to be proactively engaged (0.84 AAR) while maintaining comparable performance on general tasks.

@inproceedings{wu2024macaroon,
  title = {MACAROON: Training Vision-Language Models To Be Your Engaged Partners},
  author = {Wu, Shujin and Fung, Yi and Li, Sha and Wan, Yixin and Chang, Kai-Wei and Ji, Heng},
  booktitle = {EMNLP-Finding},
  keyword_extra = {constraint},
  year = {2024}
}

Details

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

Xueqing Wu, Zongyu Lin, Songyan Zhao, Te-Lin Wu, Pan Lu, Nanyun Peng, and Kai-Wei Chang, in EMNLP-Finding, 2024.

Full Text Code Abstract BibTeX Details

Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bottleneck for visual reasoning. To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDebugger’s effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger’s ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task.

@inproceedings{wu2024vdebugger,
  title = {VDebugger: Harnessing Execution Feedback for Debugging Visual Programs},
  author = {Wu, Xueqing and Lin, Zongyu and Zhao, Songyan and Wu, Te-Lin and Lu, Pan and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {EMNLP-Finding},
  year = {2024}
}

Details

The Hard Positive Truth about Vision-Language Compositionality

Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang, and Ranjay Krishna, in ECCV, 2024.

Full Text Abstract BibTeX Details

Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model’s ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have been overstated — because existing benchmarks do not probe whether finetuned models remain invariant to hard positives. By curating an evaluation dataset with 112,382 both hard negatives and hard positives, we uncover that including hard positives decreases CLIP’s performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 training set with both hard negatives and hard positives captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating an improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP’s understanding of semantic relationships between related “positive” concepts.

@inproceedings{kamath2024hard,
  title = {The Hard Positive Truth about Vision-Language Compositionality},
  author = {Kamath, Amita and Hsieh, Cheng-Yu and Chang, Kai-Wei and Krishna, Ranjay},
  booktitle = {ECCV},
  year = {2024}
}

Details

Tree-of-Traversals: A Zero-Shot Reasoning Algorithm for Augmenting Black-box Language Models with Knowledge Graphs

Elan Sopher Markowitz, Anil Ramakrishna, Jwala Dhamala, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, and Aram Galstyan, in ACL, 2024.

Full Text Abstract BibTeX Details

Knowledge graphs (KGs) complement Large Language Models (LLMs) by providing reliable, structured, domain-specific, and up-to-date external knowledge. However, KGs and LLMs are often developed separately and must be integrated after training. We introduce Tree-of-Traversals, a novel zero-shot reasoning algorithm that enables augmentation of black-box LLMs with one or more KGs. The algorithm equips a LLM with actions for interfacing a KG and enables the LLM to perform tree search over possible thoughts and actions to find high confidence reasoning paths. We evaluate on two popular benchmark datasets. Our results show that Tree-of-Traversals significantly improves performance on question answering and KG question answering tasks. Code is available at \urlhttps://github.com/amazon-science/tree-of-traversals

@inproceedings{markowitz2024tree,
  title = {Tree-of-Traversals: A Zero-Shot Reasoning Algorithm for Augmenting Black-box Language Models with Knowledge Graphs},
  author = {Markowitz, Elan Sopher and Ramakrishna, Anil and Dhamala, Jwala and Mehrabi, Ninareh and Peris, Charith and Gupta, Rahul and Chang, Kai-Wei and Galstyan, Aram},
  booktitle = {ACL},
  year = {2024}
}

Details

Agent Lumos: Unified and Modular Training for Open-Source Language Agents

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin, in ACL, 2024.

Full Text Abstract BibTeX Details

Closed-source agents suffer from several issues such as a lack of affordability, transparency, and reproducibility, particularly on complex interactive tasks. This motivates the development of open-source alternatives. We introduce LUMOS, one of the first frameworks for training open-source LLM-based agents. LUMOS features a learnable, unified, and modular architecture with a planning module that learns high-level subgoal generation, and a grounding module trained to translate these into actions using various tools in the execution module. The design allows for modular upgrades and wider applicability to diverse interactive tasks. To foster generalizable agent learning, we collect large-scale, unified, and high-quality training annotations derived from diverse ground-truth reasoning rationales across various complex interactive tasks. On 9 datasets, LUMOS exhibits several key advantages: (1) LUMOS excels multiple larger open-source agents on the held-out datasets (unused for training) for each task type. LUMOS even surpasses GPT agents on QA and web tasks; (2) LUMOS outperforms open-source agents produced by chain-of-thoughts and unmodularized integrated training; and (3) LUMOS effectively generalizes to unseen tasks, outperforming 33B-scale agents and domain-specific agents.

@inproceedings{yin2024agent,
  title = {Agent Lumos: Unified and Modular Training for Open-Source Language Agents},
  author = {Yin, Da and Brahman, Faeze and Ravichander, Abhilasha and Chandu, Khyathi and Chang, Kai-Wei and Choi, Yejin and Lin, Bill Yuchen},
  booktitle = {ACL},
  keyword_extra = {AI-agent, constraint},
  year = {2024}
}

Details

Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data

Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng, in ACL-Findings, 2024.

Full Text Code Abstract BibTeX Details

Quantitative reasoning is a critical skill to analyze data, yet the assessment of such ability remains limited. To address this gap, we introduce the Quantitative Reasoning with Data (QRData) benchmark, aiming to evaluate Large Language Models’ capability in statistical and causal reasoning with real-world data. The benchmark comprises a carefully constructed dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models’ quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText. We evaluate natural language reasoning, program-based reasoning, and agent reasoning methods including Chain-of-Thought, Program-of-Thoughts, ReAct, and code interpreter assistants on diverse models. The strongest model GPT-4 achieves an accuracy of 58%, which has a large room for improvement. Among open-source models, Deepseek-coder-instruct, a code LLM pretrained on 2T tokens, gets the highest accuracy of 37%. Analysis reveals that models encounter difficulties in data analysis and causal reasoning, and struggle in using causal knowledge and provided data simultaneously.

@inproceedings{liu2024are,
  title = {Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data},
  author = {Liu, Xiao and Wu, Zirui and Wu, Xueqing and Lu, Pan and Chang, Kai-Wei and Feng, Yansong},
  booktitle = {ACL-Findings},
  year = {2024}
}

Details

TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction

Kuan-Hao Huang, I.-Hung Hsu, Tanmay Parekh, Zhiyu Xie, Zixuan Zhang, Prem Natarajan, Kai-Wei Chang, Nanyun Peng, and Heng Ji, in ACL-Findings, 2024.

Full Text Abstract BibTeX Details

Event extraction has gained considerable interest due to its wide-ranging applications. However, recent studies draw attention to evaluation issues, suggesting that reported scores may not accurately reflect the true performance. In this work, we identify and address evaluation challenges, including inconsistency due to varying data assumptions or preprocessing steps, the insufficiency of current evaluation frameworks that may introduce dataset or data split bias, and the low reproducibility of some previous approaches. To address these challenges, we present TextEE, a standardized, fair, and reproducible benchmark for event extraction. TextEE comprises standardized data preprocessing scripts and splits for 16 datasets spanning eight diverse domains and includes 14 recent methodologies, conducting a comprehensive benchmark reevaluation. We also evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance. Inspired by our reevaluation results and findings, we discuss the role of event extraction in the current NLP era, as well as future challenges and insights derived from TextEE. We believe TextEE, the first standardized comprehensive benchmarking tool, will significantly facilitate future event extraction research.

@inproceedings{huang2024textee,
  title = {TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction},
  author = {Huang, Kuan-Hao and Hsu, I-Hung and Parekh, Tanmay and Xie, Zhiyu and Zhang, Zixuan and Natarajan, Prem and Chang, Kai-Wei and Peng, Nanyun and Ji, Heng},
  booktitle = {ACL-Findings},
  year = {2024}
}

Details

KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation

Di Wu, Da Yin, and Kai-Wei Chang, in ACL-Findings, 2024.

Full Text Code Abstract BibTeX Details

Despite the significant advancements in keyphrase extraction and keyphrase generation methods, the predominant approach for evaluation mainly relies on exact matching with human references. This scheme fails to recognize systems that generate keyphrases semantically equivalent to the references or diverse keyphrases that carry practical utility. To better assess the capability of keyphrase systems, we propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility. For each aspect, we design semantic-based metrics to reflect the evaluation objectives. Meta-evaluation studies demonstrate that our evaluation strategy correlates better with human preferences compared to a range of previously proposed metrics. Using KPEval, we re-evaluate 21 keyphrase systems and discover that (1) established model comparison results have blind-spots especially when considering reference-free evaluation; (2) large language models are underestimated by prior evaluation works; and (3) there is no single best model that can excel in all the aspects.

@inproceedings{wu2024kpeval,
  title = {KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation},
  author = {Wu, Di and Yin, Da and Chang, Kai-Wei},
  booktitle = {ACL-Findings},
  year = {2024}
}

Details

Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension

Fan Yin, Jayanth Srinivasa, and Kai-Wei Chang, in ICML, 2024.

Full Text Abstract BibTeX Details

We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs), which serves as a crucial step in building trust between humans and LLMs. Although several approaches based on entropy or verbalized uncertainty have been proposed to calibrate model predictions, these methods are often intractable, sensitive to hyperparameters, and less reliable when applied in generative tasks with LLMs. In this paper, we suggest investigating internal activations and quantifying LLM’s truthfulness using the local intrinsic dimension (LID) of model activations. Through experiments on four question answering (QA) datasets, we demonstrate the effectiveness ohttps://info.arxiv.org/help/prep#abstractsf our proposed method. Additionally, we study intrinsic dimensions in LLMs and their relations with model layers, autoregressive language modeling, and the training of LLMs, revealing that intrinsic dimensions can be a powerful approach to understanding LLMs.

@inproceedings{yin2024charactering,
  title = {Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension},
  author = {Yin, Fan and Srinivasa, Jayanth and Chang, Kai-Wei},
  booktitle = {ICML},
  year = {2024}
}

Details

TrustLLM: Trustworthiness in Large Language Models

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Hanchi Sun, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric P. Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Marinka Zitnik, Meng Jiang, Mohit Bansal, James Zou, Jian Pei, Jian Liu, Jianfeng Gao, Jiawei Han, Jieyu Zhao, Jiliang Tang, Jindong Wang, Joaquin Vanschoren, John Mitchell, Kai Shu, Kaidi Xu, Kai-Wei Chang, Lifang He, Lifu Huang, Michael Backes, Neil Zhenqiang Gong, Philip S. Yu, Pin-Yu Chen, Quanquan Gu, Ran Xu, Rex Ying, Shuiwang Ji, Suman Jana, Tianlong Chen, Tianming Liu, Tianyi Zhou, William Yang Wang, Xiang Li, Xiangliang Zhang, Xiao Wang, Xing Xie, Xun Chen, Xuyu Wang, Yan Liu, Yanfang Ye, Yinzhi Cao, Yong Chen, and Yue Zhao, in ICML, 2024.

Full Text Abstract BibTeX Details

Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.

@inproceedings{huang2024position,
  title = {TrustLLM: Trustworthiness in Large Language Models},
  author = {Huang, Yue and Sun, Lichao and Wang, Haoran and Wu, Siyuan and Zhang, Qihui and Li, Yuan and Gao, Chujie and Huang, Yixin and Lyu, Wenhan and Zhang, Yixuan and Li, Xiner and Sun, Hanchi and Liu, Zhengliang and Liu, Yixin and Wang, Yijue and Zhang, Zhikun and Vidgen, Bertie and Kailkhura, Bhavya and Xiong, Caiming and Xiao, Chaowei and Li, Chunyuan and Xing, Eric P. and Huang, Furong and Liu, Hao and Ji, Heng and Wang, Hongyi and Zhang, Huan and Yao, Huaxiu and Kellis, Manolis and Zitnik, Marinka and Jiang, Meng and Bansal, Mohit and Zou, James and Pei, Jian and Liu, Jian and Gao, Jianfeng and Han, Jiawei and Zhao, Jieyu and Tang, Jiliang and Wang, Jindong and Vanschoren, Joaquin and Mitchell, John and Shu, Kai and Xu, Kaidi and Chang, Kai-Wei and He, Lifang and Huang, Lifu and Backes, Michael and Gong, Neil Zhenqiang and Yu, Philip S. and Chen, Pin-Yu and Gu, Quanquan and Xu, Ran and Ying, Rex and Ji, Shuiwang and Jana, Suman and Chen, Tianlong and Liu, Tianming and Zhou, Tianyi and Wang, William Yang and Li, Xiang and Zhang, Xiangliang and Wang, Xiao and Xie, Xing and Chen, Xun and Wang, Xuyu and Liu, Yan and Ye, Yanfang and Cao, Yinzhi and Chen, Yong and Zhao, Yue},
  year = {2024},
  booktitle = {ICML}
}

Details

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng, in ICML, 2024.

Full Text Abstract BibTeX Details

Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs’ ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design.

@inproceedings{wadhawan2024contextual,
  author = {Wadhawan, Rohan and Bansal, Hritik and Chang, Kai-Wei and Peng, Nanyun},
  title = {ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models},
  booktitle = {ICML},
  year = {2024}
}

Details

Prompt-Driven LLM Safeguarding via Directed Representation Optimization

Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng, in ICML, 2024.

Full Text Abstract BibTeX Details

Prepending model inputs with safety prompts is a common practice for safeguarding large language models (LLMs) against queries with harmful intents. However, the underlying working mechanisms of safety prompts have not been unraveled yet, restricting the possibility of automatically optimizing them to improve LLM safety. In this work, we investigate how LLMs’ behavior (i.e., complying with or refusing user queries) is affected by safety prompts from the perspective of model representation. We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction, in which models become more prone to refusing to provide assistance, even when the queries are harmless. On the other hand, LLMs are naturally capable of distinguishing harmful and harmless queries without safety prompts. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO (Directed Representation Optimization). Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries’ representations along or opposite the refusal direction, depending on their harmfulness. Experiments with eight LLMs on out-of-domain and jailbreak benchmarks demonstrate that DRO remarkably improves the safeguarding performance of human-crafted safety prompts, without compromising the models’ general performance.

@inproceedings{zheng2024prompt,
  title = {Prompt-Driven LLM Safeguarding via Directed Representation Optimization},
  author = {Zheng, Chujie and Yin, Fan and Zhou, Hao and Meng, Fandong and Zhou, Jie and Chang, Kai-Wei and Huang, Minlie and Peng, Nanyun},
  year = {2024},
  booktitle = {ICML}
}

Details

Contextual Label Projection for Cross-Lingual Structured Prediction

Tanmay Parekh, I.-Hung Hsu, Kuan-Hao Huang, Kai-Wei Chang, and Nanyun Peng, in NAACL, 2024.

Full Text Abstract BibTeX Details

Label projection, which involves obtaining translated labels and texts jointly, is essential for leveraging machine translation to facilitate cross-lingual transfer in structured prediction tasks. Prior research exploring label projection often compromise translation accuracy by favoring simplified label translation or relying solely on word-level alignments. In this paper, we introduce a novel label projection approach, CLaP, which translates text to the target language and performs contextual translation on the labels using the translated text as the context, ensuring better accuracy for the translated labels. We leverage instruction-tuned language models with multilingual capabilities as our contextual translator, imposing the constraint of the presence of translated labels in the translated text via instructions. We benchmark CLaP with other label projection techniques on zero-shot cross-lingual transfer across 39 languages on two representative structured prediction tasks - event argument extraction (EAE) and named entity recognition (NER), showing over 2.4 F1 improvement for EAE and 1.4 F1 improvement for NER. We further explore the applicability of CLaP on ten extremely low-resource languages to showcase its potential for cross-lingual structured prediction.

@inproceedings{parekh2024contextual,
  title = {Contextual Label Projection for Cross-Lingual Structured Prediction},
  author = {Parekh, Tanmay and Hsu, I-Hung and Huang, Kuan-Hao and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {NAACL},
  year = {2024}
}

Details

Event Detection from Social Media for Epidemic Prediction

Tanmay Parekh, Anh Mac, Jiarui Yu, Yuxuan Dong, Syed Shahriar, Bonnie Liu, Eric J. Yang, Kuan-Hao Huang, Wei Wang, Nanyun Peng, and Kai-Wei Chang, in NAACL, 2024.

Full Text Abstract BibTeX Details

Social media is an easy-to-access platform providing timely updates about societal trends and events. Discussions regarding epidemic-related events such as infections, symptoms, and social interactions can be crucial for informing policymaking during epidemic outbreaks. In our work, we pioneer exploiting Event Detection (ED) for better preparedness and early warnings of any upcoming epidemic by developing a framework to extract and analyze epidemic-related events from social media posts. To this end, we curate an epidemic event ontology comprising seven disease-agnostic event types and construct a Twitter dataset SPEED with human-annotated events focused on the COVID-19 pandemic. Experimentation reveals how ED models trained on COVID-based SPEED can effectively detect epidemic events for three unseen epidemics of Monkeypox, Zika, and Dengue; while models trained on existing ED datasets fail miserably. Furthermore, we show that reporting sharp increases in the extracted events by our framework can provide warnings 4-9 weeks earlier than the WHO epidemic declaration for Monkeypox. This utility of our framework lays the foundations for better preparedness against emerging epidemics.

@inproceedings{parekh2024event,
  title = {Event Detection from Social Media for Epidemic Prediction},
  author = {Parekh, Tanmay and Mac, Anh and Yu, Jiarui and Dong, Yuxuan and Shahriar, Syed and Liu, Bonnie and Yang, Eric J and Huang, Kuan-Hao and Wang, Wei and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {NAACL},
  year = {2024}
}

Details

The steerability of large language models toward data-driven personas

Junyi Li, Ninareh Mehrabi, Charith Peris, Palash Goyal, Kai-Wei Chang, Aram Galstyan, Richard Zemel, and Rahul Gupta, in NAACL, 2024.

Full Text Abstract BibTeX Details

The recent surge in Large Language Model (LLM) related applications has led to a concurrent escalation in expectations for LLMs to accommodate a myriad of personas and encompass a broad spectrum of perspectives. An important first step towards addressing this demand is to align language models with specific personas, be it groups of users or individuals. Towards this goal, we first present a new conceptualization of a ¡¥persona¡¦. Moving beyond the traditional reliance on demographics like age, gender, or political party affiliation, we introduce a data-driven persona definition methodology built on collaborative-filtering. In this methodology, users are embedded into a continuous vector space based on their opinions and clustered into cohorts that manifest coherent views across specific inquiries. This methodology allows for a more nuanced understanding of different latent social groups present in the overall population (as opposed to simply using demographic groups) and enhances the applicability of model steerability. Finally, we present an efficient method to steer LLMs towards a particular persona. We learn a soft-prompting model to map the continuous representation of users into sequences of virtual tokens which, when prepended to the LLM input, enables the LLM to produce responses aligned with a given user. Our results show that our steerability algorithm is superior in performance compared to a collection of baselines.

@inproceedings{li2024steerability,
  title = {The steerability of large language models toward data-driven personas},
  author = {Li, Junyi and Mehrabi, Ninareh and Peris, Charith and Goyal, Palash and Chang, Kai-Wei and Galstyan, Aram and Zemel, Richard and Gupta, Rahul},
  booktitle = {NAACL},
  year = {2024}
}

Details

Mitigating Bias for Question Answering Models by Tracking Bias Influence

Mingyu Derek Ma, Jiun-Yu Kao, Arpit Gupta, Yu-Hsiang Lin, Wenbo Zhao, Tagyoung Chung, Wei Wang, Kai-Wei Chang, and Nanyun Peng, in NAACL, 2024.

Full Text Abstract BibTeX Details

Models of various NLP tasks have been shown to exhibit stereotypes, and the bias in the question answering (QA) models is especially harmful as the output answers might be directly consumed by the end users. There have been datasets to evaluate bias in QA models, while bias mitigation technique for the QA models is still under-explored. In this work, we propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. If the influenced instance is more biased, we derive that the query instance is biased. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task. We further introduce a new bias evaluation metric to quantify bias in a comprehensive and sensitive way. We show that our method could be applied to multiple QA formulations across multiple bias categories. It can significantly reduce the bias level in all 9 bias categories in the BBQ dataset while maintaining comparable QA accuracy.

@inproceedings{ma2024mitigating,
  title = {Mitigating Bias for Question Answering Models by Tracking Bias Influence},
  author = {Ma, Mingyu Derek and Kao, Jiun-Yu and Gupta, Arpit and Lin, Yu-Hsiang and Zhao, Wenbo and Chung, Tagyoung and Wang, Wei and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {NAACL},
  year = {2024}
}

Details

CASA: Causality-driven Argument Sufficiency Assessment

Xiao Liu, Yansong Feng, and Kai-Wei Chang, in NAACL, 2024.

Full Text Abstract BibTeX Details

The argument sufficiency assessment task aims to determine if the premises of a given argument support its conclusion. To tackle this task, existing works often train a classifier on data annotated by humans. However, annotating data is laborious, and annotations are often inconsistent due to subjective criteria. Motivated by the definition of probability of sufficiency (PS) in the causal literature, we proposeCASA, a zero-shot causality-driven argument sufficiency assessment framework. PS measures how likely introducing the premise event would lead to the conclusion when both the premise and conclusion events are absent. To estimate this probability, we propose to use large language models (LLMs) to generate contexts that are inconsistent with the premise and conclusion and revise them by injecting the premise event. Experiments on two logical fallacy detection datasets demonstrate that CASA accurately identifies insufficient arguments. We further deploy CASA in a writing assistance application, and find that suggestions generated by CASA enhance the sufficiency of student-written arguments. Code and data are available at https://github.com/xxxiaol/CASA.

@inproceedings{liu2024casa,
  title = {CASA: Causality-driven Argument Sufficiency Assessment},
  author = {Liu, Xiao and Feng, Yansong and Chang, Kai-Wei},
  booktitle = {NAACL},
  year = {2024}
}

Details

On Leveraging Encoder-only Pre-trained Language Models for Effective Keyphrase Generation

Di Wu, Wasi Uddin Ahmad, and Kai-Wei Chang, in LREC-COLING, 2024.

Full Text Code Abstract BibTeX Details

This study addresses the application of encoder-only Pre-trained Language Models (PLMs) in keyphrase generation (KPG) amidst the broader availability of domain-tailored encoder-only models compared to encoder-decoder models. We investigate three core inquiries: (1) the efficacy of encoder-only PLMs in KPG, (2) optimal architectural decisions for employing encoder-only PLMs in KPG, and (3) a performance comparison between in-domain encoder-only and encoder-decoder PLMs across varied resource settings. Our findings, derived from extensive experimentation in two domains reveal that with encoder-only PLMs, although KPE with Conditional Random Fields slightly excels in identifying present keyphrases, the KPG formulation renders a broader spectrum of keyphrase predictions. Additionally, prefix-LM fine-tuning of encoder-only PLMs emerges as a strong and data-efficient strategy for KPG, outperforming general-domain seq2seq PLMs. We also identify a favorable parameter allocation towards model depth rather than width when employing encoder-decoder architectures initialized with encoder-only PLMs. The study sheds light on the potential of utilizing encoder-only PLMs for advancing KPG systems and provides a groundwork for future KPG methods.

@inproceedings{wu2024leveraging,
  booktitle = {LREC-COLING},
  year = {2024},
  title = {On Leveraging Encoder-only Pre-trained Language Models for Effective Keyphrase Generation},
  author = {Wu, Di and Ahmad, Wasi Uddin and Chang, Kai-Wei}
}

Details

Are you talking to [’xem’] or [’x’, ’em’]? On Tokenization and Addressing Misgendering in LLMs with Pronoun Tokenization Parity

Anaelia Ovalle, Ninareh Mehrabi, Palash Goyal, Jwala Dhamala, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Yuval Pinter, and Rahul Gupta, in NAACL-Findings, 2024.

Full Text Abstract BibTeX Details

Gender-inclusive NLP research has documented the harmful limitations of gender binary-centric large language models (LLM), such as the inability to correctly use gender-diverse English neopronouns (e.g., xe, zir, fae). While data scarcity is a known culprit, the precise mechanisms through which scarcity affects this behavior remain underexplored. We discover LLM misgendering is significantly influenced by Byte-Pair Encoding (BPE) tokenization, the tokenizer powering many popular LLMs. Unlike binary pronouns, BPE overfragments neopronouns, a direct consequence of data scarcity during tokenizer training. This disparate tokenization mirrors tokenizer limitations observed in multilingual and low-resource NLP, unlocking new misgendering mitigation strategies. We propose two techniques: (1) pronoun tokenization parity, a method to enforce consistent tokenization across gendered pronouns, and (2) utilizing pre-existing LLM pronoun knowledge to improve neopronoun proficiency. Our proposed methods outperform finetuning with standard BPE, improving neopronoun accuracy from 14.1% to 58.4%. Our paper is the first to link LLM misgendering to tokenization and deficient neopronoun grammar, indicating that LLMs unable to correctly treat neopronouns as pronouns are more prone to misgender.

@inproceedings{ovalle2024are,
  title = {Are you talking to ['xem'] or ['x', 'em']? On Tokenization and Addressing Misgendering in LLMs with Pronoun Tokenization Parity},
  author = {Ovalle, Anaelia and Mehrabi, Ninareh and Goyal, Palash and Dhamala, Jwala and Chang, Kai-Wei and Zemel, Richard and Galstyan, Aram and Pinter, Yuval and Gupta, Rahul},
  booktitle = {NAACL-Findings},
  year = {2024}
}

Details

Can small language models help large language models reason better?: LM-guided chain-of-thought

Jooyoung Lee, Fan Yang, Thanh Tran, Qian Hu, Emre Barut, Kai-Wei Chang, and Chengwei Su, in LREC-COLING, 2024.

Full Text Abstract BibTeX Details

We introduce a novel framework, LM-Guided CoT, that leverages a lightweight (i.e., <1B) LM for guiding a black-box large (i.e., >10B) LM in reasoning tasks. Specifically, the lightweight LM first generates a rationale for each input instance. The Frozen large LM is then prompted to predict a task output based on the rationale generated by the lightweight LM. Our approach is resource-efficient in the sense that it only requires training the lightweight LM. We optimize the model through 1) knowledge distillation and 2) reinforcement learning from rationale-oriented and task-oriented reward signals. We assess our method with multi-hop extractive question answering (QA) benchmarks, HotpotQA and 2WikiMultiHopQA. Experimental results show that our approach outperforms all baselines regarding answer prediction accuracy. We also find that reinforcement learning helps the model to produce higher-quality rationales with improved QA performance.

@inproceedings{lee2024small,
  title = {Can small language models help large language models reason better?: LM-guided chain-of-thought},
  author = {Lee, Jooyoung and Yang, Fan and Tran, Thanh and Hu, Qian and Barut, Emre and Chang, Kai-Wei and Su, Chengwei},
  year = {2024},
  booktitle = {LREC-COLING}
}

Details

AI-Assisted Summarization of Radiologic Reports: Evaluating GPT3davinci, BARTcnn, LongT5booksum, LEDbooksum, LEDlegal, and LEDclinical

Aichi Chien, Hubert Tang, Bhavita Jagessar, Kai-wei Chang, Nanyun Peng, Kambiz Nael, and Noriko Salamon, in American Journal of Neuroradiology, 2024.

Full Text Abstract BibTeX Details

The review of clinical reports is an essential part of monitoring disease progression. Synthesizing multiple imaging reports is also important for clinical decisions. It is critical to aggregate information quickly and accurately. Machine learning natural language processing (NLP) models hold promise to address an unmet need for report summarization. We evaluated NLP methods to summarize longitudinal aneurysm reports. A total of 137 clinical reports and 100 PubMed case reports were used in this study. Models were compared against expert-generated summaries using longitudinal imaging notes collected in our institute and compared using publicly accessible PubMed case reports.

@inproceedings{chien2024aiassisted,
  title = {AI-Assisted Summarization of Radiologic Reports: Evaluating GPT3davinci, BARTcnn, LongT5booksum, LEDbooksum, LEDlegal, and LEDclinical},
  author = {Chien, Aichi and Tang, Hubert and Jagessar, Bhavita and Chang, Kai-wei and Peng, Nanyun and Nael, Kambiz and Salamon, Noriko},
  year = {2024},
  booktitle = {American Journal of Neuroradiology}
}

Details

CoBIT: A Contrastive Bi-directional Image-Text Generation Model

Haoxuan You, Mandy Guo, Zhecan Wang, Kai-Wei Chang, Jason Michael Baldridge, and Jiahui Yu, in ICLR, 2024.

Full Text Abstract BibTeX Details

The field of Vision-and-Language (VL) has witnessed a proliferation of pretrained foundation models. Current techniques typically employ only one type of training objective, whether it’s (1) contrastive objectives (like CLIP), (2) image-to-text generative objectives (like PaLI), or (3) text-to-image generative objectives (like Parti). However, all these three objectives are mutually relevant and are all based on image-text pairs. Intuitively, the first two objectives can be considered as complementary projections between two modalities, and contrastive learning can preserve global alignment and generations facilitate fine-grained understanding. Inspired by this, we present a Contrastive Bi-directional Image-Text generation model (CoBIT) to first time unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure consisting of an image unicoder, a text unicoder, and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits

@inproceedings{you2024cobit,
  title = {CoBIT: A Contrastive Bi-directional Image-Text Generation Model},
  author = {You, Haoxuan and Guo, Mandy and Wang, Zhecan and Chang, Kai-Wei and Baldridge, Jason Michael and Yu, Jiahui},
  booktitle = {ICLR},
  year = {2024},
  month = jan,
  day = {16}
}

Details

Understanding and Mitigating Spurious Correlations in Text Classification with Neighborhood Analysis

Oscar Chew, Hsuan-Tien Lin, Kai-Wei Chang, and Kuan-Hao Huang, in EACL-Findings, 2024.

Full Text Abstract BibTeX Details

Recent research has revealed that machine learning models have a tendency to leverage spurious correlations that exist in the training set but may not hold true in general circumstances. For instance, a sentiment classifier may erroneously learn that the token "performances" is commonly associated with positive movie reviews. Relying on these spurious correlations degrades the classifiers performance when it deploys on out-of-distribution data. In this paper, we examine the implications of spurious correlations through a novel perspective called neighborhood analysis. The analysis uncovers how spurious correlations lead unrelated words to erroneously cluster together in the embedding space. Driven by the analysis, we design a metric to detect spurious tokens and also propose a family of regularization methods, NFL (doN’t Forget your Language) to mitigate spurious correlations in text classification. Experiments show that NFL can effectively prevent erroneous clusters and significantly improve the robustness of classifiers without auxiliary data. The code is publicly available at https://github.com/oscarchew/doNt-Forget-your-Language.

@inproceedings{chew2024understanding,
  title = {Understanding and Mitigating Spurious Correlations in Text Classification with Neighborhood Analysis},
  author = {Chew, Oscar and Lin, Hsuan-Tien and Chang, Kai-Wei and Huang, Kuan-Hao},
  booktitle = {EACL-Findings},
  year = {2024}
}

Details

2023

Red Teaming Language Model Detectors with Language Models

Zhouxing Shi, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang, and Cho-Jui Hsieh, in TACL, 2023.

Full Text Code Abstract BibTeX Details

The prevalence and high capacity of large language models (LLMs) present significant safety and ethical risks when malicious users exploit them for automated content generation. To prevent the potentially deceptive usage of LLMs, recent works have proposed several algorithms to detect machine-generated text. In this paper, we systematically test the reliability of the existing detectors, by designing two types of attack strategies to fool the detectors: 1) replacing words with their synonyms based on the context; 2) altering the writing style of generated text. These strategies are implemented by instructing LLMs to generate synonymous word substitutions or writing directives that modify the style without human involvement, and the LLMs leveraged in the attack can also be protected by detectors. Our research reveals that our attacks effectively compromise the performance of all tested detectors, thereby underscoring the urgent need for the development of more robust machine-generated text detection systems.

@inproceedings{shi2023red,
  author = {Shi, Zhouxing and Wang, Yihan and Yin, Fan and Chen, Xiangning and Chang, Kai-Wei and Hsieh, Cho-Jui},
  title = {Red Teaming Language Model Detectors with Language Models},
  booktitle = {TACL},
  year = {2023}
}

Details

DesCo: Learning Object Recognition with Rich Language Descriptions

Liunian Harold Li, Zi-Yi Dou, Nanyun Peng, and Kai-Wei Chang, in NeurIPS, 2023.

Full Text Demo Abstract BibTeX Details Ranks 1st at the #OmniLabel Challenge of CVPR2023

Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and improve the models’ adaptability to identify novel objects and domains. Recently, several studies have attempted to query these models with complex language expressions that include specifications of fine-grained semantic details, such as attributes, shapes, textures, and relations. However, simply incorporating language descriptions as queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, the state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenges, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two major innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects based on object names and the raw image-text caption; 2) we design context-sensitive queries to improve the model’s ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.

@inproceedings{li2023desco,
  author = {Li, Liunian Harold and Dou, Zi-Yi and Peng, Nanyun and Chang, Kai-Wei},
  title = {DesCo: Learning Object Recognition with Rich Language Descriptions},
  booktitle = {NeurIPS},
  year = {2023}
}

Details

ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation

Kuan-Hao Huang, Varun Iyer, I.-Hung Hsu, Anoop Kumar, Kai-Wei Chang, and Aram Galstyan, in ACL, 2023.

Full Text Abstract BibTeX Details Area Chair’s Award

Paraphrase generation is a long-standing task in natural language processing (NLP). Supervised paraphrase generation models, which rely on human-annotated paraphrase pairs, are cost-inefficient and hard to scale up. On the other hand, automatically annotated paraphrase pairs (e.g., by machine back-translation), usually suffer from the lack of syntactic diversity – the generated paraphrase sentences are very similar to the source sentences in terms of syntax. In this work, we present ParaAMR, a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation. Our quantitative analysis, qualitative examples, and human evaluation demonstrate that the paraphrases of ParaAMR are syntactically more diverse compared to existing large-scale paraphrase datasets while preserving good semantic similarity. In addition, we show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning. Our results thus showcase the potential of ParaAMR for improving various NLP applications.

@inproceedings{huang2023paraarm,
  author = {Huang, Kuan-Hao and Iyer, Varun and Hsu, I-Hung and Kumar, Anoop and Chang, Kai-Wei and Galstyan, Aram},
  title = {ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation},
  booktitle = {ACL},
  presentation_id = {https://underline.io/events/395/posters/15227/poster/76600-paraamr-a-large-scale-syntactically-diverse-paraphrase-dataset-by-amr-back-translation},
  year = {2023}
}

Details

The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks

Nikil Roashan Selvam, Sunipa Dev, Daniel Khashabi, Tushar Khot, and Kai-Wei Chang, in ACL (short), 2023.

Full Text Abstract BibTeX Details Outstanding Paper Award

How reliably can we trust the scores obtained from social bias benchmarks as faithful indicators of problematic social biases in a given language model? In this work, we study this question by contrasting social biases with non-social biases stemming from choices made during dataset construction that might not even be discernible to the human eye. To do so, we empirically simulate various alternative constructions for a given benchmark based on innocuous modifications (such as paraphrasing or random-sampling) that maintain the essence of their social bias. On two well-known social bias benchmarks (Winogender and BiasNLI) we observe that these shallow modifications have a surprising effect on the resulting degree of bias across various models. We hope these troubling observations motivate more robust measures of social biases.

@inproceedings{roashan2023tail,
  author = {Selvam, Nikil Roashan and Dev, Sunipa and Khashabi, Daniel and Khot, Tushar and Chang, Kai-Wei},
  title = {The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks},
  presentation_id = {https://underline.io/events/395/posters/15337/poster/76963-the-tail-wagging-the-dog-dataset-construction-biases-of-social-bias-benchmarks},
  booktitle = {ACL (short)},
  year = {2023}
}

Details

On the Paradox of Learning to Reason from Data

Honghua Zhang, Liunian Harold Li, Tao Meng, Kai-Wei Chang, and Guy Van den Broeck., in IJCAI, 2023.

Full Text Code Abstract BibTeX Details Top-10 cited paper at IJCAI 23

Logical reasoning is needed in a wide range of NLP tasks. Can a BERT model be trained end-to-end to solve logical reasoning problems presented in natural language? We attempt to answer this question in a confined problem space where there exists a set of parameters that perfectly simulates logical reasoning. We make observations that seem to contradict each other: BERT attains near-perfect accuracy on in-distribution test examples while failing to generalize to other data distributions over the exact same problem space. Our study provides an explanation for this paradox: instead of learning to emulate the correct reasoning function, BERT has in fact learned statistical features that inherently exist in logical reasoning problems. We also show that it is infeasible to jointly remove statistical features from data, illustrating the difficulty of learning to reason in general. Our result naturally extends to other neural models and unveils the fundamental difference between learning to reason and learning to achieve high performance on NLP benchmarks using statistical features.

@inproceedings{zhang2023on,
  title = {On the Paradox of Learning to Reason from Data},
  author = {Zhang, Honghua and Li, Liunian Harold and Meng, Tao and Chang, Kai-Wei and den Broeck., Guy Van},
  booktitle = {IJCAI},
  year = {2023}
}

Details

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, and Kai-Wei Chang, in ICCV, 2023.

Full Text Code Abstract BibTeX Details Best Paper Award at ICLR Workshop, Oral at ICCV (195 out of 8088 submissions, top 2.5%)

Multimodal contrastive pretraining has been used to train multimodal representation models, such as CLIP, on large amounts of paired image-text data. However, previous studies have revealed that such models are vulnerable to backdoor attacks. Specifically, when trained on backdoored examples, CLIP learns spurious correlations between the embedded backdoor trigger and the target label, aligning their representations in the joint embedding space. Injecting even a small number of poisoned examples, such as 75 examples in 3 million pretraining data, can significantly manipulate the model’s behavior, making it difficult to detect or unlearn such correlations. To address this issue, we propose CleanCLIP, a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks by independently re-aligning the representations for individual modalities. We demonstrate that unsupervised finetuning using a combination of multimodal contrastive and unimodal self-supervised objectives for individual modalities can significantly reduce the impact of the backdoor attack. We show empirically that CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.

@inproceedings{bansal2023cleanclip,
  author = {Bansal, Hritik and Singhi, Nishad and Yang, Yu and Yin, Fan and Grover, Aditya and Chang, Kai-Wei},
  title = {CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning},
  booktitle = {ICCV},
  year = {2023}
}

Details

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao, in NeurIPS, 2023.

Full Text Code Abstract BibTeX Details

Large language models (LLMs) have achieved remarkable progress in solving various natural language processing tasks due to emergent reasoning abilities. However, LLMs have inherent limitations as they are incapable of accessing up-to-date information (stored on the Web or in task-specific knowledge bases), using external tools, and performing precise mathematical and logical reasoning. In this paper, we present Chameleon, an AI system that mitigates these limitations by augmenting LLMs with plug-and-play modules for compositional reasoning. Chameleon synthesizes programs by composing various tools (e.g., LLMs, off-the-shelf vision models, web search engines, Python functions, and heuristic-based modules) for accomplishing complex reasoning tasks. At the heart of Chameleon is an LLM-based planner that assembles a sequence of tools to execute to generate the final response. We showcase the effectiveness of Chameleon on two multi-modal knowledge-intensive reasoning tasks: ScienceQA and TabMWP. Chameleon, powered by GPT-4, achieves an 86.54% overall accuracy on ScienceQA, improving the best published few-shot result by 11.37%. On TabMWP, GPT-4-powered Chameleon improves the accuracy by 17.0%, lifting the state of the art to 98.78%. Our analysis also shows that the GPT-4-powered planner exhibits more consistent and rational tool selection via inferring potential constraints from instructions, compared to a ChatGPT-powered planner.

@inproceedings{lu2023chameleon,
  author = {Lu, Pan and Peng, Baolin and Cheng, Hao and Galley, Michel and Chang, Kai-Wei and Wu, Ying Nian and Zhu, Song-Chun and Gao, Jianfeng},
  title = {Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models},
  booktitle = {NeurIPS},
  keyword_extra = {AI-agent},
  year = {2023}
}

Details

AVIS: Autonomous Visual Information Seeking with Large Language Models

Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, and Alireza Fathi, in NeurIPS, 2023.

Full Text Abstract BibTeX Details

In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.

@inproceedings{hu2023avis,
  author = {Hu, Ziniu and Iscen, Ahmet and Sun, Chen and Chang, Kai-Wei and Sun, Yizhou and Schmid, Cordelia and Ross, David A and Fathi, Alireza},
  booktitle = {NeurIPS},
  title = {AVIS: Autonomous Visual Information Seeking with Large Language Models},
  keyword_extra = {AI-agent},
  year = {2023}
}

Details

A Pseudo-Semantic Loss for Deep Generative Models with Logical Constraints

Kareem Ahmed, Kai-Wei Chang, and Guy Van den Broeck, in NeurIPS, 2023.

Full Text Abstract BibTeX Details

Neuro-symbolic approaches bridge the gap between purely symbolic and neural approaches to learning. This often requires maximizing the probability of a symbolic constraint in the neural network’s output. However, output distributions are typically assumed to be fully-factorized, which prohibits the application of neurosymbolic learning to more expressive output distributions, such as autoregressive deep generative models. There, such probability computation is #P-hard, even for simple constraints. Instead, we propose to locally approximate the probability of the symbolic constraint under the pseudolikelihood distribution – the product of its full conditionals given a sample from the model. This allows our pseudo-semantic loss function to enforce the symbolic constraint. Our method bears relationship to several classical approximation schemes, including hogwild Gibbs sampling, consistent pseudolikelihood learning, and contrastive divergence. We test our proposed approach on three distinct settings: Sudoku, shortest-path prediction, and detoxifying large language models. Experiments show that pseudo-semantic loss greatly improves upon the base model’s ability to satisfy the desired logical constraint in its output distribution.

@inproceedings{ahmed2023neuro,
  title = {	A Pseudo-Semantic Loss for Deep Generative Models with Logical Constraints},
  author = {Ahmed, Kareem and Chang, Kai-Wei and den Broeck, Guy Van},
  booktitle = {NeurIPS},
  year = {2023}
}

Details

Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models

Amita Kamath, Jack Hessel, and Kai-Wei Chang, in EMNLP, 2023.

Full Text Abstract BibTeX Details

Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach doesn’t require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP’s text encoder falls short on object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multi-modal matching performance on ControlledImCaps: a new evaluation benchmark we collect+release consisting of fine-grained compositional images+captions. Specifically – our results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive vision+language models.

@inproceedings{kamath2023text,
  author = {Kamath, Amita and Hessel, Jack and Chang, Kai-Wei},
  title = {Text Encoders are Performance Bottlenecks in Contrastive Vision-Language Models},
  booktitle = {EMNLP},
  year = {2023}
}

Details

Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation

Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han, and Kai-Wei Chang, in EMNLP, 2023.

Full Text Code Abstract BibTeX Details

Instruction tuning has emerged to enhance the capabilities of large language models (LLMs) in providing appropriate outputs based on input instructions. However, existing methods for collecting instruction-tuning data suffer from limitations in scalability and affordability. In this paper, we propose Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. Built upon the metadata of existing NLP datasets, we generate multiple task instructions applicable to various NLP datasets and determine the relevant data fields for constructing instruction-tuning data with LLMs. Dynosaur offers several advantages: 1) lower generation costs (less than $12 for generating 800K instruction-tuning data), 2) good quality of instruction-tuning data (better performance than Alpaca and Instruction GPT-4 on Super-NI with comparable data sizes), and 3) the ability to grow dynamically by incorporating new datasets from Huggingface Datasets Platform. We further investigate continual learning as an approach to learning with the ever-growing instruction-tuning dataset. We demonstrate that replay methods not only help mitigate forgetting issues but help generalize to unseen tasks better. As a novel continual learning scenario for instruction tuning, selecting tasks based on instruction representations can be an effective replaying strategy.

@inproceedings{yin2023dynosaur,
  author = {Yin, Da and Liu, Xiao and Yin, Fan and Zhong, Ming and Bansal, Hritik and Han, Jiawei and Chang, Kai-Wei},
  title = {Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation},
  booktitle = {EMNLP},
  keyword_extra = {AI-agent},
  year = {2023}
}

Details

Rethinking Model Selection and Decoding for Keyphrase Generation with Pre-trained Sequence-to-Sequence Models

Di Wu, Wasi Uddin Ahmad, and Kai-Wei Chang, in EMNLP, 2023.

Full Text Code Abstract BibTeX Details

Keyphrase Generation (KPG) is a longstanding task in NLP with broad applications. The advent of pre-trained language models (PLMs) has recently led to a significant improvement in KPG. Nonetheless, several design choices are arbitrary and have not been comprehensively studied. This paper presents a systematic study aimed at benchmarking the impact of model choice and decoding strategies on PLM-based KPG. Specifically, we first reflect on why sequence-to-sequence (seq2seq) PLMs are suitable for KPG via an attention-based hypothesis. Then, we reveal that the conventional wisdom for selecting seq2seq PLMs is incomplete: (1) scaling up model size or task adaptation alone is parameter inefficient; (2) while in-domain pre-training combined with task adaptation significantly benefits KPG, they also compromise generalization to some extent. For decoding, we show that although greedy search achieves strong F1 scores, its recall has large rooms for improvement compared to sampling-based approaches. Based on the findings, we introduce DeSel, a probability-based decode-select algorithm that improves greedy search by an average of 4.7% semantic F1 over five datasets. Together, our results set a solid foundation for future exploration and study of KPG.

@inproceedings{wu2023rethinking,
  author = {Wu, Di and Ahmad, Wasi Uddin and Chang, Kai-Wei},
  title = {Rethinking Model Selection and Decoding for Keyphrase Generation with Pre-trained Sequence-to-Sequence Models},
  booktitle = {EMNLP},
  year = {2023}
}

Details

LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following

Cheng-Fu Yang, Yen-Chun Chen, Jianwei Yang, Xiyang Dai, Lu Yuan, Yu-Chiang Frank Wang, and Kai-Wei Chang, in EMNLP, 2023.

Full Text Abstract BibTeX Details

End-to-end Transformers has demonstrated impressive success rate for Embodied Instruction Following when the environment has been seen in the training time. However, they tend to struggle when deploying into a new environment. We discover this lack of generalizability is due to ignorance of the natural language instruction. To mitigate this, we first propose to explicitly align the agent’s hidden states to the instructions via contrastive learning. Nevertheless, the semantic gap between high-level language instructions and the agent’s low-level action space remains an obstacle. We further bridge this gap via a novel concept of meta-actions. Meta-actions are ubiquitous action patterns that can be parsed from the original action sequence. These patterns represents higher-level semantics that are intuitively more similar to the instructions. When meta-actions are further applied as additional training signals, the agent generalizes even better to unseen environments. Compared to a strong multi-modal Transformer baseline, we achieve a significant 4.5% absolute gain in success rate at the unseen environments of ALFRED Embodied Instruction Following. Additional analysis shows that the contrastive objective and meta-actions are complementary for achieving the best result, and the resulting agent better aligns its states to corresponding instructions, hence is more favorable for real-world embodied agents.

@inproceedings{yang2023lacma,
  title = {LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following},
  author = {Yang, Cheng-Fu and Chen, Yen-Chun and Yang, Jianwei and Dai, Xiyang and Yuan, Lu and Wang, Yu-Chiang Frank and Chang, Kai-Wei},
  booktitle = {EMNLP},
  keyword_extra = {AI-agent},
  year = {2023}
}

Details

Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks

Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, and Nanyun Peng, in EMNLP, 2023.

Full Text Code Abstract BibTeX Details

Instruction tuning (IT) achieves impressive zero-shot generalization results by training large language models (LLMs) on a massive amount of diverse tasks with instructions. However, how to select new tasks to improve the performance and generalizability of IT models remains a challenge. Training on all existing tasks is impractical due to prohibiting computation requirements, and randomly selecting tasks can lead to suboptimal performance. In this work, we propose active instruction tuning base on prompt uncertainty, a novel framework to actively identify and train on informative tasks by assessing models’ sensitivity against prompts perturbations. Our experiments on NIV2 and Self-Instruct datasets demonstrate that our method consistently outperforms other baseline strategies for task selection, achieving better out-of-distribution generalization with fewer training tasks. Additionally, we introduce a task map that categorizes and diagnoses tasks based on prompt uncertainty and generation perplexity, and discover that training on ambiguous (prompt-uncertain) tasks improves generalization while training on difficult (prompt-certain and low-probability) tasks offers no benefit, underscoring the importance of task selection for instruction tuning.

@inproceedings{kung2023active,
  title = {Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks},
  author = {Kung, Po-Nien and Yin, Fan and Wu, Di and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {EMNLP},
  year = {2023}
}

Details

"What’s ’up’ with vision-language models? Investigating their struggle to understand spatial relations."

Amita Kamath, Jack Hessel, and Kai-Wei Chang, in EMNLP, 2023.

Full Text Abstract BibTeX Details

Recent vision-language (VL) models have reached human parity on VQAv2 — but does that mean they can distinguish "left" from "right"? We curate three new corpora to precisely quantify model ability to comprehend basic spatial relations: COCO-prep from COCO, GQA-prep from GQA, and RealCLEVR from images we capture ourselves with even tighter controls. Compared to prior evaluations which conflate several types of reasoning, our three tests offer precise evaluations of spatial relations, e.g., our RealCLEVR benchmark is controlled, with only the preposition changing between images within a set, e.g. mug on/under/left of/right of a table. This enables us to evaluate model performance on pairs or sets of prepositions. We evaluate 18 VL models, finding that all fall far behind human performance (despite surpassing human performance on VQAv2, as in the case of BLIP2); most only achieve a few points above random chance across all benchmarks. We then study the LAION-2B dataset, which was used to train OpenCLIP models, to investigate if pre-training data can provide clues as to why spatial relation understanding doesn’t emerge. We find that prepositions are infrequent and often ambiguous in LAION 2B. Based on this corpus analysis, we investigate a few training strategies to address this shortcoming. While up-weighting preposition-containing instances and fine-tuning on IID data improve accuracy slightly, our three spatial relation benchmarks remain challenging for all VL models we test. We will release code and data.

@inproceedings{kamath2023whatsup,
  title = {"What's 'up' with vision-language models? Investigating their struggle to understand spatial relations."},
  author = {Kamath, Amita and Hessel, Jack and Chang, Kai-Wei},
  booktitle = {EMNLP},
  year = {2023}
}

Details

Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond

Zhecan Wang, Long Chen, Haoxuan You, Keyang Xu, Noel C. Codella, Kai-Wei Chang, and Shih-Fu Chang, in EMNLP-Findings, 2023.

Full Text Abstract BibTeX Details

Vision-language (VL) understanding tasks evaluate models’ comprehension of complex visual scenes through multiple-choice questions. However, we have identified two dataset biases that models can exploit as shortcuts to resolve various VL tasks correctly without proper understanding. The first type of dataset bias is Unbalanced Matching bias, where the correct answer overlaps the question and image more than the incorrect answers. The second type of dataset bias is Distractor Similarity bias, where incorrect answers are overly dissimilar to the correct answer but significantly similar to other incorrect answers within the same sample. To address these dataset biases, we first propose Adversarial Data Synthesis (ADS) to generate synthetic training and debiased evaluation data. We then introduce Intra-sample Counterfactual Training (ICT) to assist models in utilizing the synthesized training data, particularly the counterfactual data, via focusing on intra-sample differentiation. Extensive experiments demonstrate the effectiveness of ADS and ICT in consistently improving model performance across different benchmarks, even in domain-shifted scenarios.

@inproceedings{wang2023datasetbias,
  title = {Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond},
  author = {Wang, Zhecan and Chen, Long and You, Haoxuan and Xu, Keyang and Codella, Noel C and Chang, Kai-Wei and Chang, Shih-Fu},
  booktitle = {EMNLP-Findings},
  year = {2023}
}

Details

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad Ayyubi, Kai-Wei Chang, and Shih-Fu Chang, in EMNLP-Finding, 2023.

Full Text Abstract BibTeX Details

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE.

@inproceedings{you2023idealgpt,
  author = {You, Haoxuan and Sun, Rui and Wang, Zhecan and Chen, Long and Wang, Gengyu and Ayyubi, Hammad and Chang, Kai-Wei and Chang, Shih-Fu},
  booktitle = {EMNLP-Finding},
  title = {IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models},
  year = {2023}
}

Details

Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems

Yixin Wan, Jieyu Zhao, Aman Chadha, Nanyun Peng, and Kai-Wei Chang, in EMNLP-Finding, 2023.

Full Text Abstract BibTeX Details

Recent advancements in Large Language Models empower them to follow freeform instructions, including imitating generic or specific demographic personas in conversations. Generic personas refer to a demographic group (e.g. an Asian person), whereas specific personas can be actual names of historical figures. While the adoption of personas allows dialogue systems to be more engaging and approachable to users, it also carries the potential risk of exacerbating social biases in model responses, further causing societal harms through interactions with users. In this paper, we systematically study “persona biases”, which we define to be the sensitivity of harmful dialogue model behaviors to different persona adoptions.We categorize persona biases into biases in harmful expression and harmful agreement, as well as establish a comprehensive evaluation framework to measure persona biases in five aspects: Offensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic Agreement. Additionally, we propose to comprehensively investigate persona biases through experimenting with UniversalPersona, a systematized persona dataset with a comprehensive list of both generic and specific model personas. Through benchmarking on four different models, including Blender, ChatGPT, Alpaca, and Vicuna, our study uncovers significant persona biases in dialogue systems. Findings of our study underscores the immediate need to revisit the use of persona traits in dialogue agents to ensure their safe application.

@inproceedings{wan2023personalized,
  author = {Wan, Yixin and Zhao, Jieyu and Chadha, Aman and Peng, Nanyun and Chang, Kai-Wei},
  title = {Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems},
  booktitle = {EMNLP-Finding},
  year = {2023}
}

Details

Kelly is a Warm Person, Joseph is a Role Model: Gender Biases in LLM-Generated Reference Letters

Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, and Nanyun Peng, in EMNLP-Findings, 2023.

Full Text Abstract BibTeX Details

As generative language models advance, users have started to utilize Large Language Models (LLMs) to assist in writing various types of content, including professional documents such as recommendation letters. Despite their convenience, these applications introduce unprecedented fairness concerns. As generated reference letter might be directly utilized by users in professional or academic scenarios, it has the potential to cause direct harm such as lowering success rates for female applicants. Therefore, it is imminent and necessary to comprehensively study fairness issues and associated harms in such real-world use cases for future mitigation and monitoring. In this paper, we critically examine gender bias in LLM-generated reference letters. Inspired by findings in social science, we specifically design evaluation methods to manifest gender biases in LLM-generated letters through two dimensions: biases in language style and biases in lexical content. Furthermore, we investigate the extent of bias propagation by separately analyze bias amplification in model-hallucinated contents, which we define to be hallucination bias of model-generated documents. Through benchmarking evaluation on 4 popular LLMs, including ChatGPT, Alpaca, Vicuna and StableLM, our study reveal significant gender biases in LLM-generated recommendation letters. Our findings further point towards the importance and imminence to recognize bias in LLM-generated professional documents.

@inproceedings{wan2023kelly,
  title = {Kelly is a Warm Person, Joseph is a Role Model: Gender Biases in LLM-Generated Reference Letters},
  author = {Wan, Yixin and Pu, George and Sun, Jiao and Garimella, Aparna and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {EMNLP-Findings},
  year = {2023}
}

Details

A Survey of Deep Learning for Mathematical Reasoning

Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang, in ACL, 2023.

Full Text Abstract BibTeX Details

Mathematical reasoning is a fundamental aspect of human intelligence and is applicable in various fields, including science, engineering, finance, and everyday life. The development of artificial intelligence (AI) systems capable of solving math problems and proving theorems has garnered significant interest in the fields of machine learning and natural language processing. For example, mathematics serves as a testbed for aspects of reasoning that are challenging for powerful deep learning models, driving new algorithmic and modeling advances. On the other hand, recent advances in large-scale neural language models have opened up new benchmarks and opportunities to use deep learning for mathematical reasoning. In this survey paper, we review the key tasks, datasets, and methods at the intersection of mathematical reasoning and deep learning over the past decade. We also evaluate existing benchmarks and methods, and discuss future research directions in this domain.

@inproceedings{lu2023survey,
  author = {Lu, Pan and Qiu, Liang and Yu, Wenhao and Welleck, Sean and Chang, Kai-Wei},
  title = {A Survey of Deep Learning for Mathematical Reasoning},
  booktitle = {ACL},
  year = {2023},
  presentation_id = {https://underline.io/events/395/posters/15337/poster/76360-a-survey-of-deep-learning-for-mathematical-reasoning}
}

Details

Efficient Shapley Values Estimation by Amortization for Text Classification

Chenghao Yang, Fan Yin, He He, Kai-Wei Chang, Xiaofei Ma, and Bing Xiang, in ACL, 2023.

Full Text Abstract BibTeX Details

Despite the popularity of Shapley Values in explaining neural text classification models, computing them is prohibitive for large pretrained models due to a large number of model evaluations as it needs to perform multiple model evaluations over various perturbed text inputs. In practice, Shapley Values are often estimated stochastically with a smaller number of model evaluations. However, we find that the estimated Shapley Values are quite sensitive to random seeds – the top-ranked features often have little overlap under two different seeds, especially on examples with the longer input text. As a result, a much larger number of model evaluations is needed to reduce the sensitivity to an acceptable level. To mitigate the trade-off between stability and efficiency, we develop an amortized model that directly predicts Shapley Values of each input feature without additional model evaluation. It is trained on a set of examples with Shapley Values estimated from a large number of model evaluations to ensure stability. Experimental results on two text classification datasets demonstrate that, the proposed amortized model can estimate black-box explanation scores in milliseconds per sample in inference time and is up to 60 times more efficient than traditional methods.

@inproceedings{yang2023efficient,
  title = {Efficient Shapley Values Estimation by Amortization for Text Classification},
  author = {Yang, Chenghao and Yin, Fan and He, He and Chang, Kai-Wei and Ma, Xiaofei and Xiang, Bing},
  year = {2023},
  presentation_id = {https://underline.io/events/395/sessions/15249/lecture/76179-efficient-shapley-values-estimation-by-amortization-for-text-classification},
  booktitle = {ACL}
}

Details

Resolving Ambiguities in Text-to-Image Generative Models

Ninareh Mehrabi, Palash Goyal, Apurv Verma, Jwala Dhamala, Varun Kumar, Qian Hu, Kai-Wei Chang, Richard Zemel, Aram Galstyan, and Rahul Gupta, in ACL, 2023.

Full Text Abstract BibTeX Details

Natural language often contains ambiguities that can lead to misinterpretation and miscommunication. While humans can handle ambiguities effectively by asking clarifying questions and/or relying on contextual cues and common-sense knowledge, resolving ambiguities can be notoriously hard for machines. In this work, we study ambiguities that arise in text-to-image generative models. We curate a benchmark dataset covering different types of ambiguities that occur in these systems. We then propose a framework to mitigate ambiguities in the prompts given to the systems by soliciting clarifications from the user. Through automatic and human evaluations, we show the effectiveness of our framework in generating more faithful images aligned with human intention in the presence of ambiguities.

@inproceedings{mehrabi2023resolving,
  author = {Mehrabi, Ninareh and Goyal, Palash and Verma, Apurv and Dhamala, Jwala and Kumar, Varun and Hu, Qian and Chang, Kai-Wei and Zemel, Richard and Galstyan, Aram and Gupta, Rahul},
  booktitle = {ACL},
  title = {Resolving Ambiguities in Text-to-Image Generative Models},
  presentation_id = {https://underline.io/events/395/posters/15237/poster/76575-resolving-ambiguities-in-text-to-image-generative-models},
  year = {2023}
}

Details

GENEVA: Pushing the Limit of Generalizability for Event Argument Extraction with 100+ Event Types

Tanmay Parekh, I.-Hung Hsu, Kuan-Hao Huang, Kai-Wei Chang, and Nanyun Peng, in ACL, 2023.

Full Text Code Abstract BibTeX Details

Recent works in Event Argument Extraction (EAE) have focused on improving model generalizability to cater to new events and domains. However, standard benchmarking datasets like ACE and ERE cover less than 40 event types and 25 entity-centric argument roles. Limited diversity and coverage hinder these datasets from adequately evaluating the generalizability of EAE models. In this paper, we first contribute by creating a large and diverse EAE ontology. This ontology is created by transforming FrameNet, a comprehensive semantic role labeling (SRL) dataset for EAE, by exploiting the similarity between these two tasks. Then, exhaustive human expert annotations are collected to build the ontology, concluding with 115 events and 220 argument roles, with a significant portion of roles not being entities. We utilize this ontology to further introduce GENEVA, a diverse generalizability benchmarking dataset comprising four test suites, aimed at evaluating models’ ability to handle limited data and unseen event type generalization. We benchmark six EAE models from various families. The results show that owing to non-entity argument roles, even the best-performing model can only achieve 39% F1 score, indicating how GENEVA provides new challenges for generalization in EAE. Overall, our large and diverse EAE ontology can aid in creating more comprehensive future resources, while GENEVA is a challenging benchmarking dataset encouraging further research for improving generalizability in EAE.

@inproceedings{parekh2023geneva,
  title = {GENEVA: Pushing the Limit of Generalizability for Event Argument Extraction with 100+ Event Types},
  author = {Parekh, Tanmay and Hsu, I-Hung and Huang, Kuan-Hao and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {ACL},
  presentation_id = {https://underline.io/events/395/posters/15264/poster/77026-geneva-benchmarking-generalizability-for-event-argument-extraction-with-hundreds-of-event-types-and-argument-roles},
  year = {2023}
}

Details

TAGPRIME: A Unified Framework for Relational Structure Extraction

I.-Hung Hsu, Kuan-Hao Huang, Shuning Zhang, Wenxin Cheng, Prem Natarajan, Kai-Wei Chang, and Nanyun Peng, in ACL, 2023.

Full Text Abstract BibTeX Details

Many tasks in natural language processing require the extraction of relationship information for a given condition, such as event argument extraction, relation extraction, and task-oriented semantic parsing. Recent works usually propose sophisticated models for each task independently and pay less attention to the commonality of these tasks and to have a unified framework for all the tasks. In this work, we propose to take a unified view of all these tasks and introduce TAGPRIME to address relational structure extraction problems. TAGPRIME is a sequence tagging model that appends priming words about the information of the given condition (such as an event trigger) to the input text. With the self-attention mechanism in pre-trained language models, the priming words make the output contextualized representations contain more information about the given condition, and hence become more suitable for extracting specific relationships for the condition. Extensive experiments and analyses on three different tasks that cover ten datasets across five different languages demonstrate the generality and effectiveness of TAGPRIME.

@inproceedings{hsu2023tagprime,
  author = {Hsu, I-Hung and Huang, Kuan-Hao and Zhang, Shuning and Cheng, Wenxin and Natarajan, Prem and Chang, Kai-Wei and Peng, Nanyun},
  title = {TAGPRIME: A Unified Framework for Relational Structure Extraction},
  booktitle = {ACL},
  presentation_id = {https://underline.io/events/395/sessions/15250/lecture/76330-tagprime-a-unified-framework-for-relational-structure-extraction},
  year = {2023}
}

Details

Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step

Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi, in ACL, 2023.

Full Text Abstract BibTeX Details

Chain-of-thought prompting (e.g., "Let’s think step-by-step") primes large language models to verbalize rationalization for their predictions. While chain-of-thought can lead to dramatic performance gains, benefits appear to emerge only for sufficiently large models (beyond 50B parameters). We show that orders-of-magnitude smaller models (125M – 1.3B parameters) can still benefit from chain-of-thought prompting. To achieve this, we introduce Symbolic Chain-of-Thought Distillation (SCoTD), a method to train a smaller student model on rationalizations sampled from a significantly larger teacher model. Experiments across several commonsense benchmarks show that: 1) SCoTD enhances the performance of the student model in both supervised and few-shot settings, and especially for challenge sets; 2) sampling many reasoning chains per instance from the teacher is paramount; and 3) after distillation, student chain-of-thoughts are judged by humans as comparable to the teacher, despite orders of magnitude fewer parameters. We test several hypotheses regarding what properties of chain-of-thought samples are important, e.g., diversity vs. teacher likelihood vs. open-endedness. We release our corpus of chain-of-thought samples and code.

@inproceedings{li2023symbolic,
  title = {Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step},
  author = {Li, Liunian Harold and Hessel, Jack and Yu, Youngjae and Ren, Xiang and Chang, Kai-Wei and Choi, Yejin},
  booktitle = {ACL},
  presentation_id = {https://underline.io/events/395/posters/15197/poster/77090-symbolic-chain-of-thought-distillation-small-models-can-also-think-step-by-step?tab=poster},
  year = {2023}
}

Details

PLUE: Language Understanding Evaluation Benchmark for Privacy Policies in English

Jianfeng Chi, Wasi Uddin Ahmad, Yuan Tian, and Kai-Wei Chang, in ACL (short), 2023.

Full Text Abstract BibTeX Details

Privacy policies provide individuals with information about their rights and how their personal information is handled. Natural language understanding (NLU) technologies can support individuals and practitioners to understand better privacy practices described in lengthy and complex documents. However, existing efforts that use NLU technologies are limited by processing the language in a way exclusive to a single task focusing on certain privacy practices. To this end, we introduce the Privacy Policy Language Understanding Evaluation (PLUE) benchmark, a multi-task benchmark for evaluating the privacy policy language understanding across various tasks. We also collect a large corpus of privacy policies to enable privacy policy domain-specific language model pre-training. We evaluate several generic pre-trained language models and continue pre-training them on the collected corpus. We demonstrate that domain-specific continual pre-training offers performance improvements across all tasks.

@inproceedings{chi2023plue,
  author = {Chi, Jianfeng and Ahmad, Wasi Uddin and Tian, Yuan and Chang, Kai-Wei},
  title = {PLUE: Language Understanding Evaluation Benchmark for Privacy Policies in English},
  presentation_id = {https://underline.io/events/395/posters/15279/poster/76751-plue-language-understanding-evaluation-benchmark-for-privacy-policies-in-english},
  booktitle = {ACL (short)},
  year = {2023}
}

Details

MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models

Masoud Monajatipoor, Liunian Harold Li, Mozhdeh Rouhsedaghat, Lin Yang, and Kai-Wei Chang, in ACL (short), 2023.

Full Text Abstract BibTeX Details

Large-scale language models have shown the ability to adapt to a new task via conditioning on a few demonstrations (i.e., in-context learning). However, in the vision-language domain, most large-scale pre-trained vision-language (VL) models do not possess the ability to conduct in-context learning. How can we enable in-context learning for VL models? In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to VL domain? Specifically, we first meta-trains a language model to perform in-context learning on NLP tasks (as in MetaICL); then we transfer this model to perform VL tasks by attaching a visual encoder. Our experiments suggest that indeed in-context learning ability can be transferred cross modalities: our model considerably improves the in-context learning capability on VL tasks and can even compensate for the size of the model significantly. On VQA, OK-VQA, and GQA, our method could outperform the baseline model while having 20 times fewer parameters.

@inproceedings{monajatipoor2023metavl,
  author = {Monajatipoor, Masoud and Li, Liunian Harold and Rouhsedaghat, Mozhdeh and Yang, Lin and Chang, Kai-Wei},
  title = {MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models},
  booktitle = {ACL (short)},
  presentation_id = {https://underline.io/events/395/posters/15337/poster/76709-metavl-transferring-in-context-learning-ability-from-language-models-to-vision-language-models},
  year = {2023}
}

Details

PIP: Parse-Instructed Prefix for Syntactically Controlled Paraphrase Generation

Yixin Wan, Kuan-Hao Huang, and Kai-Wei Chang, in ACL-Finding (short), 2023.

Full Text Abstract BibTeX Details

Syntactically controlled paraphrase generation requires language models to generate paraphrases for sentences according to specific syntactic structures. Existing fine-tuning methods for this task are costly as all the parameters of the model need to be updated during the training process. Inspired by recent studies on parameter-efficient learning, we propose Parse-Instructed Prefix (PIP), a novel adaptation of prefix-tuning to tune large pre-trained language models on syntactically controlled paraphrase generation task in a low-data setting with significantly less training cost. We introduce two methods to instruct a model’s encoder prefix to capture syntax-related knowledge: direct initiation (PIP-Direct) and indirect optimization (PIP-Indirect). In contrast to traditional fine-tuning methods for this task, PIP is a compute-efficient alternative with 10 times less learnable parameters. Compared to existing prefix-tuning methods, PIP excels at capturing syntax control information, achieving significantly higher performance at the same level of learnable parameter count.

@inproceedings{wan2023pip,
  author = {Wan, Yixin and Huang, Kuan-Hao and Chang, Kai-Wei},
  title = {PIP: Parse-Instructed Prefix for Syntactically Controlled Paraphrase Generation},
  booktitle = {ACL-Finding (short)},
  presentation_id = {https://underline.io/events/395/posters/15279/poster/77944-pip-parse-instructed-prefix-for-syntactically-controlled-paraphrase-generation},
  year = {2023}
}

Details

Enhancing Unsupervised Semantic Parsing with Distributed Contextual Representations

Zixuan Ling, Xiaoqing Zheng, Jianhan Xu, Jinshu Lin, Kai-Wei Chang, Cho-Jui Hsieh, and Xuanjing Huang, in ACL-Finding, 2023.

Abstract BibTeX Details

We extend a non-parametric Bayesian model of (Titov and Klementiev, 2011) to deal with homonymy and polysemy by leveraging distributed contextual word and phrase representations pre-trained on a large collection of unlabelled texts. Then, unsupervised semantic parsing is performed by decomposing sentences into fragments, clustering the fragments to abstract away syntactic variations of the same meaning, and predicting predicate-argument relations between the fragments. To better model the statistical dependencies between predicates and their arguments, we further conduct a hierarchical Pitman-Yor process. An improved Metropolis-Hastings merge-split sampler is proposed to speed up the mixing and convergence of Markov chains by leveraging pre-trained distributed representations. The experimental results show that the models achieve better accuracy on both question-answering and relation extraction tasks.

@inproceedings{ling2023enhancing,
  author = {Ling, Zixuan and Zheng, Xiaoqing and Xu, Jianhan and Lin, Jinshu and Chang, Kai-Wei and Hsieh, Cho-Jui and Huang, Xuanjing},
  title = {Enhancing Unsupervised Semantic Parsing with Distributed Contextual Representations},
  booktitle = {ACL-Finding},
  presentation_id = {https://underline.io/events/395/posters/15279/poster/77281-enhancing-unsupervised-semantic-parsing-with-distributed-contextual-representations?tab=video},
  year = {2023}
}

Details

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, and Shih-Fu Chang, in ACL-Finding, 2023.

Full Text Abstract BibTeX Details

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model’s reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method.

@inproceedings{sun2023unifine,
  author = {Sun, Rui and Wang, Zhecan and You, Haoxuan and Codella, Noel and Chang, Kai-Wei and Chang, Shih-Fu},
  title = {UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding},
  booktitle = {ACL-Finding},
  year = {2023},
  presentation_id = {https://underline.io/events/395/posters/15279/poster/78004-unifine-a-unified-and-fine-grained-approach-for-zero-shot-vision-language-understanding}
}

Details

AVATAR: A Parallel Corpus for Java-Python Program Translation

Wasi Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang, in ACL-Finding (short), 2023.

Full Text Code Abstract BibTeX Details

Program translation refers to migrating source code from one programming language to another. It has a tremendous practical value in software development as porting software across different languages is time-consuming and costly. Automating program translation is of paramount importance in software migration, and recently researchers explored unsupervised approaches due to the unavailability of parallel corpora. However, the availability of pre-trained language models for programming languages enable supervised fine-tuning with a small amount of labeled examples. In this work, we present a corpus of 8,475 programming problems and their solutions written in two popular languages, Java and Python. We collect the dataset from competitive programming sites, online platforms, and open source repositories. We present several baselines, including models trained from scratch or pre-trained on large-scale source code collection and fine-tuned on our proposed dataset. Experiment results show that while the models perform relatively well in terms of the lexical match, they lack in generating code that is accurate in terms of syntax and data-flow match.

@inproceedings{ahmad2021avatar,
  title = {AVATAR: A Parallel Corpus for Java-Python Program Translation},
  author = {Ahmad, Wasi and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
  booktitle = {ACL-Finding (short)},
  year = {2023}
}

Details

REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge

Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, and Alireza Fathi, in CVPR, 2023.

Full Text Abstract BibTeX Details

In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries. REVEAL consists of four key components: the memory, the encoder, the retriever and the generator. The large-scale memory encodes various sources of multimodal world knowledge (e.g. image-text pairs, question answering pairs, knowledge graph triplets, etc) via a unified encoder. The retriever finds the most relevant knowledge entries in the memory, and the generator fuses the retrieved knowledge with the input query to produce the output. A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data. Furthermore, our approach can use a diverse set of multimodal knowledge sources, which is shown to result in significant gains. We show that REVEAL achieves state-of-the-art results on visual question answering and image captioning.

@inproceedings{hu2023reveal,
  author = {Hu, Ziniu and Iscen, Ahmet and Sun, Chen and Wang, Zirui and Chang, Kai-Wei and Sun, Yizhou and Schmid, Cordelia and Ross, David A. and Fathi, Alireza},
  title = {REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge},
  booktitle = {CVPR},
  year = {2023}
}

Details

GIVL: On Improving Geographical Inclusivity of Vision-and-Language Models with Pre-Training Methods

Da Yin, Feng Gao, Govind Thattai, Michael Johnston, and Kai-Wei Chang, in CVPR, 2023.

Full Text Abstract BibTeX Details

A key goal for the advancement of AI is to develop technologies that serve the needs not just of one group but of all communities regardless of their geographical region. In fact, a significant proportion of knowledge is locally shared by people from certain regions but may not apply equally in other regions because of cultural differences. If a model is unaware of regional characteristics, it may lead to performance disparity across regions and result in bias against underrepresented groups. We propose GIVL, a Geographically Inclusive Vision-and-Language Pre-trained model. There are two attributes of geo-diverse visual concepts which can help to learn geo-diverse knowledge: 1) concepts under similar categories have unique knowledge and visual characteristics, 2) concepts with similar visual features may fall in completely different categories. Motivated by the attributes, we design new pre-training objectives Image Knowledge Matching (IKM) and Image Edit Checking (IEC) to pre-train GIVL. Compared with similar-size models pre-trained with similar scale of data, GIVL achieves state-of-the-art (SOTA) and more balanced performance on geo-diverse V&L tasks.

@inproceedings{yin2023givl,
  author = {Yin, Da and Gao, Feng and Thattai, Govind and Johnston, Michael and Chang, Kai-Wei},
  title = {GIVL: On Improving Geographical Inclusivity of Vision-and-Language Models with Pre-Training Methods},
  booktitle = {CVPR},
  year = {2023}
}

Details

Semantic Strengthening of Neuro-Symbolic Learning

Kareem Ahmed, Kai-Wei Chang, and Guy Van den Broeck, in AISTATS, 2023.

Full Text Code Abstract BibTeX Details

Numerous neuro-symbolic approaches have recently been proposed typically with the goal of adding symbolic knowledge to the output layer of a neural network. Ideally, such losses maximize the probability that the neural network’s predictions satisfy the underlying domain. Unfortunately, this type of probabilistic inference is often computationally infeasible. Neuro-symbolic approaches therefore commonly resort to fuzzy approximations of this probabilistic objective, sacrificing sound probabilistic semantics, or to sampling which is very seldom feasible. We approach the problem by first assuming the constraint decomposes conditioned on the features learned by the network. We iteratively strengthen our approximation, restoring the dependence between the constraints most responsible for degrading the quality of the approximation. This corresponds to computing the mutual information between pairs of constraints conditioned on the network’s learned features, and may be construed as a measure of how well aligned the gradients of two distributions are. We show how to compute this efficiently for tractable circuits. We test our approach on three tasks: predicting a minimum-cost path in Warcraft, predicting a minimum-cost perfect matching, and solving Sudoku puzzles, observing that it improves upon the baselines while sidestepping intractability.

@inproceedings{ahmed2023semantic,
  author = {Ahmed, Kareem and Chang, Kai-Wei and Van den Broeck, Guy},
  title = {Semantic Strengthening of Neuro-Symbolic Learning},
  booktitle = {AISTATS},
  year = {2023}
}

Details

Factoring the Matrix of Domination: A Critical Review and Reimagination of Intersectionality in AI Fairness

Anaelia Ovalle, Arjun Subramonian, Vagrant Gautam, Gilbert Gee, and Kai-Wei Chang, in AIES, 2023.

Full Text Abstract BibTeX Details

Intersectionality is a critical framework that, through inquiry and praxis, allows us to examine how social inequalities persist through domains of structure and discipline. Given AI fairness’ raison detre of "fairness," we argue that adopting intersectionality as an analytical framework is pivotal to effectively operationalizing fairness. Through a critical review of how intersectionality is discussed in 30 papers from the AI fairness literature, we deductively and inductively: 1) map how intersectionality tenets operate within the AI fairness paradigm and 2) uncover gaps between the conceptualization and operationalization of intersectionality. We find that researchers overwhelmingly reduce intersectionality to optimizing for fairness metrics over demographic subgroups. They also fail to discuss their social context and when mentioning power, they mostly situate it only within the AI pipeline. We: 3) outline and assess the implications of these gaps for critical inquiry and praxis, and 4) provide actionable recommendations for AI fairness researchers to engage with intersectionality in their work by grounding it in AI epistemology

@inproceedings{ovalle2023factoring,
  title = {Factoring the Matrix of Domination: A Critical Review and Reimagination of Intersectionality in AI Fairness},
  author = {Ovalle, Anaelia and Subramonian, Arjun and Gautam, Vagrant and Gee, Gilbert and Chang, Kai-Wei},
  year = {2023},
  booktitle = {AIES}
}

Details

Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning

Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan, in ICLR, 2023.

Full Text Abstract BibTeX Details

Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TabMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multi-choice, and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TabMWP. To mitigate this, we further propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in the selection of in-context examples.

@inproceedings{lu2023dynamic,
  author = {Lu, Pan and Qiu, Liang and Chang, Kai-Wei and Wu, Ying Nian and Zhu, Song-Chun and Rajpurohit, Tanmay and Clark, Peter and Kalyan, Ashwin},
  title = {Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning},
  booktitle = {ICLR},
  year = {2023}
}

Details

2022

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan, in NeurIPS, 2022.

Full Text Abstract BibTeX Details Top-15 cited paper at NeurIPS 22

When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https://scienceqa.github.io.

@inproceedings{lu2022learn,
  title = {Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering},
  author = {Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Kalyan, Ashwin},
  booktitle = {NeurIPS},
  github_url = {https://github.com/lupantech/ScienceQA},
  year = {2022}
}

Details

Controllable Text Generation with Neurally-Decomposed Oracle

Tao Meng, Sidi Lu, Nanyun Peng, and Kai-Wei Chang, in NeurIPS, 2022.

Full Text Code Abstract BibTeX Details Oral Presentation, 201 out of 10411, top 1.9%

We propose a general and efficient framework to control auto-regressive generation models with NeurAlly-Decomposed Oracle (NADO). Given a pre-trained base language model and a sequence-level boolean oracle function, we propose to decompose the oracle function into token-level guidance to steer the base model in text generation. Specifically, the token-level guidance is approximated by a neural model trained with examples sampled from the base model, demanding no additional auxiliary labeled data. We present the closed-form optimal solution to incorporate the token-level guidance into the base model for controllable generation. We further provide a theoretical analysis of how the approximation quality of NADO affects the controllable generation results. Experiments conducted on two applications: (1) text generation with lexical constraints and (2) machine translation with formality control demonstrate that our framework efficiently guides the base model towards the given oracle while maintaining high generation quality.

@inproceedings{meng2022controllable,
  title = {Controllable Text Generation with Neurally-Decomposed Oracle},
  author = {Meng, Tao and Lu, Sidi and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {NeurIPS},
  year = {2022}
}

Details

Grounded Language-Image Pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao, in CVPR, 2022.

Full Text Code Abstract BibTeX Details Best Paper Finallist, 33 out of 8161 submissions, top 0.4%

This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.

@inproceedings{li2022grounded,
  title = {Grounded Language-Image Pre-training},
  author = {Li, Liunian Harold and Zhang, Pengchuan and Zhang, Haotian and Yang, Jianwei and Li, Chunyuan and Zhong, Yiwu and Wang, Lijuan and Yuan, Lu and Zhang, Lei and Hwang, Jenq-Neng and Chang, Kai-Wei and Gao, Jianfeng},
  booktitle = {CVPR},
  year = {2022}
}

Details

How Much Can CLIP Benefit Vision-and-Language Tasks?

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutz, in ICLR, 2022.

Full Text Code Abstract BibTeX Details Top-10 cited paper at ICLR 22

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.

@inproceedings{shen2022how,
  title = { How Much Can CLIP Benefit Vision-and-Language Tasks? },
  author = {Shen, Sheng and Li, Liunian Harold and Tan, Hao and Bansal, Mohit and Rohrbach, Anna and Chang, Kai-Wei and Yao, Zhewei and Keutz, Kurt},
  booktitle = {ICLR},
  year = {2022}
}

Details

ADDMU: Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation

Fan Yin, Yao Li, Cho-Jui Hsieh, and Kai-Wei Chang, in EMNLP, 2022.

Full Text Abstract BibTeX Details

Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks and has drawn increasing attention from the Natural Language Processing (NLP) community. Despite the surge of new AED methods, our studies show that existing methods heavily rely on a shortcut to achieve good performance. In other words, current search-based adversarial attacks in NLP stop once model predictions change, and thus most adversarial examples generated by those attacks are located near model decision boundaries. To surpass this shortcut and fairly evaluate AED methods, we propose to test AED methods with Far Boundary (FB) adversarial examples. Existing methods show worse than random guess performance under this scenario. To overcome this limitation, we propose a new technique, ADDMU, adversary detection with data and model uncertainty, which combines two types of uncertainty estimation for both regular and FB adversarial example detection. Our new method outperforms previous methods by 3.6 and 6.0 AUC points under each scenario. Finally, our analysis shows that the two types of uncertainty provided by ADDMU can be leveraged to characterize adversarial examples and identify the ones that contribute most to model’s robustness in adversarial training.

@inproceedings{yin2022addmu,
  title = {ADDMU: Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation},
  author = {Yin, Fan and Li, Yao and Hsieh, Cho-Jui and Chang, Kai-Wei},
  booktitle = {EMNLP},
  year = {2022}
}

Details

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Zhecan Wang, Haoxuan You, Yicheng He, Wenhao Li, Kai-Wei Chang, and Shih-Fu Chang, in EMNLP, 2022.

Full Text Abstract BibTeX Details

Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models’ understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model’s performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.

@inproceedings{you2022fine,
  title = {Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense},
  author = {Wang, Zhecan and You, Haoxuan and He, Yicheng and Li, Wenhao and Chang, Kai-Wei and Chang, Shih-Fu},
  booktitle = {EMNLP},
  year = {2022}
}

Details

GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models

Da Yin, Hritik Bansal, Masoud Monajatipoor, Liunian Harold Li, and Kai-Wei Chang, in EMNLP, 2022.

Full Text Code Abstract BibTeX Details

Recent work has shown that Pre-trained Language Models (PLMs) have the ability to store the relational knowledge from pre-training data in their model parameters. However, it is not clear up to what extent do PLMs store geo-diverse commonsense knowledge, the knowledge associated with a culture and only shared locally. For instance, the color of bridal dress is white in American weddings whereas it is red in Chinese weddings. Here, we wish to probe if PLMs can predict red and white as the color of the bridal dress when queried for American and Chinese weddings, respectively. To this end, we introduce a framework for geo-diverse commonsense probing on multilingual PLMs (mPLMs) and introduce a corresponding benchmark Geo-diverse Commonsense Multilingual Language Model Analysis (GeoMLAMA) dataset. GeoMLAMA contains 3125 prompts in English, Chinese, Hindi, Persian, and Swahili, with a wide coverage of concepts shared by people from American, Chinese, Indian, Iranian and Kenyan cultures. We benchmark 11 standard mPLMs which include variants of mBERT, XLM, mT5, and XGLM on GeoMLAMA. Interestingly, we find that 1) larger mPLM variants do not necessarily store geo-diverse concepts better than its smaller variant; 2) mPLMs are not intrinsically biased towards knowledge from the Western countries (the United States); 3) the native language of a country may not be the best language to probe its knowledge and 4) a language may better probe knowledge about a non-native country than its native country.

@inproceedings{yin2022geomlama,
  title = {GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models},
  author = {Yin, Da and Bansal, Hritik and Monajatipoor, Masoud and Li, Liunian Harold and Chang, Kai-Wei},
  booktitle = {EMNLP},
  year = {2022}
}

Details

How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?

Hritik Bansal, Da Yin, Masoud Monajatipoor, and Kai-Wei Chang, in EMNLP (Short), 2022.

Full Text Code Abstract BibTeX Details

Text-to-image generative models have achieved unprecedented success in generating high-quality images based on natural language descriptions. However, it is shown that these models tend to favor specific social groups when prompted with neutral text descriptions (e.g., ’a photo of a lawyer’). Following Zhao et al. (2021), we study the effect on the diversity of the generated images when adding ethical intervention that supports equitable judgment (e.g., ’if all individuals can be a lawyer irrespective of their gender’) in the input prompts. To this end, we introduce an Ethical NaTural Language Interventions in Text-to-Image GENeration (ENTIGEN) benchmark dataset to evaluate the change in image generations conditional on ethical interventions across three social axes – gender, skin color, and culture. Through ENTIGEN framework, we find that the generations from minDALL.E, DALL.E-mini and Stable Diffusion cover diverse social groups while preserving the image quality. Preliminary studies indicate that a large change in the model predictions is triggered by certain phrases such as ’irrespective of gender’ in the context of gender bias in the ethical interventions. We release code and annotated data at https://github.com/Hritikbansal/entigen_emnlp.

@inproceedings{bansal2022how,
  title = {How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?},
  author = {Bansal, Hritik and Yin, Da and Monajatipoor, Masoud and Chang, Kai-Wei},
  booktitle = {EMNLP (Short)},
  year = {2022}
}

Details

Empowering Language Models with Knowledge Graph Reasoning for Open-Domain Question Answering

Ziniu Hu, Yichong Xu, Wenhao Yu, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Kai-Wei Chang, and Yizhou Sun, in EMNLP, 2022.

Full Text Abstract BibTeX Details

Answering open-domain questions requires world knowledge about in-context entities. As pre-trained Language Models (LMs) lack the power to store all required knowledge, external knowledge sources, such as knowledge graphs, are often used to augment LMs. In this work, we propose knOwledge REasOning empowered Language Model (OREO-LM), which consists of a novel Knowledge Interaction Layer that can be flexibly plugged into existing Transformer-based LMs to interact with a differentiable Knowledge Graph Reasoning module collaboratively. In this way, LM guides KG to walk towards the desired answer, while the retrieved knowledge improves LM. By adopting OREO-LM to RoBERTa and T5, we show significant performance gain, achieving state-of-art results in the Closed-Book setting. The performance enhancement is mainly from the KG reasoning’s capacity to infer missing relational facts. In addition, OREO-LM provides reasoning paths as rationales to interpret the model’s decision.

@inproceedings{hu2022empowering,
  title = {Empowering Language Models with Knowledge Graph Reasoning for Open-Domain Question Answering},
  author = {Hu, Ziniu and Xu, Yichong and Yu, Wenhao and Wang, Shuohang and Yang, Ziyi and Zhu, Chenguang and Chang, Kai-Wei and Sun, Yizhou},
  booktitle = {EMNLP},
  year = {2022}
}

Details

Conditional Supervised Contrastive Learning for Fair Text Classification

Jianfeng Chi, William Shand, Yaodong Yu, Kai-Wei Chang, Han Zhao, and Yuan Tian, in EMNLP-Finding, 2022.

Full Text Abstract BibTeX Details

Contrastive representation learning has gained much attention due to its superior performance in learning representations from both image and sequential data. However, the learned representations could potentially lead to performance disparities in downstream tasks, such as increased silencing of underrepresented groups in toxicity comment classification. In light of this challenge, in this work, we study learning fair representations that satisfy a notion of fairness known as equalized odds for text classification via contrastive learning. Specifically, we first theoretically analyze the connections between learning representations with a fairness constraint and conditional supervised contrastive objectives, and then propose to use conditional supervised contrastive objectives to learn fair representations for text classification. We conduct experiments on two text datasets to demonstrate the effectiveness of our approaches in balancing the trade-offs between task performance and bias mitigation among existing baselines for text classification. Furthermore, we also show that the proposed methods are stable in different hyperparameter settings.

@inproceedings{chi2022conditional,
  title = {Conditional Supervised Contrastive Learning for Fair Text Classification},
  author = {Chi, Jianfeng and Shand, William and Yu, Yaodong and Chang, Kai-Wei and Zhao, Han and Tian, Yuan},
  booktitle = {EMNLP-Finding},
  year = {2022}
}

Details

Representation Learning for Resource-Constrained Keyphrase Generation

Di Wu, Wasi Uddin Ahmad, Sunipa Dev, and Kai-Wei Chang, in EMNLP-Finding, 2022.

Full Text Code Abstract BibTeX Details

State-of-the-art keyphrase generation methods generally depend on large annotated datasets, limiting their performance in domains with limited annotated data. To overcome this challenge, we design a data-oriented approach that first identifies salient information using unsupervised corpus-level statistics, and then learns a task-specific intermediate representation based on a pre-trained language model. We introduce salient span recovery and salient span prediction as denoising training objectives that condense the intra-article and inter-article knowledge essential for keyphrase generation. Through experiments on multiple keyphrase generation benchmarks, we show the effectiveness of the proposed approach for facilitating low-resource and zero-shot keyphrase generation. We further observe that the method especially benefits the generation of absent keyphrases, approaching the performance of models trained with large training sets.

@inproceedings{wu2022representation,
  title = {Representation Learning for Resource-Constrained Keyphrase Generation},
  author = {Wu, Di and Ahmad, Wasi Uddin and Dev, Sunipa and Chang, Kai-Wei},
  booktitle = {EMNLP-Finding},
  year = {2022}
}

Details

Unsupervised Syntactically Controlled Paraphrase Generation with Abstract Meaning Representations

Kuan-Hao Huang, Varun Iyer, Anoop Kumar, Sriram Venkatapathy, Kai-Wei Chang, and Aram Galstyan, in EMNLP-Finding (short), 2022.

Full Text Abstract BibTeX Details

Syntactically controlled paraphrase generation has become an emerging research direction in recent years. Most existing approaches require annotated paraphrase pairs for training and are thus costly to extend to new domains. Unsupervised approaches, on the other hand, do not need paraphrase pairs but suffer from relatively poor performance in terms of syntactic control and quality of generated paraphrases. In this paper, we demonstrate that leveraging Abstract Meaning Representations (AMR) can greatly improve the performance of unsupervised syntactically controlled paraphrase generation. Our proposed model, AMR-enhanced Paraphrase Generator (AMRPG), separately encodes the AMR graph and the constituency parse of the input sentence into two disentangled semantic and syntactic embeddings. A decoder is then learned to reconstruct the input sentence from the semantic and syntactic embeddings. Our experiments show that AMRPG generates more accurate syntactically controlled paraphrases, both quantitatively and qualitatively, compared to the existing unsupervised approaches. We also demonstrate that the paraphrases generated by AMRPG can be used for data augmentation to improve the robustness of NLP models.

@inproceedings{huang2022unsupervised,
  title = {Unsupervised Syntactically Controlled Paraphrase Generation with Abstract Meaning Representations},
  author = {Huang, Kuan-Hao and Iyer, Varun and Kumar, Anoop and Venkatapathy, Sriram and Chang, Kai-Wei and Galstyan, Aram},
  booktitle = {EMNLP-Finding (short)},
  year = {2022}
}

Details

Investigating Ensemble Methods for Model Robustness Improvement of Text Classifiers

Jieyu Zhao, Xuezhi Wang, Yao Qin, Jilin Chen, and Kai-Wei Chang, in EMNLP-Finding (short), 2022.

Full Text Abstract BibTeX Details

Large pre-trained language models have shown remarkable performance over the past few years. These models, however, sometimes learn superficial features from the dataset and cannot generalize to the distributions that are dissimilar to the training scenario. There have been several approaches proposed to reduce model’s reliance on these bias features which can improve model robustness in the out-of-distribution setting. However, existing methods usually use a fixed low-capacity model to deal with various bias features, which ignore the learnability of those features. In this paper, we analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases. We further show that by choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.

@inproceedings{zhao2022investigating,
  title = {	Investigating Ensemble Methods for Model Robustness Improvement of Text Classifiers},
  author = {Zhao, Jieyu and Wang, Xuezhi and Qin, Yao and Chen, Jilin and Chang, Kai-Wei},
  booktitle = {EMNLP-Finding (short)},
  year = {2022}
}

Details

Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, and Shih-Fu Chang, in EMNLP-Finding, 2022.

Full Text Abstract BibTeX Details

From a visual scene containing multiple people, human is able to distinguish each individual given the context descriptions about what happened before, their mental/physical states or intentions, etc. Above ability heavily relies on human-centric commonsense knowledge and reasoning. For example, if asked to identify the “person who needs healing” in an image, we need to first know that they usually have injuries or suffering expressions, then find the corresponding visual clues before finally grounding the person. We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models’ ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions. We further create a benchmark, HumanCog, a dataset with 130k grounded commonsensical descriptions annotated on 67k images, covering diverse types of commonsense and visual scenes. We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models. Further analysis demonstrates that rich visual commonsense and powerful integration of multi-modal commonsense are essential, which sheds light on future works. Data and code will be available at https://github.com/Hxyou/HumanCog.

@inproceedings{you2022find,
  title = {Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding},
  author = {You, Haoxuan and Sun, Rui and Wang, Zhecan and Chang, Kai-Wei and Chang, Shih-Fu},
  booktitle = {EMNLP-Finding},
  year = {2022}
}

Details

On the Discrimination Risk of Mean Aggregation Feature Imputation in Graphs

Arjun Subramonian, Kai-Wei Chang, and Yizhou Sun, in NeurIPS, 2022.

Full Text Abstract BibTeX Details

In human networks, nodes belonging to a marginalized group often have a disproportionate rate of unknown or missing features. This, in conjunction with graph structure and known feature biases, can cause graph feature imputation algorithms to predict values for unknown features that make the marginalized group’s feature values more distinct from the the dominant group’s feature values than they are in reality. We call this distinction the discrimination risk. We prove that a higher discrimination risk can amplify the unfairness of a machine learning model applied to the imputed data. We then formalize a general graph feature imputation framework called mean aggregation imputation and theoretically and empirically characterize graphs in which applying this framework can yield feature values with a high discrimination risk. We propose a simple algorithm to ensure mean aggregation-imputed features provably have a low discrimination risk, while minimally sacrificing reconstruction error with respect to the imputation objective. We evaluate the fairness and accuracy of our solution on synthetic and real-world credit networks.

@inproceedings{subramonian2022on,
  title = {On the Discrimination Risk of Mean Aggregation Feature Imputation in Graphs},
  author = {Subramonian, Arjun and Chang, Kai-Wei and Sun, Yizhou},
  booktitle = {NeurIPS},
  year = {2022}
}

Details

Semantic Probabilistic Layers for Neuro-Symbolic Learning

Kareem Ahmed, Stefano Teso, Kai-Wei Chang, Guy Van den Broeck, and Antonio Vergari, in NeurIPS, 2022.

Full Text Abstract BibTeX Details

We design a predictive layer for structured-output prediction (SOP) that can be plugged into any neural network guaranteeing its predictions are consistent with a set of predefined symbolic constraints. Our Semantic Probabilistic Layer (SPL) can model intricate correlations, and hard constraints, over a structured output space all while being amenable to end-to-end learning via maximum likelihood. SPLs combine exact probabilistic inference with logical reasoning in a clean and modular way, learning complex distributions and restricting their support to solutions of the constraint. As such, they can faithfully, and efficiently, model complex SOP tasks beyond the reach of alternative neuro-symbolic approaches. We empirically demonstrate that SPLs outperform these competitors in terms of accuracy on challenging SOP tasks including hierarchical multi-label classification, pathfinding and preference learning, while retaining perfect constraint satisfaction.

@inproceedings{ahmed2022semantic,
  title = {Semantic Probabilistic Layers for Neuro-Symbolic Learning},
  author = {Ahmed, Kareem and Teso, Stefano and Chang, Kai-Wei and den Broeck, Guy Van and Vergari, Antonio},
  booktitle = {NeurIPS},
  year = {2022}
}

Details

Integrating topic modeling and word embedding to characterize violent deaths

Alina Arseniev-Koehler, Susan D. Cochran, Vickie Mays, Kai-Wei Chang, and Jacob Foster, in Proceedings of the National Academy of Sciences, 2022.

Full Text Abstract BibTeX Details

There is an escalating need for methods to identify latent patterns in text data from many domains. We introduce a method to identify topics in a corpus and represent documents as topic sequences. Discourse atom topic modeling (DATM) draws on advances in theoretical machine learning to integrate topic modeling and word embedding, capitalizing on their distinct capabilities. We first identify a set of vectors (discourse atoms) that provide a sparse representation of an embedding space. Discourse atoms can be interpreted as latent topics; through a generative model, atoms map onto distributions over words. We can also infer the topic that generated a sequence of words. We illustrate our method with a prominent example of underutilized text: the US National Violent Death Reporting System (NVDRS). The NVDRS summarizes violent death incidents with structured variables and unstructured narratives. We identify 225 latent topics in the narratives (e.g., preparation for death and physical aggression); many of these topics are not captured by existing structured variables. Motivated by known patterns in suicide and homicide by gender and recent research on gender biases in semantic space, we identify the gender bias of our topics (e.g., a topic about pain medication is feminine). We then compare the gender bias of topics to their prevalence in narratives of female versus male victims. Results provide a detailed quantitative picture of reporting about lethal violence and its gendered nature. Our method offers a flexible and broadly applicable approach to model topics in text data.

@inproceedings{arseniev2022aggression,
  title = {Integrating topic modeling and word embedding to characterize violent deaths},
  author = {Arseniev-Koehler, Alina and Cochran, Susan D and Mays, Vickie and Chang, Kai-Wei and Foster, Jacob},
  booktitle = {Proceedings of the National Academy of Sciences},
  year = {2022}
}

Details

Neuro-Symbolic Entropy Regularization

Kareem Ahmed, Eric Wang, Kai-Wei Chang, and Guy Van den Broeck, in UAI, 2022.

Full Text Abstract BibTeX Details

In structured prediction, the goal is to jointly predict many output variables that together encode a structured object – a path in a graph, an entity-relation triple, or an ordering of objects. Such a large output space makes learning hard and requires vast amounts of labeled data. Different approaches leverage alternate sources of supervision. One approach – entropy regularization – posits that decision boundaries should lie in low-probability regions. It extracts supervision from unlabeled examples, but remains agnostic to the structure of the output space. Conversely, neuro-symbolic approaches exploit the knowledge that not every prediction corresponds to a valid structure in the output space. Yet, they does not further restrict the learned output distribution. This paper introduces a framework that unifies both approaches. We propose a loss, neuro-symbolic entropy regularization, that encourages the model to confidently predict a valid object. It is obtained by restricting entropy regularization to the distribution over only valid structures. This loss is efficiently computed when the output constraint is expressed as a tractable logic circuit. Moreover, it seamlessly integrates with other neuro-symbolic losses that eliminate invalid predictions. We demonstrate the efficacy of our approach on a series of semi-supervised and fully-supervised structured-prediction experiments, where we find that it leads to models whose predictions are more accurate and more likely to be valid.

@inproceedings{ahmadneuro2022,
  title = {Neuro-Symbolic Entropy Regularization},
  author = {Ahmed, Kareem and Wang, Eric and Chang, Kai-Wei and den Broeck, Guy Van},
  booktitle = {UAI},
  year = {2022}
}

Details

DEGREE: A Data-Efficient Generative Event Extraction Model

I.-Hung Hsu, Kuan-Hao Huang, Elizabeth Boschee, Scott Miller, Prem Natarajan, Kai-Wei Chang, and Nanyun Peng, in NAACL, 2022.

Full Text Abstract BibTeX Details

Event extraction (EE), the task that identifies event triggers and their arguments in text, is usually formulated as a classification or structured prediction problem. Such models usually reduce labels to numeric identifiers, making them unable to take advantage of label semantics (e.g. an event type named Arrest is related to words like arrest, detain, or apprehend). This prevents the generalization to new event types. In this work, we formulate EE as a natural language generation task and propose GenEE, a model that not only captures complex dependencies within an event but also generalizes well to unseen or rare event types. Given a passage and an event type, GenEE is trained to generate a natural sentence following a predefined template for that event type. The generated output is then decoded into trigger and argument predictions. The autoregressive generation process naturally models the dependencies among the predictions – each new word predicted depends on those already generated in the output sentence. Using carefully designed input prompts during generation, GenEE is able to capture label semantics, which enables the generalization to new event types. Empirical results show that our model achieves strong performance on event extraction tasks under all zero-shot, few-shot, and high-resource scenarios. Especially, in the high-resource setting, GenEE outperforms the state-of-the-art model on argument extraction and gets competitive results with the current best on end-to-end EE tasks.

@inproceedings{hsu2021degree,
  title = {DEGREE: A Data-Efficient Generative Event Extraction Model},
  author = {Hsu, I-Hung and Huang, Kuan-Hao and Boschee, Elizabeth and Miller, Scott and Natarajan, Prem and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {NAACL},
  year = {2022}
}

Details

Socially Aware Bias Measurements for Hindi Language Representations

Vijit Malik, Sunipa Dev, Akihiro Nishi, Nanyun Peng, and Kai-Wei Chang, in NAACL (short), 2022.

Full Text Abstract BibTeX Details

Language representations are efficient tools used across NLP applications, but they are strife with encoded societal biases. These biases are studied extensively, but with a primary focus on English language representations and biases common in the context of Western society. In this work, we investigate biases present in Hindi language representations with focuses on caste and religion-associated biases. We demonstrate how biases are unique to specific language representations based on the history and culture of the region they are widely spoken in, and how the same societal bias (such as binary gender-associated biases) is encoded by different words and text spans across languages. The discoveries of our work highlight the necessity of culture awareness and linguistic artifacts when modeling language representations, in order to better understand the encoded biases.

@inproceedings{malik2022socially,
  title = {Socially Aware Bias Measurements for Hindi Language Representations},
  author = {Malik, Vijit and Dev, Sunipa and Nishi, Akihiro and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {NAACL (short)},
  year = {2022}
}

Details

Measuring Fairness of Text Classifiers via Prediction Sensitivity

Satyapriya Krishna, Rahul Gupta, Apurv Verma, Jwala Dhamala, Yada Pruksachatkun, and Kai-Wei Chang, in ACL, 2022.

Full Text Abstract BibTeX Details

With the rapid growth in language processing applications, fairness has emerged as an important consideration in data-driven solutions. Although various fairness definitions have been explored in the recent literature, there is lack of consensus on which metrics most accurately reflect the fairness of a system. In this work, we propose a new formulation : ACCUMULATED PREDICTION SENSITIVITY, which measures fairness in machine learning models based on the model’s prediction sensitivity to perturbations in input features. The metric attempts to quantify the extent to which a single prediction depends on a protected attribute, where the protected attribute encodes the membership status of an individual in a protected group. We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness. It also correlates well with humans’ perception of fairness. We conduct experiments on two text classification datasets : JIGSAW TOXICITY, and BIAS IN BIOS, and evaluate the correlations between metrics and manual annotations on whether the model produced a fair outcome. We observe that the proposed fairness metric based on prediction sensitivity is statistically significantly more correlated with human annotation than the existing counterfactual fairness metric.

@inproceedings{krishna2022measuring,
  title = {Measuring Fairness of Text Classifiers via Prediction Sensitivity},
  author = {Krishna, Satyapriya and Gupta, Rahul and Verma, Apurv and Dhamala, Jwala and Pruksachatkun, Yada and Chang, Kai-Wei},
  booktitle = {ACL},
  year = {2022}
}

Details

On the Sensitivity and Stability of Model Interpretations

Fan Yin, Zhouxing Shi, Cho-Jui Hsieh, and Kai-Wei Chang, in ACL, 2022.

Full Text Abstract BibTeX Details

Recent years have witnessed the emergence of a variety of post-hoc interpretations that aim to uncover how natural language processing (NLP) models make predictions. Despite the surge of new interpretation methods, it remains an open problem how to define and quantitatively measure the faithfulness of interpretations, i.e., to what extent interpretations reflect the reasoning process by a model. We propose two new criteria, sensitivity and stability, that provide complementary notions of faithfulness to the existed removal-based criteria. Our results show that the conclusion for how faithful interpretations are could vary substantially based on different notions. Motivated by the desiderata of sensitivity and stability, we introduce a new class of interpretation methods that adopt techniques from adversarial robustness. Empirical results show that our proposed methods are effective under the new criteria and overcome limitations of gradient-based methods on removal-based criteria. Besides text classification, we also apply interpretation methods and metrics to dependency parsing. Our results shed light on understanding the diverse set of interpretations.

@inproceedings{yin2022on,
  title = {On the Sensitivity and Stability of Model Interpretations},
  author = {Yin, Fan and Shi, Zhouxing and Hsieh, Cho-Jui and Chang, Kai-Wei},
  booktitle = {ACL},
  year = {2022}
}

Details

PYLON: A PyTorch Framework for Learning with Constraints

Kareem Ahmed, Tao Li, Thy Ton, Quan Guo, Kai-Wei Chang, Parisa Kordjamshidi, Vivek Srikumar, Guy Van den Broeck, and Sameer Singh, in AAAI (demo), 2022.

Full Text Abstract BibTeX Details

Deep learning excels at learning task information from large amounts of data, but struggles with learning from declarative high-level knowledge that can be more succinctly expressed directly. In this work, we introduce PYLON, a neuro-symbolic training framework that builds on PyTorch to augment procedurally trained models with declaratively specified knowledge. PYLON lets users programmatically specify constraints as Python functions and compiles them into a differentiable loss, thus training predictive models that fit the data whilst satisfying the specified constraints. PYLON includes both exact as well as approximate compilers to efficiently compute the loss, employing fuzzy logic, sampling methods, and circuits, ensuring scalability even to complex models and constraints. Crucially, a guiding principle in designing PYLON is the ease with which any existing deep learning codebase can be extended to learn from constraints in a few lines code: a function that expresses the constraint, and a single line to compile it into a loss. Our demo comprises of models in NLP, computer vision, logical games, and knowledge graphs that can be interactively trained using constraints as supervision.

@inproceedings{ahmad2022pylon,
  title = {PYLON: A PyTorch Framework for Learning with Constraints},
  author = {Ahmed, Kareem and Li, Tao and Ton, Thy and Guo, Quan and Chang, Kai-Wei and Kordjamshidi, Parisa and Srikumar, Vivek and den Broeck, Guy Van and Singh, Sameer},
  booktitle = {AAAI (demo)},
  year = {2022}
}

Details

Multilingual Generative Language Models for Zero-Shot Cross-Lingual Event Argument Extraction

Kuan-Hao Huang, I.-Hung Hsu, Prem Natarajan, Kai-Wei Chang, and Nanyun Peng, in ACL, 2022.

Full Text Code Abstract BibTeX Details

We present a study on leveraging multilingual pre-trained generative language models for zero-shot cross-lingual event argument extraction (EAE). By formulating EAE as a language generation task, our method effectively encodes event structures and captures the dependencies between arguments. We design language-agnostic templates to represent the event argument structures, which are compatible with any language, hence facilitating the cross-lingual transfer. Our proposed model finetunes multilingual pre-trained generative language models to generate sentences that fill in the language-agnostic template with arguments extracted from the input passage. The model is trained on source languages and is then directly applied to target languages for event argument extraction. Experiments demonstrate that the proposed model outperforms the current state-of-the-art models on zero-shot cross-lingual EAE. Comprehensive studies and error analyses are presented to better understand the advantages and the current limitations of using generative language models for zero-shot cross-lingual transfer EAE.

@inproceedings{huang2022multilingual,
  title = {Multilingual Generative Language Models for Zero-Shot Cross-Lingual Event Argument Extraction},
  author = {Huang, Kuan-Hao and Hsu, I-Hung and Natarajan, Prem and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {ACL},
  year = {2022}
}

Details

On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations

Yang Trista Cao, Yada Pruksachatkun, Kai-Wei Chang, Rahul Gupta, Varun Kumar, Jwala Dhamala, and Aram Galstyan, in ACL (short), 2022.

Full Text Abstract BibTeX Details

Multiple metrics have been introduced to measure fairness in various natural language processing tasks. These metrics can be roughly categorized into two categories: 1) \emphextrinsic metrics for evaluating fairness in downstream applications and 2) \emphintrinsic metrics for estimating fairness in upstream contextualized language representation models. In this paper, we conduct an extensive correlation study between intrinsic and extrinsic metrics across bias notions using 19 contextualized language models. We find that intrinsic and extrinsic metrics do not necessarily correlate in their original setting, even when correcting for metric misalignments, noise in evaluation datasets, and confounding factors such as experiment configuration for extrinsic metrics.

@inproceedings{trista2022evaluation,
  title = {On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations},
  author = {Cao, Yang Trista and Pruksachatkun, Yada and Chang, Kai-Wei and Gupta, Rahul and Kumar, Varun and Dhamala, Jwala and Galstyan, Aram},
  booktitle = {ACL (short)},
  year = {2022}
}

Details

Improving the Adversarial Robustness of NLP Models by Information Bottleneck

Cenyuan Zhang, Xiang Zhou, Yixin Wan, Xiaoqing Zheng, Kai-Wei Chang, and Cho-Jui Hsieh, in ACL-Finding, 2022.

Full Text Abstract BibTeX Details

Existing studies have demonstrated that adversarial examples can be directly attributed to the presence of non-robust features, which are highly predictive, but can be easily manipulated by adversaries to fool NLP models. In this study, we explore the feasibility of capturing task-specific robust features, while eliminating the non-robust ones by using the information bottleneck theory. Through extensive experiments, we show that the models trained with our information bottleneck-based method are able to achieve a significant improvement in robust accuracy, exceeding performances of all the previously reported defense methods while suffering almost no performance drop in clean accuracy on SST-2, AGNEWS and IMDB datasets.

@inproceedings{zhang2022improving,
  title = {Improving the Adversarial Robustness of NLP Models by Information Bottleneck},
  author = {Zhang, Cenyuan and Zhou, Xiang and Wan, Yixin and Zheng, Xiaoqing and Chang, Kai-Wei and Hsieh, Cho-Jui},
  booktitle = {ACL-Finding},
  year = {2022}
}

Details

Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal

Umang Gupta, Jwala Dhamala, Varun Kumar, Apurv Verma, Yada Pruksachatkun, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Greg Ver Steeg, and Aram Galstyan, in ACL Finding, 2022.

Full Text Abstract BibTeX Details

Language models excel at generating coherent text, and model compression techniques such as knowledge distillation have enabled their use in resource-constrained settings. However, these models can be biased in multiple ways, including the unfounded association of male and female genders with gender-neutral professions. Therefore, knowledge distillation without any fairness constraints may preserve or exaggerate the teacher model’s biases onto the distilled model. To this end, we present a novel approach to mitigate gender disparity in text generation by learning a fair model during knowledge distillation. We propose two modifications to the base knowledge distillation based on counterfactual role reversal – modifying teacher probabilities and augmenting the training set. We evaluate gender polarity across professions in open-ended text generated from the resulting distilled and finetuned GPT-2models and demonstrate a substantial reduction in gender disparity with only a minor compromise in utility. Finally, we observe that language models that reduce gender polarity in language generation do not improve embedding fairness or downstream classification fairness.

@inproceedings{gupta2022equitable,
  title = {Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal},
  author = {Gupta, Umang and Dhamala, Jwala and Kumar, Varun and Verma, Apurv and Pruksachatkun, Yada and Krishna, Satyapriya and Gupta, Rahul and Chang, Kai-Wei and Steeg, Greg Ver and Galstyan, Aram},
  booktitle = {ACL Finding},
  year = {2022}
}

Details

Towards Adversarially Robust Text Classifiers by Learning to Reweight Clean Examples

Jianhan Xu, Cenyuan Zhang, Xiaoqing Zheng, Linyang Li, Cho-Jui Hsieh, Kai-Wei Chang, and Xuanjing Huang, in ACL Finding, 2022.

Full Text Abstract BibTeX Details

Most of the existing defense methods improve the adversarial robustness by making the models adapt to the training set augmented with some adversarial examples. However, the augmented adversarial examples may not be natural, which might distort the training distribution, resulting in inferior performance both in clean accuracy and adversarial robustness. In this study, we explore the feasibility of introducing a reweighting mechanism to calibrate the training distribution to obtain robust models. We propose to train text classifiers by a sample reweighting method in which the example weights are learned to minimize the loss of a validation set mixed with the clean examples and their adversarial ones in an online learning manner. Through extensive experiments, we show that there exists a reweighting mechanism to make the models more robust against adversarial attacks without the need to craft the adversarial examples for the entire training set.

@inproceedings{xu2022towards,
  title = {Towards Adversarially Robust Text Classifiers by Learning to Reweight Clean Examples},
  author = {Xu, Jianhan and Zhang, Cenyuan and Zheng, Xiaoqing and Li, Linyang and Hsieh, Cho-Jui and Chang, Kai-Wei and Huang, Xuanjing},
  booktitle = {ACL Finding},
  year = {2022}
}

Details

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, and Shih-Fu Chang, in AAAI, 2022.

Full Text Abstract BibTeX Details

Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made a great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (\bf SGEITL) framework to incorporate visual scene graph in commonsense reasoning. In order to exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in visual scene graph. Moreover, we introduce a method to train and generate domain relevant visual scene graph using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods, and prove the efficacy of each proposed component.

@inproceedings{wang2022sgeitl,
  title = {SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning},
  author = {Wang, Zhecan and You, Haoxuan and Li, Liunian Harold and Zareian, Alireza and Park, Suji and Liang, Yiqing and Chang, Kai-Wei and Chang, Shih-Fu},
  booktitle = {AAAI},
  year = {2022}
}

Details

2021

Unified Pre-training for Program Understanding and Generation

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in NAACL, 2021.

Full Text Video Code Abstract BibTeX Details Top-10 cited paper at NAACL 21

Code summarization nd generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.

@inproceedings{ahmad2021unified,
  title = {Unified Pre-training for Program Understanding and Generation},
  author = {Ahmad, Wasi and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
  booktitle = {NAACL},
  presentation_id = {https://underline.io/events/122/sessions/4197/lecture/20024-unified-pre-training-for-program-understanding-and-generation},
  year = {2021}
}

Details

Searching for an Effiective Defender: Benchmarking Defense against Adversarial Word Substitution

Zongyi Li, Jianhan Xu, Jiehang Zeng, Linyang Li, Xiaoqing Zheng, Qi Zhang, Kai-Wei Chang, and Cho-Jui Hsieh, in EMNLP, 2021.

Full Text Abstract BibTeX Details

Recent studies have shown that deep neural networks are vulnerable to intentionally crafted adversarial examples, and various methods have been proposed to defend against adversarial word-substitution attacks for neural NLP models. However, there is a lack of systematic study on comparing different defense approaches under the same attacking setting. In this paper, we seek to fill the gap of systematic studies through comprehensive researches on understanding the behavior of neural text classifiers trained by various defense methods under representative adversarial attacks. In addition, we propose an effective method to further improve the robustness of neural text classifiers against such attacks and achieved the highest accuracy on both clean and adversarial examples on AGNEWS and IMDB datasets by a significant margin.

@inproceedings{li2021searching,
  title = {Searching for an Effiective Defender: Benchmarking Defense against Adversarial Word Substitution},
  author = {Li, Zongyi and Xu, Jianhan and Zeng, Jiehang and Li, Linyang and Zheng, Xiaoqing and Zhang, Qi and Chang, Kai-Wei and Hsieh, Cho-Jui},
  presentation_id = {https://underline.io/events/192/posters/8225/poster/38025-searching-for-an-effective-defender-benchmarking-defense-against-adversarial-word-substitution},
  booktitle = {EMNLP},
  year = {2021}
}

Details

Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training

Kuan-Hao Huang, Wasi Ahmad, Nanyun Peng, and Kai-Wei Chang, in EMNLP, 2021.

Full Text Code Abstract BibTeX Details

Pre-trained multilingual language encoders, such as multilingual BERT and XLM-R, show great potential for zero-shot cross-lingual transfer. However, these multilingual encoders do not precisely align words and phrases across languages. Especially, learning alignments in the multilingual embedding space usually requires sentence-level or word-level parallel corpora, which are expensive to be obtained for low-resource languages. An alternative is to make the multilingual encoders more robust; when fine-tuning the encoder using downstream task, we train the encoder to tolerate noise in the contextual embedding spaces such that even if the representations of different languages are not aligned well, the model can still achieve good performance on zero-shot cross-lingual transfer. In this work, we propose a learning strategy for training robust models by drawing connections between adversarial examples and the failure cases of zero-shot cross-lingual transfer. We adopt two widely used robust training methods, adversarial training and randomized smoothing, to train the desired robust model. The experimental results demonstrate that robust training improves zero-shot cross-lingual transfer on text classification tasks. The improvement is more significant in the generalized cross-lingual transfer setting, where the pair of input sentences belong to two different languages.

@inproceedings{huang2021improving,
  title = {Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training},
  author = {Huang, Kuan-Hao and Ahmad, Wasi and Peng, Nanyun and Chang, Kai-Wei},
  presentation_id = {https://underline.io/events/192/posters/7783/poster/40656-improving-zero-shot-cross-lingual-transfer-learning-via-robust-training},
  booktitle = {EMNLP},
  year = {2021}
}

Details

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, and Kai-Wei Chang, in EMNLP, 2021.

Full Text Code Abstract BibTeX Details

Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models’ ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition.

@inproceedings{yin2021broaden,
  title = {	Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning},
  author = {Yin, Da and Li, Liunian Harold and Hu, Ziniu and Peng, Nanyun and Chang, Kai-Wei},
  booktitle = {EMNLP},
  presentation_id = {https://underline.io/events/192/sessions/7790/lecture/37514-broaden-the-vision-geo-diverse-visual-commonsense-reasoning},
  year = {2021}
}

Details

Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies

Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff Phillips, and Kai-Wei Chang, in EMNLP, 2021.

Full Text Slides Poster Abstract BibTeX Details

Gender is widely discussed in the context of language tasks and when examining the stereotypes propagated by language models. However, current discussions primarily treat gender as binary, which can perpetuate harms such as the cyclical erasure of non-binary gender identities. These harms are driven by model and dataset biases, which are consequences of the non-recognition and lack of understanding of non-binary genders in society. In this paper, we explain the complexity of gender and language around it, and survey non-binary persons to understand harms associated with the treatment of gender as binary in English language technologies. We also detail how current language representations (e.g., GloVe, BERT) capture and perpetuate these harms and related challenges that need to be acknowledged and addressed for representations to equitably encode gender information.

@inproceedings{dev2021harms,
  title = {Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies},
  author = {Dev, Sunipa and Monajatipoor, Masoud and Ovalle, Anaelia and Subramonian, Arjun and Phillips, Jeff and Chang, Kai-Wei},
  presentation_id = {https://underline.io/events/192/sessions/7788/lecture/37320-harms-of-gender-exclusivity-and-challenges-in-non-binary-representation-in-language-technologies},
  blog_url = {https://uclanlp.medium.com/harms-of-gender-exclusivity-and-challenges-in-non-binary-representation-in-language-technologies-5f89891b5aee},
  booktitle = {EMNLP},
  year = {2021}
}

Details

"Nice Try, Kiddo": Investigating Ad Hominems in Dialogue Responses

Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng, in NAACL, 2021.

Full Text Video Code Abstract BibTeX Details

Ad hominem attacks are those that target some feature of a person’s character instead of the position the person is maintaining. These attacks are harmful because they propagate implicit biases and diminish a person’s credibility. Since dialogue systems respond directly to user input, it is important to study ad hominems in dialogue responses. To this end, we propose categories of ad hominems, compose an annotated dataset, and build a classifier to analyze human and dialogue system responses to English Twitter posts. We specifically compare responses to Twitter topics about marginalized communities (#BlackLivesMatter, #MeToo) versus other topics (#Vegan, #WFH), because the abusive language of ad hominems could further amplify the skew of power away from marginalized populations. Furthermore, we propose a constrained decoding technique that uses salient n-gram similarity as a soft constraint for top-k sampling to reduce the amount of ad hominems generated. Our results indicate that 1) responses from both humans and DialoGPT contain more ad hominems for discussions around marginalized communities, 2) different quantities of ad hominems in the training data can influence the likelihood of generating ad hominems, and 3) we can use constrained decoding techniques to reduce ad hominems in generated dialogue responses.

@inproceedings{sheng2021nice,
  title = {"Nice Try, Kiddo": Investigating Ad Hominems in Dialogue Responses},
  booktitle = {NAACL},
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Prem and Peng, Nanyun},
  presentation_id = {https://underline.io/events/122/sessions/4137/lecture/19854-%27nice-try,-kiddo%27-investigating-ad-hominems-in-dialogue-responses},
  year = {2021}
}

Details

Evaluating the Values of Sources in Transfer Learning

Md Rizwan Parvez and Kai-Wei Chang, in NAACL, 2021.

Full Text Video Code Abstract BibTeX Details

Transfer learning that adapts a model trained on data-rich sources to low-resource targets has been widely applied in natural language processing (NLP). However, when training a transfer model over multiple sources, not every source is equally useful for the target. To better transfer a model, it is essential to understand the values of the sources. In this paper, we develop SEAL-Shap, an efficient source valuation framework for quantifying the usefulness of the sources (e.g., domains/languages) in transfer learning based on the Shapley value method. Experiments and comprehensive analyses on both cross-domain and cross-lingual transfers demonstrate that our framework is not only effective in choosing useful transfer sources but also the source values match the intuitive source-target similarity.

@inproceedings{parvez2021evaluating,
  title = {Evaluating the Values of Sources in Transfer Learning},
  author = {Parvez, Md Rizwan and Chang, Kai-Wei},
  booktitle = {NAACL},
  presentation_id = {https://underline.io/events/122/sessions/4261/lecture/19707-evaluating-the-values-of-sources-in-transfer-learning},
  year = {2021}
}

Details

On the Transferability of Adversarial Attacks against Neural Text Classifier

Liping Yuan, Xiaoqing Zheng, Yi Zhou, Cho-Jui Hsieh, and Kai-Wei Chang, in EMNLP, 2021.

Full Text Abstract BibTeX Details

Deep neural networks are vulnerable to adversarial attacks, where a small perturbation to an input alters the model prediction. In many cases, malicious inputs intentionally crafted for one model can fool another model. In this paper, we present the first study to systematically investigate the transferability of adversarial examples for text classification models and explore how various factors, including network architecture, tokenization scheme, word embedding, and model capacity, affect the transferability of adversarial examples. Based on these studies, we propose a genetic algorithm to find an ensemble of models that can be used to induce adversarial examples to fool almost all existing models. Such adversarial examples reflect the defects of the learning process and the data bias in the training set. Finally, we derive word replacement rules that can be used for model diagnostics from these adversarial examples.

@inproceedings{yuan2021on,
  title = {On the Transferability of Adversarial Attacks against Neural Text Classifier},
  author = {Yuan, Liping and Zheng, Xiaoqing and Zhou, Yi and Hsieh, Cho-Jui and Chang, Kai-Wei},
  presentation_id = {https://underline.io/events/192/posters/8223/poster/38067-on-the-transferability-of-adversarial-attacks-against-neural-text-classifier},
  booktitle = {EMNLP},
  year = {2021}
}

Details

Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models

James Y. Huang, Kuan-Hao Huang, and Kai-Wei Chang, in NAACL (short), 2021.

Full Text Video Code Abstract BibTeX Details

Pre-trained language models have achieved huge success on a wide range of NLP tasks. However, contextual representations from pre-trained models contain entangled semantic and syntactic information, and therefore cannot be directly used to derive useful semantic sentence embeddings for some tasks. Paraphrase pairs offer an effective way of learning the distinction between semantics and syntax, as they naturally share semantics and often vary in syntax. In this work, we present ParaBART, a semantic sentence embedding model that learns to disentangle semantics and syntax in sentence embeddings obtained by pre-trained language models. ParaBART is trained to perform syntax-guided paraphrasing, based on a source sentence that shares semantics with the target paraphrase, and a parse tree that specifies the target syntax. In this way, ParaBART learns disentangled semantic and syntactic representations from their respective inputs with separate encoders. Experiments in English show that ParaBART outperforms state-of-the-art sentence embedding models on unsupervised semantic similarity tasks. Additionally, we show that our approach can effectively remove syntactic information from semantic sentence embeddings, leading to better robustness against syntactic variation on downstream semantic tasks.

@inproceedings{huang2021disentangling,
  title = {Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models},
  author = {Huang, James Y. and Huang, Kuan-Hao and Chang, Kai-Wei},
  booktitle = {NAACL (short)},
  presentation_id = {https://underline.io/events/122/sessions/4151/lecture/19910-disentangling-semantics-and-syntax-in-sentence-embeddings-with-pre-trained-language-models},
  year = {2021}
}

Details

Retrieval Augmented Code Generation and Summarization

Md Rizwan Parvez, Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in EMNLP-Finding, 2021.

Full Text Abstract BibTeX Details

Software developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers’ code or summary generation behavior, we propose a retrieval augmented framework, \tool, that retrieves relevant code or summaries from a retrieval database and provides them as a supplement to code generation or summarization models. \tool has a couple of uniqueness. First, it extends the state-of-the-art dense retrieval technique to search for relevant code or summaries. Second, it can work with retrieval databases that include unimodal (only code or natural language description) or bimodal instances (code-description pairs). We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python, and the promising results endorse the effectiveness of our proposed retrieval augmented framework.

@inproceedings{parvez2021retrieval,
  title = {Retrieval Augmented Code Generation and Summarization},
  author = {Parvez, Md Rizwan and Ahmad, Wasi and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
  booktitle = {EMNLP-Finding},
  presentation_id = {https://underline.io/events/192/sessions/7923/lecture/38314-retrieval-augmented-code-generation-and-summarization},
  year = {2021}
}

Details

Relation-Guided Pre-Training for Open-Domain Question Answering

Ziniu Hu, Yizhou Sun, and Kai-Wei Chang, in EMNLP-Finding, 2021.

Full Text Abstract BibTeX Details

Answering complex open-domain questions requires understanding the latent relations between involving entities. However, we found that the existing QA datasets are extremely imbalanced in some types of relations, which hurts the generalization performance over questions with long-tail relations. To remedy this problem, in this paper, we propose a Relation-Guided Pre-Training (RGPT-QA) framework. We first generate a relational QA dataset covering a wide range of relations from both the Wikidata triplets and Wikipedia hyperlinks. We then pre-train a QA model to infer the latent relations from the question, and then conduct extractive QA to get the target answer entity. We demonstrate that by pretraining with propoed RGPT-QA techique, the popular open-domain QA model, Dense Passage Retriever (DPR), achieves 2.2%, 2.4%, and 6.3% absolute improvement in Exact Match accuracy on Natural Questions, TriviaQA, and WebQuestions. Particularly, we show that RGPT-QA improves significantly on questions with long-tail relations

@inproceedings{hu2021relation,
  title = {Relation-Guided Pre-Training for Open-Domain Question Answering},
  author = {Hu, Ziniu and Sun, Yizhou and Chang, Kai-Wei},
  presentation_id = {https://underline.io/events/192/sessions/7932/lecture/38507-relation-guided-pre-training-for-open-domain-question-answering},
  booktitle = {EMNLP-Finding},
  year = {2021}
}

Details

BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

Masoud Monajatipoor, Mozhdeh Rouhsedaghat, Liunian Harold Li, Aichi Chien, C.-C. Jay Kuo, Fabien Scalzo, and Kai-Wei Chang, in ICCV workshop on Computer Vision for Automated Medical Diagnosis, 2021.

Full Text Abstract BibTeX Details

Vision-and-language(V&L) models take image and text as input and learn to capture the associations between them. Prior studies show that pre-trained V&L models can significantly improve the model performance for downstream tasks such as Visual Question Answering (VQA). However, V&L models are less effective when applied in the medical domain (e.g., on X-ray images and clinical notes) due to the domain gap. In this paper, we investigate the challenges of applying pre-trained V&L models in medical applications. In particular, we identify that the visual representation in general V&L models is not suitable for processing medical data. To overcome this limitation, we propose BERTHop, a transformer-based model based on PixelHop++ and VisualBERT, for better capturing the associations between the two modalities. Experiments on the OpenI dataset, a commonly used thoracic disease diagnosis benchmark, show that BERTHop achieves an average Area Under the Curve (AUC) of 98.12% which is 1.62% higher than state-of-the-art (SOTA) while it is trained on a 9 times smaller dataset.

@inproceedings{monajatipoor2021berthop,
  title = {BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis},
  author = {Monajatipoor, Masoud and Rouhsedaghat, Mozhdeh and Li, Liunian Harold and Chien, Aichi and Kuo, C. -C. Jay and Scalzo, Fabien and Chang, Kai-Wei},
  booktitle = {ICCV workshop on Computer Vision for Automated Medical Diagnosis},
  year = {2021}
}

Details

An Integer Linear Programming Framework for Mining Constraints from Data

Tao Meng and Kai-Wei Chang, in ICML, 2021.

Full Text Video Code Abstract BibTeX Details

Various structured output prediction problems (e.g., sequential tagging) involve constraints over the output space. By identifying these constraints, we can filter out infeasible solutions and build an accountable model.
To this end, we present a general integer linear programming (ILP) framework for mining constraints from data. We model the inference of structured output prediction as an ILP problem. Then, given the coefficients of the objective function and the corresponding solution, we mine the underlying constraints by estimating the outer and inner polytopes of the feasible set. We verify the proposed constraint mining algorithm in various synthetic and real-world applications and demonstrate that the proposed approach successfully identifies the feasible set at scale.
In particular, we show that our approach can learn to solve 9x9 Sudoku puzzles and minimal spanning tree problems from examples without providing the underlying rules. We also demonstrate results on hierarchical multi-label classification and conduct a theoretical analysis on how close the mined constraints are from the ground truth.

@inproceedings{meng2020integer,
  author = {Meng, Tao and Chang, Kai-Wei},
  title = {An Integer Linear Programming Framework for Mining Constraints from Data},
  booktitle = {ICML},
  year = {2021}
}

Details

Intent Classification and Slot Filling for Privacy Policies

Wasi Ahmad, Jianfeng Chi, Tu Le, Thomas Norton, Yuan Tian, and Kai-Wei Chang, in ACL, 2021.

Full Text Video Code Abstract BibTeX Details

Understanding privacy policies is crucial for users as it empowers them to learn about the information that matters to them. Sentences written in a privacy policy document explain privacy practices, and the constituent text spans convey further specific information about that practice. We refer to predicting the privacy practice explained in a sentence as intent classification and identifying the text spans sharing specific information as slot filling. In this work, we propose PolicyIE, a corpus consisting of 5,250 intent and 11,788 slot annotations spanning 31 privacy policies of websites and mobile applications. PolicyIE corpus is a challenging benchmark with limited labeled examples reflecting the cost of collecting large-scale annotations. We present two alternative neural approaches as baselines: (1) formulating intent classification and slot filling as a joint sequence tagging and (2) modeling them as a sequence-to-sequence (Seq2Seq) learning task. Experiment results show that both approaches perform comparably in intent classification, while the Seq2Seq method outperforms the sequence tagging approach in slot filling by a large margin. Error analysis reveals the deficiency of the baseline approaches, suggesting room for improvement in future works. We hope the PolicyIE corpus will stimulate future research in this domain.

@inproceedings{ahmad2021intent,
  title = {Intent Classification and Slot Filling for Privacy Policies},
  author = {Ahmad, Wasi and Chi, Jianfeng and Le, Tu and Norton, Thomas and Tian, Yuan and Chang, Kai-Wei},
  booktitle = {ACL},
  year = {2021}
}

Details

Syntax-augmented Multilingual BERT for Cross-lingual Transfer

Wasi Ahmad, Haoran Li, Kai-Wei Chang, and Yashar Mehdad, in ACL, 2021.

Full Text Video Code Abstract BibTeX Details

In recent years, we have seen a colossal effort
in pre-training multilingual text encoders using large-scale corpora in many languages to
facilitate cross-lingual transfer learning. However, due to typological differences across languages, the cross-lingual transfer is challenging. Nevertheless, language syntax, e.g., syntactic dependencies, can bridge the typological gap. Previous works have shown that pretrained multilingual encoders, such as mBERT
(Devlin et al., 2019), capture language syntax, helping cross-lingual transfer. This work
shows that explicitly providing language syntax and training mBERT using an auxiliary
objective to encode the universal dependency
tree structure helps cross-lingual transfer. We
perform rigorous experiments on four NLP
tasks, including text classification, question answering, named entity recognition, and taskoriented semantic parsing. The experiment results show that syntax-augmented mBERT improves cross-lingual transfer on popular benchmarks, such as PAWS-X and MLQA, by 1.4
and 1.6 points on average across all languages.
In the generalized transfer setting, the performance boosted significantly, with 3.9 and 3.1
points on average in PAWS-X and MLQA.

@inproceedings{ahmad2021syntax,
  title = {Syntax-augmented Multilingual BERT for Cross-lingual Transfer},
  author = {Ahmad, Wasi and Li, Haoran and Chang, Kai-Wei and Mehdad, Yashar},
  booktitle = {ACL},
  year = {2021}
}

Details

Select, Extract and Generate: Neural Keyphrase Generation with Layer-wise Coverage Attention

Wasi Ahmad, Xiao Bai, Soomin Lee, and Kai-Wei Chang, in ACL, 2021.

Full Text Abstract BibTeX Details

In recent years, deep neural sequence-to-sequence framework has demonstrated promising results in keyphrase generation. However, processing long documents using such deep neural networks requires high computational resources. To reduce the computational cost, the documents are typically truncated before given as inputs. As a result, the models may miss essential points conveyed in a document. Moreover, most of the existing methods are either extractive (identify important phrases from the document) or generative (generate phrases word by word), and hence they do not benefit from the advantages of both modeling techniques. To address these challenges, we propose \emphSEG-Net, a neural keyphrase generation model that is composed of two major components, (1) a selector that selects the salient sentences in a document, and (2) an extractor-generator that jointly extracts and generates keyphrases from the selected sentences. SEG-Net uses a self-attentive architecture, known as, \emphTransformer as the building block with a couple of uniqueness. First, SEG-Net incorporates a novel \emphlayer-wise coverage attention to summarize most of the points discussed in the target document. Second, it uses an \emphinformed copy attention mechanism to encourage focusing on different segments of the document during keyphrase extraction and generation. Besides, SEG-Net jointly learns keyphrase generation and their part-of-speech tag prediction, where the later provides syntactic supervision to the former. The experimental results on seven keyphrase generation benchmarks from scientific and web documents demonstrate that SEG-Net outperforms the state-of-the-art neural generative methods by a large margin in both domains.

@inproceedings{ahmad2021select,
  title = {Select, Extract and Generate: Neural Keyphrase Generation with Layer-wise Coverage Attention},
  author = {Ahmad, Wasi and Bai, Xiao and Lee, Soomin and Chang, Kai-Wei},
  booktitle = {ACL},
  year = {2021}
}

Details

Societal Biases in Language Generation: Progress and Challenges

Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng, in ACL, 2021.

Full Text Abstract BibTeX Details

Technology for language generation has advanced rapidly, spurred by advancements in pre-training large models on massive amounts of data and the need for intelligent agents to communicate in a natural manner. While techniques can effectively generate fluent text, they can also produce undesirable societal biases that can have a disproportionately negative impact on marginalized populations. Language generation presents unique challenges for biases in terms of direct user interaction and the structure of decoding techniques. To better understand these challenges, we present a survey on societal biases in language generation, focusing on how data and techniques contribute to biases and progress towards reducing biases. Motivated by a lack of studies on biases from decoding techniques, we also conduct experiments to quantify the effects of these techniques. By further discussing general trends and open challenges, we call to attention promising directions for research and the importance of fairness and inclusivity considerations for language generation applications.

@inproceedings{sheng2021societal,
  title = {Societal Biases in Language Generation: Progress and Challenges},
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Prem and Peng, Nanyun},
  booktitle = {ACL},
  year = {2021}
}

Details

Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble

Yi Zhou, Xiaoqing Zheng, Cho-Jui Hsieh, Kai-Wei Chang, and Xuanjing Huang, in ACL, 2021.

Full Text Code Abstract BibTeX Details

Although deep neural networks have achieved prominent performance on many NLP tasks, they are vulnerable to adversarial examples. We propose Dirichlet Neighborhood Ensemble (DNE), a randomized method for training a robust model to defense synonym substitutionbased attacks. During training, DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data. In such a way, the model is robust to adversarial attacks while maintaining the performance on the original clean data. DNE is agnostic to the network architectures and scales to large models (e.g., BERT) for NLP applications. Through extensive experimentation, we demonstrate that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.

@inproceedings{zhou2021defense,
  title = {Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble},
  author = {Zhou, Yi and Zheng, Xiaoqing and Hsieh, Cho-Jui and Chang, Kai-Wei and Huang, Xuanjing},
  booktitle = {ACL},
  year = {2021}
}

Details

Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?

Jieyu Zhao, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Kai-Wei Chang, in ACL-Finding (short), 2021.

Full Text Abstract BibTeX Details

Is it possible to use natural language to intervene in a model’s behavior and alter its prediction in a desired way? We investigate the effectiveness of natural language interventions for reading-comprehension systems, studying this in the context of social stereotypes. Specifically, we propose a new language understanding task, Linguistic Ethical Interventions (LEI), where the goal is to amend a question-answering (QA) model’s unethical behavior by communicating context-specific principles of ethics and equity to it. To this end, we build upon recent methods for quantifying a system’s social stereotypes, augmenting them with different kinds of ethical interventions and the desired model behavior under such interventions. Our zero-shot evaluation finds that even today’s powerful neural language models are extremely poor ethical-advice takers, that is, they respond surprisingly little to ethical interventions even though these interventions are stated as simple sentences. Few-shot learning improves model behavior but remains far from the desired outcome, especially when evaluated for various types of generalization. Our new task thus poses a novel language understanding challenge for the community.

@inproceedings{zhao2021ethical,
  title = {Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?},
  author = {Zhao, Jieyu and Khashabi, Daniel and Khot, Tushar and Sabharwal, Ashish and Chang, Kai-Wei},
  booktitle = {ACL-Finding (short)},
  year = {2021}
}

Details

Does Robustness Improve Fairness? Approaching Fairness with Word Substitution Robustness Methods for Text Classification

Yada Pruksachatkun, Satyapriya Krishna, Jwala Dhamala, Rahul Gupta, and Kai-Wei Chang, in ACL-Finding, 2021.

Full Text Code Abstract BibTeX Details

Existing bias mitigation methods to reduce disparities in model outcomes across cohorts have focused on data augmentation, debiasing model embeddings, or adding fairness-based optimization objectives during training. Separately, certified word substitution robustness methods have been developed to decrease the impact of spurious features and synonym substitutions on model predictions. While their end goals are different, they both aim to encourage models to make the same prediction for certain changes in the input. In this paper, we investigate the utility of certified word substitution robustness methods to improve equality of odds and equality of opportunity on multiple text classification tasks. We observe that certified robustness methods improve fairness, and using both robustness and bias mitigation methods in training results in an improvement in both fronts.

@inproceedings{pruksachatkun2021robustness,
  title = {Does Robustness Improve Fairness? Approaching Fairness with Word Substitution Robustness Methods for Text Classification},
  author = {Pruksachatkun, Yada and Krishna, Satyapriya and Dhamala, Jwala and Gupta, Rahul and Chang, Kai-Wei},
  booktitle = {ACL-Finding},
  year = {2021}
}

Details

Aggression, escalation, and other latent themes in legal intervention deaths of non-Hispanic Black and White men: Results from the 2003-2017 NVDRS

Alina Arseniev-Koehler, Jacob Foster, Vickie Mays, Kai-Wei Chang, and Susan Cochran, in American Journal of Public Health, 2021.

Full Text Abstract BibTeX Details

Objectives. To investigate racial/ethnic differences in legal intervention-related deaths using state-of-theart topic modeling of law enforcement and coroner text summaries drawn from the 2003-2017 US National Violent Death Reporting System (NVDRS). Methods. Employing advanced topic modeling, we identified 8 topics consistent with dangerousness in death incidents in the NVDRS death narratives written by public health workers (PHWs). Using logistic regression, we then evaluated racial/ethnic differences in PHW-coded variables and narrative topics among 4981 males killed by legal intervention, while adjusting for age, county-level characteristics, and year. Results. Black, as compared with White, decedents were younger and their deaths were less likely to include PHW-coded mental health or substance use histories, weapon use, or positive toxicology for alcohol or psychoactive drugs, but more likely to include gangs-as-an-incident-precipitant coding. Topic modeling revealed less frequent thematic representation of physical aggression or escalation but more of gangs or criminal networks among Black versus White decedents. Conclusions. While Black males were more likely to be victims of legal intervention deaths, PHW-coded variables in the NVDRS and death narratives suggest lower threat profiles among Black versus similar White decedents. The source of this greater risk remains undetermined.

@inproceedings{arseniev2021aggression,
  title = {Aggression, escalation, and other latent themes in legal intervention deaths of non-Hispanic Black and White men: Results from the 2003-2017 NVDRS},
  author = {Arseniev-Koehler, Alina and Foster, Jacob and Mays, Vickie and Chang, Kai-Wei and Cochran, Susan},
  booktitle = {American Journal of Public Health},
  year = {2021}
}

Details

Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation

Chong Zhang, Jieyu Zhao, Huan Zhang, Kai-Wei Chang, and Cho-Jui Hsieh, in NAACL, 2021.

Full Text Video Code Abstract BibTeX Details

Robustness and counterfactual bias are usually evaluated on a test dataset. However, are these evaluations robust? If the test dataset is perturbed slightly, will the evaluation results keep the same? In this paper, we propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset. The framework first perturbs the test dataset to construct abundant natural sentences similar to the test data, and then diagnoses the prediction change regarding a single-word substitution. We apply this framework to study two perturbation-based approaches that are used to analyze models’ robustness and counterfactual bias in English. (1) For robustness, we focus on synonym substitutions and identify vulnerable examples where prediction can be altered. Our proposed attack attains high success rates (96.0%-99.8%) in finding vulnerable examples on both original and robustly trained CNNs and Transformers. (2) For counterfactual bias, we focus on substituting demographic tokens (e.g., gender, race) and measure the shift of the expected prediction among constructed sentences. Our method is able to reveal the hidden model biases not directly shown in the test dataset.

@inproceedings{zhang2021double,
  title = {	Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation},
  booktitle = {NAACL},
  author = {Zhang, Chong and Zhao, Jieyu and Zhang, Huan and Chang, Kai-Wei and Hsieh, Cho-Jui},
  year = {2021},
  presentation_id = {https://underline.io/events/122/sessions/4229/lecture/19609-double-perturbation-on-the-robustness-of-robustness-and-counterfactual-bias-evaluation}
}

Details

Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, and Kai-Wei Chang, in NAACL, 2021.

Full Text Video Abstract BibTeX Details

Pre-trained contextual vision-and-language (V&L) models have brought impressive performance improvement on various benchmarks. However, the paired text-image data required for pre-training are hard to collect and scale up. We investigate if a strong V&L representation model can be learned without text-image pairs. We propose Weakly-supervised VisualBERT with the key idea of conducting "mask-and-predict" pre-training on language-only and image-only corpora. Additionally, we introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. Evaluation on four V&L benchmarks shows that Weakly-supervised VisualBERT achieves similar performance with a model pre-trained with paired data. Besides, pre-training on more image-only data further improves a model that already has access to aligned data, suggesting the possibility of utilizing billions of raw images available to enhance V&L models.

@inproceedings{li2021unsupervised,
  author = {Li, Liunian Harold and You, Haoxuan and Wang, Zhecan and Zareian, Alireza and Chang, Shih-Fu and Chang, Kai-Wei},
  title = {Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions},
  booktitle = {NAACL},
  presentation_id = {https://underline.io/events/122/sessions/4269/lecture/19725-unsupervised-vision-and-language-pre-training-without-parallel-images-and-captions},
  year = {2021}
}

Details

Adapting Coreference Resolution for Processing Violent Death Narratives

Ankith Uppunda, Susan Cochran, Jacob Foster, Alina Arseniev-Koehler, Vickie Mays, and Kai-Wei Chang, in NAACL (short), 2021.

Full Text Video Abstract BibTeX Details

Coreference resolution is an important component in analyzing narrative text from administrative data (e.g., clinical or police sources). However, existing coreference models trained on general language corpora suffer from poor transferability due to domain gaps, especially when they are applied to gender-inclusive data with lesbian, gay, bisexual, and transgender (LGBT) individuals. In this paper, we analyzed the challenges of coreference resolution in an exemplary form of administrative text written in English: violent death narratives from the USA’s Centers for Disease Control’s (CDC) National Violent Death Reporting System. We developed a set of data augmentation rules to improve model performance using a probabilistic data programming framework. Experiments on narratives from an administrative database, as well as existing gender-inclusive coreference datasets, demonstrate the effectiveness of data augmentation in training coreference models that can better handle text data about LGBT individuals.

@inproceedings{uppunda2021adapting,
  title = {Adapting Coreference Resolution for Processing Violent Death Narratives},
  author = {Uppunda, Ankith and Cochran, Susan and Foster, Jacob and Arseniev-Koehler, Alina and Mays, Vickie and Chang, Kai-Wei},
  booktitle = {NAACL (short)},
  presentation_id = {https://underline.io/events/122/sessions/4249/lecture/19662-adapting-coreference-resolution-for-processing-violent-death-narratives},
  year = {2021}
}

Details

BOLD: Dataset and metrics for measuring biases in open-ended language generation

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta, in FAccT, 2021.

Full Text Code Abstract BibTeX Details

Recent advances in deep learning techniques have enabled machines to generate cohesive open-ended text when prompted with a sequence of words as context. While these models now empower many downstream applications from conversation bots to automatic storytelling, they have been shown to generate texts that exhibit social biases. To systematically study and benchmark social biases in open-ended language generation, we introduce the Bias in Open-Ended Language Generation Dataset (BOLD), a large-scale dataset that consists of 23,679 English text generation prompts for bias benchmarking across five domains: profession, gender, race, religion, and political ideology. We also propose new automated metrics for toxicity, psycholinguistic norms, and text gender polarity to measure social biases in open-ended text generation from multiple angles. An examination of text generated from three popular language models reveals that the majority of these models exhibit a larger social bias than human-written Wikipedia text across all domains. With these results we highlight the need to benchmark biases in open-ended language generation and caution users of language generation models on downstream tasks to be cognizant of these embedded prejudices.

@inproceedings{dhamala2021bold,
  author = {Dhamala, Jwala and Sun, Tony and Kumar, Varun and Krishna, Satyapriya and Pruksachatkun, Yada and Chang, Kai-Wei and Gupta, Rahul},
  title = {BOLD: Dataset and metrics for measuring biases in open-ended language generation},
  booktitle = {FAccT},
  year = {2021}
}

Details

Generating Syntactically Controlled Paraphrases without Using Annotated Parallel Pairs

Kuan-Hao Huang and Kai-Wei Chang, in EACL, 2021.

Full Text Slides Poster Code Abstract BibTeX Details

Paraphrase generation plays an essential role in natural language process (NLP), and it has many downstream applications. However, training supervised paraphrase models requires many annotated paraphrase pairs, which are usually costly to obtain. On the other hand, the paraphrases generated by existing unsupervised approaches are usually syntactically similar to the source sentences and are limited in diversity. In this paper, we demonstrate that it is possible to generate syntactically various paraphrases without the need for annotated paraphrase pairs. We propose Syntactically controlled Paraphrase Generator (SynPG), an encoder-decoder based model that learns to disentangle the semantics and the syntax of a sentence from a collection of unannotated texts. The disentanglement enables SynPG to control the syntax of output paraphrases by manipulating the embedding in the syntactic space. Extensive experiments using automatic metrics and human evaluation show that SynPG performs better syntactic control than unsupervised baselines, while the quality of the generated paraphrases is competitive. We also demonstrate that the performance of SynPG is competitive or even better than supervised models when the unannotated data is large. Finally, we show that the syntactically controlled paraphrases generated by SynPG can be utilized for data augmentation to improve the robustness of NLP models.

@inproceedings{huang2021generating,
  author = {Huang, Kuan-Hao and Chang, Kai-Wei},
  title = {Generating Syntactically Controlled Paraphrases without Using Annotated Parallel Pairs},
  booktitle = {EACL},
  year = {2021}
}

Details

Clinical Temporal Relation Extraction with Probabilistic Soft Logic Regularization and Global Inference

Yichao Zhou, Yu Yan, Rujun Han, J. Harry Caufield, Kai-Wei Chang, Yizhou Sun, Peipei Ping, and Wei Wang, in AAAI, 2021.

Full Text Code Abstract BibTeX Details

There  has  been  a  steady  need  in  the  medical  community to  precisely  extract  the  temporal  relations  between  clinical events. In particular, temporal information can facilitate a variety of downstream applications such as case report retrieval and medical question answering. However, existing methods either require expensive feature engineering or are incapable of  modeling  the  global  relational  dependencies  among  theevents. In this paper, we propose Clinical Temporal Relation Exaction  with  Probabilistic  Soft  Logic  Regularization  and Global Inference (CTRL-PG), a novel method to tackle the problem at the document level. Extensive experiments on two benchmark datasets, I2B2-2012 and TB-Dense, demonstrate that CTRL-PG significantly  outperforms  baseline  methodsfor temporal relation extraction.

@inproceedings{zhou2021clinical,
  author = {Zhou, Yichao and Yan, Yu and Han, Rujun and Caufield, J. Harry and Chang, Kai-Wei and Sun, Yizhou and Ping, Peipei and Wang, Wei},
  title = {Clinical Temporal Relation Extraction with Probabilistic Soft Logic Regularization and Global Inference},
  booktitle = {AAAI},
  year = {2021}
}

Details

GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction

Wasi Ahmad, Nanyun Peng, and Kai-Wei Chang, in AAAI, 2021.

Full Text Code Abstract BibTeX Details

Prevalent approaches in cross-lingual relation and event extraction use graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic representations such that models trained on one language can be applied to other languages. However, GCNs lack in modeling long-range dependencies or disconnected words in the dependency tree. To address this challenge, we propose to utilize the self-attention mechanism where we explicitly fuse structural information to learn the dependencies between words at different syntactic distances. We introduce GATE, a \bf Graph \bf Attention \bf Transformer \bf Encoder, and test its cross-lingual transferability on relation and event extraction tasks. We perform rigorous experiments on the widely used ACE05 dataset that includes three typologically different languages: English, Chinese, and Arabic. The evaluation results show that GATE outperforms three recently proposed methods by a large margin. Our detailed analysis reveals that due to the reliance on syntactic dependencies, GATE produces robust representations that facilitate transfer across languages.

@inproceedings{ahmad2021gate,
  author = {Ahmad, Wasi and Peng, Nanyun and Chang, Kai-Wei},
  title = {GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction},
  booktitle = {AAAI},
  year = {2021}
}

Details

2020

GPT-GNN: Generative Pre-Training of Graph Neural Networks

Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun, in KDD, 2020.

Full Text Video Code Abstract BibTeX Details Top-10 cited paper at KDD 20

Graph neural networks (GNNs) have been demonstrated to besuccessful in modeling graph-structured data. However, training GNNs requires abundant task-specific labeled data, which is often arduously expensive to obtain. One effective way to reduce labeling effort is to pre-train an expressive GNN model on unlabelled data with self-supervision and then transfer the learned knowledge to downstream models. In this paper, we present the GPT-GNN’s framework to initialize GNNs by generative pre-training. GPT-GNN introduces a self-supervised attributed graph generation task to pre-train a GNN,which allows the GNN to capture the intrinsic structural and semantic properties of the graph. We factorize the likelihood of graph generation into two components: 1) attribute generation, and 2) edgegeneration. By modeling both components, GPT-GNN captures the inherent dependency between node attributes and graph structure during the generative process. Comprehensive experiments on thebillion-scale academic graph and Amazon recommendation data demonstrate that GPT-GNN significantly outperforms state-of-the-art base GNN models without pre-training by up to 9.1% across different downstream tasks.

@inproceedings{hu2020gptgnn,
  author = {Hu, Ziniu and Dong, Yuxiao and Wang, Kuansan and Chang, Kai-Wei and Sun, Yizhou},
  title = {GPT-GNN: Generative Pre-Training of Graph Neural Networks},
  booktitle = {KDD},
  slide_url = {https://acbull.github.io/pdf/gpt.pptx},
  year = {2020}
}

Details

Provable, Scalable and Automatic Perturbation Analysis on General Computational Graphs

Kaidi Xu, Zhouxing Shi, Huan Zhang, Yihan Wang, Kai-Wei Chang, Minlie Huang, Bhavya Kailkhura, Xue Lin, and Cho-Jui Hsieh, in NeurIPS, 2020.

Full Text Code Abstract BibTeX Details

Linear relaxation based perturbation analysis (LiRPA) for neural networks, which computes provable linear bounds of output neurons given a certain amount of input perturbation, has become a core component in robustness verification and certified defense. The majority of LiRPA-based methods only consider simple feed-forward networks and it needs particular manual derivations and implementations when extended to other architectures. In this paper, we develop an automatic framework to enable perturbation analysis on any neural network structures, by generalizing exiting LiRPA algorithms such as CROWN to operate on general computational graphs. The flexibility, differentiability and ease of use of our framework allow us to obtain state-of-the-art results on LiRPA based certified defense on fairly complicated networks like DenseNet, ResNeXt and Transformer that are not supported by prior work. Our framework also enables loss fusion, a technique that significantly reduces the computational complexity of LiRPA for certified defense. For the first time, we demonstrate LiRPA based certified defense on Tiny ImageNet and Downscaled ImageNet where previous approaches cannot scale to due to the relatively large number of classes. Our work also yields an open-source library for the community to apply LiRPA to areas beyond certified defense without much LiRPA expertise, e.g., we create a neural network with a provably flat optimization landscape. Our open source library is available at https://github.com/KaidiXu/auto_LiRPA

@inproceedings{xu2020provable,
  author = {Xu, Kaidi and Shi, Zhouxing and Zhang, Huan and Wang, Yihan and Chang, Kai-Wei and Huang, Minlie and Kailkhura, Bhavya and Lin, Xue and Hsieh, Cho-Jui},
  title = {Provable, Scalable and Automatic Perturbation Analysis on General Computational Graphs},
  booktitle = {NeurIPS},
  year = {2020}
}

Details

LOGAN: Local Group Bias Detection by Clustering

Jieyu Zhao and Kai-Wei Chang, in EMNLP (short), 2020.

Full Text Code Abstract BibTeX Details

Machine learning techniques have been widely used in natural language processing (NLP). However, as revealed by many recent studies, machine learning models often inherit and amplify the societal biases in data. Various metrics have been proposed to quantify biases in model predictions. In particular, several of them evaluate disparity in model performance between protected groups and advantaged groups in the test corpus. However, we argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model. In fact, a model with similar aggregated performance between different groups on the entire data may behave differently on instances in a local region. To analyze and detect such local bias, we propose LOGAN, a new bias detection technique based on clustering. Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region and allows us to better analyze the biases in model predictions.

@inproceedings{zhao2020logan,
  author = {Zhao, Jieyu and Chang, Kai-Wei},
  title = {LOGAN: Local Group Bias Detection by Clustering},
  booktitle = {EMNLP (short)},
  presentation_id = {https://virtual.2020.emnlp.org/paper_main.2886.html},
  year = {2020}
}

Details

Towards Controllable Biases in Language Generation

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng, in EMNLP-Finding, 2020.

Full Text Code Abstract BibTeX Details

We present a general approach towards controllable societal biases in natural language generation (NLG). Building upon the idea of adversarial triggers, we develop a method to induce societal biases in generated text when input prompts contain mentions of specific demographic groups. We then analyze two scenarios: 1) inducing negative biases for one demographic and positive biases for another demographic, and 2) equalizing biases between demographics. The former scenario enables us to detect the types of biases present in the model. Specifically, we show the effectiveness of our approach at facilitating bias analysis by finding topics that correspond to demographic inequalities in generated text and comparing the relative effectiveness of inducing biases for different demographics. The second scenario is useful for mitigating biases in downstream applications such as dialogue generation. In our experiments, the mitigation technique proves to be effective at equalizing the amount of biases across demographics while simultaneously generating less negatively biased text overall.

@inproceedings{sheng2020towards,
  title = {Towards Controllable Biases in Language Generation},
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
  booktitle = {EMNLP-Finding},
  year = {2020}
}

Details

Cross-Lingual Dependency Parsing by POS-Guided Word Reordering

Lu Liu, Yi Zhou, Jianhan Xu, Xiaoqing Zheng, Kai-Wei Chang, and Xuanjing Huang, in EMNLP-Finding, 2020.

Full Text Abstract BibTeX Details

We propose a novel approach to cross-lingual dependency parsing based on word reordering. The words in each sentence of a source language corpus are rearranged to meet the word order in a target language under the guidance of a part-of-speech based language model (LM). To obtain the highest reordering score under the LM, a population-based optimization algorithm and its genetic operators are designed to deal with the combinatorial nature of such word reordering. A parser trained on the reordered corpus then can be used to parse sentences in the target language. We demonstrate through extensive experimentation that our approach achieves better or comparable results across 25 target languages (1.73% increase in average), and outperforms a baseline by a significant margin on the languages that are greatly different from the source one. For example, when transferring the English parser to Hindi and Latin, our approach outperforms the baseline by 15.3% and 6.7% respectively.

@inproceedings{liu2020cross-lingual,
  author = {Liu, Lu and Zhou, Yi and Xu, Jianhan and Zheng, Xiaoqing and Chang, Kai-Wei and Huang, Xuanjing},
  title = {Cross-Lingual Dependency Parsing by POS-Guided Word Reordering},
  booktitle = {EMNLP-Finding},
  year = {2020}
}

Details

PolicyQA: A Reading Comprehension Dataset for Privacy Policies

Wasi Ahmad, Jianfeng Chi, Yuan Tian, and Kai-Wei Chang, in EMNLP-Finding (short), 2020.

Full Text Code Abstract BibTeX Details

Privacy policy documents are long and verbose. A question answering (QA) system can assist users in finding the information that is relevant and important to them. Prior studies in this domain frame the QA task as retrieving the most relevant text segment or a list of sentences from the policy document given a question. On the contrary, we argue that providing users with a short text span from policy documents reduces the burden of searching the target information from a lengthy text segment. In this paper, we present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies. PolicyQA provides 714 human-annotated questions written for a wide range of privacy practices. We evaluate two existing neural QA models and perform rigorous analysis to reveal the advantages and challenges offered by PolicyQA.

@inproceedings{ahmad2020policyqa,
  author = {Ahmad, Wasi and Chi, Jianfeng and Tian, Yuan and Chang, Kai-Wei},
  title = {PolicyQA: A Reading Comprehension Dataset for Privacy Policies},
  booktitle = {EMNLP-Finding (short)},
  year = {2020}
}

Details

Generating Sports News from Live Commentary: A Chinese Dataset for Sports Game Summarization

Kuan-Hao Huang, Chen Li, and Kai-Wei Chang, in AACL (short), 2020.

Full Text Abstract BibTeX Details

Sports game summarization focuses on generating news articles from live commentaries. Unlike traditional summarization tasks, the source documents and the target summaries for sports game summarization tasks are written in quite different writing styles. In addition, live commentaries usually contain many named entities, which makes summarizing sports games precisely very challenging. To deeply study this task, we present SportsSum, a Chinese sports game summarization dataset which contains 5,428 soccer games of live commentaries and the corresponding news articles. Additionally, we propose a two-step summarization model consisting of a selector and a rewriter for SportsSum. To evaluate the correctness of generated sports summaries, we design two novel score metrics: name matching score and event matching score. Experimental results show that our model performs better than other summarization baselines on ROUGE scores as well as the two designed scores.

@inproceedings{huang2020generating,
  author = {Huang, Kuan-Hao and Li, Chen and Chang, Kai-Wei},
  title = {Generating Sports News from Live Commentary: A Chinese Dataset for Sports Game Summarization},
  booktitle = {AACL (short)},
  year = {2020}
}

Details

On the Robustness of Language Encoders against Grammatical Errors

Fan Yin, Quanyu Long, Tao Meng, and Kai-Wei Chang, in ACL, 2020.

Full Text Slides Video Code Abstract BibTeX Details

We conduct a thorough study to diagnose the behaviors of pre-trained language encoders (ELMo, BERT, and RoBERTa) when confronted with natural grammatical errors. Specifically, we collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. We use this approach to facilitate debugging models on downstream applications. Results confirm that the performance of all tested models is affected but the degree of impact varies. To interpret model behaviors, we further design a linguistic acceptability task to reveal their abilities in identifying ungrammatical sentences and the position of errors. We find that fixed contextual encoders with a simple classifier trained on the prediction of sentence correctness are able to locate error positions. We also design a cloze test for BERT and discover that BERT captures the interaction between errors and specific tokens in context. Our results shed light on understanding the robustness and behaviors of language encoders against grammatical errors.

@inproceedings{yin2020robustness,
  author = {Yin, Fan and Long, Quanyu and Meng, Tao and Chang, Kai-Wei},
  title = {On the Robustness of Language Encoders against Grammatical Errors},
  booktitle = {ACL},
  presentation_id = {https://virtual.acl2020.org/paper_main.310.html},
  year = {2020}
}

Details

Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer

Jieyu Zhao, Subhabrata Mukherjee, Saghar Hosseini, Kai-Wei Chang, and Ahmed Hassan Awadallah, in ACL, 2020.

Full Text Slides Video Abstract BibTeX Details

Multilingual representations embed words from many languages into a single semantic space such that words with similar meanings are close to each other regardless of the language. These embeddings have been widely used in various settings, such as cross-lingual transfer, where a natural language processing (NLP) model trained on one language is deployed to another language. While the cross-lingual transfer techniques are powerful, they carry gender bias from the source to target languages. In this paper, we study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications. We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations from both the intrinsic and extrinsic perspectives. Experimental results show that the magnitude of bias in the multilingual representations changes differently when we align the embeddings to different target spaces and that the alignment direction can also have an influence on the bias in transfer learning. We further provide recommendations for using the multilingual word representations for downstream tasks.

@inproceedings{zhao2020gender,
  author = {Zhao, Jieyu and Mukherjee, Subhabrata and Hosseini, Saghar and Chang, Kai-Wei and Awadallah, Ahmed Hassan},
  title = {Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer},
  booktitle = {ACL},
  year = {2020},
  presentation_id = {https://virtual.acl2020.org/paper_main.260.html}
}

Details

SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics

Da Yin, Tao Meng, and Kai-Wei Chang, in ACL, 2020.

Full Text Slides Video Code Abstract BibTeX Details

We propose SentiBERT, a variant of BERT that effectively captures compositional sentiment semantics. The model incorporates contextualized representation with binary constituency parse tree to capture semantic composition. Comprehensive experiments demonstrate that SentiBERT achieves competitive performance on phrase-level sentiment classification. We further demonstrate that the sentiment composition learned from the phrase-level annotations on SST can be transferred to other sentiment analysis tasks as well as related tasks, such as emotion classification tasks. Moreover, we conduct ablation studies and design visualization methods to understand SentiBERT. We show that SentiBERT is better than baseline approaches in capturing negation and the contrastive relation and model the compositional sentiment semantics.

@inproceedings{yin2020sentibert,
  author = {Yin, Da and Meng, Tao and Chang, Kai-Wei},
  title = {SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics},
  booktitle = {ACL},
  year = {2020},
  presentation_id = {https://virtual.acl2020.org/paper_main.341.html}
}

Details

"The Boating Store Had Its Best Sail Ever": Pronunciation-attentive Contextualized Pun Recognition

Yichao Zhou, Jyun-Yu Jiang, Jieyu Zhao, Kai-Wei Chang, and Wei Wang, in ACL, 2020.

Full Text Slides Video Code Abstract BibTeX Details

Humor plays an important role in human languages and it is essential to model humor when building intelligence systems. Among different forms of humor, puns perform wordplay for humorous effects by employing words with double entendre and high phonetic similarity. However, identifying and modeling puns are challenging as puns usually involved implicit semantic or phonological tricks. In this paper, we propose Pronunciation-attentive Contextualized Pun Recognition (PCPR) to perceive human humor, detect if a sentence contains puns and locate them in the sentence. PCPR derives contextualized representation for each word in a sentence by capturing the association between the surrounding context and its corresponding phonetic symbols. Extensive experiments are conducted on two benchmark datasets. Results demonstrate that the proposed approach significantly outperforms the state-of-the-art methods in pun detection and location tasks. In-depth analyses verify the effectiveness and robustness of PCPR.

@inproceedings{zhou2020boating,
  author = {Zhou, Yichao and Jiang, Jyun-Yu and Zhao, Jieyu and Chang, Kai-Wei and Wang, Wei},
  title = {"The Boating Store Had Its Best Sail Ever": Pronunciation-attentive Contextualized Pun Recognition},
  booktitle = {ACL},
  presentation_id = {https://virtual.acl2020.org/paper_main.75.html},
  year = {2020}
}

Details

Towards Understanding Gender Bias in Relation Extraction

Andrew Gaut, Tony Sun, Shirlyn Tang, Yuxin Huang, Jing Qian, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang, in ACL, 2020.

Full Text Abstract BibTeX Details

Recent developments in Neural Relation Extraction (NRE) have made significant strides towards automated knowledge base construction. While much attention has been dedicated towards improvements in accuracy, there have been no attempts in the literature to evaluate social biases exhibited in NRE systems. In this paper, we create WikiGenderBias, a distantly supervised dataset composed of over 45,000 sentences including a 10% human annotated test set for the purpose of analyzing gender bias in relation extraction systems. We find that when extracting spouse and hypernym (i.e., occupation) relations, an NRE system performs differently when the gender of the target entity is different. However, such disparity does not appear when extracting relations such as birth date or birth place. We also analyze two existing bias mitigation techniques, word embedding debiasing and data augmentation. Unfortunately, due to NRE models relying heavily on surface level cues, we find that existing bias mitigation approaches have a negative effect on NRE. Our analysis lays groundwork for future quantifying and mitigating bias in relation extraction.

@inproceedings{gaut2020towards,
  author = {Gaut, Andrew and Sun, Tony and Tang, Shirlyn and Huang, Yuxin and Qian, Jing and ElSherief, Mai and Zhao, Jieyu and Mirza, Diba and Belding, Elizabeth and Chang, Kai-Wei and Wang, William Yang},
  title = {Towards Understanding Gender Bias in Relation Extraction},
  booktitle = {ACL},
  year = {2020},
  presentation_id = {https://virtual.acl2020.org/paper_main.265.html}
}

Details

What Does BERT with Vision Look At?

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang, in ACL (short), 2020.

Full Text Slides Video Code Abstract BibTeX Details

Pre-trained visually grounded language models such as ViLBERT, LXMERT, and UNITER have achieved significant performance improvement on vision-and-language tasks but what they learn during pre-training remains unclear. In this work, we demonstrate that certain attention heads of a visually grounded language model actively ground elements of language to image regions. Specifically, some heads can map entities to image regions, performing the task known as entity grounding. Some heads can even detect the syntactic relations between non-entity words and image regions, tracking, for example, associations between verbs and regions corresponding to their arguments. We denote this ability as \emphsyntactic grounding. We verify grounding both quantitatively and qualitatively, using Flickr30K Entities as a testbed.

@inproceedings{li2020what,
  author = {Li, Liunian Harold and Yatskar, Mark and Yin, Da and Hsieh, Cho-Jui and Chang, Kai-Wei},
  title = {What Does BERT with Vision Look At?},
  booktitle = {ACL (short)},
  presentation_id = {https://virtual.acl2020.org/paper_main.469.html},
  year = {2020}
}

Details

Mitigating Gender Bias Amplification in Distribution by Posterior Regularization

Shengyu Jia, Tao Meng, Jieyu Zhao, and Kai-Wei Chang, in ACL (short), 2020.

Full Text Slides Video Code Abstract BibTeX Details

Advanced machine  learning  techniques  have boosted  the  performance  of  natural  language processing.  Nevertheless, recent studies, e.g., Zhao et al. (2017) show that these techniques inadvertently capture the societal bias hiddenin the corpus and further amplify it.  However,their analysis is conducted only on models’ top predictions.   In this paper,  we investigate thegender  bias  amplification  issue  from  the  distribution perspective and demonstrate that thebias is amplified in the view of predicted probability distribution over labels. We further propose a bias mitigation approach based on posterior regularization.   With little performance loss,  our method can almost remove the bias amplification  in  the  distribution. Our study sheds the light on understanding the bias amplification.

@inproceedings{jia2020mitigating,
  author = {Jia, Shengyu and Meng, Tao and Zhao, Jieyu and Chang, Kai-Wei},
  title = {Mitigating Gender Bias Amplification in Distribution by Posterior Regularization},
  booktitle = {ACL (short)},
  year = {2020},
  presentation_id = {https://virtual.acl2020.org/paper_main.264.html}
}

Details

A Transformer-based Approach for Source Code Summarization

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in ACL (short), 2020.

Full Text Slides Video Code Abstract BibTeX Details

Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies. In this work, we show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. We perform extensive analysis and ablation studies that reveal several important findings, e.g., the absolute encoding of source code tokens’ position hinders, while relative encoding significantly improves the summarization performance. We have made our code publicly available to facilitate future research.

@inproceedings{ahmad2020transformer,
  author = {Ahmad, Wasi and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
  title = {A Transformer-based Approach for Source Code Summarization},
  booktitle = {ACL (short)},
  year = {2020},
  presentation_id = {https://virtual.acl2020.org/paper_main.449.html}
}

Details

Robustness Verification for Transformers

Zhouxing Shi, Huan Zhang, Kai-Wei Chang, Minlie Huang, and Cho-Jui Hsieh, in ICLR, 2020.

Full Text Video Code Abstract BibTeX Details

Robustness verification that aims to formally certify the prediction behavior of
neural networks has become an important tool for understanding the behavior of
a given model and for obtaining safety guarantees. However, previous methods
are usually limited to relatively simple neural networks. In this paper, we consider the robustness verification problem for Transformers. Transformers have
complex self-attention layers that pose many challenges for verification, including
cross-nonlinearity and cross-position dependency, which have not been discussed
in previous work. We resolve these challenges and develop the first verification
algorithm for Transformers. The certified robustness bounds computed by our
method are significantly tighter than those by naive Interval Bound Propagation.
These bounds also shed light on interpreting Transformers as they consistently
reflect the importance of words in sentiment analysis.

@inproceedings{shi2020robustness,
  author = {Shi, Zhouxing and Zhang, Huan and Chang, Kai-Wei and Huang, Minlie and Hsieh, Cho-Jui},
  title = {Robustness Verification for Transformers},
  booktitle = {ICLR},
  year = {2020}
}

Details

2019

Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization

Ching-pei Lee and Kai-Wei Chang, in Machine Learning Journal, 2019.

Full Text Code Abstract BibTeX Details

Designing distributed algorithms for empirical risk minimization (ERM) has become an active research topic in recent years because of the practical need to deal with the huge volume of data. In this paper, we propose a general framework for training an ERM model via solving its dual problem in parallel over multiple machines. Our method provides a versatile approach for many large-scale machine learning problems, including linear binary/multi-class classification, regression, and structured prediction. Comparing with existing approaches, we show that our method has faster convergence under weaker conditions both theoretically and empirically.

@inproceedings{LD17,
  author = {Lee, Ching-pei and Chang, Kai-Wei},
  title = {Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization},
  booktitle = {Machine Learning Journal},
  year = {2019}
}

Details

Cross-lingual Dependency Parsing with Unlabeled Auxiliary Languages

Wasi Ahmad, Zhisong Zhang, Xuezhe Ma, Kai-Wei Chang, and Nanyun Peng, in CoNLL, 2019.

Full Text Poster Code Abstract BibTeX Details

Cross-lingual transfer learning has become an important weapon to battle the unavailability of annotated resources for low-resource languages.  One of the fundamental techniques to transfer across languages is learning language-agnostic representations, in the form of word embeddings or contextual encodings. In this work, we propose to leverage unannotated sentences from auxiliary languages to help learning language-agnostic representations  Specifically, we explore adversarial training for learning contextual encoders that produce invariant representations across languages to facilitate cross-lingual transfer. We conduct experiments on cross-lingual dependency parsing where we train a dependency parser on a source language and transfer it to a wide range of target languages.  Experiments on 28 target languages demonstrate that adversarial training significantly improves the overall transfer performances under several different settings.  We conduct a careful analysis to evaluate the language-agnostic representations resulted from adversarial training.

@inproceedings{ahmad2019crosslingual,
  author = {Ahmad, Wasi and Zhang, Zhisong and Ma, Xuezhe and Chang, Kai-Wei and Peng, Nanyun},
  title = {  Cross-lingual Dependency Parsing with Unlabeled Auxiliary Languages},
  booktitle = {CoNLL},
  year = {2019}
}

Details

Learning to Represent Bilingual Dictionaries

Muhao Chen, Yingtao Tian, Haochen Chen, Kai-Wei Chang, Steve Skiena, and Carlo Zaniolo, in CoNLL, 2019.

Full Text Abstract BibTeX Details

Bilingual word embeddings have been widely used to capture the correspondence of lexical semantics in different human languages. However, the cross-lingual correspondence between sentences and words is less studied, despite that this correspondence can significantly benefit many applications such as cross-lingual semantic search and textual inference. To bridge this gap, we propose a neural embedding model that leverages bilingual dictionaries. The proposed model is trained to map the lexical definitions to the cross-lingual target words, for which we explore with different sentence encoding techniques. To enhance the learning process on limited resources, our model adopts several critical learning strategies, including multi-task learning on different bridges of languages, and joint learning of the dictionary model with a bilingual word embedding model. We conduct experiments on two new tasks. In the cross-lingual reverse dictionary retrieval task, we demonstrate that our model is capable of comprehending bilingual concepts based on descriptions, and the proposed learning strategies are effective. In the bilingual paraphrase identification task, we show that our model effectively associates sentences in different languages via a shared embedding space, and outperforms existing approaches in identifying bilingual paraphrases.

@inproceedings{chen2019leanring,
  author = {Chen, Muhao and Tian, Yingtao and Chen, Haochen and Chang, Kai-Wei and Skiena, Steve and Zaniolo, Carlo},
  title = { Learning to Represent Bilingual Dictionaries},
  booktitle = {CoNLL},
  year = {2019}
}

Details

VisualBERT: A Simple and Performant Baseline for Vision and Language

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang, in Arxiv, 2019.

Full Text Code Abstract BibTeX Details One of the first BERT-based vision-language model

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

@inproceedings{li2019visualbert,
  author = {Li, Liunian Harold and Yatskar, Mark and Yin, Da and Hsieh, Cho-Jui and Chang, Kai-Wei},
  title = {VisualBERT: A Simple and Performant Baseline for Vision and Language},
  booktitle = {Arxiv},
  year = {2019}
}

Details

Target Language-Aware Constrained Inference for Cross-lingual Dependency Parsing

Tao Meng, Nanyun Peng, and Kai-Wei Chang, in EMNLP, 2019.

Full Text Poster Code Abstract BibTeX Details

Prior work on cross-lingual dependency parsing often focuses on capturing the commonalities between source and target languages and overlooks the potential of leveraging linguistic properties of the languages to facilitate the transfer. In this paper, we show that weak supervisions of linguistic knowledge for the target languages can improve a cross-lingual graph-based dependency parser substantially. Specifically, we explore several types of corpus linguistic statistics and compile them into corpus-wise constraints to guide the inference process during the test time. We adapt two techniques, Lagrangian relaxation and posterior regularization, to conduct inference with corpus-statistics constraints. Experiments show that the Lagrangian relaxation and posterior regularization inference improve the performances on 15 and 17 out of 19 target languages, respectively. The improvements are especially significant for target languages that have different word order features from the source language.

@inproceedings{meng2019target,
  author = {Meng, Tao and Peng, Nanyun and Chang, Kai-Wei},
  title = {Target Language-Aware Constrained Inference for Cross-lingual Dependency Parsing},
  booktitle = {EMNLP},
  year = {2019}
}

Details

Examining Gender Bias in Languages with Grammatical Gender

Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan-Hao Huang, Muhao Chen, Ryan Cotterell, and Kai-Wei Chang, in EMNLP, 2019.

Full Text Poster Code Abstract BibTeX Details

Recent studies have shown that word embeddings exhibit gender bias inherited from the training corpora. However, most studies to date have focused on quantifying and mitigating such bias only in English. These analyses cannot be directly extended to languages that exhibit morphological agreement on gender, such as Spanish and French. In this paper, we propose new metrics for evaluating gender bias in word embeddings of these languages and further demonstrate evidence of gender bias in bilingual embeddings which align these languages with English. Finally, we extend an existing approach to mitigate gender bias in word embeddings under both monolingual and bilingual settings. Experiments on modified Word Embedding Association Test, word similarity, word translation, and word pair translation tasks show that the proposed approaches effectively reduce the gender bias while preserving the utility of the embeddings.

@inproceedings{zhou2019examining,
  author = {Zhou, Pei and Shi, Weijia and Zhao, Jieyu and Huang, Kuan-Hao and Chen, Muhao and Cotterell, Ryan and Chang, Kai-Wei},
  title = {Examining Gender Bias in Languages with Grammatical Gender},
  booktitle = {EMNLP},
  year = {2019}
}

Details

Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei Wang, in EMNLP, 2019.

Full Text Code Abstract BibTeX Details

Adversarial attacks against machine learning models have threatened various real-world applications such as spam filtering and sentiment analysis. In this paper, we propose a novel framework, learning to DIScriminate Perturbations (DISP), to identify and adjust malicious perturbations, thereby blocking adversarial attacks for text classification models. To identify adversarial attacks, a perturbation discriminator validates how likely a token in the text is perturbed and provides a set of potential perturbations. For each potential perturbation, an embedding estimator learns to restore the embedding of the original word based on the context and a replacement token is chosen based on approximate kNN search. DISP can block adversarial attacks for any NLP model without modifying the model structure or training procedure. Extensive experiments on two benchmark datasets demonstrate that DISP significantly outperforms baseline methods in blocking adversarial attacks for text classification. In addition, in-depth analysis shows the robustness of DISP across different situations.

@inproceedings{zhou2019learning,
  author = {Zhou, Yichao and Jiang, Jyun-Yu and Chang, Kai-Wei and Wang, Wei},
  title = {Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification},
  booktitle = {EMNLP},
  year = {2019}
}

Details

Robust Text Classifier on Test-Time Budgets

Md Rizwan Parvez, Tolga Bolukbasi, Kai-Wei Chang, and Venkatesh Saligrama, in EMNLP (short), 2019.

Full Text Slides Code Abstract BibTeX Details

We propose a generic and interpretable learning framework for building robust text classification model that achieves accuracy comparable to full models under test-time budget constraints. Our approach learns a selector to identify words that are relevant to the prediction tasks and passes them to the classifier for processing. The selector is trained jointly with the classifier and directly learns to incorporate with the classifier. We further propose a data aggregation scheme to improve the robustness of the classifier. Our learning framework is general and can be incorporated with any type of text classification model. On real-world data, we show that the proposed approach improves the performance of a given classifier and speeds up the model with a mere loss in accuracy performance.

@inproceedings{parvez2019robust,
  author = {Parvez, Md Rizwan and Bolukbasi, Tolga and Chang, Kai-Wei and Saligrama, Venkatesh},
  title = {Robust Text Classifier on Test-Time Budgets},
  booktitle = {EMNLP (short)},
  year = {2019}
}

Details

Retrofitting Contextualized Word Embeddings with Paraphrases

Weijia Shi, Muhao Chen, Pei Zhou, and Kai-Wei Chang, in EMNLP (short), 2019.

Full Text Slides Video Code Abstract BibTeX Details

Contextualized word embedding models, such as ELMo, generate meaningful representations of words and their context. These models have been shown to have a great impact on downstream applications. However, in many cases, the contextualized embedding of a word changes drastically when the context is paraphrased. As a result, the downstream model is not robust to paraphrasing and other linguistic variations. To enhance the stability of contextualized word embedding models, we propose an approach to retrofitting contextualized embedding models with paraphrase contexts. Our method learns an orthogonal transformation on the input space, which seeks to minimize the variance of word representations on paraphrased contexts. Experiments show that the retrofitted model significantly outperforms the original ELMo on various sentence classification and language inference tasks.

@inproceedings{shi2019retrofitting,
  author = {Shi, Weijia and Chen, Muhao and Zhou, Pei and Chang, Kai-Wei},
  title = {Retrofitting Contextualized Word Embeddings with Paraphrases},
  booktitle = {EMNLP (short)},
  vimeo_id = {430797636},
  year = {2019}
}

Details

The Woman Worked as a Babysitter: On Biases in Language Generation

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng, in EMNLP (short), 2019.

Full Text Slides Video Code Abstract BibTeX Details

We present a systematic study of biases in natural language generation (NLG) by analyzing text generated from prompts that contain mentions of different demographic groups. In this work, we introduce the notion of the regard towards a demographic, use the varying levels of regard towards different demographics as a defining metric for bias in NLG, and analyze the extent to which sentiment scores are a relevant proxy metric for regard. To this end, we collect strategically-generated text from language models and manually annotate the text with both sentiment and regard scores. Additionally, we build an automatic regard classifier through transfer learning, so that we can analyze biases in unseen text. Together, these methods reveal the extent of the biased nature of language model generations. Our analysis provides a study of biases in NLG, bias metrics and correlated human judgments, and empirical evidence on the usefulness of our annotated dataset.

@inproceedings{sheng2019woman,
  author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
  title = {The Woman Worked as a Babysitter: On Biases in Language Generation},
  booktitle = {EMNLP (short)},
  vimeo_id = {426366363},
  year = {2019}
}

Details

Visualizing Trend of Key Roles in News Articles

Chen Xia, Haoxiang Zhang, Jacob Moghtader, Allen Wu, and Kai-Wei Chang, in EMNLP (demo), 2019.

Full Text Code Abstract BibTeX Details

There are tons of news articles generated every day reflecting the activities of key roles such as people, organizations and political parties. Analyzing these key roles allows us to understand the trends in news. In this paper, we present a demonstration system that visualizes the trend of key roles in news articles based on natural language processing techniques. Specifically, we apply a semantic role labeler and the dynamic word embedding technique to understand relationships between key roles in the news across different time periods and visualize the trends of key role and news topics change over time.

@inproceedings{xia2019visualizing,
  author = {Xia, Chen and Zhang, Haoxiang and Moghtader, Jacob and Wu, Allen and Chang, Kai-Wei},
  title = {Visualizing Trend of Key Roles in News Articles},
  booktitle = {EMNLP (demo)},
  year = {2019}
}

Details

Efficient Contextual Representation Learning With Continuous Outputs

Liunian Harold Li, Patrick H. Chen, Cho-Jui Hsieh, and Kai-Wei Chang, in TACL, 2019.

Full Text Slides Video Abstract BibTeX Details

Contextual representation models have achieved great success in improving various downstream natural language processing tasks. However, these language-model-based encoders are difficult to train due to their large parameter size and high computational complexity. By carefully examining the training procedure, we observe that the softmax layer, which predicts a distribution of the target word, often induces significant overhead, especially when the vocabulary size is large. Therefore, we revisit the design of the output layer and consider directly predicting the pre-trained embedding of the target word for a given context. When applied to ELMo, the proposed approach achieves a 4 times speedup and eliminates 80% trainable parameters while achieving competitive performance on downstream tasks. Further analysis shows that the approach maintains the speed advantage under various settings, even when the sentence encoder is scaled up.

@inproceedings{li2019efficient,
  author = {Li, Liunian Harold and Chen, Patrick H. and Hsieh, Cho-Jui and Chang, Kai-Wei},
  title = {Efficient Contextual Representation Learning With Continuous Outputs},
  booktitle = {TACL},
  year = {2019}
}

Details

Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations

Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez, in ICCV, 2019.

Full Text Code Demo Abstract BibTeX Details

In this work, we present a framework to measure and mitigate intrinsic biases with respect to protected variables –such as gender– in visual recognition tasks. We show that trained models significantly amplify the association of target labels with gender beyond what one would expect from biased datasets. Surprisingly, we show that even when datasets are balanced such that each label co-occurs equally with each gender, learned models amplify the association between labels and gender, as much as if data had not been balanced! To mitigate this, we adopt an adversarial approach to remove unwanted features corresponding to protected variables from intermediate representations in a deep neural network – and provide a detailed analysis of its effectiveness. Experiments on two datasets: the COCO dataset (objects), and the imSitu dataset (actions), show reductions in gender bias amplification while maintaining most of the accuracy of the original models.

@inproceedings{wang2019balanced,
  author = {Wang, Tianlu and Zhao, Jieyu and Yatskar, Mark and Chang, Kai-Wei and Ordonez, Vicente},
  title = {Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations},
  booktitle = {ICCV},
  year = {2019}
}

Details

Few-Shot Representation Learning for Out-Of-Vocabulary Words

Ziniu Hu, Ting Chen, Kai-Wei Chang, and Yizhou Sun, in ACL, 2019.

Full Text Poster Code Abstract BibTeX Details

Existing approaches for learning word embeddings often assume there are sufficient occurrences for each word in the corpus, such that the representation of words can be accurately estimated from their contexts. However, in real-world scenarios, out-of-vocabulary (a.k.a. OOV) words that do not appear in training corpus emerge frequently. It is challenging to learn accurate representations of these words with only a few observations. In this paper, we formulate the learning of OOV embeddings as a few-shot regression problem, and address it by training a representation function to predict the oracle embedding vector (defined as embedding trained with abundant observations) based on limited observations. Specifically, we propose a novel hierarchical attention-based architecture to serve as the neural regression function, with which the context information of a word is encoded and aggregated from K observations. Furthermore, our approach can leverage Model-Agnostic Meta-Learning (MAML) for adapting the learned model to the new corpus fast and robustly. Experiments show that the proposed approach significantly outperforms existing methods in constructing accurate embeddings for OOV words, and improves downstream tasks where these embeddings are utilized.

@inproceedings{hu2019fewshot,
  author = {Hu, Ziniu and Chen, Ting and Chang, Kai-Wei and Sun, Yizhou},
  title = {Few-Shot Representation Learning for Out-Of-Vocabulary Words},
  booktitle = {ACL},
  year = {2019}
}

Details

Mitigating Gender in Natural Language Processing: Literature Review

Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Kai-Wei Chang, and William Yang Wang, in ACL, 2019.

Full Text Slides Video Abstract BibTeX Details

As Natural Language Processing (NLP) and Machine Learning (ML) tools rise in popularity, it becomes increasingly vital to recognize the role they play in shaping societal biases and stereotypes. Although NLP models have shown success in modeling various applications, they propagate and may even amplify gender bias found in text corpora. While the study of bias in artificial intelligence is not new, methods to mitigate gender bias in NLP are relatively nascent. In this paper, we review contemporary studies on recognizing and mitigating gender bias in NLP. We discuss gender bias based on four forms of representation bias and analyze methods recognizing gender bias. Furthermore, we discuss the advantages and drawbacks of existing gender debiasing methods. Finally, we discuss future studies for recognizing and mitigating gender bias in NLP.

@inproceedings{sun2019mitigating,
  author = {Sun, Tony and Gaut, Andrew and Tang, Shirlyn and Huang, Yuxin and ElSherief, Mai and Zhao, Jieyu and Mirza, Diba and Chang, Kai-Wei and Wang, William Yang},
  title = {Mitigating Gender in Natural Language Processing: Literature Review},
  booktitle = {ACL},
  vimeo_id = {384482151},
  year = {2019}
}

Details

On Difficulties of Cross-Lingual Transfer with Order Differences: A Case Study on Dependency Parsing

Wasi Uddin Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard Hovy, Kai-Wei Chang, and Nanyun Peng, in NAACL, 2019.

Full Text Video Code Abstract BibTeX Details

Different languages might have different wordorders. In this paper, we investigate cross-lingual transfer and posit that an order-agnostic model will perform better when trans-ferring to distant foreign languages. To test ourhypothesis, we train dependency parsers on anEnglish corpus and evaluate their transfer per-formance on 30 other languages. Specifically,we compare encoders and decoders based onRecurrent Neural Networks (RNNs) and mod-ified self-attentive architectures. The formerrelies on sequential information while the lat-ter is more flexible at modeling word order.Rigorous experiments and detailed analysisshows that RNN-based architectures transferwell to languages that are close to English,while self-attentive models have better overallcross-lingual transferability and perform espe-cially well on distant languages.

@inproceedings{ahmad2019difficulties,
  author = {Ahmad, Wasi Uddin and Zhang, Zhisong and Ma, Xuezhe and Hovy, Eduard and Chang, Kai-Wei and Peng, Nanyun},
  title = {On Difficulties of Cross-Lingual Transfer with Order Differences: A Case Study on Dependency Parsing},
  booktitle = {NAACL},
  year = {2019}
}

Details

Gender Bias in Contextualized Word Embeddings

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang, in NAACL (short), 2019.

Full Text Slides Video Abstract BibTeX Details

Despite the great success of contextualized word embeddings on downstream applications, these representations potentially embed the societal biases exhibited in their training corpus. In this paper, we quantify, analyze and mitigate the gender bias exhibited in ELMo contextualized word vectors. We first demonstrate that the vectors encode and propagate information about genders unequally and then conduct a principal component analysis to visualize the geometry of the gender information in the embeddings. Then we show that ELMo works unequally well for men and women in down-stream tasks. Finally, we explore a variety of methods to remove such gender bias and demonstrate that it can be reduced through data augmentation.

@inproceedings{zhao2019gender,
  author = {Zhao, Jieyu and Wang, Tianlu and Yatskar, Mark and Cotterell, Ryan and Ordonez, Vicente and Chang, Kai-Wei},
  title = {Gender Bias in Contextualized Word Embeddings},
  booktitle = {NAACL (short)},
  year = {2019}
}

Details

Context Attentive Document Ranking and Query Suggestion

Wasi Ahmad, Kai-Wei Chang, and Hongning Wang, in SIGIR, 2019.

Full Text Slides Code Abstract BibTeX Details

We present a context-aware neural ranking model to exploit users’ on-task search activities and enhance retrieval performance. Inparticular, a two-level hierarchical recurrent neural network isintroduced to learn search context representation of individualqueries, search tasks, and corresponding dependency structure byjointly optimizing two companion retrieval tasks: document rank-ing and query suggestion. To identify variable dependency structurebetween search context and users’ ongoing search activities, at-tention at both levels of recurrent states are introduced. Extensiveexperiment comparisons against a rich set of baseline methods andan in-depth ablation analysis confirm the value of our proposedapproach for modeling search context buried in search tasks.

@inproceedings{ahmad2019context,
  author = {Ahmad, Wasi and Chang, Kai-Wei and Wang, Hongning},
  title = {Context Attentive Document Ranking and Query Suggestion},
  booktitle = {SIGIR},
  year = {2019}
}

Details

Multifaceted Protein-Protein Interaction Prediction Based on Siamese Residual RCNN

Muhao Chen, Chelsea J.-T. Ju, Guangyu Zhou, Xuelu Chen, Tianran Zhang, Kai-Wei Chang, Carlo Zaniolo, and Wei Wang, in ISMB, 2019.

Full Text Code Abstract BibTeX Details

Sequence-based protein-protein interaction (PPI) prediction represents a fundamental computational biology problem. To address this problem, extensive research efforts have been made to extract predefined features from the sequences. Based on these features, statistical algorithms are learned to classify the PPIs. However, such explicit features are usually costly to extract, and typically have limited coverage on the PPI information. Hence, we present an end-to-end framework, Lasagna, for PPI predictions using only the primary sequences of a protein pair. Lasagna incorporates a deep residual recurrent convolutional neural network in the Siamese learning architecture, which leverages both robust local features and contextualized information that are significant for capturing the mutual influence of protein sequences. Our framework relieves the data pre-processing efforts that are required by other systems, and generalizes well to different application scenarios. Experimental evaluations show that Lasagna outperforms various state-of-the-art systems on the binary PPI prediction problem. Moreover, it shows a promising performance on more challenging problems of interaction type prediction and binding affinity estimation, where existing approaches fall short.

@inproceedings{chen2019multifaceted,
  author = {Chen, Muhao and Ju, Chelsea J.-T. and Zhou, Guangyu and Chen, Xuelu and Zhang, Tianran and Chang, Kai-Wei and Zaniolo, Carlo and Wang, Wei},
  title = {Multifaceted Protein-Protein Interaction Prediction Based on Siamese Residual RCNN},
  booktitle = {ISMB},
  year = {2019}
}

Details

Pre-Training Graph Neural Networks for Generic Structural Feature Extraction

Ziniu Hu, Changjun Fan, Ting Chen, Kai-Wei Chang, and Yizhou Sun, in ICLR 2019 Workshop: Representation Learning on Graphs and Manifolds, 2019.

Full Text Abstract BibTeX Details

Graph neural networks (GNNs) are shown to be successful in modeling applications with graph structures. However, training an accurate GNN model requires a large collection of labeled data and expressive features, which might be inaccessible for some applications. To tackle this problem, we propose a pre-training framework that captures generic graph structural information that is transferable across tasks. Our framework can leverage the following three tasks: 1) denoising link reconstruction, 2) centrality score ranking, and 3) cluster preserving. The pre-training procedure can be conducted purely on the synthetic graphs, and the pre-trained GNN is then adapted for downstream applications. With the proposed pre-training procedure, the generic structural information is learned and preserved, thus the pre-trained GNN requires less amount of labeled data and fewer domain-specific features to achieve high performance on different downstream tasks. Comprehensive experiments demonstrate that our proposed framework can significantly enhance the performance of various tasks at the level of node, link, and graph.

@inproceedings{hu2019pretraining,
  author = {Hu, Ziniu and Fan, Changjun and Chen, Ting and Chang, Kai-Wei and Sun, Yizhou},
  title = {Pre-Training Graph Neural Networks for Generic Structural Feature Extraction},
  booktitle = {ICLR 2019 Workshop: Representation Learning on Graphs and Manifolds},
  year = {2019}
}

Details

Learning Bilingual Word Embeddings Using Lexical Definitions

Weijia Shi, Muhao Chen, Yingtao Tian, and Kai-Wei Chang, in Repl4NLP (ACL workshop), 2019.

Full Text Abstract BibTeX Details

Bilingual word embeddings, which represent lexicons of different languages in a shared embedding space, are essential for supporting semantic and knowledge transfers in a variety of cross-lingual NLP tasks. Existing approaches to training bilingual word embeddings require either large collections of pre-defined seed lexicons that are expensive to obtain, or parallel sentences that comprise coarse and noisy alignment. In contrast, we propose BiLex that leverages publicly available lexical definitions for bilingual word embedding learning. Without the need of predefined seed lexicons, BiLex comprises a novel word pairing strategy to automatically identify and propagate the precise fine-grain word alignment from lexical definitions. We evaluate BiLex in word-level and sentence-level translation tasks, which seek to find the cross-lingual counterparts of words and sentences respectively. BiLex significantly outperforms previous embedding methods on both tasks.

@inproceedings{shi2019bilingual,
  author = {Shi, Weijia and Chen, Muhao and Tian, Yingtao and Chang, Kai-Wei},
  title = {Learning Bilingual Word Embeddings Using Lexical Definitions},
  booktitle = {Repl4NLP (ACL workshop)},
  poster = {http://kwchang.net/documents/slides/shi2019bilingual_poster.pdf},
  year = {2019}
}

Details

2018

Generating Natural Language Adversarial Examples

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang, in EMNLP (short), 2018.

Full Text Code Abstract BibTeX Details Top-10 cited paper at EMNLP 18

Deep neural networks (DNNs) are vulnerable to adversarial examples, perturbations to correctly classified examples which can cause the network to misclassify. In the image domain, these perturbations can often be made virtually indistinguishable to human perception, causing humans and state-of-the-art models to disagree. However, in the natural language domain, small perturbations are clearly perceptible, and the replacement of a single word can drastically alter the semantics of the document. Given these challenges, we use a population-based optimization algorithm to generate semantically and syntactically similar adversarial examples. We demonstrate via a human study that 94.3% of the generated examples are classified to the original label by human evaluators, and that the examples are perceptibly quite similar. We hope our findings encourage researchers to pursue improving the robustness of DNNs in the natural language domain.

@inproceedings{alzanto2018generating,
  author = {Alzantot, Moustafa and Sharma, Yash and Elgohary, Ahmed and Ho, Bo-Jhang and Srivastava, Mani and Chang, Kai-Wei},
  title = {Generating Natural Language Adversarial Examples},
  booktitle = {EMNLP (short)},
  year = {2018}
}

Details

Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang, in NAACL (short), 2018.

Full Text Poster Code Abstract BibTeX Details Top-10 cited paper at NAACL 18

In this paper, we introduce a new benchmark for co-reference resolution focused on gender bias, WinoBias. Our corpus contains Winograd-schema style sentences with entities corresponding to people referred by their occupation (e.g. the nurse, the doctor, the carpenter). We demonstrate that a rule-based, a feature-rich, and a neural coreference system all link gendered pronouns to pro-stereotypical entities with higher accuracy than anti-stereotypical entities, by an average difference of 21.1 in F1 score. Finally, we demonstrate a data-augmentation approach that, in combination with existing word-embedding debiasing techniques, removes the bias demonstrated by these systems in WinoBias without significantly affecting their performance on existing datasets.

@inproceedings{zhao2018gender,
  author = {Zhao, Jieyu and Wang, Tianlu and Yatskar, Mark and Ordonez, Vicente and Chang, Kai-Wei},
  title = {Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods},
  booktitle = {NAACL (short)},
  press_url = {https://www.stitcher.com/podcast/matt-gardner/nlp-highlights/e/55861936},
  year = {2018}
}

Details

Learning Gender-Neutral Word Embeddings

Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang, in EMNLP (short), 2018.

Full Text Code Abstract BibTeX Details

Word embeddings have become a fundamental component in a wide range of Natu-ral Language Processing (NLP) applications.However, these word embeddings trained onhuman-generated corpora inherit strong gen-der stereotypes that reflect social constructs.In this paper, we propose a novel word em-bedding model, De-GloVe, that preserves gen-der information in certain dimensions of wordvectors while compelling other dimensions tobe free of gender influence. Quantitative andqualitative experiments demonstrate that De-GloVe successfully isolates gender informa-tion without sacrificing the functionality of theembedding model.

@inproceedings{zhao2018learning,
  author = {Zhao, Jieyu and Zhou, Yichao and Li, Zeyu and Wang, Wei and Chang, Kai-Wei},
  title = {Learning Gender-Neutral Word Embeddings},
  booktitle = {EMNLP (short)},
  year = {2018}
}

Details

Building Language Models for Text with Named Entities

Md Rizwan Parvez, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang, in ACL, 2018.

Full Text Poster Code Abstract BibTeX Details

Text in many domains involves a significant amount of named entities. Predicting the entity names is often challenging for a language model as they appear less frequent on the training corpus. In this paper, we propose a novel and effective approach to building a language model which can learn the entity names by leveraging their entity type information. We also introduce two benchmark datasets based on recipes and Java programming codes, on which we evaluate the proposed model. Experimental results show that our model achieves 52.2% better perplexity in recipe generation and 40.3% on code generation than state-of-the-art language models.

@inproceedings{parvez2018building,
  author = {Parvez, Md Rizwan and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
  title = {Building Language Models for Text with Named Entities},
  booktitle = {ACL},
  year = {2018}
}

Details

Learning Word Embeddings for Low-resource Languages by PU Learning

Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, and Kai-Wei Chang, in NAACL, 2018.

Full Text Slides Video Code Abstract BibTeX Details

Word embedding has been used as a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is very sparse because many word pairs are not observed to co-occur. In contrast to existing approaches, we argue that the zero entries in the co-occurrence matrix also provide valuable information and design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix. The experimental results demonstrate that the proposed approach requires a smaller amount of training text to obtain a reasonable word embedding model.

@inproceedings{jiang2018learning,
  author = {Jiang, Chao and Yu, Hsiang-Fu and Hsieh, Cho-Jui and Chang, Kai-Wei},
  title = {Learning Word Embeddings for Low-resource Languages by PU Learning},
  booktitle = {NAACL},
  vimeo_id = {277670013},
  year = {2018}
}

Details

Co-training Embeddings of Knowledge Graphs and Entity Descriptions for Cross-lingual Entity Alignment

Muhao Chen, Yingtao Tian, Kai-Wei Chang, Steven Skiena, and Carlo Zaniolo, in IJCAI, 2018.

Full Text Slides Code Abstract BibTeX Details

Multilingual knowledge graph (KG) embeddings provide latent semantic representations of entities and structured knowledge enabled with cross-lingual inferences that benefit various knowledge-driven cross-lingual NLP tasks. However, precisely learning such cross-lingual inferences is usually hindered by the low coverage of entity alignment in many KGs. Since many multilingual KGs also provide literal descriptions of entities, in this paper, we introduce an embedding-based approach which leverages a weakly aligned multilingual KG for semi-supervised cross-lingual learning using entity descriptions. Our approach performs co-training of two embedding models, i.e. a multilingual KG embedding model and a multilingual literal description embedding model. The models are trained on a large Wikipedia-based trilingual dataset where most entity alignment is unknown to training. Experimental results show that the performance of the proposed approach on the entity alignment task improves at each iteration of co-training, and eventually reaches a stage at which it significantly surpasses previous approaches. We also show that our approach has promising abilities for zero-shot entity alignment, and cross-lingual KG completion.

@inproceedings{chen2018multilingual,
  author = {Chen, Muhao and Tian, Yingtao and Chang, Kai-Wei and Skiena, Steven and Zaniolo, Carlo},
  title = {Co-training Embeddings of Knowledge Graphs and Entity Descriptions for Cross-lingual Entity Alignment},
  booktitle = {IJCAI},
  year = {2018}
}

Details

Multi-Task Learning for Document Ranking and Query Suggestion

Wasi Ahmad, Kai-Wei Chang, and Hongning Wang, in ICLR, 2018.

Full Text Code Abstract BibTeX Details

We propose a multi-task learning framework to jointly learn document ranking and query suggestion for web search. It consists of two major components, a document ranker and a query recommender. Document ranker combines current query and session information and compares the combined representation with document representation to rank the documents. Query recommen tracks users’ query reformulation sequence considering all previous in-session queries using a sequence to sequence approach. As both tasks are driven by the users’ underlying search intent, we perform joint learning of these two components through session recurrence, which encodes search context and intent. Extensive comparisons against state-of-the-art document ranking and query suggestion algorithms are performed on the public AOL search log, and the promising results endorse the effectiveness of the joint learning framework.

@inproceedings{ahmad2018multitask,
  author = {Ahmad, Wasi and Chang, Kai-Wei and Wang, Hongning},
  title = {Multi-Task Learning for Document Ranking and Query Suggestion},
  booktitle = {ICLR},
  year = {2018}
}

Details

Intent-aware Query Obfuscation for Privacy Protection in Personalized Web Search

Wasi Ahmad, Kai-Wei Chang, and Hongning Wang, in SIGIR, 2018.

Full Text Code Abstract BibTeX Details

Modern web search engines exploit users’ search history to personalize search results, with a goal of improving their service utility on a per-user basis. But it is this very dimension that leads to the risk of privacy infringement and raises serious public concerns. In this work, we propose a client-centered intent-aware query obfuscation solution for protecting user privacy in a personalized web search scenario. In our solution, each user query is submitted with l additional cover queries and corresponding clicks, which act as decoys to mask users’ genuine search intent from a search engine. The cover queries are sequentially sampled from a set of hierarchically organized language models to ensure the coherency of fake search intents in a cover search task. Our approach emphasizes the plausibility of generated cover queries, not only to the current genuine query but also to previous queries in the same task, to increase the complexity for a search engine to identify a user’s true intent. We also develop two new metrics from an information theoretic perspective to evaluate the effectiveness of provided privacy protection. Comprehensive experiment comparisons with state-of-the-art query obfuscation techniques are performed on the public AOL search log, and the propitious results substantiate the effectiveness of our solution.

@inproceedings{ahmad2018intent,
  author = {Ahmad, Wasi and Chang, Kai-Wei and Wang, Hongning},
  title = {Intent-aware Query Obfuscation for Privacy Protection in Personalized Web Search},
  booktitle = {SIGIR},
  year = {2018}
}

Details

Counterexamples for Robotic Planning Explained in Structured Language

Lu Feng, Mahsa Ghasemi, Kai-Wei Chang, and Ufuk Topcu, in ICRA, 2018.

Full Text Abstract BibTeX Details

Automated techniques such as model checking have been used to verify models of robotic mission plans based on Markov decision processes (MDPs) and generate counterexamples that may help diagnose requirement violations. However, such artifacts may be too complex for humans to understand, because existing representations of counterexamples typically include a large number of paths or a complex automaton. To help improve the interpretability of counterexamples, we define a notion of explainable counterexample, which includes a set of structured natural language sentences to describe the robotic behavior that lead to a requirement violation in an MDP model of robotic mission plan. We propose an approach based on mixed-integer linear programming for generating explainable counterexamples that are minimal, sound and complete. We demonstrate the usefulness of the proposed approach via a case study of warehouse robots planning.

@inproceedings{feng2018conterexamples,
  author = {Feng, Lu and Ghasemi, Mahsa and Chang, Kai-Wei and Topcu, Ufuk},
  title = {Counterexamples for Robotic Planning Explained in Structured Language},
  booktitle = {ICRA},
  year = {2018}
}

Details

A Corpus to Learn Refer-to-as Relations for Nominals

Wasi Ahmad and Kai-Wei Chang, in LREC, 2018.

Full Text Code Abstract BibTeX Details

Continuous representations for words or phrases, trained on large unlabeled corpora are proved very useful for many natural language processing tasks. While these vector representations capture many fine-grained syntactic and semantic regularities among words or phrases, it often lacks coreferential information which is useful for many downstream tasks like information extraction, text summarization etc. In this paper, we argue that good word and phrase embeddings should contain information for identifying refer-to-as relationship and construct a corpus from Wikipedia to generate coreferential neural embeddings for nominals. The term \emphnominal refers to a word or a group of words that functions like a noun phrase. In addition, we use coreference resolution as a proxy to evaluate the learned neural embeddings for noun phrases. To simplify the evaluation procedure, we design a coreferential phrase prediction task where the learned nominal embeddings are used to predict which candidate nominals can be referred to a target nominal. We further describe how to construct an evaluation dataset for such task from well known OntoNotes corpus and demonstrate encouraging baseline results.

@inproceedings{AC18,
  author = {Ahmad, Wasi and Chang, Kai-Wei},
  title = {A Corpus to Learn Refer-to-as Relations for Nominals},
  booktitle = {LREC},
  year = {2018}
}

Details

Word and sentence embedding tools to measure semantic similarity of Gene Ontology terms by their definitions

Dat Duong, Wasi Uddin Ahmad, Eleazar Eskin, Kai-Wei Chang, and Jingyi Jessica Li, in Journal of Computational Biology, 2018.

Full Text Code Abstract BibTeX Details

The Gene Ontology (GO) database contains GO terms that describe biological functions of genes.
Previous methods for comparing GO terms have relied on the fact that GO terms are organized
into a tree structure. Under this paradigm, the locations of two GO terms in the tree dictate their
similarity score. In this paper, we introduce two new solutions for this problem, by focusing
instead on the definitions of the GO terms. We apply neural network based techniques from
the natural language processing (NLP) domain. The first method does not rely on the GO tree,
whereas the second indirectly depends on the GO tree. In our first approach, we compare two GO
definitions by treating them as two unordered sets of words. The word similarity is estimated by a
word embedding model that maps words into an N-dimensional space. In our second approach,
we account for the word-ordering within a sentence. We use a sentence encoder to embed GO
definitions into vectors and estimate how likely one definition entails another. We validate our
methods in two ways. In the first experiment, we test the model’s ability to differentiate a true
protein-protein network from a randomly generated network. In the second experiment, we test
the model in identifying orthologs from randomly-matched genes in human, mouse, and fly. In
both experiments, a hybrid of NLP and GO-tree based method achieves the best classification
accuracy.

@inproceedings{DAECL18,
  author = {Duong, Dat and Ahmad, Wasi Uddin and Eskin, Eleazar and Chang, Kai-Wei and Li, Jingyi Jessica},
  title = {Word and sentence embedding tools to measure semantic similarity of Gene Ontology terms by their definitions},
  booktitle = {Journal of Computational Biology},
  year = {2018}
}

Details

A Corpus of Drug Usage Guidelines Annotated with Type of Advice

Sarah Masud Preum, Md. Rizwan Parvez, Kai-Wei Chang, and John Stankovic, in LREC, 2018.

Full Text Code Abstract BibTeX Details

Adherence to drug usage guidelines for prescription and over-the-counter drugs is critical for drug safety and effectiveness of treatment. Drug usage guideline documents contain advice on potential drug-drug interaction, drug-food interaction, and drug administration process. Current research on drug safety and public health indicates patients are often either unaware of such critical advice or overlook them. Categorizing advice statements from these documents according to their topics can enable the patients to find safety critical information. However, automatically categorizing drug usage guidelines based on their topic is an open challenge and there is no annotated dataset on drug usage guidelines. To address the latter issue, this paper presents (i) an annotation scheme for annotating safety critical advice from drug usage guidelines, (ii) an annotation tool for such data, and (iii) an annotated dataset containing drug usage guidelines from 90 drugs. This work is expected to accelerate further release of annotated drug usage guideline datasets and research on automatically filtering safety critical information from these textual documents.

@inproceedings{PPCS18,
  author = {Preum, Sarah Masud and Parvez, Md. Rizwan and Chang, Kai-Wei and Stankovic, John},
  title = {A Corpus of Drug Usage Guidelines Annotated with Type of Advice},
  booktitle = {LREC},
  year = {2018}
}

Details

Quantification and Analysis of Scientific Language Variation Across Research Fields

Pei Zhou, Muhao Chen, Kai-Wei Chang, and Carlo Zaniolo, in CDEC (workshop at ICDM), 2018.

Full Text Abstract BibTeX Details

Quantifying differences in terminologies from various academic domains has been a longstanding problem yet to be
solved. We propose a computational approach for analyzing linguistic variation among scientific research fields by capturing the
semantic change of terms based on a neural language model. The
model is trained on a large collection of literature in five computer
science research fields, for which we obtain field-specific vector
representations for key terms, and global vector representations
for other words. Several quantitative approaches are introduced
to identify the terms whose semantics have drastically changed,
or remain unchanged across different research fields. We also
propose a metric to quantify the overall linguistic variation of
research fields. After quantitative evaluation on human annotated
data and qualitative comparison with other methods, we show
that our model can improve cross-disciplinary data collaboration
by identifying terms that potentially induce confusion during
interdisciplinary studies.

@inproceedings{ZCCZ18,
  author = {Zhou, Pei and Chen, Muhao and Chang, Kai-Wei and Zaniolo, Carlo},
  title = {Quantification and Analysis of Scientific Language Variation Across Research Fields},
  booktitle = {CDEC (workshop at ICDM)},
  year = {2018}
}

Details

2017

Counterfactual Language Model Adaptation for Suggesting Phrases

Kenneth Arnold, Kai-Wei Chang, and Adam T. Kalai, in IJCNLP (short), 2017.

Full Text Abstract BibTeX Details

We study the challenge of suggesting multi-word phrases to be inserted while typing on a mobile keyboard. Recent work in mobile text entry user-interfaces has shown that, unlike single-word predictions, these phrases are treated as suggestions rather than predictions, meaning that users often insert words that weren’t what they were planning on typing.
This suggests the NLP problem of offering multi-word suggestions that are likely to be accepted by a user. We propose a method for customizing an existing language model to adapt it to a specific such task, and show how to learn the parameters of that customization offline.

@inproceedings{ACK17,
  author = {Arnold, Kenneth and Chang, Kai-Wei and Kalai, Adam T.},
  title = {Counterfactual Language Model Adaptation for Suggesting Phrases},
  booktitle = {IJCNLP (short)},
  year = {2017}
}

Details

Structured Prediction with Test-time Budget Constraints

Tolga Bolukbasi, Kai-Wei Chang, Joseph Wang, and Venkatesh Saligrama, in AAAI, 2017.

Full Text Slides Abstract BibTeX Details

We study the problem of structured prediction under test-time budget constraints. We propose a novel approach applicable to a wide range of structured prediction problems in computer vision and natural language processing. Our approach seeks to adaptively generate computationally costly features during test-time in order to reduce the computational cost of prediction while maintaining prediction performance. We show that training the adaptive feature generation system can be reduced to a series of structured learning problems, resulting in efficient training using existing structured learning algorithms. This framework provides theoretical justification for several existing heuristic approaches found in literature. We evaluate our proposed adaptive system on two real-world structured prediction tasks, optical character recognition (OCR) and dependency parsing. For OCR our method cuts the feature acquisition time by half coming within a 1% margin of top accuracy. For dependency parsing we realize an overall runtime gain of 20% without significant loss in performance.

@inproceedings{bolukbasi2017structured,
  author = {Bolukbasi, Tolga and Chang, Kai-Wei and Wang, Joseph and Saligrama, Venkatesh},
  title = {Structured Prediction with Test-time Budget Constraints},
  booktitle = {AAAI},
  year = {2017}
}

Details

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang, in EMNLP, 2017.

Full Text Slides Code Abstract BibTeX Details EMNLP 2017 Best Long Paper Award; Top-10 cited paper at EMNLP 17

Language is increasingly being used to define rich visual recognition problems with supporting image collections sourced from the web. Structured prediction models are used in these tasks to take advantage of correlations between co-occuring labels and visual input but risk inadvertently encoding social biases found in web corpora.
In this work, we study data and models associated with multilabel object classification and visual semantic role labeling. We find that (a) datasets for these tasks contain significant gender bias and (b) models trained on these datasets further amplify existing bias. For example, the activity cooking is over 33% more likely to involve females than males in a training set, but a trained model amplifies the disparity to 68% at test time. We propose to inject corpus-level constraints for calibrating existing structured prediction models and design an algorithm based on Lagrangian relaxation for the resulting inference problems. Our method results in no performance loss for the underlying recognition task but decreases the magnitude of bias amplification by 33.3% and 44.9% for multilabel classification and visual semantic role labeling, respectively.

@inproceedings{zhao2017men,
  author = {Zhao, Jieyu and Wang, Tianlu and Yatskar, Mark and Ordonez, Vicente and Chang, Kai-Wei},
  title = {Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints},
  booktitle = {EMNLP},
  year = {2017}
}

Details

Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context

Shyam Upadhyay, Kai-Wei Chang, Matt Taddy, Adam Kalai, and James Zou, in ACL RepL4NLP Workshop, 2017.

Full Text Abstract BibTeX Details Best Paper Award

Word embeddings, which represent a word as a point in a vector space, have become ubiquitous to several NLP tasks. A recent line of work uses bilingual (two languages) corpora to learn a different vector for each sense of a word, by exploiting crosslingual signals to aid sense identification. We present a multi-view Bayesian non-parametric algorithm which improves multi-sense word embeddings by (a) using multilingual (i.e., more than two languages) corpora to significantly improve sense embeddings beyond what one achieves with bilingual information, and (b) uses a principled approach to learn a variable number of senses per word, in a data-driven manner. Ours is the first approach with the ability to leverage multilingual corpora efficiently for multi-sense representation learning. Experiments show that multilingual training significantly improves performance over monolingual and bilingual training, by allowing us to combine different parallel corpora to leverage multilingual context. Multilingual training yields comparable performance to a state of the art monolingual model trained on five times more training data.

@inproceedings{upadhyay2017beyond,
  author = {Upadhyay, Shyam and Chang, Kai-Wei and Taddy, Matt and Kalai, Adam and Zou, James},
  title = {Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context},
  booktitle = {ACL RepL4NLP Workshop},
  year = {2017}
}

Details

2016

EMNLP 16 Workshop on Structured Prediction for NLP

Kai-Wei Chang, Ming-Wei Chang, Vivek Srikumar, and Alexander M. Rush, in EMNLP, 2016.

Full Text Abstract BibTeX Details

Many prediction tasks in NLP involve assigning values to mutually dependent variables. For example, when designing a model to automatically perform linguistic analysis of a sentence or a document (e.g., parsing, semantic role labeling, or discourse analysis), it is crucial to model the correlations between labels. Many other NLP tasks, such as machine translation, textual entailment, and information extraction, can be also modeled as structured prediction problems.
In order to tackle such problems, various structured prediction approaches have been proposed, and their effectiveness has been demonstrated. Studying structured prediction is interesting from both NLP and machine learning (ML) perspectives. From the NLP perspective, syntax and semantics of natural language are clearly structured and advances in this area will enable researchers to understand the linguistic structure of data. From the ML perspective, the large amount of available text data and complex linguistic structures bring challenges to the learning community. Designing expressive yet tractable models and studying efficient learning and inference algorithms become important issues.
Recently, there has been significant interest in non-standard structured prediction approaches that take advantage of non-linearity, latent components, and/or approximate inference in both the NLP and ML communities. Researchers have also been discussing the intersection between deep learning and structured prediction through the DeepStructure reading group. This workshop intends to bring together NLP and ML researchers working on diverse aspects of structured prediction and expose the participants to recent progress in this area.
Workshop Site

@inproceedings{CCSR16,
  author = {Chang, Kai-Wei and Chang, Ming-Wei and Srikumar, Vivek and Rush, Alexander M.},
  title = {EMNLP 16 Workshop on Structured Prediction for NLP},
  booktitle = {EMNLP},
  year = {2016}
}

Details

Learning from Explicit and Implicit Supervision Jointly For Algebra Word Problems

Shyam Upadhyay, Ming-Wei Chang, Kai-Wei Chang, and Wen-tau Yih, in EMNLP, 2016.

Full Text Abstract BibTeX Details

Automatically solving algebra word problems has raised considerable interest recently. Existing state-of-the-art approaches mainly rely on learning from human annotated equations. In this paper, we demonstrate that it is possible to efficiently mine algebra problems and their numerical solutions with little to no manual effort. To leverage the mined dataset, we propose a novel structured-output learning algorithm that aims to learn from both explicit (e.g., equations) and implicit (e.g., solutions) supervision signals jointly. Enabled by this new algorithm, our model gains 4.6% absolute improvement in accuracy on the ALG-514 benchmark compared to the one without using implicit supervision. The final model also outperforms the current state-of-the-art approach by 3%.
Dataset

@inproceedings{BCWS16,
  author = {Upadhyay, Shyam and Chang, Ming-Wei and Chang, Kai-Wei and Yih, Wen-tau},
  title = {Learning from Explicit and Implicit Supervision Jointly For Algebra Word Problems},
  booktitle = {EMNLP},
  year = {2016}
}

Details

A Credit Assignment Compiler for Joint Prediction

Kai-Wei Chang, He He, Hal Daume III, John Langford, and Stephane Ross, in NeurIPS, 2016.

Full Text Code Abstract BibTeX Details

Many machine learning applications involve jointly predicting multiple mutually dependent output variables. Learning to search is a family of methods where the complex decision problem is cast into a sequence of decisions via a search space. Although these methods have shown promise both in theory and in practice, implementing them has been burdensomely awkward. In this paper, we show the search space can be defined by an arbitrary imperative program, turning learning to search into a credit assignment compiler. Altogether with the algorithmic improvements for the compiler, we radically reduce the complexity of programming and the running time. We demonstrate the feasibility of our approach on multiple joint prediction tasks. In all cases, we obtain accuracies as high as alternative approaches, at drastically reduced execution and programming time.

@inproceedings{chang2016credit,
  author = {Chang, Kai-Wei and He, He and III, Hal Daume and Langford, John and Ross, Stephane},
  title = {A Credit Assignment Compiler for Joint Prediction},
  booktitle = {NeurIPS},
  year = {2016}
}

Details

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai, in NeurIPS, 2016.

Full Text Code Abstract BibTeX Details Top-10 cited paper at NeurIPS 16

The blind application of machine learning runs the risk of amplifying biases present in data. Such a danger is facing us with word embedding, a popular framework to represent text data as vectors which has been used in many machine learning and natural language processing tasks. We show that even word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent. This raises concerns because their widespread use, as we describe, often tends to amplify these biases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding. Second, gender neutral words are shown to be linearly separable from gender definition words in the word embedding. Using these properties, we provide a methodology for modifying an embedding to remove gender stereotypes, such as the association between between the words receptionist and female, while maintaining desired associations such as between the words queen and female. We define metrics to quantify both direct and indirect gender biases in embeddings, and develop algorithms to "debias" the embedding. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstrate that our algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings can be used in applications without amplifying gender bias.

@inproceedings{bolukbasi2016man,
  author = {Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James and Saligrama, Venkatesh and Kalai, Adam},
  title = {Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings},
  booktitle = {NeurIPS},
  year = {2016}
}

Details

2015

IllinoisSL: A JAVA Library for Structured Prediction

Kai-Wei Chang, Shyam Upadhyay, Ming-Wei Chang, Vivek Srikumar, and Dan Roth, in Arxiv, 2015.

Full Text Abstract BibTeX Details

Training a structured prediction model involves performing several loss-augmented inference steps. Over the lifetime of the training, many of these inference problems, although different, share the same solution. We propose AI-DCD, an Amortized Inference framework for Dual Coordinate Descent method, an approximate learning algorithm, that accelerates the training process by exploiting this redundancy of solutions, without compromising the performance of the model. We show the efficacy of our method by training a structured SVM using dual coordinate descent for an entity-relation extraction task. Our method learns the same model as an exact training algorithm would, but call the inference engine only in 10% . 24% of the inference problems encountered during training. We observe similar gains on a multi-label classification task and with a Structured Perceptron model for the entity-relation task.

@inproceedings{chang2015illinoissl,
  author = {Chang, Kai-Wei and Upadhyay, Shyam and Chang, Ming-Wei and Srikumar, Vivek and Roth, Dan},
  title = {IllinoisSL: A JAVA Library for Structured Prediction},
  booktitle = {Arxiv},
  year = {2015}
}

Details

Distributed Training of Structured SVM

Ching-pei Lee, Kai-Wei Chang, Shyam Upadhyay, and Dan Roth, in OPT workshop at NeurIPS, 2015.

Full Text Abstract BibTeX Details

Training structured prediction models is time-consuming. However, most existing approaches only use a single machine, thus, the advantage of computing power and the capacity for larger data sets of multiple machines have not been exploited. In this work, we propose an efficient algorithm for distributedly training structured support vector machines based on a distributed block-coordinate descent method. Both theoretical and experimental results indicate that our method is efficient.

@inproceedings{lee2015distributed,
  author = {Lee, Ching-pei and Chang, Kai-Wei and Upadhyay, Shyam and Roth, Dan},
  title = {Distributed Training of Structured SVM},
  booktitle = {OPT workshop at NeurIPS},
  year = {2015}
}

Details

Learning to Search Better Than Your Teacher

Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume; III, and John Langford, in ICML, 2015.

Full Text Video Code Abstract BibTeX Details

Methods for learning to search for structured prediction typically imitate a reference policy, with existing theoretical guarantees demonstrating low regret compared to that reference. This is unsatisfactory in many applications where the reference policy is suboptimal and the goal of learning is to improve upon it. Can learning to search work even when the reference is poor?
We provide a new learning to search algorithm, LOLS, which does well relative to the reference policy, but additionally guarantees low regret compared to deviations from the learned policy: a local-optimality guarantee. Consequently, LOLS can improve upon the reference policy, unlike previous algorithms. This enables us to develop structured contextual bandits, a partial information structured prediction setting with many potential applications.

@inproceedings{chang2015learninh,
  author = {Chang, Kai-Wei and Krishnamurthy, Akshay and Agarwal, Alekh and III, Hal Daume; and Langford, John},
  title = {Learning to Search Better Than Your Teacher},
  booktitle = {ICML},
  year = {2015}
}

Details

A Joint Framework for Coreference Resolution and Mention Head Detection

Haoruo Peng, Kai-Wei Chang, and Dan Roth, in CoNLL, 2015.

Full Text Abstract BibTeX Details

In coreference resolution, a fair amount of research treats mention detection as a preprocessed step and focuses on developing algorithms for clustering coreferred mentions. However, there are significant gaps between the performance on gold mentions and the performance on the real problem, when mentions are predicted from raw text via an imperfect Mention Detection (MD) module. Motivated by the goal of reducing such gaps, we develop an ILP-based joint coreference resolution and mention head formulation that is shown to yield significant improvements on coreference from raw text, outperforming existing state-of-art systems on both the ACE-2004 and the CoNLL-2012 datasets. At the same time, our joint approach is shown to improve mention detection by close to 15% F1. One key insight underlying our approach is that identifying and co-referring mention heads is not only sufficient but is more robust than working with complete mentions.

@inproceedings{peng2015joint,
  author = {Peng, Haoruo and Chang, Kai-Wei and Roth, Dan},
  title = {A Joint Framework for Coreference Resolution and Mention Head Detection},
  booktitle = {CoNLL},
  year = {2015}
}

Details

Learning to Search for Dependencies

Kai-Wei Chang, He He, Hal Daume; III, and John Lanford, in Arxiv, 2015.

Full Text Code Abstract BibTeX Details

We demonstrate that a dependency parser can be built using a credit assignment compiler which removes the burden of worrying about low-level machine learning details from the parser implementation. The result is a simple parser which robustly applies to many languages that provides similar statistical and computational performance with best-to-date transition-based parsing approaches, while avoiding various downsides including randomization, extra feature requirements, and custom learning algorithms.

@inproceedings{chang2015learning,
  author = {Chang, Kai-Wei and He, He and III, Hal Daume; and Lanford, John},
  title = {Learning to Search for Dependencies},
  booktitle = {Arxiv},
  year = {2015}
}

Details

Structural Learning with Amortized Inference

Kai-Wei Chang, Shyam Upadhyay, Gourab Kundu, and Dan Roth, in AAAI, 2015.

Full Text Poster Abstract BibTeX Details

Training a structured prediction model involves performing several loss-augmented inference steps. Over the lifetime of the training, many of these inference problems, although different, share the same solution. We propose AI-DCD, an Amortized Inference framework for Dual Coordinate Descent method, an approximate learning algorithm, that accelerates the training process by exploiting this redundancy of solutions, without compromising the performance of the model. We show the efficacy of our method by training a structured SVM using dual coordinate descent for an entity-relation extraction task. Our method learns the same model as an exact training algorithm would, but call the inference engine only in 10% . 24% of the inference problems encountered during training. We observe similar gains on a multi-label classification task and with a Structured Perceptron model for the entity-relation task.

@inproceedings{chang2015structural,
  author = {Chang, Kai-Wei and Upadhyay, Shyam and Kundu, Gourab and Roth, Dan},
  title = {Structural Learning with Amortized Inference},
  booktitle = {AAAI},
  year = {2015}
}

Details

Selective Algorithms for Large-Scale Classification and Structured Learning

Kai-Wei Chang, in UIUC Phd Thesis, 2015.

Full Text Abstract BibTeX Details

The desired output in many machine learning tasks is a structured object, such as tree, clustering, or sequence. Learning accurate prediction models for such problems requires training on large amounts of data, making use of expressive features and performing global inference that simultaneously assigns values to all interrelated nodes in the structure. All these contribute to significant scalability problems. In this thesis, we describe a collection of results that address several aspects of these problems - by carefully selecting and caching samples, structures, or latent items.
Our results lead to entryfficient learning algorithms for large-scale binary classification models, structured prediction models and for online clustering models which, in turn, support reduction in problem size, improvements in training and evaluation speed and improved performance. We have used our algorithms to learn expressive models from large amounts of annotated data and achieve state-of-the art performance on several natural language processing tasks.

@inproceedings{chang2015thesis,
  author = {Chang, Kai-Wei},
  title = {Selective Algorithms for Large-Scale Classification and Structured Learning},
  booktitle = {UIUC Phd Thesis},
  year = {2015}
}

Details

2014

A Discriminative Latent Variable Model for Online Clustering

Rajhans Samdani, Kai-Wei Chang, and Dan Roth, in ICML, 2014.

Full Text Slides Demo Abstract BibTeX Details

This paper presents a latent variable structured prediction model for discriminative supervised clustering of items called the Latent Left-linking Model (L3M). We present an online clustering algorithm for L3M based on a feature-based item similarity function. We provide a learning framework for estimating the similarity function and present a fast stochastic gradient-based learning technique. In our experiments on coreference resolution and document clustering, L3M outperforms several existing online as well as batch supervised clustering techniques.

@inproceedings{samdani2014discriminative,
  author = {Samdani, Rajhans and Chang, Kai-Wei and Roth, Dan},
  title = {A Discriminative Latent Variable Model for Online Clustering},
  booktitle = {ICML},
  year = {2014}
}

Details

Typed Tensor Decomposition of Knowledge Bases for Relation Extraction

Kai-Wei Chang, Wen-tau Yih, Bishan Yang, and Chris Meek, in EMNLP, 2014.

Full Text Video Abstract BibTeX Details

While relation extraction has traditionally been viewed as a task relying solely on textual data, recent work has shown that by taking as input existing facts in the form of entity-relation triples from both knowledge bases and textual data, the performance of relation extraction can be improved significantly. Following this new paradigm, we propose a tensor decomposition approach for knowledge base embedding that is highly scalable, and is especially suitable for relation extraction. By leveraging relational domain knowledge about entity type information, our learning algorithm is significantly faster than previous approaches and is better able to discover new relations missing from the database. In addition, when applied to a relation extraction task, our approach alone is comparable to several existing systems, and improves the weighted mean average precision of a state-of-the-art method by 10 points when used as a subcomponent.

@inproceedings{chang2014typed,
  author = {Chang, Kai-Wei and Yih, Wen-tau and Yang, Bishan and Meek, Chris},
  title = {Typed Tensor Decomposition of Knowledge Bases for Relation Extraction},
  booktitle = {EMNLP},
  year = {2014}
}

Details

The Illinois-Columbia System in the CoNLL-2014 Shared Task

Alla Rozovskaya, Kai-Wei Chang, Mark Sammons, Dan Roth, and Nizar Habash, in CoNLL Shared Task, 2014.

Full Text Abstract BibTeX Details

The CoNLL-2014 shared task is an extension of last year’s shared task and focuses on correcting grammatical errors in essays written by non-native learners of English. In this paper, we describe the Illinois-Columbia system that participated in the shared task. Our system ranked second on the original annotations and first on the revised annotations.
The core of the system is based on the University of Illinois model that placed first in the CoNLL-2013 shared task. This baseline model has been improved and expanded for this year’s competition in several respects. We describe our underlying approach, which relates to our previous work, and describe the novel aspects of the system in more detail.

@inproceedings{RCSRH14,
  author = {Rozovskaya, Alla and Chang, Kai-Wei and Sammons, Mark and Roth, Dan and Habash, Nizar},
  title = {The Illinois-Columbia System in the CoNLL-2014 Shared Task},
  booktitle = {CoNLL Shared Task},
  year = {2014}
}

Details

2013

The University of Illinois System in the CoNLL-2013 Shared Task

Alla Rozovskaya, Kai-Wei Chang, Mark Sammons, and Dan Roth, in CoNLL Shared Task, 2013.

Full Text Poster Abstract BibTeX Details

The CoNLL-2013 shared task focuses on correcting grammatical errors in essays written by non-native learners of English. In this paper, we describe the University of Illinois system that participated in the shared task. The system consists of five components and targets five types of common grammatical mistakes made by English as Second Language writers. We describe our underlying approach, which relates to our previous work, and describe the novel aspects of the system in more detail. Out of 17 participating teams, our system is ranked first based on both the original annotation and on the revised annotation.

@inproceedings{RCSR13,
  author = {Rozovskaya, Alla and Chang, Kai-Wei and Sammons, Mark and Roth, Dan},
  title = {The University of Illinois System in the CoNLL-2013 Shared Task},
  booktitle = {CoNLL Shared Task},
  year = {2013}
}

Details

A Constrained Latent Variable Model for Coreference Resolution

Kai-Wei Chang, Rajhans Samdani, and Dan Roth, in EMNLP, 2013.

Full Text Poster Demo Abstract BibTeX Details

Coreference resolution is a well known clustering task in Natural Language Processing. In this paper, we describe the Latent Left Linking model (L3M), a novel, principled, and linguistically motivated latent structured prediction approach to coreference resolution.
We show that L3M admits efficient inference and can be augmented with knowledge-based constraints; we also present a fast stochastic gradient based learning.
Experiments on ACE and Ontonotes data show that L3M and its constrained version, CL3M, are more accurate than several state-of-the-art approaches as well as some structured prediction models proposed in the literature.

@inproceedings{ChangSaRo13,
  author = {Chang, Kai-Wei and Samdani, Rajhans and Roth, Dan},
  title = {A Constrained Latent Variable Model for Coreference Resolution},
  booktitle = {EMNLP},
  year = {2013}
}

Details

Multi-Relational Latent Semantic Analysis

Kai-Wei Chang, Wen-tau Yih, and Chris Meek, in EMNLP, 2013.

Full Text Slides Abstract BibTeX Details

We present Multi-Relational Latent Semantic Analysis (MRLSA) which generalizes Latent Semantic Analysis (LSA). MRLSA provides an elegant approach to combining multiple relations between words by constructing a 3-way tensor. Similar to LSA, a low-rank approximation of the tensor is derived using a tensor decomposition. Each word in the vocabulary is thus represented by a vector in the latent semantic space and each relation is captured by a latent square matrix. The degree of two words having a specific relation can then be measured through simple linear algebraic operations. We demonstrate that by integrating multiple relations from both homogeneous and heterogeneous information sources, MRLSA achieves state-of-the-art performance on existing benchmark datasets for two relations, antonymy and is-a.

@inproceedings{chang2013mrlsa,
  author = {Chang, Kai-Wei and Yih, Wen-tau and Meek, Chris},
  title = {Multi-Relational Latent Semantic Analysis},
  booktitle = {EMNLP},
  year = {2013}
}

Details

Multi-core Structural SVM Training

Kai-Wei Chang, Vivek Srikumar, and Dan Roth, in ECML, 2013.

Full Text Poster Abstract BibTeX Details

Many problems in natural language processing and computer vision can be framed as structured prediction problems. Structural support vector machines (SVM) is a popular approach for training structured predictors, where learning is framed as an optimization problem. Most structural SVM solvers alternate between a model update phase and an inference phase (which predicts structures for all training examples). As structures become more complex, inference becomes a bottleneck and thus slows down learning considerably. In this paper, we propose a new learning algorithm for structural SVMs called DEMI-DCD that extends the dual coordinate descent approach by decoupling the model update and inference phases into different threads. We take advantage of multi-core hardware to parallelize learning with minimal synchronization between the model update and the inference phases. We prove that our algorithm not only converges but also fully utilizes all available processors to speed up learning, and validate our approach on two real-world NLP problems: part-of-speech tagging and relation extraction. In both cases, we show that our algorithm utilizes all available processors to speed up learning and achieves competitive performance. For example, it achieves a relative duality gap of 1% on a POS tagging problem in 192 seconds using 16 threads, while a standard implementation of a multi-threaded dual coordinate descent algorithm with the same number of threads requires more than 600 seconds to reach a solution of the same quality.

@inproceedings{chang2013multicore,
  author = {Chang, Kai-Wei and Srikumar, Vivek and Roth, Dan},
  title = {Multi-core Structural SVM Training},
  booktitle = {ECML},
  year = {2013}
}

Details

Tractable Semi-Supervised Learning of Complex Structured Prediction Models

Kai-wei Chang, S. Sundararajan, and S. Sathiya Keerthi, in ECML, 2013.

Full Text Slides Poster Abstract BibTeX Details

Semi-supervised learning has been widely studied in the literature. However, most previous works assume that the output structure is simple enough to allow the direct use of tractable inference/learning algorithms (e.g., binary label or linear chain). Therefore, these methods cannot be applied to problems with complex structure. In this paper, we propose an approximate semi-supervised learning method that uses piecewise training for estimating the model weights and a dual decomposition approach for solving the inference problem of finding the labels of unlabeled data subject to domain specific constraints. This allows us to extend semi-supervised learning to general structured prediction problems. As an example, we apply this approach to the problem of multi-label classification (a fully connected pairwise Markov random field). Experimental results on benchmark data show that, in spite of using approximations, the approach is effective and yields good improvements in generalization performance over the plain supervised method. In addition, we demonstrate that our inference engine can be applied to other semi-supervised learning frameworks, and extends them to solve problems with complex structure.

@inproceedings{ChangSuKe13,
  author = {Chang, Kai-wei and Sundararajan, S. and Keerthi, S. Sathiya},
  title = {Tractable Semi-Supervised Learning of Complex Structured Prediction Models},
  booktitle = {ECML},
  year = {2013}
}

Details

2012

Illinois-Coref: The UI System in the CoNLL-2012 Shared Task

Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth, in CoNLL Shared Task, 2012.

Full Text Poster Abstract BibTeX Details

The CoNLL-2012 shared task is an extension of the last year’s coreference task. We participated in the closed track of the shared tasks in both years. In this paper, we present the improvements of Illinois-Coref system from last year. We focus on improving mention detection and pronoun coreference resolution, and present a new learning protocol. These new strategies boost the performance of the system by 5% MUC F1, 0.8% BCUB F1, and 1.7% CEAF F1 on the OntoNotes-5.0 development set.

@inproceedings{CSRSR12,
  author = {Chang, Kai-Wei and Samdani, Rajhans and Rozovskaya, Alla and Sammons, Mark and Roth, Dan},
  title = {Illinois-Coref: The UI System in the CoNLL-2012 Shared Task},
  booktitle = {CoNLL Shared Task},
  year = {2012}
}

Details

Efficient Pattern-Based Time Series Classification on GPU

Kai-Wei Chang, Baplab Deka, W.-M. W. Hwu, and Dan Roth, in ICDM, 2012.

Full Text Abstract BibTeX Details

Time series shapelet discovery algorithm finds subsequences from a set of time series for use as primitives for time series classification. This algorithm has drawn a lot of interest because of the interpretability of its results. However, computation requirements restrict the algorithm from dealing with large data sets and may limit its application in many domains. In this paper, we address this issue by redesigning the algorithm for implementation on highly parallel Graphics Process Units (GPUs). We investigate several concepts of GPU programming and propose a dynamic programming algorithm that is suitable for implementation on GPUs. Results show that the proposed GPU implementation significantly reduces the running time of the shapelet discovery algorithm. For example, on the largest sample dataset from the original authors, the running time is reduced from half a day to two minutes.

@inproceedings{CDHR12,
  author = {Chang, Kai-Wei and Deka, Baplab and Hwu, W.-M. W. and Roth, Dan},
  title = {Efficient Pattern-Based Time Series Classification on GPU },
  booktitle = {ICDM},
  year = {2012}
}

Details

Large Linear Classification When Data Cannot Fit In Memory

Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin, in TKDD, 2012.

Full Text Code Abstract BibTeX Details Best Paper Award, KDD 10

Recent advances in linear classification have shown that for applications such as document classification, the training can be extremely efficient. However, most of the existing training methods are designed by assuming that data can be stored in the computer memory. These methods cannot be easily applied to data larger than the memory capacity due to the random access to the disk. We propose and analyze a block minimization framework for data larger than the memory size. At each step a block of data is loaded from the disk and handled by certain learning methods. We investigate two implementations of the proposed framework for primal and dual SVMs, respectively. As data cannot fit in memory, many design considerations are very different from those for traditional algorithms. Experiments using data sets 20 times larger than the memory demonstrate the effectiveness of the proposed method.

@inproceedings{yu2010large,
  author = {Yu, Hsiang-Fu and Hsieh, Cho-Jui and Chang, Kai-Wei and Lin, Chih-Jen},
  title = {Large Linear Classification When Data Cannot Fit In Memory},
  booktitle = {TKDD},
  year = {2012}
}

Details

2011

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models

Kai-Wei Chang and Dan Roth, in KDD, 2011.

Full Text Slides Poster Code Abstract BibTeX Details

As the size of data sets used to build classifiers steadily increases, training a linear model efficiently with limited memory becomes essential. Several techniques deal with this problem by loading blocks of data from disk one at a time, but usually take a considerable number of iterations to converge to a reasonable model. Even the best block minimization techniques [1] require many block loads since they treat all training examples uniformly. As disk I/O is expensive, reducing the amount of disk access can dramatically decrease the training time.

@inproceedings{ChangRo11,
  author = {Chang, Kai-Wei and Roth, Dan},
  title = {Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models},
  booktitle = {KDD},
  year = {2011}
}

Details

Inference Protocols for Coreference Resolution

Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth, in CoNLL Shared Task, 2011.

Full Text Slides Poster Abstract BibTeX Details

This paper presents Illinois-Coref, a system for coreference resolution that participated in the CoNLL-2011 shared task. We investigate two inference methods, Best-Link and All-Link, along with their corresponding, pairwise and structured, learning protocols. Within these, we provide a flexible architecture for incorporating linguistically-motivated constraints, several of which we developed and integrated. We compare and evaluate the inference approaches and the contribution of constraints, analyze the mistakes of the system, and discuss the challenges of resolving coreference for the OntoNotes-4.0 data set.

@inproceedings{CSRRSR11,
  author = {Chang, Kai-Wei and Samdani, Rajhans and Rozovskaya, Alla and Rizzolo, Nick and Sammons, Mark and Roth, Dan},
  title = {Inference Protocols for Coreference Resolution},
  booktitle = {CoNLL Shared Task},
  year = {2011}
}

Details

2010

Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin, in JMLR, 2010.

Full Text Abstract BibTeX Details

Maximum entropy (Maxent) is useful in natural language processing and many other areas. Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent. With many variants of IS methods, it is difficult to understand them and see the differences. In this paper, we create a general and unified framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods. We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods.

@inproceedings{HHCL10,
  author = {Huang, Fang-Lan and Hsieh, Cho-Jui and Chang, Kai-Wei and Lin, Chih-Jen},
  title = {Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models},
  booktitle = {JMLR},
  year = {2010}
}

Details

Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen Lin, in JMLR, 2010.

Full Text Code Abstract BibTeX Details

Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements.

@inproceedings{CHCRL10,
  author = {Chang, Yin-Wen and Hsieh, Cho-Jui and Chang, Kai-Wei and Ringgaard, Michael and Lin, Chih-Jen},
  title = {Training and Testing Low-degree Polynomial Data Mappings via Linear SVM},
  booktitle = {JMLR},
  year = {2010}
}

Details

A Comparison of Optimization Methods and software for Large-scale L1-regularized Linear Classification

Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin, in JMLR, 2010.

Full Text Code Abstract BibTeX Details

Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data.

@inproceedings{YCHL10,
  author = {Yuan, Guo-Xun and Chang, Kai-Wei and Hsieh, Cho-Jui and Lin, Chih-Jen},
  title = {A Comparison of Optimization Methods and software for Large-scale L1-regularized Linear Classification},
  booktitle = {JMLR},
  year = {2010}
}

Details

2009

An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes

Hung-Yi Lo, Kai-Wei Chang, Shang-Tse Chen, Tsung-Hsien Chiang, ChunSung Ferng, Cho-Jui Hsieh, Yi-Kuang Ko, Tsung-Ting Kuo, Hung-Che Lai, Ken-Yi Lin, Chia-Hsuan Wang, Hsiang-Fu Yu, Chih-Jen Lin, Hsuan-Tien Lin, and Shou-de Lin, in KDD Cup, 2009.

Full Text Abstract BibTeX Details

This paper describes our ensemble of three classifiers for the KDD Cup 2009 challenge. First, we transform the three binary classification tasks into a joint multi-class classification problem, and solve an l1-regularized maximum entropy model under the LIBLINEAR framework. Second, we propose a heterogeneous base learner, which is capable of handling different types of features and missing values, and use AdaBoost to improve the base learner. Finally, we adopt a selective naive Bayes classifier that automatically groups categorical features and discretizes numerical ones. The parameters are tuned using crossvalidation results rather than the 10% test results on the competition website. Based on the observation that the three positive labels are exclusive, we conduct a post-processing step using the linear SVM to jointly adjust the prediction scores of each classifier on the three tasks. Then, we average these prediction scores with careful validation to get the final outputs. Our final average AUC on the whole test set is 0.8461, which ranks third place in the slow track of KDD Cup 2009.

@inproceedings{LCCCFHKKLLWYLLL09,
  author = {Lo, Hung-Yi and Chang, Kai-Wei and Chen, Shang-Tse and Chiang, Tsung-Hsien and Ferng, ChunSung and Hsieh, Cho-Jui and Ko, Yi-Kuang and Kuo, Tsung-Ting and Lai, Hung-Che and Lin, Ken-Yi and Wang, Chia-Hsuan and Yu, Hsiang-Fu and Lin, Chih-Jen and Lin, Hsuan-Tien and Lin, Shou-de},
  title = {An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes},
  booktitle = {KDD Cup},
  year = {2009}
}

Details

2008

A Sequential Dual Method for Large Scale Multi-Class Linear SVMs

S. Sathiya Keerthi, S. Sundararajan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin, in KDD, 2008.

Full Text Code Abstract BibTeX Details

Efficient training of direct multi-class formulations of linear Support Vector Machines is very useful in applications such as text classification with a huge number examples as well as features. This paper presents a fast dual method for this training. The main idea is to sequentially traverse through the training set and optimize the dual variables associated with one example at a time. The speed of training is enhanced further by shrinking and cooling heuristics. Experiments indicate that our method is much faster than state of the art solvers such as bundle, cutting plane and exponentiated gradient methods

@inproceedings{KSCHL08,
  author = {Keerthi, S. Sathiya and Sundararajan, S. and Chang, Kai-Wei and Hsieh, Cho-Jui and Lin, Chih-Jen},
  title = {A Sequential Dual Method for Large Scale Multi-Class Linear SVMs},
  booktitle = {KDD},
  year = {2008}
}

Details

Coordinate Descent Method for Large-scale L2-loss Linear SVM

Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin, in JMLR, 2008.

Full Text Code Abstract BibTeX Details

Linear support vector machines (SVM) are useful for classifying large-scale sparse data. Problems with sparse features are common in applications such as document classification and natural language processing. In this paper, we propose a novel coordinate descent algorithm for training linear SVM with the L2-loss function. At each step, the proposed method minimizes a one-variable sub-problem while fixing other variables. The sub-problem is solved by Newton steps with the line search technique. The procedure globally converges at the linear rate. As each sub-problem involves only values of a corresponding feature, the proposed approach is suitable when accessing a feature is more convenient than accessing an instance. Experiments show that our method is more efficient and stable than state of the art methods such as Pegasos and TRON.

@inproceedings{ChangHsLi08,
  author = {Chang, Kai-Wei and Hsieh, Cho-Jui and Lin, Chih-Jen},
  title = {Coordinate Descent Method for Large-scale L2-loss Linear SVM},
  booktitle = {JMLR},
  year = {2008}
}

Details

LIBLINEAR: A Library for Large Linear Classification

Rong En Fan, Kai-Wei Chang, Cho-Jui Hsieh, X.-R. Wang, and Chih-Jen Lin, in JMLR, 2008.

Full Text Code Abstract BibTeX Details The Linear SVM implemntaiton in Scikit-learn

LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regression and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.

@inproceedings{FCHWL08,
  author = {Fan, Rong En and Chang, Kai-Wei and Hsieh, Cho-Jui and Wang, X.-R. and Lin, Chih-Jen},
  title = {LIBLINEAR: A Library for Large Linear Classification},
  booktitle = {JMLR},
  year = {2008}
}

Details

A Dual Coordinate Descent Method for Large-Scale Linear SVM

Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, Sathia S. Keerthi, and S. Sundararajan, in ICML, 2008.

Full Text Slides Code Abstract BibTeX Details Top-10 cited paper at ICML 08

In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1- and L2- loss functions. The proposed method is simple and reaches an e-accurate solution in O(log(1/e)) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, TRON, SVMperf , and a recent primal coordinate descent implementation.

@inproceedings{HCLKS08,
  author = {Hsieh, Cho-Jui and Chang, Kai-Wei and Lin, Chih-Jen and Keerthi, Sathia S. and Sundararajan, S.},
  title = {A Dual Coordinate Descent Method for Large-Scale Linear SVM},
  booktitle = {ICML},
  year = {2008}
}

Details