Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context
Shyam Upadhyay, Kai-Wei Chang, Matt Taddy, Adam Kalai, and James Zou, in ACL RepL4NLP Workshop, 2017.
Best Paper Award
Download the full text
Abstract
Word embeddings, which represent a word as a point in a vector space, have become ubiquitous to several NLP tasks. A recent line of work uses bilingual (two languages) corpora to learn a different vector for each sense of a word, by exploiting crosslingual signals to aid sense identification. We present a multi-view Bayesian non-parametric algorithm which improves multi-sense word embeddings by (a) using multilingual (i.e., more than two languages) corpora to significantly improve sense embeddings beyond what one achieves with bilingual information, and (b) uses a principled approach to learn a variable number of senses per word, in a data-driven manner. Ours is the first approach with the ability to leverage multilingual corpora efficiently for multi-sense representation learning. Experiments show that multilingual training significantly improves performance over monolingual and bilingual training, by allowing us to combine different parallel corpora to leverage multilingual context. Multilingual training yields comparable performance to a state of the art monolingual model trained on five times more training data.
Bib Entry
@inproceedings{upadhyay2017beyond,
author = {Upadhyay, Shyam and Chang, Kai-Wei and Taddy, Matt and Kalai, Adam and Zou, James},
title = {Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context},
booktitle = {ACL RepL4NLP Workshop},
year = {2017}
}
Related Publications
- Control Large Language Models via Divide and Conquer, EMNLP, 2024
- Re-ReST: Reflection-Reinforced Self-Training for Language Agents, EMNLP, 2024
- Agent Lumos: Unified and Modular Training for Open-Source Language Agents, ACL, 2024
- Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension, ICML, 2024
- TrustLLM: Trustworthiness in Large Language Models, ICML, 2024
- The steerability of large language models toward data-driven personas, NAACL, 2024
- AI-Assisted Summarization of Radiologic Reports: Evaluating GPT3davinci, BARTcnn, LongT5booksum, LEDbooksum, LEDlegal, and LEDclinical, American Journal of Neuroradiology, 2024
- Understanding and Mitigating Spurious Correlations in Text Classification with Neighborhood Analysis, EACL-Findings, 2024
- Few-Shot Representation Learning for Out-Of-Vocabulary Words, ACL, 2019
- Learning Word Embeddings for Low-resource Languages by PU Learning, NAACL, 2018
- Co-training Embeddings of Knowledge Graphs and Entity Descriptions for Cross-lingual Entity Alignment, IJCAI, 2018