Learning Word Embeddings for Low-resource Languages by PU Learning

Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, and Kai-Wei Chang, in NAACL, 2018.

Slides Code

Download the full text

Abstract

Word embedding has been used as a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is very sparse because many word pairs are not observed to co-occur. In contrast to existing approaches, we argue that the zero entries in the co-occurrence matrix also provide valuable information and design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix. The experimental results demonstrate that the proposed approach requires a smaller amount of training text to obtain a reasonable word embedding model.

Bib Entry

@inproceedings{jiang2018learning,
  author = {Jiang, Chao and Yu, Hsiang-Fu and Hsieh, Cho-Jui and Chang, Kai-Wei},
  title = {Learning Word Embeddings for Low-resource Languages by PU Learning},
  booktitle = {NAACL},
  vimeo_id = {277670013},
  year = {2018}
}

Related Publications

Few-Shot Representation Learning for Out-Of-Vocabulary Words

Ziniu Hu, Ting Chen, Kai-Wei Chang, and Yizhou Sun, in ACL, 2019.
Full Text Poster Code Abstract BibTeX Details

Existing approaches for learning word embeddings often assume there are sufficient occurrences for each word in the corpus, such that the representation of words can be accurately estimated from their contexts. However, in real-world scenarios, out-of-vocabulary (a.k.a. OOV) words that do not appear in training corpus emerge frequently. It is challenging to learn accurate representations of these words with only a few observations. In this paper, we formulate the learning of OOV embeddings as a few-shot regression problem, and address it by training a representation function to predict the oracle embedding vector (defined as embedding trained with abundant observations) based on limited observations. Specifically, we propose a novel hierarchical attention-based architecture to serve as the neural regression function, with which the context information of a word is encoded and aggregated from K observations. Furthermore, our approach can leverage Model-Agnostic Meta-Learning (MAML) for adapting the learned model to the new corpus fast and robustly. Experiments show that the proposed approach significantly outperforms existing methods in constructing accurate embeddings for OOV words, and improves downstream tasks where these embeddings are utilized.

@inproceedings{hu2019fewshot,
  author = {Hu, Ziniu and Chen, Ting and Chang, Kai-Wei and Sun, Yizhou},
  title = {Few-Shot Representation Learning for Out-Of-Vocabulary Words},
  booktitle = {ACL},
  year = {2019}
}

Details

Learning Word Embeddings for Low-resource Languages by PU Learning

Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, and Kai-Wei Chang, in NAACL, 2018.
Full Text Slides Video Code Abstract BibTeX Details

Word embedding has been used as a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is very sparse because many word pairs are not observed to co-occur. In contrast to existing approaches, we argue that the zero entries in the co-occurrence matrix also provide valuable information and design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix. The experimental results demonstrate that the proposed approach requires a smaller amount of training text to obtain a reasonable word embedding model.

@inproceedings{jiang2018learning,
  author = {Jiang, Chao and Yu, Hsiang-Fu and Hsieh, Cho-Jui and Chang, Kai-Wei},
  title = {Learning Word Embeddings for Low-resource Languages by PU Learning},
  booktitle = {NAACL},
  vimeo_id = {277670013},
  year = {2018}
}

Details

Co-training Embeddings of Knowledge Graphs and Entity Descriptions for Cross-lingual Entity Alignment

Muhao Chen, Yingtao Tian, Kai-Wei Chang, Steven Skiena, and Carlo Zaniolo, in IJCAI, 2018.
Full Text Slides Code Abstract BibTeX Details

Multilingual knowledge graph (KG) embeddings provide latent semantic representations of entities and structured knowledge enabled with cross-lingual inferences that benefit various knowledge-driven cross-lingual NLP tasks. However, precisely learning such cross-lingual inferences is usually hindered by the low coverage of entity alignment in many KGs. Since many multilingual KGs also provide literal descriptions of entities, in this paper, we introduce an embedding-based approach which leverages a weakly aligned multilingual KG for semi-supervised cross-lingual learning using entity descriptions. Our approach performs co-training of two embedding models, i.e. a multilingual KG embedding model and a multilingual literal description embedding model. The models are trained on a large Wikipedia-based trilingual dataset where most entity alignment is unknown to training. Experimental results show that the performance of the proposed approach on the entity alignment task improves at each iteration of co-training, and eventually reaches a stage at which it significantly surpasses previous approaches. We also show that our approach has promising abilities for zero-shot entity alignment, and cross-lingual KG completion.

@inproceedings{chen2018multilingual,
  author = {Chen, Muhao and Tian, Yingtao and Chang, Kai-Wei and Skiena, Steven and Zaniolo, Carlo},
  title = {Co-training Embeddings of Knowledge Graphs and Entity Descriptions for Cross-lingual Entity Alignment},
  booktitle = {IJCAI},
  year = {2018}
}

Details

Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context

Shyam Upadhyay, Kai-Wei Chang, Matt Taddy, Adam Kalai, and James Zou, in ACL RepL4NLP Workshop, 2017.
Full Text Abstract BibTeX Details Best Paper Award

Word embeddings, which represent a word as a point in a vector space, have become ubiquitous to several NLP tasks. A recent line of work uses bilingual (two languages) corpora to learn a different vector for each sense of a word, by exploiting crosslingual signals to aid sense identification. We present a multi-view Bayesian non-parametric algorithm which improves multi-sense word embeddings by (a) using multilingual (i.e., more than two languages) corpora to significantly improve sense embeddings beyond what one achieves with bilingual information, and (b) uses a principled approach to learn a variable number of senses per word, in a data-driven manner. Ours is the first approach with the ability to leverage multilingual corpora efficiently for multi-sense representation learning. Experiments show that multilingual training significantly improves performance over monolingual and bilingual training, by allowing us to combine different parallel corpora to leverage multilingual context. Multilingual training yields comparable performance to a state of the art monolingual model trained on five times more training data.

@inproceedings{upadhyay2017beyond,
  author = {Upadhyay, Shyam and Chang, Kai-Wei and Taddy, Matt and Kalai, Adam and Zou, James},
  title = {Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context},
  booktitle = {ACL RepL4NLP Workshop},
  year = {2017}
}

Details