Learning Word Embeddings for Low-resource Languages by PU Learning

Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, and Kai-Wei Chang, in NAACL, 2018.

Slides Code

Download the full text

Abstract

Word embedding has been used as a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is very sparse because many word pairs are not observed to co-occur. In contrast to existing approaches, we argue that the zero entries in the co-occurrence matrix also provide valuable information and design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix. The experimental results demonstrate that the proposed approach requires a smaller amount of training text to obtain a reasonable word embedding model.

Bib Entry

@inproceedings{jiang2018learning,
  author = {Jiang, Chao and Yu, Hsiang-Fu and Hsieh, Cho-Jui and Chang, Kai-Wei},
  title = {Learning Word Embeddings for Low-resource Languages by PU Learning},
  booktitle = {NAACL},
  vimeo_id = {277670013},
  year = {2018}
}

Related Publications

Control Large Language Models via Divide and Conquer, EMNLP, 2024
Re-ReST: Reflection-Reinforced Self-Training for Language Agents, EMNLP, 2024
Agent Lumos: Unified and Modular Training for Open-Source Language Agents, ACL, 2024
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension, ICML, 2024
TrustLLM: Trustworthiness in Large Language Models, ICML, 2024
The steerability of large language models toward data-driven personas, NAACL, 2024
AI-Assisted Summarization of Radiologic Reports: Evaluating GPT3davinci, BARTcnn, LongT5booksum, LEDbooksum, LEDlegal, and LEDclinical, American Journal of Neuroradiology, 2024
Understanding and Mitigating Spurious Correlations in Text Classification with Neighborhood Analysis, EACL-Findings, 2024
Few-Shot Representation Learning for Out-Of-Vocabulary Words, ACL, 2019
Co-training Embeddings of Knowledge Graphs and Entity Descriptions for Cross-lingual Entity Alignment, IJCAI, 2018
Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context, ACL RepL4NLP Workshop, 2017