Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation

Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han, and Kai-Wei Chang, in EMNLP, 2023.

Code

Download the full text

Abstract

Instruction tuning has emerged to enhance the capabilities of large language models (LLMs) in providing appropriate outputs based on input instructions. However, existing methods for collecting instruction-tuning data suffer from limitations in scalability and affordability. In this paper, we propose Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. Built upon the metadata of existing NLP datasets, we generate multiple task instructions applicable to various NLP datasets and determine the relevant data fields for constructing instruction-tuning data with LLMs. Dynosaur offers several advantages: 1) lower generation costs (less than $12 for generating 800K instruction-tuning data), 2) good quality of instruction-tuning data (better performance than Alpaca and Instruction GPT-4 on Super-NI with comparable data sizes), and 3) the ability to grow dynamically by incorporating new datasets from Huggingface Datasets Platform. We further investigate continual learning as an approach to learning with the ever-growing instruction-tuning dataset. We demonstrate that replay methods not only help mitigate forgetting issues but help generalize to unseen tasks better. As a novel continual learning scenario for instruction tuning, selecting tasks based on instruction representations can be an effective replaying strategy.

Source Code

Bib Entry

@inproceedings{yin2023dynosaur,
  author = {Yin, Da and Liu, Xiao and Yin, Fan and Zhong, Ming and Bansal, Hritik and Han, Jiawei and Chang, Kai-Wei},
  title = {Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation},
  booktitle = {EMNLP},
  year = {2023}
}

Related Publications

Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation

Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han, and Kai-Wei Chang, in EMNLP, 2023.
Full Text Code Abstract BibTeX Details

Instruction tuning has emerged to enhance the capabilities of large language models (LLMs) in providing appropriate outputs based on input instructions. However, existing methods for collecting instruction-tuning data suffer from limitations in scalability and affordability. In this paper, we propose Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. Built upon the metadata of existing NLP datasets, we generate multiple task instructions applicable to various NLP datasets and determine the relevant data fields for constructing instruction-tuning data with LLMs. Dynosaur offers several advantages: 1) lower generation costs (less than $12 for generating 800K instruction-tuning data), 2) good quality of instruction-tuning data (better performance than Alpaca and Instruction GPT-4 on Super-NI with comparable data sizes), and 3) the ability to grow dynamically by incorporating new datasets from Huggingface Datasets Platform. We further investigate continual learning as an approach to learning with the ever-growing instruction-tuning dataset. We demonstrate that replay methods not only help mitigate forgetting issues but help generalize to unseen tasks better. As a novel continual learning scenario for instruction tuning, selecting tasks based on instruction representations can be an effective replaying strategy.

@inproceedings{yin2023dynosaur,
  author = {Yin, Da and Liu, Xiao and Yin, Fan and Zhong, Ming and Bansal, Hritik and Han, Jiawei and Chang, Kai-Wei},
  title = {Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation},
  booktitle = {EMNLP},
  year = {2023}
}

Details