Abstract: Data plays a fundamental role in training Large Language Models (LLMs). Efficient data management, particularly in formulating a well-suited training dataset, is significant for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning stages. Despite the considerable importance of data management, the current research community still falls short in providing a systematic analysis of the effects of data management strategy selection, methodologies for evaluating curated datasets, and the ongoing pursuit of improved strategies. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey provides a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various noteworthy aspects of data management strategy design: data quantity, data quality, domain/task composition, etc. Looking toward the future, we extrapolate existing challenges and outline promising directions for development in this field. Therefore, this survey serves as a guiding resource for practitioners aspiring to construct powerful LLMs through efficient data management practices.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: Surveys
Languages Studied: English, multilingual
Preprint Status: There is a non-anonymous preprint (URL specified in the next question).
A1: yes
A1 Section Or Justification: There is a "Limitations" section before references.
A2: n/a
A2 Section Or Justification: Our work is a survey paper and does not concern potential risks.
A3: yes
A3 Section Or Justification: Abstract is at the begining of the paper. Introduction is Section 1 in the paper.
B: no
B1: n/a
B1 Section Or Justification: There is no artifacts used in our work.
B2: n/a
B2 Section Or Justification: There is no artifacts used in our work.
B3: n/a
B3 Section Or Justification: There is no artifacts used in our work.
B4: n/a
B4 Section Or Justification: There is no data used in our work.
B5: n/a
B5 Section Or Justification: There is no artifacts used in our work.
B6: n/a
B6 Section Or Justification: There is no data used in our work.
C: no
C1: n/a
C1 Section Or Justification: There is no computational experiments in our work.
C2: n/a
C2 Section Or Justification: There is no computational experiments in our work.
C3: n/a
C3 Section Or Justification: There is no computational experiments in our work.
C4: n/a
C4 Section Or Justification: There is no computational experiments in our work.
D: no
D1: n/a
D1 Section Or Justification: There is no human labor used in our work.
D2: n/a
D2 Section Or Justification: There is no human labor used in our work.
D3: n/a
D3 Section Or Justification: There is no human labor used in our work.
D4: n/a
D4 Section Or Justification: There is no data collection procedure in our work.
D5: n/a
D5 Section Or Justification: There is no human annotation used in our work.
E: no
E1: n/a
E1 Section Or Justification: There is no AI asisstant used in our work.
0 Replies
Loading