
## Oasis: Data Curation and Assessment System for Pretraining of Large Language Models
Tong Zhou, Yubo Chen, Pengfei Cao, Kang Liu, Shengping Liu, Jun Zhao
Keywords: Natural Language Processing: NLP: Tools, Natural Language Processing: NLP: Applications, Natural Language Processing: NLP: Language models
IJCAI/2024/Proceedings/1048 - Oasis: Data Curation and Assessment System for Pretraining of Large Language Models.pdf

### Implementation
_Given the documentation given by the authors on the method, how much time investment would it be to re-implement the method from scratch?_

[5]

The authors provide a source for their implementation (https://github.com/tongzhou21/Oasis). The readme is empty however, and the code is largely undocumented. The sizeable directory/file structure will make it an effort to understand the flow of the process, thus raising the cost of re-implementation. 

### Data
_Given the data description in the documentation, how much effort take to either: Find the same dataset the authors used, or similar datasets and defend the comparability, or acquire one from scratch?_

[1]

(3/3)

The authors provide their own data set, and obtain two others and do a comparative analysis on them. Citations are given. Their own data set is publicly available and linked.

### Configuration 
_Given the (hyper)parameters, including semantic parameters, of the method: How much effort would it take to acquire the algorithm configurations used for their results, and compare against their budgetary constraints?_

[1]

No algorithm configuration used.

### Experimental Procedure
_Given the experimental set-up of the work, how difficult is it to set up a new experiment, similar to those presented in the original work, with the same procedure?_

[1]

No empirical evaluation done.

### Expertise
_How much effort would it take to acquire the expertise required to reproduce the work independently relying on the available documentation?_

[2]

Requires some base experience with data, biases and large language models / NLP.
