Abstract: We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data in each of 45 languages, aiming to cover the equivalent of 100M English words of content. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism, multilingual evaluation, less-resourced languages, resources for less-resourced languages, corpus creation, multilingual corpora
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources
Languages Studied: Achinese, Afrikaans, Arabic, Balinese, Buginese, Bulgarian, Czech, Welsh, Danish, German, Modern Greek (1453-), English, Estonian, Basque, Persian, French, Hebrew, Croatian, Hungarian, Indonesian, Icelandic, Italian, Javanese, Japanese, Korean, Makasar, Minangkabau, Dutch, Norwegian, Pedi, Polish, Portuguese, Romanian, Russian, Southern Sotho, Spanish, Serbian, Sundanese, Swedish, Turkish, Ukrainian, Xhosa, Yue Chinese, Chinese, Zulu
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
A2 Elaboration: Our resource is intended to be used in the intersection of language acquisition and low-resource language modelling, and presents no potential risks.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 3.2.1 and Appendix B
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: 3.1.3 and Appendix B
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: 3.1.3
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: 3.3
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 3.2.2 and Appendix B
B6 Statistics For Data: Yes
B6 Elaboration: 3.2.2, Appendix B, Table 2
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 5
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 5
C3 Descriptive Statistics: Yes
C3 Elaboration: 5
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: Yes
D3 Elaboration: 3.1.3
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 762
Loading