Textbook Consistency Weighted Internet Improves Efficiency Twofold

Hao Liu; Volodymyr Mnih; Pieter Abbeel; Aleksandra Faust

Textbook Consistency Weighted Internet Improves Efficiency Twofold

Hao Liu, Volodymyr Mnih, Pieter Abbeel, Aleksandra Faust

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: training efficiency, large language model, adaptive data weighting

Abstract: We propose a novel method, Textbook Consistency, to improve the training efficiency of large language models by leveraging textbooks as a guiding signal for learning from internet-scale data. Rather than relying on hard filtering of data based on quality thresholds before training, our approach adaptively adjusts the weight of data during training based on its consistency with textbooks during training. We compute the cosine similarity between internet data and textbooks in a latent space, using this metric to modulate the cross-entropy loss. Our method significantly enhances training efficiency, achieving twice the effectiveness by reducing training time or the number of tokens required. Empirical results show superior performance on language models trained on large datasets like FineWeb and The Pile, with extensions to other domains such as robotics. Our method is simple to implement, incurs no additional overhead, and is compatible with existing data curation techniques.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11100

Loading