Keywords: Language Model, Data Valuation, Similarity, Diversity
TL;DR: We introduce DetEmbedMetrics, a method for assessing textual data quality using language model embeddings aligned with deterministic similarity and diversity functions, providing interpretable and robust data valuations.
Abstract: In the rapidly evolving field of machine learning (ML), the quality of training data significantly impacts model performance, especially with the rise of foundation models capable of generating data. Measuring data quality may be linked to two statistical metrics: similarity and diversity, relative to a baseline dataset. We introduce DetEmbedMetrics, a novel deterministic embedding-based metric that enables textual data quality assessment by integrating a language model (LM) with deterministic similarity and diversity measurement functions. The core methodology constrains LM-generated embeddings to align with deterministic mathematical measurement functions, endowing the embeddings with desirable statistical properties. This approach enables the valuation of data quality by providing consistent and reliable similarity and diversity measurements, in contrast to methods directly employing neural networks for measuring data quality. Specifically, our approach involves fine-tuning an LM by inputting textual data samples with varying levels of similarity and diversity. The model learns to generate embeddings that, when applied to deterministic similarity and diversity functions, effectively capture the relationship between data sample pairs. This method allows the model to provide associated probabilities for different levels of similarity and diversity, offering clearer interpretation and decision-making compared to continuous scores. Extensive experiments on synthetic datasets demonstrate the effectiveness of DetEmbedMetrics in identifying similarity and diversity within various datasets. Notably, DetEmbedMetrics exhibits generalizability by performing robustly across different deterministic similarity and diversity functions, not relying on specific measurement techniques. This flexibility enhances its applicability as a robust framework for various measurement functions. By providing high-quality embeddings that facilitate the valuation of similarity and diversity between datasets, this research contributes to the growing field of data-centric ML, emphasizing the importance of data quality in the ML pipeline.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3963
Loading