Navigating and Addressing Data Problems for Foundation Models (DPFM)

Published: 08 Mar 2024, Last Modified: 08 Mar 2024ICLR 2024 WorkshopsEveryoneRevisionsBibTeXCC BY 4.0
Workshop Type: Hybrid
Keywords: Foundation Models, Data Curation, Data Quality, Data-centric AI, AI Alignment, AI Safety, Copyright for Generative AI
Abstract: Foundation Models (FMs, e.g., GPT-3/4, LLaMA, DALL-E, Stable Diffusion, etc.) have been achieving sweeping success on a wide range of tasks. As researchers strive to keep up with the understanding of the capabilities and limitations of FMs as well as their implications following the rapid evolution, the attention is now shifting to the emerging notion of data-centric AI. The curation of training data has been shown to be crucially important for the performance and reliability of FMs and a wealth of recent works demonstrate that data-perspective research sheds light on a promising direction toward critical issues such as safety, alignment, efficiency, security, privacy, interpretability, etc. Recent year has seen a spur of individual works exploring many frontiers related to this topic, providing now an excellent opportunity to bring together brilliant minds to search for a systematic framework and roadmap for research. This workshop aims to discuss and explore a better understanding of the new paradigm for research on data problems for foundation models. Our technical agenda is composed of four modules with 12 **confirmed** speakers: - A. Data Quality, Dataset Curation, and Data Generation–Recent Achievements and Current Efforts - B. A Data Perspective to Efficiency, Interpretability, and Alignment–Latest Advancement and Breakthroughs - C. A Data Perspective to Safety and Ethics–Risks, Limitations, and Opportunities - D. Copyright, Legal Issues, and Data Economy–A Broader Landscape We strive to build a community behind this essential topic. Noting that the current data practices of foundation models are largely opaque, one mission of this workshop is to create a community effort on open source data efforts at the pretraining stage itself. Subsequent efforts include creating datasets, benchmark, and dedicated venues to promote research on data problems for foundation models and ultimately facilitate the widespread deployment of FMs in a sociotechnical-friendly way that provides benefit at large. Examples of our target communities include researchers on data problems (e.g., data-centric AI, dataset/data curation, data market) and foundation models (alignment, safety/trustworthiness, fairness/ethics), practitioners of downstream applications, tech companies providing innovative solutions and beyond.
Submission Number: 60
Loading