Conscious Data Contribution via Community-Driven Chain-of-Thought Distillation

Published: 23 Sept 2025, Last Modified: 18 Nov 2025ACA-NeurIPS2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Conscious Data Contribution, Multi-Community Distillation, Data Portability, Chain-of-Thought Reasoning
TL;DR: We study how multi-community distillation under Conscious Data Contribution (CDC) benefits from reasoning traces and dataset diversity, highlighting the role of individual incentives and task compatibility in shaping collective gains.
Abstract: The current era of AI development places a heavy emphasis on training large models on increasingly scaled-up datasets. This paradigm has catalyzed entirely new product categories, such as LLM chatbots, while also raising concerns about data privacy and consumer choice. In this paper, we consider questions of data portability and user autonomy in the context of LLMs that ``reason'' using chain-of-thought (CoT) traces, computing intermediate text artifacts from user input before producing a final output. We first interpret recent data privacy and portability law to argue that these intermediate computations qualify as users' personal data. Then, building on the existing framework of Conscious Data Contribution, we show how communities who receive low utility from an available model can aggregate and distill their shared knowledge into an alternate model better aligned with their goals. We verify this approach empirically and investigate the effects of community diversity, reasoning granularity, and community size on distillation performance.
Submission Number: 30
Loading