Ecom100B: A 100 B-Token Customer-Service Corpus

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: dataset
TL;DR: We create a E-com100B from FineWeb-edu.
Abstract: We introduce E-Com100B, a 100-billion-token English-centric corpus distilled from Common Crawl for pre-training customer-service-oriented language models. Following the FineWeb-Edu recipe, we prompt a lightweight scorer (Qwen3-1.7B) to rate every raw document on its pedagogical value to an aspiring e-commerce support agent on a 0--5 Likert scale, keeping only documents rated 4--5. This single filtering step yields a corpus that is 3.2× cleaner and 2.1× more task-relevant than the next-largest open alternative, while completely suppressing PII. Continued pre-training of Qwen-14B on E-Com100B drops perplexity on 5 M held-out chat logs from 9.87 to 5.42 (-45%, p < 0.001). A LangGraph agent built on Qwen3-30B-A3B and fine-tuned with 5k labelled dialogs improves success rate on ECom-Bench from 54.3% to 71.8% (+32% relative), outperforming GPT-4-0613 (65.9%) at one-seventh the inference cost. Ablations confirm that these gains disappear when the educational-value filter is replaced by random sampling.
Primary Area: datasets and benchmarks
Submission Number: 10964
Loading