CDFRS: A scalable sampling approach for efficient big data analysis

Published: 01 Jan 2024, Last Modified: 06 Feb 2025Inf. Process. Manag. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Propose the CDFRS method for efficiently sampling terabyte-scale datasets.•Propose the A2 algorithm, which efficiently determines the required sample size.•Theoretical guarantees confirm the quality of samples generated by CDFRS.•CDFRS can complete sampling on a 10TB dataset in just hundreds of seconds.•Models trained with samples closely match those trained with the entire dataset.
Loading