Abstract: Highlights•Propose the CDFRS method for efficiently sampling terabyte-scale datasets.•Propose the A2 algorithm, which efficiently determines the required sample size.•Theoretical guarantees confirm the quality of samples generated by CDFRS.•CDFRS can complete sampling on a 10TB dataset in just hundreds of seconds.•Models trained with samples closely match those trained with the entire dataset.
Loading