Abstract: K-mers are fundamental in bioinformatics, notably for error handling in sequencing data. Counting them is memory-intensive due to their redundancy. Existing methods reduce redundancy via super-k-mers, yet inefficiencies persist. We introduce hyper-k-mers, a more compact representation, reducing duplication bounds from 6 to 4 bits per k-mer. We provide a theoretical space efficiency analysis and introduce KFC, a k-mer counting algorithm leveraging hyper-k-mers. KFC significantly reduces memory usage, scaling sub-linearly with k-mer size and outperforming state-of-the-art tools, particularly for large k. Availability: KFC is available at https://github.com/lrobidou/KFC, with supplementary scripts at https://github.com/imartayan/KFC_experiments and preprint at https://doi.org/10.1101/2024.11.06.620789.
Loading