Can We Replace Reads by Numeric Signatures? Lyndon Fingerprints as Representations of Sequencing Reads for Machine Learning

Published: 01 Jan 2021, Last Modified: 07 Aug 2024AlCoB 2021EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Representations of biological sequences facilitating sequence comparison are crucial in several bioinformatics tasks. Recently, the Lyndon factorization has been proved to preserve common factors in overlapping reads [6], thus leading to the idea of using factorizations of sequences to define measures of similarity between reads. In this paper we propose as a signature of sequencing reads the notion of fingerprint, i.e., the sequence of lengths of consecutive factors in Lyndon-based factorizations of the reads. Surprisingly, fingerprints of reads are effective in preserving sequence similarities while providing a compact representation of the read, and so, k-mers extracted from a fingerprint, called k-fingers, can be used to capture sequence similarity between reads.
Loading