Abstract: A key issue in managing large amounts of data is the availability of efficient, accurate, ad selective techniques to detect homology (similarity) between newly recovered and previously acquired sequences. The algorithm presented is based on a probabilistic indexing framework which requires minimal access to the database for each match. A highly redundant number of descriptive tuples from the sequences of interest are generated and used as indices in a table look-up paradigm. Theoretical and experimental results on the sensitivity and accuracy of the approach are provided. These include the probability of correct and random matches and the storage and computational requirements. An experimental system is implemented for a database containing the complete genome of the bacteria E. Coli (approximately 2 million nucleotides). Search time is a few seconds on a workstation class machine. The algorithm is shown to scale well to databases containing billions of nucleotides with performances that are orders of magnitude better than the fastest of the current techniques.<
0 Replies
Loading