Abstract: Identification of proteins is a key step of metaproteomics research. This protein identification task should be migrated to a fast data streaming architecture to increase horizontal scalability and performance. A protein database search involves two steps: the pairwise matching of experimental spectra against protein sequences creating peptide-spectrum-matches (PSM) and the statistical validation of PSMs. The peptide-spectrum-matching is inherently parallelizable since each match is independent. However, false positive matches are inherent to this method due to measurement errors and artifacts, thus requiring statistical validation. State of the art validation is achieved using the target-decoy method, which estimates the false discovery rate (FDR) by searching against a shuffled version of the original protein database. In contrast to the protein database search, validation by target-decoy is not parallelizable, because the FDR approximation requires all experimental data at once. In short, when using a fast data architecture for the workflow, the target-decoy approach is no longer feasible. Hence a novel approach is required to avoid false discovery of PSM on streaming single-pass experimental data. To this end, the recently proposed nokoi classifier seems promising to solve the aforementioned problems. In this paper, we present a general nokoi pipeline to create such a decoy-free classifier, that reach over 95% accuracy for general metaproteomics data.
Loading