Biological Sequence Analysis Using B ́ezier Curve

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Bio-sequence Analysis, Bezier Curve, Chaos Game Representation, Deep Learning-based Classification, Image Classification
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: The analysis of biological (e.g., protein and DNA) sequences is essential for disease diagnosis, biomaterial engineering, genetic engineering, and drug discovery domains. Conventional analytical methods focus on transforming sequences into numerical representations for applying machine learning/deep learning-based sequence characterization. However, their efficacy is constrained by the intrinsic nature of deep learning (DL) models, which tend to exhibit suboptimal performance when applied to tabular data. An alternative group of methodologies endeavors to convert biological sequences into image forms by applying the concept of Chaos Game Representation (CGR). However, a noteworthy drawback of these methods lies in their tendency to map individual elements of the sequence onto a relatively small subset of designated pixels within the generated image. The resulting sparse image representation may not adequately encapsulate the comprehensive sequence information, potentially resulting in suboptimal predictions. In this study, we introduce a novel approach to transform biological sequences into images using the Bézier curve concept for element mapping. Mapping the elements onto a curve enhances the sequence information representation in the respective images, hence yielding better DL-based classification performance. We employed three distinct protein sequence datasets to validate our system by doing three different classification tasks, and the results illustrate that our Bézier curve method is able to achieve good performance for all the tasks. For instance, it has shown tremendous improvement for a protein subcellular location prediction task over the baseline methods, such as improved accuracy by 39.4\% as compared to the FCGR baseline technique using a 2-layer CNN classifier. Moreover, for Coronavirus host classification, our Bézier method has achieved 5.3\% more AUC ROC score than the FCGR using a 3-layer CNN classifier.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3543
Loading