When Biology has Chemistry: Solubility And Drug Subcategory Prediction using SMILES Strings

Sarwan Ali; Prakash Chourasia; Murray Patterson

When Biology has Chemistry: Solubility And Drug Subcategory Prediction using SMILES Strings

Sarwan Ali, Prakash Chourasia, Murray Patterson

01 Mar 2023 (modified: 23 May 2023)Submitted to Tiny Papers @ ICLR 2023Readers: Everyone

Keywords: SMILES String, Embeddings, Morgan Fingerprint, MACCS Fingerprint, k-mers, Minimizers

TL;DR: This paper investigates the use of traditional molecular fingerprints and sequence-based embeddings of SMILES strings to predict the solubility ALOGPS. The study compares five types of embeddings and six regression models to predict the solubility.

Abstract: Drug discovery is a complex process that requires extensive research and development. One important aspect of drug discovery is the prediction of drug properties, such as solubility. In recent years, sequence-based embedding methods, such as SMILES strings, have gained popularity in the drug discovery community due to their ability to encode chemical structures. SMILES strings are text-based representations of chemical structures that can be easily processed by machine learning models. This research paper presents a study on predicting (i) the solubility ALOGPS (Ghose-Crippen-Viswanadhan octanol-water partition coefficient) and (ii) drug subcategories using traditional molecular fingerprints and sequence-based embedding methods (from the bioinformatics domain) of SMILES strings. The study investigates five types of embeddings: Morgan fingerprint, MACCS fingerprint, $k$-mers, and minimizer-based spectrum. Additionally, a weighted version of $k$-mers that employs inverse document frequency is used to assign weights to each $k$-mer within the spectrum. For the classification task (\ie, drug subcategory prediction), we use the same embedding methods as input to several classifiers and report classification goodness using several evaluation metrics. For the regression task (\ie, solubility ALOGPS prediction), we use several popular models \eg, linear regression, and evaluate the performance using multiple evaluation metrics such as RMSE MAE, MSE, etc. The classification results indicate that the weighted $k$-mers method outperforms the baselines for predictive performance. The regression results indicate that the MACCS fingerprint with random forest regression model outperforms all other embedding methods and regression models. Overall, this study provides insights into the effectiveness of different embeddings, regression models, and classification models for solubility and drug subcategory prediction, which can be helpful for future tasks such as drug discovery.

8 Replies

Loading