Improving the assessment of deep learning models in the context of drug-target interaction predictionDownload PDF

Published: 05 Apr 2022, Last Modified: 05 May 2023MLDD PosterReaders: Everyone
Keywords: drug-target interaction, benchmark, protein-ligand binding, evaluation protocol
TL;DR: We show how current benchmarks suffer of information leakage and propose a systematic separation of proteins and ligands between training and test sets to measure it.
Abstract: Machine Learning techniques have been widely adopted to predict drug-target interactions, a central area of research in early drug discovery. These techniques have shown promising results on various benchmarks although they tend to suffer from poor generalization. This is typically related to very sparse and nonuniform datasets available, which limits the applicability domain of Machine Learning techniques. Moreover, widespread approaches to split datasets (into training and test sets) treat a drug-target interaction as an independent entities, when in reality the drug and target involved may take part in other interactions, breaking apart the assumption of independence. We observe that this leads to overly optimistic test results and poor generalization of out-of-distribution samples for various state-of-the-art sequence-based Machine Learning models for drug-target prediction. We show that previous approaches to reduce bias in binding datasets focus on drug or target information only and, thus, lead to similar pitfalls. Finally, we propose a minimum viable solution to evaluate the generalization capability of a Machine Learning model based on the systematic separation of test samples with respect to drugs and targets in the training set, thus discerning the three out-of-distribution scenarios seen at test time: (1) drug or (2) target present in the training set, or neither (3).
0 Replies