Improving the assessment of deep learning models in the context of drug-target interaction prediction

Mirko Torrisi; Antonio De la Vega de Leon; Guillermo Climent; Remco Loos; Alejandro Panjkovich

Improving the assessment of deep learning models in the context of drug-target interaction prediction

Mirko Torrisi, Antonio De la Vega de Leon, Guillermo Climent, Remco Loos, Alejandro Panjkovich

Published: 05 Apr 2022, Last Modified: 05 May 2023MLDD PosterReaders: Everyone

Keywords: drug-target interaction, benchmark, protein-ligand binding, evaluation protocol

TL;DR: We show how current benchmarks suffer of information leakage and propose a systematic separation of proteins and ligands between training and test sets to measure it.

Abstract: Machine Learning techniques have been widely adopted to predict drug-target interactions, a central area of research in early drug discovery. These techniques have shown promising results on various benchmarks although they tend to suffer from poor generalization. This is typically related to very sparse and nonuniform datasets available, which limits the applicability domain of Machine Learning techniques. Moreover, widespread approaches to split datasets (into training and test sets) treat a drug-target interaction as an independent entities, when in reality the drug and target involved may take part in other interactions, breaking apart the assumption of independence. We observe that this leads to overly optimistic test results and poor generalization of out-of-distribution samples for various state-of-the-art sequence-based Machine Learning models for drug-target prediction. We show that previous approaches to reduce bias in binding datasets focus on drug or target information only and, thus, lead to similar pitfalls. Finally, we propose a minimum viable solution to evaluate the generalization capability of a Machine Learning model based on the systematic separation of test samples with respect to drugs and targets in the training set, thus discerning the three out-of-distribution scenarios seen at test time: (1) drug or (2) target present in the training set, or neither (3).

0 Replies

Loading