# Deep Learning for Drug Discovery: A Case Study in Kinase Inhibitors for Anti-Cancer Therapies

## Overview

This educational notebook demonstrates how to apply deep learning techniques to drug discovery. Specifically, it shows how to build and train a PyTorch neural network to predict the binding activity of small molecules against EGFR (Epidermal Growth Factor Receptor), an important protein target in cancer therapeutics.

The notebook provides a complete workflow including data preparation, model training, evaluation, and virtual screening analysis. It is designed for learners with basic Python knowledge who want to understand how machine learning can be applied to molecular property prediction.

## Prerequisites

This tutorial assumes you have:

- Basic Python programming experience
- Familiarity with Jupyter notebooks
- No prior chemistry or drug discovery knowledge required

## Installation and Setup

### Option 1: Google Colab (Recommended for Beginners)

Google Colab provides a cloud-based environment with most dependencies pre-installed, making it the easiest way to run this notebook.

1. Upload the notebook file `deep_learning_kinases_noah_flynn.ipynb` to Google Colab
2. Create a `data` folder and upload the data file `data/EGFR-activities-chembl33.csv` to your Colab session.
3. Install required packages by running the following command in a Colab cell:

```bash
%pip install pandas numpy matplotlib seaborn rdkit scikit-learn torch
```

4. Execute the notebook cells sequentially from top to bottom

### Option 2: Local Installation

If you prefer to run the notebook on your local machine, then, assuming you have Python and a virtual environment activated, install the same packages as mentioned under Option 1: Google Colab.

Note that, if you encounter issues installing RDKit via pip, you may need to install it via conda:

```bash
conda install -c conda-forge rdkit
```

## File Structure

The project directory should be organized as follows:

```
project_directory/
├── deep_learning_kinases_noah_flynn.ipynb    # Main educational notebook
├── README.md                                  # This file
├── data/
│   └── EGFR-activities-chembl33.csv          # EGFR bioactivity data from Kinodata
├── figures/                                   # Generated plots (created during execution)
│   ├── loss_curves.pdf
│   ├── loss_curves.png
│   ├── lbvs_scatter.pdf
│   ├── lbvs_scatter.png
│   ├── enrichment_plot.pdf
│   └── enrichment_plot.png
└── artifacts/                                 # Saved models (created during execution)
    └── kinase_binder_model.pth
```

The `figures/` and `artifacts/` directories will be created automatically when you run the notebook.
If you run into issues with `figures/` or `artifacts`, the downloaded folders contain exemplary outputs that you can use as a reference as well.

## Running the Notebook

### Expected Runtime

On a standard CPU, the complete notebook takes approximately 10-15 minutes to execute. The most time-consuming steps are:
- Fingerprint generation
- Model training

If you have a CUDA-compatible GPU, training time will be significantly reduced.

### Expected Outputs

As you run the notebook, you will see:

1. **Data summaries**: Statistical descriptions of the EGFR bioactivity dataset
2. **Training progress**: Loss values printed for each training epoch
3. **Visualizations**: Loss curves, prediction scatter plots, and enrichment plots
4. **Model evaluation metrics**: MSE, RMSE, MAE, and R-squared values
5. **Enrichment analysis**: Tables showing early enrichment performance

All generated figures are saved to the `figures/` directory in both PNG and PDF formats. The trained model is saved to `artifacts/kinase_binder_model.pth`.

## Data Source

The EGFR bioactivity data used in this notebook is sourced from Kinodata, a curated database of kinase inhibitor bioactivities extracted from ChEMBL 33 (January 2024 release). The dataset contains 7,287 measurements of compounds tested against EGFR, including SMILES molecular representations and pIC50 activity values.

For more information about Kinodata, visit: https://github.com/openkinome/kinodata/releases

## Troubleshooting

### Common Issues

**Issue: "ModuleNotFoundError" for RDKit**

Solution: RDKit can be challenging to install via pip on some systems. Try installing via conda instead:
```bash
conda install -c conda-forge rdkit
```

**Issue: "CUDA out of memory" error**

Solution: The code automatically detects if CUDA is available and uses CPU otherwise. If you encounter memory issues on GPU, reduce the batch size in the `create_data_loaders` function from 32 to 16.

**Issue: "FileNotFoundError" for data file**

Solution: Ensure the data file `EGFR-activities-chembl33.csv` is located in a `data/` subdirectory relative to the notebook. Check that the file path in `load_kinase_data()` function matches your directory structure.

**Issue: Warnings about problematic SMILES strings**

Solution: A small number of SMILES strings in the dataset may fail to parse. This is expected and the code handles these gracefully by filtering them out. These warnings can be safely ignored.

**Issue: Matplotlib style warning about 'seaborn-v0_8-whitegrid'**

Solution: This may occur with newer versions of matplotlib/seaborn. The visualization will still work correctly with the default style.

## Learning Objectives

After completing this notebook, you will understand:

1. How to prepare molecular data for machine learning using fingerprint representations
2. How to implement scaffold-based dataset splitting for realistic model evaluation
3. How to build and train a neural network using PyTorch for molecular property prediction
4. How to evaluate model performance using appropriate metrics for drug discovery
5. How to assess virtual screening effectiveness using enrichment analysis

## Citation and Acknowledgments

This educational material uses data from the ChEMBL database (version 33) accessed through the Kinodata project. The implementation follows from conceptual details described in the book "Machine Learning for Drug Discovery" by Manning Publications. If you use this notebook for educational or research purposes, please acknowledge these resources.
