{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4388d45f",
   "metadata": {},
   "source": [
    "# Tutorial: Accessing Knowledge Sources\n",
    "\n",
    "This notebook details how to access the knowledge sources as part of the `leon` package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed649126",
   "metadata": {},
   "outputs": [],
   "source": [
    "import leon"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "764c7e1b",
   "metadata": {},
   "source": [
    "The function `leon.knowledge.get_knowledge_source_options()` returns a list of all of the implemented knowledge bases available in `leon`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92c85601",
   "metadata": {},
   "outputs": [],
   "source": [
    "leon.knowledge.get_knowledge_source_options()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "247eebc4",
   "metadata": {},
   "source": [
    "All of the listed knowledge sources are subclasses of the base `KnowledgeBase` class in the `leon.knowledge` module. Each knowledge source implements the class method `retrieve()`, which takes as input a string query or list of queries and returns the requested knowledge. All `KnowledgeBase`s share `top_k` as a required argument, which corresponds to the number of documents to retrieve for the input query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98aaabf6",
   "metadata": {},
   "outputs": [],
   "source": [
    "top_k: int = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fc20a7d",
   "metadata": {},
   "source": [
    "Let's now explore the different knowledge sources that are implemented."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df1ef707",
   "metadata": {},
   "source": [
    "## Retrieval-Based Knowledge Sources\n",
    "\n",
    "The first class of knowledge sources are **retrieval-based**, meaning that they have access to an external, static repository of fixed knowledge and leverage and embedding module to retrieve documents of knowledge that are most relevant to the input query. The following knowledge sources are implemented in `leon`:\n",
    "\n",
    "  - `PMCKnowledgeBase` draws knowledge from open-access, full-length manuscripts from PubMed Central via a publicly accessible [AWS S3 Bucket](https://pmc.ncbi.nlm.nih.gov/tools/pmcaws/).\n",
    "  - `PubMedKnowledgeBase` draws knowledge from unstructured text sourced from PubMed via the [`MedRAG/pubmed`](https://huggingface.co/datasets/MedRAG/pubmed) dataset from [Xiong G et al. ACL Findings (2024)](https://aclanthology.org/2024.findings-acl.372/).\n",
    "  - `TextbooksKnowledgeBase` draws knowledge from medical textbooks via the [`MedRAG/textbooks`](https://huggingface.co/datasets/MedRAG/textbooks) dataset from [Xiong G et al. ACL Findings (2024)](https://aclanthology.org/2024.findings-acl.372/).\n",
    "  - `arXivKnowledgeBase` draws knowledge from abstracts of preprints published on arXiv in the `cs.LG` (i.e., machine learning) category via the [`mteb/raw_arxiv`](https://huggingface.co/datasets/mteb/raw_arxiv) dataset. (For our experiments in the biomedical domain, this knowledge base is intended to power knowledge ablation experiments.)\n",
    "\n",
    "Retrieval-based knowledge sources require an additional `embedder` argument for initialization, which should be a subclass of the `BaseEmbedding` from [`llama_index`](https://github.com/run-llama/llama_index). We implement a subset of popular embedding methods in the `leon.embedding` module - the list of the implemented methods can be accessed by the `get_embedder_options()` function in the parent module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c2a0f5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "leon.embedding.get_embedder_options()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f198fc05",
   "metadata": {},
   "source": [
    "Note that by default, the OpenAI text embedding modules are accessed via the [`AzureOpenAIEmbedding`](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/llama_index/embeddings/azure_openai/base.py) class instead of the more traditional `OpenAIEmbedding` module.\n",
    "\n",
    "The `leon/random` embedder retrieves documents from a corpus randomly but deterministically, and is helpful for experiments ablating the specific choice of embedder and the knowledge retrieved for a particular task.\n",
    "\n",
    "We can instantiate a particular embedding from the above list using the `get_embedder()` function from the `leon.embedding` module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "141340be",
   "metadata": {},
   "outputs": [],
   "source": [
    "embedder = leon.embedding.get_embedder(\"openai/text-embedding-3-small\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8dff5bf",
   "metadata": {},
   "source": [
    "As an example, we can retrieve knowledge from medical textbooks according to the following example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29fd7882",
   "metadata": {},
   "outputs": [],
   "source": [
    "knowledge_base = getattr(leon.knowledge, \"TextbooksKnowledgeBase\")(embedder=embedder, top_k=top_k)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28200b14",
   "metadata": {},
   "source": [
    "Note that depending on the size of the knowledge corpus and the chosen embedder, it can often take minutes or even hours to load the knowledge base - even after the document embeddings are cached locally.\n",
    "\n",
    "After the knowledge base is instantiated, we can retrieve information using the `retrieve()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a35bc7cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "knowledge_base.retrieve([\"Cisplatin\", \"Gemcitabine\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68b827f4",
   "metadata": {},
   "source": [
    "## LLM-Based Knowledge Sources\n",
    "\n",
    "Another option is to use *language models* as sources of knowledge. As examples, we implement [MedGemma 4B](https://huggingface.co/google/medgemma-4b-it) and [MedGemma 27B](https://huggingface.co/google/medgemma-27b-text-it) from Google, which have been specifically trained for medical text and image comprehension. We can query these models to provide us with facts that are relevant to the input query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "886911c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "knowledge_base = getattr(leon.knowledge, \"MedGemma4BKnowledgeBase\")(top_k=top_k)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c4cce5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "knowledge_base.retrieve([\"Cisplatin\", \"Gemcitabine\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81568a2b",
   "metadata": {},
   "source": [
    "## Knowledge-Graph (KG) Knowledge Sources\n",
    "\n",
    "Knowledge graphs (KG) can also act as sources of knowledge. For example, we implement the [Hetionet](https://het.io/) knowledge graph from [Himmelstein DS et al. eLife (2017)](https://elifesciences.org/articles/26726) as a knowledge base. A general KGs has the expressive power to represent multiple different types of relationships between different entities. For our purposes, we assume that we are interested in finding the relationships between a source compound (i.e., drug) and a target disease. The retrieval function uses an embedding function to map each source compound and target disease to a corresponding node in the graph, and then represents the relationship between the two nodes in natural language. For more information on embedding functions implemented in `leon`, see the **Retrieval-Based Knowledge Sources** section above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e93e015d",
   "metadata": {},
   "outputs": [],
   "source": [
    "embedder = leon.embedding.get_embedder(\"emilyalsentzer/Bio_ClinicalBERT\")\n",
    "knowledge_base = getattr(leon.knowledge, \"HetionetKGKnowledgeBase\")(embedder=embedder, top_k=top_k)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b1f2274",
   "metadata": {},
   "source": [
    "Note that the expected arguments for the `retrieve()` function of KG knowledge sources is different compared to other `KnowledgeBase` implementations, as it expects both `source` compounds and `target` diseases to be specified (in natural language)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb6b72b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "knowledge_base.retrieve(source=[\"Cisplatin\", \"Gemcitabine\"], target=[\"Cancer\", \"Cancer\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c7c11dc",
   "metadata": {},
   "source": [
    "We also implement the [PrimeKG](https://zitniklab.hms.harvard.edu/projects/PrimeKG/) knowledge graph from [Chandak P et al. Sci Rep (2023)](https://doi.org/10.1038/s41597-023-01960-3) as a knowledge base:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdaeb1ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "knowledge_base = getattr(leon.knowledge, \"PrimeKGKnowledgeBase\")(embedder=embedder, top_k=top_k)\n",
    "knowledge_base.retrieve(source=[\"Cisplatin\", \"Gemcitabine\"], target=[\"Cancer\", \"Cancer\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7823bdd5",
   "metadata": {},
   "source": [
    "## Domain-Specific Knowledge Sources\n",
    "\n",
    "We can also incorporate domain-specific knowledge from the bioclinical domain. We implement the following knowledge sources:\n",
    "\n",
    "  - `CellosaurusKnowledgeBase` draws knowledge from [Cellosaurus](https://www.cellosaurus.org/), an external knowledge source on cell lines from [Bairoch A. J Biomol Tech (2018)](https://pmc.ncbi.nlm.nih.gov/articles/PMC5945021/). Given an input cell line, the knowledge base retrieves information about the origin of the cell line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1fa6e388",
   "metadata": {},
   "outputs": [],
   "source": [
    "knowledge_base = getattr(leon.knowledge, \"CellosaurusKnowledgeBase\")(top_k=top_k)\n",
    "knowledge_base.retrieve([\"0162D\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec81f623",
   "metadata": {},
   "source": [
    "  - `COSMICKnowledgeBase` draws knowledge from the [Catalogue Of Somatic Mutations in Cancer (COSMIC)](https://cancer.sanger.ac.uk/cosmic/) introduced by [Tate JG et al. Nucleic Acids Res (2018)](https://academic.oup.com/nar/article/47/D1/D941/5146192). The knowledge base returns the `top_k` most commonly mutated genes and their corresponding population frequencies for a given cancer type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "637c1825",
   "metadata": {},
   "outputs": [],
   "source": [
    "knowledge_base = getattr(leon.knowledge, \"COSMICKnowledgeBase\")(top_k=top_k)\n",
    "knowledge_base.retrieve(\"BC\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fc280ba",
   "metadata": {},
   "source": [
    "  - `GDSCKnowledgeBase` draws knowledge from the [Genomics of Drug Sensitivity in Cancer (GDSC)](https://www.cancerrxgene.org/) database introduced by [Yang W et al. Nucleic Acids Res (2013)](https://academic.oup.com/nar/article/41/D1/D955/1059448). The knowledge base retrieves the drug target(s) of an input drug."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20ad9c32",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note that no `top_k` argument is required (or used) for the GDSC knowledge base.\n",
    "knowledge_base = getattr(leon.knowledge, \"GDSCKnowledgeBase\")()\n",
    "knowledge_base.retrieve([\"Cisplatin\", \"Gemcitabine\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dbc5596",
   "metadata": {},
   "source": [
    "  - `DepMapKnowledgeBase` draws knowledge from the [Dependency Map (DepMap)](https://depmap.org/portal/) from [Tsherniak A et al. Cell (2017)](https://www.sciencedirect.com/science/article/pii/S0092867417306517). Given an input patient description, the knowledge base retrieves the most relevant cell models (i.e., cell lines or organoids). We can also use the `model_cell_metadata()`, `get_sensitive_ko_genes()`, and `ko_gene_metadata()` helper functions to retrieve additional information about the cell models and the genes that, when knocked out, greatly reduce the viability of the cell model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5d0ae14",
   "metadata": {},
   "outputs": [],
   "source": [
    "embedder = leon.embedding.get_embedder(\"openai/text-embedding-3-small\")\n",
    "knowledge_base = getattr(leon.knowledge, \"DepMapKnowledgeBase\")(embedder=embedder, top_k=top_k)\n",
    "cell_lines = knowledge_base.retrieve(\"46M former smoker with stage IV lung adenocarcinoma\")[0]\n",
    "for model in cell_lines:\n",
    "    print(knowledge_base.model_cell_metadata(model))\n",
    "    for gene in knowledge_base.get_sensitive_ko_genes(model, 2):\n",
    "        print(\" - \" + knowledge_base.ko_gene_metadata(gene))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "leon",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
