{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"authorship_tag":"ABX9TyOU16vAETuopQdDFPqZqnhi"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["# [[EAMC] Intro to ML](https://eamc-penn.github.io/ml.html)\n","\n","The goal of this Python notebook is three fold:\n","\n","  1. **Learn** the basics of Python and how we can easily use programming as a tool to interact with and analyze medical data. **No prior Python experience is required or expected!**\n","  2. **Understand** and interact with the clinical dataset(s) curated for this year's datathon.\n","  3. **Write** a simple function to ask a large language model (like ChatGPT) a question using Python code."],"metadata":{"id":"7ENwhH5C0tRU"}},{"cell_type":"markdown","source":["[Python](https://www.w3schools.com/python/) is a general-purpose, user-friendly programming language that can be used for a variety of different tasks, including analyzing clinical datasets.\n","\n","While Python code is built to be easy to read and write, writing your own code to parse through datasets, multiply numbers, and other mundane tasks can be a waste of time. Instead, the Python community has already written a lot of helpful tools and functions for us to use for these tasks! These tools and functions are organized into ***packages***. We can use a Python package manager called `pip` to download some packages that are relevant for us:"],"metadata":{"id":"VJcV1r_U0-3r"}},{"cell_type":"code","execution_count":21,"metadata":{"id":"dQRJFpeKwDHa","executionInfo":{"status":"ok","timestamp":1744475275525,"user_tz":240,"elapsed":6542,"user":{"displayName":"Michael Yao","userId":"02175218461346911901"}},"colab":{"base_uri":"https://localhost:8080/"},"outputId":"46fda2d2-96b8-4191-b881-ef68c12bf24e"},"outputs":[{"output_type":"stream","name":"stdout","text":["\u001b[?25l   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/644.4 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m644.4/644.4 kB\u001b[0m \u001b[31m20.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25h"]}],"source":["# To run a code block like this one, press the \"Play\" button on the left hand\n","# side or press [Shift + Enter] on your keyboard while this code block is\n","# selected.\n","!pip install -U datasets together openai --quiet"]},{"cell_type":"markdown","source":["Now that we've downloaded these packages, we also need to import them for us to use. The keyword `import` in Python tells your computer to import the packages and all of their helpful functions and tools for us to use here."],"metadata":{"id":"CBbxxPFH1GwE"}},{"cell_type":"code","source":["from datasets import load_dataset\n","from typing import Any, Dict\n","from together import Together\n","from openai import OpenAI\n","from google.colab import userdata"],"metadata":{"id":"hdTBk76XwlRr","executionInfo":{"status":"ok","timestamp":1744475290062,"user_tz":240,"elapsed":2,"user":{"displayName":"Michael Yao","userId":"02175218461346911901"}}},"execution_count":22,"outputs":[]},{"cell_type":"markdown","source":["In this tutorial, we'll explore the MedQA dataset from [Jin et al. (2020)](https://arxiv.org/abs/2009.13081https://arxiv.org/abs/2009.13081), which consists of prior USMLE Step 1, 2, and 3 practice questions. This dataset is frequently used to evaluate how well language models perform on medical tasks. You can access the dataset itself [here](https://huggingface.co/datasets/bigbio/med_qa).\n","\n","The vast majority of datasets in machine learning are stored via [Hugging Face](https://huggingface.co/). We can use the Python `datasets` package to download our track's dataset. We'll use the `load_dataset()` function to download the dataset and save it as a variable called `ds`. Note that we have a training split to train any models or algorithms you develop, and the test split to evaluate your project."],"metadata":{"id":"nr-SfdHu1RM2"}},{"cell_type":"code","source":["ds = load_dataset(\"bigbio/med_qa\", data_files=\"data_clean.zip\", trust_remote_code=True)"],"metadata":{"id":"g_eSbwr7wtoD","executionInfo":{"status":"ok","timestamp":1744475577191,"user_tz":240,"elapsed":725,"user":{"displayName":"Michael Yao","userId":"02175218461346911901"}}},"execution_count":24,"outputs":[]},{"cell_type":"code","source":["ds"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"NvCqlWwY2M7t","executionInfo":{"status":"ok","timestamp":1744475577944,"user_tz":240,"elapsed":8,"user":{"displayName":"Michael Yao","userId":"02175218461346911901"}},"outputId":"fa68b169-0dc8-41e3-d1be-4aa67fb1a56a"},"execution_count":25,"outputs":[{"output_type":"execute_result","data":{"text/plain":["DatasetDict({\n","    train: Dataset({\n","        features: ['meta_info', 'question', 'answer_idx', 'answer', 'options'],\n","        num_rows: 10178\n","    })\n","    test: Dataset({\n","        features: ['meta_info', 'question', 'answer_idx', 'answer', 'options'],\n","        num_rows: 1273\n","    })\n","    validation: Dataset({\n","        features: ['meta_info', 'question', 'answer_idx', 'answer', 'options'],\n","        num_rows: 1272\n","    })\n","})"]},"metadata":{},"execution_count":25}]},{"cell_type":"code","source":["validation_data = ds[\"validation\"]"],"metadata":{"id":"vnqflDhZw27U","executionInfo":{"status":"ok","timestamp":1744475583111,"user_tz":240,"elapsed":2,"user":{"displayName":"Michael Yao","userId":"02175218461346911901"}}},"execution_count":26,"outputs":[]},{"cell_type":"markdown","source":["What does the dataset look like? Let's take a look:"],"metadata":{"id":"TyJZ08nC2RDP"}},{"cell_type":"code","source":["validation_data[0]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"tqyYnwUrw-1j","executionInfo":{"status":"ok","timestamp":1744474839395,"user_tz":240,"elapsed":38,"user":{"displayName":"Michael Yao","userId":"02175218461346911901"}},"outputId":"9cbfbfaa-c670-4c45-8236-6fcf12b21985"},"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["{'meta_info': 'step1',\n"," 'question': 'A 21-year-old sexually active male complains of fever, pain during urination, and inflammation and pain in the right knee. A culture of the joint fluid shows a bacteria that does not ferment maltose and has no polysaccharide capsule. The physician orders antibiotic therapy for the patient. The mechanism of action of action of the medication given blocks cell wall synthesis, which of the following was given?',\n"," 'answer_idx': 'D',\n"," 'answer': 'Ceftriaxone',\n"," 'options': [{'key': 'A', 'value': 'Chloramphenicol'},\n","  {'key': 'B', 'value': 'Gentamicin'},\n","  {'key': 'C', 'value': 'Ciprofloxacin'},\n","  {'key': 'D', 'value': 'Ceftriaxone'},\n","  {'key': 'E', 'value': 'Trimethoprim'}]}"]},"metadata":{},"execution_count":5}]},{"cell_type":"markdown","source":["Hmm... this doesn't really look like \"normal\" text! `validation_data[0]` is what we call a \"dictionary\" in Python - think of it as a lookup table. For example, our dictionary has an `answer` that is equal to `Ceftriaxone`, an `answer_idx` that is equal to `D`, and a `meta_info` that is equal to `step1`.\n","\n","This isn't super readable though - let's convert this to normal text (called a \"string\" in Python):"],"metadata":{"id":"Goo6RNfZ2UGA"}},{"cell_type":"code","source":["def convert_dict_to_string(x: Dict[str, Any]) -> str:\n","    text = x[\"question\"] + \"\\n\\n\"\n","    for option in x[\"options\"]:\n","        text += \"(\" + option[\"key\"] + \") \" + option[\"value\"] + \"\\n\"\n","    return text"],"metadata":{"id":"ca__0LI8xC_X","executionInfo":{"status":"ok","timestamp":1744477572622,"user_tz":240,"elapsed":4,"user":{"displayName":"Michael Yao","userId":"02175218461346911901"}}},"execution_count":27,"outputs":[]},{"cell_type":"code","source":["validation_questions = []\n","for question in validation_data:\n","    validation_questions.append(convert_dict_to_string(question))"],"metadata":{"id":"ISmK8U_JxpsT","executionInfo":{"status":"ok","timestamp":1744477576865,"user_tz":240,"elapsed":155,"user":{"displayName":"Michael Yao","userId":"02175218461346911901"}}},"execution_count":28,"outputs":[]},{"cell_type":"markdown","source":["What does our question look like now?"],"metadata":{"id":"KdhLBYqZ92Zm"}},{"cell_type":"code","source":["print(validation_questions[0])"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"PvcS1QEAx4NE","executionInfo":{"status":"ok","timestamp":1744477586513,"user_tz":240,"elapsed":15,"user":{"displayName":"Michael Yao","userId":"02175218461346911901"}},"outputId":"56bd0316-d931-4499-f234-32e3fab0bcab"},"execution_count":29,"outputs":[{"output_type":"stream","name":"stdout","text":["A 21-year-old sexually active male complains of fever, pain during urination, and inflammation and pain in the right knee. A culture of the joint fluid shows a bacteria that does not ferment maltose and has no polysaccharide capsule. The physician orders antibiotic therapy for the patient. The mechanism of action of action of the medication given blocks cell wall synthesis, which of the following was given?\n","\n","(A) Chloramphenicol\n","(B) Gentamicin\n","(C) Ciprofloxacin\n","(D) Ceftriaxone\n","(E) Trimethoprim\n","\n"]}]},{"cell_type":"markdown","source":["We can now ask this question to two different models: Llama 3.3 from Meta (formally Facebook), and GPT 4o-Mini from OpenAI. The former is open source and everyone can run it for free (you can even run it on your own laptop!). The latter is what we know and love as \"ChatGPT\"."],"metadata":{"id":"t6LsUVWS95JY"}},{"cell_type":"code","source":["client = Together()\n","\n","response = client.chat.completions.create(\n","    model=\"meta-llama/Llama-3.3-70B-Instruct-Turbo-Free\",\n","    messages=[{\"role\": \"user\", \"content\": validation_questions[0]}],\n",")\n","\n","print(response.choices[0].message.content)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"wchzdhDAx5W4","executionInfo":{"status":"ok","timestamp":1744474851529,"user_tz":240,"elapsed":11689,"user":{"displayName":"Michael Yao","userId":"02175218461346911901"}},"outputId":"3e6173dd-ef39-4d20-a119-5213e3d13bce"},"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["To solve this question, let's break down the information provided and analyze it step by step:\n","\n","1. **Symptoms and Diagnosis**: The patient presents with fever, pain during urination, and inflammation and pain in the right knee. The culture of the joint fluid shows bacteria that do not ferment maltose and have no polysaccharide capsule. These symptoms and findings suggest a bacterial infection, possibly a sexually transmitted infection (STI) given the patient's age and sexual activity, such as gonorrhea caused by Neisseria gonorrhoeae.\n","\n","2. **Bacterial Characteristics**: The bacteria do not ferment maltose and lack a polysaccharide capsule. Neisseria gonorrhoeae is known for not fermenting maltose and not having a polysaccharide capsule, which aligns with the description provided.\n","\n","3. **Mechanism of Action of the Medication**: The medication given blocks cell wall synthesis. This is a key point in identifying the antibiotic.\n","\n","4. **Antibiotic Mechanisms of Action**:\n","   - **(A) Chloramphenicol**: Inhibits protein synthesis by binding to the 50S ribosomal subunit.\n","   - **(B) Gentamicin**: An aminoglycoside that inhibits protein synthesis by binding to the 30S ribosomal subunit.\n","   - **(C) Ciprofloxacin**: A fluoroquinolone that inhibits DNA replication by targeting DNA gyrase and topoisomerase IV.\n","   - **(D) Ceftriaxone**: A third-generation cephalosporin that inhibits cell wall synthesis by binding to penicillin-binding proteins (PBPs) located inside the bacterial cell wall, resulting in the disruption of the cell wall and ultimately leading to cell lysis.\n","   - **(E) Trimethoprim**: Inhibits dihydrofolate reductase, which is necessary for the synthesis of tetrahydrofolate, a precursor for DNA synthesis.\n","\n","Given the mechanism of action described (blocking cell wall synthesis), the correct answer is **(D) Ceftriaxone**. Ceftriaxone is effective against a wide range of bacteria, including Neisseria gonorrhoeae, and is often used to treat gonorrhea, especially given the rise of antibiotic resistance among gonococcal strains. It works by inhibiting cell wall synthesis, which aligns with the mechanism of action described in the question.\n"]}]},{"cell_type":"code","source":["client = OpenAI(api_key=userdata.get(\"OPENAI_API_KEY\"))\n","\n","response = client.chat.completions.create(\n","    model=\"gpt-4o-mini-2024-07-18\",\n","    messages=[{\"role\": \"user\", \"content\": validation_questions[0]}]\n",")\n","\n","print(response.choices[0].message.content)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"HarkLDhlyihx","executionInfo":{"status":"ok","timestamp":1744475075640,"user_tz":240,"elapsed":4241,"user":{"displayName":"Michael Yao","userId":"02175218461346911901"}},"outputId":"bda9cc55-a827-4918-e73d-9d1baa21284b"},"execution_count":20,"outputs":[{"output_type":"stream","name":"stdout","text":["The symptoms described in the clinical vignette suggest the patient has a sexually transmitted infection leading to reactive arthritis, possibly due to Neisseria gonorrhoeae, which can cause joint inflammation and often presents with urethritis and fever.\n","\n","The bacterial characteristics mentioned include that the bacteria does not ferment maltose and has no polysaccharide capsule, which is characteristic of Neisseria gonorrhoeae. \n","\n","Among the choices provided, the antibiotic that blocks cell wall synthesis is:\n","\n","(D) Ceftriaxone\n","\n","Ceftriaxone is a third-generation cephalosporin that is used to treat infections caused by Neisseria gonorrhoeae and works by inhibiting the synthesis of bacterial cell walls. \n","\n","The other options act through different mechanisms:\n","- (A) Chloramphenicol inhibits protein synthesis.\n","- (B) Gentamicin is an aminoglycoside that also inhibits protein synthesis.\n","- (C) Ciprofloxacin is a fluoroquinolone that inhibits bacterial DNA gyrase and topoisomerase.\n","- (E) Trimethoprim inhibits bacterial folic acid synthesis.\n","\n","Thus, the correct answer is (D) Ceftriaxone.\n"]}]},{"cell_type":"code","source":[],"metadata":{"id":"M0oFWIJgz9Xq"},"execution_count":null,"outputs":[]}]}