{"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"language_info":{"name":"python","version":"3.10.10","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# SQL Queries with Differential Privacy\n\n<table align=\"left\">\n\n  <td>\n    <a href=\"https://colab.research.google.com/github/opendp/smartnoise-sdk/blob/main/sql/samples/SQL%20Queries.ipynb\">\n      <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n    </a>\n  </td>\n  <td>\n    <a href=\"https://github.com/opendp/smartnoise-sdk/blob/main/sql/samples/SQL%20Queries.ipynb\">\n      <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n      View on GitHub\n    </a>\n  </td>\n  <td>\n    <a href=\"https://www.kaggle.com/notebooks/welcome?src=https://github.com/opendp/smartnoise-sdk/blob/main/sql/samples/SQL%20Queries.ipynb\">\n      <img src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAABJ0lEQVR4AezUgUYEURSH8c3IWkDIPkEI+wwhK/sMPUCSAYGQEBCwIBACyYKEIIAkrCQQkmSMJFnJYHD6cCJ/CdaZWQx+hjuX881wb8vM5uo0+wGtg8sxMnddR8AE5t6agCagCZj5APYkP6IDdF8fxzjFCUZIkYQHsD7AM+yXEnvhf4C1NR2OAvvohAb4b3/S4diFDw8K8OGPMvwLO2iHngKeKzocn9jGfOgxRIoHGT5BKsNDAkp8yPB3bCCJv4j+diRfXnlAhn5tAe4Oy1UFvOAMpayfYzE6oMA6FjCCiaHcAaEX0RLGMAncCg2Qd6vIJeIVg8gAfb+JQiLu0asqoI0hTFygGx7ge7o+0MQhOtMG3CJ3N//s6+EKOTJ/fg+kVxmf+aO9YwBrLxFeFqDOPAAAAABJRU5ErkJggg==\" alt=\"Kaggle logo\">\n      Run on Kaggle\n    </a>\n  </td>                                                                                               \n</table>","metadata":{}},{"cell_type":"code","source":"!pip install smartnoise-sql --quiet","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Read data\nSmartnoise supports issue SQL queries against CSV files, database engines, and Spark clusters.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\n\npums = pd.read_csv(\"https://raw.githubusercontent.com/opendifferentialprivacy/dp-test-datasets/master/data/PUMS_california_demographics_1000/data.csv\")\n\nprint(pums)","metadata":{"scrolled":true,"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Execute DP query\n\nOpen a private SQL connection","metadata":{}},{"cell_type":"code","source":"from snsql import *\n\n# Adopted from https://github.com/opendp/smartnoise-sdk/blob/main/datasets/PUMS.yaml\nmetadata = {\n  \"PUMS\": {\n    \"PUMS\": {\n      \"PUMS\": {\n        \"row_privacy\": True,\n        \"rows\": 1000,\n        \"age\": {\n          \"type\": \"int\",\n          \"lower\": 0,\n          \"upper\": 100\n        },\n        \"sex\": {\n          \"type\": \"string\"\n        },\n        \"educ\": {\n          \"type\": \"string\"\n        },\n        \"race\": {\n          \"type\": \"string\"\n        },\n        \"income\": {\n          \"type\": \"int\",\n          \"lower\": 0,\n          \"upper\": 500000\n        },\n        \"married\": {\n          \"type\": \"string\"\n        }\n      }\n    }\n  }\n}\n\nprivate_reader = from_connection(\n    pums, metadata=metadata, \n    privacy=Privacy(epsilon=1.0, delta=1/1000)\n)\n\nquery = 'SELECT married, COUNT(*) AS n FROM PUMS.PUMS GROUP BY married'\n\nresult_dp = private_reader.execute_df(query)\nprint(result_dp)","metadata":{"scrolled":true,"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"**Note**, in the above example, we query the PUMS microdata to get the count of individuals by marriage status.  If you run the private query repeatedly, you will see that the answer changes a bit between queries.","metadata":{}},{"cell_type":"code","source":"result_dp = private_reader.execute_df(query)\nprint(result_dp)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"The `PrivateReader` allows you to swap in differentially private results wherever exact results are currently used, it can accept some additional paramaters to control privacy/accuracy tradeoff.  Smaller epsilon will be more private, but less accurate. ","metadata":{}},{"cell_type":"code","source":"for epsilon in [4.0, 0.1]:\n    private_reader = from_connection(\n        pums, metadata=metadata, \n        privacy=Privacy(epsilon=epsilon, delta=1/1000)\n    )\n    print(f\"epsilon is: {epsilon}\")\n    result = private_reader.execute_df(query)\n    print(result)\n    print()","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Execute normal query \nCalling the underlying `Reader` directly will give the exact result.","metadata":{}},{"cell_type":"code","source":"result = private_reader.reader.execute_df(query)\n\nprint(result)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Metadata file\n\nThe `PrivateReader` needs some metadata that describes the data source.  Differentially private processing needs to know which columns can be used in numeric computations, as well as information about the sensitivity of data, and which column is the private identifier.  Metadata should be provided by the data owner, and should not be data-dependent.  For example, the acceptable range for the `age` column should be domain-specific, and should not use the actual minimum and maximum values from the data:","metadata":{}},{"cell_type":"code","source":"import snsql\nfrom pathlib import Path\nmetadata_path = Path(\"./PUMS.yaml\")\nmetadata_string = \"\"\"PUMS:\n  PUMS:\n    PUMS:\n      row_privacy: True\n      rows: 1000\n      age:\n        type: int\n        lower: 0\n        upper: 100\n      sex:\n        type: string\n      educ:\n        type: string\n      race:\n        type: string\n      income:\n        type: int\n        lower: 0\n        upper: 500000\n      married:\n        type: string\n\"\"\"\nwith open(metadata_path, \"w+\") as f:\n    f.write(metadata_string)\n    \nmeta = snsql.metadata.Metadata.from_file(metadata_path)\nprint(meta)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"Metadata can be also supply as a dictionary.","metadata":{}},{"cell_type":"code","source":"metadata = {\n  \"PUMS\": {\n    \"PUMS\": {\n      \"PUMS\": {\n        \"row_privacy\": True,\n        \"rows\": 1000,\n        \"age\": {\n          \"type\": \"int\",\n          \"lower\": 0,\n          \"upper\": 100\n        },\n        \"sex\": {\n          \"type\": \"string\"\n        },\n        \"educ\": {\n          \"type\": \"string\"\n        },\n        \"race\": {\n          \"type\": \"string\"\n        },\n        \"income\": {\n          \"type\": \"int\",\n          \"lower\": 0,\n          \"upper\": 500000\n        },\n        \"married\": {\n          \"type\": \"string\"\n        }\n      }\n    }\n  }\n}\n\nmeta = snsql.metadata.Metadata.from_dict(metadata)\nprint(meta)","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]}