{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XsO5u5zaaxw1"
      },
      "source": [
        "# ReadMe\n",
        "\n",
        "Below is the code for running our fingerprinting implementations. All referenced needed files are in the supplementary materials.\n",
        "\n",
        "**Targeted Fingerprinting**\n",
        "\n",
        "The section for targeted fingerprinting is below, titled \"Targeted Fingerprinting\". First comes helper functions. Then, the main code for targeted fingerprinting is under section \"Main Targeted Fingerprinting Code.\"\n",
        "\n",
        "The main data file needed to run the code from a dataset is \"ints.csv.\" Note that we load the data file using code from https://github.com/gaborgulyas/constrainted_fingerprinting. \"ints.csv\" is a file representing our n x d matrix A. Specifically, there are n users and d attributes. At each entry A[i][j] for all i and j, A[i][j] = 1 denotes that user i is in set j, and A[i][j] = 0 denotes that user i is not in set j. We provide the correponding data files for the datasets that we use in the paper.\n",
        "\n",
        "**Quality Testing for Targeted Fingerprinting**\n",
        "\n",
        "To compare the accuracy of our implementation to that of https://github.com/gaborgulyas/constrainted_fingerprinting, the code is under section \"Quality Section.\" The main data file needed again is \"ints.csv.\" The output from the comparison code is given in \"compTargeted.csv\", also included. This file should be included under the \"results\" folder that will be created to store the results of running our targeted fingerprinting algorithm.\n",
        "\n",
        "**General Fingerprinting**\n",
        "\n",
        "The section for general fingerprinting is under section \"General Fingerprinting.\" Helper functions come first, then the main code is under section \"Main General Fingerprinting Code.\"\n",
        "\n",
        "Similar to targeted fingerprinting, the main data file needed to run the code from a dataset is \"ints.csv.\"  \"ints.csv\" is a file representing our n x d matrix A. Specifically, there are n users and d attributes. At each entry A[i][j] for all i and j, A[i][j] = 1 denotes that user i is in set j, and A[i][j] = 0 denotes that user i is not in set j. We provide the corresponding data files for the datasets that we use in the paper.\n",
        "\n",
        "**Quality Testing for General Fingerprinting**\n",
        "\n",
        "To compare the accuracy of our implementation to that of https://github.com/gaborgulyas/constrainted_fingerprinting, the code is under section \"Quality Testing for General Fingerprinting.\" The main data file needed again is \"ints.csv.\" The output from the comparison code is given in \"g_compare.csv\", also included here. This file should be included under the \"results\" folder that will be created to store the results of running our targeted fingerprinting algorithm.\n",
        "\n",
        "**Feature Selection**\n",
        "\n",
        "The section for general fingerprinting is under section \"Feature Selection.\"\n",
        "\n",
        "The main data file needed to run the code from a dataset is \"ints.csv.\"  \"ints.csv\" is a file representing our n x d matrix A. Specifically, there are n users and d features. At each entry A[i][j] for all i and j, A[i][j] = x denotes that user i has value x for feature j. We provide the corresponding data files for the datasets that we use in the paper."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1GQLcQ63fGR_"
      },
      "source": [
        "# Targeted Fingerprinting"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XtCPTEI2Kj08"
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "import time\n",
        "import os\n",
        "import numpy as np\n",
        "import csv\n",
        "import math\n",
        "import json\n",
        "from math import log2"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VLXBILPDKkH5"
      },
      "outputs": [],
      "source": [
        "#offline targeted fingerprinting algorithm from https://github.com/gaborgulyas/constrainted_fingerprinting\n",
        "#takes users, attr as sets\n",
        "#for each attribute, finds all users that would overlap\n",
        "def greedy_individual_fingerprint(uid, Users, Attrs):\n",
        "\tuser_set = set(Users.keys())\n",
        "\tusers = {}\n",
        "\n",
        "\tfor a in Attrs.keys():\n",
        "\t\tif len(Attrs[a]) == 0:\n",
        "\t\t\tcontinue\n",
        "\t\t#print(\"HERE\")\n",
        "\t\t#print(type(user_set))\n",
        "\t\t#print(type(Attrs[a]))\n",
        "\t\tif a in uid: #modified\n",
        "\t\t\tusers[a] = Attrs[a]\n",
        "\t\telse:\n",
        "\t\t\tusers[a] = user_set - Attrs[a]\n",
        "\n",
        "\tmask = []\n",
        "\tanon_set = set(Users.keys())\n",
        "\twhile len(anon_set) > 1:\n",
        "\t\tavail_attrs = set(users.keys()) - set(mask)\n",
        "\n",
        "\t\tmin_a = None\n",
        "\t\tfor a in avail_attrs:\n",
        "\t\t\tif min_a == None:\n",
        "\t\t\t\tmin_a = a\n",
        "\t\t\t\tcontinue\n",
        "\n",
        "\t\t\tif len(users[a] & anon_set) < len(users[min_a] & anon_set):\n",
        "\t\t\t\tmin_a = a\n",
        "\n",
        "\t\tif len(users[min_a] & anon_set) == len(anon_set):\n",
        "\t\t\tbreak\n",
        "\n",
        "\t\tif min_a in uid:\n",
        "\t\t\tmask.append(min_a)\n",
        "\t\telse:\n",
        "\t\t\tif min_a == 0:\n",
        "\t\t\t\tmask.append(-0.01) # otherwise sign of 0 could not be distinguished (remained due to \"historical\" reasons :) )\n",
        "\t\t\telse:\n",
        "\t\t\t\tmask.append(-min_a)\n",
        "\t\tanon_set = anon_set & users[min_a]\n",
        "\n",
        "\treturn mask"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "y9XtKy5ggJBY"
      },
      "outputs": [],
      "source": [
        "#input: filepath (string), sampleProb (floating point that is <= 1)\n",
        "#reads in file which holds an array, samples rows via sampleProb, and returns these sampled rows\n",
        "\n",
        "#doing the uniform sampling from the original dataset here\n",
        "#can store in either ints or bools\n",
        "def getSampleInt(filePath, sampleProb):\n",
        "\n",
        "\tintsData = []\n",
        "\tsubsampleProb = sampleProb\n",
        "\n",
        "\tf = open(filePath, \"r\") #usually reading from ints.csv\n",
        "\tcsv_reader = csv.reader(f)\n",
        "\tfor row in csv_reader:\n",
        "\t\tif np.random.choice([True, False], p=[subsampleProb, 1 - subsampleProb]):\n",
        "\t\t\tintsData.append(row)\n",
        "\tf.close()\n",
        "\n",
        "\t#turning into ints\n",
        "\tintUsers = list(map(lambda lst: list(map(int, lst)), intsData))\n",
        "\tnumUsers = len(intUsers)\n",
        "\tprint(\"number of sampled users: \" + numUsers)\n",
        "\treturn intsData\n",
        "\n",
        "def getSampleBool(filePath, sampleProb): #test this function to make sure its right\n",
        "\tboolData = []\n",
        "\tsubsampleProb = sampleProb\n",
        "\n",
        "\tf = open(filePath, \"r\")\n",
        "\tcsv_reader = csv.reader(f)\n",
        "\tfor row in csv_reader:\n",
        "\t\tif np.random.choice([True, False], p = [subsampleProb, 1-subsampleProb]):\n",
        "\t\t\t#converting to bools here\n",
        "\t\t\tboolData.append([bool(int(x)) for x in row])\n",
        "\tprint(\"number of sampled users:\")\n",
        "\tprint(len(boolData))\n",
        "\tf.close()\n",
        "\treturn boolData"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rHkrFIKXhKQI"
      },
      "outputs": [],
      "source": [
        "def convertFormat(recoveredUsers, recoveredSets): #converting the sets into format to use offline alg\n",
        "\tUsers = {}\n",
        "\tAttrs = {}\n",
        "\tfor ix, row in enumerate(recoveredUsers):\n",
        "\t\tindices = np.where(row == 1)[0].tolist()\n",
        "\t\tUsers[ix] = set([int(x) for x in indices])\n",
        "\tfor ix, row in enumerate(recoveredSets):\n",
        "\t\tindices = np.where(row == 1)[0].tolist()\n",
        "\t\tAttrs[ix] = set([int(x) for x in indices])\n",
        "\treturn Users, Attrs\n",
        "\n",
        "def writeRecovered(recoveredUsers, recoveredSets): #for alternate testing purposes\n",
        "\tcompressInts = open(\"compInts.csv\", \"w\")\n",
        "\tcsv_writeCI = csv.writer(compressInts)\n",
        "\tcompressIntsT = open(\"compIntsT.csv\", \"w\")\n",
        "\tcsv_writeCIT = csv.writer(compressIntsT)\n",
        "\n",
        "\tindices_list = []\n",
        "\tfor row in recoveredUsers:\n",
        "\t\t# Find indices where value is 1\n",
        "\t\tindices = np.where(row == 1)[0]\n",
        "\t\t# Append indices to the list\n",
        "\t\tindices_list.append(indices)\n",
        "\tcsv_writeCI.writerows(indices_list)\n",
        "\tindices_list_t=[]\n",
        "\tfor row in recoveredSets:\n",
        "\t\tindices = np.where(row == 1)[0]\n",
        "\t\tindices_list_t.append(indices)\n",
        "\tcsv_writeCIT.writerows(indices_list_t)\n",
        "\n",
        "\tcompressInts.close()\n",
        "\tcompressIntsT.close()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-sNZH8LsbCQF"
      },
      "outputs": [],
      "source": [
        "def loadOrgDataset(filePath, numUsers): #only do it for the number of users needed for RAM reasons\n",
        "\tf = open(filePath, \"r\") #probably ints.csv\n",
        "\tcsv_reader = csv.reader(f)\n",
        "\torgData = []\n",
        "\tcounter = 0\n",
        "\tfor row in csv_reader:\n",
        "\t\tif counter > numUsers:\n",
        "\t\t\tbreak\n",
        "\t\tcounter = counter + 1\n",
        "\t\torgData.append(row)\n",
        "\tf.close()\n",
        "\torgUsers = list(map(lambda lst: list(map(int, lst)), orgData))\n",
        "\treturn orgUsers"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mfLudJouF-KT"
      },
      "outputs": [],
      "source": [
        "def fingerprinting(outputFilePath, numUsers, orgUsers, Users, Attr):\n",
        "\t#writing results to file\n",
        "\tif not os.path.exists(\"results\"):\n",
        "\t\tos.mkdir(\"results\")\n",
        "\tf = open(outputFilePath, \"w\") #probability results/targeted.csv\n",
        "\n",
        "  #calculating execution time\n",
        "\tstart_time = time.time()\n",
        "\t#Calculating individual fingerprints\n",
        "\tfor uid in range(numUsers): #modified here\n",
        "\t\t#print(\"$\", uid, end=' ')\n",
        "\t\tmask = []\n",
        "\t\t#person to look at\n",
        "\t\tuserOfInterest = np.array(orgUsers[uid])\n",
        "\n",
        "\t\t#turn to other paper format\n",
        "\t\totherFormat = np.where(userOfInterest == 1)[0]\n",
        "\n",
        "\t\tmask = greedy_individual_fingerprint(otherFormat, Users, Attr)\n",
        "\t\t#print(mask)\n",
        "\n",
        "\t\tf.write(\";\".join([str(x) for x in mask])+\"\\n\")\n",
        "\tprint(\"--- %s seconds ---\" % (time.time() - start_time))\n",
        "\tf.close()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "urXO9htWamKS"
      },
      "source": [
        "# Main Targeted Fingerprinting Code"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "#first, perform sampling\n",
        "datafileName = \"ints.csv\"\n",
        "subsamplingProb = 0.1 #you can change the subsampling rate here\n",
        "subsampledMatrix = getSampleBool(datafileName, subsamplingProb)\n",
        "\n",
        "#then sample edges \n",
        "subsampleMatrix_np = np.array(subsampledMatrix)\n",
        "# Create a mask where we flip the 1's based on the probability p \n",
        "p = 0.6\n",
        "flip_mask = (subsampleMatrix_np == 1) & (np.random.rand(*subsampleMatrix_np.shape) < p)\n",
        "# Apply the mask to flip the 1's to 0's\n",
        "subsampleMatrix_np[flip_mask] = 0\n",
        "\n",
        "#converting format \n",
        "subsampleMatrix_np_sets = subsampleMatrix_np.T\n",
        "Users, Attr = convertFormat(subsampleMatrix_np, subsampleMatrix_np_sets) #converting format to that of offline algorithm\n",
        "\n",
        "numUsersToTest = 15000 #this is the number of users that we want to get the targeted fingerprint of. Note that we still get the targeted fingerprint in regards to the ENTIRE dataset.\n",
        "orgUsers = loadOrgDataset(\"ints.csv\", numUsersToTest)\n",
        "\n",
        "fingerprinting(\"results/targeted.csv\", numUsersToTest, orgUsers, Users, Attr)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2R-SUvRni2hJ"
      },
      "source": [
        "## Quality Section"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sLMxdh-YosX6"
      },
      "outputs": [],
      "source": [
        "import csv\n",
        "import numpy as np"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "85OUMosJjE10"
      },
      "outputs": [],
      "source": [
        "def loadingResults(ourFilePath, compFilePath):\n",
        "\t#opening both files\n",
        "\tf = open(ourFilePath, \"r\") #\"results/targeted.csv\"\n",
        "\tcsv_reader = csv.reader(f, delimiter=';')\n",
        "\ttargeted = []\n",
        "\tfor row in csv_reader:\n",
        "\t\ttargeted.append(row)\n",
        "\tf.close()\n",
        "\n",
        "\tf2 = open(compFilePath, \"r\") #compTargeted.csv\n",
        "\tcsv_reader2 = csv.reader(f2, delimiter=';')\n",
        "\tcompared = []\n",
        "\tfor row in csv_reader2:\n",
        "\t\tcompared.append(row)\n",
        "\tf2.close()\n",
        "\n",
        "\treturn targeted, compared"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "R4TRPCCscVKz"
      },
      "outputs": [],
      "source": [
        "def compareSeparation(targeted, compared, intAttr, testingUsers, maxFSize, numUsers): #intAttr should be original dataset\n",
        "#testingUsers = how many users to look at for accuracy purposes\n",
        "#maxFSize = maximum fingerprint size to compare\n",
        "#numUsers = number of original users\n",
        "\tfingerprint = [[] for _ in range(testingUsers)]\n",
        "\tcompPrint = [[] for _ in range(testingUsers)]\n",
        "\n",
        "\tsepFin = [[] for _ in range(testingUsers)]\n",
        "\tsepComp = [[] for _ in range(testingUsers)]\n",
        "\n",
        "\tfor numA in range(maxFSize):\n",
        "\t\tsumUsers = 0 #amount of users with a certain fingerprint size\n",
        "\t\tusersSeparated = 0\n",
        "\t\tcompUsersSeparated = 0\n",
        "\t\tfor user in range(testingUsers):\n",
        "\t\t\tif len(compared[user]) <= numA or len(targeted[user]) <= numA:\n",
        "\t\t\t\tcontinue\n",
        "\t\t\tif targeted[user][numA] == '-0.01' or compared[user][numA] == '-0.01':\n",
        "\t\t\t\tcontinue\n",
        "\t\t\tsumUsers = sumUsers + 1\n",
        "\n",
        "\t\t\tattr = int(targeted[user][numA])\n",
        "\t\t\tcompAttr = abs(int(compared[user][numA]))\n",
        "\n",
        "\t\t\tfingerprint[user].append(intAttr[attr])\n",
        "\t\t\tcompPrint[user].append(intAttr[compAttr])\n",
        "\n",
        "\t\t\t#if orgUsers[user][attr] == 0: #to subtract off the user\n",
        "\t\t\tif intAttr[attr][user] == 0: #to subtract off the user\n",
        "\t\t\t\tsepFin[user].append(np.zeros(numUsers))\n",
        "\t\t\telse:\n",
        "\t\t\t\tsepFin[user].append(np.ones(numUsers))\n",
        "\n",
        "\t\t\tif intAttr[compAttr][user] == 0:\n",
        "\t\t\t\tsepComp[user].append(np.zeros(numUsers))\n",
        "\t\t\telse:\n",
        "\t\t\t\tsepComp[user].append(np.ones(numUsers))\n",
        "\n",
        "\t\t\t#after subtracting off user\n",
        "\t\t\tsep = np.array(fingerprint[user]).T-np.array(sepFin[user]).T\n",
        "\t\t\tcompSep = np.array(compPrint[user]).T-np.array(sepComp[user]).T\n",
        "\n",
        "\t\t\t#counting number of nonzero rows\n",
        "\t\t\tfinRows = np.any(sep != 0, axis=1)\n",
        "\t\t\tusersSeparated = usersSeparated + np.sum(finRows)\n",
        "\t\t\tcompRows = np.any(compSep != 0, axis = 1)\n",
        "\t\t\tcompUsersSeparated = compUsersSeparated + np.sum(compRows)\n",
        "\n",
        "\t\tif (sumUsers == 0):\n",
        "\t\t\tcontinue\n",
        "\t\tavgSep = usersSeparated / sumUsers\n",
        "\t\tavgCompSep = compUsersSeparated / sumUsers\n",
        "\n",
        "\t\tpercentSep = avgSep / numUsers\n",
        "\t\tcompPercentSep = avgCompSep / numUsers\n",
        "\n",
        "\t\tprint(f\"Fingerprint size {numA}:\")\n",
        "\t\tprint(f\"Num Users: {sumUsers}\")\n",
        "\t\tprint(f\"Avg separation: {avgSep}\")\n",
        "\t\tprint(f\"Their avg separation: {avgCompSep}\")\n",
        "\t\tprint(f\"Our separation: {percentSep}\")\n",
        "\t\tprint(f\"Comp separation: {compPercentSep}\")\n",
        "\t\tprint('-' * 10)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kDrtuBywbspP"
      },
      "outputs": [],
      "source": [
        "#Run this cell for accuracy testing\n",
        "targeted, compared = loadingResults(\"results/targeted.csv\", \"results/compTargeted.csv\")\n",
        "\n",
        "#get original users\n",
        "numUsers = 2458285 #number of users in original dataset.\n",
        "fullDataSetFileName = \"ints.csv\"\n",
        "orgUsers = loadOrgDataset(fullDataSetFileName, numUsers)\n",
        "\n",
        "#also getting the transpose for easier access to columns\n",
        "intAttr = np.transpose(np.array(orgUsers)).tolist()\n",
        "\n",
        "numUsersToComp = 15 #we look at the fingerprint of numUserstoComp different target users\n",
        "#note we still determine accuracy against entire dataset\n",
        "fingLength = 9 #looking up to a fingerprint length of this size\n",
        "compareSeparation(targeted, compared, intAttr, numUsersToComp, fingLength, numUsers)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-H0BkUzWuSIG"
      },
      "source": [
        "## General Fingerprinting"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "vD_9iGI6eSsp"
      },
      "outputs": [],
      "source": [
        "#methods for perform l0 sketch\n",
        "def l0Sketch(n, K, L):\n",
        "    H = np.zeros([L, n])\n",
        "    #optimizing\n",
        "    H[0]=1\n",
        "    H[1:] = np.random.randint(2, size=[L - 1, n])\n",
        "    H = H.cumprod(axis=0) * H[0]\n",
        "    return H\n",
        "\n",
        "def applySketch(col, H, h, K, L):\n",
        "    n = len(col)\n",
        "    B = np.zeros([L, K], dtype = int)\n",
        "    h = h.astype(int)\n",
        "    for i in range(L):\n",
        "        z = H[i] * col\n",
        "        z = z.astype(int)\n",
        "        np.add.at(B[i], h, z)\n",
        "    return(B)\n",
        "\n",
        "def returnEstimate(sketch):\n",
        "    d = 0\n",
        "    c = 50\n",
        "    for i in range(len(sketch)):\n",
        "        if np.count_nonzero(sketch[i]) >= c:\n",
        "            d = i\n",
        "    return np.count_nonzero(sketch[d]) * (2 ** d)\n",
        "\n",
        "def getSamplingSketch(col, L):\n",
        "    sampled_indices = np.random.randint(0, len(col), size= 500 * L, dtype=int)\n",
        "    sampled_values = np.array(col)[sampled_indices]\n",
        "    return sampled_values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "id": "f20jAIMdbSXW"
      },
      "outputs": [],
      "source": [
        "#for loading dataset files\n",
        "def loadOrgDataset(filePath):\n",
        "\tf = open(filePath, \"r\") #probably ints.csv\n",
        "\tcsv_reader = csv.reader(f)\n",
        "\torgData = []\n",
        "\tfor row in csv_reader:\n",
        "\t\torgData.append(row)\n",
        "\tf.close()\n",
        "\torgUsers = list(map(lambda lst: list(map(int, lst)), orgData))\n",
        "\n",
        "\t#also getting the transpose for easier access to columns\n",
        "\tintAttr = np.transpose(np.array(orgUsers)).tolist()\n",
        "\treturn intAttr"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "id": "SIK3uEKwvfgM"
      },
      "outputs": [],
      "source": [
        "#creating sketches\n",
        "def sketchingL0(numB, numL, numUsers, numAttr, intAttr): #number of buckets, number of sampling levels\n",
        "\t#with linear sketches\n",
        "\tH = l0Sketch(numUsers, numB, numL)\n",
        "\th = np.random.choice(numB, size = numUsers)\n",
        "\tsketches = []\n",
        "\tfor attr in range(numAttr): #compute linear sketches for each column\n",
        "\t\tsketchToAp = applySketch(intAttr[attr], H, h, numB, numL)\n",
        "\t\tsketches.append(np.where(sketchToAp != 0, 1, 0))\n",
        "\tsketches = np.array(sketches)\n",
        "\n",
        "\t#creating l0 sketch of 1's for subtracting off user later\n",
        "\tsketchingOnes = applySketch(np.ones(numUsers), H, h, numB, numL)\n",
        "\tsketchOfOnes = np.where(sketchingOnes != 0, 1, 0)\n",
        "\n",
        "\treturn sketches, sketchOfOnes\n",
        "\n",
        "def samplingSketches(numAttr, numIter, intAttr):\n",
        "\tsampSketches = []\n",
        "\tfor attr in range(numAttr):\n",
        "\t\tsampSketches.append(getSamplingSketch(intAttr[attr], numIter))\n",
        "\tsampSketches = np.array(sampSketches)\n",
        "\treturn sampSketches\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "id": "ky4vE5OMPXki"
      },
      "outputs": [],
      "source": [
        "def reduceToVector(fingerprint_cols, new_sketch, sampling_cols, new_samp_sketch):\n",
        "\t#appending two columns together\n",
        "\tintermed_sketch = np.array([fingerprint_cols, new_sketch])\n",
        "\tg = np.random.randint(-50, (50), size=[2, 1])\n",
        "\tnew_sketch = np.transpose(intermed_sketch, (1, 2, 0))\n",
        "\tnew_sketch = np.squeeze(np.dot(new_sketch, g))\n",
        "\n",
        "\t#also doing it to sample sketch\n",
        "\tintermed_samp_sketch = np.array([sampling_cols, new_samp_sketch])\n",
        "\tnew_samp_sketch = np.squeeze(np.dot(intermed_samp_sketch.T, g))\n",
        "\treturn new_sketch, new_samp_sketch\n",
        "\n",
        "def subtractBFreq(freq, unique_elements, prop, fingLength, subtractRound, sketchOfOnes, new_sketch, numUsers, num_samples):\n",
        "\tfreq_of_highest_elements = []\n",
        "  #subtracting off highest freq item\n",
        "\tmax_freq_index = np.argmax(freq)\n",
        "\tmax_item = unique_elements[max_freq_index]\n",
        "\tif freq[max_freq_index] >= prop and fingLength <= subtractRound:\n",
        "\t\tsubtractor = sketchOfOnes * max_item\n",
        "\t\testimate_sketch = new_sketch - subtractor\n",
        "\t\testimate = returnEstimate(estimate_sketch)\n",
        "\t\tfreqT = numUsers - estimate\n",
        "\t\tif freqT >= 0:\n",
        "\t\t\tfreq_of_highest_elements.append(freqT)\n",
        "\n",
        "\tcount = 0\n",
        "\tfor element in unique_elements:\n",
        "\t\tif element == max_item and fingLength <= subtractRound:\n",
        "\t\t\tcontinue\n",
        "\t\tif freq[count]>= prop:\n",
        "\t\t\tfreqT = (freq[count] * numUsers) / num_samples\n",
        "\t\t\tif freqT >= 0:\n",
        "\t\t\t\tfreq_of_highest_elements.append(freqT)\n",
        "\t\tcount = count + 1\n",
        "\treturn freq_of_highest_elements"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "id": "dPGww3GzNfbF"
      },
      "outputs": [],
      "source": [
        "def generalFing(max_fin_size, resultPath, sketches, sampSketches, subtractRound,sketchOfOnes,numUsers):\n",
        "\t#writing results to file\n",
        "\tif not os.path.exists(\"g_results\"):\n",
        "\t\tos.mkdir(\"g_results\")\n",
        "\tf2 = open(resultPath, \"w\")\n",
        "\tcsv_writer = csv.writer(f2)\n",
        "\n",
        "\t#going from 2 until the max\n",
        "\tfor fin_size in range(2, max_fin_size + 1):\n",
        "\t\tfingerprint = []\n",
        "\t\tfingerprint_cols = []\n",
        "\t\tsampling_cols = []\n",
        "\t\tdebug_cols = []\n",
        "\n",
        "\t\twhile(len(fingerprint) < fin_size):\n",
        "\t\t\tsep = -1\n",
        "\t\t\tgreedy_attr = -1\n",
        "\t\t\tattr_sketch = []\n",
        "\t\t\tsamp_attr_sketch = []\n",
        "\n",
        "\t\t\tfor attr in range(numAttr): #going through all remaining attributes\n",
        "\t\t\t\tif attr in fingerprint:\n",
        "\t\t\t\t\tcontinue\n",
        "\t\t\t\tnew_sketch = sketches[attr] #getting this attribute sketch\n",
        "\t\t\t\tnew_samp_sketch = sampSketches[attr]\n",
        "\n",
        "\t\t\t\tif len(fingerprint) != 0 and len(fingerprint) <= subtractRound:\n",
        "\t\t\t\t\tnew_sketch, new_samp_sketch = reduceToVector(fingerprint_cols, new_sketch, sampling_cols, new_samp_sketch)\n",
        "\n",
        "\t\t\t\tunique_elements, freq = np.unique(new_samp_sketch, return_counts=True)\n",
        "\t\t\t\tnum_samples = len(new_samp_sketch)\n",
        "\t\t\t\tprop =  num_samples * 0.0016\n",
        "\n",
        "\t\t\t\t#subtracting off highest elements, finding frequency of highest elements\n",
        "\t\t\t\tfreq_of_highest_elements = subtractBFreq(freq, unique_elements, prop, len(fingerprint), subtractRound, sketchOfOnes, new_sketch, numUsers, num_samples)\n",
        "\t\t\t\tsquared_freq = np.square(freq_of_highest_elements)\n",
        "\t\t\t\tseparation = (numUsers ** 2) - np.sum(squared_freq)\n",
        "\t\t\t\tif separation > sep:\n",
        "\t\t\t\t\tsep = separation\n",
        "\t\t\t\t\tgreedy_attr = attr\n",
        "\t\t\t\t\tattr_sketch = new_sketch\n",
        "\t\t\t\t\tsamp_attr_sketch = new_samp_sketch\n",
        "\n",
        "\t\t\tfingerprint.append(greedy_attr)\n",
        "\t\t\tdebug_cols.append(intAttr[greedy_attr])\n",
        "\t\t\tfingerprint_cols = attr_sketch\n",
        "\t\t\tsampling_cols = samp_attr_sketch\n",
        "\n",
        "\t\tprint(f\"{fin_size}: {fingerprint}\")\n",
        "\t\tcsv_writer.writerow(fingerprint)\n",
        "\tf2.close()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8rIOdMnvgM-L"
      },
      "source": [
        "# Main General Fingerprinting Code"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ufo8vdg_gMMN"
      },
      "outputs": [],
      "source": [
        "#loading original dataset\n",
        "intAttr = loadOrgDataset(\"ints.csv\")\n",
        "numAttr = len(intAttr)\n",
        "numUsers = len(intAttr[0])\n",
        "\n",
        "#calculating execution time\n",
        "start_time = time.time()\n",
        "\n",
        "#creating l0 sketches\n",
        "numB = 800 #can change number of buckets here\n",
        "numL = int((np.log2(numUsers) + 1)/3) #can change number of sampling levels here\n",
        "sketches, sketchOfOnes = sketchingL0(numB, numL, numUsers, numAttr, intAttr)\n",
        "\n",
        "#making sampling sketches\n",
        "numIter = 7\n",
        "sampSketches = samplingSketches(numAttr, numIter, intAttr)\n",
        "\n",
        "subtractRound = 4 #after this many rounds, estimate freqs with just sampling\n",
        "max_fin_size = 20\n",
        "\n",
        "#now getting general fingerprint\n",
        "resultPath = \"g_results/generalized.csv\"\n",
        "generalFing(max_fin_size, resultPath, sketches, sampSketches, subtractRound,sketchOfOnes,numUsers)\n",
        "print(\"--- %s seconds ---\" % (time.time() - start_time))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4d-elExXuA_E"
      },
      "source": [
        "# Quality Testing for General Fingerprinting"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8_C0o21YuPGC"
      },
      "outputs": [],
      "source": [
        "import csv\n",
        "import math\n",
        "import numpy as np"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hDKx_SSNu-Hj"
      },
      "outputs": [],
      "source": [
        "def loadResults(thisPaperPath, compPath):\n",
        "\t#opening both files\n",
        "\tf2 = open(thisPaperPath, \"r\")\n",
        "\tcsv_reader2 = csv.reader(f2)\n",
        "\tgen = []\n",
        "\tfor row in csv_reader2:\n",
        "\t\tgen.append(row)\n",
        "\tf2.close()\n",
        "\n",
        "\tf = open(compPath, \"r\")\n",
        "\tcsv_reader1 = csv.reader(f)\n",
        "\tcompared = []\n",
        "\tfor row in csv_reader1:\n",
        "\t\tcompared.append(row)\n",
        "\tf.close()\n",
        "\n",
        "\treturn gen, compared"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-4OkXrRVumz4"
      },
      "outputs": [],
      "source": [
        "#run this cell for general fingerprinting accuracy\n",
        "\n",
        "#loading dataset in\n",
        "intAttr = loadOrgDataset(\"ints.csv\")\n",
        "numUsers = len(intAttr[0])\n",
        "numAttr = len(intAttr)\n",
        "\n",
        "thisPaperPath = \"g_results/generalized.csv\"\n",
        "compPath = \"g_results/g_compare.csv\"\n",
        "\n",
        "gen, compared = loadResults(thisPaperPath, compPath)\n",
        "fin_size = 20 #max fingerprint size\n",
        "\n",
        "fingerprint = gen[fin_size - 2]\n",
        "compPrint = compared[fin_size - 2]\n",
        "\n",
        "fingerCol = []\n",
        "compPrintCol = []\n",
        "\n",
        "for numA in range(fin_size):\n",
        "\tprint(f\"fingerprint size: {numA}\")\n",
        "\tfingerCol.append(intAttr[int(fingerprint[numA])])\n",
        "\tcompPrintCol.append(intAttr[int(compPrint[numA])])\n",
        "\n",
        "\tunique_rows, counts = np.unique(np.array(fingerCol).T, axis=0, return_counts=True)\n",
        "\tourPercent = (len(unique_rows) / numUsers) * 100\n",
        "\tprint(f\"our percent of unique rows: {ourPercent}\")\n",
        "\n",
        "\tunique_rows1, counts1 = np.unique(np.array(compPrintCol).T, axis=0, return_counts=True)\n",
        "\tcompPerccent = (len(unique_rows1) / numUsers) * 100\n",
        "\n",
        "    #percent_to_print = 1 - (len(unique_rows1) - len(unique_rows)) / len(unique_rows1)\n",
        "    #print(f\"Our separation divided by their separation: {percent_to_print}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AJzigB31kT4y"
      },
      "source": [
        "# Feature Selection"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ILsSLdX_RheR",
        "outputId": "8d89e8f5-c057-493b-fb20-4d62d1c89e1d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Collecting ucimlrepo\n",
            "  Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)\n",
            "Requirement already satisfied: pandas>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ucimlrepo) (2.0.3)\n",
            "Requirement already satisfied: certifi>=2020.12.5 in /usr/local/lib/python3.10/dist-packages (from ucimlrepo) (2024.2.2)\n",
            "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2023.4)\n",
            "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2024.1)\n",
            "Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (1.25.2)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.0.0->ucimlrepo) (1.16.0)\n",
            "Installing collected packages: ucimlrepo\n",
            "Successfully installed ucimlrepo-0.0.7\n"
          ]
        }
      ],
      "source": [
        "%pip install ucimlrepo"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "bRDyA_nFlYB3"
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "import time\n",
        "import os\n",
        "import numpy as np\n",
        "import csv\n",
        "import math\n",
        "from math import log2\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "import csv\n",
        "import time\n",
        "from ucimlrepo import fetch_ucirepo\n",
        "import pickle\n",
        "import os\n",
        "import urllib.request\n",
        "from sklearn.cluster import KMeans\n",
        "from sklearn.datasets import make_blobs"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 123,
      "metadata": {
        "id": "g9BpmR51Sdz5"
      },
      "outputs": [],
      "source": [
        "def reducedKMeans(intAttr, featuresToUse, numClusters):\n",
        "\n",
        "\t#first isolating to the features picked\n",
        "\tintsModData = (np.array(intAttr))[featuresToUse, :]\n",
        "\ttransposed_array = np.array(intsModData).T\n",
        "\n",
        "\t#doing kmeans\n",
        "\tstart_time = time.time()\n",
        "\tkmeans = KMeans(n_clusters = numClusters)\n",
        "\tkmeans.fit(transposed_array)\n",
        "\tprint(\"--- %s seconds ---\" % (time.time() - start_time))\n",
        "\tcentroids = kmeans.cluster_centers_\n",
        "\tlabels = kmeans.labels_\n",
        "\treturn centroids, labels\n",
        "\n",
        "def getLabels(filePath):\n",
        "\twith open(filePath, \"r\") as f:\n",
        "\t\ty_list = []\n",
        "\t\tcsv_reader = csv.reader(f)\n",
        "\t\tfor row in csv_reader:\n",
        "\t\t\ty_list.append(int(row[0]))\n",
        "\treturn y_list\n",
        "\n",
        "def findingMajLabelInCentroids(labels, y_list):\n",
        "\tzeroOne, zeroTwo, zeroThree, oneOne, oneTwo, oneThree, twoOne, twoTwo, twoThree = 0, 0,0,0,0,0,0,0,0\n",
        "\tfor x in range(0, len(labels)):\n",
        "\t\tif labels[x] == 0:\n",
        "\t\t\tif y_list[x] == 1:\n",
        "\t\t\t\tzeroOne += 1\n",
        "\t\t\tif y_list[x] == 2:\n",
        "\t\t\t\tzeroTwo += 1\n",
        "\t\t\tif y_list[x] == 3:\n",
        "\t\t\t\tzeroThree += 1\n",
        "\t\tif labels[x] == 1:\n",
        "\t\t\tif y_list[x] == 1:\n",
        "\t\t\t\toneOne += 1\n",
        "\t\t\tif y_list[x] == 2:\n",
        "\t\t\t\toneTwo += 1\n",
        "\t\t\tif y_list[x] == 3:\n",
        "\t\t\t\toneThree += 1\n",
        "\t\tif labels[x] == 2:\n",
        "\t\t\tif y_list[x] == 1:\n",
        "\t\t\t\ttwoOne += 1\n",
        "\t\t\tif y_list[x] == 2:\n",
        "\t\t\t\ttwoTwo += 1\n",
        "\t\t\tif y_list[x] == 3:\n",
        "\t\t\t\ttwoThree += 1\n",
        "\n",
        "\t#find majority of centroids\n",
        "\tcluster_one = 1\n",
        "\tcluster_one_ct = zeroOne\n",
        "\tif zeroTwo > zeroOne:\n",
        "\t\tcluster_one = 2\n",
        "\t\tcluster_one_ct = zeroTwo\n",
        "\tif zeroThree > cluster_one_ct:\n",
        "\t\tcluster_one = 3\n",
        "\tcluster_two = 1\n",
        "\tcluster_two_ct = oneOne\n",
        "\tif oneTwo > oneOne:\n",
        "\t\tcluster_two = 2\n",
        "\t\tcluster_two_ct = oneTwo\n",
        "\tif oneThree > cluster_two_ct:\n",
        "\t\tcluster_two = 3\n",
        "\n",
        "\tcluster_three = 1\n",
        "\tcluster_three_ct = twoOne\n",
        "\tif twoTwo > twoOne:\n",
        "\t\tcluster_three = 2\n",
        "\t\tcluster_three_ct = twoTwo\n",
        "\tif twoThree > cluster_three_ct:\n",
        "\t\tcluster_three = 3\n",
        "\n",
        "\treturn cluster_one, cluster_two, cluster_three\n",
        "\n",
        "def accuracyPercent(labels, y_list, cluster_one, cluster_two, cluster_three):\n",
        "\t#counting how many are not in majority\n",
        "\terrorCluster = 0\n",
        "\tfor x in range(0, len(labels)):\n",
        "\t\tif labels[x] == 0:\n",
        "\t\t\tif y_list[x] != cluster_one:\n",
        "\t\t\t\terrorCluster += 1\n",
        "\t\tif labels[x] == 1:\n",
        "\t\t\tif y_list[x] != cluster_two:\n",
        "\t\t\t\terrorCluster += 1\n",
        "\t\tif labels[x] == 2:\n",
        "\t\t\tif y_list[x] != cluster_three:\n",
        "\t\t\t\terrorCluster += 1\n",
        "\tprint(errorCluster / len(labels))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lXdgGXdblh-c",
        "outputId": "38f07fa1-3181-4e1b-f07a-73a0af25a4b9"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "2: [12, 9]\n",
            "3: [12, 9, 0]\n",
            "4: [12, 9, 0, 1]\n",
            "5: [12, 9, 0, 1, 2]\n",
            "6: [12, 9, 0, 1, 2, 4]\n",
            "7: [12, 9, 0, 1, 2, 4, 3]\n",
            "8: [12, 9, 0, 1, 2, 4, 3, 6]\n",
            "9: [12, 9, 0, 1, 2, 4, 3, 6, 11]\n",
            "10: [12, 9, 0, 1, 2, 4, 3, 6, 11, 5]\n",
            "--- 1.9144701957702637 seconds ---\n"
          ]
        }
      ],
      "source": [
        "#need to load general fingerprinting section first!\n",
        "#here, we are getting the reduced number of features\n",
        "#load dataset file\n",
        "intAttr = loadOrgDataset(\"ints.csv\")\n",
        "numAttr = len(intAttr)\n",
        "numUsers = len(intAttr[0])\n",
        "\n",
        "#calculating execution time\n",
        "start_time = time.time()\n",
        "\n",
        "#creating l0 sketches\n",
        "numB = 800 #can change number of buckets here\n",
        "numL = int((np.log2(numUsers) + 1)/3) #can change number of sampling levels here\n",
        "sketches, sketchOfOnes = sketchingL0(numB, numL, numUsers, numAttr, intAttr)\n",
        "\n",
        "#making sampling sketches\n",
        "numIter = 7\n",
        "sampSketches = samplingSketches(numAttr, numIter, intAttr)\n",
        "\n",
        "subtractRound = 4 #after this many rounds, estimate freqs with just sampling\n",
        "max_fin_size = 10\n",
        "\n",
        "#now getting general fingerprint\n",
        "if not os.path.exists(\"results_fs\"):\n",
        "    os.mkdir(\"results_fs\")\n",
        "resultPath = \"results_fs/feature_selection.csv\"\n",
        "generalFing(max_fin_size, resultPath, sketches, sampSketches, subtractRound,sketchOfOnes,numUsers)\n",
        "print(\"--- %s seconds ---\" % (time.time() - start_time))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 124,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wElk7Dw-WiBc",
        "outputId": "1e253d6e-9144-4b6e-f4b3-3dbbdfefdbba"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "--- 0.014792203903198242 seconds ---\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n",
            "  warnings.warn(\n"
          ]
        }
      ],
      "source": [
        "#now doing k means on reduced features\n",
        "featuresToUse = [12, 9, 0, 1]\n",
        "numClusters = 3\n",
        "centroids, labels = reducedKMeans(intAttr, featuresToUse, numClusters)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 126,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "RMvcgh5FRD8q",
        "outputId": "cb1e2568-5f34-42f8-f87c-0bb6db669b8a"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "error:\n",
            "0.29775280898876405\n"
          ]
        }
      ],
      "source": [
        "#doing evaluation\n",
        "filePath = \"labels.csv\"\n",
        "y_list = getLabels(filePath)\n",
        "\n",
        "cluster_one, cluster_two, cluster_three = findingMajLabelInCentroids(labels, y_list)\n",
        "\n",
        "print(\"error:\")\n",
        "errorCluster = accuracyPercent(labels, y_list, cluster_one, cluster_two, cluster_three)\n",
        "#this is printing out the ERROR"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wADua5pPb_RO"
      },
      "source": [
        "#Plotting"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "znw6MCSbcFk9"
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "import pandas as pd"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "b-p85gphTEFI"
      },
      "outputs": [],
      "source": [
        "#targeted fingerprinting UCI \"Adult\", efficiency\n",
        "data = {\n",
        "    'sub_prob': [0.1, 0.2, 0.4, 0.6, 0.1, 0.2, 0.4, 0.6],\n",
        "    'time': [179.961, 526.306, 1536.116, 1919.39, 4393.5086, 4393.5086, 4393.5086, 4393.5086],\n",
        "    'algorithm': ['this paper', 'this paper', 'this paper', 'this paper', 'comparison', 'comparison', 'comparison', 'comparison']\n",
        "}\n",
        "\n",
        "df = pd.DataFrame(data)\n",
        "\n",
        "plt.figure(figsize=(11,11))\n",
        "sns.lineplot(x='sub_prob', y = 'time', hue='algorithm', data=df, style='algorithm', markers=True, dashes=True, markersize=18)\n",
        "#plt.legend(loc='upper right', bbox_to_anchor=[1.1,0.5], ncol=1,fontsize=26, markerscale=2)\n",
        "plt.legend(title='algorithm', loc='lower right', fontsize=20, title_fontsize='20')\n",
        "#plt.legend('')\n",
        "plt.ylabel('time (sec)', fontsize=26)\n",
        "plt.xlabel('subsampling probability', fontsize=26)\n",
        "#plt.yscale(\"log\")\n",
        "plt.title('Subsampling Probability vs Time (in seconds)',fontsize=28)\n",
        "plt.xticks(fontsize=20)\n",
        "plt.yticks(fontsize=20)\n",
        "plt.tight_layout()\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_7d1z1W9cJSv"
      },
      "outputs": [],
      "source": [
        "#targeted fingerprinting UCI \"Adult\", user separation\n",
        "data = {\n",
        "    'card_const': [1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7,1, 2, 3, 4, 5, 6, 7],\n",
        "    'user_sep': [0.8543, 0.9582, 0.9856, 0.9945, 0.9976, 0.9986, 0.9991,  0.8417, 0.9334, 0.9393, 0.9393, 0.9393, 0.9393, 0.9393, 0.8411, 0.9516,0.9758, 0.9767, 0.9767,0.9767, 0.9767, 0.8420, 0.9534, 0.9773,0.9893,0.9905,0.99164,0.9916,0.8419, 0.9535,0.9813,0.9894,0.9937,0.9969,0.9978],\n",
        "    'algorithm': ['comparison', 'comparison','comparison','comparison','comparison','comparison','comparison', 'p = 0.1','p = 0.1','p = 0.1','p = 0.1','p = 0.1','p = 0.1','p = 0.1', 'p = 0.2', 'p = 0.2','p = 0.2','p = 0.2','p = 0.2','p = 0.2','p = 0.2', 'p = 0.4','p = 0.4','p = 0.4','p = 0.4','p = 0.4','p = 0.4','p = 0.4', 'p = 0.6','p = 0.6','p = 0.6','p = 0.6','p = 0.6','p = 0.6','p = 0.6']\n",
        "}\n",
        "df = pd.DataFrame(data)\n",
        "\n",
        "plt.figure(figsize=(11,11))\n",
        "sns.lineplot(x='card_const', y = 'user_sep', hue='algorithm', data=df, style='algorithm', markers=True, dashes=True, markersize=18)\n",
        "#plt.legend(loc='upper right', bbox_to_anchor=[1.1,0.5], ncol=1,fontsize=26, markerscale=2)\n",
        "plt.legend(title='algorithm', loc='lower right', fontsize=20, title_fontsize='20')\n",
        "#plt.legend('')\n",
        "plt.ylabel('percent of total users separated', fontsize=26)\n",
        "plt.xlabel('cardinality constraint k', fontsize=26)\n",
        "#plt.yscale(\"log\")\n",
        "plt.title('Cardinality Constraint vs. Percent Users Separated',fontsize=28)\n",
        "plt.xticks(fontsize=20)\n",
        "plt.yticks(fontsize=20)\n",
        "plt.tight_layout()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kfHRYC7mm6eC"
      },
      "outputs": [],
      "source": [
        "#targeted fingerprinting UCI \"UC Census Data (1990)\", efficiency\n",
        "data = {\n",
        "    'sub_prob': [0.1, 0.1],\n",
        "    'time': [1.22, 52.6],\n",
        "    'algorithm': ['this paper', 'comparison']\n",
        "}\n",
        "\n",
        "df = pd.DataFrame(data)\n",
        "\n",
        "plt.figure(figsize=(11,11))\n",
        "sns.lineplot(x='sub_prob', y = 'time', hue='algorithm', data=df, style='algorithm', markers=True, dashes=True, markersize=18)\n",
        "#plt.legend(loc='upper right', bbox_to_anchor=[1.1,0.5], ncol=1,fontsize=26, markerscale=2)\n",
        "plt.legend(title='algorithm', loc='lower right', fontsize=20, title_fontsize='20')\n",
        "#plt.legend('')\n",
        "plt.ylabel('average time per user (sec)', fontsize=26)\n",
        "plt.xlabel('subsampling probability', fontsize=26)\n",
        "#plt.yscale(\"log\")\n",
        "plt.title('Subsampling Probability vs Average Time Per User (in seconds)',fontsize=28)\n",
        "plt.xticks(fontsize=20)\n",
        "plt.yticks(fontsize=20)\n",
        "plt.tight_layout()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Vbza9jNKr367"
      },
      "outputs": [],
      "source": [
        "#targeted fingerprinting UCI \"US Census Data (1990)\", user separation\n",
        "data = {\n",
        "    'card_const': [1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7],\n",
        "    'user_sep': [0.9322, 0.9861, 0.9955, 0.9981, 0.9990, 0.9994, 0.9997, 0.9225, 0.97845, 0.9819, 0.9878, 0.9899, 0.9914, 0.9931],\n",
        "    'algorithm': ['comparison', 'comparison','comparison','comparison','comparison','comparison','comparison','p = 0.1','p = 0.1','p = 0.1','p = 0.1','p = 0.1','p = 0.1','p = 0.1']\n",
        "}\n",
        "df = pd.DataFrame(data)\n",
        "\n",
        "plt.figure(figsize=(11,11))\n",
        "sns.lineplot(x='card_const', y = 'user_sep', hue='algorithm', data=df, style='algorithm', markers=True, dashes=True, markersize=18)\n",
        "#plt.legend(loc='upper right', bbox_to_anchor=[1.1,0.5], ncol=1,fontsize=26, markerscale=2)\n",
        "plt.legend(title='algorithm', loc='lower right', fontsize=20, title_fontsize='20')\n",
        "#plt.legend('')\n",
        "plt.ylabel('percent of total users separated', fontsize=26)\n",
        "plt.xlabel('cardinality constraint k', fontsize=26)\n",
        "#plt.yscale(\"log\")\n",
        "plt.title('Cardinality Constraint vs. Percent Users Separated',fontsize=28)\n",
        "plt.xticks(fontsize=20)\n",
        "plt.yticks(fontsize=20)\n",
        "plt.tight_layout()\n",
        "\n",
        "plt.savefig('largerTargeted.png')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "t1OjctbqHwL8"
      },
      "outputs": [],
      "source": [
        "#general fingerprinting UCI \"Adult\" dataset, sketch size vs time (s)\n",
        "data = {\n",
        "    'sketch_size': [300, 600, 900, 1250, 300, 600, 900, 1250],\n",
        "    'time': [35.30244033, 35.30244033, 35.30244033, 35.30244033, 0.8082528, 0.813254, 0.8281, 0.8285],\n",
        "    'algorithm': ['comparison', 'comparison','comparison', 'comparison', 'this paper', 'this paper', 'this paper', 'this paper']\n",
        "}\n",
        "df = pd.DataFrame(data)\n",
        "\n",
        "plt.figure(figsize=(11,11))\n",
        "sns.lineplot(x='sketch_size', y = 'time', hue='algorithm', data=df, style='algorithm', markers=True, dashes=True, markersize=18)\n",
        "#plt.legend(loc='upper right', bbox_to_anchor=[1.1,0.5], ncol=1,fontsize=26, markerscale=2)\n",
        "plt.legend(title='algorithm', loc='lower right', fontsize=20, title_fontsize='20', bbox_to_anchor=[1.0,0.75])\n",
        "#plt.legend('')\n",
        "plt.ylabel('time(s)', fontsize=26)\n",
        "plt.xlabel('sketch size (number of CountSketch rows)', fontsize=26)\n",
        "#plt.yscale(\"log\")\n",
        "plt.title('Sketch Size vs. Time (in seconds)',fontsize=28)\n",
        "plt.xticks(fontsize=20)\n",
        "plt.yticks(fontsize=20)\n",
        "plt.tight_layout()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rx0FBf3n_x_c"
      },
      "outputs": [],
      "source": [
        "#general fingerprinting UCI \"Adult\", accuracy\n",
        "data = {\n",
        "    'card_const': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],\n",
        "    'user_sep': [1.0, 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.988034188, 0.9415464299, 0.8874854919, 0.9193431887, 0.8179467085, 0.972392422, 0.9831391682, 0.958476619, 0.9727631371, 1.0, 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.9987789988, 0.9713898138, 0.9271523179, 0.9432272355, 0.9673981191, 0.9385555321, 0.8900063062, 0.9689538808, 0.9827363106, 1.0, 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999023199, 0.9791589592, 0.9389636103, 0.9599364376, 0.9673981191, 0.960981055, 0.9300063062, 0.9374256789, 0.9598863816, 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.999023199, 0.9808854359, 0.9432648324, 0.9693744884, 0.9509012539, 0.8941474752, 0.9403232766, 0.9583802025, 0.9709957393],\n",
        "    'algorithm': ['300 rows', '300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows','300 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '600 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '900 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows', '1250 rows']\n",
        "}\n",
        "df = pd.DataFrame(data)\n",
        "\n",
        "plt.figure(figsize=(11,11))\n",
        "sns.lineplot(x='card_const', y = 'user_sep', hue='algorithm', data=df, style='algorithm', markers=False, dashes=True, markersize=18)\n",
        "#plt.legend(loc='upper right', bbox_to_anchor=[1.1,0.5], ncol=1,fontsize=26, markerscale=2)\n",
        "plt.legend(title='sketch size', loc='lower left', fontsize=20, title_fontsize='20')\n",
        "#plt.legend('')\n",
        "plt.ylabel('accuracy ratio', fontsize=26)\n",
        "plt.xlabel('cardinality constraint k', fontsize=26)\n",
        "#plt.yscale(\"log\")\n",
        "plt.title('Cardinality Constraint vs. Accuracy Ratio',fontsize=28)\n",
        "plt.xticks(fontsize=20)\n",
        "plt.yticks(fontsize=20)\n",
        "plt.tight_layout()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IX8s_ojtQ1jp"
      },
      "outputs": [],
      "source": [
        "#general fingerprinting UCI \"US Census (1990)\" dataset, sketch size vs time (s)\n",
        "data = {\n",
        "    'sketch_size': [55000, 180000, 400000, 55000, 180000, 400000],\n",
        "    'time': [70.4571, 123.39772295951843, 325.04257702827454, 14760, 14760, 14760],\n",
        "    'algorithm': ['this paper', 'this paper', 'this paper', 'comparison', 'comparison','comparison']\n",
        "}\n",
        "df = pd.DataFrame(data)\n",
        "\n",
        "plt.figure(figsize=(11,11))\n",
        "sns.lineplot(x='sketch_size', y = 'time', hue='algorithm', data=df, style='algorithm', markers=True, dashes=True, markersize=18)\n",
        "#plt.legend(loc='upper right', bbox_to_anchor=[1.1,0.5], ncol=1,fontsize=26, markerscale=2)\n",
        "plt.legend(title='algorithm', loc='lower right', fontsize=20, title_fontsize='20', bbox_to_anchor=[1.0,0.75])\n",
        "#plt.legend('')\n",
        "plt.ylabel('time(s)', fontsize=26)\n",
        "plt.xlabel('sketch size (number of CountSketch rows)', fontsize=26)\n",
        "#plt.yscale(\"log\")\n",
        "plt.title('Sketch Size vs. Time (in seconds)',fontsize=28)\n",
        "plt.xticks(fontsize=20)\n",
        "plt.yticks(fontsize=20)\n",
        "plt.tight_layout()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ef8gItx1S4LB"
      },
      "outputs": [],
      "source": [
        "#general fingerprinting UCI \"US Census (1990)\", accuracy\n",
        "data = {\n",
        "    'card_const': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n",
        "    'user_sep': [1.0, 1.0,1.0,1.0, 1.0, 1.0, 1.0, 0.998046875, 0.748046875, 0.7404692082111437, 1.0, 1.0,1.0,1.0, 1.0, 1.0, 1.0,1.0, 0.99609375, 0.9540566959921799, 1.0, 1.0,1.0,1.0, 1.0, 1.0, 1.0,1.0, 0.99609375, 0.9872922776148583],\n",
        "    'algorithm': ['55,000 rows', '55,000 rows','55,000 rows','55,000 rows','55,000 rows','55,000 rows','55,000 rows','55,000 rows','55,000 rows','55,000 rows','180,000 rows', '180,000 rows','180,000 rows','180,000 rows','180,000 rows','180,000 rows','180,000 rows','180,000 rows','180,000 rows','180,000 rows','400,000 rows', '400,000 rows','400,000 rows','400,000 rows','400,000 rows','400,000 rows','400,000 rows','400,000 rows','400,000 rows','400,000 rows']\n",
        "}\n",
        "df = pd.DataFrame(data)\n",
        "\n",
        "plt.figure(figsize=(11,11))\n",
        "sns.lineplot(x='card_const', y = 'user_sep', hue='algorithm', data=df, style='algorithm', markers=False, dashes=True, markersize=18)\n",
        "#plt.legend(loc='upper right', bbox_to_anchor=[1.1,0.5], ncol=1,fontsize=26, markerscale=2)\n",
        "plt.legend(title='sketch size', loc='lower left', fontsize=20, title_fontsize='20')\n",
        "#plt.legend('')\n",
        "plt.ylabel('accuracy ratio', fontsize=26)\n",
        "plt.xlabel('cardinality constraint k', fontsize=26)\n",
        "#plt.yscale(\"log\")\n",
        "plt.title('Cardinality Constraint vs. Accuracy Ratio',fontsize=28)\n",
        "plt.xticks(fontsize=20)\n",
        "plt.yticks(fontsize=20)\n",
        "plt.tight_layout()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ueIlXVlZTIVJ"
      },
      "outputs": [],
      "source": [
        "#feature selection UCI \"Wine\", efficiency\n",
        "data = {\n",
        "    'card_const': [3, 4, 5, 13],\n",
        "    'user_sep': [0.0675, 0.0903, 0.1011, .2168],\n",
        "    'algorithm': ['reduction', 'reduction','reduction', 'reduction']\n",
        "\n",
        "}\n",
        "df = pd.DataFrame(data)\n",
        "\n",
        "plt.figure(figsize=(11,11))\n",
        "sns.lineplot(x='card_const', y = 'user_sep', hue='algorithm', data=df, style='algorithm', markers=True, dashes=True, markersize=18)\n",
        "#plt.legend(loc='upper right', bbox_to_anchor=[1.1,0.5], ncol=1,fontsize=26, markerscale=2)\n",
        "#plt.legend('')\n",
        "plt.legend().remove()\n",
        "plt.ylabel('time to run k-means (s)', fontsize=26)\n",
        "plt.xlabel('number of features', fontsize=26)\n",
        "#plt.yscale(\"log\")\n",
        "plt.title('Number of Features vs Runtime (s)',fontsize=28)\n",
        "plt.xticks(fontsize=20)\n",
        "plt.yticks(fontsize=20)\n",
        "plt.tight_layout()"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.6"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
