{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "16e44b1d-68f4-4f78-8a5d-bc7d1ca0967b",
   "metadata": {},
   "source": [
    "# Benchmark for Clustering\n",
    "\n",
    " In this Benchmark we are using the LP Heuristic to label d-dimensional Datasets.\n",
    " We start by importing dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "208974f3-ee65-40b7-93f2-4509a8e955b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"Load/import helper functions\"\"\"\n",
    "\n",
    "\n",
    "from LocalPopular import time_tester, calculate_scores_clustering, \\\n",
    "    locally_popular_clustering_with_euclid_graphs\n",
    "\n",
    "from GraphFunctions import  my_make_circles, randomize_graph_pos_labels, permute_graph_with_truth\n",
    "\n",
    "import importlib\n",
    "import PlotHelperFunctions\n",
    "importlib.reload(PlotHelperFunctions)\n",
    "\n",
    "from sklearn.cluster import KMeans, DBSCAN\n",
    "from sklearn.datasets import make_moons, load_breast_cancer, load_iris\n",
    "from sklearn.metrics import rand_score\n",
    "import numpy as np\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffee4aa8-710f-4218-bb3d-9c7585dd1700",
   "metadata": {},
   "source": [
    "## Create Graphs\n",
    "\n",
    "For our clustering simulations, we are using four datasets.\n",
    "\n",
    "- **Moon dataset**: Two half-moons generated by make_moons (300 points, 0:05 noise)\n",
    "- **3-Circles dataset**: 300 points in three slightly overlapping circles centered at (0.5, 0.5),(0.7, 0.3), and (0:1, 0.7) with radius 0.2 and Gaussian noise (std = 0.05).\n",
    "- **Breast cancer Wisconsin dataset**: 569 samples from diagnostic images (30 dimensional datapoints).\n",
    "- **Iris dataset**: 150 samples from three Iris species, features are sepal and petal sizes.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fed85ce8-b19f-45fe-9513-67370e7d2f2f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Moon Dataset\n",
    "moon_agents,moon_truth = make_moons(n_samples=300, noise=0.05)\n",
    "\n",
    "# Circle Dataset\n",
    "circle_agents, circle_truth = my_make_circles(300)\n",
    "\n",
    "# Cancer Dataset\n",
    "cancer = load_breast_cancer()\n",
    "cancer_agents = cancer['data']\n",
    "cancer_truth = cancer['target']\n",
    "\n",
    "# Iris Dataset\n",
    "iris = load_iris()\n",
    "iris_agents = iris['data']\n",
    "iris_truth = iris['target']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c80dffb-566a-4315-a94d-d10b01ff9805",
   "metadata": {},
   "source": [
    "## Run the algorithms\n",
    "\n",
    "We will now run the LP Heuristic in several modes.\n",
    "We will test the three introduced domains of FEN games:\n",
    "\n",
    "- Friend-appreciating\n",
    "- Enemy averse\n",
    "- Balanced\n",
    "\n",
    "Additionally for each mode we will run three different friendship-/enemy-thresholds:\n",
    "\n",
    "- (0.2, 0.2)\n",
    "- (0.25, 0.35)\n",
    "- (0.4, 0.4)\n",
    "\n",
    "These thresholds (a fraction that will be compared to the diameter of the Graph) will be used to create the friendship- and enemy-relation between data points.\n",
    "\n",
    "\n",
    "Finally we will use different initial clusterings: \n",
    "\n",
    "- putting each agent into a singleton cluster (LocPop-S), \n",
    "- dividing agents randomly into k clusters where k is the predicted number of clusters (LocPop-P), and \n",
    "- using the output of the k-means/DBSCAN algorithm (LocPop-KM/D).\n",
    "\n",
    "\n",
    "We also run k-means and DBSCAN for comparison\n",
    "\n",
    "If you want to run this code yourself we suggest reducing the repetitions from 10 to 1.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fcc3107b-8692-4e25-afea-8fb2e6b9f902",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import itertools\n",
    "import timeit\n",
    "import pandas as pd\n",
    "import numbers\n",
    "\n",
    "dfs = []\n",
    "labels = [(0.2,0.2), (0.25,0.35),(0.4,0.4)]\n",
    "\n",
    "for treshold in labels:\n",
    "    collected_data = {}\n",
    "    for repetitions in range(10):\n",
    "        f = treshold[0]   #f-bound\n",
    "        e = treshold[1]   \n",
    "        data = [ moon_agents,circle_agents,cancer_agents,iris_agents]\n",
    "        expected_clusters = [2,3,2,3]\n",
    "        graph_names = ['Moons','3 Circles', 'Cancer', 'Iris']\n",
    "        graph_truths =  [moon_truth,circle_truth,cancer_truth,iris_truth]\n",
    "        \n",
    "        \n",
    "        kmeans = lambda agents, clusters: KMeans(n_clusters = clusters).fit_predict(agents)\n",
    "        dbscan = lambda agents, clusters: DBSCAN(eps=0.2, min_samples=5).fit_predict(agents)\n",
    "        \n",
    "        kmeans_out = None\n",
    "        dbscan_out = None\n",
    "        \n",
    "        lp_a_b =lambda agents, initial_clustering, pre: locally_popular_clustering_with_euclid_graphs(agents, f, e, initial_clustering,mode='B',pre=pre)\n",
    "        lp_a_f =lambda agents, initial_clustering, pre: locally_popular_clustering_with_euclid_graphs(agents, f, e, initial_clustering,mode='F',pre=pre)\n",
    "        lp_a_e =lambda agents, initial_clustering, pre: locally_popular_clustering_with_euclid_graphs(agents, f, e, initial_clustering,mode='E',pre=pre)\n",
    "        \n",
    "        algorithms = [ kmeans, dbscan,lp_a_b,lp_a_f,lp_a_e]\n",
    "        algo_names = [ 'kmeans', 'dbscan','LP (Balanced) Heuristic',\\\n",
    "                       'LP (Friend-Oriented) Heuristic','LP (Enemy-Averse) Heuristic']\n",
    "        is_lp_heuristic = [False,False,True, True, True]\n",
    "    \n",
    "    \n",
    "        for ((graph, g_name,clusters,truth), (algo, a_name,lp_heuristic)) in \\\n",
    "            itertools.product(zip(data, graph_names, expected_clusters,graph_truths), zip(algorithms, algo_names,is_lp_heuristic)):\n",
    "\n",
    "\n",
    "            graph,truth = randomize_graph_pos_labels(graph,truth)\n",
    "\n",
    "            graph = [graph]\n",
    "            truth = [truth]\n",
    "            \n",
    "                \n",
    "            if lp_heuristic:\n",
    "                # start with everyone alone\n",
    "                a_name_modified = a_name + ' starting with everyone alone'\n",
    "                test_callable = lambda a: list(algo(a,len(graph[0]),None).values())\n",
    "                times,outputs = time_tester(test_callable,graph)\n",
    "                avg_time = sum(times)/len(times)\n",
    "                \n",
    "                scores = calculate_scores_clustering(outputs,truth,graph)\n",
    "                scores['Time'] = avg_time\n",
    "        \n",
    "                if (a_name_modified, g_name) not in collected_data:\n",
    "                    collected_data[(a_name_modified, g_name)] = []\n",
    "                collected_data[(a_name_modified, g_name)].append(scores)\n",
    "        \n",
    "                # start with random clustering \n",
    "                a_name_modified = a_name + ' starting with predicted number of clusters'\n",
    "                test_callable = lambda a: list(algo(a,clusters,None).values())\n",
    "                times,outputs = time_tester(test_callable,graph)\n",
    "                avg_time = sum(times)/len(times)\n",
    "                scores = calculate_scores_clustering(outputs,truth,graph)\n",
    "                scores['Time'] = avg_time\n",
    "        \n",
    "                if (a_name_modified, g_name) not in collected_data:\n",
    "                    collected_data[(a_name_modified, g_name)] = []\n",
    "                collected_data[(a_name_modified, g_name)].append(scores)\n",
    "        \n",
    "        \n",
    "                # start with the output of k-means\n",
    "                a_name_modified = a_name + ' starting with the output of k-means'\n",
    "                test_callable = lambda a: list(algo(a,clusters,kmeans).values())\n",
    "                times,outputs = time_tester(test_callable,graph)\n",
    "                avg_time = sum(times)/len(times)\n",
    "                scores = calculate_scores_clustering(outputs,truth,graph)\n",
    "        \n",
    "                rand_score_with_init = sum(rand_score(out, k) for out, k in zip(outputs, kmeans_out)) / len(outputs)\n",
    "                scores['Rand Score with initial clustering'] = rand_score_with_init\n",
    "                \n",
    "                scores['Time'] = avg_time\n",
    "        \n",
    "                if (a_name_modified, g_name) not in collected_data:\n",
    "                    collected_data[(a_name_modified, g_name)] = []\n",
    "                collected_data[(a_name_modified, g_name)].append(scores)\n",
    "        \n",
    "                # start with the output of dbscan\n",
    "                a_name_modified = a_name + ' starting with the output of dbscan'\n",
    "                test_callable = lambda a: list(algo(a,clusters,kmeans).values())\n",
    "                times,outputs = time_tester(test_callable,graph)\n",
    "                avg_time = sum(times)/len(times)\n",
    "                scores = calculate_scores_clustering(outputs,truth,graph)\n",
    "        \n",
    "                rand_score_with_init = sum(rand_score(out, db) for out, db in zip(outputs, dbscan_out)) / len(outputs)\n",
    "                scores['Rand Score with initial clustering'] = rand_score_with_init\n",
    "                \n",
    "                scores['Time'] = avg_time\n",
    "        \n",
    "                if (a_name_modified, g_name) not in collected_data:\n",
    "                    collected_data[(a_name_modified, g_name)] = []\n",
    "                collected_data[(a_name_modified, g_name)].append(scores)\n",
    "        \n",
    "                    \n",
    "            else:\n",
    "        \n",
    "                test_callable = lambda a : algo(a, clusters)\n",
    "                    \n",
    "                times,outputs = time_tester(test_callable,graph)\n",
    "                if(algo == kmeans):\n",
    "                    kmeans_out = outputs\n",
    "                if(algo == dbscan):\n",
    "                    dbscan_out = outputs\n",
    "                avg_time = sum(times)/len(times)\n",
    "                scores = calculate_scores_clustering(outputs,truth,graph)\n",
    "                scores['Time'] = avg_time\n",
    "                \n",
    "                if (a_name, g_name) not in collected_data:\n",
    "                    collected_data[(a_name, g_name)] = []\n",
    "                collected_data[(a_name, g_name)].append(scores)\n",
    "    \n",
    "       \n",
    "    records = []\n",
    "\n",
    "    for (method, dataset), metrics_list in collected_data.items():\n",
    "        record = {'Method': method, 'Dataset': dataset}\n",
    "        keys = metrics_list[0].keys()\n",
    "        for key in keys:\n",
    "            values = [m[key] for m in metrics_list if isinstance(m[key], numbers.Number)]\n",
    "            if values:\n",
    "                mean = sum(values) / len(values)\n",
    "                std = (sum((v - mean) ** 2 for v in values) / len(values)) ** 0.5\n",
    "                record[key] = (mean, std)\n",
    "            else:\n",
    "                record[key] = metrics_list[0][key]  # fallback for non-numeric\n",
    "        records.append(record)\n",
    "    \n",
    "    df = pd.DataFrame(records)\n",
    "    \n",
    "    dfs.append(df)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e1cba20-b628-40f2-a248-3ef91d735992",
   "metadata": {},
   "source": [
    "Quick check for the maximum standard deviation of the scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a1e56eb-1787-496e-b797-bba9d94a2baf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check maximum standard deviations\n",
    "max_stds = {}\n",
    "\n",
    "for df in dfs:\n",
    "    for col in ['Rand Index', 'Silhouette Score']:\n",
    "        stds = df[col].apply(lambda x: x[1] if isinstance(x, (tuple, list)) else float('nan'))\n",
    "        max_stds[col] = max(max_stds.get(col, float('-inf')), stds.max())\n",
    "\n",
    "print(max_stds)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74620cb5-bf02-4e22-af16-55176bec32d5",
   "metadata": {},
   "source": [
    "## Printing Figures\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6faca54-7c96-4b8a-b6d1-764443543b61",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "score_cols = ['Rand Index', 'Silhouette Score']\n",
    "\n",
    "labels = [(0.2,0.2),(0.25,0.35),(0.4,.4)]\n",
    "# Create figures\n",
    "for Dataset in ['Cancer', 'Iris', 'Moons', '3 Circles']:\n",
    "    for score in score_cols:\n",
    "        PlotHelperFunctions.plot_and_save_clustering(\n",
    "            dfs, labels, Dataset, score, mode = \"LP\",\n",
    "            save_path=f'./figures/PopularClustering/{Dataset}-{score}.png'\n",
    "        )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24a559d9-82ce-4d0d-b125-3c5ab784ea2b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import ast\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# Compute the mean Rand Scores with the initial clustering\n",
    "\n",
    "all_values = []\n",
    "\n",
    "for df in dfs:\n",
    "    col = 'Rand Score with initial clustering'\n",
    "    if col in df.columns:\n",
    "        values = df[col].apply(\n",
    "            lambda x: ast.literal_eval(x)[0] if isinstance(x, str) and x.startswith('(')\n",
    "            else x[0] if isinstance(x, (tuple, list))\n",
    "            else float('nan')\n",
    "        )\n",
    "        all_values.extend(values.dropna().tolist())\n",
    "\n",
    "# Convert to numpy array for convenience\n",
    "all_values = np.array(all_values)\n",
    "\n",
    "# Compute min, max, and average, ignoring NaNs\n",
    "min_val = np.nanmin(all_values)\n",
    "max_val = np.nanmax(all_values)\n",
    "avg_val = np.nanmean(all_values)\n",
    "\n",
    "print(f\"Min: {min_val}\")\n",
    "print(f\"Max: {max_val}\")\n",
    "print(f\"Average: {avg_val}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e93be20-f6f3-4b76-9ec6-d59509799987",
   "metadata": {},
   "source": [
    "## Saving the dataframe as csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc3b32cf-f6ed-4bad-8a5d-e9f74d92349b",
   "metadata": {},
   "outputs": [],
   "source": [
    "for i,df in enumerate(dfs):\n",
    "    df.to_csv(f'./csv/PopularClustering/dataset-{i}.csv')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
