{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bug Analysis\n",
    "\n",
    "In this notebook, we analyze the impact of a minor bug in model-vs-human: In the calculation of Error Consistency, the border conditions are not handled properly: If one observer gets a perfect score, the EC is 1 automatically, which is not necessarily correct, and we think it's better to return NaN in this case.\n",
    "\n",
    "Here, we load two dataframes of model-vs-human data, one which was created with the bug, the other without (by just changing one line in utils.py to return 1 instead of NaN, then running `bootstrap_models.py -n 1`). \n",
    "\n",
    "We find that this only has an impact on two models, `efficientnet_l2_noisy_student_475` and `transformer_L16_IN21K`, because no other model got a perfect condition anywhere.\n",
    "We here plot how these models would have been ranked using our way of doing it, but since the experiments have such a different distribution of values, the impact of not including the NaN-experiments in the mean is very large.\n",
    "\n",
    "It would probably be better to z-transform all error-consistency scores first, before averaging, and this would probably have a large impact on the ranking as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_with = pd.read_parquet(\"data/bug_analysis_with_1.parquet\", engine=\"pyarrow\")\n",
    "df_without = pd.read_parquet(\"data/bug_analysis_without_1.parquet\", engine=\"pyarrow\")\n",
    "\n",
    "merged = pd.merge(\n",
    "    left=df_with,\n",
    "    right=df_without,\n",
    "    suffixes=[\"_with\", \"_without\"],\n",
    "    on=[\"experiment\", \"condition\", \"bootstrap_id\", \"model\"],\n",
    ")\n",
    "\n",
    "# important to do this in two steps\n",
    "agg = (\n",
    "    merged.groupby([\"experiment\", \"model\"], observed=True)\n",
    "    .mean(numeric_only=True)\n",
    "    .reset_index()\n",
    ")\n",
    "final = agg.groupby(\"model\", observed=True).mean(numeric_only=True)\n",
    "\n",
    "# displaying where we get differences\n",
    "display(final[final[\"model-human-ec_with\"] != final[\"model-human-ec_without\"]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig, ax = plt.subplots(1, 1, figsize=(15, 5))\n",
    "plt.grid(axis=\"y\")\n",
    "order = (\n",
    "    final.groupby(\"model\")[\"model-human-ec_with\"]\n",
    "    .mean()\n",
    "    .sort_values(ascending=False)\n",
    "    .index.tolist()\n",
    ")\n",
    "sns.stripplot(\n",
    "    data=final, x=\"model\", y=\"model-human-ec_without\", color=\"red\", order=order, ax=ax\n",
    ")\n",
    "sns.stripplot(\n",
    "    data=final, x=\"model\", y=\"model-human-ec_with\", color=\"blue\", order=order, ax=ax\n",
    ")\n",
    "sns.despine()\n",
    "ax.set_xlabel(\"Model\")\n",
    "ax.set_ylabel(\"Error Consistency [kappa]\")\n",
    "ax.set_ylim(0, 0.3)\n",
    "ax.tick_params(axis=\"x\", labelrotation=90)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}
