{
  "original_problem": {
    "instance_id": "mwaskom__seaborn-3407",
    "repo": "mwaskom/seaborn",
    "created_at": "2023-06-27T23:17:29Z",
    "problem_statement": "pairplot raises KeyError with MultiIndex DataFrame\nWhen trying to pairplot a MultiIndex DataFrame, `pairplot` raises a `KeyError`:\r\n\r\nMRE:\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nimport seaborn as sns\r\n\r\n\r\ndata = {\r\n    (\"A\", \"1\"): np.random.rand(100),\r\n    (\"A\", \"2\"): np.random.rand(100),\r\n    (\"B\", \"1\"): np.random.rand(100),\r\n    (\"B\", \"2\"): np.random.rand(100),\r\n}\r\ndf = pd.DataFrame(data)\r\nsns.pairplot(df)\r\n```\r\n\r\nOutput:\r\n\r\n```\r\n[c:\\Users\\KLuu\\anaconda3\\lib\\site-packages\\seaborn\\axisgrid.py](file:///C:/Users/KLuu/anaconda3/lib/site-packages/seaborn/axisgrid.py) in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, corner, dropna, plot_kws, diag_kws, grid_kws, size)\r\n   2142     diag_kws.setdefault(\"legend\", False)\r\n   2143     if diag_kind == \"hist\":\r\n-> 2144         grid.map_diag(histplot, **diag_kws)\r\n   2145     elif diag_kind == \"kde\":\r\n   2146         diag_kws.setdefault(\"fill\", True)\r\n\r\n[c:\\Users\\KLuu\\anaconda3\\lib\\site-packages\\seaborn\\axisgrid.py](file:///C:/Users/KLuu/anaconda3/lib/site-packages/seaborn/axisgrid.py) in map_diag(self, func, **kwargs)\r\n   1488                 plt.sca(ax)\r\n   1489 \r\n-> 1490             vector = self.data[var]\r\n   1491             if self._hue_var is not None:\r\n   1492                 hue = self.data[self._hue_var]\r\n\r\n[c:\\Users\\KLuu\\anaconda3\\lib\\site-packages\\pandas\\core\\frame.py](file:///C:/Users/KLuu/anaconda3/lib/site-packages/pandas/core/frame.py) in __getitem__(self, key)\r\n   3765             if is_iterator(key):\r\n   3766                 key = list(key)\r\n-> 3767             indexer = self.columns._get_indexer_strict(key, \"columns\")[1]\r\n   3768 \r\n   3769         # take() does not accept boolean indexers\r\n\r\n[c:\\Users\\KLuu\\anaconda3\\lib\\site-packages\\pandas\\core\\indexes\\multi.py](file:///C:/Users/KLuu/anaconda3/lib/site-packages/pandas/core/indexes/multi.py) in _get_indexer_strict(self, key, axis_name)\r\n   2534             indexer = self._get_indexer_level_0(keyarr)\r\n   2535 \r\n-> 2536             self._raise_if_missing(key, indexer, axis_name)\r\n   2537             return self[indexer], indexer\r\n   2538 \r\n\r\n[c:\\Users\\KLuu\\anaconda3\\lib\\site-packages\\pandas\\core\\indexes\\multi.py](file:///C:/Users/KLuu/anaconda3/lib/site-packages/pandas/core/indexes/multi.py) in _raise_if_missing(self, key, indexer, axis_name)\r\n   2552                 cmask = check == -1\r\n   2553                 if cmask.any():\r\n-> 2554                     raise KeyError(f\"{keyarr[cmask]} not in index\")\r\n   2555                 # We get here when levels still contain values which are not\r\n   2556                 # actually in Index anymore\r\n\r\nKeyError: \"['1'] not in index\"\r\n```\r\n\r\nA workaround is to \"flatten\" the columns:\r\n\r\n```python\r\ndf.columns = [\"\".join(column) for column in df.columns]\r\n```\n",
    "patch": "diff --git a/seaborn/axisgrid.py b/seaborn/axisgrid.py\n--- a/seaborn/axisgrid.py\n+++ b/seaborn/axisgrid.py\n@@ -1472,8 +1472,8 @@ def map_diag(self, func, **kwargs):\n                 for ax in diag_axes[1:]:\n                     share_axis(diag_axes[0], ax, \"y\")\n \n-            self.diag_vars = np.array(diag_vars, np.object_)\n-            self.diag_axes = np.array(diag_axes, np.object_)\n+            self.diag_vars = diag_vars\n+            self.diag_axes = diag_axes\n \n         if \"hue\" not in signature(func).parameters:\n             return self._map_diag_iter_hue(func, **kwargs)\n"
  },
  "candidates_evaluated": 5,
  "judgment_result": {
    "candidates": [
      {
        "idx": 1,
        "id": "similar_2476",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about visual distortion due to a parameter, unrelated to MultiIndex handling."
      },
      {
        "idx": 2,
        "id": "similar_2502",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue involves log scaling and warnings, not related to MultiIndex or KeyError."
      },
      {
        "idx": 3,
        "id": "similar_2307",
        "decision": "Useful",
        "confidence": "High",
        "reason": "Both issues involve Seaborn's handling of data structures and function execution failures."
      },
      {
        "idx": 4,
        "id": "similar_2555",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about performance inefficiencies, not related to MultiIndex or KeyError."
      },
      {
        "idx": 5,
        "id": "similar_2295",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about handling NaN in categorical data, not related to MultiIndex or KeyError."
      }
    ]
  },
  "raw_summaries": [
    {
      "similar_issue": {
        "issue_title": "shrink parameter in histplot shifts data",
        "issue_body": "The smaller the value of the `shrink` parameter, the more the values in the histogram get shifted towards positive values.\r\n\r\n`import seaborn as sns`\r\n`import numpy as np`\r\n\r\n`r = np.random.random(100)`\r\n\r\n\r\n`sns.histplot(r);`\r\n\r\n`sns.histplot(r, shrink=0.5);`\r\n\r\n![Screen Shot 2021-02-08 at 10 11 43 PM](https://user-images.githubusercontent.com/35338267/107310557-aacb8e80-6a5a-11eb-991e-d4d4cbfd15b7.png)\r\n\r\n![Screen Shot 2021-02-08 at 10 15 46 PM](https://user-images.githubusercontent.com/35338267/107310839-36ddb600-6a5b-11eb-9f0d-54ae495f14cc.png)\r\n",
        "issue_id": 2476,
        "pr_number": 2477,
        "pr_title": "Fix histplot shrink with non-discrete bins",
        "pr_body": "Fixes #2476\r\n\r\nThe code for shifting the shrunken bars assumed that discrete binning\r\nwas in effect. This is probably the only situation where shrinking\r\nreally makes sense, but there was no prevention or warning of getting\r\nan innacurate result when using it with continuous bins.\r\n\r\nIt works properly now:\r\n\r\n```python\r\nsns.histplot(data=tips, x=\"total_bill\", binwidth=8)\r\nsns.histplot(data=tips, x=\"total_bill\", binwidth=8, shrink=.6)\r\n```\r\n![image](https://user-images.githubusercontent.com/315810/107373188-3e7d7900-6ab4-11eb-9a35-821bd76fbfd0.png)\r\n\r\n```python\r\nsns.histplot(data=tips, x=\"total_bill\", binwidth=8, color=\".6\")\r\nsns.histplot(data=tips, x=\"total_bill\", hue=\"time\", multiple=\"dodge\", binwidth=8, shrink=.6)\r\n```\r\n![image](https://user-images.githubusercontent.com/315810/107373289-610f9200-6ab4-11eb-990d-727132a53526.png)\r\n",
        "issue_closed_at": "2021-02-10T00:01:13Z",
        "base_commit": "b1dc1bc336ca2aec8308915836ec0550397e856e"
      },
      "summary": "### Summary:\nThis issue pertains to a bug in a data visualization library, specifically related to the `histplot` function in Seaborn, a Python data visualization library. The problem arises when using the `shrink` parameter within this function, which is intended to adjust the bar width in histograms. In this case, setting the `shrink` parameter to a smaller value causes an unintended shift of histogram bars towards positive values, altering the accurate representation of the data distribution.\n\n1. **Problem Description in General Terms**: The `shrink` parameter in a histogram plotting function is not functioning as intended, causing a visual distortion in the histogram output by shifting data bars.\n\n2. **Key Symptoms and Behaviors Observed**: When the `shrink` parameter is set to values less than 1, the histogram bars are erroneously shifted towards the positive axis on the plot, which misrepresents the actual data distribution. This is evident when comparing plots with and without the `shrink` parameter applied.\n\n3. **Affected Components or Systems**: The issue is localized to Seaborn's `histplot` function within the `distributions.py` file, specifically affecting the `_DistributionPlotter.plot_univariate_histogram` function.\n\n4. **Potential Impact or Severity**: The visual misrepresentation of data can lead to incorrect data interpretation and analysis, particularly in fields relying on precise data visualization for decision-making. This could affect users across different domains who utilize Seaborn for statistical data visualization.\n\n5. **Relevant Technical Details Abstracted for Broader Understanding**: The problem involves the handling of the `shrink` parameter in the histogram plotting logic, which likely involves incorrect calculations or transformations leading to the shift in bar positioning. As a result, the rendering of histograms does not accurately reflect the intended data distribution when this parameter is utilized. The issue has been addressed through modifications in the internal functions responsible for plotting histograms in Seaborn.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: shrink parameter in histplot shifts data\n\nBody:\nThe smaller the value of the `shrink` parameter, the more the values in the histogram get shifted towards positive values.\r\n\r\n`import seaborn as sns`\r\n`import numpy as np`\r\n\r\n`r = np.random.random(100)`\r\n\r\n\r\n`sns.histplot(r);`\r\n\r\n`sns.histplot(r, shrink=0.5);`\r\n\r\n![Screen Shot 2021-02-08 at 10 11 43 PM](https://user-images.githubusercontent.com/35338267/107310557-aacb8e80-6a5a-11eb-991e-d4d4cbfd15b7.png)\r\n\r\n![Screen Shot 2021-02-08 at 10 15 46 PM](https://user-images.githubusercontent.com/35338267/107310839-36ddb600-6a5b-11eb-9f0d-54ae495f14cc.png)\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nseaborn/distributions.py\n  function: _DistributionPlotter.plot_univariate_histogram\n  function: _DistributionPlotter.plot_univariate_histogram\n"
    },
    {
      "similar_issue": {
        "issue_title": "displot(kind='ecdf',..., log_scale=True) not working",
        "issue_body": "The following line of code gives an error:\r\n\r\n```\r\nsns.displot(kind='ecdf', data=df, x='col_1', log_scale=True)\r\n\r\nUserWarning: Data has no positive values, and therefore cannot be log-scaled.\r\n```\r\n\r\n\r\nMy data is all positive and kind='hist' or 'kde' works just fine.\r\n\r\n",
        "issue_id": 2502,
        "pr_number": 2504,
        "pr_title": "Fix log scaling in distribution plots",
        "pr_body": "Fixes #2502 \r\n\r\nThis is a huge development footgun; see #2409 for thoughts on how this can be made automatic to reduce the risk of such bugs",
        "issue_closed_at": "2021-03-24T21:54:09Z",
        "base_commit": "ba4bd0fa0a90b2bd00cb62c2b4a5e38013a73ac6"
      },
      "summary": "### Summary:\nThis issue is related to a problem encountered when using the `displot` function from the Seaborn library, specifically with the `kind='ecdf'` parameter in conjunction with `log_scale=True`. The user receives a warning indicating that the data cannot be log-scaled due to the absence of positive values, despite the data being entirely positive. This suggests a potential bug in the handling of log scaling for empirical cumulative distribution function (ECDF) plots within the library. The key symptom is the erroneous warning message and the failure of the plot to render as expected under these conditions. The issue affects the plotting functionality of the Seaborn library, particularly impacting users attempting to create ECDF plots with logarithmic scaling. The severity is moderate as it prevents a specific use case of the plotting function, though alternative plot types like 'hist' and 'kde' are unaffected. Relevant technical details include the handling of data validation for log scaling within the plotting functions, specifically in the `plot_univariate_ecdf` and `_plot_single_rug` methods of the `distributions.py` file in Seaborn.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: displot(kind='ecdf',..., log_scale=True) not working\n\nBody:\nThe following line of code gives an error:\r\n\r\n```\r\nsns.displot(kind='ecdf', data=df, x='col_1', log_scale=True)\r\n\r\nUserWarning: Data has no positive values, and therefore cannot be log-scaled.\r\n```\r\n\r\n\r\nMy data is all positive and kind='hist' or 'kde' works just fine.\r\n\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nseaborn/distributions.py\n  line: line 57\n  function: _DistributionPlotter.plot_univariate_ecdf\n  function: _DistributionPlotter._plot_single_rug\n"
    },
    {
      "similar_issue": {
        "issue_title": "map_* methods of PairGrid broken in 0.11.0",
        "issue_body": "Hi,\r\n\r\nI discovered that the map_* methods of PairGrid seem to be broken in version 0.11.0 for user defined functions. See reproducible example below with a corrfunc defined to plot the pearson correlation value on the lower plots. The function doesn't seem to get evaluated in version 0.11.0. When I pip install seaborn==0.10.1, I get the desired result. Plots from both cases also attached.\r\n\r\n```import numpy as np\r\nfrom scipy import stats\r\nimport pandas as pd\r\nimport seaborn as sns\r\nimport matplotlib.pyplot as plt\r\nsns.set(style=\"white\")\r\n\r\nmean = np.zeros(3)\r\ncov = np.random.uniform(.2, .4, (3, 3))\r\ncov += cov.T\r\ncov[np.diag_indices(3)] = 1\r\ndata = np.random.multivariate_normal(mean, cov, 100)\r\ndf = pd.DataFrame(data, columns=[\"X\", \"Y\", \"Z\"])\r\n\r\ndef corrfunc(x, y,**kws):\r\n    r, _ = stats.pearsonr(x, y)\r\n    ax = plt.gca()\r\n    ax.annotate(\"r = {:.2f}\".format(r),\r\n                xy=(.1, .9), xycoords=ax.transAxes)\r\n\r\ng = sns.PairGrid(df, palette=[\"red\"])\r\ng.map_upper(plt.scatter, s=10)\r\ng.map_diag(sns.distplot, kde=False)\r\ng.map_lower(sns.kdeplot, cmap=\"Blues_d\")\r\ng.map_lower(corrfunc)\r\nplt.show()\r\n\r\n```\r\n\r\n![seaborn-0-11-0](https://user-images.githubusercontent.com/3239171/94969718-380d3e00-04d1-11eb-821b-9aad80ec696e.png)\r\n\r\n![seaborn-0-10-1](https://user-images.githubusercontent.com/3239171/94969722-3a6f9800-04d1-11eb-8ef1-861d9beb1f26.png)\r\n\r\n\r\n\r\n",
        "issue_id": 2307,
        "pr_number": 2368,
        "pr_title": "Fix pairgrid off-diagonal plots with non-string column names",
        "pr_body": "Fixes #2307\r\n\r\nWorking reprex from original issue:\r\n\r\n![image](https://user-images.githubusercontent.com/315810/100743239-108c0200-33aa-11eb-8885-9c61d89b3acc.png)\r\n",
        "issue_closed_at": "2020-12-01T20:48:45Z",
        "base_commit": "2717408b564994002fe08f72ba2dd7e1acf359b6"
      },
      "summary": "### Summary:\nThis issue pertains to a regression in the seaborn library, specifically affecting the `map_*` methods of the `PairGrid` class in version 0.11.0. These methods, which are intended to facilitate the application of user-defined functions to different sections of a pair grid plot, fail to execute user-defined functions as expected in this version. As evidenced by a user-provided example, a function designed to calculate and display the Pearson correlation coefficient on the lower plots does not function correctly in version 0.11.0, although it works as intended in version 0.10.1. \n\nKey symptoms include the failure of user-defined functions to be executed, leading to incomplete or incorrect visual outputs compared to previous versions. The affected component is the `PairGrid` class within the seaborn library, specifically impacting its method for mapping functions across grid sections.\n\nThe potential impact of this issue is significant for users relying on custom analytical functions within their data visualizations, as it may render certain plots inaccurate or incomplete without manual intervention or code modification. The severity is compounded for users who depend on these functionalities for data analysis and presentation.\n\nTechnical analysis suggests that the underlying problem may reside in the internal implementation of the `map_lower` function and related methods within `PairGrid`, as indicated by changes to the `PairGrid._map_diag_iter_hue` and `PairGrid._plot_bivariate_iter_hue` functions in the library's source code. These modifications are likely aimed at restoring the expected behavior for user-defined function execution in the visualization grid.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: map_* methods of PairGrid broken in 0.11.0\n\nBody:\nHi,\r\n\r\nI discovered that the map_* methods of PairGrid seem to be broken in version 0.11.0 for user defined functions. See reproducible example below with a corrfunc defined to plot the pearson correlation value on the lower plots. The function doesn't seem to get evaluated in version 0.11.0. When I pip install seaborn==0.10.1, I get the desired result. Plots from both cases also attached.\r\n\r\n```import numpy as np\r\nfrom scipy import stats\r\nimport pandas as pd\r\nimport seaborn as sns\r\nimport matplotlib.pyplot as plt\r\nsns.set(style=\"white\")\r\n\r\nmean = np.zeros(3)\r\ncov = np.random.uniform(.2, .4, (3, 3))\r\ncov += cov.T\r\ncov[np.diag_indices(3)] = 1\r\ndata = np.random.multivariate_normal(mean, cov, 100)\r\ndf = pd.DataFrame(data, columns=[\"X\", \"Y\", \"Z\"])\r\n\r\ndef corrfunc(x, y,**kws):\r\n    r, _ = stats.pearsonr(x, y)\r\n    ax = plt.gca()\r\n    ax.annotate(\"r = {:.2f}\".format(r),\r\n                xy=(.1, .9), xycoords=ax.transAxes)\r\n\r\ng = sns.PairGrid(df, palette=[\"red\"])\r\ng.map_upper(plt.scatter, s=10)\r\ng.map_diag(sns.distplot, kde=False)\r\ng.map_lower(sns.kdeplot, cmap=\"Blues_d\")\r\ng.map_lower(corrfunc)\r\nplt.show()\r\n\r\n```\r\n\r\n![seaborn-0-11-0](https://user-images.githubusercontent.com/3239171/94969718-380d3e00-04d1-11eb-821b-9aad80ec696e.png)\r\n\r\n![seaborn-0-10-1](https://user-images.githubusercontent.com/3239171/94969722-3a6f9800-04d1-11eb-8ef1-861d9beb1f26.png)\r\n\r\n\r\n\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nseaborn/axisgrid.py\n  function: JointGrid.__init__\n  function: PairGrid._map_diag_iter_hue\n  function: PairGrid._plot_bivariate_iter_hue\n"
    },
    {
      "similar_issue": {
        "issue_title": "linewidth calculation slow for histograms",
        "issue_body": "## Example\r\n\r\n```python\r\nimport pandas as pd\r\nimport seaborn as sns\r\n\r\ndiamonds = sns.load_dataset(\"diamonds\")\r\n```\r\n\r\n```python\r\n%%timeit\r\nsns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\n# 5.85 s ± 157 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n```\r\n\r\n```python\r\n%%timeit\r\nsns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\", linewidth=0.1)\r\n# 4.37 s ± 81.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n```\r\n\r\n## Description\r\n\r\nSetting the linewidth here is taking about ~25% of the computation time. I would note that in the case where I caught this, passing `linewidth` cut plot time in half. I believe the issue is due to an inefficiencies here: https://github.com/mwaskom/seaborn/blob/66b478390c20089de7f9644ba9965ce5d4f973ff/seaborn/distributions.py#L638-L684\r\n\r\nOne cause is that line widths are being set multiple times on each object. In the loop linked above, line widths are calculated, and then set for all subplots (not just the subplot corresponding for this subset). We can see this is the case since all the rectangles have the same linewidth in the end:\r\n\r\n```python\r\nfrom itertools import chain\r\nimport matplotlib as mpl\r\n\r\nsnsfig = sns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\n\r\nchildren = chain.from_iterable(ax.get_children() for ax in snsfig.axes.flat)\r\nrects = filter(lambda x: isinstance(x, mpl.patches.Rectangle), children)\r\n\r\n{rect.get_linewidth() for rect in rects}\r\n```\r\n\r\n```\r\n{0.0, 0.37466112012984926}\r\n```\r\n\r\nSomething a little more dataset dependent is this expression:\r\n\r\nhttps://github.com/mwaskom/seaborn/blob/66b478390c20089de7f9644ba9965ce5d4f973ff/seaborn/distributions.py#L647-L650\r\n\r\nWhere `index.to_frame()` can take a very long time in some circumstances. I'm not sure exactly what these are, but for my own data [`%%snakeviz`](https://jiffyclub.github.io/snakeviz/) noted a lot of time spend doing this. But I believe `hist_metadata` should be the same for each iterations of `sub_vars`, so this could probably just be moved outside the loop.\r\n\r\n## Version info\r\n\r\nThis is using seaborn at commit 66b478390c20089de7f9644ba9965ce5d4f973ff, though I'd initially noticed this using the last release.\r\n\r\n<details>\r\n<summary> sinfo report </summary>\r\n\r\n\r\n```\r\n-----\r\nmatplotlib          3.4.1\r\npandas              1.2.4\r\nseaborn             0.12.0.dev0\r\nsinfo               0.3.1\r\nvega_datasets       0.9.0\r\n-----\r\nPIL                 8.2.0\r\nappnope             0.1.2\r\nbackcall            0.2.0\r\ncffi                1.14.5\r\ncycler              0.10.0\r\ncython_runtime      NA\r\ndateutil            2.8.1\r\ndecorator           4.4.2\r\nipykernel           5.3.4\r\nipython_genutils    0.2.0\r\njedi                0.17.0\r\nkiwisolver          1.3.1\r\nmatplotlib          3.4.1\r\nmpl_toolkits        NA\r\nnumexpr             2.7.3\r\nnumpy               1.20.2\r\npandas              1.2.4\r\nparso               0.8.2\r\npexpect             4.8.0\r\npickleshare         0.7.5\r\nprompt_toolkit      3.0.17\r\nptyprocess          0.7.0\r\npygments            2.8.1\r\npyparsing           2.4.7\r\npytz                2021.1\r\nscipy               1.6.2\r\nseaborn             0.12.0.dev0\r\nsinfo               0.3.1\r\nsix                 1.15.0\r\nsnakeviz            2.1.0\r\nstatsmodels         0.12.2\r\nstoremagic          NA\r\ntornado             6.1\r\ntraitlets           5.0.5\r\nvega_datasets       0.9.0\r\nwcwidth             0.2.5\r\nzmq                 20.0.0\r\n-----\r\nIPython             7.22.0\r\njupyter_client      6.1.12\r\njupyter_core        4.7.1\r\nnotebook            6.3.0\r\n-----\r\nPython 3.8.8 (default, Apr 13 2021, 12:59:45) [Clang 10.0.0 ]\r\nmacOS-10.15.7-x86_64-i386-64bit\r\n16 logical CPU cores, i386\r\n-----\r\nSession information updated at 2021-04-16 16:25\r\n```\r\n\r\n</details>\r\n",
        "issue_id": 2555,
        "pr_number": 2559,
        "pr_title": "Reduce redundant computation in distplot linewidth",
        "pr_body": "Fixes #2555\r\n\r\n* moves `binwidth`, `thin_bar_idx`, and `left_edge` calculation out of the loop since it's invariant over the iterations\r\n* Only `set_linewidth` one per bar, instead of setting all bar's linewidth once per facet\r\n\r\n## Some evidence this works\r\n\r\nI've run this on this branch, and on master\r\n\r\n```python\r\nimport seaborn as sns\r\nimport matplotlib as mpl\r\nfrom setuptools_scm import get_version\r\n\r\n# To show commit\r\nprint(get_version(root='..', relative_to=sns.__file__))\r\n\r\ndiamonds = sns.load_dataset(\"diamonds\")\r\n```\r\n\r\n```python\r\n%%timeit\r\ng = sns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\n```\r\n\r\n```python\r\ng = sns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\nprint({rect.get_linewidth() for rect in g.fig.findobj(mpl.patches.Rectangle)})\r\n```\r\n\r\n### This branch\r\n\r\n```\r\n0.10.1.dev198+ga365acc\r\n4.08 s ± 60.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n{0.0, 0.37466112012984926}\r\n```\r\n\r\n### master\r\n\r\n```\r\n0.10.1.dev197+g66b4783\r\n5.03 s ± 82.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n{0.0, 0.37466112012984926}\r\n```\r\n\r\n## TODO\r\n\r\n- [x] Tests (manual?)",
        "issue_closed_at": "2021-04-23T11:40:36Z",
        "base_commit": "e04b07eb3df135511e71e556c2bd34ef59ba08ba"
      },
      "summary": "### Summary:\nThis issue is related to performance inefficiencies in the `seaborn` library when creating histogram plots with specific configurations. The problem is generally described as a slowdown in the computation of histograms when the `linewidth` parameter is not explicitly set by the user. The performance degradation is attributed to unnecessary and repetitive operations within the code, specifically within the `_DistributionPlotter.plot_univariate_histogram` function of the `seaborn.distributions` module.\n\n1. **Problem description in general terms:**\n   The issue involves inefficient computation of histogram plots in the `seaborn` library, particularly when the `linewidth` parameter is not specified. This inefficiency leads to longer execution times for generating plots.\n\n2. **Key symptoms and behaviors observed:**\n   - Significant increase in computation time when generating histogram plots without specifying `linewidth`.\n   - The redundant setting of line widths for plot elements within loops, leading to unnecessary recalculations.\n   - Observations indicated that setting the `linewidth` parameter manually can significantly reduce the plot generation time.\n\n3. **Affected components or systems:**\n   The issue affects the `seaborn` library, specifically the `displot` function used for generating histogram plots. The inefficiency is found in the `_DistributionPlotter.plot_univariate_histogram` function within the `seaborn.distributions` module.\n\n4. **Potential impact or severity:**\n   The impact of this issue is primarily on performance, causing longer execution times for generating plots, which can hinder productivity and usability, especially when working with large datasets or when generating multiple plots in sequence.\n\n5. **Relevant technical details abstracted for broader understanding:**\n   - The inefficiency is caused by redundant operations in the code, such as setting line widths multiple times and recalculating metadata for each subplot iteration.\n   - The problem is exacerbated by the use of `index.to_frame()`, which can be computationally expensive in certain scenarios.\n   - The issue can be mitigated by moving certain calculations outside of iterative loops and optimizing the setting of plot parameters to avoid redundant operations.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: linewidth calculation slow for histograms\n\nBody:\n## Example\r\n\r\n```python\r\nimport pandas as pd\r\nimport seaborn as sns\r\n\r\ndiamonds = sns.load_dataset(\"diamonds\")\r\n```\r\n\r\n```python\r\n%%timeit\r\nsns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\n# 5.85 s ± 157 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n```\r\n\r\n```python\r\n%%timeit\r\nsns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\", linewidth=0.1)\r\n# 4.37 s ± 81.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n```\r\n\r\n## Description\r\n\r\nSetting the linewidth here is taking about ~25% of the computation time. I would note that in the case where I caught this, passing `linewidth` cut plot time in half. I believe the issue is due to an inefficiencies here: https://github.com/mwaskom/seaborn/blob/66b478390c20089de7f9644ba9965ce5d4f973ff/seaborn/distributions.py#L638-L684\r\n\r\nOne cause is that line widths are being set multiple times on each object. In the loop linked above, line widths are calculated, and then set for all subplots (not just the subplot corresponding for this subset). We can see this is the case since all the rectangles have the same linewidth in the end:\r\n\r\n```python\r\nfrom itertools import chain\r\nimport matplotlib as mpl\r\n\r\nsnsfig = sns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\n\r\nchildren = chain.from_iterable(ax.get_children() for ax in snsfig.axes.flat)\r\nrects = filter(lambda x: isinstance(x, mpl.patches.Rectangle), children)\r\n\r\n{rect.get_linewidth() for rect in rects}\r\n```\r\n\r\n```\r\n{0.0, 0.37466112012984926}\r\n```\r\n\r\nSomething a little more dataset dependent is this expression:\r\n\r\nhttps://github.com/mwaskom/seaborn/blob/66b478390c20089de7f9644ba9965ce5d4f973ff/seaborn/distributions.py#L647-L650\r\n\r\nWhere `index.to_frame()` can take a very long time in some circumstances. I'm not sure exactly what these are, but for my own data [`%%snakeviz`](https://jiffyclub.github.io/snakeviz/) noted a lot of time spend doing this. But I believe `hist_metadata` should be the same for each iterations of `sub_vars`, so this could probably just be moved outside the loop.\r\n\r\n## Version info\r\n\r\nThis is using seaborn at commit 66b478390c20089de7f9644ba9965ce5d4f973ff, though I'd initially noticed this using the last release.\r\n\r\n<details>\r\n<summary> sinfo report </summary>\r\n\r\n\r\n```\r\n-----\r\nmatplotlib          3.4.1\r\npandas              1.2.4\r\nseaborn             0.12.0.dev0\r\nsinfo               0.3.1\r\nvega_datasets       0.9.0\r\n-----\r\nPIL                 8.2.0\r\nappnope             0.1.2\r\nbackcall            0.2.0\r\ncffi                1.14.5\r\ncycler              0.10.0\r\ncython_runtime      NA\r\ndateutil            2.8.1\r\ndecorator           4.4.2\r\nipykernel           5.3.4\r\nipython_genutils    0.2.0\r\njedi                0.17.0\r\nkiwisolver          1.3.1\r\nmatplotlib          3.4.1\r\nmpl_toolkits        NA\r\nnumexpr             2.7.3\r\nnumpy               1.20.2\r\npandas              1.2.4\r\nparso               0.8.2\r\npexpect             4.8.0\r\npickleshare         0.7.5\r\nprompt_toolkit      3.0.17\r\nptyprocess          0.7.0\r\npygments            2.8.1\r\npyparsing           2.4.7\r\npytz                2021.1\r\nscipy               1.6.2\r\nseaborn             0.12.0.dev0\r\nsinfo               0.3.1\r\nsix                 1.15.0\r\nsnakeviz            2.1.0\r\nstatsmodels         0.12.2\r\nstoremagic          NA\r\ntornado             6.1\r\ntraitlets           5.0.5\r\nvega_datasets       0.9.0\r\nwcwidth             0.2.5\r\nzmq                 20.0.0\r\n-----\r\nIPython             7.22.0\r\njupyter_client      6.1.12\r\njupyter_core        4.7.1\r\nnotebook            6.3.0\r\n-----\r\nPython 3.8.8 (default, Apr 13 2021, 12:59:45) [Clang 10.0.0 ]\r\nmacOS-10.15.7-x86_64-i386-64bit\r\n16 logical CPU cores, i386\r\n-----\r\nSession information updated at 2021-04-16 16:25\r\n```\r\n\r\n</details>\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nseaborn/distributions.py\n  function: _DistributionPlotter.plot_univariate_histogram\n  function: _DistributionPlotter.plot_univariate_histogram\n  function: _DistributionPlotter.plot_univariate_histogram\n"
    },
    {
      "similar_issue": {
        "issue_title": "histplot with categorical values crashes with missing data, though numerical values work fine",
        "issue_body": "Not sure if this is intended behaviour, but it caught me out due to the difference in handling numerical/categorical data. I note that drawing histograms of categorical data is labelled as experimental, so ignore/close if that explains it.\r\n\r\nWith numerical data `histplot` ignores NaN and plots the other values, this is the behaviour I would expect:\r\n\r\n```\r\nimport numpy as np\r\nimport seaborn as sns\r\n\r\nsns.histplot(\r\n    [1.1, 1.2, 1.3, 1.4, np.nan]\r\n)\r\n```\r\n\r\nbut with categorical data it crashes:\r\n```\r\nimport numpy as np\r\nimport seaborn as sns\r\n\r\nsns.histplot(\r\n    ['foo', 'foo', 'bar', np.nan]\r\n)\r\n\r\n# output\r\n---------------------------------------------------------------------------\r\nTypeError                                 Traceback (most recent call last)\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)\r\n   1519         try:\r\n-> 1520             ret = self.converter.convert(x, self.units, self)\r\n   1521         except Exception as e:\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in convert(value, unit, axis)\r\n     60         # force an update so it also does type checking\r\n---> 61         unit.update(values)\r\n     62         return np.vectorize(unit._mapping.__getitem__, otypes=[float])(values)\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in update(self, data)\r\n    210             # OrderedDict just iterates over unique values in data.\r\n--> 211             cbook._check_isinstance((str, bytes), value=val)\r\n    212             if convertible:\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/cbook/__init__.py in _check_isinstance(_types, **kwargs)\r\n   2234         if not isinstance(v, types):\r\n-> 2235             raise TypeError(\r\n   2236                 \"{!r} must be an instance of {}, not a {}\".format(\r\n\r\nTypeError: 'value' must be an instance of str or bytes, not a float\r\n\r\nThe above exception was the direct cause of the following exception:\r\n\r\nConversionError                           Traceback (most recent call last)\r\n<ipython-input-61-b132ea7dca6c> in <module>\r\n      2 import seaborn as sns\r\n      3 \r\n----> 4 sns.histplot(\r\n      5     ['foo', 'foo', 'bar', np.nan]\r\n      6 )\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in histplot(data, x, y, hue, weights, stat, bins, binwidth, binrange, discrete, cumulative, common_bins, common_norm, multiple, element, fill, shrink, kde, kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws, palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)\r\n   1420     if p.univariate:\r\n   1421 \r\n-> 1422         p.plot_univariate_histogram(\r\n   1423             multiple=multiple,\r\n   1424             element=element,\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in plot_univariate_histogram(self, multiple, element, fill, common_norm, common_bins, shrink, kde, kde_kws, color, legend, line_kws, estimate_kws, **plot_kws)\r\n    421 \r\n    422         # First pass through the data to compute the histograms\r\n--> 423         for sub_vars, sub_data in self.iter_data(\"hue\", from_comp_data=True):\r\n    424 \r\n    425             # Prepare the relevant data\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in iter_data(self, grouping_vars, reverse, from_comp_data)\r\n    965 \r\n    966         if from_comp_data:\r\n--> 967             data = self.comp_data\r\n    968         else:\r\n    969             data = self.plot_data\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in comp_data(self)\r\n   1034                 axis = getattr(ax, f\"{var}axis\")\r\n   1035 \r\n-> 1036                 comp_var = axis.convert_units(self.plot_data[var])\r\n   1037                 if axis.get_scale() == \"log\":\r\n   1038                     comp_var = np.log10(comp_var)\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)\r\n   1520             ret = self.converter.convert(x, self.units, self)\r\n   1521         except Exception as e:\r\n-> 1522             raise munits.ConversionError('Failed to convert value(s) to axis '\r\n   1523                                          f'units: {x!r}') from e\r\n   1524         return ret\r\n\r\nConversionError: Failed to convert value(s) to axis units: 0    foo\r\n1    foo\r\n2    bar\r\n3    NaN\r\nName: x, dtype: object\r\n```\r\n\r\n",
        "issue_id": 2295,
        "pr_number": 2417,
        "pr_title": "Improve NA robustness in VectorPlotter.comp_data",
        "pr_body": "This PR avoids passing `nan` through the matplotlib converters used to obtain a numeric/computable representation of the data (i.e. `VectorPlotter.comp_data`).\r\n\r\nIt also\r\n- codifies that the converted columns in `comp_data` have a float dtype\r\n- converts `inf` to `nan`, in line with what matplotlib does\r\n\r\nFixes #2295 \r\n\r\nAdditionally this will implicitly address #1971 once the regression plots are refactored to use `comp_data` internally. (@mojones, funny that you opened both issues).",
        "issue_closed_at": "2021-01-05T19:40:57Z",
        "base_commit": "aad96f8d2e36ceceb82a42b69aa3a8f47ef7210d"
      },
      "summary": "### Summary:\nThis issue is related to the Seaborn visualization library, specifically concerning the `histplot` function when handling datasets that include categorical values with missing data (e.g., `NaN`). The primary problem identified is a discrepancy in how `histplot` processes numerical versus categorical data when missing values are present. While the function operates as expected with numerical data by ignoring `NaN` values and plotting the rest, attempting the same operation with categorical data results in a crash. The crash is due to a `TypeError` arising from the library's underlying components, which expect categorical data to be instances of strings or bytes, rather than floats like `NaN`.\n\nKey symptoms include an inability to plot categorical data with missing values, leading to an error trace in the Matplotlib library, which Seaborn depends on for plotting. The error stems from the conversion process of data types within the plotting axis, where the presence of `NaN` (a float) in an otherwise string-based dataset causes a type mismatch.\n\nThe affected components are primarily within the Seaborn library's data handling and plotting functions, specifically within the `comp_data` function in the `VectorPlotter` class. This issue can potentially disrupt workflows that involve visual analysis of categorical datasets with missing entries, as the current behavior prevents such data from being plotted without preprocessing to handle or remove `NaN` values.\n\nThe impact is moderate, primarily affecting users attempting to visualize categorical data with missing entries. The severity is heightened by the fact that numerical data does not exhibit the same issue, potentially leading to unexpected behavior for those unaware of the limitation with categorical data.\n\nTechnical details reveal that the patch addresses this inconsistency by modifying the handling of data in the `comp_data` function, ensuring that categorical data with `NaN` values is processed without triggering type conversion errors. This change aims to align the behavior of `histplot` for both categorical and numerical data, thereby enhancing the library's robustness and user experience.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: histplot with categorical values crashes with missing data, though numerical values work fine\n\nBody:\nNot sure if this is intended behaviour, but it caught me out due to the difference in handling numerical/categorical data. I note that drawing histograms of categorical data is labelled as experimental, so ignore/close if that explains it.\r\n\r\nWith numerical data `histplot` ignores NaN and plots the other values, this is the behaviour I would expect:\r\n\r\n```\r\nimport numpy as np\r\nimport seaborn as sns\r\n\r\nsns.histplot(\r\n    [1.1, 1.2, 1.3, 1.4, np.nan]\r\n)\r\n```\r\n\r\nbut with categorical data it crashes:\r\n```\r\nimport numpy as np\r\nimport seaborn as sns\r\n\r\nsns.histplot(\r\n    ['foo', 'foo', 'bar', np.nan]\r\n)\r\n\r\n# output\r\n---------------------------------------------------------------------------\r\nTypeError                                 Traceback (most recent call last)\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)\r\n   1519         try:\r\n-> 1520             ret = self.converter.convert(x, self.units, self)\r\n   1521         except Exception as e:\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in convert(value, unit, axis)\r\n     60         # force an update so it also does type checking\r\n---> 61         unit.update(values)\r\n     62         return np.vectorize(unit._mapping.__getitem__, otypes=[float])(values)\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in update(self, data)\r\n    210             # OrderedDict just iterates over unique values in data.\r\n--> 211             cbook._check_isinstance((str, bytes), value=val)\r\n    212             if convertible:\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/cbook/__init__.py in _check_isinstance(_types, **kwargs)\r\n   2234         if not isinstance(v, types):\r\n-> 2235             raise TypeError(\r\n   2236                 \"{!r} must be an instance of {}, not a {}\".format(\r\n\r\nTypeError: 'value' must be an instance of str or bytes, not a float\r\n\r\nThe above exception was the direct cause of the following exception:\r\n\r\nConversionError                           Traceback (most recent call last)\r\n<ipython-input-61-b132ea7dca6c> in <module>\r\n      2 import seaborn as sns\r\n      3 \r\n----> 4 sns.histplot(\r\n      5     ['foo', 'foo', 'bar', np.nan]\r\n      6 )\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in histplot(data, x, y, hue, weights, stat, bins, binwidth, binrange, discrete, cumulative, common_bins, common_norm, multiple, element, fill, shrink, kde, kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws, palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)\r\n   1420     if p.univariate:\r\n   1421 \r\n-> 1422         p.plot_univariate_histogram(\r\n   1423             multiple=multiple,\r\n   1424             element=element,\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in plot_univariate_histogram(self, multiple, element, fill, common_norm, common_bins, shrink, kde, kde_kws, color, legend, line_kws, estimate_kws, **plot_kws)\r\n    421 \r\n    422         # First pass through the data to compute the histograms\r\n--> 423         for sub_vars, sub_data in self.iter_data(\"hue\", from_comp_data=True):\r\n    424 \r\n    425             # Prepare the relevant data\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in iter_data(self, grouping_vars, reverse, from_comp_data)\r\n    965 \r\n    966         if from_comp_data:\r\n--> 967             data = self.comp_data\r\n    968         else:\r\n    969             data = self.plot_data\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in comp_data(self)\r\n   1034                 axis = getattr(ax, f\"{var}axis\")\r\n   1035 \r\n-> 1036                 comp_var = axis.convert_units(self.plot_data[var])\r\n   1037                 if axis.get_scale() == \"log\":\r\n   1038                     comp_var = np.log10(comp_var)\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)\r\n   1520             ret = self.converter.convert(x, self.units, self)\r\n   1521         except Exception as e:\r\n-> 1522             raise munits.ConversionError('Failed to convert value(s) to axis '\r\n   1523                                          f'units: {x!r}') from e\r\n   1524         return ret\r\n\r\nConversionError: Failed to convert value(s) to axis units: 0    foo\r\n1    foo\r\n2    bar\r\n3    NaN\r\nName: x, dtype: object\r\n```\r\n\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nseaborn/_core.py\n  function: VectorPlotter.comp_data\n"
    }
  ]
}