{
  "original_problem": {
    "instance_id": "mwaskom__seaborn-3010",
    "repo": "mwaskom/seaborn",
    "created_at": "2022-09-11T19:37:32Z",
    "problem_statement": "PolyFit is not robust to missing data\n```python\r\nso.Plot([1, 2, 3, None, 4], [1, 2, 3, 4, 5]).add(so.Line(), so.PolyFit())\r\n```\r\n\r\n<details><summary>Traceback</summary>\r\n\r\n```python-traceback\r\n---------------------------------------------------------------------------\r\nLinAlgError                               Traceback (most recent call last)\r\nFile ~/miniconda3/envs/seaborn-py39-latest/lib/python3.9/site-packages/IPython/core/formatters.py:343, in BaseFormatter.__call__(self, obj)\r\n    341     method = get_real_method(obj, self.print_method)\r\n    342     if method is not None:\r\n--> 343         return method()\r\n    344     return None\r\n    345 else:\r\n\r\nFile ~/code/seaborn/seaborn/_core/plot.py:265, in Plot._repr_png_(self)\r\n    263 def _repr_png_(self) -> tuple[bytes, dict[str, float]]:\r\n--> 265     return self.plot()._repr_png_()\r\n\r\nFile ~/code/seaborn/seaborn/_core/plot.py:804, in Plot.plot(self, pyplot)\r\n    800 \"\"\"\r\n    801 Compile the plot spec and return the Plotter object.\r\n    802 \"\"\"\r\n    803 with theme_context(self._theme_with_defaults()):\r\n--> 804     return self._plot(pyplot)\r\n\r\nFile ~/code/seaborn/seaborn/_core/plot.py:822, in Plot._plot(self, pyplot)\r\n    819 plotter._setup_scales(self, common, layers, coord_vars)\r\n    821 # Apply statistical transform(s)\r\n--> 822 plotter._compute_stats(self, layers)\r\n    824 # Process scale spec for semantic variables and coordinates computed by stat\r\n    825 plotter._setup_scales(self, common, layers)\r\n\r\nFile ~/code/seaborn/seaborn/_core/plot.py:1110, in Plotter._compute_stats(self, spec, layers)\r\n   1108     grouper = grouping_vars\r\n   1109 groupby = GroupBy(grouper)\r\n-> 1110 res = stat(df, groupby, orient, scales)\r\n   1112 if pair_vars:\r\n   1113     data.frames[coord_vars] = res\r\n\r\nFile ~/code/seaborn/seaborn/_stats/regression.py:41, in PolyFit.__call__(self, data, groupby, orient, scales)\r\n     39 def __call__(self, data, groupby, orient, scales):\r\n---> 41     return groupby.apply(data, self._fit_predict)\r\n\r\nFile ~/code/seaborn/seaborn/_core/groupby.py:109, in GroupBy.apply(self, data, func, *args, **kwargs)\r\n    106 grouper, groups = self._get_groups(data)\r\n    108 if not grouper:\r\n--> 109     return self._reorder_columns(func(data, *args, **kwargs), data)\r\n    111 parts = {}\r\n    112 for key, part_df in data.groupby(grouper, sort=False):\r\n\r\nFile ~/code/seaborn/seaborn/_stats/regression.py:30, in PolyFit._fit_predict(self, data)\r\n     28     xx = yy = []\r\n     29 else:\r\n---> 30     p = np.polyfit(x, y, self.order)\r\n     31     xx = np.linspace(x.min(), x.max(), self.gridsize)\r\n     32     yy = np.polyval(p, xx)\r\n\r\nFile <__array_function__ internals>:180, in polyfit(*args, **kwargs)\r\n\r\nFile ~/miniconda3/envs/seaborn-py39-latest/lib/python3.9/site-packages/numpy/lib/polynomial.py:668, in polyfit(x, y, deg, rcond, full, w, cov)\r\n    666 scale = NX.sqrt((lhs*lhs).sum(axis=0))\r\n    667 lhs /= scale\r\n--> 668 c, resids, rank, s = lstsq(lhs, rhs, rcond)\r\n    669 c = (c.T/scale).T  # broadcast scale coefficients\r\n    671 # warn on rank reduction, which indicates an ill conditioned matrix\r\n\r\nFile <__array_function__ internals>:180, in lstsq(*args, **kwargs)\r\n\r\nFile ~/miniconda3/envs/seaborn-py39-latest/lib/python3.9/site-packages/numpy/linalg/linalg.py:2300, in lstsq(a, b, rcond)\r\n   2297 if n_rhs == 0:\r\n   2298     # lapack can't handle n_rhs = 0 - so allocate the array one larger in that axis\r\n   2299     b = zeros(b.shape[:-2] + (m, n_rhs + 1), dtype=b.dtype)\r\n-> 2300 x, resids, rank, s = gufunc(a, b, rcond, signature=signature, extobj=extobj)\r\n   2301 if m == 0:\r\n   2302     x[...] = 0\r\n\r\nFile ~/miniconda3/envs/seaborn-py39-latest/lib/python3.9/site-packages/numpy/linalg/linalg.py:101, in _raise_linalgerror_lstsq(err, flag)\r\n    100 def _raise_linalgerror_lstsq(err, flag):\r\n--> 101     raise LinAlgError(\"SVD did not converge in Linear Least Squares\")\r\n\r\nLinAlgError: SVD did not converge in Linear Least Squares\r\n\r\n```\r\n\r\n</details>\n",
    "patch": "diff --git a/seaborn/_stats/regression.py b/seaborn/_stats/regression.py\n--- a/seaborn/_stats/regression.py\n+++ b/seaborn/_stats/regression.py\n@@ -38,7 +38,10 @@ def _fit_predict(self, data):\n \n     def __call__(self, data, groupby, orient, scales):\n \n-        return groupby.apply(data, self._fit_predict)\n+        return (\n+            groupby\n+            .apply(data.dropna(subset=[\"x\", \"y\"]), self._fit_predict)\n+        )\n \n \n @dataclass\n"
  },
  "candidates_evaluated": 5,
  "judgment_result": {
    "candidates": [
      {
        "idx": 1,
        "id": "similar_2476",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue involves parameter-induced data shifts, unrelated to handling missing data or robustness improvements."
      },
      {
        "idx": 2,
        "id": "similar_2502",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about log-scaling and data validation, not related to missing data handling or robustness."
      },
      {
        "idx": 3,
        "id": "similar_2307",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue involves regression in function execution, unrelated to missing data handling or robustness."
      },
      {
        "idx": 4,
        "id": "similar_2555",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about performance inefficiencies, not related to handling missing data or robustness improvements."
      },
      {
        "idx": 5,
        "id": "similar_2295",
        "decision": "Useful",
        "confidence": "High",
        "reason": "The issue involves handling missing data in plotting functions, directly relevant to improving robustness against missing data."
      }
    ]
  },
  "raw_summaries": [
    {
      "similar_issue": {
        "issue_title": "shrink parameter in histplot shifts data",
        "issue_body": "The smaller the value of the `shrink` parameter, the more the values in the histogram get shifted towards positive values.\r\n\r\n`import seaborn as sns`\r\n`import numpy as np`\r\n\r\n`r = np.random.random(100)`\r\n\r\n\r\n`sns.histplot(r);`\r\n\r\n`sns.histplot(r, shrink=0.5);`\r\n\r\n![Screen Shot 2021-02-08 at 10 11 43 PM](https://user-images.githubusercontent.com/35338267/107310557-aacb8e80-6a5a-11eb-991e-d4d4cbfd15b7.png)\r\n\r\n![Screen Shot 2021-02-08 at 10 15 46 PM](https://user-images.githubusercontent.com/35338267/107310839-36ddb600-6a5b-11eb-9f0d-54ae495f14cc.png)\r\n",
        "issue_id": 2476,
        "pr_number": 2477,
        "pr_title": "Fix histplot shrink with non-discrete bins",
        "pr_body": "Fixes #2476\r\n\r\nThe code for shifting the shrunken bars assumed that discrete binning\r\nwas in effect. This is probably the only situation where shrinking\r\nreally makes sense, but there was no prevention or warning of getting\r\nan innacurate result when using it with continuous bins.\r\n\r\nIt works properly now:\r\n\r\n```python\r\nsns.histplot(data=tips, x=\"total_bill\", binwidth=8)\r\nsns.histplot(data=tips, x=\"total_bill\", binwidth=8, shrink=.6)\r\n```\r\n![image](https://user-images.githubusercontent.com/315810/107373188-3e7d7900-6ab4-11eb-9a35-821bd76fbfd0.png)\r\n\r\n```python\r\nsns.histplot(data=tips, x=\"total_bill\", binwidth=8, color=\".6\")\r\nsns.histplot(data=tips, x=\"total_bill\", hue=\"time\", multiple=\"dodge\", binwidth=8, shrink=.6)\r\n```\r\n![image](https://user-images.githubusercontent.com/315810/107373289-610f9200-6ab4-11eb-990d-727132a53526.png)\r\n",
        "issue_closed_at": "2021-02-10T00:01:13Z",
        "base_commit": "b1dc1bc336ca2aec8308915836ec0550397e856e"
      },
      "summary": "### Summary:\n\nThis issue is related to the `shrink` parameter within the Seaborn data visualization library, specifically affecting the `histplot` function. The problem manifests when using smaller values for the `shrink` parameter, which inadvertently causes the data values in the histogram to shift towards positive values. This unexpected behavior alters the visual representation of the data, potentially leading to misinterpretations.\n\n1. **Problem description in general terms**: \n   The issue involves an unexpected data shift in histogram plots when using the `shrink` parameter in Seaborn's `histplot` function. The smaller the shrink value, the more pronounced the shift towards positive values.\n\n2. **Key symptoms and behaviors observed**: \n   - Data values in histograms appear to shift towards the positive side as the `shrink` parameter value decreases.\n   - The visual output does not accurately reflect the original dataset's distribution, impacting data analysis.\n\n3. **Affected components or systems**: \n   The issue affects Seaborn's `histplot` function, specifically the component responsible for plotting univariate histograms within the `distributions.py` module.\n\n4. **Potential impact or severity**: \n   This bug can mislead users analyzing data distributions, potentially leading to incorrect conclusions or decisions based on inaccurate histogram representations. The severity is moderate, as it directly impacts data visualization integrity.\n\n5. **Relevant technical details abstracted for broader understanding**: \n   The root cause is associated with the implementation of the `shrink` parameter within the histogram plotting function. Adjustments in the function responsible for plotting univariate histograms were necessary to ensure accurate data representation regardless of the shrink value. The code changes were made in the `plot_univariate_histogram` function in the `distributions.py` file of the Seaborn library.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: shrink parameter in histplot shifts data\n\nBody:\nThe smaller the value of the `shrink` parameter, the more the values in the histogram get shifted towards positive values.\r\n\r\n`import seaborn as sns`\r\n`import numpy as np`\r\n\r\n`r = np.random.random(100)`\r\n\r\n\r\n`sns.histplot(r);`\r\n\r\n`sns.histplot(r, shrink=0.5);`\r\n\r\n![Screen Shot 2021-02-08 at 10 11 43 PM](https://user-images.githubusercontent.com/35338267/107310557-aacb8e80-6a5a-11eb-991e-d4d4cbfd15b7.png)\r\n\r\n![Screen Shot 2021-02-08 at 10 15 46 PM](https://user-images.githubusercontent.com/35338267/107310839-36ddb600-6a5b-11eb-9f0d-54ae495f14cc.png)\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nseaborn/distributions.py\n  function: _DistributionPlotter.plot_univariate_histogram\n  function: _DistributionPlotter.plot_univariate_histogram\n"
    },
    {
      "similar_issue": {
        "issue_title": "displot(kind='ecdf',..., log_scale=True) not working",
        "issue_body": "The following line of code gives an error:\r\n\r\n```\r\nsns.displot(kind='ecdf', data=df, x='col_1', log_scale=True)\r\n\r\nUserWarning: Data has no positive values, and therefore cannot be log-scaled.\r\n```\r\n\r\n\r\nMy data is all positive and kind='hist' or 'kde' works just fine.\r\n\r\n",
        "issue_id": 2502,
        "pr_number": 2504,
        "pr_title": "Fix log scaling in distribution plots",
        "pr_body": "Fixes #2502 \r\n\r\nThis is a huge development footgun; see #2409 for thoughts on how this can be made automatic to reduce the risk of such bugs",
        "issue_closed_at": "2021-03-24T21:54:09Z",
        "base_commit": "ba4bd0fa0a90b2bd00cb62c2b4a5e38013a73ac6"
      },
      "summary": "### Summary: This issue is related to the seaborn library, specifically when using the `displot` function with the `kind='ecdf'` parameter and enabling `log_scale`. The error encountered suggests a problem with log-scaling data that is already positive, generating a misleading warning indicating that the data contains no positive values. This issue arises only when `kind='ecdf'` is specified, as other plot types like 'hist' or 'kde' work correctly under similar conditions. The underlying problem likely involves the incorrect handling or validation of data within the `plot_univariate_ecdf` and `_plot_single_rug` functions in the seaborn library's `distributions.py` file. The severity of this issue is moderate, affecting users relying on empirical cumulative distribution plots with logarithmic scaling, potentially leading to confusion and the inability to visualize data as intended. The fix involves adjusting the logic within these functions to correctly assess and handle positive data values, ensuring compatibility with the `log_scale` parameter.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: displot(kind='ecdf',..., log_scale=True) not working\n\nBody:\nThe following line of code gives an error:\r\n\r\n```\r\nsns.displot(kind='ecdf', data=df, x='col_1', log_scale=True)\r\n\r\nUserWarning: Data has no positive values, and therefore cannot be log-scaled.\r\n```\r\n\r\n\r\nMy data is all positive and kind='hist' or 'kde' works just fine.\r\n\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nseaborn/distributions.py\n  line: line 57\n  function: _DistributionPlotter.plot_univariate_ecdf\n  function: _DistributionPlotter._plot_single_rug\n"
    },
    {
      "similar_issue": {
        "issue_title": "map_* methods of PairGrid broken in 0.11.0",
        "issue_body": "Hi,\r\n\r\nI discovered that the map_* methods of PairGrid seem to be broken in version 0.11.0 for user defined functions. See reproducible example below with a corrfunc defined to plot the pearson correlation value on the lower plots. The function doesn't seem to get evaluated in version 0.11.0. When I pip install seaborn==0.10.1, I get the desired result. Plots from both cases also attached.\r\n\r\n```import numpy as np\r\nfrom scipy import stats\r\nimport pandas as pd\r\nimport seaborn as sns\r\nimport matplotlib.pyplot as plt\r\nsns.set(style=\"white\")\r\n\r\nmean = np.zeros(3)\r\ncov = np.random.uniform(.2, .4, (3, 3))\r\ncov += cov.T\r\ncov[np.diag_indices(3)] = 1\r\ndata = np.random.multivariate_normal(mean, cov, 100)\r\ndf = pd.DataFrame(data, columns=[\"X\", \"Y\", \"Z\"])\r\n\r\ndef corrfunc(x, y,**kws):\r\n    r, _ = stats.pearsonr(x, y)\r\n    ax = plt.gca()\r\n    ax.annotate(\"r = {:.2f}\".format(r),\r\n                xy=(.1, .9), xycoords=ax.transAxes)\r\n\r\ng = sns.PairGrid(df, palette=[\"red\"])\r\ng.map_upper(plt.scatter, s=10)\r\ng.map_diag(sns.distplot, kde=False)\r\ng.map_lower(sns.kdeplot, cmap=\"Blues_d\")\r\ng.map_lower(corrfunc)\r\nplt.show()\r\n\r\n```\r\n\r\n![seaborn-0-11-0](https://user-images.githubusercontent.com/3239171/94969718-380d3e00-04d1-11eb-821b-9aad80ec696e.png)\r\n\r\n![seaborn-0-10-1](https://user-images.githubusercontent.com/3239171/94969722-3a6f9800-04d1-11eb-8ef1-861d9beb1f26.png)\r\n\r\n\r\n\r\n",
        "issue_id": 2307,
        "pr_number": 2368,
        "pr_title": "Fix pairgrid off-diagonal plots with non-string column names",
        "pr_body": "Fixes #2307\r\n\r\nWorking reprex from original issue:\r\n\r\n![image](https://user-images.githubusercontent.com/315810/100743239-108c0200-33aa-11eb-8885-9c61d89b3acc.png)\r\n",
        "issue_closed_at": "2020-12-01T20:48:45Z",
        "base_commit": "2717408b564994002fe08f72ba2dd7e1acf359b6"
      },
      "summary": "### Summary:\n\nThis issue pertains to a regression in the seaborn library, specifically affecting the `map_*` methods of the `PairGrid` class, starting from version 0.11.0. The problem manifests when user-defined functions, such as a custom correlation function intended to annotate plots, fail to execute as expected. This issue is highlighted by a reproducible example where the `corrfunc` function, designed to display Pearson correlation coefficients on plots, does not execute in version 0.11.0, although it functions correctly in version 0.10.1.\n\nKey symptoms include the failure of custom functions used in conjunction with `PairGrid` methods, resulting in incomplete or incorrect plot annotations. The core component affected is the `PairGrid` class, part of seaborn's plotting infrastructure, which is used for creating a matrix of plots to visualize pairwise relationships in a dataset.\n\nThe potential impact of this issue is significant for users relying on seaborn for advanced data visualization, particularly those who utilize custom plotting functions to enhance their visualizations. The regression can disrupt workflows, as users may not be aware of the underlying issue causing expected plot annotations to be missing.\n\nTechnical details indicate that the issue likely arises from changes within the seaborn codebase, specifically in functions related to plot iteration and hue handling in the `axisgrid.py` file. The problem was addressed by modifying these functions, suggesting that the regression was due to alterations in how seaborn handles plotting logic and function application across grid elements.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: map_* methods of PairGrid broken in 0.11.0\n\nBody:\nHi,\r\n\r\nI discovered that the map_* methods of PairGrid seem to be broken in version 0.11.0 for user defined functions. See reproducible example below with a corrfunc defined to plot the pearson correlation value on the lower plots. The function doesn't seem to get evaluated in version 0.11.0. When I pip install seaborn==0.10.1, I get the desired result. Plots from both cases also attached.\r\n\r\n```import numpy as np\r\nfrom scipy import stats\r\nimport pandas as pd\r\nimport seaborn as sns\r\nimport matplotlib.pyplot as plt\r\nsns.set(style=\"white\")\r\n\r\nmean = np.zeros(3)\r\ncov = np.random.uniform(.2, .4, (3, 3))\r\ncov += cov.T\r\ncov[np.diag_indices(3)] = 1\r\ndata = np.random.multivariate_normal(mean, cov, 100)\r\ndf = pd.DataFrame(data, columns=[\"X\", \"Y\", \"Z\"])\r\n\r\ndef corrfunc(x, y,**kws):\r\n    r, _ = stats.pearsonr(x, y)\r\n    ax = plt.gca()\r\n    ax.annotate(\"r = {:.2f}\".format(r),\r\n                xy=(.1, .9), xycoords=ax.transAxes)\r\n\r\ng = sns.PairGrid(df, palette=[\"red\"])\r\ng.map_upper(plt.scatter, s=10)\r\ng.map_diag(sns.distplot, kde=False)\r\ng.map_lower(sns.kdeplot, cmap=\"Blues_d\")\r\ng.map_lower(corrfunc)\r\nplt.show()\r\n\r\n```\r\n\r\n![seaborn-0-11-0](https://user-images.githubusercontent.com/3239171/94969718-380d3e00-04d1-11eb-821b-9aad80ec696e.png)\r\n\r\n![seaborn-0-10-1](https://user-images.githubusercontent.com/3239171/94969722-3a6f9800-04d1-11eb-8ef1-861d9beb1f26.png)\r\n\r\n\r\n\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nseaborn/axisgrid.py\n  function: JointGrid.__init__\n  function: PairGrid._map_diag_iter_hue\n  function: PairGrid._plot_bivariate_iter_hue\n"
    },
    {
      "similar_issue": {
        "issue_title": "linewidth calculation slow for histograms",
        "issue_body": "## Example\r\n\r\n```python\r\nimport pandas as pd\r\nimport seaborn as sns\r\n\r\ndiamonds = sns.load_dataset(\"diamonds\")\r\n```\r\n\r\n```python\r\n%%timeit\r\nsns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\n# 5.85 s ± 157 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n```\r\n\r\n```python\r\n%%timeit\r\nsns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\", linewidth=0.1)\r\n# 4.37 s ± 81.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n```\r\n\r\n## Description\r\n\r\nSetting the linewidth here is taking about ~25% of the computation time. I would note that in the case where I caught this, passing `linewidth` cut plot time in half. I believe the issue is due to an inefficiencies here: https://github.com/mwaskom/seaborn/blob/66b478390c20089de7f9644ba9965ce5d4f973ff/seaborn/distributions.py#L638-L684\r\n\r\nOne cause is that line widths are being set multiple times on each object. In the loop linked above, line widths are calculated, and then set for all subplots (not just the subplot corresponding for this subset). We can see this is the case since all the rectangles have the same linewidth in the end:\r\n\r\n```python\r\nfrom itertools import chain\r\nimport matplotlib as mpl\r\n\r\nsnsfig = sns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\n\r\nchildren = chain.from_iterable(ax.get_children() for ax in snsfig.axes.flat)\r\nrects = filter(lambda x: isinstance(x, mpl.patches.Rectangle), children)\r\n\r\n{rect.get_linewidth() for rect in rects}\r\n```\r\n\r\n```\r\n{0.0, 0.37466112012984926}\r\n```\r\n\r\nSomething a little more dataset dependent is this expression:\r\n\r\nhttps://github.com/mwaskom/seaborn/blob/66b478390c20089de7f9644ba9965ce5d4f973ff/seaborn/distributions.py#L647-L650\r\n\r\nWhere `index.to_frame()` can take a very long time in some circumstances. I'm not sure exactly what these are, but for my own data [`%%snakeviz`](https://jiffyclub.github.io/snakeviz/) noted a lot of time spend doing this. But I believe `hist_metadata` should be the same for each iterations of `sub_vars`, so this could probably just be moved outside the loop.\r\n\r\n## Version info\r\n\r\nThis is using seaborn at commit 66b478390c20089de7f9644ba9965ce5d4f973ff, though I'd initially noticed this using the last release.\r\n\r\n<details>\r\n<summary> sinfo report </summary>\r\n\r\n\r\n```\r\n-----\r\nmatplotlib          3.4.1\r\npandas              1.2.4\r\nseaborn             0.12.0.dev0\r\nsinfo               0.3.1\r\nvega_datasets       0.9.0\r\n-----\r\nPIL                 8.2.0\r\nappnope             0.1.2\r\nbackcall            0.2.0\r\ncffi                1.14.5\r\ncycler              0.10.0\r\ncython_runtime      NA\r\ndateutil            2.8.1\r\ndecorator           4.4.2\r\nipykernel           5.3.4\r\nipython_genutils    0.2.0\r\njedi                0.17.0\r\nkiwisolver          1.3.1\r\nmatplotlib          3.4.1\r\nmpl_toolkits        NA\r\nnumexpr             2.7.3\r\nnumpy               1.20.2\r\npandas              1.2.4\r\nparso               0.8.2\r\npexpect             4.8.0\r\npickleshare         0.7.5\r\nprompt_toolkit      3.0.17\r\nptyprocess          0.7.0\r\npygments            2.8.1\r\npyparsing           2.4.7\r\npytz                2021.1\r\nscipy               1.6.2\r\nseaborn             0.12.0.dev0\r\nsinfo               0.3.1\r\nsix                 1.15.0\r\nsnakeviz            2.1.0\r\nstatsmodels         0.12.2\r\nstoremagic          NA\r\ntornado             6.1\r\ntraitlets           5.0.5\r\nvega_datasets       0.9.0\r\nwcwidth             0.2.5\r\nzmq                 20.0.0\r\n-----\r\nIPython             7.22.0\r\njupyter_client      6.1.12\r\njupyter_core        4.7.1\r\nnotebook            6.3.0\r\n-----\r\nPython 3.8.8 (default, Apr 13 2021, 12:59:45) [Clang 10.0.0 ]\r\nmacOS-10.15.7-x86_64-i386-64bit\r\n16 logical CPU cores, i386\r\n-----\r\nSession information updated at 2021-04-16 16:25\r\n```\r\n\r\n</details>\r\n",
        "issue_id": 2555,
        "pr_number": 2559,
        "pr_title": "Reduce redundant computation in distplot linewidth",
        "pr_body": "Fixes #2555\r\n\r\n* moves `binwidth`, `thin_bar_idx`, and `left_edge` calculation out of the loop since it's invariant over the iterations\r\n* Only `set_linewidth` one per bar, instead of setting all bar's linewidth once per facet\r\n\r\n## Some evidence this works\r\n\r\nI've run this on this branch, and on master\r\n\r\n```python\r\nimport seaborn as sns\r\nimport matplotlib as mpl\r\nfrom setuptools_scm import get_version\r\n\r\n# To show commit\r\nprint(get_version(root='..', relative_to=sns.__file__))\r\n\r\ndiamonds = sns.load_dataset(\"diamonds\")\r\n```\r\n\r\n```python\r\n%%timeit\r\ng = sns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\n```\r\n\r\n```python\r\ng = sns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\nprint({rect.get_linewidth() for rect in g.fig.findobj(mpl.patches.Rectangle)})\r\n```\r\n\r\n### This branch\r\n\r\n```\r\n0.10.1.dev198+ga365acc\r\n4.08 s ± 60.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n{0.0, 0.37466112012984926}\r\n```\r\n\r\n### master\r\n\r\n```\r\n0.10.1.dev197+g66b4783\r\n5.03 s ± 82.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n{0.0, 0.37466112012984926}\r\n```\r\n\r\n## TODO\r\n\r\n- [x] Tests (manual?)",
        "issue_closed_at": "2021-04-23T11:40:36Z",
        "base_commit": "e04b07eb3df135511e71e556c2bd34ef59ba08ba"
      },
      "summary": "### Summary: \nThis issue pertains to performance inefficiencies in the Seaborn library when plotting histograms using the `displot` function. Specifically, the problem arises from the way linewidth calculations are handled, resulting in prolonged computation times during plot generation. \n\n1. **Problem Description in General Terms**: \n   The plotting of histograms in the Seaborn library experiences significant slowdown due to inefficient calculations and settings of line widths. This inefficiency is rooted in redundant operations within the plotting function.\n\n2. **Key Symptoms and Behaviors Observed**: \n   - The inclusion of a `linewidth` parameter in the `displot` function reduces plot computation time by approximately 25%.\n   - Without specifying `linewidth`, the default operations result in unnecessarily prolonged execution times.\n   - The linewidths on histogram rectangles are redundantly set multiple times, impacting performance.\n\n3. **Affected Components or Systems**: \n   The issue specifically affects the `plot_univariate_histogram` function within the `seaborn/distributions.py` module of the Seaborn library.\n\n4. **Potential Impact or Severity**: \n   The impact is primarily on performance, with the potential to significantly hinder efficiency in data visualization tasks involving histograms. This can affect users who rely on Seaborn for quick and interactive data exploration.\n\n5. **Relevant Technical Details Abstracted for Broader Understanding**: \n   - Redundant linewidth calculations occur within loops, leading to inefficiencies.\n   - The `index.to_frame()` operation is identified as a potential bottleneck and is suggested to be moved outside of iterative loops to improve performance.\n   - The problem was observed in the Seaborn version at a specific commit (66b478390c20089de7f9644ba9965ce5d4f973ff), indicating a need for updates or patches to improve this aspect of the library. \n\nThis issue highlights the importance of optimizing internal operations within data visualization libraries to ensure efficient performance, particularly when dealing with large datasets or complex plots.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: linewidth calculation slow for histograms\n\nBody:\n## Example\r\n\r\n```python\r\nimport pandas as pd\r\nimport seaborn as sns\r\n\r\ndiamonds = sns.load_dataset(\"diamonds\")\r\n```\r\n\r\n```python\r\n%%timeit\r\nsns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\n# 5.85 s ± 157 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n```\r\n\r\n```python\r\n%%timeit\r\nsns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\", linewidth=0.1)\r\n# 4.37 s ± 81.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\r\n```\r\n\r\n## Description\r\n\r\nSetting the linewidth here is taking about ~25% of the computation time. I would note that in the case where I caught this, passing `linewidth` cut plot time in half. I believe the issue is due to an inefficiencies here: https://github.com/mwaskom/seaborn/blob/66b478390c20089de7f9644ba9965ce5d4f973ff/seaborn/distributions.py#L638-L684\r\n\r\nOne cause is that line widths are being set multiple times on each object. In the loop linked above, line widths are calculated, and then set for all subplots (not just the subplot corresponding for this subset). We can see this is the case since all the rectangles have the same linewidth in the end:\r\n\r\n```python\r\nfrom itertools import chain\r\nimport matplotlib as mpl\r\n\r\nsnsfig = sns.displot(diamonds, x=\"price\", row=\"cut\", col=\"color\")\r\n\r\nchildren = chain.from_iterable(ax.get_children() for ax in snsfig.axes.flat)\r\nrects = filter(lambda x: isinstance(x, mpl.patches.Rectangle), children)\r\n\r\n{rect.get_linewidth() for rect in rects}\r\n```\r\n\r\n```\r\n{0.0, 0.37466112012984926}\r\n```\r\n\r\nSomething a little more dataset dependent is this expression:\r\n\r\nhttps://github.com/mwaskom/seaborn/blob/66b478390c20089de7f9644ba9965ce5d4f973ff/seaborn/distributions.py#L647-L650\r\n\r\nWhere `index.to_frame()` can take a very long time in some circumstances. I'm not sure exactly what these are, but for my own data [`%%snakeviz`](https://jiffyclub.github.io/snakeviz/) noted a lot of time spend doing this. But I believe `hist_metadata` should be the same for each iterations of `sub_vars`, so this could probably just be moved outside the loop.\r\n\r\n## Version info\r\n\r\nThis is using seaborn at commit 66b478390c20089de7f9644ba9965ce5d4f973ff, though I'd initially noticed this using the last release.\r\n\r\n<details>\r\n<summary> sinfo report </summary>\r\n\r\n\r\n```\r\n-----\r\nmatplotlib          3.4.1\r\npandas              1.2.4\r\nseaborn             0.12.0.dev0\r\nsinfo               0.3.1\r\nvega_datasets       0.9.0\r\n-----\r\nPIL                 8.2.0\r\nappnope             0.1.2\r\nbackcall            0.2.0\r\ncffi                1.14.5\r\ncycler              0.10.0\r\ncython_runtime      NA\r\ndateutil            2.8.1\r\ndecorator           4.4.2\r\nipykernel           5.3.4\r\nipython_genutils    0.2.0\r\njedi                0.17.0\r\nkiwisolver          1.3.1\r\nmatplotlib          3.4.1\r\nmpl_toolkits        NA\r\nnumexpr             2.7.3\r\nnumpy               1.20.2\r\npandas              1.2.4\r\nparso               0.8.2\r\npexpect             4.8.0\r\npickleshare         0.7.5\r\nprompt_toolkit      3.0.17\r\nptyprocess          0.7.0\r\npygments            2.8.1\r\npyparsing           2.4.7\r\npytz                2021.1\r\nscipy               1.6.2\r\nseaborn             0.12.0.dev0\r\nsinfo               0.3.1\r\nsix                 1.15.0\r\nsnakeviz            2.1.0\r\nstatsmodels         0.12.2\r\nstoremagic          NA\r\ntornado             6.1\r\ntraitlets           5.0.5\r\nvega_datasets       0.9.0\r\nwcwidth             0.2.5\r\nzmq                 20.0.0\r\n-----\r\nIPython             7.22.0\r\njupyter_client      6.1.12\r\njupyter_core        4.7.1\r\nnotebook            6.3.0\r\n-----\r\nPython 3.8.8 (default, Apr 13 2021, 12:59:45) [Clang 10.0.0 ]\r\nmacOS-10.15.7-x86_64-i386-64bit\r\n16 logical CPU cores, i386\r\n-----\r\nSession information updated at 2021-04-16 16:25\r\n```\r\n\r\n</details>\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nseaborn/distributions.py\n  function: _DistributionPlotter.plot_univariate_histogram\n  function: _DistributionPlotter.plot_univariate_histogram\n  function: _DistributionPlotter.plot_univariate_histogram\n"
    },
    {
      "similar_issue": {
        "issue_title": "histplot with categorical values crashes with missing data, though numerical values work fine",
        "issue_body": "Not sure if this is intended behaviour, but it caught me out due to the difference in handling numerical/categorical data. I note that drawing histograms of categorical data is labelled as experimental, so ignore/close if that explains it.\r\n\r\nWith numerical data `histplot` ignores NaN and plots the other values, this is the behaviour I would expect:\r\n\r\n```\r\nimport numpy as np\r\nimport seaborn as sns\r\n\r\nsns.histplot(\r\n    [1.1, 1.2, 1.3, 1.4, np.nan]\r\n)\r\n```\r\n\r\nbut with categorical data it crashes:\r\n```\r\nimport numpy as np\r\nimport seaborn as sns\r\n\r\nsns.histplot(\r\n    ['foo', 'foo', 'bar', np.nan]\r\n)\r\n\r\n# output\r\n---------------------------------------------------------------------------\r\nTypeError                                 Traceback (most recent call last)\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)\r\n   1519         try:\r\n-> 1520             ret = self.converter.convert(x, self.units, self)\r\n   1521         except Exception as e:\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in convert(value, unit, axis)\r\n     60         # force an update so it also does type checking\r\n---> 61         unit.update(values)\r\n     62         return np.vectorize(unit._mapping.__getitem__, otypes=[float])(values)\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in update(self, data)\r\n    210             # OrderedDict just iterates over unique values in data.\r\n--> 211             cbook._check_isinstance((str, bytes), value=val)\r\n    212             if convertible:\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/cbook/__init__.py in _check_isinstance(_types, **kwargs)\r\n   2234         if not isinstance(v, types):\r\n-> 2235             raise TypeError(\r\n   2236                 \"{!r} must be an instance of {}, not a {}\".format(\r\n\r\nTypeError: 'value' must be an instance of str or bytes, not a float\r\n\r\nThe above exception was the direct cause of the following exception:\r\n\r\nConversionError                           Traceback (most recent call last)\r\n<ipython-input-61-b132ea7dca6c> in <module>\r\n      2 import seaborn as sns\r\n      3 \r\n----> 4 sns.histplot(\r\n      5     ['foo', 'foo', 'bar', np.nan]\r\n      6 )\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in histplot(data, x, y, hue, weights, stat, bins, binwidth, binrange, discrete, cumulative, common_bins, common_norm, multiple, element, fill, shrink, kde, kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws, palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)\r\n   1420     if p.univariate:\r\n   1421 \r\n-> 1422         p.plot_univariate_histogram(\r\n   1423             multiple=multiple,\r\n   1424             element=element,\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in plot_univariate_histogram(self, multiple, element, fill, common_norm, common_bins, shrink, kde, kde_kws, color, legend, line_kws, estimate_kws, **plot_kws)\r\n    421 \r\n    422         # First pass through the data to compute the histograms\r\n--> 423         for sub_vars, sub_data in self.iter_data(\"hue\", from_comp_data=True):\r\n    424 \r\n    425             # Prepare the relevant data\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in iter_data(self, grouping_vars, reverse, from_comp_data)\r\n    965 \r\n    966         if from_comp_data:\r\n--> 967             data = self.comp_data\r\n    968         else:\r\n    969             data = self.plot_data\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in comp_data(self)\r\n   1034                 axis = getattr(ax, f\"{var}axis\")\r\n   1035 \r\n-> 1036                 comp_var = axis.convert_units(self.plot_data[var])\r\n   1037                 if axis.get_scale() == \"log\":\r\n   1038                     comp_var = np.log10(comp_var)\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)\r\n   1520             ret = self.converter.convert(x, self.units, self)\r\n   1521         except Exception as e:\r\n-> 1522             raise munits.ConversionError('Failed to convert value(s) to axis '\r\n   1523                                          f'units: {x!r}') from e\r\n   1524         return ret\r\n\r\nConversionError: Failed to convert value(s) to axis units: 0    foo\r\n1    foo\r\n2    bar\r\n3    NaN\r\nName: x, dtype: object\r\n```\r\n\r\n",
        "issue_id": 2295,
        "pr_number": 2417,
        "pr_title": "Improve NA robustness in VectorPlotter.comp_data",
        "pr_body": "This PR avoids passing `nan` through the matplotlib converters used to obtain a numeric/computable representation of the data (i.e. `VectorPlotter.comp_data`).\r\n\r\nIt also\r\n- codifies that the converted columns in `comp_data` have a float dtype\r\n- converts `inf` to `nan`, in line with what matplotlib does\r\n\r\nFixes #2295 \r\n\r\nAdditionally this will implicitly address #1971 once the regression plots are refactored to use `comp_data` internally. (@mojones, funny that you opened both issues).",
        "issue_closed_at": "2021-01-05T19:40:57Z",
        "base_commit": "aad96f8d2e36ceceb82a42b69aa3a8f47ef7210d"
      },
      "summary": "### Summary:\nThis issue concerns the behavior of the `histplot` function from the Seaborn library when handling missing data within categorical datasets. The problem arises due to a disparity in how missing values (e.g., NaN) are processed between numerical and categorical data inputs. When numerical data with NaN values is used, the `histplot` function successfully ignores the NaN and plots the remaining values, as expected. However, when categorical data containing NaN values is passed to the same function, it results in a TypeError, as the function attempts to convert NaN (a float) into a string or byte format, which is not possible. This leads to a crash with a subsequent ConversionError, indicating a failure to convert the values to axis units.\n\nThe affected component is the `VectorPlotter.comp_data` function within the Seaborn library's core module, which handles data preparation for plotting. The severity of this issue lies in its potential to disrupt data visualization workflows that involve categorical data with missing entries, especially given the experimental status of categorical histogram plotting in Seaborn. The issue highlights a need for consistent handling of missing data across different data types to prevent unexpected crashes and ensure robust plotting functionality.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: histplot with categorical values crashes with missing data, though numerical values work fine\n\nBody:\nNot sure if this is intended behaviour, but it caught me out due to the difference in handling numerical/categorical data. I note that drawing histograms of categorical data is labelled as experimental, so ignore/close if that explains it.\r\n\r\nWith numerical data `histplot` ignores NaN and plots the other values, this is the behaviour I would expect:\r\n\r\n```\r\nimport numpy as np\r\nimport seaborn as sns\r\n\r\nsns.histplot(\r\n    [1.1, 1.2, 1.3, 1.4, np.nan]\r\n)\r\n```\r\n\r\nbut with categorical data it crashes:\r\n```\r\nimport numpy as np\r\nimport seaborn as sns\r\n\r\nsns.histplot(\r\n    ['foo', 'foo', 'bar', np.nan]\r\n)\r\n\r\n# output\r\n---------------------------------------------------------------------------\r\nTypeError                                 Traceback (most recent call last)\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)\r\n   1519         try:\r\n-> 1520             ret = self.converter.convert(x, self.units, self)\r\n   1521         except Exception as e:\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in convert(value, unit, axis)\r\n     60         # force an update so it also does type checking\r\n---> 61         unit.update(values)\r\n     62         return np.vectorize(unit._mapping.__getitem__, otypes=[float])(values)\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in update(self, data)\r\n    210             # OrderedDict just iterates over unique values in data.\r\n--> 211             cbook._check_isinstance((str, bytes), value=val)\r\n    212             if convertible:\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/cbook/__init__.py in _check_isinstance(_types, **kwargs)\r\n   2234         if not isinstance(v, types):\r\n-> 2235             raise TypeError(\r\n   2236                 \"{!r} must be an instance of {}, not a {}\".format(\r\n\r\nTypeError: 'value' must be an instance of str or bytes, not a float\r\n\r\nThe above exception was the direct cause of the following exception:\r\n\r\nConversionError                           Traceback (most recent call last)\r\n<ipython-input-61-b132ea7dca6c> in <module>\r\n      2 import seaborn as sns\r\n      3 \r\n----> 4 sns.histplot(\r\n      5     ['foo', 'foo', 'bar', np.nan]\r\n      6 )\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in histplot(data, x, y, hue, weights, stat, bins, binwidth, binrange, discrete, cumulative, common_bins, common_norm, multiple, element, fill, shrink, kde, kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws, palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)\r\n   1420     if p.univariate:\r\n   1421 \r\n-> 1422         p.plot_univariate_histogram(\r\n   1423             multiple=multiple,\r\n   1424             element=element,\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in plot_univariate_histogram(self, multiple, element, fill, common_norm, common_bins, shrink, kde, kde_kws, color, legend, line_kws, estimate_kws, **plot_kws)\r\n    421 \r\n    422         # First pass through the data to compute the histograms\r\n--> 423         for sub_vars, sub_data in self.iter_data(\"hue\", from_comp_data=True):\r\n    424 \r\n    425             # Prepare the relevant data\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in iter_data(self, grouping_vars, reverse, from_comp_data)\r\n    965 \r\n    966         if from_comp_data:\r\n--> 967             data = self.comp_data\r\n    968         else:\r\n    969             data = self.plot_data\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in comp_data(self)\r\n   1034                 axis = getattr(ax, f\"{var}axis\")\r\n   1035 \r\n-> 1036                 comp_var = axis.convert_units(self.plot_data[var])\r\n   1037                 if axis.get_scale() == \"log\":\r\n   1038                     comp_var = np.log10(comp_var)\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)\r\n   1520             ret = self.converter.convert(x, self.units, self)\r\n   1521         except Exception as e:\r\n-> 1522             raise munits.ConversionError('Failed to convert value(s) to axis '\r\n   1523                                          f'units: {x!r}') from e\r\n   1524         return ret\r\n\r\nConversionError: Failed to convert value(s) to axis units: 0    foo\r\n1    foo\r\n2    bar\r\n3    NaN\r\nName: x, dtype: object\r\n```\r\n\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nseaborn/_core.py\n  function: VectorPlotter.comp_data\n"
    }
  ]
}