{
  "Selected_candidate": {
    "pr_number": 2417,
    "pr_title": "Improve NA robustness in VectorPlotter.comp_data",
    "pr_body": "This PR avoids passing `nan` through the matplotlib converters used to obtain a numeric/computable representation of the data (i.e. `VectorPlotter.comp_data`).\r\n\r\nIt also\r\n- codifies that the converted columns in `comp_data` have a float dtype\r\n- converts `inf` to `nan`, in line with what matplotlib does\r\n\r\nFixes #2295 \r\n\r\nAdditionally this will implicitly address #1971 once the regression plots are refactored to use `comp_data` internally. (@mojones, funny that you opened both issues).",
    "issue_id": 2295,
    "issue_title": "histplot with categorical values crashes with missing data, though numerical values work fine",
    "issue_body": "Not sure if this is intended behaviour, but it caught me out due to the difference in handling numerical/categorical data. I note that drawing histograms of categorical data is labelled as experimental, so ignore/close if that explains it.\r\n\r\nWith numerical data `histplot` ignores NaN and plots the other values, this is the behaviour I would expect:\r\n\r\n```\r\nimport numpy as np\r\nimport seaborn as sns\r\n\r\nsns.histplot(\r\n    [1.1, 1.2, 1.3, 1.4, np.nan]\r\n)\r\n```\r\n\r\nbut with categorical data it crashes:\r\n```\r\nimport numpy as np\r\nimport seaborn as sns\r\n\r\nsns.histplot(\r\n    ['foo', 'foo', 'bar', np.nan]\r\n)\r\n\r\n# output\r\n---------------------------------------------------------------------------\r\nTypeError                                 Traceback (most recent call last)\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)\r\n   1519         try:\r\n-> 1520             ret = self.converter.convert(x, self.units, self)\r\n   1521         except Exception as e:\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in convert(value, unit, axis)\r\n     60         # force an update so it also does type checking\r\n---> 61         unit.update(values)\r\n     62         return np.vectorize(unit._mapping.__getitem__, otypes=[float])(values)\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in update(self, data)\r\n    210             # OrderedDict just iterates over unique values in data.\r\n--> 211             cbook._check_isinstance((str, bytes), value=val)\r\n    212             if convertible:\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/cbook/__init__.py in _check_isinstance(_types, **kwargs)\r\n   2234         if not isinstance(v, types):\r\n-> 2235             raise TypeError(\r\n   2236                 \"{!r} must be an instance of {}, not a {}\".format(\r\n\r\nTypeError: 'value' must be an instance of str or bytes, not a float\r\n\r\nThe above exception was the direct cause of the following exception:\r\n\r\nConversionError                           Traceback (most recent call last)\r\n<ipython-input-61-b132ea7dca6c> in <module>\r\n      2 import seaborn as sns\r\n      3 \r\n----> 4 sns.histplot(\r\n      5     ['foo', 'foo', 'bar', np.nan]\r\n      6 )\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in histplot(data, x, y, hue, weights, stat, bins, binwidth, binrange, discrete, cumulative, common_bins, common_norm, multiple, element, fill, shrink, kde, kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws, palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)\r\n   1420     if p.univariate:\r\n   1421 \r\n-> 1422         p.plot_univariate_histogram(\r\n   1423             multiple=multiple,\r\n   1424             element=element,\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in plot_univariate_histogram(self, multiple, element, fill, common_norm, common_bins, shrink, kde, kde_kws, color, legend, line_kws, estimate_kws, **plot_kws)\r\n    421 \r\n    422         # First pass through the data to compute the histograms\r\n--> 423         for sub_vars, sub_data in self.iter_data(\"hue\", from_comp_data=True):\r\n    424 \r\n    425             # Prepare the relevant data\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in iter_data(self, grouping_vars, reverse, from_comp_data)\r\n    965 \r\n    966         if from_comp_data:\r\n--> 967             data = self.comp_data\r\n    968         else:\r\n    969             data = self.plot_data\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in comp_data(self)\r\n   1034                 axis = getattr(ax, f\"{var}axis\")\r\n   1035 \r\n-> 1036                 comp_var = axis.convert_units(self.plot_data[var])\r\n   1037                 if axis.get_scale() == \"log\":\r\n   1038                     comp_var = np.log10(comp_var)\r\n\r\n~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)\r\n   1520             ret = self.converter.convert(x, self.units, self)\r\n   1521         except Exception as e:\r\n-> 1522             raise munits.ConversionError('Failed to convert value(s) to axis '\r\n   1523                                          f'units: {x!r}') from e\r\n   1524         return ret\r\n\r\nConversionError: Failed to convert value(s) to axis units: 0    foo\r\n1    foo\r\n2    bar\r\n3    NaN\r\nName: x, dtype: object\r\n```\r\n\r\n",
    "issue_closed_at": "2021-01-05T19:40:57Z",
    "base_commit": "aad96f8d2e36ceceb82a42b69aa3a8f47ef7210d",
    "changes": [
      {
        "file": "seaborn/_core.py",
        "type": "function",
        "name": "comp_data",
        "class_name": "VectorPlotter",
        "code": "def comp_data(self):\n        \"\"\"Dataframe with numeric x and y, after unit conversion and log scaling.\"\"\"\n        if not hasattr(self, \"ax\"):\n            # Probably a good idea, but will need a bunch of tests updated\n            # Most of these tests should just use the external interface\n            # Then this can be re-enabled.\n            # raise AttributeError(\"No Axes attached to plotter\")\n            return self.plot_data\n\n        if not hasattr(self, \"_comp_data\"):\n\n            comp_data = (\n                self.plot_data\n                .copy(deep=False)\n                .drop([\"x\", \"y\"], axis=1, errors=\"ignore\")\n            )\n            for var in \"yx\":\n                if var not in self.variables:\n                    continue\n\n                # Get a corresponding axis object so that we can convert the units\n                # to matplotlib's numeric representation, which we can compute on\n                # This is messy and it would probably be better for VectorPlotter\n                # to manage its own converters (using the matplotlib tools).\n                # XXX Currently does not support unshared categorical axes!\n                # (But see comment in _attach about how those don't exist)\n                if self.ax is None:\n                    ax = self.facets.axes.flat[0]\n                else:\n                    ax = self.ax\n                axis = getattr(ax, f\"{var}axis\")\n\n                comp_var = axis.convert_units(self.plot_data[var])\n                if axis.get_scale() == \"log\":\n                    comp_var = np.log10(comp_var)\n                comp_data.insert(0, var, comp_var)\n\n            self._comp_data = comp_data\n\n        return self._comp_data"
      }
    ]
  },
  "Justification": "Candidate E addresses similar issues with handling missing data, which is directly relevant to the CURRENT bug where missing values lead to a crash in the PolyFit function. Both reports highlight the challenges that arise when missing data is dealt with in plotting functions. Since Candidate E's fix improves robustness against NaN values specifically for histogram plots, it provides crucial insights on how to manage missing data appropriately in the context of statistical visualizations, making it particularly beneficial for debugging the CURRENT bug related to missing data in PolyFit.",
  "instance_id": "mwaskom__seaborn-3010",
  "repo": "mwaskom/seaborn",
  "created_at": "2022-09-11T19:37:32Z",
  "problem_statement": "PolyFit is not robust to missing data\n```python\r\nso.Plot([1, 2, 3, None, 4], [1, 2, 3, 4, 5]).add(so.Line(), so.PolyFit())\r\n```\r\n\r\n<details><summary>Traceback</summary>\r\n\r\n```python-traceback\r\n---------------------------------------------------------------------------\r\nLinAlgError                               Traceback (most recent call last)\r\nFile ~/miniconda3/envs/seaborn-py39-latest/lib/python3.9/site-packages/IPython/core/formatters.py:343, in BaseFormatter.__call__(self, obj)\r\n    341     method = get_real_method(obj, self.print_method)\r\n    342     if method is not None:\r\n--> 343         return method()\r\n    344     return None\r\n    345 else:\r\n\r\nFile ~/code/seaborn/seaborn/_core/plot.py:265, in Plot._repr_png_(self)\r\n    263 def _repr_png_(self) -> tuple[bytes, dict[str, float]]:\r\n--> 265     return self.plot()._repr_png_()\r\n\r\nFile ~/code/seaborn/seaborn/_core/plot.py:804, in Plot.plot(self, pyplot)\r\n    800 \"\"\"\r\n    801 Compile the plot spec and return the Plotter object.\r\n    802 \"\"\"\r\n    803 with theme_context(self._theme_with_defaults()):\r\n--> 804     return self._plot(pyplot)\r\n\r\nFile ~/code/seaborn/seaborn/_core/plot.py:822, in Plot._plot(self, pyplot)\r\n    819 plotter._setup_scales(self, common, layers, coord_vars)\r\n    821 # Apply statistical transform(s)\r\n--> 822 plotter._compute_stats(self, layers)\r\n    824 # Process scale spec for semantic variables and coordinates computed by stat\r\n    825 plotter._setup_scales(self, common, layers)\r\n\r\nFile ~/code/seaborn/seaborn/_core/plot.py:1110, in Plotter._compute_stats(self, spec, layers)\r\n   1108     grouper = grouping_vars\r\n   1109 groupby = GroupBy(grouper)\r\n-> 1110 res = stat(df, groupby, orient, scales)\r\n   1112 if pair_vars:\r\n   1113     data.frames[coord_vars] = res\r\n\r\nFile ~/code/seaborn/seaborn/_stats/regression.py:41, in PolyFit.__call__(self, data, groupby, orient, scales)\r\n     39 def __call__(self, data, groupby, orient, scales):\r\n---> 41     return groupby.apply(data, self._fit_predict)\r\n\r\nFile ~/code/seaborn/seaborn/_core/groupby.py:109, in GroupBy.apply(self, data, func, *args, **kwargs)\r\n    106 grouper, groups = self._get_groups(data)\r\n    108 if not grouper:\r\n--> 109     return self._reorder_columns(func(data, *args, **kwargs), data)\r\n    111 parts = {}\r\n    112 for key, part_df in data.groupby(grouper, sort=False):\r\n\r\nFile ~/code/seaborn/seaborn/_stats/regression.py:30, in PolyFit._fit_predict(self, data)\r\n     28     xx = yy = []\r\n     29 else:\r\n---> 30     p = np.polyfit(x, y, self.order)\r\n     31     xx = np.linspace(x.min(), x.max(), self.gridsize)\r\n     32     yy = np.polyval(p, xx)\r\n\r\nFile <__array_function__ internals>:180, in polyfit(*args, **kwargs)\r\n\r\nFile ~/miniconda3/envs/seaborn-py39-latest/lib/python3.9/site-packages/numpy/lib/polynomial.py:668, in polyfit(x, y, deg, rcond, full, w, cov)\r\n    666 scale = NX.sqrt((lhs*lhs).sum(axis=0))\r\n    667 lhs /= scale\r\n--> 668 c, resids, rank, s = lstsq(lhs, rhs, rcond)\r\n    669 c = (c.T/scale).T  # broadcast scale coefficients\r\n    671 # warn on rank reduction, which indicates an ill conditioned matrix\r\n\r\nFile <__array_function__ internals>:180, in lstsq(*args, **kwargs)\r\n\r\nFile ~/miniconda3/envs/seaborn-py39-latest/lib/python3.9/site-packages/numpy/linalg/linalg.py:2300, in lstsq(a, b, rcond)\r\n   2297 if n_rhs == 0:\r\n   2298     # lapack can't handle n_rhs = 0 - so allocate the array one larger in that axis\r\n   2299     b = zeros(b.shape[:-2] + (m, n_rhs + 1), dtype=b.dtype)\r\n-> 2300 x, resids, rank, s = gufunc(a, b, rcond, signature=signature, extobj=extobj)\r\n   2301 if m == 0:\r\n   2302     x[...] = 0\r\n\r\nFile ~/miniconda3/envs/seaborn-py39-latest/lib/python3.9/site-packages/numpy/linalg/linalg.py:101, in _raise_linalgerror_lstsq(err, flag)\r\n    100 def _raise_linalgerror_lstsq(err, flag):\r\n--> 101     raise LinAlgError(\"SVD did not converge in Linear Least Squares\")\r\n\r\nLinAlgError: SVD did not converge in Linear Least Squares\r\n\r\n```\r\n\r\n</details>\n",
  "patch": "diff --git a/seaborn/_stats/regression.py b/seaborn/_stats/regression.py\n--- a/seaborn/_stats/regression.py\n+++ b/seaborn/_stats/regression.py\n@@ -38,7 +38,10 @@ def _fit_predict(self, data):\n \n     def __call__(self, data, groupby, orient, scales):\n \n-        return groupby.apply(data, self._fit_predict)\n+        return (\n+            groupby\n+            .apply(data.dropna(subset=[\"x\", \"y\"]), self._fit_predict)\n+        )\n \n \n @dataclass\n"
}