{
  "original_problem": {
    "instance_id": "scikit-learn__scikit-learn-25638",
    "repo": "scikit-learn/scikit-learn",
    "created_at": "2023-02-17T22:17:50Z",
    "problem_statement": "Support nullable pandas dtypes in `unique_labels`\n### Describe the workflow you want to enable\n\nI would like to be able to pass the nullable pandas dtypes (\"Int64\", \"Float64\", \"boolean\") into sklearn's `unique_labels` function. Because the dtypes become `object` dtype when converted to numpy arrays we get `ValueError: Mix type of y not allowed, got types {'binary', 'unknown'}`:\r\n\r\nRepro with sklearn 1.2.1\r\n```py \r\n    import pandas as pd\r\n    import pytest\r\n    from sklearn.utils.multiclass import unique_labels\r\n    \r\n    for dtype in [\"Int64\", \"Float64\", \"boolean\"]:\r\n        y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)\r\n        y_predicted = pd.Series([0, 0, 1, 1, 0, 1, 1, 1, 1], dtype=\"int64\")\r\n\r\n        with pytest.raises(ValueError, match=\"Mix type of y not allowed, got types\"):\r\n            unique_labels(y_true, y_predicted)\r\n```\n\n### Describe your proposed solution\n\nWe should get the same behavior as when `int64`, `float64`, and `bool` dtypes are used, which is no error:  \r\n\r\n```python\r\n    import pandas as pd\r\n    from sklearn.utils.multiclass import unique_labels\r\n    \r\n    for dtype in [\"int64\", \"float64\", \"bool\"]:\r\n        y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)\r\n        y_predicted = pd.Series([0, 0, 1, 1, 0, 1, 1, 1, 1], dtype=\"int64\")\r\n\r\n        unique_labels(y_true, y_predicted)\r\n```\n\n### Describe alternatives you've considered, if relevant\n\nOur current workaround is to convert the data to numpy arrays with the corresponding dtype that works prior to passing it into `unique_labels`.\n\n### Additional context\n\n_No response_\n",
    "patch": "diff --git a/sklearn/utils/multiclass.py b/sklearn/utils/multiclass.py\n--- a/sklearn/utils/multiclass.py\n+++ b/sklearn/utils/multiclass.py\n@@ -155,14 +155,25 @@ def is_multilabel(y):\n     if hasattr(y, \"__array__\") or isinstance(y, Sequence) or is_array_api:\n         # DeprecationWarning will be replaced by ValueError, see NEP 34\n         # https://numpy.org/neps/nep-0034-infer-dtype-is-object.html\n+        check_y_kwargs = dict(\n+            accept_sparse=True,\n+            allow_nd=True,\n+            force_all_finite=False,\n+            ensure_2d=False,\n+            ensure_min_samples=0,\n+            ensure_min_features=0,\n+        )\n         with warnings.catch_warnings():\n             warnings.simplefilter(\"error\", np.VisibleDeprecationWarning)\n             try:\n-                y = xp.asarray(y)\n-            except (np.VisibleDeprecationWarning, ValueError):\n+                y = check_array(y, dtype=None, **check_y_kwargs)\n+            except (np.VisibleDeprecationWarning, ValueError) as e:\n+                if str(e).startswith(\"Complex data not supported\"):\n+                    raise\n+\n                 # dtype=object should be provided explicitly for ragged arrays,\n                 # see NEP 34\n-                y = xp.asarray(y, dtype=object)\n+                y = check_array(y, dtype=object, **check_y_kwargs)\n \n     if not (hasattr(y, \"shape\") and y.ndim == 2 and y.shape[1] > 1):\n         return False\n@@ -302,15 +313,27 @@ def type_of_target(y, input_name=\"\"):\n     # https://numpy.org/neps/nep-0034-infer-dtype-is-object.html\n     # We therefore catch both deprecation (NumPy < 1.24) warning and\n     # value error (NumPy >= 1.24).\n+    check_y_kwargs = dict(\n+        accept_sparse=True,\n+        allow_nd=True,\n+        force_all_finite=False,\n+        ensure_2d=False,\n+        ensure_min_samples=0,\n+        ensure_min_features=0,\n+    )\n+\n     with warnings.catch_warnings():\n         warnings.simplefilter(\"error\", np.VisibleDeprecationWarning)\n         if not issparse(y):\n             try:\n-                y = xp.asarray(y)\n-            except (np.VisibleDeprecationWarning, ValueError):\n+                y = check_array(y, dtype=None, **check_y_kwargs)\n+            except (np.VisibleDeprecationWarning, ValueError) as e:\n+                if str(e).startswith(\"Complex data not supported\"):\n+                    raise\n+\n                 # dtype=object should be provided explicitly for ragged arrays,\n                 # see NEP 34\n-                y = xp.asarray(y, dtype=object)\n+                y = check_array(y, dtype=object, **check_y_kwargs)\n \n     # The old sequence of sequences format\n     try:\n"
  },
  "candidates_evaluated": 5,
  "judgment_result": {
    "candidates": [
      {
        "idx": 1,
        "id": "similar_19878",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue involves handling sparse matrices, which is unrelated to handling nullable dtypes in pandas."
      },
      {
        "idx": 2,
        "id": "similar_14877",
        "decision": "Useful",
        "confidence": "High",
        "reason": "Both issues involve handling and validating diverse data types in scikit-learn, focusing on ensuring compatibility with expected input types."
      },
      {
        "idx": 3,
        "id": "similar_12276",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about error handling for unfitted models, which does not relate to dtype handling or conversion."
      },
      {
        "idx": 4,
        "id": "similar_11906",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue focuses on error messaging clarity, which is not directly related to dtype handling or conversion."
      },
      {
        "idx": 5,
        "id": "similar_11034",
        "decision": "Useful",
        "confidence": "High",
        "reason": "Both issues involve ensuring that specified dtypes are respected during data processing in scikit-learn."
      }
    ]
  },
  "raw_summaries": [
    {
      "similar_issue": {
        "issue_title": "OrdinalEncoder accept and failed with sparse matrix in inverse_transform",
        "issue_body": "`OrdinalEncoder` was documented to accept sparse matrix in `inverse_transform`.\r\nA check is internally done to accept sparse matrix. However, the `inverse_transform` will fail with this type of data.\r\nWe should remove the support in the check and make sure that we issue the right error.\r\n",
        "issue_id": 19878,
        "pr_number": 19879,
        "pr_title": "FIX Error for sparse matrix in OrdinalEncoder.inverse_transform",
        "pr_body": "closes #19878\r\n\r\n`OrdinalEncoder.inverse_transform` should not support sparse matrix. It was already failing but with an obscure message.\r\nThis PR adds a non-regression test to check for the error message.",
        "issue_closed_at": "2021-04-13T11:59:30Z",
        "base_commit": "767fd63c9ddddc46e288fdec2cca36a129529a8e"
      },
      "summary": "### Summary:\nThis issue is centered around a functionality discrepancy in the `OrdinalEncoder` component of a software system, specifically within its `inverse_transform` method. The problem pertains to the method's documented capability to handle sparse matrices, which are a type of data structure used for efficient storage of data with many zero elements. Despite documentation indicating support, the method fails when actually presented with sparse matrix data. This results from an internal check incorrectly allowing sparse matrices, leading to a runtime failure. The key symptom of this issue is the failure of the `inverse_transform` method when processing sparse matrices, contrary to what users might expect based on the documentation. The affected component is the `OrdinalEncoder` class within the preprocessing module, impacting users who rely on this functionality for data transformation tasks. The potential impact includes disrupting workflows that depend on sparse data structures, which could be critical for applications handling large datasets. The severity is moderate, as it requires users to alter their data handling approach or face errors. Technical resolution involves modifying the method to correctly reject sparse matrices and issue appropriate error messages, thereby aligning the method's behavior with documented capabilities.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: OrdinalEncoder accept and failed with sparse matrix in inverse_transform\n\nBody:\n`OrdinalEncoder` was documented to accept sparse matrix in `inverse_transform`.\r\nA check is internally done to accept sparse matrix. However, the `inverse_transform` will fail with this type of data.\r\nWe should remove the support in the check and make sure that we issue the right error.\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/preprocessing/_encoders.py\n  function: OrdinalEncoder.inverse_transform\n  function: OrdinalEncoder.inverse_transform\n"
    },
    {
      "similar_issue": {
        "issue_title": "DataFrames not properly validated in SimpleImputer",
        "issue_body": "```python\r\nimport pandas as pd\r\nfrom sklearn.impute import SimpleImputer\r\n\r\nSimpleImputer().fit(pd.DataFrame({'a': ['b', 'c']}))\r\n```\r\nis not validated correctly:\r\n\r\n```pythontb\r\n---------------------------------------------------------------------------\r\nValueError                                Traceback (most recent call last)\r\n~/checkout/scikit-learn/sklearn/impute/_base.py in _validate_input(self, X)\r\n    198             X = check_array(X, accept_sparse='csc', dtype=dtype,\r\n--> 199                             force_all_finite=force_all_finite, copy=self.copy)\r\n    200         except ValueError as ve:\r\n\r\n~/checkout/scikit-learn/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\r\n    496                 warnings.simplefilter('error', ComplexWarning)\r\n--> 497                 array = np.asarray(array, dtype=dtype, order=order)\r\n    498             except ComplexWarning:\r\n\r\n~/miniconda3/lib/python3.7/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)\r\n    537     \"\"\"\r\n--> 538     return array(a, dtype, copy=False, order=order)\r\n    539 \r\n\r\nValueError: could not convert string to float: 'b'\r\n\r\nDuring handling of the above exception, another exception occurred:\r\n\r\nAttributeError                            Traceback (most recent call last)\r\n<ipython-input-6-f08c4f6715ce> in <module>\r\n----> 1 SimpleImputer().fit(pd.DataFrame({'a': ['b', 'c']}))\r\n\r\n~/checkout/scikit-learn/sklearn/impute/_base.py in fit(self, X, y)\r\n    230         self : SimpleImputer\r\n    231         \"\"\"\r\n--> 232         X = self._validate_input(X)\r\n    233 \r\n    234         # default fill_value is 0 for numerical input and \"missing_value\"\r\n\r\n~/checkout/scikit-learn/sklearn/impute/_base.py in _validate_input(self, X)\r\n    202                 raise ValueError(\"Cannot use {0} strategy with non-numeric \"\r\n    203                                  \"data. Received datatype :{1}.\"\r\n--> 204                                  \"\".format(self.strategy, X.dtype.kind))\r\n    205             else:\r\n    206                 raise ve\r\n\r\n~/miniconda3/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)\r\n   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):\r\n   5066                 return self[name]\r\n-> 5067             return object.__getattribute__(self, name)\r\n   5068 \r\n   5069     def __setattr__(self, name, value):\r\n\r\nAttributeError: 'DataFrame' object has no attribute 'dtype'\r\n```\r\n",
        "issue_id": 14877,
        "pr_number": 14878,
        "pr_title": "API Improves error msg when passing non-arrays to SimpleImputer",
        "pr_body": "Fixes #14877.\r\nAn alternative would be calling ``asarray`` but that would be less informative imho.\r\n\r\nStill needs a regression test. Probably one for list and one for dataframe?",
        "issue_closed_at": "2019-10-18T15:28:54Z",
        "base_commit": "b02217d8a5651760353e310701e749c1eaece6df"
      },
      "summary": "### Summary:\nThis issue pertains to the improper validation of DataFrame inputs in the `SimpleImputer` class of the `scikit-learn` library, specifically when handling non-numeric data. The problem arises during the fitting process of the `SimpleImputer` when it encounters a DataFrame containing string data, which it cannot convert to a numeric format as expected. This results in a `ValueError`, indicating that the conversion of strings to floats failed. Additionally, an `AttributeError` is encountered due to incorrect attribute access within the `DataFrame` object. The key symptom is the inability of `SimpleImputer` to process non-numeric DataFrame inputs, which is essential for users who need to impute missing values in datasets containing categorical data. The affected component is the `_validate_input` method within the `SimpleImputer` class located in `sklearn/impute/_base.py`. The potential impact of this issue is significant as it limits the functionality of `SimpleImputer` for users dealing with mixed-type datasets, potentially leading to workflow interruptions for data preprocessing tasks. The issue highlights the need for improved handling and validation of diverse data types within machine learning preprocessing utilities.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: DataFrames not properly validated in SimpleImputer\n\nBody:\n```python\r\nimport pandas as pd\r\nfrom sklearn.impute import SimpleImputer\r\n\r\nSimpleImputer().fit(pd.DataFrame({'a': ['b', 'c']}))\r\n```\r\nis not validated correctly:\r\n\r\n```pythontb\r\n---------------------------------------------------------------------------\r\nValueError                                Traceback (most recent call last)\r\n~/checkout/scikit-learn/sklearn/impute/_base.py in _validate_input(self, X)\r\n    198             X = check_array(X, accept_sparse='csc', dtype=dtype,\r\n--> 199                             force_all_finite=force_all_finite, copy=self.copy)\r\n    200         except ValueError as ve:\r\n\r\n~/checkout/scikit-learn/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\r\n    496                 warnings.simplefilter('error', ComplexWarning)\r\n--> 497                 array = np.asarray(array, dtype=dtype, order=order)\r\n    498             except ComplexWarning:\r\n\r\n~/miniconda3/lib/python3.7/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)\r\n    537     \"\"\"\r\n--> 538     return array(a, dtype, copy=False, order=order)\r\n    539 \r\n\r\nValueError: could not convert string to float: 'b'\r\n\r\nDuring handling of the above exception, another exception occurred:\r\n\r\nAttributeError                            Traceback (most recent call last)\r\n<ipython-input-6-f08c4f6715ce> in <module>\r\n----> 1 SimpleImputer().fit(pd.DataFrame({'a': ['b', 'c']}))\r\n\r\n~/checkout/scikit-learn/sklearn/impute/_base.py in fit(self, X, y)\r\n    230         self : SimpleImputer\r\n    231         \"\"\"\r\n--> 232         X = self._validate_input(X)\r\n    233 \r\n    234         # default fill_value is 0 for numerical input and \"missing_value\"\r\n\r\n~/checkout/scikit-learn/sklearn/impute/_base.py in _validate_input(self, X)\r\n    202                 raise ValueError(\"Cannot use {0} strategy with non-numeric \"\r\n    203                                  \"data. Received datatype :{1}.\"\r\n--> 204                                  \"\".format(self.strategy, X.dtype.kind))\r\n    205             else:\r\n    206                 raise ve\r\n\r\n~/miniconda3/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)\r\n   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):\r\n   5066                 return self[name]\r\n-> 5067             return object.__getattribute__(self, name)\r\n   5068 \r\n   5069     def __setattr__(self, name, value):\r\n\r\nAttributeError: 'DataFrame' object has no attribute 'dtype'\r\n```\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/impute/_base.py\n  function: MissingIndicator._validate_input\n"
    },
    {
      "similar_issue": {
        "issue_title": "Calling NearestNeighbors.{kneighbors,radius_neighbors}_graph without first fitting should raise NotFittedError",
        "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n\r\nRunning prediction methods without running `fit` should raise `NotFittedError`. This is not the case for some non-standard methods.\r\n\r\n`check_is_fitted` should be applied for this purpose.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().kneighbors_graph([[1]])\r\n```\r\nor\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().radius_neighbors_graph([[1]])\r\n```\r\n\r\n\r\n#### Expected Results\r\n\r\nNotFittedError raised\r\n\r\n#### Actual Results\r\nAttributeError raised\r\n\r\n#### Versions\r\n\r\n```\r\nSystem\r\n------\r\n    python: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12)  [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]\r\nexecutable: /Users/joel/anaconda3/envs/scipy3k/bin/python\r\n   machine: Darwin-17.7.0-x86_64-i386-64bit\r\n\r\nBLAS\r\n----\r\n  lib_dirs: /Users/joel/anaconda3/envs/scipy3k/lib\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\ncblas_libs: mkl_rt, pthread\r\n\r\nPython deps\r\n-----------\r\nsetuptools: 37.0.0\r\n     scipy: 1.0.0\r\n       pip: 18.0\r\n     numpy: 1.14.1\r\n   sklearn: 0.21.dev0\r\n    pandas: 0.23.4\r\n    Cython: 0.28.5\r\n```",
        "issue_id": 12276,
        "pr_number": 12279,
        "pr_title": "[MRG+1] Add check_is_fitted to non standard functions",
        "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #12276 \r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdding `check_is_fitted` method to other non standard functions\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
        "issue_closed_at": "2018-10-19T13:46:09Z",
        "base_commit": "74b56dbc57d9295df8fb653adccb265da356b670"
      },
      "summary": "### Summary:\nThis issue pertains to the behavior of certain prediction methods within the `scikit-learn` library's `NearestNeighbors` module. Specifically, the problem arises from calling the methods `kneighbors_graph` and `radius_neighbors_graph` without first fitting the model using the `fit` method. This oversight does not result in a `NotFittedError` as expected, but rather an `AttributeError`, which may lead to confusion for users expecting standard error handling.\n\n1. **Problem Description in General Terms:**\n   The issue involves the improper handling of method calls on an unfitted model in a machine learning library, where the absence of a prerequisite fitting step should trigger a specific error (`NotFittedError`) but instead raises a different, less informative error.\n\n2. **Key Symptoms and Behaviors Observed:**\n   - When attempting to use `kneighbors_graph` or `radius_neighbors_graph` without calling `fit` first, users encounter an `AttributeError`.\n   - The expected behavior is for a `NotFittedError` to be raised, clearly indicating that the method cannot be called without prior fitting.\n\n3. **Affected Components or Systems:**\n   - The issue affects the `NearestNeighbors` module in `scikit-learn`, specifically within the `KNeighborsMixin.kneighbors_graph` and `RadiusNeighborsMixin.radius_neighbors_graph` functions.\n\n4. **Potential Impact or Severity:**\n   - This issue may lead to user confusion or misdiagnosis of problems, as the error message provided does not clearly indicate the root cause. It could potentially hinder the development process by requiring additional debugging to understand the source of the error.\n   - The severity is moderate, as it affects user experience and code correctness but does not lead to data loss or corruption.\n\n5. **Relevant Technical Details Abstracted for Broader Understanding:**\n   - The solution involves implementing a check using `check_is_fitted` to ensure that methods that require a fitted model are not invoked prematurely.\n   - This issue highlights the importance of robust error handling and clear, informative feedback in software libraries to guide users in correct usage patterns, particularly in complex domains like machine learning.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Calling NearestNeighbors.{kneighbors,radius_neighbors}_graph without first fitting should raise NotFittedError\n\nBody:\n<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n\r\nRunning prediction methods without running `fit` should raise `NotFittedError`. This is not the case for some non-standard methods.\r\n\r\n`check_is_fitted` should be applied for this purpose.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().kneighbors_graph([[1]])\r\n```\r\nor\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().radius_neighbors_graph([[1]])\r\n```\r\n\r\n\r\n#### Expected Results\r\n\r\nNotFittedError raised\r\n\r\n#### Actual Results\r\nAttributeError raised\r\n\r\n#### Versions\r\n\r\n```\r\nSystem\r\n------\r\n    python: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12)  [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]\r\nexecutable: /Users/joel/anaconda3/envs/scipy3k/bin/python\r\n   machine: Darwin-17.7.0-x86_64-i386-64bit\r\n\r\nBLAS\r\n----\r\n  lib_dirs: /Users/joel/anaconda3/envs/scipy3k/lib\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\ncblas_libs: mkl_rt, pthread\r\n\r\nPython deps\r\n-----------\r\nsetuptools: 37.0.0\r\n     scipy: 1.0.0\r\n       pip: 18.0\r\n     numpy: 1.14.1\r\n   sklearn: 0.21.dev0\r\n    pandas: 0.23.4\r\n    Cython: 0.28.5\r\n```\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/neighbors/base.py\n  function: KNeighborsMixin.kneighbors_graph\n  function: RadiusNeighborsMixin.radius_neighbors_graph\n"
    },
    {
      "similar_issue": {
        "issue_title": "Better error message for invalid metric in NearestNeighbors ",
        "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n",
        "issue_id": 11906,
        "pr_number": 11914,
        "pr_title": "[MRG] ENH Better error message for metrics of neighbors",
        "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n#Fixes #11906 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded expression to error message to get list of valid metrics.\r\n\r\n<!--\r\n#### Any other comments?\r\n\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
        "issue_closed_at": "2018-09-13T15:34:02Z",
        "base_commit": "7ed61a24feb4ffde0bee9342acf4a58e3f946a61"
      },
      "summary": "### Summary:\n\nThis issue is centered around the need for improved clarity in error messaging within the Scikit-Learn library, specifically when an invalid metric is specified in the `NearestNeighbors` class. \n\n1. **Problem Description in General Terms**: \n   The core of the problem is that the error message provided when an invalid metric is specified for the `NearestNeighbors` class is ambiguous and potentially misleading. Instead of accurately pinpointing that the issue is with the metric string itself, the error message implies a problem related to the algorithm specified.\n\n2. **Key Symptoms and Behaviors Observed**:\n   Users encountering this issue receive an error message stating, \"Metric 'cheybshev' not valid for algorithm 'auto'\", when they mistakenly input a typo in the metric string. This leads to confusion as the message suggests a problem with the algorithm rather than directly indicating that the metric string is incorrect.\n\n3. **Affected Components or Systems**:\n   This issue affects the `NearestNeighbors` component of the Scikit-Learn library, particularly within the `neighbors` module. The specific function involved is `NeighborsBase._check_algorithm_metric`, which is responsible for validating the metric parameter.\n\n4. **Potential Impact or Severity**:\n   The impact of this issue is primarily on user experience. It can lead to confusion and wasted time as users may not immediately realize that a typographical error in the metric string is the cause of the problem. This could potentially delay development and debugging processes.\n\n5. **Any Relevant Technical Details Abstracted for Broader Understanding**:\n   The issue arises due to a lack of specificity in the error message when an invalid metric is supplied. A clearer message would specifically list acceptable metrics or suggest that the user verify their input for typographical errors. The fixed code elements include changes to `__init__.py` and `base.py` files within the `neighbors` module, particularly enhancing the function responsible for metric validation.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Better error message for invalid metric in NearestNeighbors \n\nBody:\n<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/neighbors/__init__.py\n  line: line 14\n  line: line 28\n\nsklearn/neighbors/base.py\n  function: NeighborsBase._check_algorithm_metric\n"
    },
    {
      "similar_issue": {
        "issue_title": "OneHotEncoder does not output scipy sparse matrix of given dtype",
        "issue_body": "#### Description\r\nOneHotEncoder ignores the specified dtype in the construction of the sparse array when mixed input data are passed, i.e with both categorical and real data type\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nimport numpy as np\r\n\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nenc = OneHotEncoder(dtype=np.float32, categorical_features=[0, 1])\r\n\r\nx = np.array([[0, 1, 0, 0], [1, 2, 0, 0]], dtype=int)\r\nsparse = enc.fit(x).transform(x)\r\n```\r\n\r\n#### Expected Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float32'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Actual Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float64'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Versions\r\n__Platform__: Linux-4.13.0-38-generic-x86_64-with-debian-stretch-sid\r\n__Python__: 3.6.3 |Anaconda custom (64-bit)| (default, Oct 13 2017, 12:02:49) [GCC 7.2.0]\r\n__NumPy__: NumPy \r\n__SciPy__: SciPy 1.0.1\r\n__Scikit-Learn__: Scikit-Learn 0.19.1\r\n",
        "issue_id": 11034,
        "pr_number": 11042,
        "pr_title": "[MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034",
        "pr_body": "#### Reference Issues/PRs\r\nOriginal discussion at #11034\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n",
        "issue_closed_at": "2018-06-06T09:03:02Z",
        "base_commit": "f049ec72eb70443ec8d7826066c4246035677c11"
      },
      "summary": "### Summary:\nThis issue pertains to the OneHotEncoder in the Scikit-Learn library, which fails to respect the specified data type (dtype) for the output sparse matrix when handling mixed input data types, specifically a combination of categorical and real data. The expected behavior is for the OneHotEncoder to output a scipy sparse matrix of a given dtype, as specified by the user. However, the encoder was producing a sparse matrix with a default dtype, overriding user specifications.\n\nKey symptoms include the OneHotEncoder generating a sparse matrix of type 'numpy.float64' regardless of the user-specified type 'numpy.float32'. This discrepancy occurs during the transformation process when fitting the encoder to mixed data inputs.\n\nThe affected components are within the preprocessing module of Scikit-Learn, particularly the functions related to encoding transformations, including `add_dummy_feature`, `_transform_selected`, `OneHotEncoder.fit_transform`, and `CategoricalEncoder.transform`.\n\nThe potential impact of this issue is moderate, as it affects the usability of the OneHotEncoder for users who rely on specific dtypes for performance optimization or integration with other systems that require a particular data type. It may lead to unexpected behavior or inefficiencies in applications where precise control over data types is critical.\n\nRelevant technical details include the use of the OneHotEncoder with mixed categorical and real data types, where the dtype parameter should dictate the output format. The fix likely involves ensuring the dtype parameter is honored throughout the transformation process, affecting several encoding-related functions in the preprocessing module.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: OneHotEncoder does not output scipy sparse matrix of given dtype\n\nBody:\n#### Description\r\nOneHotEncoder ignores the specified dtype in the construction of the sparse array when mixed input data are passed, i.e with both categorical and real data type\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nimport numpy as np\r\n\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nenc = OneHotEncoder(dtype=np.float32, categorical_features=[0, 1])\r\n\r\nx = np.array([[0, 1, 0, 0], [1, 2, 0, 0]], dtype=int)\r\nsparse = enc.fit(x).transform(x)\r\n```\r\n\r\n#### Expected Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float32'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Actual Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float64'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Versions\r\n__Platform__: Linux-4.13.0-38-generic-x86_64-with-debian-stretch-sid\r\n__Python__: 3.6.3 |Anaconda custom (64-bit)| (default, Oct 13 2017, 12:02:49) [GCC 7.2.0]\r\n__NumPy__: NumPy \r\n__SciPy__: SciPy 1.0.1\r\n__Scikit-Learn__: Scikit-Learn 0.19.1\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/preprocessing/data.py\n  function: add_dummy_feature\n  function: _transform_selected\n  function: _transform_selected\n  function: OneHotEncoder.fit_transform\n  function: CategoricalEncoder.transform\n"
    }
  ]
}