{
  "original_problem": {
    "instance_id": "scikit-learn__scikit-learn-13497",
    "repo": "scikit-learn/scikit-learn",
    "created_at": "2019-03-23T14:28:08Z",
    "problem_statement": "Comparing string to array in _estimate_mi\nIn ``_estimate_mi`` there is ``discrete_features == 'auto'`` but discrete features can be an array of indices or a boolean mask.\r\nThis will error in future versions of numpy.\r\nAlso this means we never test this function with discrete features != 'auto', it seems?\n",
    "patch": "diff --git a/sklearn/feature_selection/mutual_info_.py b/sklearn/feature_selection/mutual_info_.py\n--- a/sklearn/feature_selection/mutual_info_.py\n+++ b/sklearn/feature_selection/mutual_info_.py\n@@ -10,7 +10,7 @@\n from ..preprocessing import scale\n from ..utils import check_random_state\n from ..utils.fixes import _astype_copy_false\n-from ..utils.validation import check_X_y\n+from ..utils.validation import check_array, check_X_y\n from ..utils.multiclass import check_classification_targets\n \n \n@@ -247,14 +247,16 @@ def _estimate_mi(X, y, discrete_features='auto', discrete_target=False,\n     X, y = check_X_y(X, y, accept_sparse='csc', y_numeric=not discrete_target)\n     n_samples, n_features = X.shape\n \n-    if discrete_features == 'auto':\n-        discrete_features = issparse(X)\n-\n-    if isinstance(discrete_features, bool):\n+    if isinstance(discrete_features, (str, bool)):\n+        if isinstance(discrete_features, str):\n+            if discrete_features == 'auto':\n+                discrete_features = issparse(X)\n+            else:\n+                raise ValueError(\"Invalid string value for discrete_features.\")\n         discrete_mask = np.empty(n_features, dtype=bool)\n         discrete_mask.fill(discrete_features)\n     else:\n-        discrete_features = np.asarray(discrete_features)\n+        discrete_features = check_array(discrete_features, ensure_2d=False)\n         if discrete_features.dtype != 'bool':\n             discrete_mask = np.zeros(n_features, dtype=bool)\n             discrete_mask[discrete_features] = True\n"
  },
  "candidates_evaluated": 5,
  "judgment_result": {
    "candidates": [
      {
        "idx": 1,
        "id": "similar_11906",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue focuses on improving error messaging, which is not directly related to handling data type mismatches."
      },
      {
        "idx": 2,
        "id": "similar_6032",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue deals with attribute size inconsistency, which is unrelated to data type validation or handling."
      },
      {
        "idx": 3,
        "id": "similar_12276",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about ensuring model fitting status, which does not relate to data type comparison or validation."
      },
      {
        "idx": 4,
        "id": "similar_7976",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue involves errors during re-fitting due to data handling, but it does not address data type comparison or validation."
      },
      {
        "idx": 5,
        "id": "similar_11951",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about scoring without input samples, which is not related to data type validation or handling."
      }
    ]
  },
  "raw_summaries": [
    {
      "similar_issue": {
        "issue_title": "Better error message for invalid metric in NearestNeighbors ",
        "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n",
        "issue_id": 11906,
        "pr_number": 11914,
        "pr_title": "[MRG] ENH Better error message for metrics of neighbors",
        "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n#Fixes #11906 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded expression to error message to get list of valid metrics.\r\n\r\n<!--\r\n#### Any other comments?\r\n\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
        "issue_closed_at": "2018-09-13T15:34:02Z",
        "base_commit": "7ed61a24feb4ffde0bee9342acf4a58e3f946a61"
      },
      "summary": "### Summary:\n\nThis issue pertains to the need for improved error messaging in the `NearestNeighbors` component of the Scikit-Learn library. Specifically, the error message displayed when an invalid metric is specified is ambiguous and does not clearly guide the user towards the actual problem. Instead of indicating a typo in the metric string, the error message incorrectly suggests that the issue lies with the algorithm's compatibility with the given metric. This led to confusion and misinterpretation, as the user believed the problem was related to the algorithm rather than the misspelled metric.\n\n1. **Problem Description in General Terms:**\n   The error message for invalid metric input in the `NearestNeighbors` function is misleading, causing users to misinterpret the source of the error.\n\n2. **Key Symptoms and Behaviors Observed:**\n   - The error message incorrectly attributes the issue to algorithm compatibility rather than an invalid metric.\n   - Users may not immediately recognize input typos in metric strings due to the misleading error message.\n\n3. **Affected Components or Systems:**\n   - The `NearestNeighbors` class within the Scikit-Learn library.\n   - Specifically, the error handling mechanism for metric validation in the `NeighborsBase._check_algorithm_metric` function.\n\n4. **Potential Impact or Severity:**\n   - Users may experience confusion and frustration due to unclear error messages, potentially leading to increased support requests or prolonged debugging.\n   - The issue could hinder effective usage of the `NearestNeighbors` functionality, impacting users' ability to implement nearest neighbor algorithms correctly.\n\n5. **Relevant Technical Details Abstracted for Broader Understanding:**\n   - The error message should explicitly list the accepted metric values and highlight potential typos in the input.\n   - This improvement would enhance the user experience by providing clearer guidance on correcting input errors, thereby reducing misunderstanding and improving the usability of the library.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Better error message for invalid metric in NearestNeighbors \n\nBody:\n<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/neighbors/__init__.py\n  line: line 14\n  line: line 28\n\nsklearn/neighbors/base.py\n  function: NeighborsBase._check_algorithm_metric\n"
    },
    {
      "similar_issue": {
        "issue_title": "LDA.explained_variance_ratio_ is of the wrong size",
        "issue_body": "The docs say that <a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis\">LDA.explained_variance_ratio_</a> should have only `n_components_`. But it doesn't.\n\nIt looks like this bug only exists when we use the `eigen` solver, not the `svd` solver.\n\n```\n>>> import numpy as np\n>>> from sklearn.lda import LDA\n>>> from sklearn.utils.testing import assert_equal\n>>>\n>>> state = np.random.RandomState(0)\n>>> X = state.normal(loc=0, scale=100, size=(40, 20))\n>>> y = state.randint(0, 3, size=(40, 1))\n>>>\n>>> # Train the LDA classifier. Use the eigen solver\n>>> lda_eigen = LDA(solver='eigen', n_components=5)\n>>> lda_eigen.fit(X, y)\n>>> assert_equal(lda_eigen.explained_variance_ratio_.shape, (5,))\nAssertionError: Tuples differ: (20,) != (5,)\n\nFirst differing element 0:\n20\n5\n\n- (20,)\n+ (5,)\n```\n\nLooks like we fix either the docs or the code. Which one?\n\nPinging @JPFrancoia.\n\nAddresses an issue in #6031.\n",
        "issue_id": 6032,
        "pr_number": 7632,
        "pr_title": "[MRG+1] Correcting length of explained_variance_ratio_, eigen solver, final PR",
        "pr_body": "<!--\nThanks for contributing a pull request! Please ensure you have taken a look at\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\n-->\n#### Reference Issue\n\n<!-- Example: Fixes #1234 -->\n\nFix #6032 \n#### What does this implement/fix? Explain your changes.\n\nAttribute explained_variance_ratio_ from LinearDiscriminantAnalysis class will be of length n_components (eigen solver).\n#### Any other comments?\n\nThis PR follows PR 7616. I mixed up my git history, so it was easier to open a new PR.\n\n<!--\nPlease be aware that we are a loose team of volunteers so patience is\nnecessary; assistance handling other issues is very welcome. We value\nall user contributions, no matter how minor they are. If we are slow to\nreview, either the pull request needs some benchmarking, tinkering,\nconvincing, etc. or more likely the reviewers are simply busy. In either\ncase, we ask for your understanding during the review process.\nFor more information, see our FAQ on this topic:\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\n\nThanks for contributing!\n-->\n",
        "issue_closed_at": "2016-10-25T12:52:13Z",
        "base_commit": "ee3e61754bd4bb10cea8065993e462fc7b112cb3"
      },
      "summary": "### Summary:\nThis issue pertains to an inconsistency in the `explained_variance_ratio_` attribute of the `LinearDiscriminantAnalysis` (LDA) class in the scikit-learn library, specifically when using the `eigen` solver. According to the documentation, this attribute should have a size corresponding to the number of components specified (`n_components_`). However, it incorrectly returns a size that matches the number of features in the input data instead. This discrepancy does not occur when using the `svd` solver.\n\nKey symptoms include an `AssertionError` during the testing of the `explained_variance_ratio_` attribute size, indicating that the size of the returned tuple (20) does not match the expected size (5) based on the number of components. This suggests a bug in the implementation rather than an error in the documentation.\n\nThe components affected by this issue are functions within the `LinearDiscriminantAnalysis` class, including `_solve_lsqr` and `_solve_svd`, as well as the `QuadraticDiscriminantAnalysis.fit` and `LinearDiscriminantAnalysis.transform` methods.\n\nThe potential impact of this issue is significant for users relying on correct dimensionality of the `explained_variance_ratio_` attribute for further data analysis or model evaluation. It may lead to incorrect interpretations or downstream errors in data processing pipelines.\n\nRelevant technical details abstracted for a broader understanding include the fact that the issue is specific to the usage of the `eigen` solver and not the `svd` solver, highlighting the importance of solver-specific implementation in machine learning algorithms. The resolution involves ensuring that the `explained_variance_ratio_` attribute correctly reflects the number of components, thereby aligning with the documentation and user expectations.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: LDA.explained_variance_ratio_ is of the wrong size\n\nBody:\nThe docs say that <a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis\">LDA.explained_variance_ratio_</a> should have only `n_components_`. But it doesn't.\n\nIt looks like this bug only exists when we use the `eigen` solver, not the `svd` solver.\n\n```\n>>> import numpy as np\n>>> from sklearn.lda import LDA\n>>> from sklearn.utils.testing import assert_equal\n>>>\n>>> state = np.random.RandomState(0)\n>>> X = state.normal(loc=0, scale=100, size=(40, 20))\n>>> y = state.randint(0, 3, size=(40, 1))\n>>>\n>>> # Train the LDA classifier. Use the eigen solver\n>>> lda_eigen = LDA(solver='eigen', n_components=5)\n>>> lda_eigen.fit(X, y)\n>>> assert_equal(lda_eigen.explained_variance_ratio_.shape, (5,))\nAssertionError: Tuples differ: (20,) != (5,)\n\nFirst differing element 0:\n20\n5\n\n- (20,)\n+ (5,)\n```\n\nLooks like we fix either the docs or the code. Which one?\n\nPinging @JPFrancoia.\n\nAddresses an issue in #6031.\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/discriminant_analysis.py\n  function: LinearDiscriminantAnalysis._solve_lsqr\n  function: LinearDiscriminantAnalysis._solve_svd\n  function: QuadraticDiscriminantAnalysis.fit\n  function: LinearDiscriminantAnalysis.transform\n"
    },
    {
      "similar_issue": {
        "issue_title": "Calling NearestNeighbors.{kneighbors,radius_neighbors}_graph without first fitting should raise NotFittedError",
        "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n\r\nRunning prediction methods without running `fit` should raise `NotFittedError`. This is not the case for some non-standard methods.\r\n\r\n`check_is_fitted` should be applied for this purpose.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().kneighbors_graph([[1]])\r\n```\r\nor\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().radius_neighbors_graph([[1]])\r\n```\r\n\r\n\r\n#### Expected Results\r\n\r\nNotFittedError raised\r\n\r\n#### Actual Results\r\nAttributeError raised\r\n\r\n#### Versions\r\n\r\n```\r\nSystem\r\n------\r\n    python: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12)  [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]\r\nexecutable: /Users/joel/anaconda3/envs/scipy3k/bin/python\r\n   machine: Darwin-17.7.0-x86_64-i386-64bit\r\n\r\nBLAS\r\n----\r\n  lib_dirs: /Users/joel/anaconda3/envs/scipy3k/lib\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\ncblas_libs: mkl_rt, pthread\r\n\r\nPython deps\r\n-----------\r\nsetuptools: 37.0.0\r\n     scipy: 1.0.0\r\n       pip: 18.0\r\n     numpy: 1.14.1\r\n   sklearn: 0.21.dev0\r\n    pandas: 0.23.4\r\n    Cython: 0.28.5\r\n```",
        "issue_id": 12276,
        "pr_number": 12279,
        "pr_title": "[MRG+1] Add check_is_fitted to non standard functions",
        "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #12276 \r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdding `check_is_fitted` method to other non standard functions\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
        "issue_closed_at": "2018-10-19T13:46:09Z",
        "base_commit": "74b56dbc57d9295df8fb653adccb265da356b670"
      },
      "summary": "### Summary: This issue pertains to the scikit-learn library, specifically involving the NearestNeighbors class. The problem is that certain methods, `kneighbors_graph` and `radius_neighbors_graph`, can be called without first fitting the model, which should not be allowed. Typically, calling prediction methods on an unfitted model should raise a `NotFittedError`, signaling to the user that the model must be trained using the `fit` method before making predictions. However, in this case, an `AttributeError` is being raised instead, indicating a lack of proper error handling in these methods. The affected components are the `KNeighborsMixin` and `RadiusNeighborsMixin` classes within the scikit-learn library's neighbor module. This issue could potentially cause confusion for users and disrupt the expected workflow, as users might not understand why their code fails or how to correct it. Technically, the solution involves applying the `check_is_fitted` function to ensure that these methods validate the fitting status of the model before proceeding with their operations. This change would align the behavior of these methods with the standard practices in scikit-learn, improving robustness and user experience.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Calling NearestNeighbors.{kneighbors,radius_neighbors}_graph without first fitting should raise NotFittedError\n\nBody:\n<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n\r\nRunning prediction methods without running `fit` should raise `NotFittedError`. This is not the case for some non-standard methods.\r\n\r\n`check_is_fitted` should be applied for this purpose.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().kneighbors_graph([[1]])\r\n```\r\nor\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().radius_neighbors_graph([[1]])\r\n```\r\n\r\n\r\n#### Expected Results\r\n\r\nNotFittedError raised\r\n\r\n#### Actual Results\r\nAttributeError raised\r\n\r\n#### Versions\r\n\r\n```\r\nSystem\r\n------\r\n    python: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12)  [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]\r\nexecutable: /Users/joel/anaconda3/envs/scipy3k/bin/python\r\n   machine: Darwin-17.7.0-x86_64-i386-64bit\r\n\r\nBLAS\r\n----\r\n  lib_dirs: /Users/joel/anaconda3/envs/scipy3k/lib\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\ncblas_libs: mkl_rt, pthread\r\n\r\nPython deps\r\n-----------\r\nsetuptools: 37.0.0\r\n     scipy: 1.0.0\r\n       pip: 18.0\r\n     numpy: 1.14.1\r\n   sklearn: 0.21.dev0\r\n    pandas: 0.23.4\r\n    Cython: 0.28.5\r\n```\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/neighbors/base.py\n  function: KNeighborsMixin.kneighbors_graph\n  function: RadiusNeighborsMixin.radius_neighbors_graph\n"
    },
    {
      "similar_issue": {
        "issue_title": "MLPClasiffier produce error when trying to re-fit",
        "issue_body": "Hi,\r\n\r\nI am training a MLPClasiffier model twice, each time on a different data-set.\r\nOn the second iteration the fit method produce an error. Every time a different error.\r\nI did the processes on various models, this is the only one that produce an error. \r\n\r\nthis are the errors i get (each time a different one) - \r\n\r\n> _lbfgsb.error: failed in converting 7th argument `g' of _lbfgsb.setulb to C/Fortran array\r\n\r\n> ValueError: total size of new array must be unchanged\r\n\r\n > ValueError: operands could not be broadcast together with shapes (154,100) (25,) (154,100) \r\n\r\nthanks",
        "issue_id": 7976,
        "pr_number": 8035,
        "pr_title": "[MRG+1] Catch cases for different class size in MLPClassifier with warm start (#7976) ",
        "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\n<!-- Example: Fixes #1234 -->\r\nFixes #7976 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\nThis provides a test for different cases that throws an error when warm_start = True for MLPClassifier. Currently, vague errors are thrown when class size is different between the current fit and the previous fit. This fix will throw a clearer error message. \r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n\r\n",
        "issue_closed_at": "2016-12-29T01:01:13Z",
        "base_commit": "40a1b7a0b10fea2995c7aaa46c90a9633e6d99f6"
      },
      "summary": "### Summary:\nThis issue is related to the `MLPClassifier` from the scikit-learn library, which encounters errors when attempting to re-fit the model on a second dataset. The problem arises specifically during the second iteration of fitting, resulting in various errors depending on the input dataset. The observed errors include issues with converting arguments to C/Fortran arrays, incompatible array sizes, and shape mismatches during operations, indicating potential problems with data handling and validation within the model's fitting process.\n\nKey symptoms include:\n- Errors occur exclusively on the second fit attempt, not during the first.\n- Different error messages are generated each time, suggesting inconsistent behavior.\n- The specific errors reported involve argument conversion failures, array size mismatches, and broadcasting issues.\n\nThe affected component is the `MLPClassifier` in the `sklearn.neural_network` module, which is part of the broader scikit-learn machine learning library. The impact of this issue is significant for users who need to re-fit their models on multiple datasets, as it prevents the successful execution of subsequent training iterations and could lead to unreliable model performance.\n\nThe technical details hint at possible flaws in input validation and memory management within the `fit` method, especially when handling different data structures across multiple training iterations. The resolution involved modifications to the input validation and prediction functions within the `MLPRegressor` class, suggesting a shared underlying issue related to input processing in the neural network module.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: MLPClasiffier produce error when trying to re-fit\n\nBody:\nHi,\r\n\r\nI am training a MLPClasiffier model twice, each time on a different data-set.\r\nOn the second iteration the fit method produce an error. Every time a different error.\r\nI did the processes on various models, this is the only one that produce an error. \r\n\r\nthis are the errors i get (each time a different one) - \r\n\r\n> _lbfgsb.error: failed in converting 7th argument `g' of _lbfgsb.setulb to C/Fortran array\r\n\r\n> ValueError: total size of new array must be unchanged\r\n\r\n > ValueError: operands could not be broadcast together with shapes (154,100) (25,) (154,100) \r\n\r\nthanks\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/neural_network/multilayer_perceptron.py\n  function: MLPRegressor._validate_input\n  function: MLPRegressor.predict\n"
    },
    {
      "similar_issue": {
        "issue_title": "Scoring for dummy classifier does not work without test samples",
        "issue_body": "When using the `Dummy classifier`, it is possible to fit the classifier without providing any examples, which makes sense as the classifier only operates on the targets. However, it is not possible to score the classifier without providing examples. Using `DummyClassifier` without examples is helpful if the examples, but not the targets, are to big to fit into memory. One can still get around this by constructing \r\nartificial examples, say just zeros, but avoiding this would make things a bit easier.\r\n\r\n#### Code to Reproduce\r\n```python\r\nfrom sklearn.dummy import DummyClassifier\r\nimport numpy as np\r\n\r\ny = [1, 1, 2]\r\ny_ = [1, 1, 2]\r\n\r\nd = DummyClassifier()\r\nd.fit(None, y)\r\nprint(d.score(None, y_))\r\n```\r\n\r\n\r\n#### Expected Results\r\nThe score is printed.\r\n\r\n#### Actual Results\r\nAn exception is thrown\r\n```\r\nValueError: Found input variables with inconsistent numbers of samples: [3, 1]\r\n```\r\n\r\nSo one has to do\r\n\r\n```python\r\nfrom sklearn.dummy import DummyClassifier\r\nimport numpy as np\r\n\r\ny = [1, 1, 2]\r\ny_ = [1, 1, 2]\r\nx = np.zeros(shape=(3, 1))\r\n\r\nd = DummyClassifier()\r\nd.fit(None, y)\r\nprint(d.score(x,  y_))\r\n```\r\n\r\nIf this seems like a useful feature, I would be happy to submit a PR.\r\n\r\n#### Versions\r\nLinux-4.15.0-33-generic-x86_64-with-debian-buster-sid\r\nPython 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) \r\n[GCC 7.2.0]\r\nNumPy 1.14.3\r\nSciPy 1.0.0\r\nScikit-Learn 0.19.2\r\n\r\n<!-- Thanks for contributing! -->\r\n",
        "issue_id": 11951,
        "pr_number": 11957,
        "pr_title": "[MRG] Allow scoring of dummies without testsamples",
        "pr_body": "As DummyClassifier and DummyRegressor operate solely on the targets,\r\nthey can now be used without passing test samples, instead passing None.\r\nAlso includes some minor renaming in the corresponding tests for more\r\nconsistency.\r\n\r\n<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\nResolves #11951 \r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n\r\n\r\n\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
        "issue_closed_at": "2018-09-13T08:04:00Z",
        "base_commit": "ddf37c75c7b912104df56e1325363cd94a4fdd5f"
      },
      "summary": "### Summary:\n\nThis issue is related to the functionality of the `DummyClassifier` in the `scikit-learn` library, specifically concerning its scoring mechanism when no test samples are provided. Generally, the problem arises because the `DummyClassifier`, a simple model that makes predictions based solely on the distribution of the target values, allows itself to be fitted without input samples. This is advantageous when input data is too large to fit into memory, but targets are manageable. However, the problem occurs when attempting to score the classifier without providing any input samples, resulting in an exception.\n\nKey symptoms and behaviors observed include the inability to compute the score without constructing artificial input samples, leading to a `ValueError` due to inconsistent sample sizes. The affected component is the `DummyClassifier` within the `scikit-learn` library, specifically its scoring function which expects input samples even though they are not inherently needed for this type of classifier.\n\nThe potential impact of this issue is moderate, as it hinders the usability of the `DummyClassifier` in scenarios where memory constraints prevent the loading of input data, requiring users to implement a workaround by creating dummy input samples. While this issue does not affect the core functionality of the classifier, it adds unnecessary complexity to its usage.\n\nRelevant technical details include the requirement for consistent sample sizes between input data and targets during scoring, which is not inherently necessary for a classifier that operates independently of input features. The proposed fix involves modifying the scoring mechanism to allow scoring to occur without input samples, simplifying the process for users dealing with large datasets.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Scoring for dummy classifier does not work without test samples\n\nBody:\nWhen using the `Dummy classifier`, it is possible to fit the classifier without providing any examples, which makes sense as the classifier only operates on the targets. However, it is not possible to score the classifier without providing examples. Using `DummyClassifier` without examples is helpful if the examples, but not the targets, are to big to fit into memory. One can still get around this by constructing \r\nartificial examples, say just zeros, but avoiding this would make things a bit easier.\r\n\r\n#### Code to Reproduce\r\n```python\r\nfrom sklearn.dummy import DummyClassifier\r\nimport numpy as np\r\n\r\ny = [1, 1, 2]\r\ny_ = [1, 1, 2]\r\n\r\nd = DummyClassifier()\r\nd.fit(None, y)\r\nprint(d.score(None, y_))\r\n```\r\n\r\n\r\n#### Expected Results\r\nThe score is printed.\r\n\r\n#### Actual Results\r\nAn exception is thrown\r\n```\r\nValueError: Found input variables with inconsistent numbers of samples: [3, 1]\r\n```\r\n\r\nSo one has to do\r\n\r\n```python\r\nfrom sklearn.dummy import DummyClassifier\r\nimport numpy as np\r\n\r\ny = [1, 1, 2]\r\ny_ = [1, 1, 2]\r\nx = np.zeros(shape=(3, 1))\r\n\r\nd = DummyClassifier()\r\nd.fit(None, y)\r\nprint(d.score(x,  y_))\r\n```\r\n\r\nIf this seems like a useful feature, I would be happy to submit a PR.\r\n\r\n#### Versions\r\nLinux-4.15.0-33-generic-x86_64-with-debian-buster-sid\r\nPython 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) \r\n[GCC 7.2.0]\r\nNumPy 1.14.3\r\nSciPy 1.0.0\r\nScikit-Learn 0.19.2\r\n\r\n<!-- Thanks for contributing! -->\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/dummy.py\n  function: DummyClassifier.predict_log_proba\n  function: DummyRegressor.predict\n"
    }
  ]
}