{
  "original_problem": {
    "instance_id": "scikit-learn__scikit-learn-13142",
    "repo": "scikit-learn/scikit-learn",
    "created_at": "2019-02-12T14:32:37Z",
    "problem_statement": "GaussianMixture predict and fit_predict disagree when n_init>1\n#### Description\r\nWhen `n_init` is specified in GaussianMixture, the results of fit_predict(X) and predict(X) are often different.  The `test_gaussian_mixture_fit_predict` unit test doesn't catch this because it does not set `n_init`.\r\n\r\n#### Steps/Code to Reproduce\r\n```\r\npython\r\nfrom sklearn.mixture import GaussianMixture\r\nfrom sklearn.utils.testing import assert_array_equal\r\nimport numpy\r\nX = numpy.random.randn(1000,5)\r\nprint 'no n_init'\r\ngm = GaussianMixture(n_components=5)\r\nc1 = gm.fit_predict(X)\r\nc2 = gm.predict(X)\r\nassert_array_equal(c1,c2)\r\nprint 'n_init=5'\r\ngm = GaussianMixture(n_components=5, n_init=5)\r\nc1 = gm.fit_predict(X)\r\nc2 = gm.predict(X)\r\nassert_array_equal(c1,c2)\r\n```\r\n\r\n#### Expected Results\r\n```\r\nno n_init\r\nn_init=5\r\n```\r\nNo exceptions.\r\n\r\n#### Actual Results\r\n```\r\nno n_init\r\nn_init=5\r\nTraceback (most recent call last):\r\n  File \"test_gm.py\", line 17, in <module>\r\n    assert_array_equal(c1,c2)\r\n  File \"/home/scott/.local/lib/python2.7/site-packages/numpy/testing/_private/utils.py\", line 872, in assert_array_equal\r\n    verbose=verbose, header='Arrays are not equal')\r\n  File \"/home/scott/.local/lib/python2.7/site-packages/numpy/testing/_private/utils.py\", line 796, in assert_array_compare\r\n    raise AssertionError(msg)\r\nAssertionError: \r\nArrays are not equal\r\n\r\n(mismatch 88.6%)\r\n x: array([4, 0, 1, 1, 1, 3, 3, 4, 4, 2, 0, 0, 1, 2, 0, 2, 0, 1, 3, 1, 1, 3,\r\n       2, 1, 0, 2, 1, 0, 2, 0, 3, 1, 2, 3, 3, 1, 0, 2, 2, 0, 3, 0, 2, 0,\r\n       4, 2, 3, 0, 4, 2, 4, 1, 0, 2, 2, 1, 3, 2, 1, 4, 0, 2, 2, 1, 1, 2,...\r\n y: array([4, 1, 0, 2, 2, 1, 1, 4, 4, 0, 4, 1, 0, 3, 1, 0, 2, 2, 1, 2, 0, 0,\r\n       1, 0, 4, 1, 0, 4, 0, 1, 1, 2, 3, 1, 4, 0, 1, 4, 4, 4, 0, 1, 0, 2,\r\n       4, 1, 1, 2, 4, 3, 4, 0, 2, 3, 2, 3, 0, 0, 2, 3, 3, 3, 3, 0, 3, 2,...\r\n```\r\n\r\n#### Versions\r\n```\r\nSystem:\r\n    python: 2.7.15rc1 (default, Nov 12 2018, 14:31:15)  [GCC 7.3.0]\r\n   machine: Linux-4.15.0-43-generic-x86_64-with-Ubuntu-18.04-bionic\r\nexecutable: /usr/bin/python\r\n\r\nBLAS:\r\n    macros: HAVE_CBLAS=None, NO_ATLAS_INFO=-1\r\ncblas_libs: cblas\r\n  lib_dirs: /usr/lib/x86_64-linux-gnu\r\n\r\nPython deps:\r\n    Cython: 0.28.5\r\n     scipy: 1.2.0\r\nsetuptools: 39.0.1\r\n       pip: 19.0.1\r\n     numpy: 1.16.0\r\n    pandas: 0.23.1\r\n   sklearn: 0.20.2\r\n```\n",
    "patch": "diff --git a/sklearn/mixture/base.py b/sklearn/mixture/base.py\n--- a/sklearn/mixture/base.py\n+++ b/sklearn/mixture/base.py\n@@ -257,11 +257,6 @@ def fit_predict(self, X, y=None):\n                 best_params = self._get_parameters()\n                 best_n_iter = n_iter\n \n-        # Always do a final e-step to guarantee that the labels returned by\n-        # fit_predict(X) are always consistent with fit(X).predict(X)\n-        # for any value of max_iter and tol (and any random_state).\n-        _, log_resp = self._e_step(X)\n-\n         if not self.converged_:\n             warnings.warn('Initialization %d did not converge. '\n                           'Try different init parameters, '\n@@ -273,6 +268,11 @@ def fit_predict(self, X, y=None):\n         self.n_iter_ = best_n_iter\n         self.lower_bound_ = max_lower_bound\n \n+        # Always do a final e-step to guarantee that the labels returned by\n+        # fit_predict(X) are always consistent with fit(X).predict(X)\n+        # for any value of max_iter and tol (and any random_state).\n+        _, log_resp = self._e_step(X)\n+\n         return log_resp.argmax(axis=1)\n \n     def _e_step(self, X):\n"
  },
  "candidates_evaluated": 5,
  "judgment_result": {
    "candidates": [
      {
        "idx": 1,
        "id": "similar_12276",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about error handling for unfitted models, which is unrelated to the consistency problem between fit_predict and predict."
      },
      {
        "idx": 2,
        "id": "similar_7065",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "This issue deals with error handling for unfitted models, not relevant to the consistency issue in GaussianMixture."
      },
      {
        "idx": 3,
        "id": "similar_6032",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about attribute size inconsistency, which does not relate to the fit_predict and predict consistency problem."
      },
      {
        "idx": 4,
        "id": "similar_8933",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "This issue involves class label handling in OOB score calculation, unrelated to the GaussianMixture consistency issue."
      },
      {
        "idx": 5,
        "id": "similar_11906",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about improving error message clarity, which does not relate to the consistency problem in GaussianMixture."
      }
    ]
  },
  "raw_summaries": [
    {
      "similar_issue": {
        "issue_title": "Calling NearestNeighbors.{kneighbors,radius_neighbors}_graph without first fitting should raise NotFittedError",
        "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n\r\nRunning prediction methods without running `fit` should raise `NotFittedError`. This is not the case for some non-standard methods.\r\n\r\n`check_is_fitted` should be applied for this purpose.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().kneighbors_graph([[1]])\r\n```\r\nor\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().radius_neighbors_graph([[1]])\r\n```\r\n\r\n\r\n#### Expected Results\r\n\r\nNotFittedError raised\r\n\r\n#### Actual Results\r\nAttributeError raised\r\n\r\n#### Versions\r\n\r\n```\r\nSystem\r\n------\r\n    python: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12)  [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]\r\nexecutable: /Users/joel/anaconda3/envs/scipy3k/bin/python\r\n   machine: Darwin-17.7.0-x86_64-i386-64bit\r\n\r\nBLAS\r\n----\r\n  lib_dirs: /Users/joel/anaconda3/envs/scipy3k/lib\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\ncblas_libs: mkl_rt, pthread\r\n\r\nPython deps\r\n-----------\r\nsetuptools: 37.0.0\r\n     scipy: 1.0.0\r\n       pip: 18.0\r\n     numpy: 1.14.1\r\n   sklearn: 0.21.dev0\r\n    pandas: 0.23.4\r\n    Cython: 0.28.5\r\n```",
        "issue_id": 12276,
        "pr_number": 12279,
        "pr_title": "[MRG+1] Add check_is_fitted to non standard functions",
        "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #12276 \r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdding `check_is_fitted` method to other non standard functions\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
        "issue_closed_at": "2018-10-19T13:46:09Z",
        "base_commit": "74b56dbc57d9295df8fb653adccb265da356b670"
      },
      "summary": "### Summary:\nThis issue pertains to a flaw in the error handling mechanism within the scikit-learn library, specifically regarding the `NearestNeighbors` class's methods `kneighbors_graph` and `radius_neighbors_graph`. The core problem is that these methods do not properly check whether the model has been fitted before execution. As a result, instead of raising a `NotFittedError`—which is the expected exception when a model is used without prior fitting—an `AttributeError` is raised, leading to confusion and potentially hindering debugging efforts.\n\nKey symptoms observed include the unintended raising of an `AttributeError` when the `kneighbors_graph` or `radius_neighbors_graph` methods are called without prior invocation of the `fit` method on an instance of `NearestNeighbors`. This behavior is inconsistent with the standard practice in scikit-learn, where models must be fitted before making predictions or generating outputs.\n\nThe components affected by this issue include the `KNeighborsMixin.kneighbors_graph` and `RadiusNeighborsMixin.radius_neighbors_graph` functions in the `sklearn/neighbors/base.py` module. The potential impact of this issue is primarily on user experience and the reliability of error reporting within the library, which could lead to a misunderstanding of the root cause of the error by the users.\n\nFrom a technical perspective, the resolution involves implementing a check for model fitting status using the `check_is_fitted` utility function. This ensures that a `NotFittedError` is raised as intended, improving the robustness and predictability of the library's behavior in scenarios where methods are called without prior model fitting.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Calling NearestNeighbors.{kneighbors,radius_neighbors}_graph without first fitting should raise NotFittedError\n\nBody:\n<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n\r\nRunning prediction methods without running `fit` should raise `NotFittedError`. This is not the case for some non-standard methods.\r\n\r\n`check_is_fitted` should be applied for this purpose.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().kneighbors_graph([[1]])\r\n```\r\nor\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().radius_neighbors_graph([[1]])\r\n```\r\n\r\n\r\n#### Expected Results\r\n\r\nNotFittedError raised\r\n\r\n#### Actual Results\r\nAttributeError raised\r\n\r\n#### Versions\r\n\r\n```\r\nSystem\r\n------\r\n    python: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12)  [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]\r\nexecutable: /Users/joel/anaconda3/envs/scipy3k/bin/python\r\n   machine: Darwin-17.7.0-x86_64-i386-64bit\r\n\r\nBLAS\r\n----\r\n  lib_dirs: /Users/joel/anaconda3/envs/scipy3k/lib\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\ncblas_libs: mkl_rt, pthread\r\n\r\nPython deps\r\n-----------\r\nsetuptools: 37.0.0\r\n     scipy: 1.0.0\r\n       pip: 18.0\r\n     numpy: 1.14.1\r\n   sklearn: 0.21.dev0\r\n    pandas: 0.23.4\r\n    Cython: 0.28.5\r\n```\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/neighbors/base.py\n  function: KNeighborsMixin.kneighbors_graph\n  function: RadiusNeighborsMixin.radius_neighbors_graph\n"
    },
    {
      "similar_issue": {
        "issue_title": "DummyRegressor raises ValueError instead of NotFittedError",
        "issue_body": "#### Description\n\ntrying to call predict on an instance of DummyRegressor that has not been fitted raises ValueError. I think it should be NotFittedError.\n#### Steps/Code to Reproduce\n\n```\n>>>from sklearn.dummy import DummyRegressor\n>>>clf = DummyRegressor()\n>>>clf.predict(np.zeros((10,10)))\n```\n#### Expected Results\n\nNotFittedError\n#### Actual Results\n\nValueError\n\n<!--\nIf your issue is a usage question, submit it here instead:\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\n-->\n\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\n\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\n\n<!--\nExample:\n```\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.decomposition import LatentDirichletAllocation\n\ndocs = [\"Help I have a bug\" for i in range(1000)]\n\nvectorizer = CountVectorizer(input=docs, analyzer='word')\nlda_features = vectorizer.fit_transform(docs)\n\nlda_model = LatentDirichletAllocation(\n    n_topics=10,\n    learning_method='online',\n    evaluate_every=10,\n    n_jobs=4,\n)\nmodel = lda_model.fit(lda_features)\n```\nIf the code is too long, feel free to put it in a public gist and link\nit in the issue: https://gist.github.com\n-->\n#### Versions\n\n<!--\nPlease run the following snippet and paste the output below.\nimport platform; print(platform.platform())\nimport sys; print(\"Python\", sys.version)\nimport numpy; print(\"NumPy\", numpy.__version__)\nimport scipy; print(\"SciPy\", scipy.__version__)\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\n-->\n\nLinux-3.19.0-47-generic-x86_64-with-Ubuntu-14.04-trusty\n('Python', '2.7.6 (default, Jun 22 2015, 17:58:13) \\n[GCC 4.8.2]')\n('NumPy', '1.11.0')\n('SciPy', '0.16.1')\n('Scikit-Learn', '0.17')\n\n<!-- Thanks for contributing! -->\n",
        "issue_id": 7065,
        "pr_number": 7069,
        "pr_title": "DummyClassifier and DummyRegressor raise NotFittedError",
        "pr_body": "<!--\nThanks for contributing a pull request! Please ensure you have taken a look at\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\n-->\n#### Reference Issue\n\nFixes #7065\n\n<!-- Example: Fixes #1234 -->\n#### What does this implement/fix? Explain your changes.\n\nDummyClassifier and DummyRegressor raise NotFittedError\n#### Any other comments?\n\n<!--\nPlease be aware that we are a loose team of volunteers so patience is\nnecessary; assistance handling other issues is very welcome. We value\nall user contributions, no matter how minor they are. If we are slow to\nreview, either the pull request needs some benchmarking, tinkering,\nconvincing, etc. or more likely the reviewers are simply busy. In either\ncase, we ask for your understanding during the review process.\nFor more information, see our FAQ on this topic:\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\n\nThanks for contributing!\n-->\n",
        "issue_closed_at": "2016-07-25T07:53:18Z",
        "base_commit": "7a7e8091c73abc59de4bb71f577b020cd2572c38"
      },
      "summary": "### Summary: \nThis issue pertains to the scikit-learn library, specifically involving the `DummyRegressor` class. The problem arises when an attempt is made to call the `predict` method on an instance of `DummyRegressor` that has not been fitted to any data. Instead of the expected `NotFittedError`, a `ValueError` is raised. This behavior is inconsistent with typical error handling conventions in scikit-learn, where a `NotFittedError` is used to indicate that a model has not been fitted prior to making predictions.\n\n1. **Problem description in general terms**: The issue is a discrepancy in error handling when using a regression model without prior fitting, leading to the wrong type of error being raised.\n\n2. **Key symptoms and behaviors observed**: When the `predict` method is called on an unfitted `DummyRegressor`, the system raises a `ValueError` instead of the more appropriate `NotFittedError`.\n\n3. **Affected components or systems**: The issue impacts the `DummyRegressor` class within the `sklearn.dummy` module of the scikit-learn library. It may also extend to similar methods in other dummy estimators, such as `DummyClassifier`.\n\n4. **Potential impact or severity**: This issue can lead to confusion for developers and users of the library who rely on consistent error messaging for debugging and ensuring models are used correctly. It affects error handling and could potentially lead to incorrect assumptions about the model’s state.\n\n5. **Relevant technical details abstracted for broader understanding**: The fix was applied to the `DummyRegressor.predict` method within the `sklearn/dummy.py` file, ensuring that a `NotFittedError` is raised when predicting without fitting. This adjustment aligns the behavior with other scikit-learn estimators, maintaining consistency within the library's error handling practices.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: DummyRegressor raises ValueError instead of NotFittedError\n\nBody:\n#### Description\n\ntrying to call predict on an instance of DummyRegressor that has not been fitted raises ValueError. I think it should be NotFittedError.\n#### Steps/Code to Reproduce\n\n```\n>>>from sklearn.dummy import DummyRegressor\n>>>clf = DummyRegressor()\n>>>clf.predict(np.zeros((10,10)))\n```\n#### Expected Results\n\nNotFittedError\n#### Actual Results\n\nValueError\n\n<!--\nIf your issue is a usage question, submit it here instead:\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\n-->\n\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\n\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\n\n<!--\nExample:\n```\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.decomposition import LatentDirichletAllocation\n\ndocs = [\"Help I have a bug\" for i in range(1000)]\n\nvectorizer = CountVectorizer(input=docs, analyzer='word')\nlda_features = vectorizer.fit_transform(docs)\n\nlda_model = LatentDirichletAllocation(\n    n_topics=10,\n    learning_method='online',\n    evaluate_every=10,\n    n_jobs=4,\n)\nmodel = lda_model.fit(lda_features)\n```\nIf the code is too long, feel free to put it in a public gist and link\nit in the issue: https://gist.github.com\n-->\n#### Versions\n\n<!--\nPlease run the following snippet and paste the output below.\nimport platform; print(platform.platform())\nimport sys; print(\"Python\", sys.version)\nimport numpy; print(\"NumPy\", numpy.__version__)\nimport scipy; print(\"SciPy\", scipy.__version__)\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\n-->\n\nLinux-3.19.0-47-generic-x86_64-with-Ubuntu-14.04-trusty\n('Python', '2.7.6 (default, Jun 22 2015, 17:58:13) \\n[GCC 4.8.2]')\n('NumPy', '1.11.0')\n('SciPy', '0.16.1')\n('Scikit-Learn', '0.17')\n\n<!-- Thanks for contributing! -->\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/dummy.py\n  line: line 12\n  function: DummyRegressor.predict\n  function: DummyClassifier.predict_proba\n  function: DummyRegressor.predict\n"
    },
    {
      "similar_issue": {
        "issue_title": "LDA.explained_variance_ratio_ is of the wrong size",
        "issue_body": "The docs say that <a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis\">LDA.explained_variance_ratio_</a> should have only `n_components_`. But it doesn't.\n\nIt looks like this bug only exists when we use the `eigen` solver, not the `svd` solver.\n\n```\n>>> import numpy as np\n>>> from sklearn.lda import LDA\n>>> from sklearn.utils.testing import assert_equal\n>>>\n>>> state = np.random.RandomState(0)\n>>> X = state.normal(loc=0, scale=100, size=(40, 20))\n>>> y = state.randint(0, 3, size=(40, 1))\n>>>\n>>> # Train the LDA classifier. Use the eigen solver\n>>> lda_eigen = LDA(solver='eigen', n_components=5)\n>>> lda_eigen.fit(X, y)\n>>> assert_equal(lda_eigen.explained_variance_ratio_.shape, (5,))\nAssertionError: Tuples differ: (20,) != (5,)\n\nFirst differing element 0:\n20\n5\n\n- (20,)\n+ (5,)\n```\n\nLooks like we fix either the docs or the code. Which one?\n\nPinging @JPFrancoia.\n\nAddresses an issue in #6031.\n",
        "issue_id": 6032,
        "pr_number": 7632,
        "pr_title": "[MRG+1] Correcting length of explained_variance_ratio_, eigen solver, final PR",
        "pr_body": "<!--\nThanks for contributing a pull request! Please ensure you have taken a look at\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\n-->\n#### Reference Issue\n\n<!-- Example: Fixes #1234 -->\n\nFix #6032 \n#### What does this implement/fix? Explain your changes.\n\nAttribute explained_variance_ratio_ from LinearDiscriminantAnalysis class will be of length n_components (eigen solver).\n#### Any other comments?\n\nThis PR follows PR 7616. I mixed up my git history, so it was easier to open a new PR.\n\n<!--\nPlease be aware that we are a loose team of volunteers so patience is\nnecessary; assistance handling other issues is very welcome. We value\nall user contributions, no matter how minor they are. If we are slow to\nreview, either the pull request needs some benchmarking, tinkering,\nconvincing, etc. or more likely the reviewers are simply busy. In either\ncase, we ask for your understanding during the review process.\nFor more information, see our FAQ on this topic:\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\n\nThanks for contributing!\n-->\n",
        "issue_closed_at": "2016-10-25T12:52:13Z",
        "base_commit": "ee3e61754bd4bb10cea8065993e462fc7b112cb3"
      },
      "summary": "### Summary:\nThis issue pertains to a discrepancy between the documented behavior and the actual behavior of the `explained_variance_ratio_` attribute in the `LinearDiscriminantAnalysis` (LDA) class within the Scikit-learn library. Specifically, the documentation states that the `explained_variance_ratio_` attribute should have a size corresponding to the number of components (`n_components_`) specified when using the LDA model. However, when the `eigen` solver is utilized, the attribute's size is incorrectly set to the number of features instead.\n\nKey symptoms and behaviors observed include an assertion error when checking if the `explained_variance_ratio_` attribute has the expected shape `(n_components_,)`. Instead, it incorrectly returns the shape `(number_of_features,)`, which is inconsistent with the intended behavior described in the documentation. This issue is not present when the `svd` solver is used, indicating that the problem is specific to the `eigen` solver.\n\nThe affected component is the `LinearDiscriminantAnalysis` class in the `sklearn.discriminant_analysis` module, particularly when using the `eigen` solver. The potential impact of this issue is significant for users relying on the `explained_variance_ratio_` attribute for dimensionality reduction and data interpretation, as the incorrect size of this attribute could lead to erroneous analysis or misinterpretation of results.\n\nTechnical details abstracted for broader understanding include the need to align the implementation of the `explained_variance_ratio_` attribute with its intended design as per the documentation, either by correcting the code to reflect the documented behavior or by updating the documentation to match the current implementation. The fix involves changes to several functions within the `discriminant_analysis.py` file, ensuring that the attribute's size correctly matches the number of components specified.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: LDA.explained_variance_ratio_ is of the wrong size\n\nBody:\nThe docs say that <a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis\">LDA.explained_variance_ratio_</a> should have only `n_components_`. But it doesn't.\n\nIt looks like this bug only exists when we use the `eigen` solver, not the `svd` solver.\n\n```\n>>> import numpy as np\n>>> from sklearn.lda import LDA\n>>> from sklearn.utils.testing import assert_equal\n>>>\n>>> state = np.random.RandomState(0)\n>>> X = state.normal(loc=0, scale=100, size=(40, 20))\n>>> y = state.randint(0, 3, size=(40, 1))\n>>>\n>>> # Train the LDA classifier. Use the eigen solver\n>>> lda_eigen = LDA(solver='eigen', n_components=5)\n>>> lda_eigen.fit(X, y)\n>>> assert_equal(lda_eigen.explained_variance_ratio_.shape, (5,))\nAssertionError: Tuples differ: (20,) != (5,)\n\nFirst differing element 0:\n20\n5\n\n- (20,)\n+ (5,)\n```\n\nLooks like we fix either the docs or the code. Which one?\n\nPinging @JPFrancoia.\n\nAddresses an issue in #6031.\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/discriminant_analysis.py\n  function: LinearDiscriminantAnalysis._solve_lsqr\n  function: LinearDiscriminantAnalysis._solve_svd\n  function: QuadraticDiscriminantAnalysis.fit\n  function: LinearDiscriminantAnalysis.transform\n"
    },
    {
      "similar_issue": {
        "issue_title": "BUG: BaggingClassifier.oob_score_ should not change with class label",
        "issue_body": "Let us compute the oob score of a bagged classifier.\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom sklearn.ensemble import BaggingClassifier\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\n\r\nN = 50\r\nrandState = 5\r\nlabel = 'Label'\r\nfeatures = ['A','B','C']\r\n\r\nlabels = np.random.randint(3, size = N) - 1\r\ndf = pd.DataFrame( labels , index=range(N), columns=[label] )\r\nfor col in features:\r\n    df[col] = df[label] + 0.01 * np.random.rand( N )\r\n\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nHere, clf.oob_score_=0.0.\r\n\r\nNow, you would not expect that the OOB accuracy is a function of the class labels...\r\n\r\n```python\r\ndf.loc[ df[label] == -1 , label ] = 2\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nNow, clf.oob_score_=1.0.\r\n\r\nClearly, OOB score should not be a function of the labels arbitrarily chosen for the classes.\r\n\r\nsklearn.__version__: '0.18.1'\r\nnumpy.__version__: '1.11.3'",
        "issue_id": 8933,
        "pr_number": 8936,
        "pr_title": "[MRG+1] fixed OOB_Score bug for bagging classifiers.",
        "pr_body": "Fixes #8933\r\n\r\n<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\n<!-- Example: Fixes #1234 -->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
        "issue_closed_at": "2017-06-08T09:35:49Z",
        "base_commit": "9131f89e6c165fb27dadd37d3168c1ee5ea84f5a"
      },
      "summary": "### Summary:\nThis issue pertains to a bug in the `BaggingClassifier` of the `scikit-learn` library, specifically regarding the calculation of the out-of-bag (OOB) score. The problem arises when the OOB score of a bagged classifier is influenced by the arbitrary labels assigned to the classes, which should not be the case. The expected behavior is that the OOB score remains consistent regardless of the numeric values assigned to class labels; however, in this scenario, simply altering the class labels resulted in a significant change in the OOB score. This inconsistent behavior indicates a flaw in the handling of class labels within the OOB score calculation mechanism.\n\nKey symptoms and behaviors observed include:\n- The OOB score being 0.0 when class labels are initially set.\n- The OOB score inexplicably changing to 1.0 after reassigning class labels, despite no modification to the feature data or model parameters.\n\nThe affected component is the `BaggingClassifier` within the `sklearn.ensemble` module of the `scikit-learn` library, particularly focused on the function responsible for setting the OOB score, `BaggingRegressor._set_oob_score`.\n\nThe potential impact of this bug is significant for users relying on OOB scores for model evaluation, as it undermines the reliability and interpretability of model performance metrics. If unaddressed, it could lead to incorrect assessments of model accuracy and misinform decision-making processes.\n\nRelevant technical details abstracted for broader understanding include the fact that the bug is present in `scikit-learn` version '0.18.1', and it manifests when using the `BaggingClassifier` with a `KNeighborsClassifier` as the base estimator. The issue is fixed in the `bagging.py` file of the `sklearn.ensemble` module, specifically in the `BaggingRegressor._set_oob_score` function.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: BUG: BaggingClassifier.oob_score_ should not change with class label\n\nBody:\nLet us compute the oob score of a bagged classifier.\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom sklearn.ensemble import BaggingClassifier\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\n\r\nN = 50\r\nrandState = 5\r\nlabel = 'Label'\r\nfeatures = ['A','B','C']\r\n\r\nlabels = np.random.randint(3, size = N) - 1\r\ndf = pd.DataFrame( labels , index=range(N), columns=[label] )\r\nfor col in features:\r\n    df[col] = df[label] + 0.01 * np.random.rand( N )\r\n\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nHere, clf.oob_score_=0.0.\r\n\r\nNow, you would not expect that the OOB accuracy is a function of the class labels...\r\n\r\n```python\r\ndf.loc[ df[label] == -1 , label ] = 2\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nNow, clf.oob_score_=1.0.\r\n\r\nClearly, OOB score should not be a function of the labels arbitrarily chosen for the classes.\r\n\r\nsklearn.__version__: '0.18.1'\r\nnumpy.__version__: '1.11.3'\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/ensemble/bagging.py\n  function: BaggingRegressor._set_oob_score\n"
    },
    {
      "similar_issue": {
        "issue_title": "Better error message for invalid metric in NearestNeighbors ",
        "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n",
        "issue_id": 11906,
        "pr_number": 11914,
        "pr_title": "[MRG] ENH Better error message for metrics of neighbors",
        "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n#Fixes #11906 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded expression to error message to get list of valid metrics.\r\n\r\n<!--\r\n#### Any other comments?\r\n\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
        "issue_closed_at": "2018-09-13T15:34:02Z",
        "base_commit": "7ed61a24feb4ffde0bee9342acf4a58e3f946a61"
      },
      "summary": "### Summary:\n\nThis issue is centered around the clarity and effectiveness of error messages in the Scikit-Learn library, specifically related to the NearestNeighbors class. The problem arises when an invalid metric is specified, leading to a misleading error message. The current message indicates that the metric is not valid for the algorithm 'auto', which can confuse users into thinking the issue is related to the algorithm selection, rather than a simple typographical error in the metric name.\n\n1. **Problem description in general terms**: The issue involves enhancing the clarity of error messages within a software library to improve user understanding and debugging efficiency. In this case, the error message does not clearly indicate that the problem is due to an invalid metric name, potentially leading to user confusion and misdiagnosis of the issue.\n\n2. **Key symptoms and behaviors observed**: Users receive a ValueError message stating that a specified metric is not valid for the algorithm 'auto'. This message does not explicitly point out the root cause, which is an incorrect metric name caused by a typo, leading users to misinterpret the problem as being related to algorithm compatibility.\n\n3. **Affected components or systems**: The components affected are part of the Scikit-Learn library, specifically the NearestNeighbors class and its associated error handling mechanism within the sklearn.neighbors module.\n\n4. **Potential impact or severity**: The severity is moderate as it primarily affects user experience and efficiency in debugging. While it does not cause a system crash or incorrect results, it can lead to unnecessary confusion and time spent troubleshooting the wrong aspect of the issue.\n\n5. **Relevant technical details abstracted for broader understanding**: The issue highlights the importance of precise and informative error messages in software libraries, especially those used widely in data science and machine learning. Improving error messages can significantly enhance usability and reduce the learning curve for users, fostering a more intuitive interaction with the library's features. The fix involves modifying the function responsible for checking algorithm-metric compatibility to provide a clearer indication of the specific problem with the metric input.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Better error message for invalid metric in NearestNeighbors \n\nBody:\n<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/neighbors/__init__.py\n  line: line 14\n  line: line 28\n\nsklearn/neighbors/base.py\n  function: NeighborsBase._check_algorithm_metric\n"
    }
  ]
}