{
  "Selected_candidate": {
    "pr_number": 7069,
    "pr_title": "DummyClassifier and DummyRegressor raise NotFittedError",
    "pr_body": "<!--\nThanks for contributing a pull request! Please ensure you have taken a look at\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\n-->\n#### Reference Issue\n\nFixes #7065\n\n<!-- Example: Fixes #1234 -->\n#### What does this implement/fix? Explain your changes.\n\nDummyClassifier and DummyRegressor raise NotFittedError\n#### Any other comments?\n\n<!--\nPlease be aware that we are a loose team of volunteers so patience is\nnecessary; assistance handling other issues is very welcome. We value\nall user contributions, no matter how minor they are. If we are slow to\nreview, either the pull request needs some benchmarking, tinkering,\nconvincing, etc. or more likely the reviewers are simply busy. In either\ncase, we ask for your understanding during the review process.\nFor more information, see our FAQ on this topic:\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\n\nThanks for contributing!\n-->\n",
    "issue_id": 7065,
    "issue_title": "DummyRegressor raises ValueError instead of NotFittedError",
    "issue_body": "#### Description\n\ntrying to call predict on an instance of DummyRegressor that has not been fitted raises ValueError. I think it should be NotFittedError.\n#### Steps/Code to Reproduce\n\n```\n>>>from sklearn.dummy import DummyRegressor\n>>>clf = DummyRegressor()\n>>>clf.predict(np.zeros((10,10)))\n```\n#### Expected Results\n\nNotFittedError\n#### Actual Results\n\nValueError\n\n<!--\nIf your issue is a usage question, submit it here instead:\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\n-->\n\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\n\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\n\n<!--\nExample:\n```\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.decomposition import LatentDirichletAllocation\n\ndocs = [\"Help I have a bug\" for i in range(1000)]\n\nvectorizer = CountVectorizer(input=docs, analyzer='word')\nlda_features = vectorizer.fit_transform(docs)\n\nlda_model = LatentDirichletAllocation(\n    n_topics=10,\n    learning_method='online',\n    evaluate_every=10,\n    n_jobs=4,\n)\nmodel = lda_model.fit(lda_features)\n```\nIf the code is too long, feel free to put it in a public gist and link\nit in the issue: https://gist.github.com\n-->\n#### Versions\n\n<!--\nPlease run the following snippet and paste the output below.\nimport platform; print(platform.platform())\nimport sys; print(\"Python\", sys.version)\nimport numpy; print(\"NumPy\", numpy.__version__)\nimport scipy; print(\"SciPy\", scipy.__version__)\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\n-->\n\nLinux-3.19.0-47-generic-x86_64-with-Ubuntu-14.04-trusty\n('Python', '2.7.6 (default, Jun 22 2015, 17:58:13) \\n[GCC 4.8.2]')\n('NumPy', '1.11.0')\n('SciPy', '0.16.1')\n('Scikit-Learn', '0.17')\n\n<!-- Thanks for contributing! -->\n",
    "issue_closed_at": "2016-07-25T07:53:18Z",
    "base_commit": "7a7e8091c73abc59de4bb71f577b020cd2572c38",
    "changes": [
      {
        "file": "sklearn/dummy.py",
        "type": "line",
        "name": "line 12",
        "code": "from .utils import check_random_state\nfrom .utils.validation import check_array\nfrom .utils.validation import check_consistent_length\nfrom .utils.random import random_choice_csc\nfrom .utils.stats import _weighted_percentile\nfrom .utils.multiclass import class_distribution"
      },
      {
        "file": "sklearn/dummy.py",
        "type": "function",
        "name": "predict",
        "class_name": "DummyRegressor",
        "code": "def predict(self, X):\n        \"\"\"\n        Perform classification on test vectors X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape = [n_samples, n_features]\n            Input vectors, where n_samples is the number of samples\n            and n_features is the number of features.\n\n        Returns\n        -------\n        y : array, shape = [n_samples]  or [n_samples, n_outputs]\n            Predicted target values for X.\n        \"\"\"\n        if not hasattr(self, \"constant_\"):\n            raise ValueError(\"DummyRegressor not fitted.\")\n\n        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n        n_samples = X.shape[0]\n\n        y = np.ones((n_samples, 1)) * self.constant_\n\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            y = np.ravel(y)\n\n        return y"
      },
      {
        "file": "sklearn/dummy.py",
        "type": "function",
        "name": "predict_proba",
        "class_name": "DummyClassifier",
        "code": "def predict_proba(self, X):\n        \"\"\"\n        Return probability estimates for the test vectors X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape = [n_samples, n_features]\n            Input vectors, where n_samples is the number of samples\n            and n_features is the number of features.\n\n        Returns\n        -------\n        P : array-like or list of array-lke of shape = [n_samples, n_classes]\n            Returns the probability of the sample for each class in\n            the model, where classes are ordered arithmetically, for each\n            output.\n        \"\"\"\n        if not hasattr(self, \"classes_\"):\n            raise ValueError(\"DummyClassifier not fitted.\")\n\n        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n        # numpy random_state expects Python int and not long as size argument\n        # under Windows\n        n_samples = int(X.shape[0])\n        rs = check_random_state(self.random_state)\n\n        n_classes_ = self.n_classes_\n        classes_ = self.classes_\n        class_prior_ = self.class_prior_\n        constant = self.constant\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            # Get same type even for self.n_outputs_ == 1\n            n_classes_ = [n_classes_]\n            classes_ = [classes_]\n            class_prior_ = [class_prior_]\n            constant = [constant]\n\n        P = []\n        for k in range(self.n_outputs_):\n            if self.strategy == \"most_frequent\":\n                ind = class_prior_[k].argmax()\n                out = np.zeros((n_samples, n_classes_[k]), dtype=np.float64)\n                out[:, ind] = 1.0\n            elif self.strategy == \"prior\":\n                out = np.ones((n_samples, 1)) * class_prior_[k]\n\n            elif self.strategy == \"stratified\":\n                out = rs.multinomial(1, class_prior_[k], size=n_samples)\n\n            elif self.strategy == \"uniform\":\n                out = np.ones((n_samples, n_classes_[k]), dtype=np.float64)\n                out /= n_classes_[k]\n\n            elif self.strategy == \"constant\":\n                ind = np.where(classes_[k] == constant[k])\n                out = np.zeros((n_samples, n_classes_[k]), dtype=np.float64)\n                out[:, ind] = 1.0\n\n            P.append(out)\n\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            P = P[0]\n\n        return P"
      },
      {
        "file": "sklearn/dummy.py",
        "type": "function",
        "name": "predict",
        "class_name": "DummyRegressor",
        "code": "def predict(self, X):\n        \"\"\"\n        Perform classification on test vectors X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape = [n_samples, n_features]\n            Input vectors, where n_samples is the number of samples\n            and n_features is the number of features.\n\n        Returns\n        -------\n        y : array, shape = [n_samples]  or [n_samples, n_outputs]\n            Predicted target values for X.\n        \"\"\"\n        if not hasattr(self, \"constant_\"):\n            raise ValueError(\"DummyRegressor not fitted.\")\n\n        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n        n_samples = X.shape[0]\n\n        y = np.ones((n_samples, 1)) * self.constant_\n\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            y = np.ravel(y)\n\n        return y"
      }
    ]
  },
  "Justification": "Candidate B is highly relevant because it involves a fitting and prediction scenario in a machine learning context, specifically dealing with a regression model (DummyRegressor). Both the CURRENT bug report and Candidate B encounter issues when methods are called on models that haven't been properly fitted, a similar condition that may stem from the initialization parameters like `n_init`. While the specific implementations are different (GaussianMixture vs. DummyRegressor), the underlying reasoning about model fitting can provide insights into the CURRENT bug's behavior and guide debugging efforts effectively. Additionally, Candidate B's fix addresses the correct handling of fitting scenarios, which is central to resolving the discrepancies noted in the CURRENT report.",
  "instance_id": "scikit-learn__scikit-learn-13142",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-02-12T14:32:37Z",
  "problem_statement": "GaussianMixture predict and fit_predict disagree when n_init>1\n#### Description\r\nWhen `n_init` is specified in GaussianMixture, the results of fit_predict(X) and predict(X) are often different.  The `test_gaussian_mixture_fit_predict` unit test doesn't catch this because it does not set `n_init`.\r\n\r\n#### Steps/Code to Reproduce\r\n```\r\npython\r\nfrom sklearn.mixture import GaussianMixture\r\nfrom sklearn.utils.testing import assert_array_equal\r\nimport numpy\r\nX = numpy.random.randn(1000,5)\r\nprint 'no n_init'\r\ngm = GaussianMixture(n_components=5)\r\nc1 = gm.fit_predict(X)\r\nc2 = gm.predict(X)\r\nassert_array_equal(c1,c2)\r\nprint 'n_init=5'\r\ngm = GaussianMixture(n_components=5, n_init=5)\r\nc1 = gm.fit_predict(X)\r\nc2 = gm.predict(X)\r\nassert_array_equal(c1,c2)\r\n```\r\n\r\n#### Expected Results\r\n```\r\nno n_init\r\nn_init=5\r\n```\r\nNo exceptions.\r\n\r\n#### Actual Results\r\n```\r\nno n_init\r\nn_init=5\r\nTraceback (most recent call last):\r\n  File \"test_gm.py\", line 17, in <module>\r\n    assert_array_equal(c1,c2)\r\n  File \"/home/scott/.local/lib/python2.7/site-packages/numpy/testing/_private/utils.py\", line 872, in assert_array_equal\r\n    verbose=verbose, header='Arrays are not equal')\r\n  File \"/home/scott/.local/lib/python2.7/site-packages/numpy/testing/_private/utils.py\", line 796, in assert_array_compare\r\n    raise AssertionError(msg)\r\nAssertionError: \r\nArrays are not equal\r\n\r\n(mismatch 88.6%)\r\n x: array([4, 0, 1, 1, 1, 3, 3, 4, 4, 2, 0, 0, 1, 2, 0, 2, 0, 1, 3, 1, 1, 3,\r\n       2, 1, 0, 2, 1, 0, 2, 0, 3, 1, 2, 3, 3, 1, 0, 2, 2, 0, 3, 0, 2, 0,\r\n       4, 2, 3, 0, 4, 2, 4, 1, 0, 2, 2, 1, 3, 2, 1, 4, 0, 2, 2, 1, 1, 2,...\r\n y: array([4, 1, 0, 2, 2, 1, 1, 4, 4, 0, 4, 1, 0, 3, 1, 0, 2, 2, 1, 2, 0, 0,\r\n       1, 0, 4, 1, 0, 4, 0, 1, 1, 2, 3, 1, 4, 0, 1, 4, 4, 4, 0, 1, 0, 2,\r\n       4, 1, 1, 2, 4, 3, 4, 0, 2, 3, 2, 3, 0, 0, 2, 3, 3, 3, 3, 0, 3, 2,...\r\n```\r\n\r\n#### Versions\r\n```\r\nSystem:\r\n    python: 2.7.15rc1 (default, Nov 12 2018, 14:31:15)  [GCC 7.3.0]\r\n   machine: Linux-4.15.0-43-generic-x86_64-with-Ubuntu-18.04-bionic\r\nexecutable: /usr/bin/python\r\n\r\nBLAS:\r\n    macros: HAVE_CBLAS=None, NO_ATLAS_INFO=-1\r\ncblas_libs: cblas\r\n  lib_dirs: /usr/lib/x86_64-linux-gnu\r\n\r\nPython deps:\r\n    Cython: 0.28.5\r\n     scipy: 1.2.0\r\nsetuptools: 39.0.1\r\n       pip: 19.0.1\r\n     numpy: 1.16.0\r\n    pandas: 0.23.1\r\n   sklearn: 0.20.2\r\n```\n",
  "patch": "diff --git a/sklearn/mixture/base.py b/sklearn/mixture/base.py\n--- a/sklearn/mixture/base.py\n+++ b/sklearn/mixture/base.py\n@@ -257,11 +257,6 @@ def fit_predict(self, X, y=None):\n                 best_params = self._get_parameters()\n                 best_n_iter = n_iter\n \n-        # Always do a final e-step to guarantee that the labels returned by\n-        # fit_predict(X) are always consistent with fit(X).predict(X)\n-        # for any value of max_iter and tol (and any random_state).\n-        _, log_resp = self._e_step(X)\n-\n         if not self.converged_:\n             warnings.warn('Initialization %d did not converge. '\n                           'Try different init parameters, '\n@@ -273,6 +268,11 @@ def fit_predict(self, X, y=None):\n         self.n_iter_ = best_n_iter\n         self.lower_bound_ = max_lower_bound\n \n+        # Always do a final e-step to guarantee that the labels returned by\n+        # fit_predict(X) are always consistent with fit(X).predict(X)\n+        # for any value of max_iter and tol (and any random_state).\n+        _, log_resp = self._e_step(X)\n+\n         return log_resp.argmax(axis=1)\n \n     def _e_step(self, X):\n"
}