{
  "instance_id": "scikit-learn__scikit-learn-13142",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-02-12T14:32:37Z",
  "problem_statement": "GaussianMixture predict and fit_predict disagree when n_init>1\n#### Description\r\nWhen `n_init` is specified in GaussianMixture, the results of fit_predict(X) and predict(X) are often different.  The `test_gaussian_mixture_fit_predict` unit test doesn't catch this because it does not set `n_init`.\r\n\r\n#### Steps/Code to Reproduce\r\n```\r\npython\r\nfrom sklearn.mixture import GaussianMixture\r\nfrom sklearn.utils.testing import assert_array_equal\r\nimport numpy\r\nX = numpy.random.randn(1000,5)\r\nprint 'no n_init'\r\ngm = GaussianMixture(n_components=5)\r\nc1 = gm.fit_predict(X)\r\nc2 = gm.predict(X)\r\nassert_array_equal(c1,c2)\r\nprint 'n_init=5'\r\ngm = GaussianMixture(n_components=5, n_init=5)\r\nc1 = gm.fit_predict(X)\r\nc2 = gm.predict(X)\r\nassert_array_equal(c1,c2)\r\n```\r\n\r\n#### Expected Results\r\n```\r\nno n_init\r\nn_init=5\r\n```\r\nNo exceptions.\r\n\r\n#### Actual Results\r\n```\r\nno n_init\r\nn_init=5\r\nTraceback (most recent call last):\r\n  File \"test_gm.py\", line 17, in <module>\r\n    assert_array_equal(c1,c2)\r\n  File \"/home/scott/.local/lib/python2.7/site-packages/numpy/testing/_private/utils.py\", line 872, in assert_array_equal\r\n    verbose=verbose, header='Arrays are not equal')\r\n  File \"/home/scott/.local/lib/python2.7/site-packages/numpy/testing/_private/utils.py\", line 796, in assert_array_compare\r\n    raise AssertionError(msg)\r\nAssertionError: \r\nArrays are not equal\r\n\r\n(mismatch 88.6%)\r\n x: array([4, 0, 1, 1, 1, 3, 3, 4, 4, 2, 0, 0, 1, 2, 0, 2, 0, 1, 3, 1, 1, 3,\r\n       2, 1, 0, 2, 1, 0, 2, 0, 3, 1, 2, 3, 3, 1, 0, 2, 2, 0, 3, 0, 2, 0,\r\n       4, 2, 3, 0, 4, 2, 4, 1, 0, 2, 2, 1, 3, 2, 1, 4, 0, 2, 2, 1, 1, 2,...\r\n y: array([4, 1, 0, 2, 2, 1, 1, 4, 4, 0, 4, 1, 0, 3, 1, 0, 2, 2, 1, 2, 0, 0,\r\n       1, 0, 4, 1, 0, 4, 0, 1, 1, 2, 3, 1, 4, 0, 1, 4, 4, 4, 0, 1, 0, 2,\r\n       4, 1, 1, 2, 4, 3, 4, 0, 2, 3, 2, 3, 0, 0, 2, 3, 3, 3, 3, 0, 3, 2,...\r\n```\r\n\r\n#### Versions\r\n```\r\nSystem:\r\n    python: 2.7.15rc1 (default, Nov 12 2018, 14:31:15)  [GCC 7.3.0]\r\n   machine: Linux-4.15.0-43-generic-x86_64-with-Ubuntu-18.04-bionic\r\nexecutable: /usr/bin/python\r\n\r\nBLAS:\r\n    macros: HAVE_CBLAS=None, NO_ATLAS_INFO=-1\r\ncblas_libs: cblas\r\n  lib_dirs: /usr/lib/x86_64-linux-gnu\r\n\r\nPython deps:\r\n    Cython: 0.28.5\r\n     scipy: 1.2.0\r\nsetuptools: 39.0.1\r\n       pip: 19.0.1\r\n     numpy: 1.16.0\r\n    pandas: 0.23.1\r\n   sklearn: 0.20.2\r\n```\n",
  "patch": "diff --git a/sklearn/mixture/base.py b/sklearn/mixture/base.py\n--- a/sklearn/mixture/base.py\n+++ b/sklearn/mixture/base.py\n@@ -257,11 +257,6 @@ def fit_predict(self, X, y=None):\n                 best_params = self._get_parameters()\n                 best_n_iter = n_iter\n \n-        # Always do a final e-step to guarantee that the labels returned by\n-        # fit_predict(X) are always consistent with fit(X).predict(X)\n-        # for any value of max_iter and tol (and any random_state).\n-        _, log_resp = self._e_step(X)\n-\n         if not self.converged_:\n             warnings.warn('Initialization %d did not converge. '\n                           'Try different init parameters, '\n@@ -273,6 +268,11 @@ def fit_predict(self, X, y=None):\n         self.n_iter_ = best_n_iter\n         self.lower_bound_ = max_lower_bound\n \n+        # Always do a final e-step to guarantee that the labels returned by\n+        # fit_predict(X) are always consistent with fit(X).predict(X)\n+        # for any value of max_iter and tol (and any random_state).\n+        _, log_resp = self._e_step(X)\n+\n         return log_resp.argmax(axis=1)\n \n     def _e_step(self, X):\n",
  "similar_bug_items": [
    {
      "pr_number": 12279,
      "pr_title": "[MRG+1] Add check_is_fitted to non standard functions",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #12276 \r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdding `check_is_fitted` method to other non standard functions\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 12276,
      "issue_title": "Calling NearestNeighbors.{kneighbors,radius_neighbors}_graph without first fitting should raise NotFittedError",
      "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n\r\nRunning prediction methods without running `fit` should raise `NotFittedError`. This is not the case for some non-standard methods.\r\n\r\n`check_is_fitted` should be applied for this purpose.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().kneighbors_graph([[1]])\r\n```\r\nor\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().radius_neighbors_graph([[1]])\r\n```\r\n\r\n\r\n#### Expected Results\r\n\r\nNotFittedError raised\r\n\r\n#### Actual Results\r\nAttributeError raised\r\n\r\n#### Versions\r\n\r\n```\r\nSystem\r\n------\r\n    python: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12)  [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]\r\nexecutable: /Users/joel/anaconda3/envs/scipy3k/bin/python\r\n   machine: Darwin-17.7.0-x86_64-i386-64bit\r\n\r\nBLAS\r\n----\r\n  lib_dirs: /Users/joel/anaconda3/envs/scipy3k/lib\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\ncblas_libs: mkl_rt, pthread\r\n\r\nPython deps\r\n-----------\r\nsetuptools: 37.0.0\r\n     scipy: 1.0.0\r\n       pip: 18.0\r\n     numpy: 1.14.1\r\n   sklearn: 0.21.dev0\r\n    pandas: 0.23.4\r\n    Cython: 0.28.5\r\n```",
      "issue_closed_at": "2018-10-19T13:46:09Z",
      "base_commit": "74b56dbc57d9295df8fb653adccb265da356b670",
      "changes": [
        {
          "file": "sklearn/neighbors/base.py",
          "type": "function",
          "name": "kneighbors_graph",
          "class_name": "KNeighborsMixin",
          "code": "def kneighbors_graph(self, X=None, n_neighbors=None,\n                         mode='connectivity'):\n        \"\"\"Computes the (weighted) graph of k-Neighbors for points in X\n\n        Parameters\n        ----------\n        X : array-like, shape (n_query, n_features), \\\n                or (n_query, n_indexed) if metric == 'precomputed'\n            The query point or points.\n            If not provided, neighbors of each indexed point are returned.\n            In this case, the query point is not considered its own neighbor.\n\n        n_neighbors : int\n            Number of neighbors for each sample.\n            (default is value passed to the constructor).\n\n        mode : {'connectivity', 'distance'}, optional\n            Type of returned matrix: 'connectivity' will return the\n            connectivity matrix with ones and zeros, in 'distance' the\n            edges are Euclidean distance between points.\n\n        Returns\n        -------\n        A : sparse matrix in CSR format, shape = [n_samples, n_samples_fit]\n            n_samples_fit is the number of samples in the fitted data\n            A[i, j] is assigned the weight of edge that connects i to j.\n\n        Examples\n        --------\n        >>> X = [[0], [3], [1]]\n        >>> from sklearn.neighbors import NearestNeighbors\n        >>> neigh = NearestNeighbors(n_neighbors=2)\n        >>> neigh.fit(X) # doctest: +ELLIPSIS\n        NearestNeighbors(algorithm='auto', leaf_size=30, ...)\n        >>> A = neigh.kneighbors_graph(X)\n        >>> A.toarray()\n        array([[1., 0., 1.],\n               [0., 1., 1.],\n               [1., 0., 1.]])\n\n        See also\n        --------\n        NearestNeighbors.radius_neighbors_graph\n        \"\"\"\n        if n_neighbors is None:\n            n_neighbors = self.n_neighbors\n\n        # kneighbors does the None handling.\n        if X is not None:\n            X = check_array(X, accept_sparse='csr')\n            n_samples1 = X.shape[0]\n        else:\n            n_samples1 = self._fit_X.shape[0]\n\n        n_samples2 = self._fit_X.shape[0]\n        n_nonzero = n_samples1 * n_neighbors\n        A_indptr = np.arange(0, n_nonzero + 1, n_neighbors)\n\n        # construct CSR matrix representation of the k-NN graph\n        if mode == 'connectivity':\n            A_data = np.ones(n_samples1 * n_neighbors)\n            A_ind = self.kneighbors(X, n_neighbors, return_distance=False)\n\n        elif mode == 'distance':\n            A_data, A_ind = self.kneighbors(\n                X, n_neighbors, return_distance=True)\n            A_data = np.ravel(A_data)\n\n        else:\n            raise ValueError(\n                'Unsupported mode, must be one of \"connectivity\" '\n                'or \"distance\" but got \"%s\" instead' % mode)\n\n        kneighbors_graph = csr_matrix((A_data, A_ind.ravel(), A_indptr),\n                                      shape=(n_samples1, n_samples2))\n\n        return kneighbors_graph"
        },
        {
          "file": "sklearn/neighbors/base.py",
          "type": "function",
          "name": "radius_neighbors_graph",
          "class_name": "RadiusNeighborsMixin",
          "code": "def radius_neighbors_graph(self, X=None, radius=None, mode='connectivity'):\n        \"\"\"Computes the (weighted) graph of Neighbors for points in X\n\n        Neighborhoods are restricted the points at a distance lower than\n        radius.\n\n        Parameters\n        ----------\n        X : array-like, shape = [n_samples, n_features], optional\n            The query point or points.\n            If not provided, neighbors of each indexed point are returned.\n            In this case, the query point is not considered its own neighbor.\n\n        radius : float\n            Radius of neighborhoods.\n            (default is the value passed to the constructor).\n\n        mode : {'connectivity', 'distance'}, optional\n            Type of returned matrix: 'connectivity' will return the\n            connectivity matrix with ones and zeros, in 'distance' the\n            edges are Euclidean distance between points.\n\n        Returns\n        -------\n        A : sparse matrix in CSR format, shape = [n_samples, n_samples]\n            A[i, j] is assigned the weight of edge that connects i to j.\n\n        Examples\n        --------\n        >>> X = [[0], [3], [1]]\n        >>> from sklearn.neighbors import NearestNeighbors\n        >>> neigh = NearestNeighbors(radius=1.5)\n        >>> neigh.fit(X) # doctest: +ELLIPSIS\n        NearestNeighbors(algorithm='auto', leaf_size=30, ...)\n        >>> A = neigh.radius_neighbors_graph(X)\n        >>> A.toarray()\n        array([[1., 0., 1.],\n               [0., 1., 0.],\n               [1., 0., 1.]])\n\n        See also\n        --------\n        kneighbors_graph\n        \"\"\"\n        if X is not None:\n            X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n\n        n_samples2 = self._fit_X.shape[0]\n        if radius is None:\n            radius = self.radius\n\n        # construct CSR matrix representation of the NN graph\n        if mode == 'connectivity':\n            A_ind = self.radius_neighbors(X, radius,\n                                          return_distance=False)\n            A_data = None\n        elif mode == 'distance':\n            dist, A_ind = self.radius_neighbors(X, radius,\n                                                return_distance=True)\n            A_data = np.concatenate(list(dist))\n        else:\n            raise ValueError(\n                'Unsupported mode, must be one of \"connectivity\", '\n                'or \"distance\" but got %s instead' % mode)\n\n        n_samples1 = A_ind.shape[0]\n        n_neighbors = np.array([len(a) for a in A_ind])\n        A_ind = np.concatenate(list(A_ind))\n        if A_data is None:\n            A_data = np.ones(len(A_ind))\n        A_indptr = np.concatenate((np.zeros(1, dtype=int),\n                                   np.cumsum(n_neighbors)))\n\n        return csr_matrix((A_data, A_ind, A_indptr),\n                          shape=(n_samples1, n_samples2))"
        }
      ]
    },
    {
      "pr_number": 7069,
      "pr_title": "DummyClassifier and DummyRegressor raise NotFittedError",
      "pr_body": "<!--\nThanks for contributing a pull request! Please ensure you have taken a look at\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\n-->\n#### Reference Issue\n\nFixes #7065\n\n<!-- Example: Fixes #1234 -->\n#### What does this implement/fix? Explain your changes.\n\nDummyClassifier and DummyRegressor raise NotFittedError\n#### Any other comments?\n\n<!--\nPlease be aware that we are a loose team of volunteers so patience is\nnecessary; assistance handling other issues is very welcome. We value\nall user contributions, no matter how minor they are. If we are slow to\nreview, either the pull request needs some benchmarking, tinkering,\nconvincing, etc. or more likely the reviewers are simply busy. In either\ncase, we ask for your understanding during the review process.\nFor more information, see our FAQ on this topic:\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\n\nThanks for contributing!\n-->\n",
      "issue_id": 7065,
      "issue_title": "DummyRegressor raises ValueError instead of NotFittedError",
      "issue_body": "#### Description\n\ntrying to call predict on an instance of DummyRegressor that has not been fitted raises ValueError. I think it should be NotFittedError.\n#### Steps/Code to Reproduce\n\n```\n>>>from sklearn.dummy import DummyRegressor\n>>>clf = DummyRegressor()\n>>>clf.predict(np.zeros((10,10)))\n```\n#### Expected Results\n\nNotFittedError\n#### Actual Results\n\nValueError\n\n<!--\nIf your issue is a usage question, submit it here instead:\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\n-->\n\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\n\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\n\n<!--\nExample:\n```\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.decomposition import LatentDirichletAllocation\n\ndocs = [\"Help I have a bug\" for i in range(1000)]\n\nvectorizer = CountVectorizer(input=docs, analyzer='word')\nlda_features = vectorizer.fit_transform(docs)\n\nlda_model = LatentDirichletAllocation(\n    n_topics=10,\n    learning_method='online',\n    evaluate_every=10,\n    n_jobs=4,\n)\nmodel = lda_model.fit(lda_features)\n```\nIf the code is too long, feel free to put it in a public gist and link\nit in the issue: https://gist.github.com\n-->\n#### Versions\n\n<!--\nPlease run the following snippet and paste the output below.\nimport platform; print(platform.platform())\nimport sys; print(\"Python\", sys.version)\nimport numpy; print(\"NumPy\", numpy.__version__)\nimport scipy; print(\"SciPy\", scipy.__version__)\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\n-->\n\nLinux-3.19.0-47-generic-x86_64-with-Ubuntu-14.04-trusty\n('Python', '2.7.6 (default, Jun 22 2015, 17:58:13) \\n[GCC 4.8.2]')\n('NumPy', '1.11.0')\n('SciPy', '0.16.1')\n('Scikit-Learn', '0.17')\n\n<!-- Thanks for contributing! -->\n",
      "issue_closed_at": "2016-07-25T07:53:18Z",
      "base_commit": "7a7e8091c73abc59de4bb71f577b020cd2572c38",
      "changes": [
        {
          "file": "sklearn/dummy.py",
          "type": "line",
          "name": "line 12",
          "code": "from .utils import check_random_state\nfrom .utils.validation import check_array\nfrom .utils.validation import check_consistent_length\nfrom .utils.random import random_choice_csc\nfrom .utils.stats import _weighted_percentile\nfrom .utils.multiclass import class_distribution"
        },
        {
          "file": "sklearn/dummy.py",
          "type": "function",
          "name": "predict",
          "class_name": "DummyRegressor",
          "code": "def predict(self, X):\n        \"\"\"\n        Perform classification on test vectors X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape = [n_samples, n_features]\n            Input vectors, where n_samples is the number of samples\n            and n_features is the number of features.\n\n        Returns\n        -------\n        y : array, shape = [n_samples]  or [n_samples, n_outputs]\n            Predicted target values for X.\n        \"\"\"\n        if not hasattr(self, \"constant_\"):\n            raise ValueError(\"DummyRegressor not fitted.\")\n\n        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n        n_samples = X.shape[0]\n\n        y = np.ones((n_samples, 1)) * self.constant_\n\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            y = np.ravel(y)\n\n        return y"
        },
        {
          "file": "sklearn/dummy.py",
          "type": "function",
          "name": "predict_proba",
          "class_name": "DummyClassifier",
          "code": "def predict_proba(self, X):\n        \"\"\"\n        Return probability estimates for the test vectors X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape = [n_samples, n_features]\n            Input vectors, where n_samples is the number of samples\n            and n_features is the number of features.\n\n        Returns\n        -------\n        P : array-like or list of array-lke of shape = [n_samples, n_classes]\n            Returns the probability of the sample for each class in\n            the model, where classes are ordered arithmetically, for each\n            output.\n        \"\"\"\n        if not hasattr(self, \"classes_\"):\n            raise ValueError(\"DummyClassifier not fitted.\")\n\n        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n        # numpy random_state expects Python int and not long as size argument\n        # under Windows\n        n_samples = int(X.shape[0])\n        rs = check_random_state(self.random_state)\n\n        n_classes_ = self.n_classes_\n        classes_ = self.classes_\n        class_prior_ = self.class_prior_\n        constant = self.constant\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            # Get same type even for self.n_outputs_ == 1\n            n_classes_ = [n_classes_]\n            classes_ = [classes_]\n            class_prior_ = [class_prior_]\n            constant = [constant]\n\n        P = []\n        for k in range(self.n_outputs_):\n            if self.strategy == \"most_frequent\":\n                ind = class_prior_[k].argmax()\n                out = np.zeros((n_samples, n_classes_[k]), dtype=np.float64)\n                out[:, ind] = 1.0\n            elif self.strategy == \"prior\":\n                out = np.ones((n_samples, 1)) * class_prior_[k]\n\n            elif self.strategy == \"stratified\":\n                out = rs.multinomial(1, class_prior_[k], size=n_samples)\n\n            elif self.strategy == \"uniform\":\n                out = np.ones((n_samples, n_classes_[k]), dtype=np.float64)\n                out /= n_classes_[k]\n\n            elif self.strategy == \"constant\":\n                ind = np.where(classes_[k] == constant[k])\n                out = np.zeros((n_samples, n_classes_[k]), dtype=np.float64)\n                out[:, ind] = 1.0\n\n            P.append(out)\n\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            P = P[0]\n\n        return P"
        },
        {
          "file": "sklearn/dummy.py",
          "type": "function",
          "name": "predict",
          "class_name": "DummyRegressor",
          "code": "def predict(self, X):\n        \"\"\"\n        Perform classification on test vectors X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape = [n_samples, n_features]\n            Input vectors, where n_samples is the number of samples\n            and n_features is the number of features.\n\n        Returns\n        -------\n        y : array, shape = [n_samples]  or [n_samples, n_outputs]\n            Predicted target values for X.\n        \"\"\"\n        if not hasattr(self, \"constant_\"):\n            raise ValueError(\"DummyRegressor not fitted.\")\n\n        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n        n_samples = X.shape[0]\n\n        y = np.ones((n_samples, 1)) * self.constant_\n\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            y = np.ravel(y)\n\n        return y"
        }
      ]
    },
    {
      "pr_number": 7632,
      "pr_title": "[MRG+1] Correcting length of explained_variance_ratio_, eigen solver, final PR",
      "pr_body": "<!--\nThanks for contributing a pull request! Please ensure you have taken a look at\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\n-->\n#### Reference Issue\n\n<!-- Example: Fixes #1234 -->\n\nFix #6032 \n#### What does this implement/fix? Explain your changes.\n\nAttribute explained_variance_ratio_ from LinearDiscriminantAnalysis class will be of length n_components (eigen solver).\n#### Any other comments?\n\nThis PR follows PR 7616. I mixed up my git history, so it was easier to open a new PR.\n\n<!--\nPlease be aware that we are a loose team of volunteers so patience is\nnecessary; assistance handling other issues is very welcome. We value\nall user contributions, no matter how minor they are. If we are slow to\nreview, either the pull request needs some benchmarking, tinkering,\nconvincing, etc. or more likely the reviewers are simply busy. In either\ncase, we ask for your understanding during the review process.\nFor more information, see our FAQ on this topic:\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\n\nThanks for contributing!\n-->\n",
      "issue_id": 6032,
      "issue_title": "LDA.explained_variance_ratio_ is of the wrong size",
      "issue_body": "The docs say that <a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis\">LDA.explained_variance_ratio_</a> should have only `n_components_`. But it doesn't.\n\nIt looks like this bug only exists when we use the `eigen` solver, not the `svd` solver.\n\n```\n>>> import numpy as np\n>>> from sklearn.lda import LDA\n>>> from sklearn.utils.testing import assert_equal\n>>>\n>>> state = np.random.RandomState(0)\n>>> X = state.normal(loc=0, scale=100, size=(40, 20))\n>>> y = state.randint(0, 3, size=(40, 1))\n>>>\n>>> # Train the LDA classifier. Use the eigen solver\n>>> lda_eigen = LDA(solver='eigen', n_components=5)\n>>> lda_eigen.fit(X, y)\n>>> assert_equal(lda_eigen.explained_variance_ratio_.shape, (5,))\nAssertionError: Tuples differ: (20,) != (5,)\n\nFirst differing element 0:\n20\n5\n\n- (20,)\n+ (5,)\n```\n\nLooks like we fix either the docs or the code. Which one?\n\nPinging @JPFrancoia.\n\nAddresses an issue in #6031.\n",
      "issue_closed_at": "2016-10-25T12:52:13Z",
      "base_commit": "ee3e61754bd4bb10cea8065993e462fc7b112cb3",
      "changes": [
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "function",
          "name": "_solve_lsqr",
          "class_name": "LinearDiscriminantAnalysis",
          "code": "def _solve_lsqr(self, X, y, shrinkage):\n        \"\"\"Least squares solver.\n\n        The least squares solver computes a straightforward solution of the\n        optimal decision rule based directly on the discriminant functions. It\n        can only be used for classification (with optional shrinkage), because\n        estimation of eigenvectors is not performed. Therefore, dimensionality\n        reduction with the transform is not supported.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Training data.\n\n        y : array-like, shape (n_samples,) or (n_samples, n_classes)\n            Target values.\n\n        shrinkage : string or float, optional\n            Shrinkage parameter, possible values:\n              - None: no shrinkage (default).\n              - 'auto': automatic shrinkage using the Ledoit-Wolf lemma.\n              - float between 0 and 1: fixed shrinkage parameter.\n\n        Notes\n        -----\n        This solver is based on [1]_, section 2.6.2, pp. 39-41.\n\n        References\n        ----------\n        .. [1] R. O. Duda, P. E. Hart, D. G. Stork. Pattern Classification\n           (Second Edition). John Wiley & Sons, Inc., New York, 2001. ISBN\n           0-471-05669-3.\n        \"\"\"\n        self.means_ = _class_means(X, y)\n        self.covariance_ = _class_cov(X, y, self.priors_, shrinkage)\n        self.coef_ = linalg.lstsq(self.covariance_, self.means_.T)[0].T\n        self.intercept_ = (-0.5 * np.diag(np.dot(self.means_, self.coef_.T))\n                           + np.log(self.priors_))"
        },
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "function",
          "name": "_solve_svd",
          "class_name": "LinearDiscriminantAnalysis",
          "code": "def _solve_svd(self, X, y):\n        \"\"\"SVD solver.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Training data.\n\n        y : array-like, shape (n_samples,) or (n_samples, n_targets)\n            Target values.\n        \"\"\"\n        n_samples, n_features = X.shape\n        n_classes = len(self.classes_)\n\n        self.means_ = _class_means(X, y)\n        if self.store_covariance:\n            self.covariance_ = _class_cov(X, y, self.priors_)\n\n        Xc = []\n        for idx, group in enumerate(self.classes_):\n            Xg = X[y == group, :]\n            Xc.append(Xg - self.means_[idx])\n\n        self.xbar_ = np.dot(self.priors_, self.means_)\n\n        Xc = np.concatenate(Xc, axis=0)\n\n        # 1) within (univariate) scaling by with classes std-dev\n        std = Xc.std(axis=0)\n        # avoid division by zero in normalization\n        std[std == 0] = 1.\n        fac = 1. / (n_samples - n_classes)\n\n        # 2) Within variance scaling\n        X = np.sqrt(fac) * (Xc / std)\n        # SVD of centered (within)scaled data\n        U, S, V = linalg.svd(X, full_matrices=False)\n\n        rank = np.sum(S > self.tol)\n        if rank < n_features:\n            warnings.warn(\"Variables are collinear.\")\n        # Scaling of within covariance is: V' 1/S\n        scalings = (V[:rank] / std).T / S[:rank]\n\n        # 3) Between variance scaling\n        # Scale weighted centers\n        X = np.dot(((np.sqrt((n_samples * self.priors_) * fac)) *\n                    (self.means_ - self.xbar_).T).T, scalings)\n        # Centers are living in a space with n_classes-1 dim (maximum)\n        # Use SVD to find projection in the space spanned by the\n        # (n_classes) centers\n        _, S, V = linalg.svd(X, full_matrices=0)\n\n        self.explained_variance_ratio_ = (S**2 / np.sum(\n                S**2))[:self.n_components]\n        rank = np.sum(S > self.tol * S[0])\n        self.scalings_ = np.dot(scalings, V.T[:, :rank])\n        coef = np.dot(self.means_ - self.xbar_, self.scalings_)\n        self.intercept_ = (-0.5 * np.sum(coef ** 2, axis=1)\n                           + np.log(self.priors_))\n        self.coef_ = np.dot(coef, self.scalings_.T)\n        self.intercept_ -= np.dot(self.xbar_, self.coef_.T)"
        },
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "function",
          "name": "fit",
          "class_name": "QuadraticDiscriminantAnalysis",
          "code": "def fit(self, X, y, store_covariances=None, tol=None):\n        \"\"\"Fit the model according to the given training data and parameters.\n\n            .. versionchanged:: 0.17\n               Deprecated *store_covariance* have been moved to main constructor.\n\n            .. versionchanged:: 0.17\n               Deprecated *tol* have been moved to main constructor.\n\n        Parameters\n        ----------\n        X : array-like, shape = [n_samples, n_features]\n            Training vector, where n_samples in the number of samples and\n            n_features is the number of features.\n\n        y : array, shape = [n_samples]\n            Target values (integers)\n        \"\"\"\n        if store_covariances:\n            warnings.warn(\"The parameter 'store_covariances' is deprecated as \"\n                          \"of version 0.17 and will be removed in 0.19. The \"\n                          \"parameter is no longer necessary because the value \"\n                          \"is set via the estimator initialisation or \"\n                          \"set_params method.\", DeprecationWarning)\n            self.store_covariances = store_covariances\n        if tol:\n            warnings.warn(\"The parameter 'tol' is deprecated as of version \"\n                          \"0.17 and will be removed in 0.19. The parameter is \"\n                          \"no longer necessary because the value is set via \"\n                          \"the estimator initialisation or set_params method.\",\n                          DeprecationWarning)\n            self.tol = tol\n        X, y = check_X_y(X, y)\n        check_classification_targets(y)\n        self.classes_, y = np.unique(y, return_inverse=True)\n        n_samples, n_features = X.shape\n        n_classes = len(self.classes_)\n        if n_classes < 2:\n            raise ValueError('y has less than 2 classes')\n        if self.priors is None:\n            self.priors_ = bincount(y) / float(n_samples)\n        else:\n            self.priors_ = self.priors\n\n        cov = None\n        if self.store_covariances:\n            cov = []\n        means = []\n        scalings = []\n        rotations = []\n        for ind in xrange(n_classes):\n            Xg = X[y == ind, :]\n            meang = Xg.mean(0)\n            means.append(meang)\n            if len(Xg) == 1:\n                raise ValueError('y has only 1 sample in class %s, covariance '\n                                 'is ill defined.' % str(self.classes_[ind]))\n            Xgc = Xg - meang\n            # Xgc = U * S * V.T\n            U, S, Vt = np.linalg.svd(Xgc, full_matrices=False)\n            rank = np.sum(S > self.tol)\n            if rank < n_features:\n                warnings.warn(\"Variables are collinear\")\n            S2 = (S ** 2) / (len(Xg) - 1)\n            S2 = ((1 - self.reg_param) * S2) + self.reg_param\n            if self.store_covariances:\n                # cov = V * (S^2 / (n-1)) * V.T\n                cov.append(np.dot(S2 * Vt.T, Vt))\n            scalings.append(S2)\n            rotations.append(Vt.T)\n        if self.store_covariances:\n            self.covariances_ = cov\n        self.means_ = np.asarray(means)\n        self.scalings_ = scalings\n        self.rotations_ = rotations\n        return self"
        },
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "function",
          "name": "transform",
          "class_name": "LinearDiscriminantAnalysis",
          "code": "def transform(self, X):\n        \"\"\"Project data to maximize class separation.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input data.\n\n        Returns\n        -------\n        X_new : array, shape (n_samples, n_components)\n            Transformed data.\n        \"\"\"\n        if self.solver == 'lsqr':\n            raise NotImplementedError(\"transform not implemented for 'lsqr' \"\n                                      \"solver (use 'svd' or 'eigen').\")\n        check_is_fitted(self, ['xbar_', 'scalings_'], all_or_any=any)\n\n        X = check_array(X)\n        if self.solver == 'svd':\n            X_new = np.dot(X - self.xbar_, self.scalings_)\n        elif self.solver == 'eigen':\n            X_new = np.dot(X, self.scalings_)\n        n_components = X.shape[1] if self.n_components is None \\\n            else self.n_components\n        return X_new[:, :n_components]"
        }
      ]
    },
    {
      "pr_number": 8936,
      "pr_title": "[MRG+1] fixed OOB_Score bug for bagging classifiers.",
      "pr_body": "Fixes #8933\r\n\r\n<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\n<!-- Example: Fixes #1234 -->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 8933,
      "issue_title": "BUG: BaggingClassifier.oob_score_ should not change with class label",
      "issue_body": "Let us compute the oob score of a bagged classifier.\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom sklearn.ensemble import BaggingClassifier\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\n\r\nN = 50\r\nrandState = 5\r\nlabel = 'Label'\r\nfeatures = ['A','B','C']\r\n\r\nlabels = np.random.randint(3, size = N) - 1\r\ndf = pd.DataFrame( labels , index=range(N), columns=[label] )\r\nfor col in features:\r\n    df[col] = df[label] + 0.01 * np.random.rand( N )\r\n\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nHere, clf.oob_score_=0.0.\r\n\r\nNow, you would not expect that the OOB accuracy is a function of the class labels...\r\n\r\n```python\r\ndf.loc[ df[label] == -1 , label ] = 2\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nNow, clf.oob_score_=1.0.\r\n\r\nClearly, OOB score should not be a function of the labels arbitrarily chosen for the classes.\r\n\r\nsklearn.__version__: '0.18.1'\r\nnumpy.__version__: '1.11.3'",
      "issue_closed_at": "2017-06-08T09:35:49Z",
      "base_commit": "9131f89e6c165fb27dadd37d3168c1ee5ea84f5a",
      "changes": [
        {
          "file": "sklearn/ensemble/bagging.py",
          "type": "function",
          "name": "_set_oob_score",
          "class_name": "BaggingRegressor",
          "code": "def _set_oob_score(self, X, y):\n        n_samples = y.shape[0]\n\n        predictions = np.zeros((n_samples,))\n        n_predictions = np.zeros((n_samples,))\n\n        for estimator, samples, features in zip(self.estimators_,\n                                                self.estimators_samples_,\n                                                self.estimators_features_):\n            # Create mask for OOB samples\n            mask = ~samples\n\n            predictions[mask] += estimator.predict((X[mask, :])[:, features])\n            n_predictions[mask] += 1\n\n        if (n_predictions == 0).any():\n            warn(\"Some inputs do not have OOB scores. \"\n                 \"This probably means too few estimators were used \"\n                 \"to compute any reliable oob estimates.\")\n            n_predictions[n_predictions == 0] = 1\n\n        predictions /= n_predictions\n\n        self.oob_prediction_ = predictions\n        self.oob_score_ = r2_score(y, predictions)"
        }
      ]
    },
    {
      "pr_number": 11914,
      "pr_title": "[MRG] ENH Better error message for metrics of neighbors",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n#Fixes #11906 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded expression to error message to get list of valid metrics.\r\n\r\n<!--\r\n#### Any other comments?\r\n\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 11906,
      "issue_title": "Better error message for invalid metric in NearestNeighbors ",
      "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n",
      "issue_closed_at": "2018-09-13T15:34:02Z",
      "base_commit": "7ed61a24feb4ffde0bee9342acf4a58e3f946a61",
      "changes": [
        {
          "file": "sklearn/neighbors/__init__.py",
          "type": "line",
          "name": "line 14",
          "code": "from .kde import KernelDensity\nfrom .approximate import LSHForest\nfrom .lof import LocalOutlierFactor\n\n__all__ = ['BallTree',\n           'DistanceMetric',"
        },
        {
          "file": "sklearn/neighbors/__init__.py",
          "type": "line",
          "name": "line 28",
          "code": "           'radius_neighbors_graph',\n           'KernelDensity',\n           'LSHForest',\n           'LocalOutlierFactor']"
        },
        {
          "file": "sklearn/neighbors/base.py",
          "type": "function",
          "name": "_check_algorithm_metric",
          "class_name": "NeighborsBase",
          "code": "def _check_algorithm_metric(self):\n        if self.algorithm not in ['auto', 'brute',\n                                  'kd_tree', 'ball_tree']:\n            raise ValueError(\"unrecognized algorithm: '%s'\" % self.algorithm)\n\n        if self.algorithm == 'auto':\n            if self.metric == 'precomputed':\n                alg_check = 'brute'\n            elif (callable(self.metric) or\n                  self.metric in VALID_METRICS['ball_tree']):\n                alg_check = 'ball_tree'\n            else:\n                alg_check = 'brute'\n        else:\n            alg_check = self.algorithm\n\n        if callable(self.metric):\n            if self.algorithm == 'kd_tree':\n                # callable metric is only valid for brute force and ball_tree\n                raise ValueError(\n                    \"kd_tree algorithm does not support callable metric '%s'\"\n                    % self.metric)\n        elif self.metric not in VALID_METRICS[alg_check]:\n            raise ValueError(\"Metric '%s' not valid for algorithm '%s'\"\n                             % (self.metric, self.algorithm))\n\n        if self.metric_params is not None and 'p' in self.metric_params:\n            warnings.warn(\"Parameter p is found in metric_params. \"\n                          \"The corresponding parameter from __init__ \"\n                          \"is ignored.\", SyntaxWarning, stacklevel=3)\n            effective_p = self.metric_params['p']\n        else:\n            effective_p = self.p\n\n        if self.metric in ['wminkowski', 'minkowski'] and effective_p < 1:\n            raise ValueError(\"p must be greater than one for minkowski metric\")"
        }
      ]
    }
  ]
}