{
  "instance_id": "scikit-learn__scikit-learn-14092",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-06-14T14:16:17Z",
  "problem_statement": "NCA fails in GridSearch due to too strict parameter checks\nNCA checks its parameters to have a specific type, which can easily fail in a GridSearch due to how param grid is made.\r\n\r\nHere is an example:\r\n```python\r\nimport numpy as np\r\n\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.model_selection import GridSearchCV\r\nfrom sklearn.neighbors import NeighborhoodComponentsAnalysis\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\n\r\nX = np.random.random_sample((100, 10))\r\ny = np.random.randint(2, size=100)\r\n\r\nnca = NeighborhoodComponentsAnalysis()\r\nknn = KNeighborsClassifier()\r\n\r\npipe = Pipeline([('nca', nca),\r\n                 ('knn', knn)])\r\n                \r\nparams = {'nca__tol': [0.1, 0.5, 1],\r\n          'nca__n_components': np.arange(1, 10)}\r\n          \r\ngs = GridSearchCV(estimator=pipe, param_grid=params, error_score='raise')\r\ngs.fit(X,y)\r\n```\r\n\r\nThe issue is that for `tol`: 1 is not a float, and for  `n_components`: np.int64 is not int\r\n\r\nBefore proposing a fix for this specific situation, I'd like to have your general opinion about parameter checking.  \r\nI like this idea of common parameter checking tool introduced with the NCA PR. What do you think about extending it across the code-base (or at least for new or recent estimators) ?\r\n\r\nCurrently parameter checking is not always done or often partially done, and is quite redundant. For instance, here is the input validation of lda:\r\n```python\r\ndef _check_params(self):\r\n        \"\"\"Check model parameters.\"\"\"\r\n        if self.n_components <= 0:\r\n            raise ValueError(\"Invalid 'n_components' parameter: %r\"\r\n                             % self.n_components)\r\n\r\n        if self.total_samples <= 0:\r\n            raise ValueError(\"Invalid 'total_samples' parameter: %r\"\r\n                             % self.total_samples)\r\n\r\n        if self.learning_offset < 0:\r\n            raise ValueError(\"Invalid 'learning_offset' parameter: %r\"\r\n                             % self.learning_offset)\r\n\r\n        if self.learning_method not in (\"batch\", \"online\"):\r\n            raise ValueError(\"Invalid 'learning_method' parameter: %r\"\r\n                             % self.learning_method)\r\n```\r\nmost params aren't checked and for those who are there's a lot of duplicated code.\r\n\r\nA propose to be upgrade the new tool to be able to check open/closed intervals (currently only closed) and list membership.\r\n\r\nThe api would be something like that:\r\n```\r\ncheck_param(param, name, valid_options)\r\n```\r\nwhere valid_options would be a dict of `type: constraint`. e.g for the `beta_loss` param of `NMF`, it can be either a float or a string in a list, which would give\r\n```\r\nvalid_options = {numbers.Real: None,  # None for no constraint\r\n                 str: ['frobenius', 'kullback-leibler', 'itakura-saito']}\r\n```\r\nSometimes a parameter can only be positive or within a given interval, e.g. `l1_ratio` of `LogisticRegression` must be between 0 and 1, which would give\r\n```\r\nvalid_options = {numbers.Real: Interval(0, 1, closed='both')}\r\n```\r\npositivity of e.g. `max_iter` would be `numbers.Integral: Interval(left=1)`.\n",
  "patch": "diff --git a/sklearn/neighbors/nca.py b/sklearn/neighbors/nca.py\n--- a/sklearn/neighbors/nca.py\n+++ b/sklearn/neighbors/nca.py\n@@ -13,6 +13,7 @@\n import numpy as np\n import sys\n import time\n+import numbers\n from scipy.optimize import minimize\n from ..utils.extmath import softmax\n from ..metrics import pairwise_distances\n@@ -299,7 +300,8 @@ def _validate_params(self, X, y):\n \n         # Check the preferred dimensionality of the projected space\n         if self.n_components is not None:\n-            check_scalar(self.n_components, 'n_components', int, 1)\n+            check_scalar(\n+                self.n_components, 'n_components', numbers.Integral, 1)\n \n             if self.n_components > X.shape[1]:\n                 raise ValueError('The preferred dimensionality of the '\n@@ -318,9 +320,9 @@ def _validate_params(self, X, y):\n                                  .format(X.shape[1],\n                                          self.components_.shape[1]))\n \n-        check_scalar(self.max_iter, 'max_iter', int, 1)\n-        check_scalar(self.tol, 'tol', float, 0.)\n-        check_scalar(self.verbose, 'verbose', int, 0)\n+        check_scalar(self.max_iter, 'max_iter', numbers.Integral, 1)\n+        check_scalar(self.tol, 'tol', numbers.Real, 0.)\n+        check_scalar(self.verbose, 'verbose', numbers.Integral, 0)\n \n         if self.callback is not None:\n             if not callable(self.callback):\n",
  "similar_bug_items": [
    {
      "pr_number": 11914,
      "pr_title": "[MRG] ENH Better error message for metrics of neighbors",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n#Fixes #11906 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded expression to error message to get list of valid metrics.\r\n\r\n<!--\r\n#### Any other comments?\r\n\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 11906,
      "issue_title": "Better error message for invalid metric in NearestNeighbors ",
      "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n",
      "issue_closed_at": "2018-09-13T15:34:02Z",
      "base_commit": "7ed61a24feb4ffde0bee9342acf4a58e3f946a61",
      "changes": [
        {
          "file": "sklearn/neighbors/__init__.py",
          "type": "line",
          "name": "line 14",
          "code": "from .kde import KernelDensity\nfrom .approximate import LSHForest\nfrom .lof import LocalOutlierFactor\n\n__all__ = ['BallTree',\n           'DistanceMetric',"
        },
        {
          "file": "sklearn/neighbors/__init__.py",
          "type": "line",
          "name": "line 28",
          "code": "           'radius_neighbors_graph',\n           'KernelDensity',\n           'LSHForest',\n           'LocalOutlierFactor']"
        },
        {
          "file": "sklearn/neighbors/base.py",
          "type": "function",
          "name": "_check_algorithm_metric",
          "class_name": "NeighborsBase",
          "code": "def _check_algorithm_metric(self):\n        if self.algorithm not in ['auto', 'brute',\n                                  'kd_tree', 'ball_tree']:\n            raise ValueError(\"unrecognized algorithm: '%s'\" % self.algorithm)\n\n        if self.algorithm == 'auto':\n            if self.metric == 'precomputed':\n                alg_check = 'brute'\n            elif (callable(self.metric) or\n                  self.metric in VALID_METRICS['ball_tree']):\n                alg_check = 'ball_tree'\n            else:\n                alg_check = 'brute'\n        else:\n            alg_check = self.algorithm\n\n        if callable(self.metric):\n            if self.algorithm == 'kd_tree':\n                # callable metric is only valid for brute force and ball_tree\n                raise ValueError(\n                    \"kd_tree algorithm does not support callable metric '%s'\"\n                    % self.metric)\n        elif self.metric not in VALID_METRICS[alg_check]:\n            raise ValueError(\"Metric '%s' not valid for algorithm '%s'\"\n                             % (self.metric, self.algorithm))\n\n        if self.metric_params is not None and 'p' in self.metric_params:\n            warnings.warn(\"Parameter p is found in metric_params. \"\n                          \"The corresponding parameter from __init__ \"\n                          \"is ignored.\", SyntaxWarning, stacklevel=3)\n            effective_p = self.metric_params['p']\n        else:\n            effective_p = self.p\n\n        if self.metric in ['wminkowski', 'minkowski'] and effective_p < 1:\n            raise ValueError(\"p must be greater than one for minkowski metric\")"
        }
      ]
    },
    {
      "pr_number": 12279,
      "pr_title": "[MRG+1] Add check_is_fitted to non standard functions",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #12276 \r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdding `check_is_fitted` method to other non standard functions\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 12276,
      "issue_title": "Calling NearestNeighbors.{kneighbors,radius_neighbors}_graph without first fitting should raise NotFittedError",
      "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n\r\nRunning prediction methods without running `fit` should raise `NotFittedError`. This is not the case for some non-standard methods.\r\n\r\n`check_is_fitted` should be applied for this purpose.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().kneighbors_graph([[1]])\r\n```\r\nor\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().radius_neighbors_graph([[1]])\r\n```\r\n\r\n\r\n#### Expected Results\r\n\r\nNotFittedError raised\r\n\r\n#### Actual Results\r\nAttributeError raised\r\n\r\n#### Versions\r\n\r\n```\r\nSystem\r\n------\r\n    python: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12)  [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]\r\nexecutable: /Users/joel/anaconda3/envs/scipy3k/bin/python\r\n   machine: Darwin-17.7.0-x86_64-i386-64bit\r\n\r\nBLAS\r\n----\r\n  lib_dirs: /Users/joel/anaconda3/envs/scipy3k/lib\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\ncblas_libs: mkl_rt, pthread\r\n\r\nPython deps\r\n-----------\r\nsetuptools: 37.0.0\r\n     scipy: 1.0.0\r\n       pip: 18.0\r\n     numpy: 1.14.1\r\n   sklearn: 0.21.dev0\r\n    pandas: 0.23.4\r\n    Cython: 0.28.5\r\n```",
      "issue_closed_at": "2018-10-19T13:46:09Z",
      "base_commit": "74b56dbc57d9295df8fb653adccb265da356b670",
      "changes": [
        {
          "file": "sklearn/neighbors/base.py",
          "type": "function",
          "name": "kneighbors_graph",
          "class_name": "KNeighborsMixin",
          "code": "def kneighbors_graph(self, X=None, n_neighbors=None,\n                         mode='connectivity'):\n        \"\"\"Computes the (weighted) graph of k-Neighbors for points in X\n\n        Parameters\n        ----------\n        X : array-like, shape (n_query, n_features), \\\n                or (n_query, n_indexed) if metric == 'precomputed'\n            The query point or points.\n            If not provided, neighbors of each indexed point are returned.\n            In this case, the query point is not considered its own neighbor.\n\n        n_neighbors : int\n            Number of neighbors for each sample.\n            (default is value passed to the constructor).\n\n        mode : {'connectivity', 'distance'}, optional\n            Type of returned matrix: 'connectivity' will return the\n            connectivity matrix with ones and zeros, in 'distance' the\n            edges are Euclidean distance between points.\n\n        Returns\n        -------\n        A : sparse matrix in CSR format, shape = [n_samples, n_samples_fit]\n            n_samples_fit is the number of samples in the fitted data\n            A[i, j] is assigned the weight of edge that connects i to j.\n\n        Examples\n        --------\n        >>> X = [[0], [3], [1]]\n        >>> from sklearn.neighbors import NearestNeighbors\n        >>> neigh = NearestNeighbors(n_neighbors=2)\n        >>> neigh.fit(X) # doctest: +ELLIPSIS\n        NearestNeighbors(algorithm='auto', leaf_size=30, ...)\n        >>> A = neigh.kneighbors_graph(X)\n        >>> A.toarray()\n        array([[1., 0., 1.],\n               [0., 1., 1.],\n               [1., 0., 1.]])\n\n        See also\n        --------\n        NearestNeighbors.radius_neighbors_graph\n        \"\"\"\n        if n_neighbors is None:\n            n_neighbors = self.n_neighbors\n\n        # kneighbors does the None handling.\n        if X is not None:\n            X = check_array(X, accept_sparse='csr')\n            n_samples1 = X.shape[0]\n        else:\n            n_samples1 = self._fit_X.shape[0]\n\n        n_samples2 = self._fit_X.shape[0]\n        n_nonzero = n_samples1 * n_neighbors\n        A_indptr = np.arange(0, n_nonzero + 1, n_neighbors)\n\n        # construct CSR matrix representation of the k-NN graph\n        if mode == 'connectivity':\n            A_data = np.ones(n_samples1 * n_neighbors)\n            A_ind = self.kneighbors(X, n_neighbors, return_distance=False)\n\n        elif mode == 'distance':\n            A_data, A_ind = self.kneighbors(\n                X, n_neighbors, return_distance=True)\n            A_data = np.ravel(A_data)\n\n        else:\n            raise ValueError(\n                'Unsupported mode, must be one of \"connectivity\" '\n                'or \"distance\" but got \"%s\" instead' % mode)\n\n        kneighbors_graph = csr_matrix((A_data, A_ind.ravel(), A_indptr),\n                                      shape=(n_samples1, n_samples2))\n\n        return kneighbors_graph"
        },
        {
          "file": "sklearn/neighbors/base.py",
          "type": "function",
          "name": "radius_neighbors_graph",
          "class_name": "RadiusNeighborsMixin",
          "code": "def radius_neighbors_graph(self, X=None, radius=None, mode='connectivity'):\n        \"\"\"Computes the (weighted) graph of Neighbors for points in X\n\n        Neighborhoods are restricted the points at a distance lower than\n        radius.\n\n        Parameters\n        ----------\n        X : array-like, shape = [n_samples, n_features], optional\n            The query point or points.\n            If not provided, neighbors of each indexed point are returned.\n            In this case, the query point is not considered its own neighbor.\n\n        radius : float\n            Radius of neighborhoods.\n            (default is the value passed to the constructor).\n\n        mode : {'connectivity', 'distance'}, optional\n            Type of returned matrix: 'connectivity' will return the\n            connectivity matrix with ones and zeros, in 'distance' the\n            edges are Euclidean distance between points.\n\n        Returns\n        -------\n        A : sparse matrix in CSR format, shape = [n_samples, n_samples]\n            A[i, j] is assigned the weight of edge that connects i to j.\n\n        Examples\n        --------\n        >>> X = [[0], [3], [1]]\n        >>> from sklearn.neighbors import NearestNeighbors\n        >>> neigh = NearestNeighbors(radius=1.5)\n        >>> neigh.fit(X) # doctest: +ELLIPSIS\n        NearestNeighbors(algorithm='auto', leaf_size=30, ...)\n        >>> A = neigh.radius_neighbors_graph(X)\n        >>> A.toarray()\n        array([[1., 0., 1.],\n               [0., 1., 0.],\n               [1., 0., 1.]])\n\n        See also\n        --------\n        kneighbors_graph\n        \"\"\"\n        if X is not None:\n            X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n\n        n_samples2 = self._fit_X.shape[0]\n        if radius is None:\n            radius = self.radius\n\n        # construct CSR matrix representation of the NN graph\n        if mode == 'connectivity':\n            A_ind = self.radius_neighbors(X, radius,\n                                          return_distance=False)\n            A_data = None\n        elif mode == 'distance':\n            dist, A_ind = self.radius_neighbors(X, radius,\n                                                return_distance=True)\n            A_data = np.concatenate(list(dist))\n        else:\n            raise ValueError(\n                'Unsupported mode, must be one of \"connectivity\", '\n                'or \"distance\" but got %s instead' % mode)\n\n        n_samples1 = A_ind.shape[0]\n        n_neighbors = np.array([len(a) for a in A_ind])\n        A_ind = np.concatenate(list(A_ind))\n        if A_data is None:\n            A_data = np.ones(len(A_ind))\n        A_indptr = np.concatenate((np.zeros(1, dtype=int),\n                                   np.cumsum(n_neighbors)))\n\n        return csr_matrix((A_data, A_ind, A_indptr),\n                          shape=(n_samples1, n_samples2))"
        }
      ]
    },
    {
      "pr_number": 8936,
      "pr_title": "[MRG+1] fixed OOB_Score bug for bagging classifiers.",
      "pr_body": "Fixes #8933\r\n\r\n<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\n<!-- Example: Fixes #1234 -->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 8933,
      "issue_title": "BUG: BaggingClassifier.oob_score_ should not change with class label",
      "issue_body": "Let us compute the oob score of a bagged classifier.\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom sklearn.ensemble import BaggingClassifier\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\n\r\nN = 50\r\nrandState = 5\r\nlabel = 'Label'\r\nfeatures = ['A','B','C']\r\n\r\nlabels = np.random.randint(3, size = N) - 1\r\ndf = pd.DataFrame( labels , index=range(N), columns=[label] )\r\nfor col in features:\r\n    df[col] = df[label] + 0.01 * np.random.rand( N )\r\n\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nHere, clf.oob_score_=0.0.\r\n\r\nNow, you would not expect that the OOB accuracy is a function of the class labels...\r\n\r\n```python\r\ndf.loc[ df[label] == -1 , label ] = 2\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nNow, clf.oob_score_=1.0.\r\n\r\nClearly, OOB score should not be a function of the labels arbitrarily chosen for the classes.\r\n\r\nsklearn.__version__: '0.18.1'\r\nnumpy.__version__: '1.11.3'",
      "issue_closed_at": "2017-06-08T09:35:49Z",
      "base_commit": "9131f89e6c165fb27dadd37d3168c1ee5ea84f5a",
      "changes": [
        {
          "file": "sklearn/ensemble/bagging.py",
          "type": "function",
          "name": "_set_oob_score",
          "class_name": "BaggingRegressor",
          "code": "def _set_oob_score(self, X, y):\n        n_samples = y.shape[0]\n\n        predictions = np.zeros((n_samples,))\n        n_predictions = np.zeros((n_samples,))\n\n        for estimator, samples, features in zip(self.estimators_,\n                                                self.estimators_samples_,\n                                                self.estimators_features_):\n            # Create mask for OOB samples\n            mask = ~samples\n\n            predictions[mask] += estimator.predict((X[mask, :])[:, features])\n            n_predictions[mask] += 1\n\n        if (n_predictions == 0).any():\n            warn(\"Some inputs do not have OOB scores. \"\n                 \"This probably means too few estimators were used \"\n                 \"to compute any reliable oob estimates.\")\n            n_predictions[n_predictions == 0] = 1\n\n        predictions /= n_predictions\n\n        self.oob_prediction_ = predictions\n        self.oob_score_ = r2_score(y, predictions)"
        }
      ]
    },
    {
      "pr_number": 8035,
      "pr_title": "[MRG+1] Catch cases for different class size in MLPClassifier with warm start (#7976) ",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\n<!-- Example: Fixes #1234 -->\r\nFixes #7976 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\nThis provides a test for different cases that throws an error when warm_start = True for MLPClassifier. Currently, vague errors are thrown when class size is different between the current fit and the previous fit. This fix will throw a clearer error message. \r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n\r\n",
      "issue_id": 7976,
      "issue_title": "MLPClasiffier produce error when trying to re-fit",
      "issue_body": "Hi,\r\n\r\nI am training a MLPClasiffier model twice, each time on a different data-set.\r\nOn the second iteration the fit method produce an error. Every time a different error.\r\nI did the processes on various models, this is the only one that produce an error. \r\n\r\nthis are the errors i get (each time a different one) - \r\n\r\n> _lbfgsb.error: failed in converting 7th argument `g' of _lbfgsb.setulb to C/Fortran array\r\n\r\n> ValueError: total size of new array must be unchanged\r\n\r\n > ValueError: operands could not be broadcast together with shapes (154,100) (25,) (154,100) \r\n\r\nthanks",
      "issue_closed_at": "2016-12-29T01:01:13Z",
      "base_commit": "40a1b7a0b10fea2995c7aaa46c90a9633e6d99f6",
      "changes": [
        {
          "file": "sklearn/neural_network/multilayer_perceptron.py",
          "type": "function",
          "name": "_validate_input",
          "class_name": "MLPRegressor",
          "code": "def _validate_input(self, X, y, incremental):\n        X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],\n                         multi_output=True, y_numeric=True)\n        if y.ndim == 2 and y.shape[1] == 1:\n            y = column_or_1d(y, warn=True)\n        return X, y"
        },
        {
          "file": "sklearn/neural_network/multilayer_perceptron.py",
          "type": "function",
          "name": "predict",
          "class_name": "MLPRegressor",
          "code": "def predict(self, X):\n        \"\"\"Predict using the multi-layer perceptron model.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape (n_samples, n_features)\n            The input data.\n\n        Returns\n        -------\n        y : array-like, shape (n_samples, n_outputs)\n            The predicted values.\n        \"\"\"\n        check_is_fitted(self, \"coefs_\")\n        y_pred = self._predict(X)\n        if y_pred.shape[1] == 1:\n            return y_pred.ravel()\n        return y_pred"
        }
      ]
    },
    {
      "pr_number": 7594,
      "pr_title": "[MRG+1] FIX Make sure GridSearchCV and RandomizedSearchCV are pickle-able",
      "pr_body": "Fixes #7562 \n- Subclasses the `np.ma.MaskedArray` and overrides the `__getstate__` to make obj dtyped `MaskedArray`s pickle-able.\n- Uses this fixed `utils.fixes.MaskedArray` inside `gs.cv_results_`...\n\nThis is based off of https://github.com/numpy/numpy/pull/8122\n\nPlease review @jnothman @amueller @GaelVaroquaux @davechallis\n",
      "issue_id": 7562,
      "issue_title": "Error unpickling RandomizedSearchCV objects in 0.18 due to masked arrays",
      "issue_body": "#### Description\n\nIn version 0.18, loading pickles of fitted RandomizedSearchCV objects results in a `TypeError` exception (from pickle also created with version 0.18).\n\nThe error seems related to the use of masked arrays in the `RandomizedSearchCV.cv_results_` attribute - clearing this before pickling (i.e. setting to to `None`) allows pickling/unpickling to work.\n#### Steps/Code to Reproduce\n\n```\nimport pickle                                                                   \nfrom sklearn.model_selection import RandomizedSearchCV                          \nfrom sklearn.ensemble import RandomForestClassifier                             \n\nX = [[1, 0], [1, 0], [1, 0], [0, 1], [0, 1], [0, 1]]                            \ny = [1, 1, 1, 0, 0, 0]                                                          \n\nmodel = RandomizedSearchCV(RandomForestClassifier(),                            \n                           {'n_estimators': [5, 10, 20]},                       \n                           n_iter=3)                                            \nmodel.fit(X, y)                                                                 \n\nwith open('model.pkl', 'wb') as fh:                                             \n    pickle.dump(model, fh)                                                      \n\nwith open('model.pkl', 'rb') as fh:                                             \n    model = pickle.load(fh)\n\nprint(model.predict(X))\n```\n#### Expected Results\n\n```\n[1, 1, 1, 0, 0, 0]\n```\n#### Actual Results\n\n```\nTraceback (most recent call last):\n  File \"./t.py\", line 19, in <module>\n    model = pickle.load(fh)\n  File \"/Users/dsc/miniconda3/envs/p3/lib/python3.5/site-packages/numpy/ma/core.py\", line 5863, in __setstate__\n    super(MaskedArray, self).__setstate__((shp, typ, isf, raw))\nTypeError: object pickle not returning list\n```\n#### Versions\n\nPython 3.5.1 |Continuum Analytics, Inc.| (default, Dec  7 2015, 11:24:55) \n[GCC 4.2.1 (Apple Inc. build 5577)]\nNumPy 1.11.1\nSciPy 0.18.1\nScikit-Learn 0.18\n",
      "issue_closed_at": "2016-10-10T19:33:44Z",
      "base_commit": "33ed90dc0aa0549a5963000d7d070aa18ca389c4",
      "changes": [
        {
          "file": "sklearn/model_selection/_search.py",
          "type": "line",
          "name": "line 30",
          "code": "from ..utils import check_random_state\nfrom ..utils.fixes import sp_version\nfrom ..utils.fixes import rankdata\nfrom ..utils.random import sample_without_replacement\nfrom ..utils.validation import indexable, check_is_fitted\nfrom ..utils.metaestimators import if_delegate_has_method"
        },
        {
          "file": "sklearn/model_selection/_search.py",
          "type": "function",
          "name": "_store",
          "class_name": "BaseSearchCV",
          "code": "def _store(key_name, array, weights=None, splits=False, rank=False):\n            \"\"\"A small helper to store the scores/times to the cv_results_\"\"\"\n            array = np.array(array, dtype=np.float64).reshape(n_candidates,\n                                                              n_splits)\n            if splits:\n                for split_i in range(n_splits):\n                    results[\"split%d_%s\"\n                            % (split_i, key_name)] = array[:, split_i]\n\n            array_means = np.average(array, axis=1, weights=weights)\n            results['mean_%s' % key_name] = array_means\n            # Weighted std is not directly available in numpy\n            array_stds = np.sqrt(np.average((array -\n                                             array_means[:, np.newaxis]) ** 2,\n                                            axis=1, weights=weights))\n            results['std_%s' % key_name] = array_stds\n\n            if rank:\n                results[\"rank_%s\" % key_name] = np.asarray(\n                    rankdata(-array_means, method='min'), dtype=np.int32)"
        },
        {
          "file": "sklearn/utils/fixes.py",
          "type": "function",
          "name": "rankdata",
          "class_name": null,
          "code": "def rankdata(a, method='average'):\n        if method not in ('average', 'min', 'max', 'dense', 'ordinal'):\n            raise ValueError('unknown method \"{0}\"'.format(method))\n\n        arr = np.ravel(np.asarray(a))\n        algo = 'mergesort' if method == 'ordinal' else 'quicksort'\n        sorter = np.argsort(arr, kind=algo)\n\n        inv = np.empty(sorter.size, dtype=np.intp)\n        inv[sorter] = np.arange(sorter.size, dtype=np.intp)\n\n        if method == 'ordinal':\n            return inv + 1\n\n        arr = arr[sorter]\n        obs = np.r_[True, arr[1:] != arr[:-1]]\n        dense = obs.cumsum()[inv]\n\n        if method == 'dense':\n            return dense\n\n        # cumulative counts of each unique value\n        count = np.r_[np.nonzero(obs)[0], len(obs)]\n\n        if method == 'max':\n            return count[dense]\n\n        if method == 'min':\n            return count[dense - 1] + 1\n\n        # average method\n        return .5 * (count[dense] + count[dense - 1] + 1)"
        }
      ]
    }
  ]
}