{
  "instance_id": "scikit-learn__scikit-learn-15535",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-11-05T02:09:55Z",
  "problem_statement": "regression in input validation of clustering metrics\n```python\r\nfrom sklearn.metrics.cluster import mutual_info_score\r\nimport numpy as np\r\n\r\nx = np.random.choice(['a', 'b'], size=20).astype(object)\r\nmutual_info_score(x, x)\r\n```\r\nValueError: could not convert string to float: 'b'\r\n\r\nwhile\r\n```python\r\nx = np.random.choice(['a', 'b'], size=20)\r\nmutual_info_score(x, x)\r\n```\r\nworks with a warning?\r\n\r\nthis worked in 0.21.1 without a warning (as I think it should)\r\n\r\n\r\nEdit by @ogrisel: I removed the `.astype(object)` in the second code snippet.\n",
  "patch": "diff --git a/sklearn/metrics/cluster/_supervised.py b/sklearn/metrics/cluster/_supervised.py\n--- a/sklearn/metrics/cluster/_supervised.py\n+++ b/sklearn/metrics/cluster/_supervised.py\n@@ -43,10 +43,10 @@ def check_clusterings(labels_true, labels_pred):\n         The predicted labels.\n     \"\"\"\n     labels_true = check_array(\n-        labels_true, ensure_2d=False, ensure_min_samples=0\n+        labels_true, ensure_2d=False, ensure_min_samples=0, dtype=None,\n     )\n     labels_pred = check_array(\n-        labels_pred, ensure_2d=False, ensure_min_samples=0\n+        labels_pred, ensure_2d=False, ensure_min_samples=0, dtype=None,\n     )\n \n     # input checks\n",
  "similar_bug_items": [
    {
      "pr_number": 8936,
      "pr_title": "[MRG+1] fixed OOB_Score bug for bagging classifiers.",
      "pr_body": "Fixes #8933\r\n\r\n<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\n<!-- Example: Fixes #1234 -->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 8933,
      "issue_title": "BUG: BaggingClassifier.oob_score_ should not change with class label",
      "issue_body": "Let us compute the oob score of a bagged classifier.\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom sklearn.ensemble import BaggingClassifier\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\n\r\nN = 50\r\nrandState = 5\r\nlabel = 'Label'\r\nfeatures = ['A','B','C']\r\n\r\nlabels = np.random.randint(3, size = N) - 1\r\ndf = pd.DataFrame( labels , index=range(N), columns=[label] )\r\nfor col in features:\r\n    df[col] = df[label] + 0.01 * np.random.rand( N )\r\n\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nHere, clf.oob_score_=0.0.\r\n\r\nNow, you would not expect that the OOB accuracy is a function of the class labels...\r\n\r\n```python\r\ndf.loc[ df[label] == -1 , label ] = 2\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nNow, clf.oob_score_=1.0.\r\n\r\nClearly, OOB score should not be a function of the labels arbitrarily chosen for the classes.\r\n\r\nsklearn.__version__: '0.18.1'\r\nnumpy.__version__: '1.11.3'",
      "issue_closed_at": "2017-06-08T09:35:49Z",
      "base_commit": "9131f89e6c165fb27dadd37d3168c1ee5ea84f5a",
      "changes": [
        {
          "file": "sklearn/ensemble/bagging.py",
          "type": "function",
          "name": "_set_oob_score",
          "class_name": "BaggingRegressor",
          "code": "def _set_oob_score(self, X, y):\n        n_samples = y.shape[0]\n\n        predictions = np.zeros((n_samples,))\n        n_predictions = np.zeros((n_samples,))\n\n        for estimator, samples, features in zip(self.estimators_,\n                                                self.estimators_samples_,\n                                                self.estimators_features_):\n            # Create mask for OOB samples\n            mask = ~samples\n\n            predictions[mask] += estimator.predict((X[mask, :])[:, features])\n            n_predictions[mask] += 1\n\n        if (n_predictions == 0).any():\n            warn(\"Some inputs do not have OOB scores. \"\n                 \"This probably means too few estimators were used \"\n                 \"to compute any reliable oob estimates.\")\n            n_predictions[n_predictions == 0] = 1\n\n        predictions /= n_predictions\n\n        self.oob_prediction_ = predictions\n        self.oob_score_ = r2_score(y, predictions)"
        }
      ]
    },
    {
      "pr_number": 4894,
      "pr_title": "fix dtype transform problem in KNN and RandomForest",
      "pr_body": "Fix #4879\n\nI add a int array to store the indices before the loop. \n@amueller I also checked LabelEncoder but it can only accommodate 1d array, so a loop is still needed and slicing problem will still be there. \nFor MultiLabelBinarizer, it can accommodate 2d array, but it will return a 2d binary array to store the indices, which is very different from the original implementation in RandomForest and KNN. I am wondering is there any better way to do so? \n",
      "issue_id": 4879,
      "issue_title": "RandomForestClassifier changes label values",
      "issue_body": "Consider the following example:\n\n``` python\nimport numpy as np\nfrom sklearn.ensemble import RandomForestClassifier\n\nCLASSES = 15\nX = np.arange(CLASSES).reshape(-1, 1)\ny = [ch for ch in 'ABCDEFGHIJKLMNOPQRSTU'[:CLASSES]]\n\nprint(RandomForestClassifier().fit(X, y).predict(X))\n```\n\nThe output is\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'B' 'B' 'B' 'B' 'B']`\n\nwhereas I expect it to be\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O']`\n\nI traced the cause of the error to [here](https://github.com/scikit-learn/scikit-learn/blob/733629d256abaf9a31d6e6305f859768432907ed/sklearn/ensemble/forest.py#L418). I'm happy to put in a pull request, but can others please verify they get the same unexpected behaviour first?\n\nSystem information:\n``\nCPython 3.3.5\nIPython 3.1.0\n\nnumpy 1.9.2\nscipy 0.14.0\nscikit-learn 0.16.1(conda)\n\ncompiler   : GCC 4.2.1 (Apple Inc. build 5577)\nsystem     : Darwin\nrelease    : 13.4.0\nmachine    : x86_64\nprocessor  : i386\nCPU cores  : 4\ninterpreter: 64bit\n\n```\n```\n",
      "issue_closed_at": "2015-06-24T21:33:10Z",
      "base_commit": "cd98ffced656569681648c88c51dae844501643c",
      "changes": [
        {
          "file": "sklearn/ensemble/forest.py",
          "type": "function",
          "name": "_validate_y_class_weight",
          "class_name": "ForestClassifier",
          "code": "def _validate_y_class_weight(self, y):\n        y = np.copy(y)\n        expanded_class_weight = None\n\n        if self.class_weight is not None:\n            y_original = np.copy(y)\n\n        self.classes_ = []\n        self.n_classes_ = []\n\n        for k in range(self.n_outputs_):\n            classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n            self.classes_.append(classes_k)\n            self.n_classes_.append(classes_k.shape[0])\n\n        if self.class_weight is not None:\n            valid_presets = ('auto', 'balanced', 'balanced_subsample', 'subsample', 'auto')\n            if isinstance(self.class_weight, six.string_types):\n                if self.class_weight not in valid_presets:\n                    raise ValueError('Valid presets for class_weight include '\n                                     '\"balanced\" and \"balanced_subsample\". Given \"%s\".'\n                                     % self.class_weight)\n                if self.class_weight == \"subsample\":\n                    warn(\"class_weight='subsample' is deprecated and will be removed in 0.18.\"\n                         \" It was replaced by class_weight='balanced_subsample' \"\n                         \"using the balanced strategy.\", DeprecationWarning)\n                if self.warm_start:\n                    warn('class_weight presets \"balanced\" or \"balanced_subsample\" are '\n                         'not recommended for warm_start if the fitted data '\n                         'differs from the full dataset. In order to use '\n                         '\"balanced\" weights, use compute_class_weight(\"balanced\", '\n                         'classes, y). In place of y you can use a large '\n                         'enough sample of the full training set target to '\n                         'properly estimate the class frequency '\n                         'distributions. Pass the resulting weights as the '\n                         'class_weight parameter.')\n\n            if (self.class_weight not in ['subsample', 'balanced_subsample'] or\n                    not self.bootstrap):\n                if self.class_weight == 'subsample':\n                    class_weight = 'auto'\n                elif self.class_weight == \"balanced_subsample\":\n                    class_weight = \"balanced\"\n                else:\n                    class_weight = self.class_weight\n                with warnings.catch_warnings():\n                    if class_weight == \"auto\":\n                        warnings.simplefilter('ignore', DeprecationWarning)\n                    expanded_class_weight = compute_sample_weight(class_weight,\n                                                                  y_original)\n\n        return y, expanded_class_weight"
        },
        {
          "file": "sklearn/tree/tree.py",
          "type": "function",
          "name": "fit",
          "class_name": "BaseDecisionTree",
          "code": "def fit(self, X, y, sample_weight=None, check_input=True):\n        \"\"\"Build a decision tree from the training set (X, y).\n\n        Parameters\n        ----------\n        X : array-like or sparse matrix, shape = [n_samples, n_features]\n            The training input samples. Internally, it will be converted to\n            ``dtype=np.float32`` and if a sparse matrix is provided\n            to a sparse ``csc_matrix``.\n\n        y : array-like, shape = [n_samples] or [n_samples, n_outputs]\n            The target values (class labels in classification, real numbers in\n            regression). In the regression case, use ``dtype=np.float64`` and\n            ``order='C'`` for maximum efficiency.\n\n        sample_weight : array-like, shape = [n_samples] or None\n            Sample weights. If None, then samples are equally weighted. Splits\n            that would create child nodes with net zero or negative weight are\n            ignored while searching for a split in each node. In the case of\n            classification, splits are also ignored if they would result in any\n            single class carrying a negative weight in either child node.\n\n        check_input : boolean, (default=True)\n            Allow to bypass several input checking.\n            Don't use this parameter unless you know what you do.\n\n        Returns\n        -------\n        self : object\n            Returns self.\n        \"\"\"\n        random_state = check_random_state(self.random_state)\n        if check_input:\n            X = check_array(X, dtype=DTYPE, accept_sparse=\"csc\")\n            if issparse(X):\n                X.sort_indices()\n\n                if X.indices.dtype != np.intc or X.indptr.dtype != np.intc:\n                    raise ValueError(\"No support for np.int64 index based \"\n                                     \"sparse matrices\")\n\n        # Determine output settings\n        n_samples, self.n_features_ = X.shape\n        is_classification = isinstance(self, ClassifierMixin)\n\n        y = np.atleast_1d(y)\n        expanded_class_weight = None\n\n        if y.ndim == 1:\n            # reshape is necessary to preserve the data contiguity against vs\n            # [:, np.newaxis] that does not.\n            y = np.reshape(y, (-1, 1))\n\n        self.n_outputs_ = y.shape[1]\n\n        if is_classification:\n            y = np.copy(y)\n\n            self.classes_ = []\n            self.n_classes_ = []\n\n            if self.class_weight is not None:\n                y_original = np.copy(y)\n\n            for k in range(self.n_outputs_):\n                classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n                self.classes_.append(classes_k)\n                self.n_classes_.append(classes_k.shape[0])\n\n            if self.class_weight is not None:\n                expanded_class_weight = compute_sample_weight(\n                    self.class_weight, y_original)\n\n        else:\n            self.classes_ = [None] * self.n_outputs_\n            self.n_classes_ = [1] * self.n_outputs_\n\n        self.n_classes_ = np.array(self.n_classes_, dtype=np.intp)\n\n        if getattr(y, \"dtype\", None) != DOUBLE or not y.flags.contiguous:\n            y = np.ascontiguousarray(y, dtype=DOUBLE)\n\n        # Check parameters\n        max_depth = ((2 ** 31) - 1 if self.max_depth is None\n                     else self.max_depth)\n        max_leaf_nodes = (-1 if self.max_leaf_nodes is None\n                          else self.max_leaf_nodes)\n\n        if isinstance(self.max_features, six.string_types):\n            if self.max_features == \"auto\":\n                if is_classification:\n                    max_features = max(1, int(np.sqrt(self.n_features_)))\n                else:\n                    max_features = self.n_features_\n            elif self.max_features == \"sqrt\":\n                max_features = max(1, int(np.sqrt(self.n_features_)))\n            elif self.max_features == \"log2\":\n                max_features = max(1, int(np.log2(self.n_features_)))\n            else:\n                raise ValueError(\n                    'Invalid value for max_features. Allowed string '\n                    'values are \"auto\", \"sqrt\" or \"log2\".')\n        elif self.max_features is None:\n            max_features = self.n_features_\n        elif isinstance(self.max_features, (numbers.Integral, np.integer)):\n            max_features = self.max_features\n        else:  # float\n            if self.max_features > 0.0:\n                max_features = max(1, int(self.max_features * self.n_features_))\n            else:\n                max_features = 0\n\n        self.max_features_ = max_features\n\n        if len(y) != n_samples:\n            raise ValueError(\"Number of labels=%d does not match \"\n                             \"number of samples=%d\" % (len(y), n_samples))\n        if self.min_samples_split <= 0:\n            raise ValueError(\"min_samples_split must be greater than zero.\")\n        if self.min_samples_leaf <= 0:\n            raise ValueError(\"min_samples_leaf must be greater than zero.\")\n        if not 0 <= self.min_weight_fraction_leaf <= 0.5:\n            raise ValueError(\"min_weight_fraction_leaf must in [0, 0.5]\")\n        if max_depth <= 0:\n            raise ValueError(\"max_depth must be greater than zero. \")\n        if not (0 < max_features <= self.n_features_):\n            raise ValueError(\"max_features must be in (0, n_features]\")\n        if not isinstance(max_leaf_nodes, (numbers.Integral, np.integer)):\n            raise ValueError(\"max_leaf_nodes must be integral number but was \"\n                             \"%r\" % max_leaf_nodes)\n        if -1 < max_leaf_nodes < 2:\n            raise ValueError((\"max_leaf_nodes {0} must be either smaller than \"\n                              \"0 or larger than 1\").format(max_leaf_nodes))\n\n        if sample_weight is not None:\n            if (getattr(sample_weight, \"dtype\", None) != DOUBLE or\n                    not sample_weight.flags.contiguous):\n                sample_weight = np.ascontiguousarray(\n                    sample_weight, dtype=DOUBLE)\n            if len(sample_weight.shape) > 1:\n                raise ValueError(\"Sample weights array has more \"\n                                 \"than one dimension: %d\" %\n                                 len(sample_weight.shape))\n            if len(sample_weight) != n_samples:\n                raise ValueError(\"Number of weights=%d does not match \"\n                                 \"number of samples=%d\" %\n                                 (len(sample_weight), n_samples))\n\n        if expanded_class_weight is not None:\n            if sample_weight is not None:\n                sample_weight = sample_weight * expanded_class_weight\n            else:\n                sample_weight = expanded_class_weight\n\n        # Set min_weight_leaf from min_weight_fraction_leaf\n        if self.min_weight_fraction_leaf != 0. and sample_weight is not None:\n            min_weight_leaf = (self.min_weight_fraction_leaf *\n                               np.sum(sample_weight))\n        else:\n            min_weight_leaf = 0.\n\n        # Set min_samples_split sensibly\n        min_samples_split = max(self.min_samples_split,\n                                2 * self.min_samples_leaf)\n\n        # Build tree\n        criterion = self.criterion\n        if not isinstance(criterion, Criterion):\n            if is_classification:\n                criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,\n                                                         self.n_classes_)\n            else:\n                criterion = CRITERIA_REG[self.criterion](self.n_outputs_)\n\n        SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS\n\n        splitter = self.splitter\n        if not isinstance(self.splitter, Splitter):\n            splitter = SPLITTERS[self.splitter](criterion,\n                                                self.max_features_,\n                                                self.min_samples_leaf,\n                                                min_weight_leaf,\n                                                random_state)\n\n        self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_)\n\n        # Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise\n        if max_leaf_nodes < 0:\n            builder = DepthFirstTreeBuilder(splitter, min_samples_split,\n                                            self.min_samples_leaf,\n                                            min_weight_leaf,\n                                            max_depth)\n        else:\n            builder = BestFirstTreeBuilder(splitter, min_samples_split,\n                                           self.min_samples_leaf,\n                                           min_weight_leaf,\n                                           max_depth,\n                                           max_leaf_nodes)\n\n        builder.build(self.tree_, X, y, sample_weight)\n\n        if self.n_outputs_ == 1:\n            self.n_classes_ = self.n_classes_[0]\n            self.classes_ = self.classes_[0]\n\n        return self"
        }
      ]
    },
    {
      "pr_number": 12279,
      "pr_title": "[MRG+1] Add check_is_fitted to non standard functions",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #12276 \r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdding `check_is_fitted` method to other non standard functions\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 12276,
      "issue_title": "Calling NearestNeighbors.{kneighbors,radius_neighbors}_graph without first fitting should raise NotFittedError",
      "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n\r\nRunning prediction methods without running `fit` should raise `NotFittedError`. This is not the case for some non-standard methods.\r\n\r\n`check_is_fitted` should be applied for this purpose.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().kneighbors_graph([[1]])\r\n```\r\nor\r\n```py\r\nfrom sklearn.neighbors import NearestNeighbors\r\nNearestNeighbors().radius_neighbors_graph([[1]])\r\n```\r\n\r\n\r\n#### Expected Results\r\n\r\nNotFittedError raised\r\n\r\n#### Actual Results\r\nAttributeError raised\r\n\r\n#### Versions\r\n\r\n```\r\nSystem\r\n------\r\n    python: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:52:12)  [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]\r\nexecutable: /Users/joel/anaconda3/envs/scipy3k/bin/python\r\n   machine: Darwin-17.7.0-x86_64-i386-64bit\r\n\r\nBLAS\r\n----\r\n  lib_dirs: /Users/joel/anaconda3/envs/scipy3k/lib\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\ncblas_libs: mkl_rt, pthread\r\n\r\nPython deps\r\n-----------\r\nsetuptools: 37.0.0\r\n     scipy: 1.0.0\r\n       pip: 18.0\r\n     numpy: 1.14.1\r\n   sklearn: 0.21.dev0\r\n    pandas: 0.23.4\r\n    Cython: 0.28.5\r\n```",
      "issue_closed_at": "2018-10-19T13:46:09Z",
      "base_commit": "74b56dbc57d9295df8fb653adccb265da356b670",
      "changes": [
        {
          "file": "sklearn/neighbors/base.py",
          "type": "function",
          "name": "kneighbors_graph",
          "class_name": "KNeighborsMixin",
          "code": "def kneighbors_graph(self, X=None, n_neighbors=None,\n                         mode='connectivity'):\n        \"\"\"Computes the (weighted) graph of k-Neighbors for points in X\n\n        Parameters\n        ----------\n        X : array-like, shape (n_query, n_features), \\\n                or (n_query, n_indexed) if metric == 'precomputed'\n            The query point or points.\n            If not provided, neighbors of each indexed point are returned.\n            In this case, the query point is not considered its own neighbor.\n\n        n_neighbors : int\n            Number of neighbors for each sample.\n            (default is value passed to the constructor).\n\n        mode : {'connectivity', 'distance'}, optional\n            Type of returned matrix: 'connectivity' will return the\n            connectivity matrix with ones and zeros, in 'distance' the\n            edges are Euclidean distance between points.\n\n        Returns\n        -------\n        A : sparse matrix in CSR format, shape = [n_samples, n_samples_fit]\n            n_samples_fit is the number of samples in the fitted data\n            A[i, j] is assigned the weight of edge that connects i to j.\n\n        Examples\n        --------\n        >>> X = [[0], [3], [1]]\n        >>> from sklearn.neighbors import NearestNeighbors\n        >>> neigh = NearestNeighbors(n_neighbors=2)\n        >>> neigh.fit(X) # doctest: +ELLIPSIS\n        NearestNeighbors(algorithm='auto', leaf_size=30, ...)\n        >>> A = neigh.kneighbors_graph(X)\n        >>> A.toarray()\n        array([[1., 0., 1.],\n               [0., 1., 1.],\n               [1., 0., 1.]])\n\n        See also\n        --------\n        NearestNeighbors.radius_neighbors_graph\n        \"\"\"\n        if n_neighbors is None:\n            n_neighbors = self.n_neighbors\n\n        # kneighbors does the None handling.\n        if X is not None:\n            X = check_array(X, accept_sparse='csr')\n            n_samples1 = X.shape[0]\n        else:\n            n_samples1 = self._fit_X.shape[0]\n\n        n_samples2 = self._fit_X.shape[0]\n        n_nonzero = n_samples1 * n_neighbors\n        A_indptr = np.arange(0, n_nonzero + 1, n_neighbors)\n\n        # construct CSR matrix representation of the k-NN graph\n        if mode == 'connectivity':\n            A_data = np.ones(n_samples1 * n_neighbors)\n            A_ind = self.kneighbors(X, n_neighbors, return_distance=False)\n\n        elif mode == 'distance':\n            A_data, A_ind = self.kneighbors(\n                X, n_neighbors, return_distance=True)\n            A_data = np.ravel(A_data)\n\n        else:\n            raise ValueError(\n                'Unsupported mode, must be one of \"connectivity\" '\n                'or \"distance\" but got \"%s\" instead' % mode)\n\n        kneighbors_graph = csr_matrix((A_data, A_ind.ravel(), A_indptr),\n                                      shape=(n_samples1, n_samples2))\n\n        return kneighbors_graph"
        },
        {
          "file": "sklearn/neighbors/base.py",
          "type": "function",
          "name": "radius_neighbors_graph",
          "class_name": "RadiusNeighborsMixin",
          "code": "def radius_neighbors_graph(self, X=None, radius=None, mode='connectivity'):\n        \"\"\"Computes the (weighted) graph of Neighbors for points in X\n\n        Neighborhoods are restricted the points at a distance lower than\n        radius.\n\n        Parameters\n        ----------\n        X : array-like, shape = [n_samples, n_features], optional\n            The query point or points.\n            If not provided, neighbors of each indexed point are returned.\n            In this case, the query point is not considered its own neighbor.\n\n        radius : float\n            Radius of neighborhoods.\n            (default is the value passed to the constructor).\n\n        mode : {'connectivity', 'distance'}, optional\n            Type of returned matrix: 'connectivity' will return the\n            connectivity matrix with ones and zeros, in 'distance' the\n            edges are Euclidean distance between points.\n\n        Returns\n        -------\n        A : sparse matrix in CSR format, shape = [n_samples, n_samples]\n            A[i, j] is assigned the weight of edge that connects i to j.\n\n        Examples\n        --------\n        >>> X = [[0], [3], [1]]\n        >>> from sklearn.neighbors import NearestNeighbors\n        >>> neigh = NearestNeighbors(radius=1.5)\n        >>> neigh.fit(X) # doctest: +ELLIPSIS\n        NearestNeighbors(algorithm='auto', leaf_size=30, ...)\n        >>> A = neigh.radius_neighbors_graph(X)\n        >>> A.toarray()\n        array([[1., 0., 1.],\n               [0., 1., 0.],\n               [1., 0., 1.]])\n\n        See also\n        --------\n        kneighbors_graph\n        \"\"\"\n        if X is not None:\n            X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n\n        n_samples2 = self._fit_X.shape[0]\n        if radius is None:\n            radius = self.radius\n\n        # construct CSR matrix representation of the NN graph\n        if mode == 'connectivity':\n            A_ind = self.radius_neighbors(X, radius,\n                                          return_distance=False)\n            A_data = None\n        elif mode == 'distance':\n            dist, A_ind = self.radius_neighbors(X, radius,\n                                                return_distance=True)\n            A_data = np.concatenate(list(dist))\n        else:\n            raise ValueError(\n                'Unsupported mode, must be one of \"connectivity\", '\n                'or \"distance\" but got %s instead' % mode)\n\n        n_samples1 = A_ind.shape[0]\n        n_neighbors = np.array([len(a) for a in A_ind])\n        A_ind = np.concatenate(list(A_ind))\n        if A_data is None:\n            A_data = np.ones(len(A_ind))\n        A_indptr = np.concatenate((np.zeros(1, dtype=int),\n                                   np.cumsum(n_neighbors)))\n\n        return csr_matrix((A_data, A_ind, A_indptr),\n                          shape=(n_samples1, n_samples2))"
        }
      ]
    },
    {
      "pr_number": 7632,
      "pr_title": "[MRG+1] Correcting length of explained_variance_ratio_, eigen solver, final PR",
      "pr_body": "<!--\nThanks for contributing a pull request! Please ensure you have taken a look at\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\n-->\n#### Reference Issue\n\n<!-- Example: Fixes #1234 -->\n\nFix #6032 \n#### What does this implement/fix? Explain your changes.\n\nAttribute explained_variance_ratio_ from LinearDiscriminantAnalysis class will be of length n_components (eigen solver).\n#### Any other comments?\n\nThis PR follows PR 7616. I mixed up my git history, so it was easier to open a new PR.\n\n<!--\nPlease be aware that we are a loose team of volunteers so patience is\nnecessary; assistance handling other issues is very welcome. We value\nall user contributions, no matter how minor they are. If we are slow to\nreview, either the pull request needs some benchmarking, tinkering,\nconvincing, etc. or more likely the reviewers are simply busy. In either\ncase, we ask for your understanding during the review process.\nFor more information, see our FAQ on this topic:\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\n\nThanks for contributing!\n-->\n",
      "issue_id": 6032,
      "issue_title": "LDA.explained_variance_ratio_ is of the wrong size",
      "issue_body": "The docs say that <a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis\">LDA.explained_variance_ratio_</a> should have only `n_components_`. But it doesn't.\n\nIt looks like this bug only exists when we use the `eigen` solver, not the `svd` solver.\n\n```\n>>> import numpy as np\n>>> from sklearn.lda import LDA\n>>> from sklearn.utils.testing import assert_equal\n>>>\n>>> state = np.random.RandomState(0)\n>>> X = state.normal(loc=0, scale=100, size=(40, 20))\n>>> y = state.randint(0, 3, size=(40, 1))\n>>>\n>>> # Train the LDA classifier. Use the eigen solver\n>>> lda_eigen = LDA(solver='eigen', n_components=5)\n>>> lda_eigen.fit(X, y)\n>>> assert_equal(lda_eigen.explained_variance_ratio_.shape, (5,))\nAssertionError: Tuples differ: (20,) != (5,)\n\nFirst differing element 0:\n20\n5\n\n- (20,)\n+ (5,)\n```\n\nLooks like we fix either the docs or the code. Which one?\n\nPinging @JPFrancoia.\n\nAddresses an issue in #6031.\n",
      "issue_closed_at": "2016-10-25T12:52:13Z",
      "base_commit": "ee3e61754bd4bb10cea8065993e462fc7b112cb3",
      "changes": [
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "function",
          "name": "_solve_lsqr",
          "class_name": "LinearDiscriminantAnalysis",
          "code": "def _solve_lsqr(self, X, y, shrinkage):\n        \"\"\"Least squares solver.\n\n        The least squares solver computes a straightforward solution of the\n        optimal decision rule based directly on the discriminant functions. It\n        can only be used for classification (with optional shrinkage), because\n        estimation of eigenvectors is not performed. Therefore, dimensionality\n        reduction with the transform is not supported.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Training data.\n\n        y : array-like, shape (n_samples,) or (n_samples, n_classes)\n            Target values.\n\n        shrinkage : string or float, optional\n            Shrinkage parameter, possible values:\n              - None: no shrinkage (default).\n              - 'auto': automatic shrinkage using the Ledoit-Wolf lemma.\n              - float between 0 and 1: fixed shrinkage parameter.\n\n        Notes\n        -----\n        This solver is based on [1]_, section 2.6.2, pp. 39-41.\n\n        References\n        ----------\n        .. [1] R. O. Duda, P. E. Hart, D. G. Stork. Pattern Classification\n           (Second Edition). John Wiley & Sons, Inc., New York, 2001. ISBN\n           0-471-05669-3.\n        \"\"\"\n        self.means_ = _class_means(X, y)\n        self.covariance_ = _class_cov(X, y, self.priors_, shrinkage)\n        self.coef_ = linalg.lstsq(self.covariance_, self.means_.T)[0].T\n        self.intercept_ = (-0.5 * np.diag(np.dot(self.means_, self.coef_.T))\n                           + np.log(self.priors_))"
        },
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "function",
          "name": "_solve_svd",
          "class_name": "LinearDiscriminantAnalysis",
          "code": "def _solve_svd(self, X, y):\n        \"\"\"SVD solver.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Training data.\n\n        y : array-like, shape (n_samples,) or (n_samples, n_targets)\n            Target values.\n        \"\"\"\n        n_samples, n_features = X.shape\n        n_classes = len(self.classes_)\n\n        self.means_ = _class_means(X, y)\n        if self.store_covariance:\n            self.covariance_ = _class_cov(X, y, self.priors_)\n\n        Xc = []\n        for idx, group in enumerate(self.classes_):\n            Xg = X[y == group, :]\n            Xc.append(Xg - self.means_[idx])\n\n        self.xbar_ = np.dot(self.priors_, self.means_)\n\n        Xc = np.concatenate(Xc, axis=0)\n\n        # 1) within (univariate) scaling by with classes std-dev\n        std = Xc.std(axis=0)\n        # avoid division by zero in normalization\n        std[std == 0] = 1.\n        fac = 1. / (n_samples - n_classes)\n\n        # 2) Within variance scaling\n        X = np.sqrt(fac) * (Xc / std)\n        # SVD of centered (within)scaled data\n        U, S, V = linalg.svd(X, full_matrices=False)\n\n        rank = np.sum(S > self.tol)\n        if rank < n_features:\n            warnings.warn(\"Variables are collinear.\")\n        # Scaling of within covariance is: V' 1/S\n        scalings = (V[:rank] / std).T / S[:rank]\n\n        # 3) Between variance scaling\n        # Scale weighted centers\n        X = np.dot(((np.sqrt((n_samples * self.priors_) * fac)) *\n                    (self.means_ - self.xbar_).T).T, scalings)\n        # Centers are living in a space with n_classes-1 dim (maximum)\n        # Use SVD to find projection in the space spanned by the\n        # (n_classes) centers\n        _, S, V = linalg.svd(X, full_matrices=0)\n\n        self.explained_variance_ratio_ = (S**2 / np.sum(\n                S**2))[:self.n_components]\n        rank = np.sum(S > self.tol * S[0])\n        self.scalings_ = np.dot(scalings, V.T[:, :rank])\n        coef = np.dot(self.means_ - self.xbar_, self.scalings_)\n        self.intercept_ = (-0.5 * np.sum(coef ** 2, axis=1)\n                           + np.log(self.priors_))\n        self.coef_ = np.dot(coef, self.scalings_.T)\n        self.intercept_ -= np.dot(self.xbar_, self.coef_.T)"
        },
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "function",
          "name": "fit",
          "class_name": "QuadraticDiscriminantAnalysis",
          "code": "def fit(self, X, y, store_covariances=None, tol=None):\n        \"\"\"Fit the model according to the given training data and parameters.\n\n            .. versionchanged:: 0.17\n               Deprecated *store_covariance* have been moved to main constructor.\n\n            .. versionchanged:: 0.17\n               Deprecated *tol* have been moved to main constructor.\n\n        Parameters\n        ----------\n        X : array-like, shape = [n_samples, n_features]\n            Training vector, where n_samples in the number of samples and\n            n_features is the number of features.\n\n        y : array, shape = [n_samples]\n            Target values (integers)\n        \"\"\"\n        if store_covariances:\n            warnings.warn(\"The parameter 'store_covariances' is deprecated as \"\n                          \"of version 0.17 and will be removed in 0.19. The \"\n                          \"parameter is no longer necessary because the value \"\n                          \"is set via the estimator initialisation or \"\n                          \"set_params method.\", DeprecationWarning)\n            self.store_covariances = store_covariances\n        if tol:\n            warnings.warn(\"The parameter 'tol' is deprecated as of version \"\n                          \"0.17 and will be removed in 0.19. The parameter is \"\n                          \"no longer necessary because the value is set via \"\n                          \"the estimator initialisation or set_params method.\",\n                          DeprecationWarning)\n            self.tol = tol\n        X, y = check_X_y(X, y)\n        check_classification_targets(y)\n        self.classes_, y = np.unique(y, return_inverse=True)\n        n_samples, n_features = X.shape\n        n_classes = len(self.classes_)\n        if n_classes < 2:\n            raise ValueError('y has less than 2 classes')\n        if self.priors is None:\n            self.priors_ = bincount(y) / float(n_samples)\n        else:\n            self.priors_ = self.priors\n\n        cov = None\n        if self.store_covariances:\n            cov = []\n        means = []\n        scalings = []\n        rotations = []\n        for ind in xrange(n_classes):\n            Xg = X[y == ind, :]\n            meang = Xg.mean(0)\n            means.append(meang)\n            if len(Xg) == 1:\n                raise ValueError('y has only 1 sample in class %s, covariance '\n                                 'is ill defined.' % str(self.classes_[ind]))\n            Xgc = Xg - meang\n            # Xgc = U * S * V.T\n            U, S, Vt = np.linalg.svd(Xgc, full_matrices=False)\n            rank = np.sum(S > self.tol)\n            if rank < n_features:\n                warnings.warn(\"Variables are collinear\")\n            S2 = (S ** 2) / (len(Xg) - 1)\n            S2 = ((1 - self.reg_param) * S2) + self.reg_param\n            if self.store_covariances:\n                # cov = V * (S^2 / (n-1)) * V.T\n                cov.append(np.dot(S2 * Vt.T, Vt))\n            scalings.append(S2)\n            rotations.append(Vt.T)\n        if self.store_covariances:\n            self.covariances_ = cov\n        self.means_ = np.asarray(means)\n        self.scalings_ = scalings\n        self.rotations_ = rotations\n        return self"
        },
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "function",
          "name": "transform",
          "class_name": "LinearDiscriminantAnalysis",
          "code": "def transform(self, X):\n        \"\"\"Project data to maximize class separation.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input data.\n\n        Returns\n        -------\n        X_new : array, shape (n_samples, n_components)\n            Transformed data.\n        \"\"\"\n        if self.solver == 'lsqr':\n            raise NotImplementedError(\"transform not implemented for 'lsqr' \"\n                                      \"solver (use 'svd' or 'eigen').\")\n        check_is_fitted(self, ['xbar_', 'scalings_'], all_or_any=any)\n\n        X = check_array(X)\n        if self.solver == 'svd':\n            X_new = np.dot(X - self.xbar_, self.scalings_)\n        elif self.solver == 'eigen':\n            X_new = np.dot(X, self.scalings_)\n        n_components = X.shape[1] if self.n_components is None \\\n            else self.n_components\n        return X_new[:, :n_components]"
        }
      ]
    },
    {
      "pr_number": 8035,
      "pr_title": "[MRG+1] Catch cases for different class size in MLPClassifier with warm start (#7976) ",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\n<!-- Example: Fixes #1234 -->\r\nFixes #7976 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\nThis provides a test for different cases that throws an error when warm_start = True for MLPClassifier. Currently, vague errors are thrown when class size is different between the current fit and the previous fit. This fix will throw a clearer error message. \r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n\r\n",
      "issue_id": 7976,
      "issue_title": "MLPClasiffier produce error when trying to re-fit",
      "issue_body": "Hi,\r\n\r\nI am training a MLPClasiffier model twice, each time on a different data-set.\r\nOn the second iteration the fit method produce an error. Every time a different error.\r\nI did the processes on various models, this is the only one that produce an error. \r\n\r\nthis are the errors i get (each time a different one) - \r\n\r\n> _lbfgsb.error: failed in converting 7th argument `g' of _lbfgsb.setulb to C/Fortran array\r\n\r\n> ValueError: total size of new array must be unchanged\r\n\r\n > ValueError: operands could not be broadcast together with shapes (154,100) (25,) (154,100) \r\n\r\nthanks",
      "issue_closed_at": "2016-12-29T01:01:13Z",
      "base_commit": "40a1b7a0b10fea2995c7aaa46c90a9633e6d99f6",
      "changes": [
        {
          "file": "sklearn/neural_network/multilayer_perceptron.py",
          "type": "function",
          "name": "_validate_input",
          "class_name": "MLPRegressor",
          "code": "def _validate_input(self, X, y, incremental):\n        X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],\n                         multi_output=True, y_numeric=True)\n        if y.ndim == 2 and y.shape[1] == 1:\n            y = column_or_1d(y, warn=True)\n        return X, y"
        },
        {
          "file": "sklearn/neural_network/multilayer_perceptron.py",
          "type": "function",
          "name": "predict",
          "class_name": "MLPRegressor",
          "code": "def predict(self, X):\n        \"\"\"Predict using the multi-layer perceptron model.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape (n_samples, n_features)\n            The input data.\n\n        Returns\n        -------\n        y : array-like, shape (n_samples, n_outputs)\n            The predicted values.\n        \"\"\"\n        check_is_fitted(self, \"coefs_\")\n        y_pred = self._predict(X)\n        if y_pred.shape[1] == 1:\n            return y_pred.ravel()\n        return y_pred"
        }
      ]
    }
  ]
}