{
  "instance_id": "scikit-learn__scikit-learn-10949",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2018-04-10T15:30:56Z",
  "problem_statement": "warn_on_dtype with DataFrame\n#### Description\r\n\r\n``warn_on_dtype`` has no effect when input is a pandas ``DataFrame``\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nfrom sklearn.utils.validation import check_array\r\nimport pandas as pd\r\ndf = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)\r\nchecked = check_array(df, warn_on_dtype=True)\r\n```\r\n\r\n#### Expected result: \r\n\r\n```python-traceback\r\nDataConversionWarning: Data with input dtype object was converted to float64.\r\n```\r\n\r\n#### Actual Results\r\nNo warning is thrown\r\n\r\n#### Versions\r\nLinux-4.4.0-116-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda, Inc.| (default, Nov  3 2017, 19:19:16) \r\n[GCC 7.2.0]\r\nNumPy 1.13.1\r\nSciPy 0.19.1\r\nScikit-Learn 0.20.dev0\r\nPandas 0.21.0\r\n\nwarn_on_dtype with DataFrame\n#### Description\r\n\r\n``warn_on_dtype`` has no effect when input is a pandas ``DataFrame``\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nfrom sklearn.utils.validation import check_array\r\nimport pandas as pd\r\ndf = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)\r\nchecked = check_array(df, warn_on_dtype=True)\r\n```\r\n\r\n#### Expected result: \r\n\r\n```python-traceback\r\nDataConversionWarning: Data with input dtype object was converted to float64.\r\n```\r\n\r\n#### Actual Results\r\nNo warning is thrown\r\n\r\n#### Versions\r\nLinux-4.4.0-116-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda, Inc.| (default, Nov  3 2017, 19:19:16) \r\n[GCC 7.2.0]\r\nNumPy 1.13.1\r\nSciPy 0.19.1\r\nScikit-Learn 0.20.dev0\r\nPandas 0.21.0\r\n\n",
  "patch": "diff --git a/sklearn/utils/validation.py b/sklearn/utils/validation.py\n--- a/sklearn/utils/validation.py\n+++ b/sklearn/utils/validation.py\n@@ -466,6 +466,12 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,\n         # not a data type (e.g. a column named dtype in a pandas DataFrame)\n         dtype_orig = None\n \n+    # check if the object contains several dtypes (typically a pandas\n+    # DataFrame), and store them. If not, store None.\n+    dtypes_orig = None\n+    if hasattr(array, \"dtypes\") and hasattr(array, \"__array__\"):\n+        dtypes_orig = np.array(array.dtypes)\n+\n     if dtype_numeric:\n         if dtype_orig is not None and dtype_orig.kind == \"O\":\n             # if input is object, convert to float.\n@@ -581,6 +587,16 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,\n     if copy and np.may_share_memory(array, array_orig):\n         array = np.array(array, dtype=dtype, order=order)\n \n+    if (warn_on_dtype and dtypes_orig is not None and\n+            {array.dtype} != set(dtypes_orig)):\n+        # if there was at the beginning some other types than the final one\n+        # (for instance in a DataFrame that can contain several dtypes) then\n+        # some data must have been converted\n+        msg = (\"Data with input dtype %s were all converted to %s%s.\"\n+               % (', '.join(map(str, sorted(set(dtypes_orig)))), array.dtype,\n+                  context))\n+        warnings.warn(msg, DataConversionWarning, stacklevel=3)\n+\n     return array\n \n \n",
  "similar_bug_items": [
    {
      "pr_number": 4894,
      "pr_title": "fix dtype transform problem in KNN and RandomForest",
      "pr_body": "Fix #4879\n\nI add a int array to store the indices before the loop. \n@amueller I also checked LabelEncoder but it can only accommodate 1d array, so a loop is still needed and slicing problem will still be there. \nFor MultiLabelBinarizer, it can accommodate 2d array, but it will return a 2d binary array to store the indices, which is very different from the original implementation in RandomForest and KNN. I am wondering is there any better way to do so? \n",
      "issue_id": 4879,
      "issue_title": "RandomForestClassifier changes label values",
      "issue_body": "Consider the following example:\n\n``` python\nimport numpy as np\nfrom sklearn.ensemble import RandomForestClassifier\n\nCLASSES = 15\nX = np.arange(CLASSES).reshape(-1, 1)\ny = [ch for ch in 'ABCDEFGHIJKLMNOPQRSTU'[:CLASSES]]\n\nprint(RandomForestClassifier().fit(X, y).predict(X))\n```\n\nThe output is\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'B' 'B' 'B' 'B' 'B']`\n\nwhereas I expect it to be\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O']`\n\nI traced the cause of the error to [here](https://github.com/scikit-learn/scikit-learn/blob/733629d256abaf9a31d6e6305f859768432907ed/sklearn/ensemble/forest.py#L418). I'm happy to put in a pull request, but can others please verify they get the same unexpected behaviour first?\n\nSystem information:\n``\nCPython 3.3.5\nIPython 3.1.0\n\nnumpy 1.9.2\nscipy 0.14.0\nscikit-learn 0.16.1(conda)\n\ncompiler   : GCC 4.2.1 (Apple Inc. build 5577)\nsystem     : Darwin\nrelease    : 13.4.0\nmachine    : x86_64\nprocessor  : i386\nCPU cores  : 4\ninterpreter: 64bit\n\n```\n```\n",
      "issue_closed_at": "2015-06-24T21:33:10Z",
      "base_commit": "cd98ffced656569681648c88c51dae844501643c",
      "changes": [
        {
          "file": "sklearn/ensemble/forest.py",
          "type": "function",
          "name": "_validate_y_class_weight",
          "class_name": "ForestClassifier",
          "code": "def _validate_y_class_weight(self, y):\n        y = np.copy(y)\n        expanded_class_weight = None\n\n        if self.class_weight is not None:\n            y_original = np.copy(y)\n\n        self.classes_ = []\n        self.n_classes_ = []\n\n        for k in range(self.n_outputs_):\n            classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n            self.classes_.append(classes_k)\n            self.n_classes_.append(classes_k.shape[0])\n\n        if self.class_weight is not None:\n            valid_presets = ('auto', 'balanced', 'balanced_subsample', 'subsample', 'auto')\n            if isinstance(self.class_weight, six.string_types):\n                if self.class_weight not in valid_presets:\n                    raise ValueError('Valid presets for class_weight include '\n                                     '\"balanced\" and \"balanced_subsample\". Given \"%s\".'\n                                     % self.class_weight)\n                if self.class_weight == \"subsample\":\n                    warn(\"class_weight='subsample' is deprecated and will be removed in 0.18.\"\n                         \" It was replaced by class_weight='balanced_subsample' \"\n                         \"using the balanced strategy.\", DeprecationWarning)\n                if self.warm_start:\n                    warn('class_weight presets \"balanced\" or \"balanced_subsample\" are '\n                         'not recommended for warm_start if the fitted data '\n                         'differs from the full dataset. In order to use '\n                         '\"balanced\" weights, use compute_class_weight(\"balanced\", '\n                         'classes, y). In place of y you can use a large '\n                         'enough sample of the full training set target to '\n                         'properly estimate the class frequency '\n                         'distributions. Pass the resulting weights as the '\n                         'class_weight parameter.')\n\n            if (self.class_weight not in ['subsample', 'balanced_subsample'] or\n                    not self.bootstrap):\n                if self.class_weight == 'subsample':\n                    class_weight = 'auto'\n                elif self.class_weight == \"balanced_subsample\":\n                    class_weight = \"balanced\"\n                else:\n                    class_weight = self.class_weight\n                with warnings.catch_warnings():\n                    if class_weight == \"auto\":\n                        warnings.simplefilter('ignore', DeprecationWarning)\n                    expanded_class_weight = compute_sample_weight(class_weight,\n                                                                  y_original)\n\n        return y, expanded_class_weight"
        },
        {
          "file": "sklearn/tree/tree.py",
          "type": "function",
          "name": "fit",
          "class_name": "BaseDecisionTree",
          "code": "def fit(self, X, y, sample_weight=None, check_input=True):\n        \"\"\"Build a decision tree from the training set (X, y).\n\n        Parameters\n        ----------\n        X : array-like or sparse matrix, shape = [n_samples, n_features]\n            The training input samples. Internally, it will be converted to\n            ``dtype=np.float32`` and if a sparse matrix is provided\n            to a sparse ``csc_matrix``.\n\n        y : array-like, shape = [n_samples] or [n_samples, n_outputs]\n            The target values (class labels in classification, real numbers in\n            regression). In the regression case, use ``dtype=np.float64`` and\n            ``order='C'`` for maximum efficiency.\n\n        sample_weight : array-like, shape = [n_samples] or None\n            Sample weights. If None, then samples are equally weighted. Splits\n            that would create child nodes with net zero or negative weight are\n            ignored while searching for a split in each node. In the case of\n            classification, splits are also ignored if they would result in any\n            single class carrying a negative weight in either child node.\n\n        check_input : boolean, (default=True)\n            Allow to bypass several input checking.\n            Don't use this parameter unless you know what you do.\n\n        Returns\n        -------\n        self : object\n            Returns self.\n        \"\"\"\n        random_state = check_random_state(self.random_state)\n        if check_input:\n            X = check_array(X, dtype=DTYPE, accept_sparse=\"csc\")\n            if issparse(X):\n                X.sort_indices()\n\n                if X.indices.dtype != np.intc or X.indptr.dtype != np.intc:\n                    raise ValueError(\"No support for np.int64 index based \"\n                                     \"sparse matrices\")\n\n        # Determine output settings\n        n_samples, self.n_features_ = X.shape\n        is_classification = isinstance(self, ClassifierMixin)\n\n        y = np.atleast_1d(y)\n        expanded_class_weight = None\n\n        if y.ndim == 1:\n            # reshape is necessary to preserve the data contiguity against vs\n            # [:, np.newaxis] that does not.\n            y = np.reshape(y, (-1, 1))\n\n        self.n_outputs_ = y.shape[1]\n\n        if is_classification:\n            y = np.copy(y)\n\n            self.classes_ = []\n            self.n_classes_ = []\n\n            if self.class_weight is not None:\n                y_original = np.copy(y)\n\n            for k in range(self.n_outputs_):\n                classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n                self.classes_.append(classes_k)\n                self.n_classes_.append(classes_k.shape[0])\n\n            if self.class_weight is not None:\n                expanded_class_weight = compute_sample_weight(\n                    self.class_weight, y_original)\n\n        else:\n            self.classes_ = [None] * self.n_outputs_\n            self.n_classes_ = [1] * self.n_outputs_\n\n        self.n_classes_ = np.array(self.n_classes_, dtype=np.intp)\n\n        if getattr(y, \"dtype\", None) != DOUBLE or not y.flags.contiguous:\n            y = np.ascontiguousarray(y, dtype=DOUBLE)\n\n        # Check parameters\n        max_depth = ((2 ** 31) - 1 if self.max_depth is None\n                     else self.max_depth)\n        max_leaf_nodes = (-1 if self.max_leaf_nodes is None\n                          else self.max_leaf_nodes)\n\n        if isinstance(self.max_features, six.string_types):\n            if self.max_features == \"auto\":\n                if is_classification:\n                    max_features = max(1, int(np.sqrt(self.n_features_)))\n                else:\n                    max_features = self.n_features_\n            elif self.max_features == \"sqrt\":\n                max_features = max(1, int(np.sqrt(self.n_features_)))\n            elif self.max_features == \"log2\":\n                max_features = max(1, int(np.log2(self.n_features_)))\n            else:\n                raise ValueError(\n                    'Invalid value for max_features. Allowed string '\n                    'values are \"auto\", \"sqrt\" or \"log2\".')\n        elif self.max_features is None:\n            max_features = self.n_features_\n        elif isinstance(self.max_features, (numbers.Integral, np.integer)):\n            max_features = self.max_features\n        else:  # float\n            if self.max_features > 0.0:\n                max_features = max(1, int(self.max_features * self.n_features_))\n            else:\n                max_features = 0\n\n        self.max_features_ = max_features\n\n        if len(y) != n_samples:\n            raise ValueError(\"Number of labels=%d does not match \"\n                             \"number of samples=%d\" % (len(y), n_samples))\n        if self.min_samples_split <= 0:\n            raise ValueError(\"min_samples_split must be greater than zero.\")\n        if self.min_samples_leaf <= 0:\n            raise ValueError(\"min_samples_leaf must be greater than zero.\")\n        if not 0 <= self.min_weight_fraction_leaf <= 0.5:\n            raise ValueError(\"min_weight_fraction_leaf must in [0, 0.5]\")\n        if max_depth <= 0:\n            raise ValueError(\"max_depth must be greater than zero. \")\n        if not (0 < max_features <= self.n_features_):\n            raise ValueError(\"max_features must be in (0, n_features]\")\n        if not isinstance(max_leaf_nodes, (numbers.Integral, np.integer)):\n            raise ValueError(\"max_leaf_nodes must be integral number but was \"\n                             \"%r\" % max_leaf_nodes)\n        if -1 < max_leaf_nodes < 2:\n            raise ValueError((\"max_leaf_nodes {0} must be either smaller than \"\n                              \"0 or larger than 1\").format(max_leaf_nodes))\n\n        if sample_weight is not None:\n            if (getattr(sample_weight, \"dtype\", None) != DOUBLE or\n                    not sample_weight.flags.contiguous):\n                sample_weight = np.ascontiguousarray(\n                    sample_weight, dtype=DOUBLE)\n            if len(sample_weight.shape) > 1:\n                raise ValueError(\"Sample weights array has more \"\n                                 \"than one dimension: %d\" %\n                                 len(sample_weight.shape))\n            if len(sample_weight) != n_samples:\n                raise ValueError(\"Number of weights=%d does not match \"\n                                 \"number of samples=%d\" %\n                                 (len(sample_weight), n_samples))\n\n        if expanded_class_weight is not None:\n            if sample_weight is not None:\n                sample_weight = sample_weight * expanded_class_weight\n            else:\n                sample_weight = expanded_class_weight\n\n        # Set min_weight_leaf from min_weight_fraction_leaf\n        if self.min_weight_fraction_leaf != 0. and sample_weight is not None:\n            min_weight_leaf = (self.min_weight_fraction_leaf *\n                               np.sum(sample_weight))\n        else:\n            min_weight_leaf = 0.\n\n        # Set min_samples_split sensibly\n        min_samples_split = max(self.min_samples_split,\n                                2 * self.min_samples_leaf)\n\n        # Build tree\n        criterion = self.criterion\n        if not isinstance(criterion, Criterion):\n            if is_classification:\n                criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,\n                                                         self.n_classes_)\n            else:\n                criterion = CRITERIA_REG[self.criterion](self.n_outputs_)\n\n        SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS\n\n        splitter = self.splitter\n        if not isinstance(self.splitter, Splitter):\n            splitter = SPLITTERS[self.splitter](criterion,\n                                                self.max_features_,\n                                                self.min_samples_leaf,\n                                                min_weight_leaf,\n                                                random_state)\n\n        self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_)\n\n        # Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise\n        if max_leaf_nodes < 0:\n            builder = DepthFirstTreeBuilder(splitter, min_samples_split,\n                                            self.min_samples_leaf,\n                                            min_weight_leaf,\n                                            max_depth)\n        else:\n            builder = BestFirstTreeBuilder(splitter, min_samples_split,\n                                           self.min_samples_leaf,\n                                           min_weight_leaf,\n                                           max_depth,\n                                           max_leaf_nodes)\n\n        builder.build(self.tree_, X, y, sample_weight)\n\n        if self.n_outputs_ == 1:\n            self.n_classes_ = self.n_classes_[0]\n            self.classes_ = self.classes_[0]\n\n        return self"
        }
      ]
    },
    {
      "pr_number": 7594,
      "pr_title": "[MRG+1] FIX Make sure GridSearchCV and RandomizedSearchCV are pickle-able",
      "pr_body": "Fixes #7562 \n- Subclasses the `np.ma.MaskedArray` and overrides the `__getstate__` to make obj dtyped `MaskedArray`s pickle-able.\n- Uses this fixed `utils.fixes.MaskedArray` inside `gs.cv_results_`...\n\nThis is based off of https://github.com/numpy/numpy/pull/8122\n\nPlease review @jnothman @amueller @GaelVaroquaux @davechallis\n",
      "issue_id": 7562,
      "issue_title": "Error unpickling RandomizedSearchCV objects in 0.18 due to masked arrays",
      "issue_body": "#### Description\n\nIn version 0.18, loading pickles of fitted RandomizedSearchCV objects results in a `TypeError` exception (from pickle also created with version 0.18).\n\nThe error seems related to the use of masked arrays in the `RandomizedSearchCV.cv_results_` attribute - clearing this before pickling (i.e. setting to to `None`) allows pickling/unpickling to work.\n#### Steps/Code to Reproduce\n\n```\nimport pickle                                                                   \nfrom sklearn.model_selection import RandomizedSearchCV                          \nfrom sklearn.ensemble import RandomForestClassifier                             \n\nX = [[1, 0], [1, 0], [1, 0], [0, 1], [0, 1], [0, 1]]                            \ny = [1, 1, 1, 0, 0, 0]                                                          \n\nmodel = RandomizedSearchCV(RandomForestClassifier(),                            \n                           {'n_estimators': [5, 10, 20]},                       \n                           n_iter=3)                                            \nmodel.fit(X, y)                                                                 \n\nwith open('model.pkl', 'wb') as fh:                                             \n    pickle.dump(model, fh)                                                      \n\nwith open('model.pkl', 'rb') as fh:                                             \n    model = pickle.load(fh)\n\nprint(model.predict(X))\n```\n#### Expected Results\n\n```\n[1, 1, 1, 0, 0, 0]\n```\n#### Actual Results\n\n```\nTraceback (most recent call last):\n  File \"./t.py\", line 19, in <module>\n    model = pickle.load(fh)\n  File \"/Users/dsc/miniconda3/envs/p3/lib/python3.5/site-packages/numpy/ma/core.py\", line 5863, in __setstate__\n    super(MaskedArray, self).__setstate__((shp, typ, isf, raw))\nTypeError: object pickle not returning list\n```\n#### Versions\n\nPython 3.5.1 |Continuum Analytics, Inc.| (default, Dec  7 2015, 11:24:55) \n[GCC 4.2.1 (Apple Inc. build 5577)]\nNumPy 1.11.1\nSciPy 0.18.1\nScikit-Learn 0.18\n",
      "issue_closed_at": "2016-10-10T19:33:44Z",
      "base_commit": "33ed90dc0aa0549a5963000d7d070aa18ca389c4",
      "changes": [
        {
          "file": "sklearn/model_selection/_search.py",
          "type": "line",
          "name": "line 30",
          "code": "from ..utils import check_random_state\nfrom ..utils.fixes import sp_version\nfrom ..utils.fixes import rankdata\nfrom ..utils.random import sample_without_replacement\nfrom ..utils.validation import indexable, check_is_fitted\nfrom ..utils.metaestimators import if_delegate_has_method"
        },
        {
          "file": "sklearn/model_selection/_search.py",
          "type": "function",
          "name": "_store",
          "class_name": "BaseSearchCV",
          "code": "def _store(key_name, array, weights=None, splits=False, rank=False):\n            \"\"\"A small helper to store the scores/times to the cv_results_\"\"\"\n            array = np.array(array, dtype=np.float64).reshape(n_candidates,\n                                                              n_splits)\n            if splits:\n                for split_i in range(n_splits):\n                    results[\"split%d_%s\"\n                            % (split_i, key_name)] = array[:, split_i]\n\n            array_means = np.average(array, axis=1, weights=weights)\n            results['mean_%s' % key_name] = array_means\n            # Weighted std is not directly available in numpy\n            array_stds = np.sqrt(np.average((array -\n                                             array_means[:, np.newaxis]) ** 2,\n                                            axis=1, weights=weights))\n            results['std_%s' % key_name] = array_stds\n\n            if rank:\n                results[\"rank_%s\" % key_name] = np.asarray(\n                    rankdata(-array_means, method='min'), dtype=np.int32)"
        },
        {
          "file": "sklearn/utils/fixes.py",
          "type": "function",
          "name": "rankdata",
          "class_name": null,
          "code": "def rankdata(a, method='average'):\n        if method not in ('average', 'min', 'max', 'dense', 'ordinal'):\n            raise ValueError('unknown method \"{0}\"'.format(method))\n\n        arr = np.ravel(np.asarray(a))\n        algo = 'mergesort' if method == 'ordinal' else 'quicksort'\n        sorter = np.argsort(arr, kind=algo)\n\n        inv = np.empty(sorter.size, dtype=np.intp)\n        inv[sorter] = np.arange(sorter.size, dtype=np.intp)\n\n        if method == 'ordinal':\n            return inv + 1\n\n        arr = arr[sorter]\n        obs = np.r_[True, arr[1:] != arr[:-1]]\n        dense = obs.cumsum()[inv]\n\n        if method == 'dense':\n            return dense\n\n        # cumulative counts of each unique value\n        count = np.r_[np.nonzero(obs)[0], len(obs)]\n\n        if method == 'max':\n            return count[dense]\n\n        if method == 'min':\n            return count[dense - 1] + 1\n\n        # average method\n        return .5 * (count[dense] + count[dense - 1] + 1)"
        }
      ]
    },
    {
      "pr_number": 9655,
      "pr_title": "[MRG+1] Return nan in RadiusNeighborsRegressor for empty neighbor set",
      "pr_body": "#### Reference Issue\r\nFixes #9654 \r\n\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\nRadiusNeighborsRegressor is behaving differently when there are no neighbors for a sample between when weights are or aren't used. This PR fixes this inconsistency. This PR also fixes raised error when no available data points for `RadiusNeighborRegression` using non-uniform weights.",
      "issue_id": 9654,
      "issue_title": "RadiusNeighborRegression error",
      "issue_body": "#### Description\r\nRadiusNeighborRegression has inconsistent output depending whether using weights that are uniform or by distance. \r\n\r\n- When using uniform weights then if no observations are available within the specified radius of an observation point then no it returns `np.nan`, \r\n\r\n- When using distance weights it raises `ZeroDivisionError: Weights sum to zero, can't be normalized` is raised from `np.average` as there are no weights to use for the observation. This is demonstrated with the following example below - copied from `RadiusNeighborsRegressor` where distance is specified and for X used for prediction is different,\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nfrom sklearn.neighbors import RadiusNeighborsRegressor   \r\n\r\nX = [[0], [1], [2], [3]]\r\ny = [[0], [0], [1], [1]]\r\n\r\nneigh = RadiusNeighborsRegressor(radius=1.0, weights='distance')\r\nneigh.fit(X, y) \r\n\r\ny_hat = neigh.predict([[-2],[0]])\r\n```\r\n\r\n#### Expected Results\r\n```python\r\narray([[ nan],\r\n       [  0.]])\r\n```\r\n\r\n#### Actual Results\r\n```\r\n---------------------------------------------------------------------------\r\nZeroDivisionError                         Traceback (most recent call last)\r\n<ipython-input-6-c1cb420157ca> in <module>()\r\n      7 neigh.fit(X, y)\r\n      8 \r\n----> 9 y_hat = neigh.predict([[-1.5],[1]])\r\n\r\n~\\AppData\\Local\\Continuum\\Miniconda3\\lib\\site-packages\\sklearn\\neighbors\\regression.py in predict(self, X)\r\n    294             y_pred = np.array([(np.average(_y[ind, :], axis=0,\r\n    295                                            weights=weights[i]))\r\n--> 296                                for (i, ind) in enumerate(neigh_ind)])\r\n    297 \r\n    298         if self._y.ndim == 1:\r\n\r\n~\\AppData\\Local\\Continuum\\Miniconda3\\lib\\site-packages\\sklearn\\neighbors\\regression.py in <listcomp>(.0)\r\n    294             y_pred = np.array([(np.average(_y[ind, :], axis=0,\r\n    295                                            weights=weights[i]))\r\n--> 296                                for (i, ind) in enumerate(neigh_ind)])\r\n    297 \r\n    298         if self._y.ndim == 1:\r\n\r\n~\\AppData\\Local\\Continuum\\Miniconda3\\lib\\site-packages\\numpy\\lib\\function_base.py in average(a, axis, weights, returned)\r\n   1138         if np.any(scl == 0.0):\r\n   1139             raise ZeroDivisionError(\r\n-> 1140                 \"Weights sum to zero, can't be normalized\")\r\n   1141 \r\n   1142         avg = np.multiply(a, wgt, dtype=result_dtype).sum(axis)/scl\r\n\r\nZeroDivisionError: Weights sum to zero, can't be normalized\r\n```\r\n\r\n\r\n#### Versions\r\nWindows-10-10.0.15063-SP0\r\nPython 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]\r\nNumPy 1.13.1\r\nSciPy 0.19.1\r\nScikit-Learn 0.18.2\r\n\r\nI encountered the same problem on Mac OS and Linux machines.",
      "issue_closed_at": "2017-11-21T09:07:55Z",
      "base_commit": "bbdcd70fcaa875e78a009d89f755bba602a861be",
      "changes": [
        {
          "file": "sklearn/neighbors/regression.py",
          "type": "line",
          "name": "line 5",
          "code": "#          Alexandre Gramfort <alexandre.gramfort@inria.fr>\n#          Sparseness support by Lars Buitinck\n#          Multi-output support by Arnaud Joly <a.joly@ulg.ac.be>\n#\n# License: BSD 3 clause (C) INRIA, University of Amsterdam\n\nimport numpy as np\nfrom scipy.sparse import issparse"
        },
        {
          "file": "sklearn/neighbors/regression.py",
          "type": "function",
          "name": "predict",
          "class_name": "RadiusNeighborsRegressor",
          "code": "def predict(self, X):\n        \"\"\"Predict the target for the provided data\n\n        Parameters\n        ----------\n        X : array-like, shape (n_query, n_features), \\\n                or (n_query, n_indexed) if metric == 'precomputed'\n            Test samples.\n\n        Returns\n        -------\n        y : array of int, shape = [n_samples] or [n_samples, n_outputs]\n            Target values\n        \"\"\"\n        X = check_array(X, accept_sparse='csr')\n\n        neigh_dist, neigh_ind = self.radius_neighbors(X)\n\n        weights = _get_weights(neigh_dist, self.weights)\n\n        _y = self._y\n        if _y.ndim == 1:\n            _y = _y.reshape((-1, 1))\n\n        if weights is None:\n            y_pred = np.array([np.mean(_y[ind, :], axis=0)\n                               for ind in neigh_ind])\n        else:\n            y_pred = np.array([(np.average(_y[ind, :], axis=0,\n                                           weights=weights[i]))\n                               for (i, ind) in enumerate(neigh_ind)])\n\n        if self._y.ndim == 1:\n            y_pred = y_pred.ravel()\n\n        return y_pred"
        },
        {
          "file": "sklearn/neighbors/regression.py",
          "type": "function",
          "name": "predict",
          "class_name": "RadiusNeighborsRegressor",
          "code": "def predict(self, X):\n        \"\"\"Predict the target for the provided data\n\n        Parameters\n        ----------\n        X : array-like, shape (n_query, n_features), \\\n                or (n_query, n_indexed) if metric == 'precomputed'\n            Test samples.\n\n        Returns\n        -------\n        y : array of int, shape = [n_samples] or [n_samples, n_outputs]\n            Target values\n        \"\"\"\n        X = check_array(X, accept_sparse='csr')\n\n        neigh_dist, neigh_ind = self.radius_neighbors(X)\n\n        weights = _get_weights(neigh_dist, self.weights)\n\n        _y = self._y\n        if _y.ndim == 1:\n            _y = _y.reshape((-1, 1))\n\n        if weights is None:\n            y_pred = np.array([np.mean(_y[ind, :], axis=0)\n                               for ind in neigh_ind])\n        else:\n            y_pred = np.array([(np.average(_y[ind, :], axis=0,\n                                           weights=weights[i]))\n                               for (i, ind) in enumerate(neigh_ind)])\n\n        if self._y.ndim == 1:\n            y_pred = y_pred.ravel()\n\n        return y_pred"
        }
      ]
    },
    {
      "pr_number": 7069,
      "pr_title": "DummyClassifier and DummyRegressor raise NotFittedError",
      "pr_body": "<!--\nThanks for contributing a pull request! Please ensure you have taken a look at\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\n-->\n#### Reference Issue\n\nFixes #7065\n\n<!-- Example: Fixes #1234 -->\n#### What does this implement/fix? Explain your changes.\n\nDummyClassifier and DummyRegressor raise NotFittedError\n#### Any other comments?\n\n<!--\nPlease be aware that we are a loose team of volunteers so patience is\nnecessary; assistance handling other issues is very welcome. We value\nall user contributions, no matter how minor they are. If we are slow to\nreview, either the pull request needs some benchmarking, tinkering,\nconvincing, etc. or more likely the reviewers are simply busy. In either\ncase, we ask for your understanding during the review process.\nFor more information, see our FAQ on this topic:\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\n\nThanks for contributing!\n-->\n",
      "issue_id": 7065,
      "issue_title": "DummyRegressor raises ValueError instead of NotFittedError",
      "issue_body": "#### Description\n\ntrying to call predict on an instance of DummyRegressor that has not been fitted raises ValueError. I think it should be NotFittedError.\n#### Steps/Code to Reproduce\n\n```\n>>>from sklearn.dummy import DummyRegressor\n>>>clf = DummyRegressor()\n>>>clf.predict(np.zeros((10,10)))\n```\n#### Expected Results\n\nNotFittedError\n#### Actual Results\n\nValueError\n\n<!--\nIf your issue is a usage question, submit it here instead:\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\n-->\n\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\n\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\n\n<!--\nExample:\n```\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.decomposition import LatentDirichletAllocation\n\ndocs = [\"Help I have a bug\" for i in range(1000)]\n\nvectorizer = CountVectorizer(input=docs, analyzer='word')\nlda_features = vectorizer.fit_transform(docs)\n\nlda_model = LatentDirichletAllocation(\n    n_topics=10,\n    learning_method='online',\n    evaluate_every=10,\n    n_jobs=4,\n)\nmodel = lda_model.fit(lda_features)\n```\nIf the code is too long, feel free to put it in a public gist and link\nit in the issue: https://gist.github.com\n-->\n#### Versions\n\n<!--\nPlease run the following snippet and paste the output below.\nimport platform; print(platform.platform())\nimport sys; print(\"Python\", sys.version)\nimport numpy; print(\"NumPy\", numpy.__version__)\nimport scipy; print(\"SciPy\", scipy.__version__)\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\n-->\n\nLinux-3.19.0-47-generic-x86_64-with-Ubuntu-14.04-trusty\n('Python', '2.7.6 (default, Jun 22 2015, 17:58:13) \\n[GCC 4.8.2]')\n('NumPy', '1.11.0')\n('SciPy', '0.16.1')\n('Scikit-Learn', '0.17')\n\n<!-- Thanks for contributing! -->\n",
      "issue_closed_at": "2016-07-25T07:53:18Z",
      "base_commit": "7a7e8091c73abc59de4bb71f577b020cd2572c38",
      "changes": [
        {
          "file": "sklearn/dummy.py",
          "type": "line",
          "name": "line 12",
          "code": "from .utils import check_random_state\nfrom .utils.validation import check_array\nfrom .utils.validation import check_consistent_length\nfrom .utils.random import random_choice_csc\nfrom .utils.stats import _weighted_percentile\nfrom .utils.multiclass import class_distribution"
        },
        {
          "file": "sklearn/dummy.py",
          "type": "function",
          "name": "predict",
          "class_name": "DummyRegressor",
          "code": "def predict(self, X):\n        \"\"\"\n        Perform classification on test vectors X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape = [n_samples, n_features]\n            Input vectors, where n_samples is the number of samples\n            and n_features is the number of features.\n\n        Returns\n        -------\n        y : array, shape = [n_samples]  or [n_samples, n_outputs]\n            Predicted target values for X.\n        \"\"\"\n        if not hasattr(self, \"constant_\"):\n            raise ValueError(\"DummyRegressor not fitted.\")\n\n        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n        n_samples = X.shape[0]\n\n        y = np.ones((n_samples, 1)) * self.constant_\n\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            y = np.ravel(y)\n\n        return y"
        },
        {
          "file": "sklearn/dummy.py",
          "type": "function",
          "name": "predict_proba",
          "class_name": "DummyClassifier",
          "code": "def predict_proba(self, X):\n        \"\"\"\n        Return probability estimates for the test vectors X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape = [n_samples, n_features]\n            Input vectors, where n_samples is the number of samples\n            and n_features is the number of features.\n\n        Returns\n        -------\n        P : array-like or list of array-lke of shape = [n_samples, n_classes]\n            Returns the probability of the sample for each class in\n            the model, where classes are ordered arithmetically, for each\n            output.\n        \"\"\"\n        if not hasattr(self, \"classes_\"):\n            raise ValueError(\"DummyClassifier not fitted.\")\n\n        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n        # numpy random_state expects Python int and not long as size argument\n        # under Windows\n        n_samples = int(X.shape[0])\n        rs = check_random_state(self.random_state)\n\n        n_classes_ = self.n_classes_\n        classes_ = self.classes_\n        class_prior_ = self.class_prior_\n        constant = self.constant\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            # Get same type even for self.n_outputs_ == 1\n            n_classes_ = [n_classes_]\n            classes_ = [classes_]\n            class_prior_ = [class_prior_]\n            constant = [constant]\n\n        P = []\n        for k in range(self.n_outputs_):\n            if self.strategy == \"most_frequent\":\n                ind = class_prior_[k].argmax()\n                out = np.zeros((n_samples, n_classes_[k]), dtype=np.float64)\n                out[:, ind] = 1.0\n            elif self.strategy == \"prior\":\n                out = np.ones((n_samples, 1)) * class_prior_[k]\n\n            elif self.strategy == \"stratified\":\n                out = rs.multinomial(1, class_prior_[k], size=n_samples)\n\n            elif self.strategy == \"uniform\":\n                out = np.ones((n_samples, n_classes_[k]), dtype=np.float64)\n                out /= n_classes_[k]\n\n            elif self.strategy == \"constant\":\n                ind = np.where(classes_[k] == constant[k])\n                out = np.zeros((n_samples, n_classes_[k]), dtype=np.float64)\n                out[:, ind] = 1.0\n\n            P.append(out)\n\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            P = P[0]\n\n        return P"
        },
        {
          "file": "sklearn/dummy.py",
          "type": "function",
          "name": "predict",
          "class_name": "DummyRegressor",
          "code": "def predict(self, X):\n        \"\"\"\n        Perform classification on test vectors X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape = [n_samples, n_features]\n            Input vectors, where n_samples is the number of samples\n            and n_features is the number of features.\n\n        Returns\n        -------\n        y : array, shape = [n_samples]  or [n_samples, n_outputs]\n            Predicted target values for X.\n        \"\"\"\n        if not hasattr(self, \"constant_\"):\n            raise ValueError(\"DummyRegressor not fitted.\")\n\n        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])\n        n_samples = X.shape[0]\n\n        y = np.ones((n_samples, 1)) * self.constant_\n\n        if self.n_outputs_ == 1 and not self.output_2d_:\n            y = np.ravel(y)\n\n        return y"
        }
      ]
    },
    {
      "pr_number": 4146,
      "pr_title": "[MRG + 1] Fdr treshold bug",
      "pr_body": "Continues #2932. Fixes #2771.\nThese are some minor fixes on top of #2932, where @bthirion already gave his +1.\nMaybe @arjoly wants to have a look as he commented there.\nThis is a good bug fix that I think we should include asap.\n\nFYI tests take .5s.\n",
      "issue_id": 2771,
      "issue_title": "SelectFdr has serious thresholding bug",
      "issue_body": "The current code reads like:\n\n```\ndef _get_support_mask(self):\n    alpha = self.alpha\n    sv = np.sort(self.pvalues_)\n    threshold = sv[sv < alpha * np.arange(len(self.pvalues_))].max()\n    return self.pvalues_ <= threshold\n```\n\nBut this doesn't actually control FDR at all, the correct implementation should have:\n\n```\n    bf_alpha = alpha / len(self.pvalues_)\n    threshold = sv[sv < bf_alpha * np.arange(len(self.pvalues_))].max()\n```\n\nNote the k / m term in the equation at:\nhttp://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure\n",
      "issue_closed_at": "2015-02-24T22:15:19Z",
      "base_commit": "f6af4881a5a66fb21688379b39f9304898a11bc0",
      "changes": [
        {
          "file": "sklearn/feature_selection/univariate_selection.py",
          "type": "function",
          "name": "_get_support_mask",
          "class_name": "GenericUnivariateSelect",
          "code": "def _get_support_mask(self):\n        check_is_fitted(self, 'scores_')\n\n        selector = self._make_selector()\n        selector.pvalues_ = self.pvalues_\n        selector.scores_ = self.scores_\n        return selector._get_support_mask()"
        },
        {
          "file": "sklearn/feature_selection/univariate_selection.py",
          "type": "class",
          "name": "SelectFdr",
          "code": "class SelectFdr(_BaseFilter):\n    \"\"\"Filter: Select the p-values for an estimated false discovery rate\n\n    This uses the Benjamini-Hochberg procedure. ``alpha`` is the target false\n    discovery rate.\n\n    Parameters\n    ----------\n    score_func : callable\n        Function taking two arrays X and y, and returning a pair of arrays\n        (scores, pvalues).\n\n    alpha : float, optional\n        The highest uncorrected p-value for features to keep.\n\n\n    Attributes\n    ----------\n    scores_ : array-like, shape=(n_features,)\n        Scores of features.\n\n    pvalues_ : array-like, shape=(n_features,)\n        p-values of feature scores.\n    \"\"\"\n\n    def __init__(self, score_func=f_classif, alpha=5e-2):\n        super(SelectFdr, self).__init__(score_func)\n        self.alpha = alpha\n\n    def _get_support_mask(self):\n        check_is_fitted(self, 'scores_')\n\n        alpha = self.alpha\n        sv = np.sort(self.pvalues_)\n        selected = sv[sv < alpha * np.arange(len(self.pvalues_))]\n        if selected.size == 0:\n            return np.zeros_like(self.pvalues_, dtype=bool)\n        return self.pvalues_ <= selected.max()"
        },
        {
          "file": "sklearn/feature_selection/univariate_selection.py",
          "type": "function",
          "name": "__init__",
          "class_name": "GenericUnivariateSelect",
          "code": "def __init__(self, score_func=f_classif, mode='percentile', param=1e-5):\n        super(GenericUnivariateSelect, self).__init__(score_func)\n        self.mode = mode\n        self.param = param"
        }
      ]
    }
  ]
}