{
  "instance_id": "scikit-learn__scikit-learn-12471",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2018-10-27T10:43:48Z",
  "problem_statement": "OneHotEncoder ignore unknown error when categories are strings \n#### Description\r\n\r\nThis bug is very specific, but it happens when you set OneHotEncoder to ignore unknown entries.\r\nand your labels are strings. The memory of the arrays is not handled safely and it can lead to a ValueError\r\n\r\nBasically, when you call the transform method it will sets all the unknown strings on your array to OneHotEncoder.categories_[i][0] which is the first category alphabetically sorted given for fit\r\nIf this OneHotEncoder.categories_[i][0] is a long string, and the array that you want to transform has small strings, then it is impossible to fit the whole  OneHotEncoder.categories_[i][0] into the entries of the array we want to transform. So  OneHotEncoder.categories_[i][0]  is truncated and this raise the ValueError.\r\n\r\n\r\n\r\n#### Steps/Code to Reproduce\r\n```\r\n\r\nimport numpy as np\r\nfrom sklearn.preprocessing import OneHotEncoder\r\n\r\n\r\n# It needs to be numpy arrays, the error does not appear \r\n# is you have lists of lists because it gets treated like an array of objects.\r\ntrain  = np.array([ '22','333','4444','11111111' ]).reshape((-1,1))\r\ntest   = np.array([ '55555',  '22' ]).reshape((-1,1))\r\n\r\nohe = OneHotEncoder(dtype=bool,handle_unknown='ignore')\r\n\r\nohe.fit( train )\r\nenc_test = ohe.transform( test )\r\n\r\n```\r\n\r\n\r\n#### Expected Results\r\nHere we should get an sparse matrix 2x4 false everywhere except at (1,1) the '22' that is known\r\n\r\n#### Actual Results\r\n\r\n> ValueError: y contains previously unseen labels: ['111111']\r\n\r\n\r\n#### Versions\r\nSystem:\r\n    python: 2.7.12 (default, Dec  4 2017, 14:50:18)  [GCC 5.4.0 20160609]\r\n   machine: Linux-4.4.0-138-generic-x86_64-with-Ubuntu-16.04-xenial\r\nexecutable: /usr/bin/python\r\n\r\nBLAS:\r\n    macros: HAVE_CBLAS=None\r\ncblas_libs: openblas, openblas\r\n  lib_dirs: /usr/lib\r\n\r\nPython deps:\r\n    Cython: 0.25.2\r\n     scipy: 0.18.1\r\nsetuptools: 36.7.0\r\n       pip: 9.0.1\r\n     numpy: 1.15.2\r\n    pandas: 0.19.1\r\n   sklearn: 0.21.dev0\r\n\r\n\r\n\r\n#### Comments\r\n\r\nI already implemented a fix for this issue, where I check the size of the elements in the array before, and I cast them into objects if necessary.\n",
  "patch": "diff --git a/sklearn/preprocessing/_encoders.py b/sklearn/preprocessing/_encoders.py\n--- a/sklearn/preprocessing/_encoders.py\n+++ b/sklearn/preprocessing/_encoders.py\n@@ -110,7 +110,14 @@ def _transform(self, X, handle_unknown='error'):\n                     # continue `The rows are marked `X_mask` and will be\n                     # removed later.\n                     X_mask[:, i] = valid_mask\n-                    Xi = Xi.copy()\n+                    # cast Xi into the largest string type necessary\n+                    # to handle different lengths of numpy strings\n+                    if (self.categories_[i].dtype.kind in ('U', 'S')\n+                            and self.categories_[i].itemsize > Xi.itemsize):\n+                        Xi = Xi.astype(self.categories_[i].dtype)\n+                    else:\n+                        Xi = Xi.copy()\n+\n                     Xi[~valid_mask] = self.categories_[i][0]\n             _, encoded = _encode(Xi, self.categories_[i], encode=True)\n             X_int[:, i] = encoded\n",
  "similar_bug_items": [
    {
      "pr_number": 11043,
      "pr_title": "[MRG+2] ENH Passthrough DataFrame in FunctionTransformer",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n\r\ncloses #10655 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded the following option\r\n\r\n- [x] Raise FutureWarning that `validate=False` will be the default in the future.\r\n- [x] Convert list to array when `validate=False`.\r\n- [x] Make a what's new entry for the change of behaviour.\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 10655,
      "issue_title": "FunctionTransformer should not convert DataFrames to arrays by default",
      "issue_body": "I would expect a common use of FunctionTransformer is to apply some function to a Pandas DataFrame, ideally using its own methods or accessors. As noted in #10648, it can be easy for users to miss that they need to set validate=False to pass through a DataFrame without converting it to a NumPy array. I think it would be more user-friendly to have `validate='array-or-frame'` by default, which would pass through DataFrames to the function, but otherwise convert its input to a 2d array. For strict backwards compatibility, the default should be changed through a deprecation cycle, warning whenever using the default validation means a DataFrame is currently converted to an array.\r\n\r\nDo others agree?",
      "issue_closed_at": "2018-07-10T09:48:04Z",
      "base_commit": "19bc7e8af6ec3468b6c7f4718a31cd5f508528cd",
      "changes": [
        {
          "file": "sklearn/preprocessing/_function_transformer.py",
          "type": "class",
          "name": "FunctionTransformer",
          "code": "class FunctionTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"Constructs a transformer from an arbitrary callable.\n\n    A FunctionTransformer forwards its X (and optionally y) arguments to a\n    user-defined function or function object and returns the result of this\n    function. This is useful for stateless transformations such as taking the\n    log of frequencies, doing custom scaling, etc.\n\n    Note: If a lambda is used as the function, then the resulting\n    transformer will not be pickleable.\n\n    .. versionadded:: 0.17\n\n    Read more in the :ref:`User Guide <function_transformer>`.\n\n    Parameters\n    ----------\n    func : callable, optional default=None\n        The callable to use for the transformation. This will be passed\n        the same arguments as transform, with args and kwargs forwarded.\n        If func is None, then func will be the identity function.\n\n    inverse_func : callable, optional default=None\n        The callable to use for the inverse transformation. This will be\n        passed the same arguments as inverse transform, with args and\n        kwargs forwarded. If inverse_func is None, then inverse_func\n        will be the identity function.\n\n    validate : bool, optional default=True\n        Indicate that the input X array should be checked before calling\n        func. If validate is false, there will be no input validation.\n        If it is true, then X will be converted to a 2-dimensional NumPy\n        array or sparse matrix. If this conversion is not possible or X\n        contains NaN or infinity, an exception is raised.\n\n    accept_sparse : boolean, optional\n        Indicate that func accepts a sparse matrix as input. If validate is\n        False, this has no effect. Otherwise, if accept_sparse is false,\n        sparse matrix inputs will cause an exception to be raised.\n\n    pass_y : bool, optional default=False\n        Indicate that transform should forward the y argument to the\n        inner callable.\n\n        .. deprecated::0.19\n\n    check_inverse : bool, default=True\n       Whether to check that or ``func`` followed by ``inverse_func`` leads to\n       the original inputs. It can be used for a sanity check, raising a\n       warning when the condition is not fulfilled.\n\n       .. versionadded:: 0.20\n\n    kw_args : dict, optional\n        Dictionary of additional keyword arguments to pass to func.\n\n    inv_kw_args : dict, optional\n        Dictionary of additional keyword arguments to pass to inverse_func.\n\n    \"\"\"\n    def __init__(self, func=None, inverse_func=None, validate=True,\n                 accept_sparse=False, pass_y='deprecated', check_inverse=True,\n                 kw_args=None, inv_kw_args=None):\n        self.func = func\n        self.inverse_func = inverse_func\n        self.validate = validate\n        self.accept_sparse = accept_sparse\n        self.pass_y = pass_y\n        self.check_inverse = check_inverse\n        self.kw_args = kw_args\n        self.inv_kw_args = inv_kw_args\n\n    def _check_inverse_transform(self, X):\n        \"\"\"Check that func and inverse_func are the inverse.\"\"\"\n        idx_selected = slice(None, None, max(1, X.shape[0] // 100))\n        try:\n            assert_allclose_dense_sparse(\n                X[idx_selected],\n                self.inverse_transform(self.transform(X[idx_selected])))\n        except AssertionError:\n            warnings.warn(\"The provided functions are not strictly\"\n                          \" inverse of each other. If you are sure you\"\n                          \" want to proceed regardless, set\"\n                          \" 'check_inverse=False'.\", UserWarning)\n\n    def fit(self, X, y=None):\n        \"\"\"Fit transformer by checking X.\n\n        If ``validate`` is ``True``, ``X`` will be checked.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        Returns\n        -------\n        self\n        \"\"\"\n        if self.validate:\n            X = check_array(X, self.accept_sparse)\n        if (self.check_inverse and not (self.func is None or\n                                        self.inverse_func is None)):\n            self._check_inverse_transform(X)\n        return self\n\n    def transform(self, X, y='deprecated'):\n        \"\"\"Transform X using the forward function.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        y : (ignored)\n            .. deprecated::0.19\n\n        Returns\n        -------\n        X_out : array-like, shape (n_samples, n_features)\n            Transformed input.\n        \"\"\"\n        if not isinstance(y, string_types) or y != 'deprecated':\n            warnings.warn(\"The parameter y on transform() is \"\n                          \"deprecated since 0.19 and will be removed in 0.21\",\n                          DeprecationWarning)\n\n        return self._transform(X, y=y, func=self.func, kw_args=self.kw_args)\n\n    def inverse_transform(self, X, y='deprecated'):\n        \"\"\"Transform X using the inverse function.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        y : (ignored)\n            .. deprecated::0.19\n\n        Returns\n        -------\n        X_out : array-like, shape (n_samples, n_features)\n            Transformed input.\n        \"\"\"\n        if not isinstance(y, string_types) or y != 'deprecated':\n            warnings.warn(\"The parameter y on inverse_transform() is \"\n                          \"deprecated since 0.19 and will be removed in 0.21\",\n                          DeprecationWarning)\n        return self._transform(X, y=y, func=self.inverse_func,\n                               kw_args=self.inv_kw_args)\n\n    def _transform(self, X, y=None, func=None, kw_args=None):\n        if self.validate:\n            X = check_array(X, self.accept_sparse)\n\n        if func is None:\n            func = _identity\n\n        if (not isinstance(self.pass_y, string_types) or\n                self.pass_y != 'deprecated'):\n            # We do this to know if pass_y was set to False / True\n            pass_y = self.pass_y\n            warnings.warn(\"The parameter pass_y is deprecated since 0.19 and \"\n                          \"will be removed in 0.21\", DeprecationWarning)\n        else:\n            pass_y = False\n\n        return func(X, *((y,) if pass_y else ()),\n                    **(kw_args if kw_args else {}))"
        },
        {
          "file": "sklearn/preprocessing/_function_transformer.py",
          "type": "class",
          "name": "FunctionTransformer",
          "code": "class FunctionTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"Constructs a transformer from an arbitrary callable.\n\n    A FunctionTransformer forwards its X (and optionally y) arguments to a\n    user-defined function or function object and returns the result of this\n    function. This is useful for stateless transformations such as taking the\n    log of frequencies, doing custom scaling, etc.\n\n    Note: If a lambda is used as the function, then the resulting\n    transformer will not be pickleable.\n\n    .. versionadded:: 0.17\n\n    Read more in the :ref:`User Guide <function_transformer>`.\n\n    Parameters\n    ----------\n    func : callable, optional default=None\n        The callable to use for the transformation. This will be passed\n        the same arguments as transform, with args and kwargs forwarded.\n        If func is None, then func will be the identity function.\n\n    inverse_func : callable, optional default=None\n        The callable to use for the inverse transformation. This will be\n        passed the same arguments as inverse transform, with args and\n        kwargs forwarded. If inverse_func is None, then inverse_func\n        will be the identity function.\n\n    validate : bool, optional default=True\n        Indicate that the input X array should be checked before calling\n        func. If validate is false, there will be no input validation.\n        If it is true, then X will be converted to a 2-dimensional NumPy\n        array or sparse matrix. If this conversion is not possible or X\n        contains NaN or infinity, an exception is raised.\n\n    accept_sparse : boolean, optional\n        Indicate that func accepts a sparse matrix as input. If validate is\n        False, this has no effect. Otherwise, if accept_sparse is false,\n        sparse matrix inputs will cause an exception to be raised.\n\n    pass_y : bool, optional default=False\n        Indicate that transform should forward the y argument to the\n        inner callable.\n\n        .. deprecated::0.19\n\n    check_inverse : bool, default=True\n       Whether to check that or ``func`` followed by ``inverse_func`` leads to\n       the original inputs. It can be used for a sanity check, raising a\n       warning when the condition is not fulfilled.\n\n       .. versionadded:: 0.20\n\n    kw_args : dict, optional\n        Dictionary of additional keyword arguments to pass to func.\n\n    inv_kw_args : dict, optional\n        Dictionary of additional keyword arguments to pass to inverse_func.\n\n    \"\"\"\n    def __init__(self, func=None, inverse_func=None, validate=True,\n                 accept_sparse=False, pass_y='deprecated', check_inverse=True,\n                 kw_args=None, inv_kw_args=None):\n        self.func = func\n        self.inverse_func = inverse_func\n        self.validate = validate\n        self.accept_sparse = accept_sparse\n        self.pass_y = pass_y\n        self.check_inverse = check_inverse\n        self.kw_args = kw_args\n        self.inv_kw_args = inv_kw_args\n\n    def _check_inverse_transform(self, X):\n        \"\"\"Check that func and inverse_func are the inverse.\"\"\"\n        idx_selected = slice(None, None, max(1, X.shape[0] // 100))\n        try:\n            assert_allclose_dense_sparse(\n                X[idx_selected],\n                self.inverse_transform(self.transform(X[idx_selected])))\n        except AssertionError:\n            warnings.warn(\"The provided functions are not strictly\"\n                          \" inverse of each other. If you are sure you\"\n                          \" want to proceed regardless, set\"\n                          \" 'check_inverse=False'.\", UserWarning)\n\n    def fit(self, X, y=None):\n        \"\"\"Fit transformer by checking X.\n\n        If ``validate`` is ``True``, ``X`` will be checked.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        Returns\n        -------\n        self\n        \"\"\"\n        if self.validate:\n            X = check_array(X, self.accept_sparse)\n        if (self.check_inverse and not (self.func is None or\n                                        self.inverse_func is None)):\n            self._check_inverse_transform(X)\n        return self\n\n    def transform(self, X, y='deprecated'):\n        \"\"\"Transform X using the forward function.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        y : (ignored)\n            .. deprecated::0.19\n\n        Returns\n        -------\n        X_out : array-like, shape (n_samples, n_features)\n            Transformed input.\n        \"\"\"\n        if not isinstance(y, string_types) or y != 'deprecated':\n            warnings.warn(\"The parameter y on transform() is \"\n                          \"deprecated since 0.19 and will be removed in 0.21\",\n                          DeprecationWarning)\n\n        return self._transform(X, y=y, func=self.func, kw_args=self.kw_args)\n\n    def inverse_transform(self, X, y='deprecated'):\n        \"\"\"Transform X using the inverse function.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        y : (ignored)\n            .. deprecated::0.19\n\n        Returns\n        -------\n        X_out : array-like, shape (n_samples, n_features)\n            Transformed input.\n        \"\"\"\n        if not isinstance(y, string_types) or y != 'deprecated':\n            warnings.warn(\"The parameter y on inverse_transform() is \"\n                          \"deprecated since 0.19 and will be removed in 0.21\",\n                          DeprecationWarning)\n        return self._transform(X, y=y, func=self.inverse_func,\n                               kw_args=self.inv_kw_args)\n\n    def _transform(self, X, y=None, func=None, kw_args=None):\n        if self.validate:\n            X = check_array(X, self.accept_sparse)\n\n        if func is None:\n            func = _identity\n\n        if (not isinstance(self.pass_y, string_types) or\n                self.pass_y != 'deprecated'):\n            # We do this to know if pass_y was set to False / True\n            pass_y = self.pass_y\n            warnings.warn(\"The parameter pass_y is deprecated since 0.19 and \"\n                          \"will be removed in 0.21\", DeprecationWarning)\n        else:\n            pass_y = False\n\n        return func(X, *((y,) if pass_y else ()),\n                    **(kw_args if kw_args else {}))"
        },
        {
          "file": "sklearn/preprocessing/_function_transformer.py",
          "type": "function",
          "name": "__init__",
          "class_name": "FunctionTransformer",
          "code": "def __init__(self, func=None, inverse_func=None, validate=True,\n                 accept_sparse=False, pass_y='deprecated', check_inverse=True,\n                 kw_args=None, inv_kw_args=None):\n        self.func = func\n        self.inverse_func = inverse_func\n        self.validate = validate\n        self.accept_sparse = accept_sparse\n        self.pass_y = pass_y\n        self.check_inverse = check_inverse\n        self.kw_args = kw_args\n        self.inv_kw_args = inv_kw_args"
        },
        {
          "file": "sklearn/preprocessing/_function_transformer.py",
          "type": "function",
          "name": "fit",
          "class_name": "FunctionTransformer",
          "code": "def fit(self, X, y=None):\n        \"\"\"Fit transformer by checking X.\n\n        If ``validate`` is ``True``, ``X`` will be checked.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        Returns\n        -------\n        self\n        \"\"\"\n        if self.validate:\n            X = check_array(X, self.accept_sparse)\n        if (self.check_inverse and not (self.func is None or\n                                        self.inverse_func is None)):\n            self._check_inverse_transform(X)\n        return self"
        },
        {
          "file": "sklearn/preprocessing/_function_transformer.py",
          "type": "function",
          "name": "inverse_transform",
          "class_name": "FunctionTransformer",
          "code": "def inverse_transform(self, X, y='deprecated'):\n        \"\"\"Transform X using the inverse function.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        y : (ignored)\n            .. deprecated::0.19\n\n        Returns\n        -------\n        X_out : array-like, shape (n_samples, n_features)\n            Transformed input.\n        \"\"\"\n        if not isinstance(y, string_types) or y != 'deprecated':\n            warnings.warn(\"The parameter y on inverse_transform() is \"\n                          \"deprecated since 0.19 and will be removed in 0.21\",\n                          DeprecationWarning)\n        return self._transform(X, y=y, func=self.inverse_func,\n                               kw_args=self.inv_kw_args)"
        }
      ]
    },
    {
      "pr_number": 12104,
      "pr_title": "[MRG] Convert ColumnTransformer input list to numpy array",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #12096.\r\n\r\n#### What does this implement/fix? Explain your changes.\r\nConverts the input list for ColumnTransformer to a numpy array.\r\n\r\nAdded a check inside `transform` and `fit_transform` to check if the input `X` is a list, if it is then it gets converted to a numpy array.\r\n\r\n#### Any other comments?\r\nShould this conversion be documented in the docstrings for ColumnTransfomer's `fit`, `transform` and `fit_transform`?\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 12096,
      "issue_title": "ColumnTransformer breaks where X is a list",
      "issue_body": "```py\r\n>>> from sklearn.preprocessing import StandardScaler\r\n>>> from sklearn.compose import ColumnTransformer\r\n>>> ColumnTransformer([('foobar', StandardScaler(), [0, 1, 2])]).fit([[1, 2, 3]])\r\nTraceback (most recent call last):\r\n  File \"<stdin>\", line 1, in <module>\r\n  File \"/Users/joel/repos/scikit-learn/sklearn/compose/_column_transformer.py\", line 398, in fit\r\n    self.fit_transform(X, y=y)\r\n  File \"/Users/joel/repos/scikit-learn/sklearn/compose/_column_transformer.py\", line 422, in fit_transform\r\n    self._validate_remainder(X)\r\n  File \"/Users/joel/repos/scikit-learn/sklearn/compose/_column_transformer.py\", line 275, in _validate_remainder\r\n    n_columns = X.shape[1]\r\nAttributeError: 'list' object has no attribute 'shape'\r\n```\r\n\r\nThe passed list should be interpreted as an array for the sake of extracting columns. Instead an error is raised.",
      "issue_closed_at": "2018-09-25T14:37:16Z",
      "base_commit": "0c0a9e8406987fd53ccd705c3e455514c47c49c4",
      "changes": [
        {
          "file": "sklearn/compose/_column_transformer.py",
          "type": "line",
          "name": "line 16",
          "code": "from ..base import clone, TransformerMixin\nfrom ..utils import Parallel, delayed\nfrom ..externals import six\nfrom ..pipeline import (\n    _fit_one_transformer, _fit_transform_one, _transform_one, _name_estimators)\nfrom ..preprocessing import FunctionTransformer\nfrom ..utils import Bunch\nfrom ..utils.metaestimators import _BaseComposition\nfrom ..utils.validation import check_is_fitted\n\n\n__all__ = ['ColumnTransformer', 'make_column_transformer']"
        },
        {
          "file": "sklearn/compose/_column_transformer.py",
          "type": "function",
          "name": "fit_transform",
          "class_name": "ColumnTransformer",
          "code": "def fit_transform(self, X, y=None):\n        \"\"\"Fit all transformers, transform the data and concatenate results.\n\n        Parameters\n        ----------\n        X : array-like or DataFrame of shape [n_samples, n_features]\n            Input data, of which specified subsets are used to fit the\n            transformers.\n\n        y : array-like, shape (n_samples, ...), optional\n            Targets for supervised learning.\n\n        Returns\n        -------\n        X_t : array-like or sparse matrix, shape (n_samples, sum_n_components)\n            hstack of results of transformers. sum_n_components is the\n            sum of n_components (output dimension) over transformers. If\n            any result is a sparse matrix, everything will be converted to\n            sparse matrices.\n\n        \"\"\"\n        self._validate_transformers()\n        self._validate_column_callables(X)\n        self._validate_remainder(X)\n\n        result = self._fit_transform(X, y, _fit_transform_one)\n\n        if not result:\n            self._update_fitted_transformers([])\n            # All transformers are None\n            return np.zeros((X.shape[0], 0))\n\n        Xs, transformers = zip(*result)\n\n        # determine if concatenated output will be sparse or not\n        if all(sparse.issparse(X) for X in Xs):\n            self.sparse_output_ = True\n        elif any(sparse.issparse(X) for X in Xs):\n            nnz = sum(X.nnz if sparse.issparse(X) else X.size for X in Xs)\n            total = sum(X.shape[0] * X.shape[1] if sparse.issparse(X)\n                        else X.size for X in Xs)\n            density = nnz / total\n            self.sparse_output_ = density < self.sparse_threshold\n        else:\n            self.sparse_output_ = False\n\n        self._update_fitted_transformers(transformers)\n        self._validate_output(Xs)\n\n        return self._hstack(list(Xs))"
        },
        {
          "file": "sklearn/compose/_column_transformer.py",
          "type": "function",
          "name": "transform",
          "class_name": "ColumnTransformer",
          "code": "def transform(self, X):\n        \"\"\"Transform X separately by each transformer, concatenate results.\n\n        Parameters\n        ----------\n        X : array-like or DataFrame of shape [n_samples, n_features]\n            The data to be transformed by subset.\n\n        Returns\n        -------\n        X_t : array-like or sparse matrix, shape (n_samples, sum_n_components)\n            hstack of results of transformers. sum_n_components is the\n            sum of n_components (output dimension) over transformers. If\n            any result is a sparse matrix, everything will be converted to\n            sparse matrices.\n\n        \"\"\"\n        check_is_fitted(self, 'transformers_')\n\n        Xs = self._fit_transform(X, None, _transform_one, fitted=True)\n        self._validate_output(Xs)\n\n        if not Xs:\n            # All transformers are None\n            return np.zeros((X.shape[0], 0))\n\n        return self._hstack(list(Xs))"
        },
        {
          "file": "sklearn/compose/_column_transformer.py",
          "type": "function",
          "name": "_hstack",
          "class_name": "ColumnTransformer",
          "code": "def _hstack(self, Xs):\n        \"\"\"Stacks Xs horizontally.\n\n        This allows subclasses to control the stacking behavior, while reusing\n        everything else from ColumnTransformer.\n\n        Parameters\n        ----------\n        Xs : List of numpy arrays, sparse arrays, or DataFrames\n        \"\"\"\n        if self.sparse_output_:\n            return sparse.hstack(Xs).tocsr()\n        else:\n            Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]\n            return np.hstack(Xs)"
        }
      ]
    },
    {
      "pr_number": 11042,
      "pr_title": "[MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034",
      "pr_body": "#### Reference Issues/PRs\r\nOriginal discussion at #11034\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n",
      "issue_id": 11034,
      "issue_title": "OneHotEncoder does not output scipy sparse matrix of given dtype",
      "issue_body": "#### Description\r\nOneHotEncoder ignores the specified dtype in the construction of the sparse array when mixed input data are passed, i.e with both categorical and real data type\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nimport numpy as np\r\n\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nenc = OneHotEncoder(dtype=np.float32, categorical_features=[0, 1])\r\n\r\nx = np.array([[0, 1, 0, 0], [1, 2, 0, 0]], dtype=int)\r\nsparse = enc.fit(x).transform(x)\r\n```\r\n\r\n#### Expected Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float32'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Actual Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float64'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Versions\r\n__Platform__: Linux-4.13.0-38-generic-x86_64-with-debian-stretch-sid\r\n__Python__: 3.6.3 |Anaconda custom (64-bit)| (default, Oct 13 2017, 12:02:49) [GCC 7.2.0]\r\n__NumPy__: NumPy \r\n__SciPy__: SciPy 1.0.1\r\n__Scikit-Learn__: Scikit-Learn 0.19.1\r\n",
      "issue_closed_at": "2018-06-06T09:03:02Z",
      "base_commit": "f049ec72eb70443ec8d7826066c4246035677c11",
      "changes": [
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "add_dummy_feature",
          "class_name": null,
          "code": "def add_dummy_feature(X, value=1.0):\n    \"\"\"Augment dataset with an additional dummy feature.\n\n    This is useful for fitting an intercept term with implementations which\n    cannot otherwise fit it directly.\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Data.\n\n    value : float\n        Value to use for the dummy feature.\n\n    Returns\n    -------\n\n    X : {array, sparse matrix}, shape [n_samples, n_features + 1]\n        Same data with dummy feature added as first column.\n\n    Examples\n    --------\n\n    >>> from sklearn.preprocessing import add_dummy_feature\n    >>> add_dummy_feature([[0, 1], [1, 0]])\n    array([[1., 0., 1.],\n           [1., 1., 0.]])\n    \"\"\"\n    X = check_array(X, accept_sparse=['csc', 'csr', 'coo'], dtype=FLOAT_DTYPES)\n    n_samples, n_features = X.shape\n    shape = (n_samples, n_features + 1)\n    if sparse.issparse(X):\n        if sparse.isspmatrix_coo(X):\n            # Shift columns to the right.\n            col = X.col + 1\n            # Column indices of dummy feature are 0 everywhere.\n            col = np.concatenate((np.zeros(n_samples), col))\n            # Row indices of dummy feature are 0, ..., n_samples-1.\n            row = np.concatenate((np.arange(n_samples), X.row))\n            # Prepend the dummy feature n_samples times.\n            data = np.concatenate((np.ones(n_samples) * value, X.data))\n            return sparse.coo_matrix((data, (row, col)), shape)\n        elif sparse.isspmatrix_csc(X):\n            # Shift index pointers since we need to add n_samples elements.\n            indptr = X.indptr + n_samples\n            # indptr[0] must be 0.\n            indptr = np.concatenate((np.array([0]), indptr))\n            # Row indices of dummy feature are 0, ..., n_samples-1.\n            indices = np.concatenate((np.arange(n_samples), X.indices))\n            # Prepend the dummy feature n_samples times.\n            data = np.concatenate((np.ones(n_samples) * value, X.data))\n            return sparse.csc_matrix((data, indices, indptr), shape)\n        else:\n            klass = X.__class__\n            return klass(add_dummy_feature(X.tocoo(), value))\n    else:\n        return np.hstack((np.ones((n_samples, 1)) * value, X))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "_transform_selected",
          "class_name": null,
          "code": "def _transform_selected(X, transform, selected=\"all\", copy=True):\n    \"\"\"Apply a transform function to portion of selected features\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Dense array or sparse matrix.\n\n    transform : callable\n        A callable transform(X) -> X_transformed\n\n    copy : boolean, optional\n        Copy X even if it could be avoided.\n\n    selected: \"all\" or array of indices or mask\n        Specify which features to apply the transform to.\n\n    Returns\n    -------\n    X : array or sparse matrix, shape=(n_samples, n_features_new)\n    \"\"\"\n    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)\n\n    if isinstance(selected, six.string_types) and selected == \"all\":\n        return transform(X)\n\n    if len(selected) == 0:\n        return X\n\n    n_features = X.shape[1]\n    ind = np.arange(n_features)\n    sel = np.zeros(n_features, dtype=bool)\n    sel[np.asarray(selected)] = True\n    not_sel = np.logical_not(sel)\n    n_selected = np.sum(sel)\n\n    if n_selected == 0:\n        # No features selected.\n        return X\n    elif n_selected == n_features:\n        # All features selected.\n        return transform(X)\n    else:\n        X_sel = transform(X[:, ind[sel]])\n        X_not_sel = X[:, ind[not_sel]]\n\n        if sparse.issparse(X_sel) or sparse.issparse(X_not_sel):\n            return sparse.hstack((X_sel, X_not_sel))\n        else:\n            return np.hstack((X_sel, X_not_sel))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "_transform_selected",
          "class_name": null,
          "code": "def _transform_selected(X, transform, selected=\"all\", copy=True):\n    \"\"\"Apply a transform function to portion of selected features\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Dense array or sparse matrix.\n\n    transform : callable\n        A callable transform(X) -> X_transformed\n\n    copy : boolean, optional\n        Copy X even if it could be avoided.\n\n    selected: \"all\" or array of indices or mask\n        Specify which features to apply the transform to.\n\n    Returns\n    -------\n    X : array or sparse matrix, shape=(n_samples, n_features_new)\n    \"\"\"\n    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)\n\n    if isinstance(selected, six.string_types) and selected == \"all\":\n        return transform(X)\n\n    if len(selected) == 0:\n        return X\n\n    n_features = X.shape[1]\n    ind = np.arange(n_features)\n    sel = np.zeros(n_features, dtype=bool)\n    sel[np.asarray(selected)] = True\n    not_sel = np.logical_not(sel)\n    n_selected = np.sum(sel)\n\n    if n_selected == 0:\n        # No features selected.\n        return X\n    elif n_selected == n_features:\n        # All features selected.\n        return transform(X)\n    else:\n        X_sel = transform(X[:, ind[sel]])\n        X_not_sel = X[:, ind[not_sel]]\n\n        if sparse.issparse(X_sel) or sparse.issparse(X_not_sel):\n            return sparse.hstack((X_sel, X_not_sel))\n        else:\n            return np.hstack((X_sel, X_not_sel))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "fit_transform",
          "class_name": "OneHotEncoder",
          "code": "def fit_transform(self, X, y=None):\n        \"\"\"Fit OneHotEncoder to X, then transform X.\n\n        Equivalent to self.fit(X).transform(X), but more convenient and more\n        efficient. See fit for the parameters, transform for the return value.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_feature]\n            Input array of type int.\n        \"\"\"\n        return _transform_selected(X, self._fit_transform,\n                                   self.categorical_features, copy=True)"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "transform",
          "class_name": "CategoricalEncoder",
          "code": "def transform(self, X):\n        \"\"\"Transform X using specified encoding scheme.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_features]\n            The data to encode.\n\n        Returns\n        -------\n        X_out : sparse matrix or a 2-d array\n            Transformed input.\n\n        \"\"\"\n        X_temp = check_array(X, dtype=None)\n        if not hasattr(X, 'dtype') and np.issubdtype(X_temp.dtype, np.str_):\n            X = check_array(X, dtype=np.object)\n        else:\n            X = X_temp\n\n        n_samples, n_features = X.shape\n        X_int = np.zeros_like(X, dtype=np.int)\n        X_mask = np.ones_like(X, dtype=np.bool)\n\n        for i in range(n_features):\n            Xi = X[:, i]\n            valid_mask = np.in1d(Xi, self.categories_[i])\n\n            if not np.all(valid_mask):\n                if self.handle_unknown == 'error':\n                    diff = np.unique(X[~valid_mask, i])\n                    msg = (\"Found unknown categories {0} in column {1}\"\n                           \" during transform\".format(diff, i))\n                    raise ValueError(msg)\n                else:\n                    # Set the problematic rows to an acceptable value and\n                    # continue `The rows are marked `X_mask` and will be\n                    # removed later.\n                    X_mask[:, i] = valid_mask\n                    Xi = Xi.copy()\n                    Xi[~valid_mask] = self.categories_[i][0]\n            X_int[:, i] = self._label_encoders_[i].transform(Xi)\n\n        if self.encoding == 'ordinal':\n            return X_int.astype(self.dtype, copy=False)\n\n        mask = X_mask.ravel()\n        n_values = [cats.shape[0] for cats in self.categories_]\n        n_values = np.array([0] + n_values)\n        feature_indices = np.cumsum(n_values)\n\n        indices = (X_int + feature_indices[:-1]).ravel()[mask]\n        indptr = X_mask.sum(axis=1).cumsum()\n        indptr = np.insert(indptr, 0, 0)\n        data = np.ones(n_samples * n_features)[mask]\n\n        out = sparse.csr_matrix((data, indices, indptr),\n                                shape=(n_samples, feature_indices[-1]),\n                                dtype=self.dtype)\n        if self.encoding == 'onehot-dense':\n            return out.toarray()\n        else:\n            return out"
        }
      ]
    },
    {
      "pr_number": 4894,
      "pr_title": "fix dtype transform problem in KNN and RandomForest",
      "pr_body": "Fix #4879\n\nI add a int array to store the indices before the loop. \n@amueller I also checked LabelEncoder but it can only accommodate 1d array, so a loop is still needed and slicing problem will still be there. \nFor MultiLabelBinarizer, it can accommodate 2d array, but it will return a 2d binary array to store the indices, which is very different from the original implementation in RandomForest and KNN. I am wondering is there any better way to do so? \n",
      "issue_id": 4879,
      "issue_title": "RandomForestClassifier changes label values",
      "issue_body": "Consider the following example:\n\n``` python\nimport numpy as np\nfrom sklearn.ensemble import RandomForestClassifier\n\nCLASSES = 15\nX = np.arange(CLASSES).reshape(-1, 1)\ny = [ch for ch in 'ABCDEFGHIJKLMNOPQRSTU'[:CLASSES]]\n\nprint(RandomForestClassifier().fit(X, y).predict(X))\n```\n\nThe output is\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'B' 'B' 'B' 'B' 'B']`\n\nwhereas I expect it to be\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O']`\n\nI traced the cause of the error to [here](https://github.com/scikit-learn/scikit-learn/blob/733629d256abaf9a31d6e6305f859768432907ed/sklearn/ensemble/forest.py#L418). I'm happy to put in a pull request, but can others please verify they get the same unexpected behaviour first?\n\nSystem information:\n``\nCPython 3.3.5\nIPython 3.1.0\n\nnumpy 1.9.2\nscipy 0.14.0\nscikit-learn 0.16.1(conda)\n\ncompiler   : GCC 4.2.1 (Apple Inc. build 5577)\nsystem     : Darwin\nrelease    : 13.4.0\nmachine    : x86_64\nprocessor  : i386\nCPU cores  : 4\ninterpreter: 64bit\n\n```\n```\n",
      "issue_closed_at": "2015-06-24T21:33:10Z",
      "base_commit": "cd98ffced656569681648c88c51dae844501643c",
      "changes": [
        {
          "file": "sklearn/ensemble/forest.py",
          "type": "function",
          "name": "_validate_y_class_weight",
          "class_name": "ForestClassifier",
          "code": "def _validate_y_class_weight(self, y):\n        y = np.copy(y)\n        expanded_class_weight = None\n\n        if self.class_weight is not None:\n            y_original = np.copy(y)\n\n        self.classes_ = []\n        self.n_classes_ = []\n\n        for k in range(self.n_outputs_):\n            classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n            self.classes_.append(classes_k)\n            self.n_classes_.append(classes_k.shape[0])\n\n        if self.class_weight is not None:\n            valid_presets = ('auto', 'balanced', 'balanced_subsample', 'subsample', 'auto')\n            if isinstance(self.class_weight, six.string_types):\n                if self.class_weight not in valid_presets:\n                    raise ValueError('Valid presets for class_weight include '\n                                     '\"balanced\" and \"balanced_subsample\". Given \"%s\".'\n                                     % self.class_weight)\n                if self.class_weight == \"subsample\":\n                    warn(\"class_weight='subsample' is deprecated and will be removed in 0.18.\"\n                         \" It was replaced by class_weight='balanced_subsample' \"\n                         \"using the balanced strategy.\", DeprecationWarning)\n                if self.warm_start:\n                    warn('class_weight presets \"balanced\" or \"balanced_subsample\" are '\n                         'not recommended for warm_start if the fitted data '\n                         'differs from the full dataset. In order to use '\n                         '\"balanced\" weights, use compute_class_weight(\"balanced\", '\n                         'classes, y). In place of y you can use a large '\n                         'enough sample of the full training set target to '\n                         'properly estimate the class frequency '\n                         'distributions. Pass the resulting weights as the '\n                         'class_weight parameter.')\n\n            if (self.class_weight not in ['subsample', 'balanced_subsample'] or\n                    not self.bootstrap):\n                if self.class_weight == 'subsample':\n                    class_weight = 'auto'\n                elif self.class_weight == \"balanced_subsample\":\n                    class_weight = \"balanced\"\n                else:\n                    class_weight = self.class_weight\n                with warnings.catch_warnings():\n                    if class_weight == \"auto\":\n                        warnings.simplefilter('ignore', DeprecationWarning)\n                    expanded_class_weight = compute_sample_weight(class_weight,\n                                                                  y_original)\n\n        return y, expanded_class_weight"
        },
        {
          "file": "sklearn/tree/tree.py",
          "type": "function",
          "name": "fit",
          "class_name": "BaseDecisionTree",
          "code": "def fit(self, X, y, sample_weight=None, check_input=True):\n        \"\"\"Build a decision tree from the training set (X, y).\n\n        Parameters\n        ----------\n        X : array-like or sparse matrix, shape = [n_samples, n_features]\n            The training input samples. Internally, it will be converted to\n            ``dtype=np.float32`` and if a sparse matrix is provided\n            to a sparse ``csc_matrix``.\n\n        y : array-like, shape = [n_samples] or [n_samples, n_outputs]\n            The target values (class labels in classification, real numbers in\n            regression). In the regression case, use ``dtype=np.float64`` and\n            ``order='C'`` for maximum efficiency.\n\n        sample_weight : array-like, shape = [n_samples] or None\n            Sample weights. If None, then samples are equally weighted. Splits\n            that would create child nodes with net zero or negative weight are\n            ignored while searching for a split in each node. In the case of\n            classification, splits are also ignored if they would result in any\n            single class carrying a negative weight in either child node.\n\n        check_input : boolean, (default=True)\n            Allow to bypass several input checking.\n            Don't use this parameter unless you know what you do.\n\n        Returns\n        -------\n        self : object\n            Returns self.\n        \"\"\"\n        random_state = check_random_state(self.random_state)\n        if check_input:\n            X = check_array(X, dtype=DTYPE, accept_sparse=\"csc\")\n            if issparse(X):\n                X.sort_indices()\n\n                if X.indices.dtype != np.intc or X.indptr.dtype != np.intc:\n                    raise ValueError(\"No support for np.int64 index based \"\n                                     \"sparse matrices\")\n\n        # Determine output settings\n        n_samples, self.n_features_ = X.shape\n        is_classification = isinstance(self, ClassifierMixin)\n\n        y = np.atleast_1d(y)\n        expanded_class_weight = None\n\n        if y.ndim == 1:\n            # reshape is necessary to preserve the data contiguity against vs\n            # [:, np.newaxis] that does not.\n            y = np.reshape(y, (-1, 1))\n\n        self.n_outputs_ = y.shape[1]\n\n        if is_classification:\n            y = np.copy(y)\n\n            self.classes_ = []\n            self.n_classes_ = []\n\n            if self.class_weight is not None:\n                y_original = np.copy(y)\n\n            for k in range(self.n_outputs_):\n                classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n                self.classes_.append(classes_k)\n                self.n_classes_.append(classes_k.shape[0])\n\n            if self.class_weight is not None:\n                expanded_class_weight = compute_sample_weight(\n                    self.class_weight, y_original)\n\n        else:\n            self.classes_ = [None] * self.n_outputs_\n            self.n_classes_ = [1] * self.n_outputs_\n\n        self.n_classes_ = np.array(self.n_classes_, dtype=np.intp)\n\n        if getattr(y, \"dtype\", None) != DOUBLE or not y.flags.contiguous:\n            y = np.ascontiguousarray(y, dtype=DOUBLE)\n\n        # Check parameters\n        max_depth = ((2 ** 31) - 1 if self.max_depth is None\n                     else self.max_depth)\n        max_leaf_nodes = (-1 if self.max_leaf_nodes is None\n                          else self.max_leaf_nodes)\n\n        if isinstance(self.max_features, six.string_types):\n            if self.max_features == \"auto\":\n                if is_classification:\n                    max_features = max(1, int(np.sqrt(self.n_features_)))\n                else:\n                    max_features = self.n_features_\n            elif self.max_features == \"sqrt\":\n                max_features = max(1, int(np.sqrt(self.n_features_)))\n            elif self.max_features == \"log2\":\n                max_features = max(1, int(np.log2(self.n_features_)))\n            else:\n                raise ValueError(\n                    'Invalid value for max_features. Allowed string '\n                    'values are \"auto\", \"sqrt\" or \"log2\".')\n        elif self.max_features is None:\n            max_features = self.n_features_\n        elif isinstance(self.max_features, (numbers.Integral, np.integer)):\n            max_features = self.max_features\n        else:  # float\n            if self.max_features > 0.0:\n                max_features = max(1, int(self.max_features * self.n_features_))\n            else:\n                max_features = 0\n\n        self.max_features_ = max_features\n\n        if len(y) != n_samples:\n            raise ValueError(\"Number of labels=%d does not match \"\n                             \"number of samples=%d\" % (len(y), n_samples))\n        if self.min_samples_split <= 0:\n            raise ValueError(\"min_samples_split must be greater than zero.\")\n        if self.min_samples_leaf <= 0:\n            raise ValueError(\"min_samples_leaf must be greater than zero.\")\n        if not 0 <= self.min_weight_fraction_leaf <= 0.5:\n            raise ValueError(\"min_weight_fraction_leaf must in [0, 0.5]\")\n        if max_depth <= 0:\n            raise ValueError(\"max_depth must be greater than zero. \")\n        if not (0 < max_features <= self.n_features_):\n            raise ValueError(\"max_features must be in (0, n_features]\")\n        if not isinstance(max_leaf_nodes, (numbers.Integral, np.integer)):\n            raise ValueError(\"max_leaf_nodes must be integral number but was \"\n                             \"%r\" % max_leaf_nodes)\n        if -1 < max_leaf_nodes < 2:\n            raise ValueError((\"max_leaf_nodes {0} must be either smaller than \"\n                              \"0 or larger than 1\").format(max_leaf_nodes))\n\n        if sample_weight is not None:\n            if (getattr(sample_weight, \"dtype\", None) != DOUBLE or\n                    not sample_weight.flags.contiguous):\n                sample_weight = np.ascontiguousarray(\n                    sample_weight, dtype=DOUBLE)\n            if len(sample_weight.shape) > 1:\n                raise ValueError(\"Sample weights array has more \"\n                                 \"than one dimension: %d\" %\n                                 len(sample_weight.shape))\n            if len(sample_weight) != n_samples:\n                raise ValueError(\"Number of weights=%d does not match \"\n                                 \"number of samples=%d\" %\n                                 (len(sample_weight), n_samples))\n\n        if expanded_class_weight is not None:\n            if sample_weight is not None:\n                sample_weight = sample_weight * expanded_class_weight\n            else:\n                sample_weight = expanded_class_weight\n\n        # Set min_weight_leaf from min_weight_fraction_leaf\n        if self.min_weight_fraction_leaf != 0. and sample_weight is not None:\n            min_weight_leaf = (self.min_weight_fraction_leaf *\n                               np.sum(sample_weight))\n        else:\n            min_weight_leaf = 0.\n\n        # Set min_samples_split sensibly\n        min_samples_split = max(self.min_samples_split,\n                                2 * self.min_samples_leaf)\n\n        # Build tree\n        criterion = self.criterion\n        if not isinstance(criterion, Criterion):\n            if is_classification:\n                criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,\n                                                         self.n_classes_)\n            else:\n                criterion = CRITERIA_REG[self.criterion](self.n_outputs_)\n\n        SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS\n\n        splitter = self.splitter\n        if not isinstance(self.splitter, Splitter):\n            splitter = SPLITTERS[self.splitter](criterion,\n                                                self.max_features_,\n                                                self.min_samples_leaf,\n                                                min_weight_leaf,\n                                                random_state)\n\n        self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_)\n\n        # Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise\n        if max_leaf_nodes < 0:\n            builder = DepthFirstTreeBuilder(splitter, min_samples_split,\n                                            self.min_samples_leaf,\n                                            min_weight_leaf,\n                                            max_depth)\n        else:\n            builder = BestFirstTreeBuilder(splitter, min_samples_split,\n                                           self.min_samples_leaf,\n                                           min_weight_leaf,\n                                           max_depth,\n                                           max_leaf_nodes)\n\n        builder.build(self.tree_, X, y, sample_weight)\n\n        if self.n_outputs_ == 1:\n            self.n_classes_ = self.n_classes_[0]\n            self.classes_ = self.classes_[0]\n\n        return self"
        }
      ]
    },
    {
      "pr_number": 8936,
      "pr_title": "[MRG+1] fixed OOB_Score bug for bagging classifiers.",
      "pr_body": "Fixes #8933\r\n\r\n<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\n<!-- Example: Fixes #1234 -->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 8933,
      "issue_title": "BUG: BaggingClassifier.oob_score_ should not change with class label",
      "issue_body": "Let us compute the oob score of a bagged classifier.\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom sklearn.ensemble import BaggingClassifier\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\n\r\nN = 50\r\nrandState = 5\r\nlabel = 'Label'\r\nfeatures = ['A','B','C']\r\n\r\nlabels = np.random.randint(3, size = N) - 1\r\ndf = pd.DataFrame( labels , index=range(N), columns=[label] )\r\nfor col in features:\r\n    df[col] = df[label] + 0.01 * np.random.rand( N )\r\n\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nHere, clf.oob_score_=0.0.\r\n\r\nNow, you would not expect that the OOB accuracy is a function of the class labels...\r\n\r\n```python\r\ndf.loc[ df[label] == -1 , label ] = 2\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nNow, clf.oob_score_=1.0.\r\n\r\nClearly, OOB score should not be a function of the labels arbitrarily chosen for the classes.\r\n\r\nsklearn.__version__: '0.18.1'\r\nnumpy.__version__: '1.11.3'",
      "issue_closed_at": "2017-06-08T09:35:49Z",
      "base_commit": "9131f89e6c165fb27dadd37d3168c1ee5ea84f5a",
      "changes": [
        {
          "file": "sklearn/ensemble/bagging.py",
          "type": "function",
          "name": "_set_oob_score",
          "class_name": "BaggingRegressor",
          "code": "def _set_oob_score(self, X, y):\n        n_samples = y.shape[0]\n\n        predictions = np.zeros((n_samples,))\n        n_predictions = np.zeros((n_samples,))\n\n        for estimator, samples, features in zip(self.estimators_,\n                                                self.estimators_samples_,\n                                                self.estimators_features_):\n            # Create mask for OOB samples\n            mask = ~samples\n\n            predictions[mask] += estimator.predict((X[mask, :])[:, features])\n            n_predictions[mask] += 1\n\n        if (n_predictions == 0).any():\n            warn(\"Some inputs do not have OOB scores. \"\n                 \"This probably means too few estimators were used \"\n                 \"to compute any reliable oob estimates.\")\n            n_predictions[n_predictions == 0] = 1\n\n        predictions /= n_predictions\n\n        self.oob_prediction_ = predictions\n        self.oob_score_ = r2_score(y, predictions)"
        }
      ]
    }
  ]
}