{
  "instance_id": "scikit-learn__scikit-learn-13241",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-02-25T11:27:41Z",
  "problem_statement": "Differences among the results of KernelPCA with rbf kernel\nHi there,\r\nI met with a problem:\r\n\r\n#### Description\r\nWhen I run KernelPCA for dimension reduction for the same datasets, the results are different in signs.\r\n\r\n#### Steps/Code to Reproduce\r\nJust to reduce the dimension to 7 with rbf kernel:\r\npca = KernelPCA(n_components=7, kernel='rbf', copy_X=False, n_jobs=-1)\r\npca.fit_transform(X)\r\n\r\n#### Expected Results\r\nThe same result.\r\n\r\n#### Actual Results\r\nThe results are the same except for their signs:(\r\n[[-0.44457617 -0.18155886 -0.10873474  0.13548386 -0.1437174  -0.057469\t0.18124364]] \r\n\r\n[[ 0.44457617  0.18155886  0.10873474 -0.13548386 -0.1437174  -0.057469 -0.18124364]] \r\n\r\n[[-0.44457617 -0.18155886  0.10873474  0.13548386  0.1437174   0.057469  0.18124364]] \r\n\r\n#### Versions\r\n0.18.1\r\n\n",
  "patch": "diff --git a/sklearn/decomposition/kernel_pca.py b/sklearn/decomposition/kernel_pca.py\n--- a/sklearn/decomposition/kernel_pca.py\n+++ b/sklearn/decomposition/kernel_pca.py\n@@ -8,6 +8,7 @@\n from scipy.sparse.linalg import eigsh\n \n from ..utils import check_random_state\n+from ..utils.extmath import svd_flip\n from ..utils.validation import check_is_fitted, check_array\n from ..exceptions import NotFittedError\n from ..base import BaseEstimator, TransformerMixin, _UnstableOn32BitMixin\n@@ -210,6 +211,10 @@ def _fit_transform(self, K):\n                                                 maxiter=self.max_iter,\n                                                 v0=v0)\n \n+        # flip eigenvectors' sign to enforce deterministic output\n+        self.alphas_, _ = svd_flip(self.alphas_,\n+                                   np.empty_like(self.alphas_).T)\n+\n         # sort eigenvectors in descending order\n         indices = self.lambdas_.argsort()[::-1]\n         self.lambdas_ = self.lambdas_[indices]\n",
  "similar_bug_items": [
    {
      "pr_number": 11042,
      "pr_title": "[MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034",
      "pr_body": "#### Reference Issues/PRs\r\nOriginal discussion at #11034\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n",
      "issue_id": 11034,
      "issue_title": "OneHotEncoder does not output scipy sparse matrix of given dtype",
      "issue_body": "#### Description\r\nOneHotEncoder ignores the specified dtype in the construction of the sparse array when mixed input data are passed, i.e with both categorical and real data type\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nimport numpy as np\r\n\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nenc = OneHotEncoder(dtype=np.float32, categorical_features=[0, 1])\r\n\r\nx = np.array([[0, 1, 0, 0], [1, 2, 0, 0]], dtype=int)\r\nsparse = enc.fit(x).transform(x)\r\n```\r\n\r\n#### Expected Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float32'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Actual Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float64'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Versions\r\n__Platform__: Linux-4.13.0-38-generic-x86_64-with-debian-stretch-sid\r\n__Python__: 3.6.3 |Anaconda custom (64-bit)| (default, Oct 13 2017, 12:02:49) [GCC 7.2.0]\r\n__NumPy__: NumPy \r\n__SciPy__: SciPy 1.0.1\r\n__Scikit-Learn__: Scikit-Learn 0.19.1\r\n",
      "issue_closed_at": "2018-06-06T09:03:02Z",
      "base_commit": "f049ec72eb70443ec8d7826066c4246035677c11",
      "changes": [
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "add_dummy_feature",
          "class_name": null,
          "code": "def add_dummy_feature(X, value=1.0):\n    \"\"\"Augment dataset with an additional dummy feature.\n\n    This is useful for fitting an intercept term with implementations which\n    cannot otherwise fit it directly.\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Data.\n\n    value : float\n        Value to use for the dummy feature.\n\n    Returns\n    -------\n\n    X : {array, sparse matrix}, shape [n_samples, n_features + 1]\n        Same data with dummy feature added as first column.\n\n    Examples\n    --------\n\n    >>> from sklearn.preprocessing import add_dummy_feature\n    >>> add_dummy_feature([[0, 1], [1, 0]])\n    array([[1., 0., 1.],\n           [1., 1., 0.]])\n    \"\"\"\n    X = check_array(X, accept_sparse=['csc', 'csr', 'coo'], dtype=FLOAT_DTYPES)\n    n_samples, n_features = X.shape\n    shape = (n_samples, n_features + 1)\n    if sparse.issparse(X):\n        if sparse.isspmatrix_coo(X):\n            # Shift columns to the right.\n            col = X.col + 1\n            # Column indices of dummy feature are 0 everywhere.\n            col = np.concatenate((np.zeros(n_samples), col))\n            # Row indices of dummy feature are 0, ..., n_samples-1.\n            row = np.concatenate((np.arange(n_samples), X.row))\n            # Prepend the dummy feature n_samples times.\n            data = np.concatenate((np.ones(n_samples) * value, X.data))\n            return sparse.coo_matrix((data, (row, col)), shape)\n        elif sparse.isspmatrix_csc(X):\n            # Shift index pointers since we need to add n_samples elements.\n            indptr = X.indptr + n_samples\n            # indptr[0] must be 0.\n            indptr = np.concatenate((np.array([0]), indptr))\n            # Row indices of dummy feature are 0, ..., n_samples-1.\n            indices = np.concatenate((np.arange(n_samples), X.indices))\n            # Prepend the dummy feature n_samples times.\n            data = np.concatenate((np.ones(n_samples) * value, X.data))\n            return sparse.csc_matrix((data, indices, indptr), shape)\n        else:\n            klass = X.__class__\n            return klass(add_dummy_feature(X.tocoo(), value))\n    else:\n        return np.hstack((np.ones((n_samples, 1)) * value, X))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "_transform_selected",
          "class_name": null,
          "code": "def _transform_selected(X, transform, selected=\"all\", copy=True):\n    \"\"\"Apply a transform function to portion of selected features\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Dense array or sparse matrix.\n\n    transform : callable\n        A callable transform(X) -> X_transformed\n\n    copy : boolean, optional\n        Copy X even if it could be avoided.\n\n    selected: \"all\" or array of indices or mask\n        Specify which features to apply the transform to.\n\n    Returns\n    -------\n    X : array or sparse matrix, shape=(n_samples, n_features_new)\n    \"\"\"\n    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)\n\n    if isinstance(selected, six.string_types) and selected == \"all\":\n        return transform(X)\n\n    if len(selected) == 0:\n        return X\n\n    n_features = X.shape[1]\n    ind = np.arange(n_features)\n    sel = np.zeros(n_features, dtype=bool)\n    sel[np.asarray(selected)] = True\n    not_sel = np.logical_not(sel)\n    n_selected = np.sum(sel)\n\n    if n_selected == 0:\n        # No features selected.\n        return X\n    elif n_selected == n_features:\n        # All features selected.\n        return transform(X)\n    else:\n        X_sel = transform(X[:, ind[sel]])\n        X_not_sel = X[:, ind[not_sel]]\n\n        if sparse.issparse(X_sel) or sparse.issparse(X_not_sel):\n            return sparse.hstack((X_sel, X_not_sel))\n        else:\n            return np.hstack((X_sel, X_not_sel))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "_transform_selected",
          "class_name": null,
          "code": "def _transform_selected(X, transform, selected=\"all\", copy=True):\n    \"\"\"Apply a transform function to portion of selected features\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Dense array or sparse matrix.\n\n    transform : callable\n        A callable transform(X) -> X_transformed\n\n    copy : boolean, optional\n        Copy X even if it could be avoided.\n\n    selected: \"all\" or array of indices or mask\n        Specify which features to apply the transform to.\n\n    Returns\n    -------\n    X : array or sparse matrix, shape=(n_samples, n_features_new)\n    \"\"\"\n    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)\n\n    if isinstance(selected, six.string_types) and selected == \"all\":\n        return transform(X)\n\n    if len(selected) == 0:\n        return X\n\n    n_features = X.shape[1]\n    ind = np.arange(n_features)\n    sel = np.zeros(n_features, dtype=bool)\n    sel[np.asarray(selected)] = True\n    not_sel = np.logical_not(sel)\n    n_selected = np.sum(sel)\n\n    if n_selected == 0:\n        # No features selected.\n        return X\n    elif n_selected == n_features:\n        # All features selected.\n        return transform(X)\n    else:\n        X_sel = transform(X[:, ind[sel]])\n        X_not_sel = X[:, ind[not_sel]]\n\n        if sparse.issparse(X_sel) or sparse.issparse(X_not_sel):\n            return sparse.hstack((X_sel, X_not_sel))\n        else:\n            return np.hstack((X_sel, X_not_sel))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "fit_transform",
          "class_name": "OneHotEncoder",
          "code": "def fit_transform(self, X, y=None):\n        \"\"\"Fit OneHotEncoder to X, then transform X.\n\n        Equivalent to self.fit(X).transform(X), but more convenient and more\n        efficient. See fit for the parameters, transform for the return value.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_feature]\n            Input array of type int.\n        \"\"\"\n        return _transform_selected(X, self._fit_transform,\n                                   self.categorical_features, copy=True)"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "transform",
          "class_name": "CategoricalEncoder",
          "code": "def transform(self, X):\n        \"\"\"Transform X using specified encoding scheme.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_features]\n            The data to encode.\n\n        Returns\n        -------\n        X_out : sparse matrix or a 2-d array\n            Transformed input.\n\n        \"\"\"\n        X_temp = check_array(X, dtype=None)\n        if not hasattr(X, 'dtype') and np.issubdtype(X_temp.dtype, np.str_):\n            X = check_array(X, dtype=np.object)\n        else:\n            X = X_temp\n\n        n_samples, n_features = X.shape\n        X_int = np.zeros_like(X, dtype=np.int)\n        X_mask = np.ones_like(X, dtype=np.bool)\n\n        for i in range(n_features):\n            Xi = X[:, i]\n            valid_mask = np.in1d(Xi, self.categories_[i])\n\n            if not np.all(valid_mask):\n                if self.handle_unknown == 'error':\n                    diff = np.unique(X[~valid_mask, i])\n                    msg = (\"Found unknown categories {0} in column {1}\"\n                           \" during transform\".format(diff, i))\n                    raise ValueError(msg)\n                else:\n                    # Set the problematic rows to an acceptable value and\n                    # continue `The rows are marked `X_mask` and will be\n                    # removed later.\n                    X_mask[:, i] = valid_mask\n                    Xi = Xi.copy()\n                    Xi[~valid_mask] = self.categories_[i][0]\n            X_int[:, i] = self._label_encoders_[i].transform(Xi)\n\n        if self.encoding == 'ordinal':\n            return X_int.astype(self.dtype, copy=False)\n\n        mask = X_mask.ravel()\n        n_values = [cats.shape[0] for cats in self.categories_]\n        n_values = np.array([0] + n_values)\n        feature_indices = np.cumsum(n_values)\n\n        indices = (X_int + feature_indices[:-1]).ravel()[mask]\n        indptr = X_mask.sum(axis=1).cumsum()\n        indptr = np.insert(indptr, 0, 0)\n        data = np.ones(n_samples * n_features)[mask]\n\n        out = sparse.csr_matrix((data, indices, indptr),\n                                shape=(n_samples, feature_indices[-1]),\n                                dtype=self.dtype)\n        if self.encoding == 'onehot-dense':\n            return out.toarray()\n        else:\n            return out"
        }
      ]
    },
    {
      "pr_number": 2523,
      "pr_title": "[MRG] FIX #2481: add warning for bug in old numpy with unicode",
      "pr_body": "This is a fix for  #2481 which causes the jenkins tests to fail with numpy 1.3.0.\n",
      "issue_id": 2481,
      "issue_title": "LabelEncoder doesn't work correctly for unicode labels in Python 2.6 + numpy 1.3",
      "issue_body": "LabelEncoder works incorrectly for unicode labels in Python 2.6 + numpy 1.3. This is currently untested; to reproduce replace bytestrings with unicode strings here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/tests/test_label.py#L191\n\nThis is the cause of Jenkins failure that https://github.com/scikit-learn/scikit-learn/pull/2462 triggered.\n",
      "issue_closed_at": "2013-10-16T12:24:23Z",
      "base_commit": "a2580d6fe04340f535c624ae48b4a52a66cbc839",
      "changes": [
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "line",
          "name": "line 8",
          "code": "\nfrom ..base import BaseEstimator, TransformerMixin\n\nfrom ..utils.fixes import unique\nfrom ..utils import deprecated, column_or_1d\n\nfrom ..utils.multiclass import unique_labels"
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "line",
          "name": "line 25",
          "code": "    'LabelEncoder',\n]\n\n\nclass LabelEncoder(BaseEstimator, TransformerMixin):\n    \"\"\"Encode labels with value between 0 and n_classes-1."
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "function",
          "name": "fit",
          "class_name": "LabelBinarizer",
          "code": "def fit(self, y):\n        \"\"\"Fit label binarizer\n\n        Parameters\n        ----------\n        y : numpy array of shape (n_samples,) or sequence of sequences\n            Target values. In the multilabel case the nested sequences can\n            have variable lengths.\n\n        Returns\n        -------\n        self : returns an instance of self.\n        \"\"\"\n        y_type = type_of_target(y)\n        self.multilabel_ = y_type.startswith('multilabel')\n        if self.multilabel_:\n            self.indicator_matrix_ = y_type == 'multilabel-indicator'\n\n        self.classes_ = unique_labels(y)\n\n        return self"
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "function",
          "name": "fit_transform",
          "class_name": "LabelEncoder",
          "code": "def fit_transform(self, y):\n        \"\"\"Fit label encoder and return encoded labels\n\n        Parameters\n        ----------\n        y : array-like of shape [n_samples]\n            Target values.\n\n        Returns\n        -------\n        y : array-like of shape [n_samples]\n        \"\"\"\n        y = column_or_1d(y, warn=True)\n        self.classes_, y = unique(y, return_inverse=True)\n        return y"
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "function",
          "name": "transform",
          "class_name": "LabelBinarizer",
          "code": "def transform(self, y):\n        \"\"\"Transform multi-class labels to binary labels\n\n        The output of transform is sometimes referred to by some authors as the\n        1-of-K coding scheme.\n\n        Parameters\n        ----------\n        y : numpy array of shape [n_samples] or sequence of sequences\n            Target values. In the multilabel case the nested sequences can\n            have variable lengths.\n\n        Returns\n        -------\n        Y : numpy array of shape [n_samples, n_classes]\n        \"\"\"\n        self._check_fitted()\n\n        y_is_multilabel = type_of_target(y).startswith('multilabel')\n\n        if y_is_multilabel and not self.multilabel_:\n            raise ValueError(\"The object was not fitted with multilabel\"\n                             \" input.\")\n\n        return label_binarize(y, self.classes_,\n                              multilabel=self.multilabel_,\n                              pos_label=self.pos_label,\n                              neg_label=self.neg_label)"
        }
      ]
    },
    {
      "pr_number": 9910,
      "pr_title": "[MRG] Improve MinCovDet error when covariance of support data is 0",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\nFixes #9864 \r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\nThis improves the MinCovDet error message when the covariance matrix of the support data is equal to 0. This also adds non-regression tests.\r\n\r\n**EDIT**: When the support data or more lie on a hyperplane, the algorithm described in the original paper returns the minimum covariance determinant estimates of the location and the (singular) covariance matrix. The algorithm then computes the equation of the hyperplane. I don't think (although I'm not certain about it) that `MinCovDet` implements such a particular case and I don't know if this should be handled but we might want to test what happens in this case. I can try the example of the original paper for such a situation and see what happens. My guess is that using `pinvh` makes `MinCovDet` return a solution anyway from which we can compute a Mahalanobis distance.\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 9864,
      "issue_title": "Improve MinCovDet.fit error when covariance is zero",
      "issue_body": "Ok this is extremely weird, can someone run this code and see if it crashes with that error?\r\n\r\n```py\r\nimport numpy as np\r\n\r\nfrom sklearn.covariance import MinCovDet\r\n\r\nclf = MinCovDet()\r\n\r\ndata = np.array([0.5, 0.1, 0.1, 0.1, 0.957, 0.1, 0.1,\r\n                 0.1, 0.4285, 0.1]).reshape(-1, 1)\r\nclf.fit(data)\r\n```\r\n\r\nIf I change the array to this\r\n\r\n```py\r\ndata = np.array([0.5, 0.11, 0.1, 0.1, 0.957, 0.1, 0.1, \r\n                 0.1, 0.4285, 0.1]).reshape(-1, 1)\r\n```\r\n\r\nThen it runs fine\r\n\r\nBut it seems to crash with any array where there are too many of the same values\r\n\r\nThis array crashes as well\r\n\r\n```py\r\ndata = np.array([0.5, 0.3, 0.3, 0.3, 0.957, 0.3, 0.3, \r\n                 0.3, 0.4285, 0.3]).reshape(-1, 1)\r\n```\r\n\r\nI already checked for NANs and everything, there's nothing\r\n\r\nUsing Python 3.6.2\r\nPandas 0.20.3\r\nNumpy 1.13.1\r\nscikit-learn 0.19.0\r\n\r\n\r\nThanks",
      "issue_closed_at": "2017-10-13T08:33:57Z",
      "base_commit": "fabb4fe4929066041c1c739d40b5fb9b5f514a3a",
      "changes": [
        {
          "file": "sklearn/covariance/robust_covariance.py",
          "type": "function",
          "name": "fast_mcd",
          "class_name": null,
          "code": "def fast_mcd(X, support_fraction=None,\n             cov_computation_method=empirical_covariance,\n             random_state=None):\n    \"\"\"Estimates the Minimum Covariance Determinant matrix.\n\n    Read more in the :ref:`User Guide <robust_covariance>`.\n\n    Parameters\n    ----------\n    X : array-like, shape (n_samples, n_features)\n      The data matrix, with p features and n samples.\n\n    support_fraction : float, 0 < support_fraction < 1\n          The proportion of points to be included in the support of the raw\n          MCD estimate. Default is None, which implies that the minimum\n          value of support_fraction will be used within the algorithm:\n          `[n_sample + n_features + 1] / 2`.\n\n    cov_computation_method : callable, default empirical_covariance\n        The function which will be used to compute the covariance.\n        Must return shape (n_features, n_features)\n\n    random_state : int, RandomState instance or None, optional (default=None)\n        If int, random_state is the seed used by the random number generator;\n        If RandomState instance, random_state is the random number generator;\n        If None, the random number generator is the RandomState instance used\n        by `np.random`.\n\n    Notes\n    -----\n    The FastMCD algorithm has been introduced by Rousseuw and Van Driessen\n    in \"A Fast Algorithm for the Minimum Covariance Determinant Estimator,\n    1999, American Statistical Association and the American Society\n    for Quality, TECHNOMETRICS\".\n    The principle is to compute robust estimates and random subsets before\n    pooling them into a larger subsets, and finally into the full data set.\n    Depending on the size of the initial sample, we have one, two or three\n    such computation levels.\n\n    Note that only raw estimates are returned. If one is interested in\n    the correction and reweighting steps described in [RouseeuwVan]_,\n    see the MinCovDet object.\n\n    References\n    ----------\n\n    .. [RouseeuwVan] A Fast Algorithm for the Minimum Covariance\n        Determinant Estimator, 1999, American Statistical Association\n        and the American Society for Quality, TECHNOMETRICS\n\n    .. [Butler1993] R. W. Butler, P. L. Davies and M. Jhun,\n        Asymptotics For The Minimum Covariance Determinant Estimator,\n        The Annals of Statistics, 1993, Vol. 21, No. 3, 1385-1400\n\n    Returns\n    -------\n    location : array-like, shape (n_features,)\n        Robust location of the data.\n\n    covariance : array-like, shape (n_features, n_features)\n        Robust covariance of the features.\n\n    support : array-like, type boolean, shape (n_samples,)\n        A mask of the observations that have been used to compute\n        the robust location and covariance estimates of the data set.\n\n    \"\"\"\n    random_state = check_random_state(random_state)\n\n    X = check_array(X, ensure_min_samples=2, estimator='fast_mcd')\n    n_samples, n_features = X.shape\n\n    # minimum breakdown value\n    if support_fraction is None:\n        n_support = int(np.ceil(0.5 * (n_samples + n_features + 1)))\n    else:\n        n_support = int(support_fraction * n_samples)\n\n    # 1-dimensional case quick computation\n    # (Rousseeuw, P. J. and Leroy, A. M. (2005) References, in Robust\n    #  Regression and Outlier Detection, John Wiley & Sons, chapter 4)\n    if n_features == 1:\n        if n_support < n_samples:\n            # find the sample shortest halves\n            X_sorted = np.sort(np.ravel(X))\n            diff = X_sorted[n_support:] - X_sorted[:(n_samples - n_support)]\n            halves_start = np.where(diff == np.min(diff))[0]\n            # take the middle points' mean to get the robust location estimate\n            location = 0.5 * (X_sorted[n_support + halves_start] +\n                              X_sorted[halves_start]).mean()\n            support = np.zeros(n_samples, dtype=bool)\n            X_centered = X - location\n            support[np.argsort(np.abs(X_centered), 0)[:n_support]] = True\n            covariance = np.asarray([[np.var(X[support])]])\n            location = np.array([location])\n            # get precision matrix in an optimized way\n            precision = linalg.pinvh(covariance)\n            dist = (np.dot(X_centered, precision) * (X_centered)).sum(axis=1)\n        else:\n            support = np.ones(n_samples, dtype=bool)\n            covariance = np.asarray([[np.var(X)]])\n            location = np.asarray([np.mean(X)])\n            X_centered = X - location\n            # get precision matrix in an optimized way\n            precision = linalg.pinvh(covariance)\n            dist = (np.dot(X_centered, precision) * (X_centered)).sum(axis=1)\n# Starting FastMCD algorithm for p-dimensional case\n    if (n_samples > 500) and (n_features > 1):\n        # 1. Find candidate supports on subsets\n        # a. split the set in subsets of size ~ 300\n        n_subsets = n_samples // 300\n        n_samples_subsets = n_samples // n_subsets\n        samples_shuffle = random_state.permutation(n_samples)\n        h_subset = int(np.ceil(n_samples_subsets *\n                       (n_support / float(n_samples))))\n        # b. perform a total of 500 trials\n        n_trials_tot = 500\n        # c. select 10 best (location, covariance) for each subset\n        n_best_sub = 10\n        n_trials = max(10, n_trials_tot // n_subsets)\n        n_best_tot = n_subsets * n_best_sub\n        all_best_locations = np.zeros((n_best_tot, n_features))\n        try:\n            all_best_covariances = np.zeros((n_best_tot, n_features,\n                                             n_features))\n        except MemoryError:\n            # The above is too big. Let's try with something much small\n            # (and less optimal)\n            all_best_covariances = np.zeros((n_best_tot, n_features,\n                                             n_features))\n            n_best_tot = 10\n            n_best_sub = 2\n        for i in range(n_subsets):\n            low_bound = i * n_samples_subsets\n            high_bound = low_bound + n_samples_subsets\n            current_subset = X[samples_shuffle[low_bound:high_bound]]\n            best_locations_sub, best_covariances_sub, _, _ = select_candidates(\n                current_subset, h_subset, n_trials,\n                select=n_best_sub, n_iter=2,\n                cov_computation_method=cov_computation_method,\n                random_state=random_state)\n            subset_slice = np.arange(i * n_best_sub, (i + 1) * n_best_sub)\n            all_best_locations[subset_slice] = best_locations_sub\n            all_best_covariances[subset_slice] = best_covariances_sub\n        # 2. Pool the candidate supports into a merged set\n        # (possibly the full dataset)\n        n_samples_merged = min(1500, n_samples)\n        h_merged = int(np.ceil(n_samples_merged *\n                       (n_support / float(n_samples))))\n        if n_samples > 1500:\n            n_best_merged = 10\n        else:\n            n_best_merged = 1\n        # find the best couples (location, covariance) on the merged set\n        selection = random_state.permutation(n_samples)[:n_samples_merged]\n        locations_merged, covariances_merged, supports_merged, d = \\\n            select_candidates(\n                X[selection], h_merged,\n                n_trials=(all_best_locations, all_best_covariances),\n                select=n_best_merged,\n                cov_computation_method=cov_computation_method,\n                random_state=random_state)\n        # 3. Finally get the overall best (locations, covariance) couple\n        if n_samples < 1500:\n            # directly get the best couple (location, covariance)\n            location = locations_merged[0]\n            covariance = covariances_merged[0]\n            support = np.zeros(n_samples, dtype=bool)\n            dist = np.zeros(n_samples)\n            support[selection] = supports_merged[0]\n            dist[selection] = d[0]\n        else:\n            # select the best couple on the full dataset\n            locations_full, covariances_full, supports_full, d = \\\n                select_candidates(\n                    X, n_support,\n                    n_trials=(locations_merged, covariances_merged),\n                    select=1,\n                    cov_computation_method=cov_computation_method,\n                    random_state=random_state)\n            location = locations_full[0]\n            covariance = covariances_full[0]\n            support = supports_full[0]\n            dist = d[0]\n    elif n_features > 1:\n        # 1. Find the 10 best couples (location, covariance)\n        # considering two iterations\n        n_trials = 30\n        n_best = 10\n        locations_best, covariances_best, _, _ = select_candidates(\n            X, n_support, n_trials=n_trials, select=n_best, n_iter=2,\n            cov_computation_method=cov_computation_method,\n            random_state=random_state)\n        # 2. Select the best couple on the full dataset amongst the 10\n        locations_full, covariances_full, supports_full, d = select_candidates(\n            X, n_support, n_trials=(locations_best, covariances_best),\n            select=1, cov_computation_method=cov_computation_method,\n            random_state=random_state)\n        location = locations_full[0]\n        covariance = covariances_full[0]\n        support = supports_full[0]\n        dist = d[0]\n\n    return location, covariance, support, dist"
        },
        {
          "file": "sklearn/covariance/robust_covariance.py",
          "type": "function",
          "name": "correct_covariance",
          "class_name": "MinCovDet",
          "code": "def correct_covariance(self, data):\n        \"\"\"Apply a correction to raw Minimum Covariance Determinant estimates.\n\n        Correction using the empirical correction factor suggested\n        by Rousseeuw and Van Driessen in [RVD]_.\n\n        Parameters\n        ----------\n        data : array-like, shape (n_samples, n_features)\n            The data matrix, with p features and n samples.\n            The data set must be the one which was used to compute\n            the raw estimates.\n\n        References\n        ----------\n\n        .. [RVD] `A Fast Algorithm for the Minimum Covariance\n            Determinant Estimator, 1999, American Statistical Association\n            and the American Society for Quality, TECHNOMETRICS`\n\n        Returns\n        -------\n        covariance_corrected : array-like, shape (n_features, n_features)\n            Corrected robust covariance estimate.\n\n        \"\"\"\n        correction = np.median(self.dist_) / chi2(data.shape[1]).isf(0.5)\n        covariance_corrected = self.raw_covariance_ * correction\n        self.dist_ /= correction\n        return covariance_corrected"
        }
      ]
    },
    {
      "pr_number": 4894,
      "pr_title": "fix dtype transform problem in KNN and RandomForest",
      "pr_body": "Fix #4879\n\nI add a int array to store the indices before the loop. \n@amueller I also checked LabelEncoder but it can only accommodate 1d array, so a loop is still needed and slicing problem will still be there. \nFor MultiLabelBinarizer, it can accommodate 2d array, but it will return a 2d binary array to store the indices, which is very different from the original implementation in RandomForest and KNN. I am wondering is there any better way to do so? \n",
      "issue_id": 4879,
      "issue_title": "RandomForestClassifier changes label values",
      "issue_body": "Consider the following example:\n\n``` python\nimport numpy as np\nfrom sklearn.ensemble import RandomForestClassifier\n\nCLASSES = 15\nX = np.arange(CLASSES).reshape(-1, 1)\ny = [ch for ch in 'ABCDEFGHIJKLMNOPQRSTU'[:CLASSES]]\n\nprint(RandomForestClassifier().fit(X, y).predict(X))\n```\n\nThe output is\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'B' 'B' 'B' 'B' 'B']`\n\nwhereas I expect it to be\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O']`\n\nI traced the cause of the error to [here](https://github.com/scikit-learn/scikit-learn/blob/733629d256abaf9a31d6e6305f859768432907ed/sklearn/ensemble/forest.py#L418). I'm happy to put in a pull request, but can others please verify they get the same unexpected behaviour first?\n\nSystem information:\n``\nCPython 3.3.5\nIPython 3.1.0\n\nnumpy 1.9.2\nscipy 0.14.0\nscikit-learn 0.16.1(conda)\n\ncompiler   : GCC 4.2.1 (Apple Inc. build 5577)\nsystem     : Darwin\nrelease    : 13.4.0\nmachine    : x86_64\nprocessor  : i386\nCPU cores  : 4\ninterpreter: 64bit\n\n```\n```\n",
      "issue_closed_at": "2015-06-24T21:33:10Z",
      "base_commit": "cd98ffced656569681648c88c51dae844501643c",
      "changes": [
        {
          "file": "sklearn/ensemble/forest.py",
          "type": "function",
          "name": "_validate_y_class_weight",
          "class_name": "ForestClassifier",
          "code": "def _validate_y_class_weight(self, y):\n        y = np.copy(y)\n        expanded_class_weight = None\n\n        if self.class_weight is not None:\n            y_original = np.copy(y)\n\n        self.classes_ = []\n        self.n_classes_ = []\n\n        for k in range(self.n_outputs_):\n            classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n            self.classes_.append(classes_k)\n            self.n_classes_.append(classes_k.shape[0])\n\n        if self.class_weight is not None:\n            valid_presets = ('auto', 'balanced', 'balanced_subsample', 'subsample', 'auto')\n            if isinstance(self.class_weight, six.string_types):\n                if self.class_weight not in valid_presets:\n                    raise ValueError('Valid presets for class_weight include '\n                                     '\"balanced\" and \"balanced_subsample\". Given \"%s\".'\n                                     % self.class_weight)\n                if self.class_weight == \"subsample\":\n                    warn(\"class_weight='subsample' is deprecated and will be removed in 0.18.\"\n                         \" It was replaced by class_weight='balanced_subsample' \"\n                         \"using the balanced strategy.\", DeprecationWarning)\n                if self.warm_start:\n                    warn('class_weight presets \"balanced\" or \"balanced_subsample\" are '\n                         'not recommended for warm_start if the fitted data '\n                         'differs from the full dataset. In order to use '\n                         '\"balanced\" weights, use compute_class_weight(\"balanced\", '\n                         'classes, y). In place of y you can use a large '\n                         'enough sample of the full training set target to '\n                         'properly estimate the class frequency '\n                         'distributions. Pass the resulting weights as the '\n                         'class_weight parameter.')\n\n            if (self.class_weight not in ['subsample', 'balanced_subsample'] or\n                    not self.bootstrap):\n                if self.class_weight == 'subsample':\n                    class_weight = 'auto'\n                elif self.class_weight == \"balanced_subsample\":\n                    class_weight = \"balanced\"\n                else:\n                    class_weight = self.class_weight\n                with warnings.catch_warnings():\n                    if class_weight == \"auto\":\n                        warnings.simplefilter('ignore', DeprecationWarning)\n                    expanded_class_weight = compute_sample_weight(class_weight,\n                                                                  y_original)\n\n        return y, expanded_class_weight"
        },
        {
          "file": "sklearn/tree/tree.py",
          "type": "function",
          "name": "fit",
          "class_name": "BaseDecisionTree",
          "code": "def fit(self, X, y, sample_weight=None, check_input=True):\n        \"\"\"Build a decision tree from the training set (X, y).\n\n        Parameters\n        ----------\n        X : array-like or sparse matrix, shape = [n_samples, n_features]\n            The training input samples. Internally, it will be converted to\n            ``dtype=np.float32`` and if a sparse matrix is provided\n            to a sparse ``csc_matrix``.\n\n        y : array-like, shape = [n_samples] or [n_samples, n_outputs]\n            The target values (class labels in classification, real numbers in\n            regression). In the regression case, use ``dtype=np.float64`` and\n            ``order='C'`` for maximum efficiency.\n\n        sample_weight : array-like, shape = [n_samples] or None\n            Sample weights. If None, then samples are equally weighted. Splits\n            that would create child nodes with net zero or negative weight are\n            ignored while searching for a split in each node. In the case of\n            classification, splits are also ignored if they would result in any\n            single class carrying a negative weight in either child node.\n\n        check_input : boolean, (default=True)\n            Allow to bypass several input checking.\n            Don't use this parameter unless you know what you do.\n\n        Returns\n        -------\n        self : object\n            Returns self.\n        \"\"\"\n        random_state = check_random_state(self.random_state)\n        if check_input:\n            X = check_array(X, dtype=DTYPE, accept_sparse=\"csc\")\n            if issparse(X):\n                X.sort_indices()\n\n                if X.indices.dtype != np.intc or X.indptr.dtype != np.intc:\n                    raise ValueError(\"No support for np.int64 index based \"\n                                     \"sparse matrices\")\n\n        # Determine output settings\n        n_samples, self.n_features_ = X.shape\n        is_classification = isinstance(self, ClassifierMixin)\n\n        y = np.atleast_1d(y)\n        expanded_class_weight = None\n\n        if y.ndim == 1:\n            # reshape is necessary to preserve the data contiguity against vs\n            # [:, np.newaxis] that does not.\n            y = np.reshape(y, (-1, 1))\n\n        self.n_outputs_ = y.shape[1]\n\n        if is_classification:\n            y = np.copy(y)\n\n            self.classes_ = []\n            self.n_classes_ = []\n\n            if self.class_weight is not None:\n                y_original = np.copy(y)\n\n            for k in range(self.n_outputs_):\n                classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n                self.classes_.append(classes_k)\n                self.n_classes_.append(classes_k.shape[0])\n\n            if self.class_weight is not None:\n                expanded_class_weight = compute_sample_weight(\n                    self.class_weight, y_original)\n\n        else:\n            self.classes_ = [None] * self.n_outputs_\n            self.n_classes_ = [1] * self.n_outputs_\n\n        self.n_classes_ = np.array(self.n_classes_, dtype=np.intp)\n\n        if getattr(y, \"dtype\", None) != DOUBLE or not y.flags.contiguous:\n            y = np.ascontiguousarray(y, dtype=DOUBLE)\n\n        # Check parameters\n        max_depth = ((2 ** 31) - 1 if self.max_depth is None\n                     else self.max_depth)\n        max_leaf_nodes = (-1 if self.max_leaf_nodes is None\n                          else self.max_leaf_nodes)\n\n        if isinstance(self.max_features, six.string_types):\n            if self.max_features == \"auto\":\n                if is_classification:\n                    max_features = max(1, int(np.sqrt(self.n_features_)))\n                else:\n                    max_features = self.n_features_\n            elif self.max_features == \"sqrt\":\n                max_features = max(1, int(np.sqrt(self.n_features_)))\n            elif self.max_features == \"log2\":\n                max_features = max(1, int(np.log2(self.n_features_)))\n            else:\n                raise ValueError(\n                    'Invalid value for max_features. Allowed string '\n                    'values are \"auto\", \"sqrt\" or \"log2\".')\n        elif self.max_features is None:\n            max_features = self.n_features_\n        elif isinstance(self.max_features, (numbers.Integral, np.integer)):\n            max_features = self.max_features\n        else:  # float\n            if self.max_features > 0.0:\n                max_features = max(1, int(self.max_features * self.n_features_))\n            else:\n                max_features = 0\n\n        self.max_features_ = max_features\n\n        if len(y) != n_samples:\n            raise ValueError(\"Number of labels=%d does not match \"\n                             \"number of samples=%d\" % (len(y), n_samples))\n        if self.min_samples_split <= 0:\n            raise ValueError(\"min_samples_split must be greater than zero.\")\n        if self.min_samples_leaf <= 0:\n            raise ValueError(\"min_samples_leaf must be greater than zero.\")\n        if not 0 <= self.min_weight_fraction_leaf <= 0.5:\n            raise ValueError(\"min_weight_fraction_leaf must in [0, 0.5]\")\n        if max_depth <= 0:\n            raise ValueError(\"max_depth must be greater than zero. \")\n        if not (0 < max_features <= self.n_features_):\n            raise ValueError(\"max_features must be in (0, n_features]\")\n        if not isinstance(max_leaf_nodes, (numbers.Integral, np.integer)):\n            raise ValueError(\"max_leaf_nodes must be integral number but was \"\n                             \"%r\" % max_leaf_nodes)\n        if -1 < max_leaf_nodes < 2:\n            raise ValueError((\"max_leaf_nodes {0} must be either smaller than \"\n                              \"0 or larger than 1\").format(max_leaf_nodes))\n\n        if sample_weight is not None:\n            if (getattr(sample_weight, \"dtype\", None) != DOUBLE or\n                    not sample_weight.flags.contiguous):\n                sample_weight = np.ascontiguousarray(\n                    sample_weight, dtype=DOUBLE)\n            if len(sample_weight.shape) > 1:\n                raise ValueError(\"Sample weights array has more \"\n                                 \"than one dimension: %d\" %\n                                 len(sample_weight.shape))\n            if len(sample_weight) != n_samples:\n                raise ValueError(\"Number of weights=%d does not match \"\n                                 \"number of samples=%d\" %\n                                 (len(sample_weight), n_samples))\n\n        if expanded_class_weight is not None:\n            if sample_weight is not None:\n                sample_weight = sample_weight * expanded_class_weight\n            else:\n                sample_weight = expanded_class_weight\n\n        # Set min_weight_leaf from min_weight_fraction_leaf\n        if self.min_weight_fraction_leaf != 0. and sample_weight is not None:\n            min_weight_leaf = (self.min_weight_fraction_leaf *\n                               np.sum(sample_weight))\n        else:\n            min_weight_leaf = 0.\n\n        # Set min_samples_split sensibly\n        min_samples_split = max(self.min_samples_split,\n                                2 * self.min_samples_leaf)\n\n        # Build tree\n        criterion = self.criterion\n        if not isinstance(criterion, Criterion):\n            if is_classification:\n                criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,\n                                                         self.n_classes_)\n            else:\n                criterion = CRITERIA_REG[self.criterion](self.n_outputs_)\n\n        SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS\n\n        splitter = self.splitter\n        if not isinstance(self.splitter, Splitter):\n            splitter = SPLITTERS[self.splitter](criterion,\n                                                self.max_features_,\n                                                self.min_samples_leaf,\n                                                min_weight_leaf,\n                                                random_state)\n\n        self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_)\n\n        # Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise\n        if max_leaf_nodes < 0:\n            builder = DepthFirstTreeBuilder(splitter, min_samples_split,\n                                            self.min_samples_leaf,\n                                            min_weight_leaf,\n                                            max_depth)\n        else:\n            builder = BestFirstTreeBuilder(splitter, min_samples_split,\n                                           self.min_samples_leaf,\n                                           min_weight_leaf,\n                                           max_depth,\n                                           max_leaf_nodes)\n\n        builder.build(self.tree_, X, y, sample_weight)\n\n        if self.n_outputs_ == 1:\n            self.n_classes_ = self.n_classes_[0]\n            self.classes_ = self.classes_[0]\n\n        return self"
        }
      ]
    },
    {
      "pr_number": 12443,
      "pr_title": "[MRG] Fix incorrect error when OneHotEncoder.transform called prior to fit",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #12395 \r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\nFix checks for `categories_` attribute when transform is called. \r\n \r\nThe edge case when class is instantiated with `categorical_features` set to all False,  `_legacy_mode` is set to True but `_legacy_fit_transform` never gets called. To fix this, explicitly tested for this and set `categories_` to an empty list. This fix also allows `get_feature_names` to return an empty array.\r\n\r\n```\r\nimport numpy as np\r\nfrom sklearn.preprocessing import OneHotEncoder\r\n\r\nX = np.array([[3, 2, 1], [0, 1, 1]])\r\n\r\n# Edge case: all non-categorical\r\ncat = [False, False, False]\r\nenc = OneHotEncoder(categorical_features=cat)\r\nenc.fit(X)\r\nprint('Categories are {}'.format(enc.categories_))\r\nprint('Feature names are {}'.format(enc.get_feature_names()))\r\n```\r\n\r\n```\r\nCategories are []\r\nFeature names are []\r\n```\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 12395,
      "issue_title": "OneHotEncoder throws unhelpful error messages when tranform called prior to fit",
      "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: https://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\n\r\nOneHotEncoder throws an `AttributeError` instead of a `NotFittedError` when tranform is called prior to fit\r\n\r\n- if `transform` is called prior to being fit an `AttributeError` is thrown\r\n- if `categories` includes arrays of of unicode type\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n\r\n```\r\nimport numpy as np\r\nfrom sklearn.preprocessing import OneHotEncoder\r\n\r\ncategories = sorted(['Dillon', 'Joel', 'Earl', 'Liz'])\r\nX = np.array(['Dillon', 'Dillon', 'Joel', 'Liz', 'Liz', 'Earl']).reshape(-1, 1)\r\n\r\nohe = OneHotEncoder(categories=[sorted(categories)])\r\nohe.transform(X)\r\n# Throws AttributeError: 'OneHotEncoder' object has no attribute '_legacy_mode'\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\n`NotFittedError: This OneHotEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.`\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n`Throws AttributeError: 'OneHotEncoder' object has no attribute '_legacy_mode'`\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nFor scikit-learn >= 0.20:\r\nimport sklearn; sklearn.show_versions()\r\nFor scikit-learn < 0.20:\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\n```\r\nSystem\r\n------\r\n    python: 3.6.3 (default, Oct  4 2017, 06:09:38)  [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)]\r\nexecutable: /Users/dillon/Envs/mewtwo/bin/python3.6\r\n   machine: Darwin-18.0.0-x86_64-i386-64bit\r\n\r\nBLAS\r\n----\r\n    macros: NO_ATLAS_INFO=3, HAVE_CBLAS=None\r\n  lib_dirs:\r\ncblas_libs: cblas\r\n\r\nPython deps\r\n-----------\r\n       pip: 18.1\r\nsetuptools: 39.0.1\r\n   sklearn: 0.20.0\r\n     numpy: 1.14.2\r\n     scipy: 1.0.1\r\n    Cython: None\r\n    pandas: 0.22.0\r\n```\r\n\r\n<!-- Thanks for contributing! -->\r\n",
      "issue_closed_at": "2018-11-12T22:49:54Z",
      "base_commit": "88cdeb85a9303a7b580952b720703a4aca9dc1c0",
      "changes": [
        {
          "file": "sklearn/preprocessing/_encoders.py",
          "type": "line",
          "name": "line 21",
          "code": "from .base import _transform_selected\nfrom .label import _encode, _encode_check_unknown\n\n\nrange = six.moves.range\n\n\n__all__ = [\n    'OneHotEncoder',\n    'OrdinalEncoder'"
        },
        {
          "file": "sklearn/preprocessing/_encoders.py",
          "type": "function",
          "name": "_handle_deprecations",
          "class_name": "OneHotEncoder",
          "code": "def _handle_deprecations(self, X):\n\n        # internal version of the attributes to handle deprecations\n        self._categories = getattr(self, '_categories', None)\n        self._categorical_features = getattr(self, '_categorical_features',\n                                             None)\n\n        # user manually set the categories or second fit -> never legacy mode\n        if self.categories is not None or self._categories is not None:\n            self._legacy_mode = False\n            if self.categories is not None:\n                self._categories = self.categories\n\n        # categories not set -> infer if we need legacy mode or not\n        elif self.n_values is not None and self.n_values != 'auto':\n            msg = (\n                \"Passing 'n_values' is deprecated in version 0.20 and will be \"\n                \"removed in 0.22. You can use the 'categories' keyword \"\n                \"instead. 'n_values=n' corresponds to 'categories=[range(n)]'.\"\n            )\n            warnings.warn(msg, DeprecationWarning)\n            self._legacy_mode = True\n\n        else:  # n_values = 'auto'\n            if self.handle_unknown == 'ignore':\n                # no change in behaviour, no need to raise deprecation warning\n                self._legacy_mode = False\n                self._categories = 'auto'\n                if self.n_values == 'auto':\n                    # user manually specified this\n                    msg = (\n                        \"Passing 'n_values' is deprecated in version 0.20 and \"\n                        \"will be removed in 0.22. n_values='auto' can be \"\n                        \"replaced with categories='auto'.\"\n                    )\n                    warnings.warn(msg, DeprecationWarning)\n            else:\n\n                # check if we have integer or categorical input\n                try:\n                    check_array(X, dtype=np.int)\n                except ValueError:\n                    self._legacy_mode = False\n                    self._categories = 'auto'\n                else:\n                    msg = (\n                        \"The handling of integer data will change in version \"\n                        \"0.22. Currently, the categories are determined \"\n                        \"based on the range [0, max(values)], while in the \"\n                        \"future they will be determined based on the unique \"\n                        \"values.\\nIf you want the future behaviour and \"\n                        \"silence this warning, you can specify \"\n                        \"\\\"categories='auto'\\\".\\n\"\n                        \"In case you used a LabelEncoder before this \"\n                        \"OneHotEncoder to convert the categories to integers, \"\n                        \"then you can now use the OneHotEncoder directly.\"\n                    )\n                    warnings.warn(msg, FutureWarning)\n                    self._legacy_mode = True\n                    self.n_values = 'auto'\n\n        # if user specified categorical_features -> always use legacy mode\n        if self.categorical_features is not None:\n            if (isinstance(self.categorical_features, six.string_types)\n                    and self.categorical_features == 'all'):\n                warnings.warn(\n                    \"The 'categorical_features' keyword is deprecated in \"\n                    \"version 0.20 and will be removed in 0.22. The passed \"\n                    \"value of 'all' is the default and can simply be removed.\",\n                    DeprecationWarning)\n            else:\n                if self.categories is not None:\n                    raise ValueError(\n                        \"The 'categorical_features' keyword is deprecated, \"\n                        \"and cannot be used together with specifying \"\n                        \"'categories'.\")\n                warnings.warn(\n                    \"The 'categorical_features' keyword is deprecated in \"\n                    \"version 0.20 and will be removed in 0.22. You can \"\n                    \"use the ColumnTransformer instead.\", DeprecationWarning)\n                self._legacy_mode = True\n            self._categorical_features = self.categorical_features\n        else:\n            self._categorical_features = 'all'"
        },
        {
          "file": "sklearn/preprocessing/_encoders.py",
          "type": "function",
          "name": "transform",
          "class_name": "OrdinalEncoder",
          "code": "def transform(self, X):\n        \"\"\"Transform X to ordinal codes.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_features]\n            The data to encode.\n\n        Returns\n        -------\n        X_out : sparse matrix or a 2-d array\n            Transformed input.\n\n        \"\"\"\n        X_int, _ = self._transform(X)\n        return X_int.astype(self.dtype, copy=False)"
        },
        {
          "file": "sklearn/preprocessing/_encoders.py",
          "type": "function",
          "name": "get_feature_names",
          "class_name": "OneHotEncoder",
          "code": "def get_feature_names(self, input_features=None):\n        \"\"\"Return feature names for output features.\n\n        Parameters\n        ----------\n        input_features : list of string, length n_features, optional\n            String names for input features if available. By default,\n            \"x0\", \"x1\", ... \"xn_features\" is used.\n\n        Returns\n        -------\n        output_feature_names : array of string, length n_output_features\n\n        \"\"\"\n        check_is_fitted(self, 'categories_')\n        cats = self.categories_\n        if input_features is None:\n            input_features = ['x%d' % i for i in range(len(cats))]\n        elif(len(input_features) != len(self.categories_)):\n            raise ValueError(\n                \"input_features should have length equal to number of \"\n                \"features ({}), got {}\".format(len(self.categories_),\n                                               len(input_features)))\n\n        feature_names = []\n        for i in range(len(cats)):\n            names = [\n                input_features[i] + '_' + six.text_type(t) for t in cats[i]]\n            feature_names.extend(names)\n\n        return np.array(feature_names, dtype=object)"
        }
      ]
    }
  ]
}