{
  "instance_id": "scikit-learn__scikit-learn-13439",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-03-12T20:32:50Z",
  "problem_statement": "Pipeline should implement __len__\n#### Description\r\n\r\nWith the new indexing support `pipe[:len(pipe)]` raises an error.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```python\r\nfrom sklearn import svm\r\nfrom sklearn.datasets import samples_generator\r\nfrom sklearn.feature_selection import SelectKBest\r\nfrom sklearn.feature_selection import f_regression\r\nfrom sklearn.pipeline import Pipeline\r\n\r\n# generate some data to play with\r\nX, y = samples_generator.make_classification(\r\n    n_informative=5, n_redundant=0, random_state=42)\r\n\r\nanova_filter = SelectKBest(f_regression, k=5)\r\nclf = svm.SVC(kernel='linear')\r\npipe = Pipeline([('anova', anova_filter), ('svc', clf)])\r\n\r\nlen(pipe)\r\n```\r\n\r\n#### Versions\r\n\r\n```\r\nSystem:\r\n    python: 3.6.7 | packaged by conda-forge | (default, Feb 19 2019, 18:37:23)  [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]\r\nexecutable: /Users/krisz/.conda/envs/arrow36/bin/python\r\n   machine: Darwin-18.2.0-x86_64-i386-64bit\r\n\r\nBLAS:\r\n    macros: HAVE_CBLAS=None\r\n  lib_dirs: /Users/krisz/.conda/envs/arrow36/lib\r\ncblas_libs: openblas, openblas\r\n\r\nPython deps:\r\n       pip: 19.0.3\r\nsetuptools: 40.8.0\r\n   sklearn: 0.21.dev0\r\n     numpy: 1.16.2\r\n     scipy: 1.2.1\r\n    Cython: 0.29.6\r\n    pandas: 0.24.1\r\n```\n",
  "patch": "diff --git a/sklearn/pipeline.py b/sklearn/pipeline.py\n--- a/sklearn/pipeline.py\n+++ b/sklearn/pipeline.py\n@@ -199,6 +199,12 @@ def _iter(self, with_final=True):\n             if trans is not None and trans != 'passthrough':\n                 yield idx, name, trans\n \n+    def __len__(self):\n+        \"\"\"\n+        Returns the length of the Pipeline\n+        \"\"\"\n+        return len(self.steps)\n+\n     def __getitem__(self, ind):\n         \"\"\"Returns a sub-pipeline or a single esimtator in the pipeline\n \n",
  "similar_bug_items": [
    {
      "pr_number": 8035,
      "pr_title": "[MRG+1] Catch cases for different class size in MLPClassifier with warm start (#7976) ",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\n<!-- Example: Fixes #1234 -->\r\nFixes #7976 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\nThis provides a test for different cases that throws an error when warm_start = True for MLPClassifier. Currently, vague errors are thrown when class size is different between the current fit and the previous fit. This fix will throw a clearer error message. \r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n\r\n",
      "issue_id": 7976,
      "issue_title": "MLPClasiffier produce error when trying to re-fit",
      "issue_body": "Hi,\r\n\r\nI am training a MLPClasiffier model twice, each time on a different data-set.\r\nOn the second iteration the fit method produce an error. Every time a different error.\r\nI did the processes on various models, this is the only one that produce an error. \r\n\r\nthis are the errors i get (each time a different one) - \r\n\r\n> _lbfgsb.error: failed in converting 7th argument `g' of _lbfgsb.setulb to C/Fortran array\r\n\r\n> ValueError: total size of new array must be unchanged\r\n\r\n > ValueError: operands could not be broadcast together with shapes (154,100) (25,) (154,100) \r\n\r\nthanks",
      "issue_closed_at": "2016-12-29T01:01:13Z",
      "base_commit": "40a1b7a0b10fea2995c7aaa46c90a9633e6d99f6",
      "changes": [
        {
          "file": "sklearn/neural_network/multilayer_perceptron.py",
          "type": "function",
          "name": "_validate_input",
          "class_name": "MLPRegressor",
          "code": "def _validate_input(self, X, y, incremental):\n        X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],\n                         multi_output=True, y_numeric=True)\n        if y.ndim == 2 and y.shape[1] == 1:\n            y = column_or_1d(y, warn=True)\n        return X, y"
        },
        {
          "file": "sklearn/neural_network/multilayer_perceptron.py",
          "type": "function",
          "name": "predict",
          "class_name": "MLPRegressor",
          "code": "def predict(self, X):\n        \"\"\"Predict using the multi-layer perceptron model.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape (n_samples, n_features)\n            The input data.\n\n        Returns\n        -------\n        y : array-like, shape (n_samples, n_outputs)\n            The predicted values.\n        \"\"\"\n        check_is_fitted(self, \"coefs_\")\n        y_pred = self._predict(X)\n        if y_pred.shape[1] == 1:\n            return y_pred.ravel()\n        return y_pred"
        }
      ]
    },
    {
      "pr_number": 11042,
      "pr_title": "[MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034",
      "pr_body": "#### Reference Issues/PRs\r\nOriginal discussion at #11034\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n",
      "issue_id": 11034,
      "issue_title": "OneHotEncoder does not output scipy sparse matrix of given dtype",
      "issue_body": "#### Description\r\nOneHotEncoder ignores the specified dtype in the construction of the sparse array when mixed input data are passed, i.e with both categorical and real data type\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nimport numpy as np\r\n\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nenc = OneHotEncoder(dtype=np.float32, categorical_features=[0, 1])\r\n\r\nx = np.array([[0, 1, 0, 0], [1, 2, 0, 0]], dtype=int)\r\nsparse = enc.fit(x).transform(x)\r\n```\r\n\r\n#### Expected Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float32'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Actual Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float64'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Versions\r\n__Platform__: Linux-4.13.0-38-generic-x86_64-with-debian-stretch-sid\r\n__Python__: 3.6.3 |Anaconda custom (64-bit)| (default, Oct 13 2017, 12:02:49) [GCC 7.2.0]\r\n__NumPy__: NumPy \r\n__SciPy__: SciPy 1.0.1\r\n__Scikit-Learn__: Scikit-Learn 0.19.1\r\n",
      "issue_closed_at": "2018-06-06T09:03:02Z",
      "base_commit": "f049ec72eb70443ec8d7826066c4246035677c11",
      "changes": [
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "add_dummy_feature",
          "class_name": null,
          "code": "def add_dummy_feature(X, value=1.0):\n    \"\"\"Augment dataset with an additional dummy feature.\n\n    This is useful for fitting an intercept term with implementations which\n    cannot otherwise fit it directly.\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Data.\n\n    value : float\n        Value to use for the dummy feature.\n\n    Returns\n    -------\n\n    X : {array, sparse matrix}, shape [n_samples, n_features + 1]\n        Same data with dummy feature added as first column.\n\n    Examples\n    --------\n\n    >>> from sklearn.preprocessing import add_dummy_feature\n    >>> add_dummy_feature([[0, 1], [1, 0]])\n    array([[1., 0., 1.],\n           [1., 1., 0.]])\n    \"\"\"\n    X = check_array(X, accept_sparse=['csc', 'csr', 'coo'], dtype=FLOAT_DTYPES)\n    n_samples, n_features = X.shape\n    shape = (n_samples, n_features + 1)\n    if sparse.issparse(X):\n        if sparse.isspmatrix_coo(X):\n            # Shift columns to the right.\n            col = X.col + 1\n            # Column indices of dummy feature are 0 everywhere.\n            col = np.concatenate((np.zeros(n_samples), col))\n            # Row indices of dummy feature are 0, ..., n_samples-1.\n            row = np.concatenate((np.arange(n_samples), X.row))\n            # Prepend the dummy feature n_samples times.\n            data = np.concatenate((np.ones(n_samples) * value, X.data))\n            return sparse.coo_matrix((data, (row, col)), shape)\n        elif sparse.isspmatrix_csc(X):\n            # Shift index pointers since we need to add n_samples elements.\n            indptr = X.indptr + n_samples\n            # indptr[0] must be 0.\n            indptr = np.concatenate((np.array([0]), indptr))\n            # Row indices of dummy feature are 0, ..., n_samples-1.\n            indices = np.concatenate((np.arange(n_samples), X.indices))\n            # Prepend the dummy feature n_samples times.\n            data = np.concatenate((np.ones(n_samples) * value, X.data))\n            return sparse.csc_matrix((data, indices, indptr), shape)\n        else:\n            klass = X.__class__\n            return klass(add_dummy_feature(X.tocoo(), value))\n    else:\n        return np.hstack((np.ones((n_samples, 1)) * value, X))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "_transform_selected",
          "class_name": null,
          "code": "def _transform_selected(X, transform, selected=\"all\", copy=True):\n    \"\"\"Apply a transform function to portion of selected features\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Dense array or sparse matrix.\n\n    transform : callable\n        A callable transform(X) -> X_transformed\n\n    copy : boolean, optional\n        Copy X even if it could be avoided.\n\n    selected: \"all\" or array of indices or mask\n        Specify which features to apply the transform to.\n\n    Returns\n    -------\n    X : array or sparse matrix, shape=(n_samples, n_features_new)\n    \"\"\"\n    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)\n\n    if isinstance(selected, six.string_types) and selected == \"all\":\n        return transform(X)\n\n    if len(selected) == 0:\n        return X\n\n    n_features = X.shape[1]\n    ind = np.arange(n_features)\n    sel = np.zeros(n_features, dtype=bool)\n    sel[np.asarray(selected)] = True\n    not_sel = np.logical_not(sel)\n    n_selected = np.sum(sel)\n\n    if n_selected == 0:\n        # No features selected.\n        return X\n    elif n_selected == n_features:\n        # All features selected.\n        return transform(X)\n    else:\n        X_sel = transform(X[:, ind[sel]])\n        X_not_sel = X[:, ind[not_sel]]\n\n        if sparse.issparse(X_sel) or sparse.issparse(X_not_sel):\n            return sparse.hstack((X_sel, X_not_sel))\n        else:\n            return np.hstack((X_sel, X_not_sel))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "_transform_selected",
          "class_name": null,
          "code": "def _transform_selected(X, transform, selected=\"all\", copy=True):\n    \"\"\"Apply a transform function to portion of selected features\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Dense array or sparse matrix.\n\n    transform : callable\n        A callable transform(X) -> X_transformed\n\n    copy : boolean, optional\n        Copy X even if it could be avoided.\n\n    selected: \"all\" or array of indices or mask\n        Specify which features to apply the transform to.\n\n    Returns\n    -------\n    X : array or sparse matrix, shape=(n_samples, n_features_new)\n    \"\"\"\n    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)\n\n    if isinstance(selected, six.string_types) and selected == \"all\":\n        return transform(X)\n\n    if len(selected) == 0:\n        return X\n\n    n_features = X.shape[1]\n    ind = np.arange(n_features)\n    sel = np.zeros(n_features, dtype=bool)\n    sel[np.asarray(selected)] = True\n    not_sel = np.logical_not(sel)\n    n_selected = np.sum(sel)\n\n    if n_selected == 0:\n        # No features selected.\n        return X\n    elif n_selected == n_features:\n        # All features selected.\n        return transform(X)\n    else:\n        X_sel = transform(X[:, ind[sel]])\n        X_not_sel = X[:, ind[not_sel]]\n\n        if sparse.issparse(X_sel) or sparse.issparse(X_not_sel):\n            return sparse.hstack((X_sel, X_not_sel))\n        else:\n            return np.hstack((X_sel, X_not_sel))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "fit_transform",
          "class_name": "OneHotEncoder",
          "code": "def fit_transform(self, X, y=None):\n        \"\"\"Fit OneHotEncoder to X, then transform X.\n\n        Equivalent to self.fit(X).transform(X), but more convenient and more\n        efficient. See fit for the parameters, transform for the return value.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_feature]\n            Input array of type int.\n        \"\"\"\n        return _transform_selected(X, self._fit_transform,\n                                   self.categorical_features, copy=True)"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "transform",
          "class_name": "CategoricalEncoder",
          "code": "def transform(self, X):\n        \"\"\"Transform X using specified encoding scheme.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_features]\n            The data to encode.\n\n        Returns\n        -------\n        X_out : sparse matrix or a 2-d array\n            Transformed input.\n\n        \"\"\"\n        X_temp = check_array(X, dtype=None)\n        if not hasattr(X, 'dtype') and np.issubdtype(X_temp.dtype, np.str_):\n            X = check_array(X, dtype=np.object)\n        else:\n            X = X_temp\n\n        n_samples, n_features = X.shape\n        X_int = np.zeros_like(X, dtype=np.int)\n        X_mask = np.ones_like(X, dtype=np.bool)\n\n        for i in range(n_features):\n            Xi = X[:, i]\n            valid_mask = np.in1d(Xi, self.categories_[i])\n\n            if not np.all(valid_mask):\n                if self.handle_unknown == 'error':\n                    diff = np.unique(X[~valid_mask, i])\n                    msg = (\"Found unknown categories {0} in column {1}\"\n                           \" during transform\".format(diff, i))\n                    raise ValueError(msg)\n                else:\n                    # Set the problematic rows to an acceptable value and\n                    # continue `The rows are marked `X_mask` and will be\n                    # removed later.\n                    X_mask[:, i] = valid_mask\n                    Xi = Xi.copy()\n                    Xi[~valid_mask] = self.categories_[i][0]\n            X_int[:, i] = self._label_encoders_[i].transform(Xi)\n\n        if self.encoding == 'ordinal':\n            return X_int.astype(self.dtype, copy=False)\n\n        mask = X_mask.ravel()\n        n_values = [cats.shape[0] for cats in self.categories_]\n        n_values = np.array([0] + n_values)\n        feature_indices = np.cumsum(n_values)\n\n        indices = (X_int + feature_indices[:-1]).ravel()[mask]\n        indptr = X_mask.sum(axis=1).cumsum()\n        indptr = np.insert(indptr, 0, 0)\n        data = np.ones(n_samples * n_features)[mask]\n\n        out = sparse.csr_matrix((data, indices, indptr),\n                                shape=(n_samples, feature_indices[-1]),\n                                dtype=self.dtype)\n        if self.encoding == 'onehot-dense':\n            return out.toarray()\n        else:\n            return out"
        }
      ]
    },
    {
      "pr_number": 6379,
      "pr_title": "[MRG+1] fix StratifiedShuffleSplit train and test overlap",
      "pr_body": "Fix #6121.\n",
      "issue_id": 6121,
      "issue_title": "StratifiedShuffleSplit generates overlapping train and test indices",
      "issue_body": "Why there is overlap between `dev_idx` and `t_idx` in the following code? It should have been no overlap.\n\n<pre>\n```\ntrain_test_split = StratifiedShuffleSplit(labels, n_iter=1, test_size=0.2, random_state=0)\nfor train_idx, test_idx in train_test_split:\n    train_tmp = set(train_idx)\n    test_tmp = set(test_idx)\n    assert_equal(train_tmp.intersection(test_tmp), set())\n    X_train = np.copy(feats[train_idx])\n    y_train = np.copy(labels[train_idx])\n    trans_train = np.copy(trans[train_idx])\n    X_valid = np.copy(feats[test_idx])\n    y_valid = np.copy(labels[test_idx])\n    trans_valid = np.copy(trans[test_idx])\ndel feats\ndel labels\ndel trans\ndev_test_split = StratifiedShuffleSplit(y_valid, n_iter=1, test_size=0.5, random_state=0)\nfor dev_idx, t_idx in dev_test_split:\n    dev_tmp = set(dev_idx)\n    t_tmp = set(t_idx)\n    <b>assert_equal(dev_tmp.intersection(t_tmp), set())</b>\n    X_dev = np.copy(X_valid[dev_idx])\n    y_dev = np.copy(y_valid[dev_idx])\n    trans_dev = np.copy(trans_valid[dev_idx])\n    X_test = np.copy(X_valid[t_idx])\n    y_test = np.copy(y_valid[t_idx])\n    trans_test = np.copy(trans_valid[t_idx])\ndel X_valid\ndel y_valid\ndel trans_valid\n```\n</pre>\n\nThe second `assert_equal()` test prompted a error as follows:\n\n```\n    assert_equal(dev_tmp.intersection(t_tmp), set())\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 513, in assertEqual\n    assertion_func(first, second, msg=msg)\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 796, in assertSetEqual\n    self.fail(self._formatMessage(msg, standardMsg))\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 410, in fail\n    raise self.failureException(msg)\nAssertionError: Items in the first set but not the second:\n1160\n1161\n907\n1070\n1747\n2232\n```\n",
      "issue_closed_at": "2016-02-29T20:23:23Z",
      "base_commit": "90f2c184c64c7e27e2def24ca4a8340c5501f545",
      "changes": [
        {
          "file": "sklearn/cross_validation.py",
          "type": "function",
          "name": "_iter_indices",
          "class_name": "LabelShuffleSplit",
          "code": "def _iter_indices(self):\n        for label_train, label_test in super(LabelShuffleSplit,\n                                             self)._iter_indices():\n            # these are the indices of classes in the partition\n            # invert them into data indices\n\n            train = np.flatnonzero(np.in1d(self.label_indices, label_train))\n            test = np.flatnonzero(np.in1d(self.label_indices, label_test))\n\n            yield train, test"
        },
        {
          "file": "sklearn/model_selection/_split.py",
          "type": "function",
          "name": "_iter_indices",
          "class_name": "StratifiedShuffleSplit",
          "code": "def _iter_indices(self, X, y, labels=None):\n        n_samples = _num_samples(X)\n        n_train, n_test = _validate_shuffle_split(n_samples, self.test_size,\n                                                  self.train_size)\n        classes, y_indices = np.unique(y, return_inverse=True)\n        n_classes = classes.shape[0]\n\n        class_counts = bincount(y_indices)\n        if np.min(class_counts) < 2:\n            raise ValueError(\"The least populated class in y has only 1\"\n                             \" member, which is too few. The minimum\"\n                             \" number of labels for any class cannot\"\n                             \" be less than 2.\")\n\n        if n_train < n_classes:\n            raise ValueError('The train_size = %d should be greater or '\n                             'equal to the number of classes = %d' %\n                             (n_train, n_classes))\n        if n_test < n_classes:\n            raise ValueError('The test_size = %d should be greater or '\n                             'equal to the number of classes = %d' %\n                             (n_test, n_classes))\n\n        rng = check_random_state(self.random_state)\n        p_i = class_counts / float(n_samples)\n        n_i = np.round(n_train * p_i).astype(int)\n        t_i = np.minimum(class_counts - n_i,\n                         np.round(n_test * p_i).astype(int))\n\n        for _ in range(self.n_iter):\n            train = []\n            test = []\n\n            for i, class_i in enumerate(classes):\n                permutation = rng.permutation(class_counts[i])\n                perm_indices_class_i = np.where((y == class_i))[0][permutation]\n\n                train.extend(perm_indices_class_i[:n_i[i]])\n                test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]])\n\n            # Because of rounding issues (as n_train and n_test are not\n            # dividers of the number of elements per class), we may end\n            # up here with less samples in train and test than asked for.\n            if len(train) < n_train or len(test) < n_test:\n                # We complete by affecting randomly the missing indexes\n                missing_indices = np.where(bincount(train + test,\n                                                    minlength=len(y)) == 0)[0]\n                missing_indices = rng.permutation(missing_indices)\n                train.extend(missing_indices[:(n_train - len(train))])\n                test.extend(missing_indices[-(n_test - len(test)):])\n\n            train = rng.permutation(train)\n            test = rng.permutation(test)\n\n            yield train, test"
        }
      ]
    },
    {
      "pr_number": 2523,
      "pr_title": "[MRG] FIX #2481: add warning for bug in old numpy with unicode",
      "pr_body": "This is a fix for  #2481 which causes the jenkins tests to fail with numpy 1.3.0.\n",
      "issue_id": 2481,
      "issue_title": "LabelEncoder doesn't work correctly for unicode labels in Python 2.6 + numpy 1.3",
      "issue_body": "LabelEncoder works incorrectly for unicode labels in Python 2.6 + numpy 1.3. This is currently untested; to reproduce replace bytestrings with unicode strings here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/tests/test_label.py#L191\n\nThis is the cause of Jenkins failure that https://github.com/scikit-learn/scikit-learn/pull/2462 triggered.\n",
      "issue_closed_at": "2013-10-16T12:24:23Z",
      "base_commit": "a2580d6fe04340f535c624ae48b4a52a66cbc839",
      "changes": [
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "line",
          "name": "line 8",
          "code": "\nfrom ..base import BaseEstimator, TransformerMixin\n\nfrom ..utils.fixes import unique\nfrom ..utils import deprecated, column_or_1d\n\nfrom ..utils.multiclass import unique_labels"
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "line",
          "name": "line 25",
          "code": "    'LabelEncoder',\n]\n\n\nclass LabelEncoder(BaseEstimator, TransformerMixin):\n    \"\"\"Encode labels with value between 0 and n_classes-1."
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "function",
          "name": "fit",
          "class_name": "LabelBinarizer",
          "code": "def fit(self, y):\n        \"\"\"Fit label binarizer\n\n        Parameters\n        ----------\n        y : numpy array of shape (n_samples,) or sequence of sequences\n            Target values. In the multilabel case the nested sequences can\n            have variable lengths.\n\n        Returns\n        -------\n        self : returns an instance of self.\n        \"\"\"\n        y_type = type_of_target(y)\n        self.multilabel_ = y_type.startswith('multilabel')\n        if self.multilabel_:\n            self.indicator_matrix_ = y_type == 'multilabel-indicator'\n\n        self.classes_ = unique_labels(y)\n\n        return self"
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "function",
          "name": "fit_transform",
          "class_name": "LabelEncoder",
          "code": "def fit_transform(self, y):\n        \"\"\"Fit label encoder and return encoded labels\n\n        Parameters\n        ----------\n        y : array-like of shape [n_samples]\n            Target values.\n\n        Returns\n        -------\n        y : array-like of shape [n_samples]\n        \"\"\"\n        y = column_or_1d(y, warn=True)\n        self.classes_, y = unique(y, return_inverse=True)\n        return y"
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "function",
          "name": "transform",
          "class_name": "LabelBinarizer",
          "code": "def transform(self, y):\n        \"\"\"Transform multi-class labels to binary labels\n\n        The output of transform is sometimes referred to by some authors as the\n        1-of-K coding scheme.\n\n        Parameters\n        ----------\n        y : numpy array of shape [n_samples] or sequence of sequences\n            Target values. In the multilabel case the nested sequences can\n            have variable lengths.\n\n        Returns\n        -------\n        Y : numpy array of shape [n_samples, n_classes]\n        \"\"\"\n        self._check_fitted()\n\n        y_is_multilabel = type_of_target(y).startswith('multilabel')\n\n        if y_is_multilabel and not self.multilabel_:\n            raise ValueError(\"The object was not fitted with multilabel\"\n                             \" input.\")\n\n        return label_binarize(y, self.classes_,\n                              multilabel=self.multilabel_,\n                              pos_label=self.pos_label,\n                              neg_label=self.neg_label)"
        }
      ]
    },
    {
      "pr_number": 9396,
      "pr_title": "[MRG + 1] Fix wrong error message in StratifiedKFold",
      "pr_body": "#### Reference Issue\r\nFixes #9381 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\nFixes wrong error message used while creating folds in StratifiedKFold \r\n\r\n#### Any other comments?\r\nNone\r\n",
      "issue_id": 9381,
      "issue_title": "Wrong error message in StratifiedKFold",
      "issue_body": "``All the n_groups for individual classes are less than n_splits=%d.`` That seem confusing / wrong",
      "issue_closed_at": "2017-07-18T23:25:28Z",
      "base_commit": "eece6d909a8baa03c6f19276f494a7680ae299e1",
      "changes": [
        {
          "file": "sklearn/model_selection/_split.py",
          "type": "function",
          "name": "_make_test_folds",
          "class_name": "StratifiedKFold",
          "code": "def _make_test_folds(self, X, y=None):\n        rng = self.random_state\n        y = np.asarray(y)\n        n_samples = y.shape[0]\n        unique_y, y_inversed = np.unique(y, return_inverse=True)\n        y_counts = np.bincount(y_inversed)\n        min_groups = np.min(y_counts)\n        if np.all(self.n_splits > y_counts):\n            raise ValueError(\"All the n_groups for individual classes\"\n                             \" are less than n_splits=%d.\"\n                             % (self.n_splits))\n        if self.n_splits > min_groups:\n            warnings.warn((\"The least populated class in y has only %d\"\n                           \" members, which is too few. The minimum\"\n                           \" number of groups for any class cannot\"\n                           \" be less than n_splits=%d.\"\n                           % (min_groups, self.n_splits)), Warning)\n\n        # pre-assign each sample to a test fold index using individual KFold\n        # splitting strategies for each class so as to respect the balance of\n        # classes\n        # NOTE: Passing the data corresponding to ith class say X[y==class_i]\n        # will break when the data is not 100% stratifiable for all classes.\n        # So we pass np.zeroes(max(c, n_splits)) as data to the KFold\n        per_cls_cvs = [\n            KFold(self.n_splits, shuffle=self.shuffle,\n                  random_state=rng).split(np.zeros(max(count, self.n_splits)))\n            for count in y_counts]\n\n        test_folds = np.zeros(n_samples, dtype=np.int)\n        for test_fold_indices, per_cls_splits in enumerate(zip(*per_cls_cvs)):\n            for cls, (_, test_split) in zip(unique_y, per_cls_splits):\n                cls_test_folds = test_folds[y == cls]\n                # the test split can be too big because we used\n                # KFold(...).split(X[:max(c, n_splits)]) when data is not 100%\n                # stratifiable for all the classes\n                # (we use a warning instead of raising an exception)\n                # If this is the case, let's trim it:\n                test_split = test_split[test_split < len(cls_test_folds)]\n                cls_test_folds[test_split] = test_fold_indices\n                test_folds[y == cls] = cls_test_folds\n\n        return test_folds"
        }
      ]
    }
  ]
}