{
  "Selected_candidate": {
    "pr_number": 6379,
    "pr_title": "[MRG+1] fix StratifiedShuffleSplit train and test overlap",
    "pr_body": "Fix #6121.\n",
    "issue_id": 6121,
    "issue_title": "StratifiedShuffleSplit generates overlapping train and test indices",
    "issue_body": "Why there is overlap between `dev_idx` and `t_idx` in the following code? It should have been no overlap.\n\n<pre>\n```\ntrain_test_split = StratifiedShuffleSplit(labels, n_iter=1, test_size=0.2, random_state=0)\nfor train_idx, test_idx in train_test_split:\n    train_tmp = set(train_idx)\n    test_tmp = set(test_idx)\n    assert_equal(train_tmp.intersection(test_tmp), set())\n    X_train = np.copy(feats[train_idx])\n    y_train = np.copy(labels[train_idx])\n    trans_train = np.copy(trans[train_idx])\n    X_valid = np.copy(feats[test_idx])\n    y_valid = np.copy(labels[test_idx])\n    trans_valid = np.copy(trans[test_idx])\ndel feats\ndel labels\ndel trans\ndev_test_split = StratifiedShuffleSplit(y_valid, n_iter=1, test_size=0.5, random_state=0)\nfor dev_idx, t_idx in dev_test_split:\n    dev_tmp = set(dev_idx)\n    t_tmp = set(t_idx)\n    <b>assert_equal(dev_tmp.intersection(t_tmp), set())</b>\n    X_dev = np.copy(X_valid[dev_idx])\n    y_dev = np.copy(y_valid[dev_idx])\n    trans_dev = np.copy(trans_valid[dev_idx])\n    X_test = np.copy(X_valid[t_idx])\n    y_test = np.copy(y_valid[t_idx])\n    trans_test = np.copy(trans_valid[t_idx])\ndel X_valid\ndel y_valid\ndel trans_valid\n```\n</pre>\n\nThe second `assert_equal()` test prompted a error as follows:\n\n```\n    assert_equal(dev_tmp.intersection(t_tmp), set())\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 513, in assertEqual\n    assertion_func(first, second, msg=msg)\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 796, in assertSetEqual\n    self.fail(self._formatMessage(msg, standardMsg))\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 410, in fail\n    raise self.failureException(msg)\nAssertionError: Items in the first set but not the second:\n1160\n1161\n907\n1070\n1747\n2232\n```\n",
    "issue_closed_at": "2016-02-29T20:23:23Z",
    "base_commit": "90f2c184c64c7e27e2def24ca4a8340c5501f545",
    "changes": [
      {
        "file": "sklearn/cross_validation.py",
        "type": "function",
        "name": "_iter_indices",
        "class_name": "LabelShuffleSplit",
        "code": "def _iter_indices(self):\n        for label_train, label_test in super(LabelShuffleSplit,\n                                             self)._iter_indices():\n            # these are the indices of classes in the partition\n            # invert them into data indices\n\n            train = np.flatnonzero(np.in1d(self.label_indices, label_train))\n            test = np.flatnonzero(np.in1d(self.label_indices, label_test))\n\n            yield train, test"
      },
      {
        "file": "sklearn/model_selection/_split.py",
        "type": "function",
        "name": "_iter_indices",
        "class_name": "StratifiedShuffleSplit",
        "code": "def _iter_indices(self, X, y, labels=None):\n        n_samples = _num_samples(X)\n        n_train, n_test = _validate_shuffle_split(n_samples, self.test_size,\n                                                  self.train_size)\n        classes, y_indices = np.unique(y, return_inverse=True)\n        n_classes = classes.shape[0]\n\n        class_counts = bincount(y_indices)\n        if np.min(class_counts) < 2:\n            raise ValueError(\"The least populated class in y has only 1\"\n                             \" member, which is too few. The minimum\"\n                             \" number of labels for any class cannot\"\n                             \" be less than 2.\")\n\n        if n_train < n_classes:\n            raise ValueError('The train_size = %d should be greater or '\n                             'equal to the number of classes = %d' %\n                             (n_train, n_classes))\n        if n_test < n_classes:\n            raise ValueError('The test_size = %d should be greater or '\n                             'equal to the number of classes = %d' %\n                             (n_test, n_classes))\n\n        rng = check_random_state(self.random_state)\n        p_i = class_counts / float(n_samples)\n        n_i = np.round(n_train * p_i).astype(int)\n        t_i = np.minimum(class_counts - n_i,\n                         np.round(n_test * p_i).astype(int))\n\n        for _ in range(self.n_iter):\n            train = []\n            test = []\n\n            for i, class_i in enumerate(classes):\n                permutation = rng.permutation(class_counts[i])\n                perm_indices_class_i = np.where((y == class_i))[0][permutation]\n\n                train.extend(perm_indices_class_i[:n_i[i]])\n                test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]])\n\n            # Because of rounding issues (as n_train and n_test are not\n            # dividers of the number of elements per class), we may end\n            # up here with less samples in train and test than asked for.\n            if len(train) < n_train or len(test) < n_test:\n                # We complete by affecting randomly the missing indexes\n                missing_indices = np.where(bincount(train + test,\n                                                    minlength=len(y)) == 0)[0]\n                missing_indices = rng.permutation(missing_indices)\n                train.extend(missing_indices[:(n_train - len(train))])\n                test.extend(missing_indices[-(n_test - len(test)):])\n\n            train = rng.permutation(train)\n            test = rng.permutation(test)\n\n            yield train, test"
      }
    ]
  },
  "Justification": "Candidate C relates to data handling in terms of partitioning sets, which can inform debugging of the Pipeline's functionality by offering insights on how to manage expectations of indexing and interactions of data splits similarly to how indexing should be managed with the `__len__` method. The nature of ensuring correctness in data integrity can provide foundational understanding for correcting the CURRENT issue.",
  "instance_id": "scikit-learn__scikit-learn-13439",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-03-12T20:32:50Z",
  "problem_statement": "Pipeline should implement __len__\n#### Description\r\n\r\nWith the new indexing support `pipe[:len(pipe)]` raises an error.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```python\r\nfrom sklearn import svm\r\nfrom sklearn.datasets import samples_generator\r\nfrom sklearn.feature_selection import SelectKBest\r\nfrom sklearn.feature_selection import f_regression\r\nfrom sklearn.pipeline import Pipeline\r\n\r\n# generate some data to play with\r\nX, y = samples_generator.make_classification(\r\n    n_informative=5, n_redundant=0, random_state=42)\r\n\r\nanova_filter = SelectKBest(f_regression, k=5)\r\nclf = svm.SVC(kernel='linear')\r\npipe = Pipeline([('anova', anova_filter), ('svc', clf)])\r\n\r\nlen(pipe)\r\n```\r\n\r\n#### Versions\r\n\r\n```\r\nSystem:\r\n    python: 3.6.7 | packaged by conda-forge | (default, Feb 19 2019, 18:37:23)  [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]\r\nexecutable: /Users/krisz/.conda/envs/arrow36/bin/python\r\n   machine: Darwin-18.2.0-x86_64-i386-64bit\r\n\r\nBLAS:\r\n    macros: HAVE_CBLAS=None\r\n  lib_dirs: /Users/krisz/.conda/envs/arrow36/lib\r\ncblas_libs: openblas, openblas\r\n\r\nPython deps:\r\n       pip: 19.0.3\r\nsetuptools: 40.8.0\r\n   sklearn: 0.21.dev0\r\n     numpy: 1.16.2\r\n     scipy: 1.2.1\r\n    Cython: 0.29.6\r\n    pandas: 0.24.1\r\n```\n",
  "patch": "diff --git a/sklearn/pipeline.py b/sklearn/pipeline.py\n--- a/sklearn/pipeline.py\n+++ b/sklearn/pipeline.py\n@@ -199,6 +199,12 @@ def _iter(self, with_final=True):\n             if trans is not None and trans != 'passthrough':\n                 yield idx, name, trans\n \n+    def __len__(self):\n+        \"\"\"\n+        Returns the length of the Pipeline\n+        \"\"\"\n+        return len(self.steps)\n+\n     def __getitem__(self, ind):\n         \"\"\"Returns a sub-pipeline or a single esimtator in the pipeline\n \n"
}