{
    "Selected_candidate": {
        "pr_number": 6379,
        "pr_title": "[MRG+1] fix StratifiedShuffleSplit train and test overlap",
        "pr_body": "Fix #6121.\n",
        "issue_id": 6121,
        "issue_title": "StratifiedShuffleSplit generates overlapping train and test indices",
        "issue_body": "Why there is overlap between `dev_idx` and `t_idx` in the following code? It should have been no overlap.\n\n<pre>\n```\ntrain_test_split = StratifiedShuffleSplit(labels, n_iter=1, test_size=0.2, random_state=0)\nfor train_idx, test_idx in train_test_split:\n    train_tmp = set(train_idx)\n    test_tmp = set(test_idx)\n    assert_equal(train_tmp.intersection(test_tmp), set())\n    X_train = np.copy(feats[train_idx])\n    y_train = np.copy(labels[train_idx])\n    trans_train = np.copy(trans[train_idx])\n    X_valid = np.copy(feats[test_idx])\n    y_valid = np.copy(labels[test_idx])\n    trans_valid = np.copy(trans[test_idx])\ndel feats\ndel labels\ndel trans\ndev_test_split = StratifiedShuffleSplit(y_valid, n_iter=1, test_size=0.5, random_state=0)\nfor dev_idx, t_idx in dev_test_split:\n    dev_tmp = set(dev_idx)\n    t_tmp = set(t_idx)\n    <b>assert_equal(dev_tmp.intersection(t_tmp), set())</b>\n    X_dev = np.copy(X_valid[dev_idx])\n    y_dev = np.copy(y_valid[dev_idx])\n    trans_dev = np.copy(trans_valid[dev_idx])\n    X_test = np.copy(X_valid[t_idx])\n    y_test = np.copy(y_valid[t_idx])\n    trans_test = np.copy(trans_valid[t_idx])\ndel X_valid\ndel y_valid\ndel trans_valid\n```\n</pre>\n\nThe second `assert_equal()` test prompted a error as follows:\n\n```\n    assert_equal(dev_tmp.intersection(t_tmp), set())\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 513, in assertEqual\n    assertion_func(first, second, msg=msg)\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 796, in assertSetEqual\n    self.fail(self._formatMessage(msg, standardMsg))\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 410, in fail\n    raise self.failureException(msg)\nAssertionError: Items in the first set but not the second:\n1160\n1161\n907\n1070\n1747\n2232\n```\n",
        "issue_closed_at": "2016-02-29T20:23:23Z",
        "base_commit": "90f2c184c64c7e27e2def24ca4a8340c5501f545",
        "changes": [
            {
                "file": "sklearn/cross_validation.py",
                "type": "function",
                "name": "_iter_indices",
                "class_name": "LabelShuffleSplit",
                "code": "def _iter_indices(self):\n        for label_train, label_test in super(LabelShuffleSplit,\n                                             self)._iter_indices():\n            # these are the indices of classes in the partition\n            # invert them into data indices\n\n            train = np.flatnonzero(np.in1d(self.label_indices, label_train))\n            test = np.flatnonzero(np.in1d(self.label_indices, label_test))\n\n            yield train, test"
            },
            {
                "file": "sklearn/model_selection/_split.py",
                "type": "function",
                "name": "_iter_indices",
                "class_name": "StratifiedShuffleSplit",
                "code": "def _iter_indices(self, X, y, labels=None):\n        n_samples = _num_samples(X)\n        n_train, n_test = _validate_shuffle_split(n_samples, self.test_size,\n                                                  self.train_size)\n        classes, y_indices = np.unique(y, return_inverse=True)\n        n_classes = classes.shape[0]\n\n        class_counts = bincount(y_indices)\n        if np.min(class_counts) < 2:\n            raise ValueError(\"The least populated class in y has only 1\"\n                             \" member, which is too few. The minimum\"\n                             \" number of labels for any class cannot\"\n                             \" be less than 2.\")\n\n        if n_train < n_classes:\n            raise ValueError('The train_size = %d should be greater or '\n                             'equal to the number of classes = %d' %\n                             (n_train, n_classes))\n        if n_test < n_classes:\n            raise ValueError('The test_size = %d should be greater or '\n                             'equal to the number of classes = %d' %\n                             (n_test, n_classes))\n\n        rng = check_random_state(self.random_state)\n        p_i = class_counts / float(n_samples)\n        n_i = np.round(n_train * p_i).astype(int)\n        t_i = np.minimum(class_counts - n_i,\n                         np.round(n_test * p_i).astype(int))\n\n        for _ in range(self.n_iter):\n            train = []\n            test = []\n\n            for i, class_i in enumerate(classes):\n                permutation = rng.permutation(class_counts[i])\n                perm_indices_class_i = np.where((y == class_i))[0][permutation]\n\n                train.extend(perm_indices_class_i[:n_i[i]])\n                test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]])\n\n            # Because of rounding issues (as n_train and n_test are not\n            # dividers of the number of elements per class), we may end\n            # up here with less samples in train and test than asked for.\n            if len(train) < n_train or len(test) < n_test:\n                # We complete by affecting randomly the missing indexes\n                missing_indices = np.where(bincount(train + test,\n                                                    minlength=len(y)) == 0)[0]\n                missing_indices = rng.permutation(missing_indices)\n                train.extend(missing_indices[:(n_train - len(train))])\n                test.extend(missing_indices[-(n_test - len(test)):])\n\n            train = rng.permutation(train)\n            test = rng.permutation(test)\n\n            yield train, test"
            }
        ]
    },
    "Justification": "Candidate C relates to data handling in terms of partitioning sets, which can inform debugging of the Pipeline's functionality by offering insights on how to manage expectations of indexing and interactions of data splits similarly to how indexing should be managed with the `__len__` method. The nature of ensuring correctness in data integrity can provide foundational understanding for correcting the CURRENT issue."
}