{
  "instance_id": "scikit-learn__scikit-learn-14087",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-06-13T20:09:22Z",
  "problem_statement": "IndexError thrown with LogisticRegressionCV and refit=False\n#### Description\r\nThe following error is thrown when trying to estimate a regularization parameter via cross-validation, *without* refitting.\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nimport sys\r\nimport sklearn\r\nfrom sklearn.linear_model import LogisticRegressionCV\r\nimport numpy as np\r\n\r\nnp.random.seed(29)\r\nX = np.random.normal(size=(1000, 3))\r\nbeta = np.random.normal(size=3)\r\nintercept = np.random.normal(size=None)\r\ny = np.sign(intercept + X @ beta)\r\n\r\nLogisticRegressionCV(\r\ncv=5,\r\nsolver='saga', # same error with 'liblinear'\r\ntol=1e-2,\r\nrefit=False).fit(X, y)\r\n```\r\n\r\n\r\n#### Expected Results\r\nNo error is thrown. \r\n\r\n#### Actual Results\r\n```\r\n---------------------------------------------------------------------------\r\nIndexError                                Traceback (most recent call last)\r\n<ipython-input-3-81609fd8d2ca> in <module>\r\n----> 1 LogisticRegressionCV(refit=False).fit(X, y)\r\n\r\n~/.pyenv/versions/3.6.7/envs/jupyter/lib/python3.6/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)\r\n   2192                 else:\r\n   2193                     w = np.mean([coefs_paths[:, i, best_indices[i], :]\r\n-> 2194                                  for i in range(len(folds))], axis=0)\r\n   2195 \r\n   2196                 best_indices_C = best_indices % len(self.Cs_)\r\n\r\n~/.pyenv/versions/3.6.7/envs/jupyter/lib/python3.6/site-packages/sklearn/linear_model/logistic.py in <listcomp>(.0)\r\n   2192                 else:\r\n   2193                     w = np.mean([coefs_paths[:, i, best_indices[i], :]\r\n-> 2194                                  for i in range(len(folds))], axis=0)\r\n   2195 \r\n   2196                 best_indices_C = best_indices % len(self.Cs_)\r\n\r\nIndexError: too many indices for array\r\n```\r\n\r\n#### Versions\r\n```\r\nSystem:\r\n    python: 3.6.7 (default, May 13 2019, 16:14:45)  [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.46.4)]\r\nexecutable: /Users/tsweetser/.pyenv/versions/3.6.7/envs/jupyter/bin/python\r\n   machine: Darwin-18.6.0-x86_64-i386-64bit\r\n\r\nBLAS:\r\n    macros: NO_ATLAS_INFO=3, HAVE_CBLAS=None\r\n  lib_dirs: \r\ncblas_libs: cblas\r\n\r\nPython deps:\r\n       pip: 19.1.1\r\nsetuptools: 39.0.1\r\n   sklearn: 0.21.2\r\n     numpy: 1.15.1\r\n     scipy: 1.1.0\r\n    Cython: 0.29.6\r\n    pandas: 0.24.2\r\n```\n",
  "patch": "diff --git a/sklearn/linear_model/logistic.py b/sklearn/linear_model/logistic.py\n--- a/sklearn/linear_model/logistic.py\n+++ b/sklearn/linear_model/logistic.py\n@@ -2170,7 +2170,7 @@ def fit(self, X, y, sample_weight=None):\n                 # Take the best scores across every fold and the average of\n                 # all coefficients corresponding to the best scores.\n                 best_indices = np.argmax(scores, axis=1)\n-                if self.multi_class == 'ovr':\n+                if multi_class == 'ovr':\n                     w = np.mean([coefs_paths[i, best_indices[i], :]\n                                  for i in range(len(folds))], axis=0)\n                 else:\n@@ -2180,8 +2180,11 @@ def fit(self, X, y, sample_weight=None):\n                 best_indices_C = best_indices % len(self.Cs_)\n                 self.C_.append(np.mean(self.Cs_[best_indices_C]))\n \n-                best_indices_l1 = best_indices // len(self.Cs_)\n-                self.l1_ratio_.append(np.mean(l1_ratios_[best_indices_l1]))\n+                if self.penalty == 'elasticnet':\n+                    best_indices_l1 = best_indices // len(self.Cs_)\n+                    self.l1_ratio_.append(np.mean(l1_ratios_[best_indices_l1]))\n+                else:\n+                    self.l1_ratio_.append(None)\n \n             if multi_class == 'multinomial':\n                 self.C_ = np.tile(self.C_, n_classes)\n",
  "similar_bug_items": [
    {
      "pr_number": 4894,
      "pr_title": "fix dtype transform problem in KNN and RandomForest",
      "pr_body": "Fix #4879\n\nI add a int array to store the indices before the loop. \n@amueller I also checked LabelEncoder but it can only accommodate 1d array, so a loop is still needed and slicing problem will still be there. \nFor MultiLabelBinarizer, it can accommodate 2d array, but it will return a 2d binary array to store the indices, which is very different from the original implementation in RandomForest and KNN. I am wondering is there any better way to do so? \n",
      "issue_id": 4879,
      "issue_title": "RandomForestClassifier changes label values",
      "issue_body": "Consider the following example:\n\n``` python\nimport numpy as np\nfrom sklearn.ensemble import RandomForestClassifier\n\nCLASSES = 15\nX = np.arange(CLASSES).reshape(-1, 1)\ny = [ch for ch in 'ABCDEFGHIJKLMNOPQRSTU'[:CLASSES]]\n\nprint(RandomForestClassifier().fit(X, y).predict(X))\n```\n\nThe output is\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'B' 'B' 'B' 'B' 'B']`\n\nwhereas I expect it to be\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O']`\n\nI traced the cause of the error to [here](https://github.com/scikit-learn/scikit-learn/blob/733629d256abaf9a31d6e6305f859768432907ed/sklearn/ensemble/forest.py#L418). I'm happy to put in a pull request, but can others please verify they get the same unexpected behaviour first?\n\nSystem information:\n``\nCPython 3.3.5\nIPython 3.1.0\n\nnumpy 1.9.2\nscipy 0.14.0\nscikit-learn 0.16.1(conda)\n\ncompiler   : GCC 4.2.1 (Apple Inc. build 5577)\nsystem     : Darwin\nrelease    : 13.4.0\nmachine    : x86_64\nprocessor  : i386\nCPU cores  : 4\ninterpreter: 64bit\n\n```\n```\n",
      "issue_closed_at": "2015-06-24T21:33:10Z",
      "base_commit": "cd98ffced656569681648c88c51dae844501643c",
      "changes": [
        {
          "file": "sklearn/ensemble/forest.py",
          "type": "function",
          "name": "_validate_y_class_weight",
          "class_name": "ForestClassifier",
          "code": "def _validate_y_class_weight(self, y):\n        y = np.copy(y)\n        expanded_class_weight = None\n\n        if self.class_weight is not None:\n            y_original = np.copy(y)\n\n        self.classes_ = []\n        self.n_classes_ = []\n\n        for k in range(self.n_outputs_):\n            classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n            self.classes_.append(classes_k)\n            self.n_classes_.append(classes_k.shape[0])\n\n        if self.class_weight is not None:\n            valid_presets = ('auto', 'balanced', 'balanced_subsample', 'subsample', 'auto')\n            if isinstance(self.class_weight, six.string_types):\n                if self.class_weight not in valid_presets:\n                    raise ValueError('Valid presets for class_weight include '\n                                     '\"balanced\" and \"balanced_subsample\". Given \"%s\".'\n                                     % self.class_weight)\n                if self.class_weight == \"subsample\":\n                    warn(\"class_weight='subsample' is deprecated and will be removed in 0.18.\"\n                         \" It was replaced by class_weight='balanced_subsample' \"\n                         \"using the balanced strategy.\", DeprecationWarning)\n                if self.warm_start:\n                    warn('class_weight presets \"balanced\" or \"balanced_subsample\" are '\n                         'not recommended for warm_start if the fitted data '\n                         'differs from the full dataset. In order to use '\n                         '\"balanced\" weights, use compute_class_weight(\"balanced\", '\n                         'classes, y). In place of y you can use a large '\n                         'enough sample of the full training set target to '\n                         'properly estimate the class frequency '\n                         'distributions. Pass the resulting weights as the '\n                         'class_weight parameter.')\n\n            if (self.class_weight not in ['subsample', 'balanced_subsample'] or\n                    not self.bootstrap):\n                if self.class_weight == 'subsample':\n                    class_weight = 'auto'\n                elif self.class_weight == \"balanced_subsample\":\n                    class_weight = \"balanced\"\n                else:\n                    class_weight = self.class_weight\n                with warnings.catch_warnings():\n                    if class_weight == \"auto\":\n                        warnings.simplefilter('ignore', DeprecationWarning)\n                    expanded_class_weight = compute_sample_weight(class_weight,\n                                                                  y_original)\n\n        return y, expanded_class_weight"
        },
        {
          "file": "sklearn/tree/tree.py",
          "type": "function",
          "name": "fit",
          "class_name": "BaseDecisionTree",
          "code": "def fit(self, X, y, sample_weight=None, check_input=True):\n        \"\"\"Build a decision tree from the training set (X, y).\n\n        Parameters\n        ----------\n        X : array-like or sparse matrix, shape = [n_samples, n_features]\n            The training input samples. Internally, it will be converted to\n            ``dtype=np.float32`` and if a sparse matrix is provided\n            to a sparse ``csc_matrix``.\n\n        y : array-like, shape = [n_samples] or [n_samples, n_outputs]\n            The target values (class labels in classification, real numbers in\n            regression). In the regression case, use ``dtype=np.float64`` and\n            ``order='C'`` for maximum efficiency.\n\n        sample_weight : array-like, shape = [n_samples] or None\n            Sample weights. If None, then samples are equally weighted. Splits\n            that would create child nodes with net zero or negative weight are\n            ignored while searching for a split in each node. In the case of\n            classification, splits are also ignored if they would result in any\n            single class carrying a negative weight in either child node.\n\n        check_input : boolean, (default=True)\n            Allow to bypass several input checking.\n            Don't use this parameter unless you know what you do.\n\n        Returns\n        -------\n        self : object\n            Returns self.\n        \"\"\"\n        random_state = check_random_state(self.random_state)\n        if check_input:\n            X = check_array(X, dtype=DTYPE, accept_sparse=\"csc\")\n            if issparse(X):\n                X.sort_indices()\n\n                if X.indices.dtype != np.intc or X.indptr.dtype != np.intc:\n                    raise ValueError(\"No support for np.int64 index based \"\n                                     \"sparse matrices\")\n\n        # Determine output settings\n        n_samples, self.n_features_ = X.shape\n        is_classification = isinstance(self, ClassifierMixin)\n\n        y = np.atleast_1d(y)\n        expanded_class_weight = None\n\n        if y.ndim == 1:\n            # reshape is necessary to preserve the data contiguity against vs\n            # [:, np.newaxis] that does not.\n            y = np.reshape(y, (-1, 1))\n\n        self.n_outputs_ = y.shape[1]\n\n        if is_classification:\n            y = np.copy(y)\n\n            self.classes_ = []\n            self.n_classes_ = []\n\n            if self.class_weight is not None:\n                y_original = np.copy(y)\n\n            for k in range(self.n_outputs_):\n                classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n                self.classes_.append(classes_k)\n                self.n_classes_.append(classes_k.shape[0])\n\n            if self.class_weight is not None:\n                expanded_class_weight = compute_sample_weight(\n                    self.class_weight, y_original)\n\n        else:\n            self.classes_ = [None] * self.n_outputs_\n            self.n_classes_ = [1] * self.n_outputs_\n\n        self.n_classes_ = np.array(self.n_classes_, dtype=np.intp)\n\n        if getattr(y, \"dtype\", None) != DOUBLE or not y.flags.contiguous:\n            y = np.ascontiguousarray(y, dtype=DOUBLE)\n\n        # Check parameters\n        max_depth = ((2 ** 31) - 1 if self.max_depth is None\n                     else self.max_depth)\n        max_leaf_nodes = (-1 if self.max_leaf_nodes is None\n                          else self.max_leaf_nodes)\n\n        if isinstance(self.max_features, six.string_types):\n            if self.max_features == \"auto\":\n                if is_classification:\n                    max_features = max(1, int(np.sqrt(self.n_features_)))\n                else:\n                    max_features = self.n_features_\n            elif self.max_features == \"sqrt\":\n                max_features = max(1, int(np.sqrt(self.n_features_)))\n            elif self.max_features == \"log2\":\n                max_features = max(1, int(np.log2(self.n_features_)))\n            else:\n                raise ValueError(\n                    'Invalid value for max_features. Allowed string '\n                    'values are \"auto\", \"sqrt\" or \"log2\".')\n        elif self.max_features is None:\n            max_features = self.n_features_\n        elif isinstance(self.max_features, (numbers.Integral, np.integer)):\n            max_features = self.max_features\n        else:  # float\n            if self.max_features > 0.0:\n                max_features = max(1, int(self.max_features * self.n_features_))\n            else:\n                max_features = 0\n\n        self.max_features_ = max_features\n\n        if len(y) != n_samples:\n            raise ValueError(\"Number of labels=%d does not match \"\n                             \"number of samples=%d\" % (len(y), n_samples))\n        if self.min_samples_split <= 0:\n            raise ValueError(\"min_samples_split must be greater than zero.\")\n        if self.min_samples_leaf <= 0:\n            raise ValueError(\"min_samples_leaf must be greater than zero.\")\n        if not 0 <= self.min_weight_fraction_leaf <= 0.5:\n            raise ValueError(\"min_weight_fraction_leaf must in [0, 0.5]\")\n        if max_depth <= 0:\n            raise ValueError(\"max_depth must be greater than zero. \")\n        if not (0 < max_features <= self.n_features_):\n            raise ValueError(\"max_features must be in (0, n_features]\")\n        if not isinstance(max_leaf_nodes, (numbers.Integral, np.integer)):\n            raise ValueError(\"max_leaf_nodes must be integral number but was \"\n                             \"%r\" % max_leaf_nodes)\n        if -1 < max_leaf_nodes < 2:\n            raise ValueError((\"max_leaf_nodes {0} must be either smaller than \"\n                              \"0 or larger than 1\").format(max_leaf_nodes))\n\n        if sample_weight is not None:\n            if (getattr(sample_weight, \"dtype\", None) != DOUBLE or\n                    not sample_weight.flags.contiguous):\n                sample_weight = np.ascontiguousarray(\n                    sample_weight, dtype=DOUBLE)\n            if len(sample_weight.shape) > 1:\n                raise ValueError(\"Sample weights array has more \"\n                                 \"than one dimension: %d\" %\n                                 len(sample_weight.shape))\n            if len(sample_weight) != n_samples:\n                raise ValueError(\"Number of weights=%d does not match \"\n                                 \"number of samples=%d\" %\n                                 (len(sample_weight), n_samples))\n\n        if expanded_class_weight is not None:\n            if sample_weight is not None:\n                sample_weight = sample_weight * expanded_class_weight\n            else:\n                sample_weight = expanded_class_weight\n\n        # Set min_weight_leaf from min_weight_fraction_leaf\n        if self.min_weight_fraction_leaf != 0. and sample_weight is not None:\n            min_weight_leaf = (self.min_weight_fraction_leaf *\n                               np.sum(sample_weight))\n        else:\n            min_weight_leaf = 0.\n\n        # Set min_samples_split sensibly\n        min_samples_split = max(self.min_samples_split,\n                                2 * self.min_samples_leaf)\n\n        # Build tree\n        criterion = self.criterion\n        if not isinstance(criterion, Criterion):\n            if is_classification:\n                criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,\n                                                         self.n_classes_)\n            else:\n                criterion = CRITERIA_REG[self.criterion](self.n_outputs_)\n\n        SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS\n\n        splitter = self.splitter\n        if not isinstance(self.splitter, Splitter):\n            splitter = SPLITTERS[self.splitter](criterion,\n                                                self.max_features_,\n                                                self.min_samples_leaf,\n                                                min_weight_leaf,\n                                                random_state)\n\n        self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_)\n\n        # Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise\n        if max_leaf_nodes < 0:\n            builder = DepthFirstTreeBuilder(splitter, min_samples_split,\n                                            self.min_samples_leaf,\n                                            min_weight_leaf,\n                                            max_depth)\n        else:\n            builder = BestFirstTreeBuilder(splitter, min_samples_split,\n                                           self.min_samples_leaf,\n                                           min_weight_leaf,\n                                           max_depth,\n                                           max_leaf_nodes)\n\n        builder.build(self.tree_, X, y, sample_weight)\n\n        if self.n_outputs_ == 1:\n            self.n_classes_ = self.n_classes_[0]\n            self.classes_ = self.classes_[0]\n\n        return self"
        }
      ]
    },
    {
      "pr_number": 8936,
      "pr_title": "[MRG+1] fixed OOB_Score bug for bagging classifiers.",
      "pr_body": "Fixes #8933\r\n\r\n<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\n<!-- Example: Fixes #1234 -->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 8933,
      "issue_title": "BUG: BaggingClassifier.oob_score_ should not change with class label",
      "issue_body": "Let us compute the oob score of a bagged classifier.\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom sklearn.ensemble import BaggingClassifier\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\n\r\nN = 50\r\nrandState = 5\r\nlabel = 'Label'\r\nfeatures = ['A','B','C']\r\n\r\nlabels = np.random.randint(3, size = N) - 1\r\ndf = pd.DataFrame( labels , index=range(N), columns=[label] )\r\nfor col in features:\r\n    df[col] = df[label] + 0.01 * np.random.rand( N )\r\n\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nHere, clf.oob_score_=0.0.\r\n\r\nNow, you would not expect that the OOB accuracy is a function of the class labels...\r\n\r\n```python\r\ndf.loc[ df[label] == -1 , label ] = 2\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nNow, clf.oob_score_=1.0.\r\n\r\nClearly, OOB score should not be a function of the labels arbitrarily chosen for the classes.\r\n\r\nsklearn.__version__: '0.18.1'\r\nnumpy.__version__: '1.11.3'",
      "issue_closed_at": "2017-06-08T09:35:49Z",
      "base_commit": "9131f89e6c165fb27dadd37d3168c1ee5ea84f5a",
      "changes": [
        {
          "file": "sklearn/ensemble/bagging.py",
          "type": "function",
          "name": "_set_oob_score",
          "class_name": "BaggingRegressor",
          "code": "def _set_oob_score(self, X, y):\n        n_samples = y.shape[0]\n\n        predictions = np.zeros((n_samples,))\n        n_predictions = np.zeros((n_samples,))\n\n        for estimator, samples, features in zip(self.estimators_,\n                                                self.estimators_samples_,\n                                                self.estimators_features_):\n            # Create mask for OOB samples\n            mask = ~samples\n\n            predictions[mask] += estimator.predict((X[mask, :])[:, features])\n            n_predictions[mask] += 1\n\n        if (n_predictions == 0).any():\n            warn(\"Some inputs do not have OOB scores. \"\n                 \"This probably means too few estimators were used \"\n                 \"to compute any reliable oob estimates.\")\n            n_predictions[n_predictions == 0] = 1\n\n        predictions /= n_predictions\n\n        self.oob_prediction_ = predictions\n        self.oob_score_ = r2_score(y, predictions)"
        }
      ]
    },
    {
      "pr_number": 9396,
      "pr_title": "[MRG + 1] Fix wrong error message in StratifiedKFold",
      "pr_body": "#### Reference Issue\r\nFixes #9381 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\nFixes wrong error message used while creating folds in StratifiedKFold \r\n\r\n#### Any other comments?\r\nNone\r\n",
      "issue_id": 9381,
      "issue_title": "Wrong error message in StratifiedKFold",
      "issue_body": "``All the n_groups for individual classes are less than n_splits=%d.`` That seem confusing / wrong",
      "issue_closed_at": "2017-07-18T23:25:28Z",
      "base_commit": "eece6d909a8baa03c6f19276f494a7680ae299e1",
      "changes": [
        {
          "file": "sklearn/model_selection/_split.py",
          "type": "function",
          "name": "_make_test_folds",
          "class_name": "StratifiedKFold",
          "code": "def _make_test_folds(self, X, y=None):\n        rng = self.random_state\n        y = np.asarray(y)\n        n_samples = y.shape[0]\n        unique_y, y_inversed = np.unique(y, return_inverse=True)\n        y_counts = np.bincount(y_inversed)\n        min_groups = np.min(y_counts)\n        if np.all(self.n_splits > y_counts):\n            raise ValueError(\"All the n_groups for individual classes\"\n                             \" are less than n_splits=%d.\"\n                             % (self.n_splits))\n        if self.n_splits > min_groups:\n            warnings.warn((\"The least populated class in y has only %d\"\n                           \" members, which is too few. The minimum\"\n                           \" number of groups for any class cannot\"\n                           \" be less than n_splits=%d.\"\n                           % (min_groups, self.n_splits)), Warning)\n\n        # pre-assign each sample to a test fold index using individual KFold\n        # splitting strategies for each class so as to respect the balance of\n        # classes\n        # NOTE: Passing the data corresponding to ith class say X[y==class_i]\n        # will break when the data is not 100% stratifiable for all classes.\n        # So we pass np.zeroes(max(c, n_splits)) as data to the KFold\n        per_cls_cvs = [\n            KFold(self.n_splits, shuffle=self.shuffle,\n                  random_state=rng).split(np.zeros(max(count, self.n_splits)))\n            for count in y_counts]\n\n        test_folds = np.zeros(n_samples, dtype=np.int)\n        for test_fold_indices, per_cls_splits in enumerate(zip(*per_cls_cvs)):\n            for cls, (_, test_split) in zip(unique_y, per_cls_splits):\n                cls_test_folds = test_folds[y == cls]\n                # the test split can be too big because we used\n                # KFold(...).split(X[:max(c, n_splits)]) when data is not 100%\n                # stratifiable for all the classes\n                # (we use a warning instead of raising an exception)\n                # If this is the case, let's trim it:\n                test_split = test_split[test_split < len(cls_test_folds)]\n                cls_test_folds[test_split] = test_fold_indices\n                test_folds[y == cls] = cls_test_folds\n\n        return test_folds"
        }
      ]
    },
    {
      "pr_number": 6379,
      "pr_title": "[MRG+1] fix StratifiedShuffleSplit train and test overlap",
      "pr_body": "Fix #6121.\n",
      "issue_id": 6121,
      "issue_title": "StratifiedShuffleSplit generates overlapping train and test indices",
      "issue_body": "Why there is overlap between `dev_idx` and `t_idx` in the following code? It should have been no overlap.\n\n<pre>\n```\ntrain_test_split = StratifiedShuffleSplit(labels, n_iter=1, test_size=0.2, random_state=0)\nfor train_idx, test_idx in train_test_split:\n    train_tmp = set(train_idx)\n    test_tmp = set(test_idx)\n    assert_equal(train_tmp.intersection(test_tmp), set())\n    X_train = np.copy(feats[train_idx])\n    y_train = np.copy(labels[train_idx])\n    trans_train = np.copy(trans[train_idx])\n    X_valid = np.copy(feats[test_idx])\n    y_valid = np.copy(labels[test_idx])\n    trans_valid = np.copy(trans[test_idx])\ndel feats\ndel labels\ndel trans\ndev_test_split = StratifiedShuffleSplit(y_valid, n_iter=1, test_size=0.5, random_state=0)\nfor dev_idx, t_idx in dev_test_split:\n    dev_tmp = set(dev_idx)\n    t_tmp = set(t_idx)\n    <b>assert_equal(dev_tmp.intersection(t_tmp), set())</b>\n    X_dev = np.copy(X_valid[dev_idx])\n    y_dev = np.copy(y_valid[dev_idx])\n    trans_dev = np.copy(trans_valid[dev_idx])\n    X_test = np.copy(X_valid[t_idx])\n    y_test = np.copy(y_valid[t_idx])\n    trans_test = np.copy(trans_valid[t_idx])\ndel X_valid\ndel y_valid\ndel trans_valid\n```\n</pre>\n\nThe second `assert_equal()` test prompted a error as follows:\n\n```\n    assert_equal(dev_tmp.intersection(t_tmp), set())\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 513, in assertEqual\n    assertion_func(first, second, msg=msg)\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 796, in assertSetEqual\n    self.fail(self._formatMessage(msg, standardMsg))\n  File \"/home/xyang45/miniconda2/lib/python2.7/unittest/case.py\", line 410, in fail\n    raise self.failureException(msg)\nAssertionError: Items in the first set but not the second:\n1160\n1161\n907\n1070\n1747\n2232\n```\n",
      "issue_closed_at": "2016-02-29T20:23:23Z",
      "base_commit": "90f2c184c64c7e27e2def24ca4a8340c5501f545",
      "changes": [
        {
          "file": "sklearn/cross_validation.py",
          "type": "function",
          "name": "_iter_indices",
          "class_name": "LabelShuffleSplit",
          "code": "def _iter_indices(self):\n        for label_train, label_test in super(LabelShuffleSplit,\n                                             self)._iter_indices():\n            # these are the indices of classes in the partition\n            # invert them into data indices\n\n            train = np.flatnonzero(np.in1d(self.label_indices, label_train))\n            test = np.flatnonzero(np.in1d(self.label_indices, label_test))\n\n            yield train, test"
        },
        {
          "file": "sklearn/model_selection/_split.py",
          "type": "function",
          "name": "_iter_indices",
          "class_name": "StratifiedShuffleSplit",
          "code": "def _iter_indices(self, X, y, labels=None):\n        n_samples = _num_samples(X)\n        n_train, n_test = _validate_shuffle_split(n_samples, self.test_size,\n                                                  self.train_size)\n        classes, y_indices = np.unique(y, return_inverse=True)\n        n_classes = classes.shape[0]\n\n        class_counts = bincount(y_indices)\n        if np.min(class_counts) < 2:\n            raise ValueError(\"The least populated class in y has only 1\"\n                             \" member, which is too few. The minimum\"\n                             \" number of labels for any class cannot\"\n                             \" be less than 2.\")\n\n        if n_train < n_classes:\n            raise ValueError('The train_size = %d should be greater or '\n                             'equal to the number of classes = %d' %\n                             (n_train, n_classes))\n        if n_test < n_classes:\n            raise ValueError('The test_size = %d should be greater or '\n                             'equal to the number of classes = %d' %\n                             (n_test, n_classes))\n\n        rng = check_random_state(self.random_state)\n        p_i = class_counts / float(n_samples)\n        n_i = np.round(n_train * p_i).astype(int)\n        t_i = np.minimum(class_counts - n_i,\n                         np.round(n_test * p_i).astype(int))\n\n        for _ in range(self.n_iter):\n            train = []\n            test = []\n\n            for i, class_i in enumerate(classes):\n                permutation = rng.permutation(class_counts[i])\n                perm_indices_class_i = np.where((y == class_i))[0][permutation]\n\n                train.extend(perm_indices_class_i[:n_i[i]])\n                test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]])\n\n            # Because of rounding issues (as n_train and n_test are not\n            # dividers of the number of elements per class), we may end\n            # up here with less samples in train and test than asked for.\n            if len(train) < n_train or len(test) < n_test:\n                # We complete by affecting randomly the missing indexes\n                missing_indices = np.where(bincount(train + test,\n                                                    minlength=len(y)) == 0)[0]\n                missing_indices = rng.permutation(missing_indices)\n                train.extend(missing_indices[:(n_train - len(train))])\n                test.extend(missing_indices[-(n_test - len(test)):])\n\n            train = rng.permutation(train)\n            test = rng.permutation(test)\n\n            yield train, test"
        }
      ]
    },
    {
      "pr_number": 7594,
      "pr_title": "[MRG+1] FIX Make sure GridSearchCV and RandomizedSearchCV are pickle-able",
      "pr_body": "Fixes #7562 \n- Subclasses the `np.ma.MaskedArray` and overrides the `__getstate__` to make obj dtyped `MaskedArray`s pickle-able.\n- Uses this fixed `utils.fixes.MaskedArray` inside `gs.cv_results_`...\n\nThis is based off of https://github.com/numpy/numpy/pull/8122\n\nPlease review @jnothman @amueller @GaelVaroquaux @davechallis\n",
      "issue_id": 7562,
      "issue_title": "Error unpickling RandomizedSearchCV objects in 0.18 due to masked arrays",
      "issue_body": "#### Description\n\nIn version 0.18, loading pickles of fitted RandomizedSearchCV objects results in a `TypeError` exception (from pickle also created with version 0.18).\n\nThe error seems related to the use of masked arrays in the `RandomizedSearchCV.cv_results_` attribute - clearing this before pickling (i.e. setting to to `None`) allows pickling/unpickling to work.\n#### Steps/Code to Reproduce\n\n```\nimport pickle                                                                   \nfrom sklearn.model_selection import RandomizedSearchCV                          \nfrom sklearn.ensemble import RandomForestClassifier                             \n\nX = [[1, 0], [1, 0], [1, 0], [0, 1], [0, 1], [0, 1]]                            \ny = [1, 1, 1, 0, 0, 0]                                                          \n\nmodel = RandomizedSearchCV(RandomForestClassifier(),                            \n                           {'n_estimators': [5, 10, 20]},                       \n                           n_iter=3)                                            \nmodel.fit(X, y)                                                                 \n\nwith open('model.pkl', 'wb') as fh:                                             \n    pickle.dump(model, fh)                                                      \n\nwith open('model.pkl', 'rb') as fh:                                             \n    model = pickle.load(fh)\n\nprint(model.predict(X))\n```\n#### Expected Results\n\n```\n[1, 1, 1, 0, 0, 0]\n```\n#### Actual Results\n\n```\nTraceback (most recent call last):\n  File \"./t.py\", line 19, in <module>\n    model = pickle.load(fh)\n  File \"/Users/dsc/miniconda3/envs/p3/lib/python3.5/site-packages/numpy/ma/core.py\", line 5863, in __setstate__\n    super(MaskedArray, self).__setstate__((shp, typ, isf, raw))\nTypeError: object pickle not returning list\n```\n#### Versions\n\nPython 3.5.1 |Continuum Analytics, Inc.| (default, Dec  7 2015, 11:24:55) \n[GCC 4.2.1 (Apple Inc. build 5577)]\nNumPy 1.11.1\nSciPy 0.18.1\nScikit-Learn 0.18\n",
      "issue_closed_at": "2016-10-10T19:33:44Z",
      "base_commit": "33ed90dc0aa0549a5963000d7d070aa18ca389c4",
      "changes": [
        {
          "file": "sklearn/model_selection/_search.py",
          "type": "line",
          "name": "line 30",
          "code": "from ..utils import check_random_state\nfrom ..utils.fixes import sp_version\nfrom ..utils.fixes import rankdata\nfrom ..utils.random import sample_without_replacement\nfrom ..utils.validation import indexable, check_is_fitted\nfrom ..utils.metaestimators import if_delegate_has_method"
        },
        {
          "file": "sklearn/model_selection/_search.py",
          "type": "function",
          "name": "_store",
          "class_name": "BaseSearchCV",
          "code": "def _store(key_name, array, weights=None, splits=False, rank=False):\n            \"\"\"A small helper to store the scores/times to the cv_results_\"\"\"\n            array = np.array(array, dtype=np.float64).reshape(n_candidates,\n                                                              n_splits)\n            if splits:\n                for split_i in range(n_splits):\n                    results[\"split%d_%s\"\n                            % (split_i, key_name)] = array[:, split_i]\n\n            array_means = np.average(array, axis=1, weights=weights)\n            results['mean_%s' % key_name] = array_means\n            # Weighted std is not directly available in numpy\n            array_stds = np.sqrt(np.average((array -\n                                             array_means[:, np.newaxis]) ** 2,\n                                            axis=1, weights=weights))\n            results['std_%s' % key_name] = array_stds\n\n            if rank:\n                results[\"rank_%s\" % key_name] = np.asarray(\n                    rankdata(-array_means, method='min'), dtype=np.int32)"
        },
        {
          "file": "sklearn/utils/fixes.py",
          "type": "function",
          "name": "rankdata",
          "class_name": null,
          "code": "def rankdata(a, method='average'):\n        if method not in ('average', 'min', 'max', 'dense', 'ordinal'):\n            raise ValueError('unknown method \"{0}\"'.format(method))\n\n        arr = np.ravel(np.asarray(a))\n        algo = 'mergesort' if method == 'ordinal' else 'quicksort'\n        sorter = np.argsort(arr, kind=algo)\n\n        inv = np.empty(sorter.size, dtype=np.intp)\n        inv[sorter] = np.arange(sorter.size, dtype=np.intp)\n\n        if method == 'ordinal':\n            return inv + 1\n\n        arr = arr[sorter]\n        obs = np.r_[True, arr[1:] != arr[:-1]]\n        dense = obs.cumsum()[inv]\n\n        if method == 'dense':\n            return dense\n\n        # cumulative counts of each unique value\n        count = np.r_[np.nonzero(obs)[0], len(obs)]\n\n        if method == 'max':\n            return count[dense]\n\n        if method == 'min':\n            return count[dense - 1] + 1\n\n        # average method\n        return .5 * (count[dense] + count[dense - 1] + 1)"
        }
      ]
    }
  ]
}