{
  "Selected_candidate": {
    "pr_number": 4894,
    "pr_title": "fix dtype transform problem in KNN and RandomForest",
    "pr_body": "Fix #4879\n\nI add a int array to store the indices before the loop. \n@amueller I also checked LabelEncoder but it can only accommodate 1d array, so a loop is still needed and slicing problem will still be there. \nFor MultiLabelBinarizer, it can accommodate 2d array, but it will return a 2d binary array to store the indices, which is very different from the original implementation in RandomForest and KNN. I am wondering is there any better way to do so? \n",
    "issue_id": 4879,
    "issue_title": "RandomForestClassifier changes label values",
    "issue_body": "Consider the following example:\n\n``` python\nimport numpy as np\nfrom sklearn.ensemble import RandomForestClassifier\n\nCLASSES = 15\nX = np.arange(CLASSES).reshape(-1, 1)\ny = [ch for ch in 'ABCDEFGHIJKLMNOPQRSTU'[:CLASSES]]\n\nprint(RandomForestClassifier().fit(X, y).predict(X))\n```\n\nThe output is\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'B' 'B' 'B' 'B' 'B']`\n\nwhereas I expect it to be\n\n`['A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O']`\n\nI traced the cause of the error to [here](https://github.com/scikit-learn/scikit-learn/blob/733629d256abaf9a31d6e6305f859768432907ed/sklearn/ensemble/forest.py#L418). I'm happy to put in a pull request, but can others please verify they get the same unexpected behaviour first?\n\nSystem information:\n``\nCPython 3.3.5\nIPython 3.1.0\n\nnumpy 1.9.2\nscipy 0.14.0\nscikit-learn 0.16.1(conda)\n\ncompiler   : GCC 4.2.1 (Apple Inc. build 5577)\nsystem     : Darwin\nrelease    : 13.4.0\nmachine    : x86_64\nprocessor  : i386\nCPU cores  : 4\ninterpreter: 64bit\n\n```\n```\n",
    "issue_closed_at": "2015-06-24T21:33:10Z",
    "base_commit": "cd98ffced656569681648c88c51dae844501643c",
    "changes": [
      {
        "file": "sklearn/ensemble/forest.py",
        "type": "function",
        "name": "_validate_y_class_weight",
        "class_name": "ForestClassifier",
        "code": "def _validate_y_class_weight(self, y):\n        y = np.copy(y)\n        expanded_class_weight = None\n\n        if self.class_weight is not None:\n            y_original = np.copy(y)\n\n        self.classes_ = []\n        self.n_classes_ = []\n\n        for k in range(self.n_outputs_):\n            classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n            self.classes_.append(classes_k)\n            self.n_classes_.append(classes_k.shape[0])\n\n        if self.class_weight is not None:\n            valid_presets = ('auto', 'balanced', 'balanced_subsample', 'subsample', 'auto')\n            if isinstance(self.class_weight, six.string_types):\n                if self.class_weight not in valid_presets:\n                    raise ValueError('Valid presets for class_weight include '\n                                     '\"balanced\" and \"balanced_subsample\". Given \"%s\".'\n                                     % self.class_weight)\n                if self.class_weight == \"subsample\":\n                    warn(\"class_weight='subsample' is deprecated and will be removed in 0.18.\"\n                         \" It was replaced by class_weight='balanced_subsample' \"\n                         \"using the balanced strategy.\", DeprecationWarning)\n                if self.warm_start:\n                    warn('class_weight presets \"balanced\" or \"balanced_subsample\" are '\n                         'not recommended for warm_start if the fitted data '\n                         'differs from the full dataset. In order to use '\n                         '\"balanced\" weights, use compute_class_weight(\"balanced\", '\n                         'classes, y). In place of y you can use a large '\n                         'enough sample of the full training set target to '\n                         'properly estimate the class frequency '\n                         'distributions. Pass the resulting weights as the '\n                         'class_weight parameter.')\n\n            if (self.class_weight not in ['subsample', 'balanced_subsample'] or\n                    not self.bootstrap):\n                if self.class_weight == 'subsample':\n                    class_weight = 'auto'\n                elif self.class_weight == \"balanced_subsample\":\n                    class_weight = \"balanced\"\n                else:\n                    class_weight = self.class_weight\n                with warnings.catch_warnings():\n                    if class_weight == \"auto\":\n                        warnings.simplefilter('ignore', DeprecationWarning)\n                    expanded_class_weight = compute_sample_weight(class_weight,\n                                                                  y_original)\n\n        return y, expanded_class_weight"
      },
      {
        "file": "sklearn/tree/tree.py",
        "type": "function",
        "name": "fit",
        "class_name": "BaseDecisionTree",
        "code": "def fit(self, X, y, sample_weight=None, check_input=True):\n        \"\"\"Build a decision tree from the training set (X, y).\n\n        Parameters\n        ----------\n        X : array-like or sparse matrix, shape = [n_samples, n_features]\n            The training input samples. Internally, it will be converted to\n            ``dtype=np.float32`` and if a sparse matrix is provided\n            to a sparse ``csc_matrix``.\n\n        y : array-like, shape = [n_samples] or [n_samples, n_outputs]\n            The target values (class labels in classification, real numbers in\n            regression). In the regression case, use ``dtype=np.float64`` and\n            ``order='C'`` for maximum efficiency.\n\n        sample_weight : array-like, shape = [n_samples] or None\n            Sample weights. If None, then samples are equally weighted. Splits\n            that would create child nodes with net zero or negative weight are\n            ignored while searching for a split in each node. In the case of\n            classification, splits are also ignored if they would result in any\n            single class carrying a negative weight in either child node.\n\n        check_input : boolean, (default=True)\n            Allow to bypass several input checking.\n            Don't use this parameter unless you know what you do.\n\n        Returns\n        -------\n        self : object\n            Returns self.\n        \"\"\"\n        random_state = check_random_state(self.random_state)\n        if check_input:\n            X = check_array(X, dtype=DTYPE, accept_sparse=\"csc\")\n            if issparse(X):\n                X.sort_indices()\n\n                if X.indices.dtype != np.intc or X.indptr.dtype != np.intc:\n                    raise ValueError(\"No support for np.int64 index based \"\n                                     \"sparse matrices\")\n\n        # Determine output settings\n        n_samples, self.n_features_ = X.shape\n        is_classification = isinstance(self, ClassifierMixin)\n\n        y = np.atleast_1d(y)\n        expanded_class_weight = None\n\n        if y.ndim == 1:\n            # reshape is necessary to preserve the data contiguity against vs\n            # [:, np.newaxis] that does not.\n            y = np.reshape(y, (-1, 1))\n\n        self.n_outputs_ = y.shape[1]\n\n        if is_classification:\n            y = np.copy(y)\n\n            self.classes_ = []\n            self.n_classes_ = []\n\n            if self.class_weight is not None:\n                y_original = np.copy(y)\n\n            for k in range(self.n_outputs_):\n                classes_k, y[:, k] = np.unique(y[:, k], return_inverse=True)\n                self.classes_.append(classes_k)\n                self.n_classes_.append(classes_k.shape[0])\n\n            if self.class_weight is not None:\n                expanded_class_weight = compute_sample_weight(\n                    self.class_weight, y_original)\n\n        else:\n            self.classes_ = [None] * self.n_outputs_\n            self.n_classes_ = [1] * self.n_outputs_\n\n        self.n_classes_ = np.array(self.n_classes_, dtype=np.intp)\n\n        if getattr(y, \"dtype\", None) != DOUBLE or not y.flags.contiguous:\n            y = np.ascontiguousarray(y, dtype=DOUBLE)\n\n        # Check parameters\n        max_depth = ((2 ** 31) - 1 if self.max_depth is None\n                     else self.max_depth)\n        max_leaf_nodes = (-1 if self.max_leaf_nodes is None\n                          else self.max_leaf_nodes)\n\n        if isinstance(self.max_features, six.string_types):\n            if self.max_features == \"auto\":\n                if is_classification:\n                    max_features = max(1, int(np.sqrt(self.n_features_)))\n                else:\n                    max_features = self.n_features_\n            elif self.max_features == \"sqrt\":\n                max_features = max(1, int(np.sqrt(self.n_features_)))\n            elif self.max_features == \"log2\":\n                max_features = max(1, int(np.log2(self.n_features_)))\n            else:\n                raise ValueError(\n                    'Invalid value for max_features. Allowed string '\n                    'values are \"auto\", \"sqrt\" or \"log2\".')\n        elif self.max_features is None:\n            max_features = self.n_features_\n        elif isinstance(self.max_features, (numbers.Integral, np.integer)):\n            max_features = self.max_features\n        else:  # float\n            if self.max_features > 0.0:\n                max_features = max(1, int(self.max_features * self.n_features_))\n            else:\n                max_features = 0\n\n        self.max_features_ = max_features\n\n        if len(y) != n_samples:\n            raise ValueError(\"Number of labels=%d does not match \"\n                             \"number of samples=%d\" % (len(y), n_samples))\n        if self.min_samples_split <= 0:\n            raise ValueError(\"min_samples_split must be greater than zero.\")\n        if self.min_samples_leaf <= 0:\n            raise ValueError(\"min_samples_leaf must be greater than zero.\")\n        if not 0 <= self.min_weight_fraction_leaf <= 0.5:\n            raise ValueError(\"min_weight_fraction_leaf must in [0, 0.5]\")\n        if max_depth <= 0:\n            raise ValueError(\"max_depth must be greater than zero. \")\n        if not (0 < max_features <= self.n_features_):\n            raise ValueError(\"max_features must be in (0, n_features]\")\n        if not isinstance(max_leaf_nodes, (numbers.Integral, np.integer)):\n            raise ValueError(\"max_leaf_nodes must be integral number but was \"\n                             \"%r\" % max_leaf_nodes)\n        if -1 < max_leaf_nodes < 2:\n            raise ValueError((\"max_leaf_nodes {0} must be either smaller than \"\n                              \"0 or larger than 1\").format(max_leaf_nodes))\n\n        if sample_weight is not None:\n            if (getattr(sample_weight, \"dtype\", None) != DOUBLE or\n                    not sample_weight.flags.contiguous):\n                sample_weight = np.ascontiguousarray(\n                    sample_weight, dtype=DOUBLE)\n            if len(sample_weight.shape) > 1:\n                raise ValueError(\"Sample weights array has more \"\n                                 \"than one dimension: %d\" %\n                                 len(sample_weight.shape))\n            if len(sample_weight) != n_samples:\n                raise ValueError(\"Number of weights=%d does not match \"\n                                 \"number of samples=%d\" %\n                                 (len(sample_weight), n_samples))\n\n        if expanded_class_weight is not None:\n            if sample_weight is not None:\n                sample_weight = sample_weight * expanded_class_weight\n            else:\n                sample_weight = expanded_class_weight\n\n        # Set min_weight_leaf from min_weight_fraction_leaf\n        if self.min_weight_fraction_leaf != 0. and sample_weight is not None:\n            min_weight_leaf = (self.min_weight_fraction_leaf *\n                               np.sum(sample_weight))\n        else:\n            min_weight_leaf = 0.\n\n        # Set min_samples_split sensibly\n        min_samples_split = max(self.min_samples_split,\n                                2 * self.min_samples_leaf)\n\n        # Build tree\n        criterion = self.criterion\n        if not isinstance(criterion, Criterion):\n            if is_classification:\n                criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,\n                                                         self.n_classes_)\n            else:\n                criterion = CRITERIA_REG[self.criterion](self.n_outputs_)\n\n        SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS\n\n        splitter = self.splitter\n        if not isinstance(self.splitter, Splitter):\n            splitter = SPLITTERS[self.splitter](criterion,\n                                                self.max_features_,\n                                                self.min_samples_leaf,\n                                                min_weight_leaf,\n                                                random_state)\n\n        self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_)\n\n        # Use BestFirst if max_leaf_nodes given; use DepthFirst otherwise\n        if max_leaf_nodes < 0:\n            builder = DepthFirstTreeBuilder(splitter, min_samples_split,\n                                            self.min_samples_leaf,\n                                            min_weight_leaf,\n                                            max_depth)\n        else:\n            builder = BestFirstTreeBuilder(splitter, min_samples_split,\n                                           self.min_samples_leaf,\n                                           min_weight_leaf,\n                                           max_depth,\n                                           max_leaf_nodes)\n\n        builder.build(self.tree_, X, y, sample_weight)\n\n        if self.n_outputs_ == 1:\n            self.n_classes_ = self.n_classes_[0]\n            self.classes_ = self.classes_[0]\n\n        return self"
      }
    ]
  },
  "Justification": "Candidate A directly relates to the `warn_on_dtype` behavior within scikit-learn and involves issues with data type handling, specifically with the RandomForestClassifier. Although it does not mirror the exact mechanism of warnings, both deal with the expectations of data inputs and corresponding outputs — checking types and managing their conversions effectively. The patch proposed in this candidate also reflects a fundamental understanding of data types, which is critical in debugging the currently observed lack of a warning in `check_array`. Overall, insights from handling type-related errors in Candidate A could provide valuable pointers for fixing the `warn_on_dtype` issue with DataFrames.",
  "instance_id": "scikit-learn__scikit-learn-10949",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2018-04-10T15:30:56Z",
  "problem_statement": "warn_on_dtype with DataFrame\n#### Description\r\n\r\n``warn_on_dtype`` has no effect when input is a pandas ``DataFrame``\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nfrom sklearn.utils.validation import check_array\r\nimport pandas as pd\r\ndf = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)\r\nchecked = check_array(df, warn_on_dtype=True)\r\n```\r\n\r\n#### Expected result: \r\n\r\n```python-traceback\r\nDataConversionWarning: Data with input dtype object was converted to float64.\r\n```\r\n\r\n#### Actual Results\r\nNo warning is thrown\r\n\r\n#### Versions\r\nLinux-4.4.0-116-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda, Inc.| (default, Nov  3 2017, 19:19:16) \r\n[GCC 7.2.0]\r\nNumPy 1.13.1\r\nSciPy 0.19.1\r\nScikit-Learn 0.20.dev0\r\nPandas 0.21.0\r\n\nwarn_on_dtype with DataFrame\n#### Description\r\n\r\n``warn_on_dtype`` has no effect when input is a pandas ``DataFrame``\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nfrom sklearn.utils.validation import check_array\r\nimport pandas as pd\r\ndf = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)\r\nchecked = check_array(df, warn_on_dtype=True)\r\n```\r\n\r\n#### Expected result: \r\n\r\n```python-traceback\r\nDataConversionWarning: Data with input dtype object was converted to float64.\r\n```\r\n\r\n#### Actual Results\r\nNo warning is thrown\r\n\r\n#### Versions\r\nLinux-4.4.0-116-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda, Inc.| (default, Nov  3 2017, 19:19:16) \r\n[GCC 7.2.0]\r\nNumPy 1.13.1\r\nSciPy 0.19.1\r\nScikit-Learn 0.20.dev0\r\nPandas 0.21.0\r\n\n",
  "patch": "diff --git a/sklearn/utils/validation.py b/sklearn/utils/validation.py\n--- a/sklearn/utils/validation.py\n+++ b/sklearn/utils/validation.py\n@@ -466,6 +466,12 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,\n         # not a data type (e.g. a column named dtype in a pandas DataFrame)\n         dtype_orig = None\n \n+    # check if the object contains several dtypes (typically a pandas\n+    # DataFrame), and store them. If not, store None.\n+    dtypes_orig = None\n+    if hasattr(array, \"dtypes\") and hasattr(array, \"__array__\"):\n+        dtypes_orig = np.array(array.dtypes)\n+\n     if dtype_numeric:\n         if dtype_orig is not None and dtype_orig.kind == \"O\":\n             # if input is object, convert to float.\n@@ -581,6 +587,16 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,\n     if copy and np.may_share_memory(array, array_orig):\n         array = np.array(array, dtype=dtype, order=order)\n \n+    if (warn_on_dtype and dtypes_orig is not None and\n+            {array.dtype} != set(dtypes_orig)):\n+        # if there was at the beginning some other types than the final one\n+        # (for instance in a DataFrame that can contain several dtypes) then\n+        # some data must have been converted\n+        msg = (\"Data with input dtype %s were all converted to %s%s.\"\n+               % (', '.join(map(str, sorted(set(dtypes_orig)))), array.dtype,\n+                  context))\n+        warnings.warn(msg, DataConversionWarning, stacklevel=3)\n+\n     return array\n \n \n"
}