{
  "instance_id": "scikit-learn__scikit-learn-25500",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2023-01-27T19:49:28Z",
  "problem_statement": "CalibratedClassifierCV doesn't work with `set_config(transform_output=\"pandas\")`\n### Describe the bug\r\n\r\nCalibratedClassifierCV with isotonic regression doesn't work when we previously set `set_config(transform_output=\"pandas\")`.\r\nThe IsotonicRegression seems to return a dataframe, which is a problem for `_CalibratedClassifier`  in `predict_proba` where it tries to put the dataframe in a numpy array row `proba[:, class_idx] = calibrator.predict(this_pred)`.\r\n\r\n### Steps/Code to Reproduce\r\n\r\n```python\r\nimport numpy as np\r\nfrom sklearn import set_config\r\nfrom sklearn.calibration import CalibratedClassifierCV\r\nfrom sklearn.linear_model import SGDClassifier\r\n\r\nset_config(transform_output=\"pandas\")\r\nmodel = CalibratedClassifierCV(SGDClassifier(), method='isotonic')\r\nmodel.fit(np.arange(90).reshape(30, -1), np.arange(30) % 2)\r\nmodel.predict(np.arange(90).reshape(30, -1))\r\n```\r\n\r\n### Expected Results\r\n\r\nIt should not crash.\r\n\r\n### Actual Results\r\n\r\n```\r\n../core/model_trainer.py:306: in train_model\r\n    cv_predictions = cross_val_predict(pipeline,\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:968: in cross_val_predict\r\n    predictions = parallel(\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/parallel.py:1085: in __call__\r\n    if self.dispatch_one_batch(iterator):\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/parallel.py:901: in dispatch_one_batch\r\n    self._dispatch(tasks)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/parallel.py:819: in _dispatch\r\n    job = self._backend.apply_async(batch, callback=cb)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/_parallel_backends.py:208: in apply_async\r\n    result = ImmediateResult(func)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/_parallel_backends.py:597: in __init__\r\n    self.results = batch()\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/parallel.py:288: in __call__\r\n    return [func(*args, **kwargs)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/parallel.py:288: in <listcomp>\r\n    return [func(*args, **kwargs)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/utils/fixes.py:117: in __call__\r\n    return self.function(*args, **kwargs)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:1052: in _fit_and_predict\r\n    predictions = func(X_test)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/pipeline.py:548: in predict_proba\r\n    return self.steps[-1][1].predict_proba(Xt, **predict_proba_params)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/calibration.py:477: in predict_proba\r\n    proba = calibrated_classifier.predict_proba(X)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/calibration.py:764: in predict_proba\r\n    proba[:, class_idx] = calibrator.predict(this_pred)\r\nE   ValueError: could not broadcast input array from shape (20,1) into shape (20,)\r\n```\r\n\r\n### Versions\r\n\r\n```shell\r\nSystem:\r\n    python: 3.9.15 (main, Nov 24 2022, 14:31:59)  [GCC 11.2.0]\r\nexecutable: /home/philippe/.anaconda3/envs/strategy-training/bin/python\r\n   machine: Linux-5.15.0-57-generic-x86_64-with-glibc2.31\r\n\r\nPython dependencies:\r\n      sklearn: 1.2.0\r\n          pip: 22.2.2\r\n   setuptools: 62.3.2\r\n        numpy: 1.23.5\r\n        scipy: 1.9.3\r\n       Cython: None\r\n       pandas: 1.4.1\r\n   matplotlib: 3.6.3\r\n       joblib: 1.2.0\r\nthreadpoolctl: 3.1.0\r\n\r\nBuilt with OpenMP: True\r\n\r\nthreadpoolctl info:\r\n       user_api: openmp\r\n   internal_api: openmp\r\n         prefix: libgomp\r\n       filepath: /home/philippe/.anaconda3/envs/strategy-training/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0\r\n        version: None\r\n    num_threads: 12\r\n\r\n       user_api: blas\r\n   internal_api: openblas\r\n         prefix: libopenblas\r\n       filepath: /home/philippe/.anaconda3/envs/strategy-training/lib/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so\r\n        version: 0.3.20\r\nthreading_layer: pthreads\r\n   architecture: Haswell\r\n    num_threads: 12\r\n\r\n       user_api: blas\r\n   internal_api: openblas\r\n         prefix: libopenblas\r\n       filepath: /home/philippe/.anaconda3/envs/strategy-training/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so\r\n        version: 0.3.18\r\nthreading_layer: pthreads\r\n   architecture: Haswell\r\n    num_threads: 12\r\n```\r\n\n",
  "patch": "diff --git a/sklearn/isotonic.py b/sklearn/isotonic.py\n--- a/sklearn/isotonic.py\n+++ b/sklearn/isotonic.py\n@@ -360,23 +360,16 @@ def fit(self, X, y, sample_weight=None):\n         self._build_f(X, y)\n         return self\n \n-    def transform(self, T):\n-        \"\"\"Transform new data by linear interpolation.\n-\n-        Parameters\n-        ----------\n-        T : array-like of shape (n_samples,) or (n_samples, 1)\n-            Data to transform.\n+    def _transform(self, T):\n+        \"\"\"`_transform` is called by both `transform` and `predict` methods.\n \n-            .. versionchanged:: 0.24\n-               Also accepts 2d array with 1 feature.\n+        Since `transform` is wrapped to output arrays of specific types (e.g.\n+        NumPy arrays, pandas DataFrame), we cannot make `predict` call `transform`\n+        directly.\n \n-        Returns\n-        -------\n-        y_pred : ndarray of shape (n_samples,)\n-            The transformed data.\n+        The above behaviour could be changed in the future, if we decide to output\n+        other type of arrays when calling `predict`.\n         \"\"\"\n-\n         if hasattr(self, \"X_thresholds_\"):\n             dtype = self.X_thresholds_.dtype\n         else:\n@@ -397,6 +390,24 @@ def transform(self, T):\n \n         return res\n \n+    def transform(self, T):\n+        \"\"\"Transform new data by linear interpolation.\n+\n+        Parameters\n+        ----------\n+        T : array-like of shape (n_samples,) or (n_samples, 1)\n+            Data to transform.\n+\n+            .. versionchanged:: 0.24\n+               Also accepts 2d array with 1 feature.\n+\n+        Returns\n+        -------\n+        y_pred : ndarray of shape (n_samples,)\n+            The transformed data.\n+        \"\"\"\n+        return self._transform(T)\n+\n     def predict(self, T):\n         \"\"\"Predict new data by linear interpolation.\n \n@@ -410,7 +421,7 @@ def predict(self, T):\n         y_pred : ndarray of shape (n_samples,)\n             Transformed data.\n         \"\"\"\n-        return self.transform(T)\n+        return self._transform(T)\n \n     # We implement get_feature_names_out here instead of using\n     # `ClassNamePrefixFeaturesOutMixin`` because `input_features` are ignored.\n",
  "similar_bug_items": [
    {
      "pr_number": 23256,
      "pr_title": "FIX SGDRegreesor and SGDClassifier use correct number of validation data",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #23255.\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n- changes the `validation_mask` to be a boolean array for correct indexing.\r\n- fixes a failing test case.\r\n\r\n#### Any other comments?\r\nI haven't added any test cases because I'm not quite sure how to test this behavior without directly using internal api like checking against `_ValidationScoreCallback.X_val.shape`.\r\n\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 23255,
      "issue_title": "BUG SGDRegressor only using the first two samples for validation error calculation",
      "issue_body": "### Describe the bug\r\n\r\n`BaseSGD._make_validation_split` returns a 0-1 integer array as validation mask, which is used to calculate the validation score with:\r\n```python\r\n_ValidationScoreCallback(\r\n            self,\r\n            X[validation_mask],\r\n            y[validation_mask],\r\n            sample_weight[validation_mask],\r\n            classes=classes,\r\n```\r\nhere `X[validation_mask]` is just repeating the first two samples as shown in the following code. A quick fix would be convert `validation_mask` to a boolean array.\r\n\r\n### Steps/Code to Reproduce\r\n\r\nI added a print statement in `_ValidationScoreCallback` to print out the shape of validation data\r\n```python\r\nclass _ValidationScoreCallback:\r\n    \"\"\"Callback for early stopping based on validation score\"\"\"\r\n\r\n    def __init__(self, estimator, X_val, y_val, sample_weight_val, classes=None):\r\n        self.estimator = clone(estimator)\r\n        self.estimator.t_ = 1  # to pass check_is_fitted\r\n        if classes is not None:\r\n            self.estimator.classes_ = classes\r\n        self.X_val = X_val\r\n        self.y_val = y_val\r\n        self.sample_weight_val = sample_weight_val\r\n        print(X_val.shape, y_val.shape)\r\n```\r\nand afterwards running the following code\r\n```python\r\nimport numpy as np\r\nfrom sklearn.linear_model import SGDRegressor\r\n\r\nX = np.random.randn(1000, 5)\r\ny = np.random.randn(1000)\r\n\r\nsgd = SGDRegressor(early_stopping=True, max_iter=1, validation_fraction=0.1)\r\nsgd.fit(X, y)\r\n```\r\n\r\n### Expected Results\r\n\r\nThe size of validation data should be `(100,5), (100, )`.\r\n\r\n### Actual Results\r\n\r\nprints out `(1000, 5), (1000,)`\r\n\r\n### Versions\r\n\r\n```shell\r\nSystem:\r\n    python: 3.8.8 (default, Apr 13 2021, 12:59:45)  [Clang 10.0.0 ]\r\nexecutable: /Users/mac/Documents/GitHub/.ENV/bin/python\r\n   machine: macOS-10.16-x86_64-i386-64bit\r\n\r\nPython dependencies:\r\n      sklearn: 1.2.dev0\r\n          pip: 20.2.3\r\n   setuptools: 60.9.3\r\n        numpy: 1.21.2\r\n        scipy: 1.8.0\r\n       Cython: 0.29.28\r\n       pandas: 1.3.3\r\n   matplotlib: 3.5.1\r\n       joblib: 1.1.0\r\nthreadpoolctl: 3.0.0\r\n\r\nBuilt with OpenMP: False\r\n```\r\n",
      "issue_closed_at": "2022-05-02T11:41:03Z",
      "base_commit": "920ab2508fe634ad32df68bb0ebd4f4512fbfb53",
      "changes": [
        {
          "file": "sklearn/linear_model/_stochastic_gradient.py",
          "type": "function",
          "name": "_make_validation_split",
          "class_name": "BaseSGD",
          "code": "def _make_validation_split(self, y):\n        \"\"\"Split the dataset between training set and validation set.\n\n        Parameters\n        ----------\n        y : ndarray of shape (n_samples, )\n            Target values.\n\n        Returns\n        -------\n        validation_mask : ndarray of shape (n_samples, )\n            Equal to 1 on the validation set, 0 on the training set.\n        \"\"\"\n        n_samples = y.shape[0]\n        validation_mask = np.zeros(n_samples, dtype=np.uint8)\n        if not self.early_stopping:\n            # use the full set for training, with an empty validation set\n            return validation_mask\n\n        if is_classifier(self):\n            splitter_type = StratifiedShuffleSplit\n        else:\n            splitter_type = ShuffleSplit\n        cv = splitter_type(\n            test_size=self.validation_fraction, random_state=self.random_state\n        )\n        idx_train, idx_val = next(cv.split(np.zeros(shape=(y.shape[0], 1)), y))\n        if idx_train.shape[0] == 0 or idx_val.shape[0] == 0:\n            raise ValueError(\n                \"Splitting %d samples into a train set and a validation set \"\n                \"with validation_fraction=%r led to an empty set (%d and %d \"\n                \"samples). Please either change validation_fraction, increase \"\n                \"number of samples, or disable early_stopping.\"\n                % (\n                    n_samples,\n                    self.validation_fraction,\n                    idx_train.shape[0],\n                    idx_val.shape[0],\n                )\n            )\n\n        validation_mask[idx_val] = 1\n        return validation_mask"
        },
        {
          "file": "sklearn/linear_model/_stochastic_gradient.py",
          "type": "function",
          "name": "_make_validation_split",
          "class_name": "BaseSGD",
          "code": "def _make_validation_split(self, y):\n        \"\"\"Split the dataset between training set and validation set.\n\n        Parameters\n        ----------\n        y : ndarray of shape (n_samples, )\n            Target values.\n\n        Returns\n        -------\n        validation_mask : ndarray of shape (n_samples, )\n            Equal to 1 on the validation set, 0 on the training set.\n        \"\"\"\n        n_samples = y.shape[0]\n        validation_mask = np.zeros(n_samples, dtype=np.uint8)\n        if not self.early_stopping:\n            # use the full set for training, with an empty validation set\n            return validation_mask\n\n        if is_classifier(self):\n            splitter_type = StratifiedShuffleSplit\n        else:\n            splitter_type = ShuffleSplit\n        cv = splitter_type(\n            test_size=self.validation_fraction, random_state=self.random_state\n        )\n        idx_train, idx_val = next(cv.split(np.zeros(shape=(y.shape[0], 1)), y))\n        if idx_train.shape[0] == 0 or idx_val.shape[0] == 0:\n            raise ValueError(\n                \"Splitting %d samples into a train set and a validation set \"\n                \"with validation_fraction=%r led to an empty set (%d and %d \"\n                \"samples). Please either change validation_fraction, increase \"\n                \"number of samples, or disable early_stopping.\"\n                % (\n                    n_samples,\n                    self.validation_fraction,\n                    idx_train.shape[0],\n                    idx_val.shape[0],\n                )\n            )\n\n        validation_mask[idx_val] = 1\n        return validation_mask"
        }
      ]
    },
    {
      "pr_number": 11796,
      "pr_title": "[MRG+2] Fix LDA predict_proba() ",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\nFixes #6848\r\ncloses #11727\r\ncloses #5149\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\nFixes the `predict_proba()` method of LinearDiscriminantAnalysis.\r\nAn `if` statement is used to differentiate between the binary and multi-class case, due to the different output format of the `decision_function` method implemented in the `LinearClassifierMixin` class.\r\n\r\n#### Any other comments?\r\nCopying from #6848:\r\nDo we perhaps want to include additional tests checking the output of predict_proba for LDA and QDA both for the binary and multi-class cases?\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 6848,
      "issue_title": "LinearDiscriminantAnalysis predict probability bug",
      "issue_body": "I am pretty confident there is a bug introduced in commit\n7c1101d7c26ba0b77184cce9c0b9be79adb526de\n\nConcretely, line 518 of the current version \nhttps://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py\nshould be removed as it yields wrong results. \n\nThere is no reason why constant 1 should be added to the computed probability after exponentiation and before inversion. \n\nTo verify this, I have run a one-to-one comparison between the outcome of the method and MATLAB's builtin LDA classifier on the Iris dataset. Only after removal of line 518, results match (up to a tolerance).\n\nIf everyone agrees on that, I am happy to submit a PR.\n",
      "issue_closed_at": "2019-03-07T16:44:18Z",
      "base_commit": "b73a51bcda362d94d8907915a382a8eb403554c8",
      "changes": [
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "line",
          "name": "line 22",
          "code": "from .utils import check_array, check_X_y\nfrom .utils.validation import check_is_fitted\nfrom .utils.multiclass import check_classification_targets\nfrom .preprocessing import StandardScaler\n\n"
        },
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "function",
          "name": "predict_proba",
          "class_name": "QuadraticDiscriminantAnalysis",
          "code": "def predict_proba(self, X):\n        \"\"\"Return posterior probabilities of classification.\n\n        Parameters\n        ----------\n        X : array-like, shape = [n_samples, n_features]\n            Array of samples/test vectors.\n\n        Returns\n        -------\n        C : array, shape = [n_samples, n_classes]\n            Posterior probabilities of classification per class.\n        \"\"\"\n        values = self._decision_function(X)\n        # compute the likelihood of the underlying gaussian models\n        # up to a multiplicative constant.\n        likelihood = np.exp(values - values.max(axis=1)[:, np.newaxis])\n        # compute posterior probabilities\n        return likelihood / likelihood.sum(axis=1)[:, np.newaxis]"
        }
      ]
    },
    {
      "pr_number": 19641,
      "pr_title": "Fix Calibrated classifier cv predictions with pipeline",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #19637, #8710.\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAs suggested in #19641, this PR removed validation from the CalibratedClassifierCV predict_proba function and replaced X.shape[0] with _num_samples(X).\r\n\r\nRegression testing has also been added.\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 19637,
      "issue_title": "CalibratedClassifier on pipelines",
      "issue_body": "Looks like CalibratedClassifierCV doesn't support pipelines, see https://github.com/scikit-learn/scikit-learn/issues/19625 and https://github.com/scikit-learn/scikit-learn/discussions/19279\r\n\r\nWe should try to delegate the validation to the base estimator (https://github.com/scikit-learn/scikit-learn/discussions/19279#discussioncomment-313812)",
      "issue_closed_at": "2021-03-10T10:27:24Z",
      "base_commit": "42e90e9ba28fb37c2c9bd3e8aed1ac2387f1d5d5",
      "changes": [
        {
          "file": "sklearn/calibration.py",
          "type": "line",
          "name": "line 24",
          "code": "                   MetaEstimatorMixin)\nfrom .preprocessing import label_binarize, LabelEncoder\nfrom .utils import (\n    check_array,\n    column_or_1d,\n    deprecated,\n    indexable,\n)\nfrom .utils.multiclass import check_classification_targets\nfrom .utils.fixes import delayed\nfrom .utils.validation import check_is_fitted, check_consistent_length\nfrom .utils.validation import _check_sample_weight\nfrom .pipeline import Pipeline\nfrom .isotonic import IsotonicRegression\nfrom .svm import LinearSVC"
        },
        {
          "file": "sklearn/calibration.py",
          "type": "function",
          "name": "predict_proba",
          "class_name": "_CalibratedClassifier",
          "code": "def predict_proba(self, X):\n        \"\"\"Calculate calibrated probabilities.\n\n        Calculates classification calibrated probabilities\n        for each class, in a one-vs-all manner, for `X`.\n\n        Parameters\n        ----------\n        X : ndarray of shape (n_samples, n_features)\n            The sample data.\n\n        Returns\n        -------\n        proba : array, shape (n_samples, n_classes)\n            The predicted probabilities. Can be exact zeros.\n        \"\"\"\n        n_classes = len(self.classes)\n        pred_method = _get_prediction_method(self.base_estimator)\n        predictions = _compute_predictions(pred_method, X, n_classes)\n\n        label_encoder = LabelEncoder().fit(self.classes)\n        pos_class_indices = label_encoder.transform(\n            self.base_estimator.classes_\n        )\n\n        proba = np.zeros((X.shape[0], n_classes))\n        for class_idx, this_pred, calibrator in \\\n                zip(pos_class_indices, predictions.T, self.calibrators):\n            if n_classes == 2:\n                # When binary, `predictions` consists only of predictions for\n                # clf.classes_[1] but `pos_class_indices` = 0\n                class_idx += 1\n            proba[:, class_idx] = calibrator.predict(this_pred)\n\n        # Normalize the probabilities\n        if n_classes == 2:\n            proba[:, 0] = 1. - proba[:, 1]\n        else:\n            denominator = np.sum(proba, axis=1)[:, np.newaxis]\n            # In the edge case where for each class calibrator returns a null\n            # probability for a given sample, use the uniform distribution\n            # instead.\n            uniform_proba = np.full_like(proba, 1 / n_classes)\n            proba = np.divide(proba, denominator, out=uniform_proba,\n                              where=denominator != 0)\n\n        # Deal with cases where the predicted probability minimally exceeds 1.0\n        proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0\n\n        return proba"
        },
        {
          "file": "sklearn/calibration.py",
          "type": "function",
          "name": "predict_proba",
          "class_name": "_CalibratedClassifier",
          "code": "def predict_proba(self, X):\n        \"\"\"Calculate calibrated probabilities.\n\n        Calculates classification calibrated probabilities\n        for each class, in a one-vs-all manner, for `X`.\n\n        Parameters\n        ----------\n        X : ndarray of shape (n_samples, n_features)\n            The sample data.\n\n        Returns\n        -------\n        proba : array, shape (n_samples, n_classes)\n            The predicted probabilities. Can be exact zeros.\n        \"\"\"\n        n_classes = len(self.classes)\n        pred_method = _get_prediction_method(self.base_estimator)\n        predictions = _compute_predictions(pred_method, X, n_classes)\n\n        label_encoder = LabelEncoder().fit(self.classes)\n        pos_class_indices = label_encoder.transform(\n            self.base_estimator.classes_\n        )\n\n        proba = np.zeros((X.shape[0], n_classes))\n        for class_idx, this_pred, calibrator in \\\n                zip(pos_class_indices, predictions.T, self.calibrators):\n            if n_classes == 2:\n                # When binary, `predictions` consists only of predictions for\n                # clf.classes_[1] but `pos_class_indices` = 0\n                class_idx += 1\n            proba[:, class_idx] = calibrator.predict(this_pred)\n\n        # Normalize the probabilities\n        if n_classes == 2:\n            proba[:, 0] = 1. - proba[:, 1]\n        else:\n            denominator = np.sum(proba, axis=1)[:, np.newaxis]\n            # In the edge case where for each class calibrator returns a null\n            # probability for a given sample, use the uniform distribution\n            # instead.\n            uniform_proba = np.full_like(proba, 1 / n_classes)\n            proba = np.divide(proba, denominator, out=uniform_proba,\n                              where=denominator != 0)\n\n        # Deal with cases where the predicted probability minimally exceeds 1.0\n        proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0\n\n        return proba"
        },
        {
          "file": "sklearn/calibration.py",
          "type": "function",
          "name": "predict_proba",
          "class_name": "_CalibratedClassifier",
          "code": "def predict_proba(self, X):\n        \"\"\"Calculate calibrated probabilities.\n\n        Calculates classification calibrated probabilities\n        for each class, in a one-vs-all manner, for `X`.\n\n        Parameters\n        ----------\n        X : ndarray of shape (n_samples, n_features)\n            The sample data.\n\n        Returns\n        -------\n        proba : array, shape (n_samples, n_classes)\n            The predicted probabilities. Can be exact zeros.\n        \"\"\"\n        n_classes = len(self.classes)\n        pred_method = _get_prediction_method(self.base_estimator)\n        predictions = _compute_predictions(pred_method, X, n_classes)\n\n        label_encoder = LabelEncoder().fit(self.classes)\n        pos_class_indices = label_encoder.transform(\n            self.base_estimator.classes_\n        )\n\n        proba = np.zeros((X.shape[0], n_classes))\n        for class_idx, this_pred, calibrator in \\\n                zip(pos_class_indices, predictions.T, self.calibrators):\n            if n_classes == 2:\n                # When binary, `predictions` consists only of predictions for\n                # clf.classes_[1] but `pos_class_indices` = 0\n                class_idx += 1\n            proba[:, class_idx] = calibrator.predict(this_pred)\n\n        # Normalize the probabilities\n        if n_classes == 2:\n            proba[:, 0] = 1. - proba[:, 1]\n        else:\n            denominator = np.sum(proba, axis=1)[:, np.newaxis]\n            # In the edge case where for each class calibrator returns a null\n            # probability for a given sample, use the uniform distribution\n            # instead.\n            uniform_proba = np.full_like(proba, 1 / n_classes)\n            proba = np.divide(proba, denominator, out=uniform_proba,\n                              where=denominator != 0)\n\n        # Deal with cases where the predicted probability minimally exceeds 1.0\n        proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0\n\n        return proba"
        }
      ]
    },
    {
      "pr_number": 19879,
      "pr_title": "FIX Error for sparse matrix in OrdinalEncoder.inverse_transform",
      "pr_body": "closes #19878\r\n\r\n`OrdinalEncoder.inverse_transform` should not support sparse matrix. It was already failing but with an obscure message.\r\nThis PR adds a non-regression test to check for the error message.",
      "issue_id": 19878,
      "issue_title": "OrdinalEncoder accept and failed with sparse matrix in inverse_transform",
      "issue_body": "`OrdinalEncoder` was documented to accept sparse matrix in `inverse_transform`.\r\nA check is internally done to accept sparse matrix. However, the `inverse_transform` will fail with this type of data.\r\nWe should remove the support in the check and make sure that we issue the right error.\r\n",
      "issue_closed_at": "2021-04-13T11:59:30Z",
      "base_commit": "767fd63c9ddddc46e288fdec2cca36a129529a8e",
      "changes": [
        {
          "file": "sklearn/preprocessing/_encoders.py",
          "type": "function",
          "name": "inverse_transform",
          "class_name": "OrdinalEncoder",
          "code": "def inverse_transform(self, X):\n        \"\"\"\n        Convert the data back to the original representation.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            The transformed data.\n\n        Returns\n        -------\n        X_tr : ndarray of shape (n_samples, n_features)\n            Inverse transformed array.\n        \"\"\"\n        check_is_fitted(self)\n        X = check_array(X, accept_sparse='csr', force_all_finite='allow-nan')\n\n        n_samples, _ = X.shape\n        n_features = len(self.categories_)\n\n        # validate shape of passed X\n        msg = (\"Shape of the passed X data is not correct. Expected {0} \"\n               \"columns, got {1}.\")\n        if X.shape[1] != n_features:\n            raise ValueError(msg.format(n_features, X.shape[1]))\n\n        # create resulting array of appropriate dtype\n        dt = np.find_common_type([cat.dtype for cat in self.categories_], [])\n        X_tr = np.empty((n_samples, n_features), dtype=dt)\n\n        found_unknown = {}\n\n        for i in range(n_features):\n            labels = X[:, i].astype('int64', copy=False)\n\n            # replace values of X[:, i] that were nan with actual indices\n            if i in self._missing_indices:\n                X_i_mask = _get_mask(X[:, i], np.nan)\n                labels[X_i_mask] = self._missing_indices[i]\n\n            if self.handle_unknown == 'use_encoded_value':\n                unknown_labels = labels == self.unknown_value\n                X_tr[:, i] = self.categories_[i][np.where(\n                    unknown_labels, 0, labels)]\n                found_unknown[i] = unknown_labels\n            else:\n                X_tr[:, i] = self.categories_[i][labels]\n\n        # insert None values for unknown values\n        if found_unknown:\n            X_tr = X_tr.astype(object, copy=False)\n\n            for idx, mask in found_unknown.items():\n                X_tr[mask, idx] = None\n\n        return X_tr"
        },
        {
          "file": "sklearn/preprocessing/_encoders.py",
          "type": "function",
          "name": "inverse_transform",
          "class_name": "OrdinalEncoder",
          "code": "def inverse_transform(self, X):\n        \"\"\"\n        Convert the data back to the original representation.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            The transformed data.\n\n        Returns\n        -------\n        X_tr : ndarray of shape (n_samples, n_features)\n            Inverse transformed array.\n        \"\"\"\n        check_is_fitted(self)\n        X = check_array(X, accept_sparse='csr', force_all_finite='allow-nan')\n\n        n_samples, _ = X.shape\n        n_features = len(self.categories_)\n\n        # validate shape of passed X\n        msg = (\"Shape of the passed X data is not correct. Expected {0} \"\n               \"columns, got {1}.\")\n        if X.shape[1] != n_features:\n            raise ValueError(msg.format(n_features, X.shape[1]))\n\n        # create resulting array of appropriate dtype\n        dt = np.find_common_type([cat.dtype for cat in self.categories_], [])\n        X_tr = np.empty((n_samples, n_features), dtype=dt)\n\n        found_unknown = {}\n\n        for i in range(n_features):\n            labels = X[:, i].astype('int64', copy=False)\n\n            # replace values of X[:, i] that were nan with actual indices\n            if i in self._missing_indices:\n                X_i_mask = _get_mask(X[:, i], np.nan)\n                labels[X_i_mask] = self._missing_indices[i]\n\n            if self.handle_unknown == 'use_encoded_value':\n                unknown_labels = labels == self.unknown_value\n                X_tr[:, i] = self.categories_[i][np.where(\n                    unknown_labels, 0, labels)]\n                found_unknown[i] = unknown_labels\n            else:\n                X_tr[:, i] = self.categories_[i][labels]\n\n        # insert None values for unknown values\n        if found_unknown:\n            X_tr = X_tr.astype(object, copy=False)\n\n            for idx, mask in found_unknown.items():\n                X_tr[mask, idx] = None\n\n        return X_tr"
        }
      ]
    },
    {
      "pr_number": 8936,
      "pr_title": "[MRG+1] fixed OOB_Score bug for bagging classifiers.",
      "pr_body": "Fixes #8933\r\n\r\n<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\r\n-->\r\n#### Reference Issue\r\n<!-- Example: Fixes #1234 -->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 8933,
      "issue_title": "BUG: BaggingClassifier.oob_score_ should not change with class label",
      "issue_body": "Let us compute the oob score of a bagged classifier.\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom sklearn.ensemble import BaggingClassifier\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\n\r\nN = 50\r\nrandState = 5\r\nlabel = 'Label'\r\nfeatures = ['A','B','C']\r\n\r\nlabels = np.random.randint(3, size = N) - 1\r\ndf = pd.DataFrame( labels , index=range(N), columns=[label] )\r\nfor col in features:\r\n    df[col] = df[label] + 0.01 * np.random.rand( N )\r\n\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nHere, clf.oob_score_=0.0.\r\n\r\nNow, you would not expect that the OOB accuracy is a function of the class labels...\r\n\r\n```python\r\ndf.loc[ df[label] == -1 , label ] = 2\r\nclf = BaggingClassifier(base_estimator = KNeighborsClassifier(), n_estimators = 10, oob_score = True, random_state = randState )\r\nclf.fit(df[features], df[label])\r\nprint clf.oob_score_\r\n```\r\n\r\nNow, clf.oob_score_=1.0.\r\n\r\nClearly, OOB score should not be a function of the labels arbitrarily chosen for the classes.\r\n\r\nsklearn.__version__: '0.18.1'\r\nnumpy.__version__: '1.11.3'",
      "issue_closed_at": "2017-06-08T09:35:49Z",
      "base_commit": "9131f89e6c165fb27dadd37d3168c1ee5ea84f5a",
      "changes": [
        {
          "file": "sklearn/ensemble/bagging.py",
          "type": "function",
          "name": "_set_oob_score",
          "class_name": "BaggingRegressor",
          "code": "def _set_oob_score(self, X, y):\n        n_samples = y.shape[0]\n\n        predictions = np.zeros((n_samples,))\n        n_predictions = np.zeros((n_samples,))\n\n        for estimator, samples, features in zip(self.estimators_,\n                                                self.estimators_samples_,\n                                                self.estimators_features_):\n            # Create mask for OOB samples\n            mask = ~samples\n\n            predictions[mask] += estimator.predict((X[mask, :])[:, features])\n            n_predictions[mask] += 1\n\n        if (n_predictions == 0).any():\n            warn(\"Some inputs do not have OOB scores. \"\n                 \"This probably means too few estimators were used \"\n                 \"to compute any reliable oob estimates.\")\n            n_predictions[n_predictions == 0] = 1\n\n        predictions /= n_predictions\n\n        self.oob_prediction_ = predictions\n        self.oob_score_ = r2_score(y, predictions)"
        }
      ]
    }
  ]
}