{
  "Selected_candidate": {
    "pr_number": 19641,
    "pr_title": "Fix Calibrated classifier cv predictions with pipeline",
    "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #19637, #8710.\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAs suggested in #19641, this PR removed validation from the CalibratedClassifierCV predict_proba function and replaced X.shape[0] with _num_samples(X).\r\n\r\nRegression testing has also been added.\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
    "issue_id": 19637,
    "issue_title": "CalibratedClassifier on pipelines",
    "issue_body": "Looks like CalibratedClassifierCV doesn't support pipelines, see https://github.com/scikit-learn/scikit-learn/issues/19625 and https://github.com/scikit-learn/scikit-learn/discussions/19279\r\n\r\nWe should try to delegate the validation to the base estimator (https://github.com/scikit-learn/scikit-learn/discussions/19279#discussioncomment-313812)",
    "issue_closed_at": "2021-03-10T10:27:24Z",
    "base_commit": "42e90e9ba28fb37c2c9bd3e8aed1ac2387f1d5d5",
    "changes": [
      {
        "file": "sklearn/calibration.py",
        "type": "line",
        "name": "line 24",
        "code": "                   MetaEstimatorMixin)\nfrom .preprocessing import label_binarize, LabelEncoder\nfrom .utils import (\n    check_array,\n    column_or_1d,\n    deprecated,\n    indexable,\n)\nfrom .utils.multiclass import check_classification_targets\nfrom .utils.fixes import delayed\nfrom .utils.validation import check_is_fitted, check_consistent_length\nfrom .utils.validation import _check_sample_weight\nfrom .pipeline import Pipeline\nfrom .isotonic import IsotonicRegression\nfrom .svm import LinearSVC"
      },
      {
        "file": "sklearn/calibration.py",
        "type": "function",
        "name": "predict_proba",
        "class_name": "_CalibratedClassifier",
        "code": "def predict_proba(self, X):\n        \"\"\"Calculate calibrated probabilities.\n\n        Calculates classification calibrated probabilities\n        for each class, in a one-vs-all manner, for `X`.\n\n        Parameters\n        ----------\n        X : ndarray of shape (n_samples, n_features)\n            The sample data.\n\n        Returns\n        -------\n        proba : array, shape (n_samples, n_classes)\n            The predicted probabilities. Can be exact zeros.\n        \"\"\"\n        n_classes = len(self.classes)\n        pred_method = _get_prediction_method(self.base_estimator)\n        predictions = _compute_predictions(pred_method, X, n_classes)\n\n        label_encoder = LabelEncoder().fit(self.classes)\n        pos_class_indices = label_encoder.transform(\n            self.base_estimator.classes_\n        )\n\n        proba = np.zeros((X.shape[0], n_classes))\n        for class_idx, this_pred, calibrator in \\\n                zip(pos_class_indices, predictions.T, self.calibrators):\n            if n_classes == 2:\n                # When binary, `predictions` consists only of predictions for\n                # clf.classes_[1] but `pos_class_indices` = 0\n                class_idx += 1\n            proba[:, class_idx] = calibrator.predict(this_pred)\n\n        # Normalize the probabilities\n        if n_classes == 2:\n            proba[:, 0] = 1. - proba[:, 1]\n        else:\n            denominator = np.sum(proba, axis=1)[:, np.newaxis]\n            # In the edge case where for each class calibrator returns a null\n            # probability for a given sample, use the uniform distribution\n            # instead.\n            uniform_proba = np.full_like(proba, 1 / n_classes)\n            proba = np.divide(proba, denominator, out=uniform_proba,\n                              where=denominator != 0)\n\n        # Deal with cases where the predicted probability minimally exceeds 1.0\n        proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0\n\n        return proba"
      },
      {
        "file": "sklearn/calibration.py",
        "type": "function",
        "name": "predict_proba",
        "class_name": "_CalibratedClassifier",
        "code": "def predict_proba(self, X):\n        \"\"\"Calculate calibrated probabilities.\n\n        Calculates classification calibrated probabilities\n        for each class, in a one-vs-all manner, for `X`.\n\n        Parameters\n        ----------\n        X : ndarray of shape (n_samples, n_features)\n            The sample data.\n\n        Returns\n        -------\n        proba : array, shape (n_samples, n_classes)\n            The predicted probabilities. Can be exact zeros.\n        \"\"\"\n        n_classes = len(self.classes)\n        pred_method = _get_prediction_method(self.base_estimator)\n        predictions = _compute_predictions(pred_method, X, n_classes)\n\n        label_encoder = LabelEncoder().fit(self.classes)\n        pos_class_indices = label_encoder.transform(\n            self.base_estimator.classes_\n        )\n\n        proba = np.zeros((X.shape[0], n_classes))\n        for class_idx, this_pred, calibrator in \\\n                zip(pos_class_indices, predictions.T, self.calibrators):\n            if n_classes == 2:\n                # When binary, `predictions` consists only of predictions for\n                # clf.classes_[1] but `pos_class_indices` = 0\n                class_idx += 1\n            proba[:, class_idx] = calibrator.predict(this_pred)\n\n        # Normalize the probabilities\n        if n_classes == 2:\n            proba[:, 0] = 1. - proba[:, 1]\n        else:\n            denominator = np.sum(proba, axis=1)[:, np.newaxis]\n            # In the edge case where for each class calibrator returns a null\n            # probability for a given sample, use the uniform distribution\n            # instead.\n            uniform_proba = np.full_like(proba, 1 / n_classes)\n            proba = np.divide(proba, denominator, out=uniform_proba,\n                              where=denominator != 0)\n\n        # Deal with cases where the predicted probability minimally exceeds 1.0\n        proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0\n\n        return proba"
      },
      {
        "file": "sklearn/calibration.py",
        "type": "function",
        "name": "predict_proba",
        "class_name": "_CalibratedClassifier",
        "code": "def predict_proba(self, X):\n        \"\"\"Calculate calibrated probabilities.\n\n        Calculates classification calibrated probabilities\n        for each class, in a one-vs-all manner, for `X`.\n\n        Parameters\n        ----------\n        X : ndarray of shape (n_samples, n_features)\n            The sample data.\n\n        Returns\n        -------\n        proba : array, shape (n_samples, n_classes)\n            The predicted probabilities. Can be exact zeros.\n        \"\"\"\n        n_classes = len(self.classes)\n        pred_method = _get_prediction_method(self.base_estimator)\n        predictions = _compute_predictions(pred_method, X, n_classes)\n\n        label_encoder = LabelEncoder().fit(self.classes)\n        pos_class_indices = label_encoder.transform(\n            self.base_estimator.classes_\n        )\n\n        proba = np.zeros((X.shape[0], n_classes))\n        for class_idx, this_pred, calibrator in \\\n                zip(pos_class_indices, predictions.T, self.calibrators):\n            if n_classes == 2:\n                # When binary, `predictions` consists only of predictions for\n                # clf.classes_[1] but `pos_class_indices` = 0\n                class_idx += 1\n            proba[:, class_idx] = calibrator.predict(this_pred)\n\n        # Normalize the probabilities\n        if n_classes == 2:\n            proba[:, 0] = 1. - proba[:, 1]\n        else:\n            denominator = np.sum(proba, axis=1)[:, np.newaxis]\n            # In the edge case where for each class calibrator returns a null\n            # probability for a given sample, use the uniform distribution\n            # instead.\n            uniform_proba = np.full_like(proba, 1 / n_classes)\n            proba = np.divide(proba, denominator, out=uniform_proba,\n                              where=denominator != 0)\n\n        # Deal with cases where the predicted probability minimally exceeds 1.0\n        proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0\n\n        return proba"
      }
    ]
  },
  "Justification": "Candidate C is the most relevant because it directly addresses issues with the `CalibratedClassifierCV` and has a very similar context, especially regarding its integration with pipelines. Both the CURRENT bug report and Candidate C's report involve the `predict_proba` function and validation handling in `CalibratedClassifierCV`. Addressing the problem of incorrect predictions when using pipelines or configuration settings, such as transforming outputs, makes Candidate C an ideal choice for debugging the CURRENT bug. The structural and module similarities, along with their common focus on `CalibratedClassifierCV`, provide substantial insights for fixing the CURRENT issue.",
  "instance_id": "scikit-learn__scikit-learn-25500",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2023-01-27T19:49:28Z",
  "problem_statement": "CalibratedClassifierCV doesn't work with `set_config(transform_output=\"pandas\")`\n### Describe the bug\r\n\r\nCalibratedClassifierCV with isotonic regression doesn't work when we previously set `set_config(transform_output=\"pandas\")`.\r\nThe IsotonicRegression seems to return a dataframe, which is a problem for `_CalibratedClassifier`  in `predict_proba` where it tries to put the dataframe in a numpy array row `proba[:, class_idx] = calibrator.predict(this_pred)`.\r\n\r\n### Steps/Code to Reproduce\r\n\r\n```python\r\nimport numpy as np\r\nfrom sklearn import set_config\r\nfrom sklearn.calibration import CalibratedClassifierCV\r\nfrom sklearn.linear_model import SGDClassifier\r\n\r\nset_config(transform_output=\"pandas\")\r\nmodel = CalibratedClassifierCV(SGDClassifier(), method='isotonic')\r\nmodel.fit(np.arange(90).reshape(30, -1), np.arange(30) % 2)\r\nmodel.predict(np.arange(90).reshape(30, -1))\r\n```\r\n\r\n### Expected Results\r\n\r\nIt should not crash.\r\n\r\n### Actual Results\r\n\r\n```\r\n../core/model_trainer.py:306: in train_model\r\n    cv_predictions = cross_val_predict(pipeline,\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:968: in cross_val_predict\r\n    predictions = parallel(\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/parallel.py:1085: in __call__\r\n    if self.dispatch_one_batch(iterator):\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/parallel.py:901: in dispatch_one_batch\r\n    self._dispatch(tasks)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/parallel.py:819: in _dispatch\r\n    job = self._backend.apply_async(batch, callback=cb)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/_parallel_backends.py:208: in apply_async\r\n    result = ImmediateResult(func)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/_parallel_backends.py:597: in __init__\r\n    self.results = batch()\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/parallel.py:288: in __call__\r\n    return [func(*args, **kwargs)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/joblib/parallel.py:288: in <listcomp>\r\n    return [func(*args, **kwargs)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/utils/fixes.py:117: in __call__\r\n    return self.function(*args, **kwargs)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:1052: in _fit_and_predict\r\n    predictions = func(X_test)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/pipeline.py:548: in predict_proba\r\n    return self.steps[-1][1].predict_proba(Xt, **predict_proba_params)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/calibration.py:477: in predict_proba\r\n    proba = calibrated_classifier.predict_proba(X)\r\n../../../../.anaconda3/envs/strategy-training/lib/python3.9/site-packages/sklearn/calibration.py:764: in predict_proba\r\n    proba[:, class_idx] = calibrator.predict(this_pred)\r\nE   ValueError: could not broadcast input array from shape (20,1) into shape (20,)\r\n```\r\n\r\n### Versions\r\n\r\n```shell\r\nSystem:\r\n    python: 3.9.15 (main, Nov 24 2022, 14:31:59)  [GCC 11.2.0]\r\nexecutable: /home/philippe/.anaconda3/envs/strategy-training/bin/python\r\n   machine: Linux-5.15.0-57-generic-x86_64-with-glibc2.31\r\n\r\nPython dependencies:\r\n      sklearn: 1.2.0\r\n          pip: 22.2.2\r\n   setuptools: 62.3.2\r\n        numpy: 1.23.5\r\n        scipy: 1.9.3\r\n       Cython: None\r\n       pandas: 1.4.1\r\n   matplotlib: 3.6.3\r\n       joblib: 1.2.0\r\nthreadpoolctl: 3.1.0\r\n\r\nBuilt with OpenMP: True\r\n\r\nthreadpoolctl info:\r\n       user_api: openmp\r\n   internal_api: openmp\r\n         prefix: libgomp\r\n       filepath: /home/philippe/.anaconda3/envs/strategy-training/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0\r\n        version: None\r\n    num_threads: 12\r\n\r\n       user_api: blas\r\n   internal_api: openblas\r\n         prefix: libopenblas\r\n       filepath: /home/philippe/.anaconda3/envs/strategy-training/lib/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so\r\n        version: 0.3.20\r\nthreading_layer: pthreads\r\n   architecture: Haswell\r\n    num_threads: 12\r\n\r\n       user_api: blas\r\n   internal_api: openblas\r\n         prefix: libopenblas\r\n       filepath: /home/philippe/.anaconda3/envs/strategy-training/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so\r\n        version: 0.3.18\r\nthreading_layer: pthreads\r\n   architecture: Haswell\r\n    num_threads: 12\r\n```\r\n\n",
  "patch": "diff --git a/sklearn/isotonic.py b/sklearn/isotonic.py\n--- a/sklearn/isotonic.py\n+++ b/sklearn/isotonic.py\n@@ -360,23 +360,16 @@ def fit(self, X, y, sample_weight=None):\n         self._build_f(X, y)\n         return self\n \n-    def transform(self, T):\n-        \"\"\"Transform new data by linear interpolation.\n-\n-        Parameters\n-        ----------\n-        T : array-like of shape (n_samples,) or (n_samples, 1)\n-            Data to transform.\n+    def _transform(self, T):\n+        \"\"\"`_transform` is called by both `transform` and `predict` methods.\n \n-            .. versionchanged:: 0.24\n-               Also accepts 2d array with 1 feature.\n+        Since `transform` is wrapped to output arrays of specific types (e.g.\n+        NumPy arrays, pandas DataFrame), we cannot make `predict` call `transform`\n+        directly.\n \n-        Returns\n-        -------\n-        y_pred : ndarray of shape (n_samples,)\n-            The transformed data.\n+        The above behaviour could be changed in the future, if we decide to output\n+        other type of arrays when calling `predict`.\n         \"\"\"\n-\n         if hasattr(self, \"X_thresholds_\"):\n             dtype = self.X_thresholds_.dtype\n         else:\n@@ -397,6 +390,24 @@ def transform(self, T):\n \n         return res\n \n+    def transform(self, T):\n+        \"\"\"Transform new data by linear interpolation.\n+\n+        Parameters\n+        ----------\n+        T : array-like of shape (n_samples,) or (n_samples, 1)\n+            Data to transform.\n+\n+            .. versionchanged:: 0.24\n+               Also accepts 2d array with 1 feature.\n+\n+        Returns\n+        -------\n+        y_pred : ndarray of shape (n_samples,)\n+            The transformed data.\n+        \"\"\"\n+        return self._transform(T)\n+\n     def predict(self, T):\n         \"\"\"Predict new data by linear interpolation.\n \n@@ -410,7 +421,7 @@ def predict(self, T):\n         y_pred : ndarray of shape (n_samples,)\n             Transformed data.\n         \"\"\"\n-        return self.transform(T)\n+        return self._transform(T)\n \n     # We implement get_feature_names_out here instead of using\n     # `ClassNamePrefixFeaturesOutMixin`` because `input_features` are ignored.\n"
}