{
  "Selected_candidate": {
    "pr_number": 22775,
    "pr_title": "FIX Fix ColumnTransformer.get_feature_names_out with slices",
    "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md\r\n-->\r\n\r\n#### Reference Issues/PRs\r\nFixes #22774\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\nCurrently, ColumnTransformer's get_feature_names_out does not work when the columns are specified as slices. This fix makes get_feature_names_out work properly with slices as well.\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
    "issue_id": 22774,
    "issue_title": "ColumnTransformer's get_feature_names_out does not work properly with slices",
    "issue_body": "### Describe the bug\r\n\r\nSlices are a supported for selecting columns in `ColumnTransformer`. Also, `get_column_names_out` is supported to label generated columns. But get_column_names_out does not work with slices.\r\n\r\n### Steps/Code to Reproduce\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.preprocessing import Normalizer as _Normalizer\r\nfrom sklearn.base import _OneToOneFeatureMixin\r\n\r\n\r\nclass Normalizer(_OneToOneFeatureMixin, _Normalizer):\r\n    pass\r\n\r\nct = ColumnTransformer([(\"norm1\", Normalizer(norm='l1'), [0, 1]),\r\n                        (\"norm2\", Normalizer(norm='l1'), slice(2, 4))],\r\n                       verbose_feature_names_out=False)\r\nX = np.array([[0., 1., 2., 2.], [1., 1., 0., 1.]])\r\nct.fit_transform(X)\r\n\r\n#  Expecting array(['x0', 'x1', 'x2', 'x3'], dtype=object)\r\nct.get_feature_names_out()  # get TypeError\r\n\r\ndf = pd.DataFrame(X, columns=['c1', 'c2', 'c3', 'c4'])\r\nct.fit_transform(df)\r\n\r\n# Expecting array(['c0', 'c1', 'c2', 'c3'], dtype=object)\r\nct.get_feature_names_out()  # get TypeError\r\n```\r\n\r\n### Expected Results\r\n\r\nExpecting array(['x0', 'x1', 'x2', 'x3'], dtype=object)\r\n\r\nand \r\n\r\nExpecting array(['c0', 'c1', 'c2', 'c3'], dtype=object)\r\n\r\n### Actual Results\r\n\r\nProduces error\r\n\r\n### Versions\r\n\r\n```shell\r\nSystem:\r\n    python: 3.8.10 (default, Nov 26 2021, 20:14:08)  [GCC 9.3.0]\r\nexecutable: /home/popos/code/sklearn-transformer-extensions/.venv/bin/python\r\n   machine: Linux-5.16.11-76051611-generic-x86_64-with-glibc2.29\r\n\r\nPython dependencies:\r\n          pip: 22.0.4\r\n   setuptools: 60.9.3\r\n      sklearn: 1.0.2\r\n        numpy: 1.22.3\r\n        scipy: 1.6.1\r\n       Cython: None\r\n       pandas: 1.4.1\r\n   matplotlib: None\r\n       joblib: 1.1.0\r\nthreadpoolctl: 3.1.0\r\n\r\nBuilt with OpenMP: True\r\n```\r\n",
    "issue_closed_at": "2022-03-15T19:06:04Z",
    "base_commit": "cd5385e7112c453afff205fa0ce6a67356cbcf32",
    "changes": [
      {
        "file": "sklearn/compose/_column_transformer.py",
        "type": "function",
        "name": "_get_feature_name_out_for_transformer",
        "class_name": "ColumnTransformer",
        "code": "def _get_feature_name_out_for_transformer(\n        self, name, trans, column, feature_names_in\n    ):\n        \"\"\"Gets feature names of transformer.\n\n        Used in conjunction with self._iter(fitted=True) in get_feature_names_out.\n        \"\"\"\n        if trans == \"drop\" or _is_empty_column_selection(column):\n            return\n        elif trans == \"passthrough\":\n            if (not isinstance(column, slice)) and all(\n                isinstance(col, str) for col in column\n            ):\n                # selection was already strings\n                return column\n            else:\n                return feature_names_in[column]\n\n        # An actual transformer\n        if not hasattr(trans, \"get_feature_names_out\"):\n            raise AttributeError(\n                f\"Transformer {name} (type {type(trans).__name__}) does \"\n                \"not provide get_feature_names_out.\"\n            )\n        if isinstance(column, Iterable) and not all(\n            isinstance(col, str) for col in column\n        ):\n            column = _safe_indexing(feature_names_in, column)\n        return trans.get_feature_names_out(column)"
      }
    ]
  },
  "Justification": "Candidate B is the most helpful because it examines the operation of the `ColumnTransformer`, which is closely related to the `FeatureUnion` mentioned in the CURRENT bug report. Both involve how transformer outputs are managed within scikit-learn pipelines. The issue of `get_feature_names_out` not working properly with slices indicates potential underlying problems with data handling in transformers, which could also relate to the failure encountered when using `pandas` output in `FeatureUnion`. Furthermore, the candidate bug highlights a recognizable issue with how columns are processed and named in the output, likely providing parallel insights into transforming and aggregating data correctly. This structural and symptom similarity positions it well for providing valuable insights into resolving the CURRENT bug.",
  "instance_id": "scikit-learn__scikit-learn-25747",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2023-03-02T20:38:47Z",
  "problem_statement": "FeatureUnion not working when aggregating data and pandas transform output selected\n### Describe the bug\n\nI would like to use `pandas` transform output and use a custom transformer in a feature union which aggregates data. When I'm using this combination I got an error. When I use default `numpy` output it works fine.\n\n### Steps/Code to Reproduce\n\n```python\r\nimport pandas as pd\r\nfrom sklearn.base import BaseEstimator, TransformerMixin\r\nfrom sklearn import set_config\r\nfrom sklearn.pipeline import make_union\r\n\r\nindex = pd.date_range(start=\"2020-01-01\", end=\"2020-01-05\", inclusive=\"left\", freq=\"H\")\r\ndata = pd.DataFrame(index=index, data=[10] * len(index), columns=[\"value\"])\r\ndata[\"date\"] = index.date\r\n\r\n\r\nclass MyTransformer(BaseEstimator, TransformerMixin):\r\n    def fit(self, X: pd.DataFrame, y: pd.Series | None = None, **kwargs):\r\n        return self\r\n\r\n    def transform(self, X: pd.DataFrame, y: pd.Series | None = None) -> pd.DataFrame:\r\n        return X[\"value\"].groupby(X[\"date\"]).sum()\r\n\r\n\r\n# This works.\r\nset_config(transform_output=\"default\")\r\nprint(make_union(MyTransformer()).fit_transform(data))\r\n\r\n# This does not work.\r\nset_config(transform_output=\"pandas\")\r\nprint(make_union(MyTransformer()).fit_transform(data))\r\n```\n\n### Expected Results\n\nNo error is thrown when using `pandas` transform output.\n\n### Actual Results\n\n```python\r\n---------------------------------------------------------------------------\r\nValueError                                Traceback (most recent call last)\r\nCell In[5], line 25\r\n     23 # This does not work.\r\n     24 set_config(transform_output=\"pandas\")\r\n---> 25 print(make_union(MyTransformer()).fit_transform(data))\r\n\r\nFile ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/sklearn/utils/_set_output.py:150, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)\r\n    143 if isinstance(data_to_wrap, tuple):\r\n    144     # only wrap the first output for cross decomposition\r\n    145     return (\r\n    146         _wrap_data_with_container(method, data_to_wrap[0], X, self),\r\n    147         *data_to_wrap[1:],\r\n    148     )\r\n--> 150 return _wrap_data_with_container(method, data_to_wrap, X, self)\r\n\r\nFile ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/sklearn/utils/_set_output.py:130, in _wrap_data_with_container(method, data_to_wrap, original_input, estimator)\r\n    127     return data_to_wrap\r\n    129 # dense_config == \"pandas\"\r\n--> 130 return _wrap_in_pandas_container(\r\n    131     data_to_wrap=data_to_wrap,\r\n    132     index=getattr(original_input, \"index\", None),\r\n    133     columns=estimator.get_feature_names_out,\r\n    134 )\r\n\r\nFile ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/sklearn/utils/_set_output.py:59, in _wrap_in_pandas_container(data_to_wrap, columns, index)\r\n     57         data_to_wrap.columns = columns\r\n     58     if index is not None:\r\n---> 59         data_to_wrap.index = index\r\n     60     return data_to_wrap\r\n     62 return pd.DataFrame(data_to_wrap, index=index, columns=columns)\r\n\r\nFile ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/pandas/core/generic.py:5588, in NDFrame.__setattr__(self, name, value)\r\n   5586 try:\r\n   5587     object.__getattribute__(self, name)\r\n-> 5588     return object.__setattr__(self, name, value)\r\n   5589 except AttributeError:\r\n   5590     pass\r\n\r\nFile ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/pandas/_libs/properties.pyx:70, in pandas._libs.properties.AxisProperty.__set__()\r\n\r\nFile ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/pandas/core/generic.py:769, in NDFrame._set_axis(self, axis, labels)\r\n    767 def _set_axis(self, axis: int, labels: Index) -> None:\r\n    768     labels = ensure_index(labels)\r\n--> 769     self._mgr.set_axis(axis, labels)\r\n    770     self._clear_item_cache()\r\n\r\nFile ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/pandas/core/internals/managers.py:214, in BaseBlockManager.set_axis(self, axis, new_labels)\r\n    212 def set_axis(self, axis: int, new_labels: Index) -> None:\r\n    213     # Caller is responsible for ensuring we have an Index object.\r\n--> 214     self._validate_set_axis(axis, new_labels)\r\n    215     self.axes[axis] = new_labels\r\n\r\nFile ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/pandas/core/internals/base.py:69, in DataManager._validate_set_axis(self, axis, new_labels)\r\n     66     pass\r\n     68 elif new_len != old_len:\r\n---> 69     raise ValueError(\r\n     70         f\"Length mismatch: Expected axis has {old_len} elements, new \"\r\n     71         f\"values have {new_len} elements\"\r\n     72     )\r\n\r\nValueError: Length mismatch: Expected axis has 4 elements, new values have 96 elements\r\n```\n\n### Versions\n\n```shell\nSystem:\r\n    python: 3.10.6 (main, Aug 30 2022, 05:11:14) [Clang 13.0.0 (clang-1300.0.29.30)]\r\nexecutable: /Users/macbookpro/.local/share/virtualenvs/3e_VBrf2/bin/python\r\n   machine: macOS-11.3-x86_64-i386-64bit\r\n\r\nPython dependencies:\r\n      sklearn: 1.2.1\r\n          pip: 22.3.1\r\n   setuptools: 67.3.2\r\n        numpy: 1.23.5\r\n        scipy: 1.10.1\r\n       Cython: None\r\n       pandas: 1.4.4\r\n   matplotlib: 3.7.0\r\n       joblib: 1.2.0\r\nthreadpoolctl: 3.1.0\r\n\r\nBuilt with OpenMP: True\r\n\r\nthreadpoolctl info:\r\n       user_api: blas\r\n   internal_api: openblas\r\n         prefix: libopenblas\r\n       filepath: /Users/macbookpro/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/numpy/.dylibs/libopenblas64_.0.dylib\r\n        version: 0.3.20\r\nthreading_layer: pthreads\r\n   architecture: Haswell\r\n    num_threads: 4\r\n\r\n       user_api: openmp\r\n   internal_api: openmp\r\n         prefix: libomp\r\n       filepath: /Users/macbookpro/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/sklearn/.dylibs/libomp.dylib\r\n        version: None\r\n    num_threads: 8\r\n\r\n       user_api: blas\r\n   internal_api: openblas\r\n         prefix: libopenblas\r\n       filepath: /Users/macbookpro/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/scipy/.dylibs/libopenblas.0.dylib\r\n        version: 0.3.18\r\nthreading_layer: pthreads\r\n   architecture: Haswell\r\n    num_threads: 4\n```\n\n",
  "patch": "diff --git a/sklearn/utils/_set_output.py b/sklearn/utils/_set_output.py\n--- a/sklearn/utils/_set_output.py\n+++ b/sklearn/utils/_set_output.py\n@@ -34,7 +34,7 @@ def _wrap_in_pandas_container(\n         `range(n_features)`.\n \n     index : array-like, default=None\n-        Index for data.\n+        Index for data. `index` is ignored if `data_to_wrap` is already a DataFrame.\n \n     Returns\n     -------\n@@ -55,8 +55,6 @@ def _wrap_in_pandas_container(\n     if isinstance(data_to_wrap, pd.DataFrame):\n         if columns is not None:\n             data_to_wrap.columns = columns\n-        if index is not None:\n-            data_to_wrap.index = index\n         return data_to_wrap\n \n     return pd.DataFrame(data_to_wrap, index=index, columns=columns)\n"
}