{
  "instance_id": "scikit-learn__scikit-learn-25570",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2023-02-08T18:28:21Z",
  "problem_statement": "ColumnTransformer with pandas output can't handle transformers with no features\n### Describe the bug\r\n\r\nHi,\r\n\r\nColumnTransformer doesn't deal well with transformers that apply to 0 features (categorical_features in the example below) when using \"pandas\" as output. It seems steps with 0 features are not fitted, hence don't appear in `self._iter(fitted=True)` (_column_transformer.py l.856) and hence break the input to the `_add_prefix_for_feature_names_out` function (l.859).\r\n\r\n\r\n### Steps/Code to Reproduce\r\n\r\nHere is some code to reproduce the error. If you remove .set_output(transform=\"pandas\") on the line before last, all works fine. If you remove the (\"categorical\", ...) step, it works fine too.\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nfrom lightgbm import LGBMClassifier\r\nfrom sklearn.compose import ColumnTransformer\r\nfrom sklearn.impute import SimpleImputer\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.preprocessing import RobustScaler\r\n\r\nX = pd.DataFrame(data=[[1.0, 2.0, 3.0, 4.0], [4, 2, 2, 5]],\r\n                 columns=[\"a\", \"b\", \"c\", \"d\"])\r\ny = np.array([0, 1])\r\ncategorical_features = []\r\nnumerical_features = [\"a\", \"b\", \"c\"]\r\nmodel_preprocessing = (\"preprocessing\",\r\n                       ColumnTransformer([\r\n                           ('categorical', 'passthrough', categorical_features),\r\n                           ('numerical', Pipeline([(\"scaler\", RobustScaler()),\r\n                                                   (\"imputer\", SimpleImputer(strategy=\"median\"))\r\n                                                   ]), numerical_features),\r\n                       ], remainder='drop'))\r\npipeline = Pipeline([model_preprocessing, (\"classifier\", LGBMClassifier())]).set_output(transform=\"pandas\")\r\npipeline.fit(X, y)\r\n```\r\n\r\n### Expected Results\r\n\r\nThe step with no features should be ignored.\r\n\r\n### Actual Results\r\n\r\nHere is the error message:\r\n```pytb\r\nTraceback (most recent call last):\r\n  File \"/home/philippe/workspace/script.py\", line 22, in <module>\r\n    pipeline.fit(X, y)\r\n  File \"/home/philippe/.anaconda3/envs/deleteme/lib/python3.9/site-packages/sklearn/pipeline.py\", line 402, in fit\r\n    Xt = self._fit(X, y, **fit_params_steps)\r\n  File \"/home/philippe/.anaconda3/envs/deleteme/lib/python3.9/site-packages/sklearn/pipeline.py\", line 360, in _fit\r\n    X, fitted_transformer = fit_transform_one_cached(\r\n  File \"/home/philippe/.anaconda3/envs/deleteme/lib/python3.9/site-packages/joblib/memory.py\", line 349, in __call__\r\n    return self.func(*args, **kwargs)\r\n  File \"/home/philippe/.anaconda3/envs/deleteme/lib/python3.9/site-packages/sklearn/pipeline.py\", line 894, in _fit_transform_one\r\n    res = transformer.fit_transform(X, y, **fit_params)\r\n  File \"/home/philippe/.anaconda3/envs/deleteme/lib/python3.9/site-packages/sklearn/utils/_set_output.py\", line 142, in wrapped\r\n    data_to_wrap = f(self, X, *args, **kwargs)\r\n  File \"/home/philippe/.anaconda3/envs/deleteme/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py\", line 750, in fit_transform\r\n    return self._hstack(list(Xs))\r\n  File \"/home/philippe/.anaconda3/envs/deleteme/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py\", line 862, in _hstack\r\n    output.columns = names_out\r\n  File \"/home/philippe/.anaconda3/envs/deleteme/lib/python3.9/site-packages/pandas/core/generic.py\", line 5596, in __setattr__\r\n    return object.__setattr__(self, name, value)\r\n  File \"pandas/_libs/properties.pyx\", line 70, in pandas._libs.properties.AxisProperty.__set__\r\n  File \"/home/philippe/.anaconda3/envs/deleteme/lib/python3.9/site-packages/pandas/core/generic.py\", line 769, in _set_axis\r\n    self._mgr.set_axis(axis, labels)\r\n  File \"/home/philippe/.anaconda3/envs/deleteme/lib/python3.9/site-packages/pandas/core/internals/managers.py\", line 214, in set_axis\r\n    self._validate_set_axis(axis, new_labels)\r\n  File \"/home/philippe/.anaconda3/envs/deleteme/lib/python3.9/site-packages/pandas/core/internals/base.py\", line 69, in _validate_set_axis\r\n    raise ValueError(\r\nValueError: Length mismatch: Expected axis has 3 elements, new values have 0 elements\r\n\r\nProcess finished with exit code 1\r\n```\r\n\r\n### Versions\r\n\r\n```shell\r\nSystem:\r\n    python: 3.9.15 (main, Nov 24 2022, 14:31:59)  [GCC 11.2.0]\r\nexecutable: /home/philippe/.anaconda3/envs/strategy-training/bin/python\r\n   machine: Linux-5.15.0-57-generic-x86_64-with-glibc2.31\r\n\r\nPython dependencies:\r\n      sklearn: 1.2.0\r\n          pip: 22.2.2\r\n   setuptools: 62.3.2\r\n        numpy: 1.23.5\r\n        scipy: 1.9.3\r\n       Cython: None\r\n       pandas: 1.4.1\r\n   matplotlib: 3.6.3\r\n       joblib: 1.2.0\r\nthreadpoolctl: 3.1.0\r\n\r\nBuilt with OpenMP: True\r\n\r\nthreadpoolctl info:\r\n       user_api: openmp\r\n   internal_api: openmp\r\n         prefix: libgomp\r\n       filepath: /home/philippe/.anaconda3/envs/strategy-training/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0\r\n        version: None\r\n    num_threads: 12\r\n\r\n       user_api: blas\r\n   internal_api: openblas\r\n         prefix: libopenblas\r\n       filepath: /home/philippe/.anaconda3/envs/strategy-training/lib/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so\r\n        version: 0.3.20\r\nthreading_layer: pthreads\r\n   architecture: Haswell\r\n    num_threads: 12\r\n\r\n       user_api: blas\r\n   internal_api: openblas\r\n         prefix: libopenblas\r\n       filepath: /home/philippe/.anaconda3/envs/strategy-training/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so\r\n        version: 0.3.18\r\nthreading_layer: pthreads\r\n   architecture: Haswell\r\n    num_threads: 12\r\n```\r\n\n",
  "patch": "diff --git a/sklearn/compose/_column_transformer.py b/sklearn/compose/_column_transformer.py\n--- a/sklearn/compose/_column_transformer.py\n+++ b/sklearn/compose/_column_transformer.py\n@@ -865,7 +865,9 @@ def _hstack(self, Xs):\n                 transformer_names = [\n                     t[0] for t in self._iter(fitted=True, replace_strings=True)\n                 ]\n-                feature_names_outs = [X.columns for X in Xs]\n+                # Selection of columns might be empty.\n+                # Hence feature names are filtered for non-emptiness.\n+                feature_names_outs = [X.columns for X in Xs if X.shape[1] != 0]\n                 names_out = self._add_prefix_for_feature_names_out(\n                     list(zip(transformer_names, feature_names_outs))\n                 )\n",
  "similar_bug_items": [
    {
      "pr_number": 19879,
      "pr_title": "FIX Error for sparse matrix in OrdinalEncoder.inverse_transform",
      "pr_body": "closes #19878\r\n\r\n`OrdinalEncoder.inverse_transform` should not support sparse matrix. It was already failing but with an obscure message.\r\nThis PR adds a non-regression test to check for the error message.",
      "issue_id": 19878,
      "issue_title": "OrdinalEncoder accept and failed with sparse matrix in inverse_transform",
      "issue_body": "`OrdinalEncoder` was documented to accept sparse matrix in `inverse_transform`.\r\nA check is internally done to accept sparse matrix. However, the `inverse_transform` will fail with this type of data.\r\nWe should remove the support in the check and make sure that we issue the right error.\r\n",
      "issue_closed_at": "2021-04-13T11:59:30Z",
      "base_commit": "767fd63c9ddddc46e288fdec2cca36a129529a8e",
      "changes": [
        {
          "file": "sklearn/preprocessing/_encoders.py",
          "type": "function",
          "name": "inverse_transform",
          "class_name": "OrdinalEncoder",
          "code": "def inverse_transform(self, X):\n        \"\"\"\n        Convert the data back to the original representation.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            The transformed data.\n\n        Returns\n        -------\n        X_tr : ndarray of shape (n_samples, n_features)\n            Inverse transformed array.\n        \"\"\"\n        check_is_fitted(self)\n        X = check_array(X, accept_sparse='csr', force_all_finite='allow-nan')\n\n        n_samples, _ = X.shape\n        n_features = len(self.categories_)\n\n        # validate shape of passed X\n        msg = (\"Shape of the passed X data is not correct. Expected {0} \"\n               \"columns, got {1}.\")\n        if X.shape[1] != n_features:\n            raise ValueError(msg.format(n_features, X.shape[1]))\n\n        # create resulting array of appropriate dtype\n        dt = np.find_common_type([cat.dtype for cat in self.categories_], [])\n        X_tr = np.empty((n_samples, n_features), dtype=dt)\n\n        found_unknown = {}\n\n        for i in range(n_features):\n            labels = X[:, i].astype('int64', copy=False)\n\n            # replace values of X[:, i] that were nan with actual indices\n            if i in self._missing_indices:\n                X_i_mask = _get_mask(X[:, i], np.nan)\n                labels[X_i_mask] = self._missing_indices[i]\n\n            if self.handle_unknown == 'use_encoded_value':\n                unknown_labels = labels == self.unknown_value\n                X_tr[:, i] = self.categories_[i][np.where(\n                    unknown_labels, 0, labels)]\n                found_unknown[i] = unknown_labels\n            else:\n                X_tr[:, i] = self.categories_[i][labels]\n\n        # insert None values for unknown values\n        if found_unknown:\n            X_tr = X_tr.astype(object, copy=False)\n\n            for idx, mask in found_unknown.items():\n                X_tr[mask, idx] = None\n\n        return X_tr"
        },
        {
          "file": "sklearn/preprocessing/_encoders.py",
          "type": "function",
          "name": "inverse_transform",
          "class_name": "OrdinalEncoder",
          "code": "def inverse_transform(self, X):\n        \"\"\"\n        Convert the data back to the original representation.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            The transformed data.\n\n        Returns\n        -------\n        X_tr : ndarray of shape (n_samples, n_features)\n            Inverse transformed array.\n        \"\"\"\n        check_is_fitted(self)\n        X = check_array(X, accept_sparse='csr', force_all_finite='allow-nan')\n\n        n_samples, _ = X.shape\n        n_features = len(self.categories_)\n\n        # validate shape of passed X\n        msg = (\"Shape of the passed X data is not correct. Expected {0} \"\n               \"columns, got {1}.\")\n        if X.shape[1] != n_features:\n            raise ValueError(msg.format(n_features, X.shape[1]))\n\n        # create resulting array of appropriate dtype\n        dt = np.find_common_type([cat.dtype for cat in self.categories_], [])\n        X_tr = np.empty((n_samples, n_features), dtype=dt)\n\n        found_unknown = {}\n\n        for i in range(n_features):\n            labels = X[:, i].astype('int64', copy=False)\n\n            # replace values of X[:, i] that were nan with actual indices\n            if i in self._missing_indices:\n                X_i_mask = _get_mask(X[:, i], np.nan)\n                labels[X_i_mask] = self._missing_indices[i]\n\n            if self.handle_unknown == 'use_encoded_value':\n                unknown_labels = labels == self.unknown_value\n                X_tr[:, i] = self.categories_[i][np.where(\n                    unknown_labels, 0, labels)]\n                found_unknown[i] = unknown_labels\n            else:\n                X_tr[:, i] = self.categories_[i][labels]\n\n        # insert None values for unknown values\n        if found_unknown:\n            X_tr = X_tr.astype(object, copy=False)\n\n            for idx, mask in found_unknown.items():\n                X_tr[mask, idx] = None\n\n        return X_tr"
        }
      ]
    },
    {
      "pr_number": 23256,
      "pr_title": "FIX SGDRegreesor and SGDClassifier use correct number of validation data",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #23255.\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n- changes the `validation_mask` to be a boolean array for correct indexing.\r\n- fixes a failing test case.\r\n\r\n#### Any other comments?\r\nI haven't added any test cases because I'm not quite sure how to test this behavior without directly using internal api like checking against `_ValidationScoreCallback.X_val.shape`.\r\n\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 23255,
      "issue_title": "BUG SGDRegressor only using the first two samples for validation error calculation",
      "issue_body": "### Describe the bug\r\n\r\n`BaseSGD._make_validation_split` returns a 0-1 integer array as validation mask, which is used to calculate the validation score with:\r\n```python\r\n_ValidationScoreCallback(\r\n            self,\r\n            X[validation_mask],\r\n            y[validation_mask],\r\n            sample_weight[validation_mask],\r\n            classes=classes,\r\n```\r\nhere `X[validation_mask]` is just repeating the first two samples as shown in the following code. A quick fix would be convert `validation_mask` to a boolean array.\r\n\r\n### Steps/Code to Reproduce\r\n\r\nI added a print statement in `_ValidationScoreCallback` to print out the shape of validation data\r\n```python\r\nclass _ValidationScoreCallback:\r\n    \"\"\"Callback for early stopping based on validation score\"\"\"\r\n\r\n    def __init__(self, estimator, X_val, y_val, sample_weight_val, classes=None):\r\n        self.estimator = clone(estimator)\r\n        self.estimator.t_ = 1  # to pass check_is_fitted\r\n        if classes is not None:\r\n            self.estimator.classes_ = classes\r\n        self.X_val = X_val\r\n        self.y_val = y_val\r\n        self.sample_weight_val = sample_weight_val\r\n        print(X_val.shape, y_val.shape)\r\n```\r\nand afterwards running the following code\r\n```python\r\nimport numpy as np\r\nfrom sklearn.linear_model import SGDRegressor\r\n\r\nX = np.random.randn(1000, 5)\r\ny = np.random.randn(1000)\r\n\r\nsgd = SGDRegressor(early_stopping=True, max_iter=1, validation_fraction=0.1)\r\nsgd.fit(X, y)\r\n```\r\n\r\n### Expected Results\r\n\r\nThe size of validation data should be `(100,5), (100, )`.\r\n\r\n### Actual Results\r\n\r\nprints out `(1000, 5), (1000,)`\r\n\r\n### Versions\r\n\r\n```shell\r\nSystem:\r\n    python: 3.8.8 (default, Apr 13 2021, 12:59:45)  [Clang 10.0.0 ]\r\nexecutable: /Users/mac/Documents/GitHub/.ENV/bin/python\r\n   machine: macOS-10.16-x86_64-i386-64bit\r\n\r\nPython dependencies:\r\n      sklearn: 1.2.dev0\r\n          pip: 20.2.3\r\n   setuptools: 60.9.3\r\n        numpy: 1.21.2\r\n        scipy: 1.8.0\r\n       Cython: 0.29.28\r\n       pandas: 1.3.3\r\n   matplotlib: 3.5.1\r\n       joblib: 1.1.0\r\nthreadpoolctl: 3.0.0\r\n\r\nBuilt with OpenMP: False\r\n```\r\n",
      "issue_closed_at": "2022-05-02T11:41:03Z",
      "base_commit": "920ab2508fe634ad32df68bb0ebd4f4512fbfb53",
      "changes": [
        {
          "file": "sklearn/linear_model/_stochastic_gradient.py",
          "type": "function",
          "name": "_make_validation_split",
          "class_name": "BaseSGD",
          "code": "def _make_validation_split(self, y):\n        \"\"\"Split the dataset between training set and validation set.\n\n        Parameters\n        ----------\n        y : ndarray of shape (n_samples, )\n            Target values.\n\n        Returns\n        -------\n        validation_mask : ndarray of shape (n_samples, )\n            Equal to 1 on the validation set, 0 on the training set.\n        \"\"\"\n        n_samples = y.shape[0]\n        validation_mask = np.zeros(n_samples, dtype=np.uint8)\n        if not self.early_stopping:\n            # use the full set for training, with an empty validation set\n            return validation_mask\n\n        if is_classifier(self):\n            splitter_type = StratifiedShuffleSplit\n        else:\n            splitter_type = ShuffleSplit\n        cv = splitter_type(\n            test_size=self.validation_fraction, random_state=self.random_state\n        )\n        idx_train, idx_val = next(cv.split(np.zeros(shape=(y.shape[0], 1)), y))\n        if idx_train.shape[0] == 0 or idx_val.shape[0] == 0:\n            raise ValueError(\n                \"Splitting %d samples into a train set and a validation set \"\n                \"with validation_fraction=%r led to an empty set (%d and %d \"\n                \"samples). Please either change validation_fraction, increase \"\n                \"number of samples, or disable early_stopping.\"\n                % (\n                    n_samples,\n                    self.validation_fraction,\n                    idx_train.shape[0],\n                    idx_val.shape[0],\n                )\n            )\n\n        validation_mask[idx_val] = 1\n        return validation_mask"
        },
        {
          "file": "sklearn/linear_model/_stochastic_gradient.py",
          "type": "function",
          "name": "_make_validation_split",
          "class_name": "BaseSGD",
          "code": "def _make_validation_split(self, y):\n        \"\"\"Split the dataset between training set and validation set.\n\n        Parameters\n        ----------\n        y : ndarray of shape (n_samples, )\n            Target values.\n\n        Returns\n        -------\n        validation_mask : ndarray of shape (n_samples, )\n            Equal to 1 on the validation set, 0 on the training set.\n        \"\"\"\n        n_samples = y.shape[0]\n        validation_mask = np.zeros(n_samples, dtype=np.uint8)\n        if not self.early_stopping:\n            # use the full set for training, with an empty validation set\n            return validation_mask\n\n        if is_classifier(self):\n            splitter_type = StratifiedShuffleSplit\n        else:\n            splitter_type = ShuffleSplit\n        cv = splitter_type(\n            test_size=self.validation_fraction, random_state=self.random_state\n        )\n        idx_train, idx_val = next(cv.split(np.zeros(shape=(y.shape[0], 1)), y))\n        if idx_train.shape[0] == 0 or idx_val.shape[0] == 0:\n            raise ValueError(\n                \"Splitting %d samples into a train set and a validation set \"\n                \"with validation_fraction=%r led to an empty set (%d and %d \"\n                \"samples). Please either change validation_fraction, increase \"\n                \"number of samples, or disable early_stopping.\"\n                % (\n                    n_samples,\n                    self.validation_fraction,\n                    idx_train.shape[0],\n                    idx_val.shape[0],\n                )\n            )\n\n        validation_mask[idx_val] = 1\n        return validation_mask"
        }
      ]
    },
    {
      "pr_number": 18528,
      "pr_title": "FIX TruncatedSVD.fit_transform returns the same as fit.transform",
      "pr_body": "Fix #15144\r\nSupersede and close #16421",
      "issue_id": 15144,
      "issue_title": "TruncatedSVD.fit(X).transform(X) is not the same as .fit_transform(X)",
      "issue_body": "<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\nRecall that SVD(X) decomposes X into three matrices, U, Sigma, and V^t. \r\n\r\nIn scikit-learn, TruncatedSVD treats .fit().transform() differently from .fit_transform(). On the one hand, .fit(X).transform(X) will return X @ V. On the other hand, .fit_transform(X) will return U * Sigma. If this is intended behaviour, I believe it should be detailed in the documentation.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\nExample:\r\n```python\r\nimport numpy as np\r\nfrom sklearn.decomposition import TruncatedSVD\r\nfrom sklearn.random_projection import sparse_random_matrix\r\nX = sparse_random_matrix(100, 100, density=0.01, random_state=42)\r\n\r\nx1 =  TruncatedSVD(n_components=5, n_iter=7, random_state=42).fit_transform(X)\r\nx2 =  TruncatedSVD(n_components=5, n_iter=7, random_state=42).fit_transform(X)\r\nx3 =  TruncatedSVD(n_components=5, n_iter=7, random_state=42).fit(X).transform(X)\r\n\r\nnp.linalg.norm(x1 - x2)  # >>> equals 0, as desired\r\nnp.linalg.norm(x1 - x3)  # >>> equals 0.0248, implies the result is different -- not desirable!\r\n```\r\n\r\n#### Expected Results\r\nWe should have np.linalg.norm(x1 - x3) == 0 to be True.\r\n\r\n#### Actual Results\r\nWe get that np.linalg.norm(x1 - x3) equals 0.0248, meaning that the result of .fit(X).transform(X) is different from .fit_transform(X).\r\n\r\n#### Versions\r\nPython deps:\r\n       pip: 19.2.3\r\nsetuptools: 41.2.0\r\n   sklearn: 0.21.3\r\n     numpy: 1.17.2\r\n     scipy: 1.3.1\r\n    Cython: None\r\n    pandas: 0.25.1\r\n",
      "issue_closed_at": "2020-10-06T20:47:11Z",
      "base_commit": "aa4a10dbfaee9fd52af06f0f0c8e8ae77f243ef6",
      "changes": [
        {
          "file": "sklearn/decomposition/_truncated_svd.py",
          "type": "function",
          "name": "__init__",
          "class_name": "TruncatedSVD",
          "code": "def __init__(self, n_components=2, *, algorithm=\"randomized\", n_iter=5,\n                 random_state=None, tol=0.):\n        self.algorithm = algorithm\n        self.n_components = n_components\n        self.n_iter = n_iter\n        self.random_state = random_state\n        self.tol = tol"
        },
        {
          "file": "sklearn/decomposition/_truncated_svd.py",
          "type": "function",
          "name": "fit",
          "class_name": "TruncatedSVD",
          "code": "def fit(self, X, y=None):\n        \"\"\"Fit LSI model on training data X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            Training data.\n\n        y : Ignored\n\n        Returns\n        -------\n        self : object\n            Returns the transformer object.\n        \"\"\"\n        self.fit_transform(X)\n        return self"
        },
        {
          "file": "sklearn/decomposition/_truncated_svd.py",
          "type": "function",
          "name": "fit_transform",
          "class_name": "TruncatedSVD",
          "code": "def fit_transform(self, X, y=None):\n        \"\"\"Fit LSI model to X and perform dimensionality reduction on X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            Training data.\n\n        y : Ignored\n\n        Returns\n        -------\n        X_new : ndarray of shape (n_samples, n_components)\n            Reduced version of X. This will always be a dense array.\n        \"\"\"\n        X = self._validate_data(X, accept_sparse=['csr', 'csc'],\n                                ensure_min_features=2)\n        random_state = check_random_state(self.random_state)\n\n        if self.algorithm == \"arpack\":\n            U, Sigma, VT = svds(X, k=self.n_components, tol=self.tol)\n            # svds doesn't abide by scipy.linalg.svd/randomized_svd\n            # conventions, so reverse its outputs.\n            Sigma = Sigma[::-1]\n            U, VT = svd_flip(U[:, ::-1], VT[::-1])\n\n        elif self.algorithm == \"randomized\":\n            k = self.n_components\n            n_features = X.shape[1]\n            if k >= n_features:\n                raise ValueError(\"n_components must be < n_features;\"\n                                 \" got %d >= %d\" % (k, n_features))\n            U, Sigma, VT = randomized_svd(X, self.n_components,\n                                          n_iter=self.n_iter,\n                                          random_state=random_state)\n        else:\n            raise ValueError(\"unknown algorithm %r\" % self.algorithm)\n\n        self.components_ = VT\n\n        # Calculate explained variance & explained variance ratio\n        X_transformed = U * Sigma\n        self.explained_variance_ = exp_var = np.var(X_transformed, axis=0)\n        if sp.issparse(X):\n            _, full_var = mean_variance_axis(X, axis=0)\n            full_var = full_var.sum()\n        else:\n            full_var = np.var(X, axis=0).sum()\n        self.explained_variance_ratio_ = exp_var / full_var\n        self.singular_values_ = Sigma  # Store the singular values.\n\n        return X_transformed"
        }
      ]
    },
    {
      "pr_number": 2523,
      "pr_title": "[MRG] FIX #2481: add warning for bug in old numpy with unicode",
      "pr_body": "This is a fix for  #2481 which causes the jenkins tests to fail with numpy 1.3.0.\n",
      "issue_id": 2481,
      "issue_title": "LabelEncoder doesn't work correctly for unicode labels in Python 2.6 + numpy 1.3",
      "issue_body": "LabelEncoder works incorrectly for unicode labels in Python 2.6 + numpy 1.3. This is currently untested; to reproduce replace bytestrings with unicode strings here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/tests/test_label.py#L191\n\nThis is the cause of Jenkins failure that https://github.com/scikit-learn/scikit-learn/pull/2462 triggered.\n",
      "issue_closed_at": "2013-10-16T12:24:23Z",
      "base_commit": "a2580d6fe04340f535c624ae48b4a52a66cbc839",
      "changes": [
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "line",
          "name": "line 8",
          "code": "\nfrom ..base import BaseEstimator, TransformerMixin\n\nfrom ..utils.fixes import unique\nfrom ..utils import deprecated, column_or_1d\n\nfrom ..utils.multiclass import unique_labels"
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "line",
          "name": "line 25",
          "code": "    'LabelEncoder',\n]\n\n\nclass LabelEncoder(BaseEstimator, TransformerMixin):\n    \"\"\"Encode labels with value between 0 and n_classes-1."
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "function",
          "name": "fit",
          "class_name": "LabelBinarizer",
          "code": "def fit(self, y):\n        \"\"\"Fit label binarizer\n\n        Parameters\n        ----------\n        y : numpy array of shape (n_samples,) or sequence of sequences\n            Target values. In the multilabel case the nested sequences can\n            have variable lengths.\n\n        Returns\n        -------\n        self : returns an instance of self.\n        \"\"\"\n        y_type = type_of_target(y)\n        self.multilabel_ = y_type.startswith('multilabel')\n        if self.multilabel_:\n            self.indicator_matrix_ = y_type == 'multilabel-indicator'\n\n        self.classes_ = unique_labels(y)\n\n        return self"
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "function",
          "name": "fit_transform",
          "class_name": "LabelEncoder",
          "code": "def fit_transform(self, y):\n        \"\"\"Fit label encoder and return encoded labels\n\n        Parameters\n        ----------\n        y : array-like of shape [n_samples]\n            Target values.\n\n        Returns\n        -------\n        y : array-like of shape [n_samples]\n        \"\"\"\n        y = column_or_1d(y, warn=True)\n        self.classes_, y = unique(y, return_inverse=True)\n        return y"
        },
        {
          "file": "sklearn/preprocessing/label.py",
          "type": "function",
          "name": "transform",
          "class_name": "LabelBinarizer",
          "code": "def transform(self, y):\n        \"\"\"Transform multi-class labels to binary labels\n\n        The output of transform is sometimes referred to by some authors as the\n        1-of-K coding scheme.\n\n        Parameters\n        ----------\n        y : numpy array of shape [n_samples] or sequence of sequences\n            Target values. In the multilabel case the nested sequences can\n            have variable lengths.\n\n        Returns\n        -------\n        Y : numpy array of shape [n_samples, n_classes]\n        \"\"\"\n        self._check_fitted()\n\n        y_is_multilabel = type_of_target(y).startswith('multilabel')\n\n        if y_is_multilabel and not self.multilabel_:\n            raise ValueError(\"The object was not fitted with multilabel\"\n                             \" input.\")\n\n        return label_binarize(y, self.classes_,\n                              multilabel=self.multilabel_,\n                              pos_label=self.pos_label,\n                              neg_label=self.neg_label)"
        }
      ]
    },
    {
      "pr_number": 11043,
      "pr_title": "[MRG+2] ENH Passthrough DataFrame in FunctionTransformer",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n\r\ncloses #10655 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded the following option\r\n\r\n- [x] Raise FutureWarning that `validate=False` will be the default in the future.\r\n- [x] Convert list to array when `validate=False`.\r\n- [x] Make a what's new entry for the change of behaviour.\r\n\r\n#### Any other comments?\r\n\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 10655,
      "issue_title": "FunctionTransformer should not convert DataFrames to arrays by default",
      "issue_body": "I would expect a common use of FunctionTransformer is to apply some function to a Pandas DataFrame, ideally using its own methods or accessors. As noted in #10648, it can be easy for users to miss that they need to set validate=False to pass through a DataFrame without converting it to a NumPy array. I think it would be more user-friendly to have `validate='array-or-frame'` by default, which would pass through DataFrames to the function, but otherwise convert its input to a 2d array. For strict backwards compatibility, the default should be changed through a deprecation cycle, warning whenever using the default validation means a DataFrame is currently converted to an array.\r\n\r\nDo others agree?",
      "issue_closed_at": "2018-07-10T09:48:04Z",
      "base_commit": "19bc7e8af6ec3468b6c7f4718a31cd5f508528cd",
      "changes": [
        {
          "file": "sklearn/preprocessing/_function_transformer.py",
          "type": "class",
          "name": "FunctionTransformer",
          "code": "class FunctionTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"Constructs a transformer from an arbitrary callable.\n\n    A FunctionTransformer forwards its X (and optionally y) arguments to a\n    user-defined function or function object and returns the result of this\n    function. This is useful for stateless transformations such as taking the\n    log of frequencies, doing custom scaling, etc.\n\n    Note: If a lambda is used as the function, then the resulting\n    transformer will not be pickleable.\n\n    .. versionadded:: 0.17\n\n    Read more in the :ref:`User Guide <function_transformer>`.\n\n    Parameters\n    ----------\n    func : callable, optional default=None\n        The callable to use for the transformation. This will be passed\n        the same arguments as transform, with args and kwargs forwarded.\n        If func is None, then func will be the identity function.\n\n    inverse_func : callable, optional default=None\n        The callable to use for the inverse transformation. This will be\n        passed the same arguments as inverse transform, with args and\n        kwargs forwarded. If inverse_func is None, then inverse_func\n        will be the identity function.\n\n    validate : bool, optional default=True\n        Indicate that the input X array should be checked before calling\n        func. If validate is false, there will be no input validation.\n        If it is true, then X will be converted to a 2-dimensional NumPy\n        array or sparse matrix. If this conversion is not possible or X\n        contains NaN or infinity, an exception is raised.\n\n    accept_sparse : boolean, optional\n        Indicate that func accepts a sparse matrix as input. If validate is\n        False, this has no effect. Otherwise, if accept_sparse is false,\n        sparse matrix inputs will cause an exception to be raised.\n\n    pass_y : bool, optional default=False\n        Indicate that transform should forward the y argument to the\n        inner callable.\n\n        .. deprecated::0.19\n\n    check_inverse : bool, default=True\n       Whether to check that or ``func`` followed by ``inverse_func`` leads to\n       the original inputs. It can be used for a sanity check, raising a\n       warning when the condition is not fulfilled.\n\n       .. versionadded:: 0.20\n\n    kw_args : dict, optional\n        Dictionary of additional keyword arguments to pass to func.\n\n    inv_kw_args : dict, optional\n        Dictionary of additional keyword arguments to pass to inverse_func.\n\n    \"\"\"\n    def __init__(self, func=None, inverse_func=None, validate=True,\n                 accept_sparse=False, pass_y='deprecated', check_inverse=True,\n                 kw_args=None, inv_kw_args=None):\n        self.func = func\n        self.inverse_func = inverse_func\n        self.validate = validate\n        self.accept_sparse = accept_sparse\n        self.pass_y = pass_y\n        self.check_inverse = check_inverse\n        self.kw_args = kw_args\n        self.inv_kw_args = inv_kw_args\n\n    def _check_inverse_transform(self, X):\n        \"\"\"Check that func and inverse_func are the inverse.\"\"\"\n        idx_selected = slice(None, None, max(1, X.shape[0] // 100))\n        try:\n            assert_allclose_dense_sparse(\n                X[idx_selected],\n                self.inverse_transform(self.transform(X[idx_selected])))\n        except AssertionError:\n            warnings.warn(\"The provided functions are not strictly\"\n                          \" inverse of each other. If you are sure you\"\n                          \" want to proceed regardless, set\"\n                          \" 'check_inverse=False'.\", UserWarning)\n\n    def fit(self, X, y=None):\n        \"\"\"Fit transformer by checking X.\n\n        If ``validate`` is ``True``, ``X`` will be checked.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        Returns\n        -------\n        self\n        \"\"\"\n        if self.validate:\n            X = check_array(X, self.accept_sparse)\n        if (self.check_inverse and not (self.func is None or\n                                        self.inverse_func is None)):\n            self._check_inverse_transform(X)\n        return self\n\n    def transform(self, X, y='deprecated'):\n        \"\"\"Transform X using the forward function.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        y : (ignored)\n            .. deprecated::0.19\n\n        Returns\n        -------\n        X_out : array-like, shape (n_samples, n_features)\n            Transformed input.\n        \"\"\"\n        if not isinstance(y, string_types) or y != 'deprecated':\n            warnings.warn(\"The parameter y on transform() is \"\n                          \"deprecated since 0.19 and will be removed in 0.21\",\n                          DeprecationWarning)\n\n        return self._transform(X, y=y, func=self.func, kw_args=self.kw_args)\n\n    def inverse_transform(self, X, y='deprecated'):\n        \"\"\"Transform X using the inverse function.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        y : (ignored)\n            .. deprecated::0.19\n\n        Returns\n        -------\n        X_out : array-like, shape (n_samples, n_features)\n            Transformed input.\n        \"\"\"\n        if not isinstance(y, string_types) or y != 'deprecated':\n            warnings.warn(\"The parameter y on inverse_transform() is \"\n                          \"deprecated since 0.19 and will be removed in 0.21\",\n                          DeprecationWarning)\n        return self._transform(X, y=y, func=self.inverse_func,\n                               kw_args=self.inv_kw_args)\n\n    def _transform(self, X, y=None, func=None, kw_args=None):\n        if self.validate:\n            X = check_array(X, self.accept_sparse)\n\n        if func is None:\n            func = _identity\n\n        if (not isinstance(self.pass_y, string_types) or\n                self.pass_y != 'deprecated'):\n            # We do this to know if pass_y was set to False / True\n            pass_y = self.pass_y\n            warnings.warn(\"The parameter pass_y is deprecated since 0.19 and \"\n                          \"will be removed in 0.21\", DeprecationWarning)\n        else:\n            pass_y = False\n\n        return func(X, *((y,) if pass_y else ()),\n                    **(kw_args if kw_args else {}))"
        },
        {
          "file": "sklearn/preprocessing/_function_transformer.py",
          "type": "class",
          "name": "FunctionTransformer",
          "code": "class FunctionTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"Constructs a transformer from an arbitrary callable.\n\n    A FunctionTransformer forwards its X (and optionally y) arguments to a\n    user-defined function or function object and returns the result of this\n    function. This is useful for stateless transformations such as taking the\n    log of frequencies, doing custom scaling, etc.\n\n    Note: If a lambda is used as the function, then the resulting\n    transformer will not be pickleable.\n\n    .. versionadded:: 0.17\n\n    Read more in the :ref:`User Guide <function_transformer>`.\n\n    Parameters\n    ----------\n    func : callable, optional default=None\n        The callable to use for the transformation. This will be passed\n        the same arguments as transform, with args and kwargs forwarded.\n        If func is None, then func will be the identity function.\n\n    inverse_func : callable, optional default=None\n        The callable to use for the inverse transformation. This will be\n        passed the same arguments as inverse transform, with args and\n        kwargs forwarded. If inverse_func is None, then inverse_func\n        will be the identity function.\n\n    validate : bool, optional default=True\n        Indicate that the input X array should be checked before calling\n        func. If validate is false, there will be no input validation.\n        If it is true, then X will be converted to a 2-dimensional NumPy\n        array or sparse matrix. If this conversion is not possible or X\n        contains NaN or infinity, an exception is raised.\n\n    accept_sparse : boolean, optional\n        Indicate that func accepts a sparse matrix as input. If validate is\n        False, this has no effect. Otherwise, if accept_sparse is false,\n        sparse matrix inputs will cause an exception to be raised.\n\n    pass_y : bool, optional default=False\n        Indicate that transform should forward the y argument to the\n        inner callable.\n\n        .. deprecated::0.19\n\n    check_inverse : bool, default=True\n       Whether to check that or ``func`` followed by ``inverse_func`` leads to\n       the original inputs. It can be used for a sanity check, raising a\n       warning when the condition is not fulfilled.\n\n       .. versionadded:: 0.20\n\n    kw_args : dict, optional\n        Dictionary of additional keyword arguments to pass to func.\n\n    inv_kw_args : dict, optional\n        Dictionary of additional keyword arguments to pass to inverse_func.\n\n    \"\"\"\n    def __init__(self, func=None, inverse_func=None, validate=True,\n                 accept_sparse=False, pass_y='deprecated', check_inverse=True,\n                 kw_args=None, inv_kw_args=None):\n        self.func = func\n        self.inverse_func = inverse_func\n        self.validate = validate\n        self.accept_sparse = accept_sparse\n        self.pass_y = pass_y\n        self.check_inverse = check_inverse\n        self.kw_args = kw_args\n        self.inv_kw_args = inv_kw_args\n\n    def _check_inverse_transform(self, X):\n        \"\"\"Check that func and inverse_func are the inverse.\"\"\"\n        idx_selected = slice(None, None, max(1, X.shape[0] // 100))\n        try:\n            assert_allclose_dense_sparse(\n                X[idx_selected],\n                self.inverse_transform(self.transform(X[idx_selected])))\n        except AssertionError:\n            warnings.warn(\"The provided functions are not strictly\"\n                          \" inverse of each other. If you are sure you\"\n                          \" want to proceed regardless, set\"\n                          \" 'check_inverse=False'.\", UserWarning)\n\n    def fit(self, X, y=None):\n        \"\"\"Fit transformer by checking X.\n\n        If ``validate`` is ``True``, ``X`` will be checked.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        Returns\n        -------\n        self\n        \"\"\"\n        if self.validate:\n            X = check_array(X, self.accept_sparse)\n        if (self.check_inverse and not (self.func is None or\n                                        self.inverse_func is None)):\n            self._check_inverse_transform(X)\n        return self\n\n    def transform(self, X, y='deprecated'):\n        \"\"\"Transform X using the forward function.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        y : (ignored)\n            .. deprecated::0.19\n\n        Returns\n        -------\n        X_out : array-like, shape (n_samples, n_features)\n            Transformed input.\n        \"\"\"\n        if not isinstance(y, string_types) or y != 'deprecated':\n            warnings.warn(\"The parameter y on transform() is \"\n                          \"deprecated since 0.19 and will be removed in 0.21\",\n                          DeprecationWarning)\n\n        return self._transform(X, y=y, func=self.func, kw_args=self.kw_args)\n\n    def inverse_transform(self, X, y='deprecated'):\n        \"\"\"Transform X using the inverse function.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        y : (ignored)\n            .. deprecated::0.19\n\n        Returns\n        -------\n        X_out : array-like, shape (n_samples, n_features)\n            Transformed input.\n        \"\"\"\n        if not isinstance(y, string_types) or y != 'deprecated':\n            warnings.warn(\"The parameter y on inverse_transform() is \"\n                          \"deprecated since 0.19 and will be removed in 0.21\",\n                          DeprecationWarning)\n        return self._transform(X, y=y, func=self.inverse_func,\n                               kw_args=self.inv_kw_args)\n\n    def _transform(self, X, y=None, func=None, kw_args=None):\n        if self.validate:\n            X = check_array(X, self.accept_sparse)\n\n        if func is None:\n            func = _identity\n\n        if (not isinstance(self.pass_y, string_types) or\n                self.pass_y != 'deprecated'):\n            # We do this to know if pass_y was set to False / True\n            pass_y = self.pass_y\n            warnings.warn(\"The parameter pass_y is deprecated since 0.19 and \"\n                          \"will be removed in 0.21\", DeprecationWarning)\n        else:\n            pass_y = False\n\n        return func(X, *((y,) if pass_y else ()),\n                    **(kw_args if kw_args else {}))"
        },
        {
          "file": "sklearn/preprocessing/_function_transformer.py",
          "type": "function",
          "name": "__init__",
          "class_name": "FunctionTransformer",
          "code": "def __init__(self, func=None, inverse_func=None, validate=True,\n                 accept_sparse=False, pass_y='deprecated', check_inverse=True,\n                 kw_args=None, inv_kw_args=None):\n        self.func = func\n        self.inverse_func = inverse_func\n        self.validate = validate\n        self.accept_sparse = accept_sparse\n        self.pass_y = pass_y\n        self.check_inverse = check_inverse\n        self.kw_args = kw_args\n        self.inv_kw_args = inv_kw_args"
        },
        {
          "file": "sklearn/preprocessing/_function_transformer.py",
          "type": "function",
          "name": "fit",
          "class_name": "FunctionTransformer",
          "code": "def fit(self, X, y=None):\n        \"\"\"Fit transformer by checking X.\n\n        If ``validate`` is ``True``, ``X`` will be checked.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        Returns\n        -------\n        self\n        \"\"\"\n        if self.validate:\n            X = check_array(X, self.accept_sparse)\n        if (self.check_inverse and not (self.func is None or\n                                        self.inverse_func is None)):\n            self._check_inverse_transform(X)\n        return self"
        },
        {
          "file": "sklearn/preprocessing/_function_transformer.py",
          "type": "function",
          "name": "inverse_transform",
          "class_name": "FunctionTransformer",
          "code": "def inverse_transform(self, X, y='deprecated'):\n        \"\"\"Transform X using the inverse function.\n\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Input array.\n\n        y : (ignored)\n            .. deprecated::0.19\n\n        Returns\n        -------\n        X_out : array-like, shape (n_samples, n_features)\n            Transformed input.\n        \"\"\"\n        if not isinstance(y, string_types) or y != 'deprecated':\n            warnings.warn(\"The parameter y on inverse_transform() is \"\n                          \"deprecated since 0.19 and will be removed in 0.21\",\n                          DeprecationWarning)\n        return self._transform(X, y=y, func=self.inverse_func,\n                               kw_args=self.inv_kw_args)"
        }
      ]
    }
  ]
}