{
  "RepoName": "https://github.com/scikit-learn-contrib/sklearn-pandas.git",
  "CommitSHA": "c9db2d6dcbf515eade751073f43318e43cae5177",
  "Time": "",
  "Difficulty": "Medium",
  "Type": "indexing error",
  "BuggyCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/test.py",
      "content": "import pytest\nfrom unittest.mock import Mock\nimport numpy as np\nimport pandas as pd\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn.compose import make_column_selector\nfrom sklearn.preprocessing import StandardScaler\n\n\nclass GetStartWith:\n    def __init__(self, start_str):\n        self.start_str = start_str\n\n    def __call__(self, X: pd.DataFrame) -> list:\n        return [c for c in X.columns if c.startswith(self.start_str)]\n\n\ndf = pd.DataFrame({\n    'sepal length (cm)': [1.0, 2.0, 3.0],\n    'sepal width (cm)': [1.0, 2.0, 3.0],\n    'petal length (cm)': [1.0, 2.0, 3.0],\n    'petal width (cm)': [1.0, 2.0, 3.0]\n})\nt = DataFrameMapper([\n    (make_column_selector(dtype_include=float), StandardScaler(), {'alias': 'x'}),\n    (GetStartWith('petal'), None, {'alias': 'petal'})\n], df_out=True, default=False)\n\nt.fit(df)\nprint(t.transform(df).shape)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/setup.py",
      "content": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nfrom setuptools import setup\nfrom setuptools.command.test import test as TestCommand\nimport re\n\nfor line in open('sklearn_pandas/__init__.py'):\n    match = re.match(\"__version__ *= *'(.*)'\", line)\n    if match:\n        __version__, = match.groups()\n\n\nclass PyTest(TestCommand):\n    user_options = [('pytest-args=', 'a', \"Arguments to pass to py.test\")]\n\n    def initialize_options(self):\n        TestCommand.initialize_options(self)\n        self.pytest_args = []\n\n    def finalize_options(self):\n        TestCommand.finalize_options(self)\n        self.test_args = []\n        self.test_suite = True\n\n    def run(self):\n        import pytest\n        errno = pytest.main(self.pytest_args)\n        raise SystemExit(errno)\n\n\nsetup(name='sklearn-pandas',\n      version=__version__,\n      description='Pandas integration with sklearn',\n      maintainer='Ritesh Agrawal',\n      maintainer_email='ragrawal@gmail.com',\n      url='https://github.com/scikit-learn-contrib/sklearn-pandas',\n      packages=['sklearn_pandas'],\n      keywords=['scikit', 'sklearn', 'pandas'],\n      install_requires=[\n          'scikit-learn>=0.23.0',\n          'scipy>=1.5.1',\n          'pandas>=1.1.4',\n          'numpy>=1.18.1'\n      ],\n      tests_require=['pytest', 'mock'],\n      cmdclass={'test': PyTest},\n      license='MIT License'\n)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/noxfile.py",
      "content": "import nox\n\n@nox.session\ndef lint(session):\n    session.install('pytest>=5.3.5', 'setuptools>=45.2',\n                    'wheel>=0.34.2', 'flake8>=3.7.9',\n                    'numpy==1.18.1', 'pandas==1.1.4')\n    session.install('.')\n    session.run('flake8', 'sklearn_pandas/', 'tests')\n\n@nox.session\n@nox.parametrize('numpy', ['1.18.1', '1.19.4', '1.20.1'])\n@nox.parametrize('scipy', ['1.5.4', '1.6.0'])\n@nox.parametrize('pandas', ['1.1.4', '1.2.2'])\ndef tests(session, numpy, scipy, pandas):\n    session.install('pytest>=5.3.5', \n                    'setuptools>=45.2',\n                    'wheel>=0.34.2',\n                    f'numpy=={numpy}',\n                    f'scipy=={scipy}',\n                    f'pandas=={pandas}'\n                    )\n    session.install('.')\n    session.run('py.test', 'README.rst', 'tests')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
      "content": "def gen_features(columns, classes=None, prefix='', suffix=''):\n    \"\"\"Generates a feature definition list which can be passed\n    into DataFrameMapper\n\n    Params:\n\n    columns     a list of column names to generate features for.\n\n    classes     a list of classes for each feature, a list of dictionaries with\n                transformer class and init parameters, or None.\n\n                If list of classes is provided, then each of them is\n                instantiated with default arguments. Example:\n\n                    classes = [StandardScaler, LabelBinarizer]\n\n                If list of dictionaries is provided, then each of them should\n                have a 'class' key with transformer class. All other keys are\n                passed into 'class' value constructor. Example:\n\n                    classes = [\n                        {'class': StandardScaler, 'with_mean': False},\n                        {'class': LabelBinarizer}\n                    }]\n\n                If None value selected, then each feature left as is.\n\n    prefix      add prefix to transformed column names\n\n    suffix      add suffix to transformed column names.\n\n    \"\"\"\n    if classes is None:\n        return [(column, None) for column in columns]\n\n    feature_defs = []\n\n    for column in columns:\n        feature_transformers = []\n\n        arguments = {}\n        if prefix and prefix != \"\":\n            arguments['prefix'] = prefix\n        if suffix and suffix != \"\":\n            arguments['suffix'] = suffix\n\n        classes = [cls for cls in classes if cls is not None]\n        if not classes:\n            feature_defs.append((column, None, arguments))\n\n        else:\n            for definition in classes:\n                if isinstance(definition, dict):\n                    params = definition.copy()\n                    klass = params.pop('class')\n                    feature_transformers.append(klass(**params))\n                else:\n                    feature_transformers.append(definition())\n\n            if not feature_transformers:\n                feature_transformers = None\n\n            feature_defs.append((column, feature_transformers, arguments))\n\n    return feature_defs\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py",
      "content": "import numpy as np\nimport pandas as pd\nfrom sklearn.base import TransformerMixin\nimport warnings\n\n\ndef _get_mask(X, value):\n    \"\"\"\n    Compute the boolean mask X == missing_values.\n    \"\"\"\n    if value == \"NaN\" or \\\n       value is None or \\\n       (isinstance(value, float) and np.isnan(value)):\n        return pd.isnull(X)\n    else:\n        return X == value\n\n\nclass NumericalTransformer(TransformerMixin):\n    \"\"\"\n    Provides commonly used numerical transformers.\n    \"\"\"\n    SUPPORTED_FUNCTIONS = ['log', 'log1p']\n\n    def __init__(self, func):\n        \"\"\"\n        Params\n\n        func    function to apply to input columns. The function will be\n                applied to each value. Supported functions are defined\n                in SUPPORTED_FUNCTIONS variable. Throws assertion error if the\n                not supported.\n        \"\"\"\n\n        warnings.warn(\"\"\"\n            NumericalTransformer will be deprecated in 3.0 version.\n            Please use Sklearn.base.TransformerMixin to write\n            customer transformers\n            \"\"\", DeprecationWarning)\n\n        assert func in self.SUPPORTED_FUNCTIONS, \\\n            f\"Only following func are supported: {self.SUPPORTED_FUNCTIONS}\"\n        super(NumericalTransformer, self).__init__()\n        self.__func = func\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X, y=None):\n        if self.__func == 'log1p':\n            return np.vectorize(np.log1p)(X)\n        elif self.__func == 'log':\n            return np.vectorize(np.log)(X)\n\n        raise ValueError(f\"Invalid function name: {self.__func}\")\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/__init__.py",
      "content": "__version__ = '2.2.0'\n\nimport logging\nlogger = logging.getLogger(__name__)\n\nfrom .dataframe_mapper import DataFrameMapper  # NOQA\nfrom .features_generator import gen_features  # NOQA\nfrom .transformers import NumericalTransformer # NOQA\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/pipeline.py",
      "content": "import six\nfrom sklearn.pipeline import _name_estimators, Pipeline\nfrom sklearn.utils import tosequence\n\n\ndef _call_fit(fit_method, X, y=None, **kwargs):\n    \"\"\"\n    helper function, calls the fit or fit_transform method with the correct\n    number of parameters\n\n    fit_method: fit or fit_transform method of the transformer\n    X: the data to fit\n    y: the target vector relative to X, optional\n    kwargs: any keyword arguments to the fit method\n\n    return: the result of the fit or fit_transform method\n\n    WARNING: if this function raises a TypeError exception, test the fit\n    or fit_transform method passed to it in isolation as _call_fit will not\n    distinguish TypeError due to incorrect number of arguments from\n    other TypeError\n    \"\"\"\n    try:\n        return fit_method(X, y, **kwargs)\n    except TypeError:\n        # fit takes only one argument\n        return fit_method(X, **kwargs)\n\n\nclass TransformerPipeline(Pipeline):\n    \"\"\"\n    Pipeline that expects all steps to be transformers taking a single X\n    argument, an optional y argument, and having fit and transform methods.\n\n    Code is copied from sklearn's Pipeline\n    \"\"\"\n\n    def __init__(self, steps):\n        names, estimators = zip(*steps)\n        if len(dict(steps)) != len(steps):\n            raise ValueError(\n                \"Provided step names are not unique: %s\" % (names,))\n\n        # shallow copy of steps\n        self.steps = tosequence(steps)\n        estimator = estimators[-1]\n\n        for e in estimators:\n            if (not (hasattr(e, \"fit\") or hasattr(e, \"fit_transform\")) or not\n                    hasattr(e, \"transform\")):\n                raise TypeError(\"All steps of the chain should \"\n                                \"be transforms and implement fit and transform\"\n                                \" '%s' (type %s) doesn't)\" % (e, type(e)))\n\n        if not hasattr(estimator, \"fit\"):\n            raise TypeError(\"Last step of chain should implement fit \"\n                            \"'%s' (type %s) doesn't)\"\n                            % (estimator, type(estimator)))\n\n    def _pre_transform(self, X, y=None, **fit_params):\n        fit_params_steps = dict((step, {}) for step, _ in self.steps)\n        for pname, pval in six.iteritems(fit_params):\n            step, param = pname.split('__', 1)\n            fit_params_steps[step][param] = pval\n        Xt = X\n        for name, transform in self.steps[:-1]:\n            if hasattr(transform, \"fit_transform\"):\n                Xt = _call_fit(transform.fit_transform,\n                               Xt, y, **fit_params_steps[name])\n            else:\n                Xt = _call_fit(transform.fit,\n                               Xt, y, **fit_params_steps[name]).transform(Xt)\n        return Xt, fit_params_steps[self.steps[-1][0]]\n\n    def fit(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        _call_fit(self.steps[-1][-1].fit, Xt, y, **fit_params)\n        return self\n\n    def fit_transform(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        if hasattr(self.steps[-1][-1], 'fit_transform'):\n            return _call_fit(self.steps[-1][-1].fit_transform,\n                             Xt, y, **fit_params)\n        else:\n            return _call_fit(self.steps[-1][-1].fit,\n                             Xt, y, **fit_params).transform(Xt)\n\n\ndef make_transformer_pipeline(*steps):\n    \"\"\"Construct a TransformerPipeline from the given estimators.\n    \"\"\"\n    return TransformerPipeline(_name_estimators(steps))\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "import contextlib\nfrom datetime import datetime\nimport pandas as pd\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom .cross_validation import DataWrapper\nfrom .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline\nfrom . import logger\n\nstring_types = text_type = str\n\n\ndef _handle_feature(fea):\n    \"\"\"\n    Convert 1-dimensional arrays to 2-dimensional column vectors.\n    \"\"\"\n    if len(fea.shape) == 1:\n        fea = np.array([fea]).T\n\n    return fea\n\n\ndef _build_transformer(transformers):\n    if isinstance(transformers, list):\n        transformers = make_transformer_pipeline(*transformers)\n    return transformers\n\n\ndef _build_feature(columns, transformers, options={}, X=None):\n    if X is None:\n        return (columns, _build_transformer(transformers), options)\n    return (\n        columns(X) if callable(columns) else columns,\n        _build_transformer(transformers),\n        options\n    )\n\n\ndef _elapsed_secs(t1):\n    return (datetime.now()-t1).total_seconds()\n\n\ndef _get_feature_names(estimator):\n    \"\"\"\n    Attempt to extract feature names based on a given estimator\n    \"\"\"\n    if hasattr(estimator, 'classes_'):\n        return estimator.classes_\n    elif hasattr(estimator, 'get_feature_names'):\n        return estimator.get_feature_names()\n    return None\n\n\n@contextlib.contextmanager\ndef add_column_names_to_exception(column_names):\n    # Stolen from https://stackoverflow.com/a/17677938/356729\n    try:\n        yield\n    except Exception as ex:\n        if ex.args:\n            msg = u'{}: {}'.format(column_names, ex.args[0])\n        else:\n            msg = text_type(column_names)\n        ex.args = (msg,) + ex.args[1:]\n        raise\n\n\nclass DataFrameMapper(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Map Pandas data frame column subsets to their own\n    sklearn transformation.\n    \"\"\"\n\n    def __init__(self, features, default=False, sparse=False, df_out=False,\n                 input_df=False, drop_cols=None):\n        \"\"\"\n        Params:\n\n        features    a list of tuples with features definitions.\n                    The first element is the pandas column selector. This can\n                    be a string (for one column) or a list of strings.\n                    The second element is an object that supports\n                    sklearn's transform interface, or a list of such objects\n                    The third element is optional and, if present, must be\n                    a dictionary with the options to apply to the\n                    transformation. Example: {'alias': 'day_of_week'}\n\n        default     default transformer to apply to the columns not\n                    explicitly selected in the mapper. If False (default),\n                    discard them. If None, pass them through untouched. Any\n                    other transformer will be applied to all the unselected\n                    columns as a whole, taken as a 2d-array.\n\n        sparse      will return sparse matrix if set True and any of the\n                    extracted features is sparse. Defaults to False.\n\n        df_out      return a pandas data frame, with each column named using\n                    the pandas column that created it (if there's only one\n                    input and output) or the input columns joined with '_'\n                    if there's multiple inputs, and the name concatenated with\n                    '_1', '_2' etc if there's multiple outputs. NB: does not\n                    work if *default* or *sparse* are true\n\n        input_df    If ``True`` pass the selected columns to the transformers\n                    as a pandas DataFrame or Series. Otherwise pass them as a\n                    numpy array. Defaults to ``False``.\n\n        drop_cols   List of columns to be dropped. Defaults to None.\n\n        \"\"\"\n        self.features = features\n        self.default = default\n        self.built_default = None\n        self.sparse = sparse\n        self.df_out = df_out\n        self.input_df = input_df\n        self.drop_cols = [] if drop_cols is None else drop_cols\n        self.transformed_names_ = []\n        if (df_out and (sparse or default)):\n            raise ValueError(\"Can not use df_out with sparse or default\")\n\n    def _build(self, X=None):\n        \"\"\"\n        Build attributes built_features and built_default.\n        \"\"\"\n        if isinstance(self.features, list):\n            self.built_features = [\n                _build_feature(*f, X=X) for f in self.features\n            ]\n        else:\n            self.built_features = _build_feature(*self.features, X=X)\n        self.built_default = _build_transformer(self.default)\n\n    @property\n    def _selected_columns(self):\n        \"\"\"\n        Return a set of selected columns in the feature list.\n        \"\"\"\n        selected_columns = set()\n        for feature in self.features:\n            columns = feature[0]\n            if isinstance(columns, list):\n                selected_columns = selected_columns.union(set(columns))\n            else:\n                selected_columns.add(columns)\n        return selected_columns\n\n    def _unselected_columns(self, X):\n        \"\"\"\n        Return list of columns present in X and not selected explicitly in the\n        mapper.\n\n        Unselected columns are returned in the order they appear in the\n        dataframe to avoid issues with different ordering during default fit\n        and transform steps.\n        \"\"\"\n        X_columns = list(X.columns)\n        return [column for column in X_columns if\n                column not in self._selected_columns\n                and column not in self.drop_cols]\n\n    def __setstate__(self, state):\n        # compatibility for older versions of sklearn-pandas\n        super().__setstate__(state)\n        self.features = [_build_feature(*feat) for feat in state['features']]\n        self.sparse = state.get('sparse', False)\n        self.default = state.get('default', False)\n        self.df_out = state.get('df_out', False)\n        self.input_df = state.get('input_df', False)\n        self.drop_cols = state.get('drop_cols', [])\n        self.built_features = state.get('built_features', self.features)\n        self.built_default = state.get('built_default', self.default)\n        self.transformed_names_ = state.get('transformed_names_', [])\n\n    def __getstate__(self):\n        state = super().__getstate__()\n        state['features'] = self.features\n        state['sparse'] = self.sparse\n        state['default'] = self.default\n        state['df_out'] = self.df_out\n        state['input_df'] = self.input_df\n        state['drop_cols'] = self.drop_cols\n        state['build_features'] = getattr(self, 'built_features', None)\n        state['built_default'] = self.built_default\n        state['transformed_names_'] = self.transformed_names_\n        return state\n\n    def _get_col_subset(self, X, cols, input_df=False):\n        \"\"\"\n        Get a subset of columns from the given table X.\n\n        X       a Pandas dataframe; the table to select columns from\n        cols    a string or list of strings representing the columns to select.\n                It can also be a callable that returns True or False, i.e.\n                compatible with the built-in filter function.\n\n        Returns a numpy array with the data from the selected columns\n        \"\"\"\n\n        if isinstance(cols, string_types):\n            return_vector = True\n            cols = [cols]\n        else:\n            return_vector = False\n\n        # Needed when using the cross-validation compatibility\n        # layer for sklearn<0.16.0.\n        # Will be dropped on sklearn-pandas 2.0.\n        if isinstance(X, list):\n            X = [x[cols] for x in X]\n            X = pd.DataFrame(X)\n\n        elif isinstance(X, DataWrapper):\n            X = X.df  # fetch underlying data\n\n        if return_vector:\n            t = X[cols[0]]\n        else:\n            t = X[cols]\n\n        # return either a DataFrame/Series or a numpy array\n        if input_df:\n            return t\n        else:\n            return t.values\n\n    def fit(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n\n        \"\"\"\n        self._build(X=X)\n\n        for columns, transformers, options in self.built_features:\n            t1 = datetime.now()\n            input_df = options.get('input_df', self.input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    Xt = self._get_col_subset(X, columns, input_df)\n                    _call_fit(transformers.fit, Xt, y)\n            logger.info(f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n        # handle features not explicitly selected\n        if self.built_default:  # not False and not None\n            unsel_cols = self._unselected_columns(X)\n            with add_column_names_to_exception(unsel_cols):\n                Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n                _call_fit(self.built_default.fit, Xt, y)\n        return self\n\n    def get_names(self, columns, transformer, x, alias=None, prefix='',\n                  suffix=''):\n        \"\"\"\n        Return verbose names for the transformed columns.\n\n        columns       name (or list of names) of the original column(s)\n        transformer   transformer - can be a TransformerPipeline\n        x             transformed columns (numpy.ndarray)\n        alias         base name to use for the selected columns\n        \"\"\"\n        if alias is not None:\n            name = alias\n        elif isinstance(columns, list):\n            name = '_'.join(map(str, columns))\n        else:\n            name = columns\n        num_cols = x.shape[1] if len(x.shape) > 1 else 1\n\n        output = []\n\n        if num_cols > 1:\n            # If there are as many columns as classes in the transformer,\n            # infer column names from classes names.\n\n            # If we are dealing with multiple transformers for these columns\n            # attempt to extract the names from each of them, starting from the\n            # last one\n            if isinstance(transformer, TransformerPipeline):\n                inverse_steps = transformer.steps[::-1]\n                estimators = (estimator for name, estimator in inverse_steps)\n                names_steps = (_get_feature_names(e) for e in estimators)\n                names = next((n for n in names_steps if n is not None), None)\n            # Otherwise use the only estimator present\n            else:\n                names = _get_feature_names(transformer)\n\n            if names is not None and len(names) == num_cols:\n                output = [f\"{name}_{o}\" for o in names]\n                # otherwise, return name concatenated with '_1', '_2', etc.\n            else:\n                output = [name + '_' + str(o) for o in range(num_cols)]\n        else:\n            output = [name]\n\n        if prefix == suffix == \"\":\n            return output\n\n        return ['{}{}{}'.format(prefix, x, suffix) for x in output]\n\n    def get_dtypes(self, extracted):\n        dtypes_features = [self.get_dtype(ex) for ex in extracted]\n        return [dtype for dtype_feature in dtypes_features\n                for dtype in dtype_feature]\n\n    def get_dtype(self, ex):\n        if isinstance(ex, np.ndarray) or sparse.issparse(ex):\n            return [ex.dtype] * ex.shape[1]\n        elif isinstance(ex, pd.DataFrame):\n            return list(ex.dtypes)\n        else:\n            raise TypeError(type(ex))\n\n    def _transform(self, X, y=None, do_fit=False):\n        \"\"\"\n        Transform the given data with possibility to fit in advance.\n        Avoids code duplication for implementation of transform and\n        fit_transform.\n        \"\"\"\n        if do_fit:\n            self._build(X=X)\n\n        extracted = []\n        transformed_names_ = []\n        for columns, transformers, options in self.built_features:\n            input_df = options.get('input_df', self.input_df)\n\n            # columns could be a string or list of\n            # strings; we don't care because pandas\n            # will handle either.\n            Xt = self._get_col_subset(X, columns, input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    if do_fit and hasattr(transformers, 'fit_transform'):\n                        t1 = datetime.now()\n                        Xt = _call_fit(transformers.fit_transform, Xt, y)\n                        logger.info(f\"[FIT_TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n                    else:\n                        if do_fit:\n                            t1 = datetime.now()\n                            _call_fit(transformers.fit, Xt, y)\n                            logger.info(\n                                f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n                        t1 = datetime.now()\n                        Xt = transformers.transform(Xt)\n                        logger.info(f\"[TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n\n            extracted.append(_handle_feature(Xt))\n\n            alias = options.get('alias')\n\n            prefix = options.get('prefix', '')\n            suffix = options.get('suffix', '')\n\n            transformed_names_ += self.get_names(\n                columns, transformers, Xt, alias, prefix, suffix)\n\n        # handle features not explicitly selected\n        if self.built_default is not False:\n            unsel_cols = self._unselected_columns(X)\n            Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n            if self.built_default is not None:\n                with add_column_names_to_exception(unsel_cols):\n                    if do_fit and hasattr(self.built_default, 'fit_transform'):\n                        Xt = _call_fit(self.built_default.fit_transform, Xt, y)\n                    else:\n                        if do_fit:\n                            _call_fit(self.built_default.fit, Xt, y)\n                        Xt = self.built_default.transform(Xt)\n                transformed_names_ += self.get_names(\n                    unsel_cols, self.built_default, Xt)\n            else:\n                # if not applying a default transformer,\n                # keep column names unmodified\n                transformed_names_ += unsel_cols\n\n            extracted.append(_handle_feature(Xt))\n\n        self.transformed_names_ = transformed_names_\n\n        # combine the feature outputs into one array.\n        # at this point we lose track of which features\n        # were created from which input columns, so it's\n        # assumed that that doesn't matter to the model.\n\n        # If any of the extracted features is sparse, combine sparsely.\n        # Otherwise, combine as normal arrays.\n        if any(sparse.issparse(fea) for fea in extracted):\n            stacked = sparse.hstack(extracted).tocsr()\n            # return a sparse matrix only if the mapper was initialized\n            # with sparse=True\n            if not self.sparse:\n                stacked = stacked.toarray()\n        else:\n            stacked = np.hstack(extracted)\n\n        if self.df_out:\n            # if no rows were dropped preserve the original index,\n            # otherwise use a new integer one\n            no_rows_dropped = len(X) == len(stacked)\n            if no_rows_dropped:\n                index = X.index\n            else:\n                index = None\n\n            # output different data types, if appropriate\n            dtypes = self.get_dtypes(extracted)\n            df_out = pd.DataFrame(\n                stacked,\n                columns=self.transformed_names_,\n                index=index)\n            # preserve types\n            for col, dtype in zip(self.transformed_names_, dtypes):\n                df_out[col] = df_out[col].astype(dtype)\n            return df_out\n        else:\n            return stacked\n\n    def transform(self, X):\n        \"\"\"\n        Transform the given data. Assumes that fit has already been called.\n\n        X       the data to transform\n        \"\"\"\n        return self._transform(X)\n\n    def fit_transform(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline and directly apply\n        it to the given data.\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n        \"\"\"\n        return self._transform(X, y, True)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/cross_validation.py",
      "content": "class DataWrapper(object):\n\n    def __init__(self, df):\n        self.df = df\n\n    def __len__(self):\n        return len(self.df)\n\n    def __getitem__(self, key):\n        return self.df.iloc[key]\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_pipeline.py",
      "content": "import pytest\nfrom sklearn_pandas.pipeline import TransformerPipeline, _call_fit\n\n# In py3, mock is included with the unittest standard library\n# In py2, it's a separate package\ntry:\n    from unittest.mock import patch\nexcept ImportError:\n    from mock import patch\n\n\nclass NoTransformT(object):\n    \"\"\"Transformer without transform method.\n    \"\"\"\n    def fit(self, x):\n        return self\n\n\nclass NoFitT(object):\n    \"\"\"Transformer without fit method.\n    \"\"\"\n    def transform(self, x):\n        return self\n\n\nclass Trans(object):\n    \"\"\"\n    Transformer with fit and transform methods\n    \"\"\"\n    def fit(self, x, y=None):\n        return self\n\n    def transform(self, x):\n        return self\n\n\ndef func_x_y(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments\n    \"\"\"\n    return\n\n\ndef func_x(x, kwarg='kwarg'):\n    \"\"\"\n    Function with required x argument\n    \"\"\"\n    return\n\n\ndef func_raise_type_err(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments,\n    raises TypeError\n    \"\"\"\n    raise TypeError\n\n\ndef test_all_steps_fit_transform():\n    \"\"\"\n    All steps must implement fit and transform. Otherwise, raise TypeError.\n    \"\"\"\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoTransformT())])\n\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoFitT())])\n\n\n@patch.object(Trans, 'fit', side_effect=func_x_y)\ndef test_called_with_x_and_y(mock_fit):\n    \"\"\"\n    Fit method with required X and y arguments is called with both and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', 'y', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_x)\ndef test_called_with_x(mock_fit):\n    \"\"\"\n    Fit method with a required X arguments is called with it and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n    _call_fit(Trans().fit, 'X', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_raise_type_err)\ndef test_raises_type_error(mock_fit):\n    \"\"\"\n    If a fit method with required X and y arguments raises a TypeError, it's\n    re-raised (for a different reason) when it's called with one argument\n    \"\"\"\n    with pytest.raises(TypeError):\n        _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
      "content": "import tempfile\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nimport joblib\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas import NumericalTransformer\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_common_numerical_transformer(simple_dataset):\n    \"\"\"\n    Test log transformation\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ], df_out=True)\n    df = simple_dataset\n    outDF = transfomer.fit_transform(df)\n    assert list(outDF.columns) == ['feat1']\n    assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n\n\ndef test_numerical_transformer_serialization(simple_dataset):\n    \"\"\"\n    Test if you can serialize transformer\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ])\n\n    df = simple_dataset\n    transfomer.fit(df)\n    f = tempfile.NamedTemporaryFile(delete=True)\n    joblib.dump(transfomer, f.name)\n    transfomer2 = joblib.load(f.name)\n    np.array_equal(transfomer.transform(df), transfomer2.transform(df))\n    f.close()\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "# -*- coding: utf8 -*-\n\nimport pytest\nfrom unittest.mock import Mock\nfrom pandas import DataFrame\nimport pandas as pd\nfrom scipy import sparse\nfrom sklearn.datasets import load_iris\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.svm import SVC\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.preprocessing import (\n    StandardScaler, OneHotEncoder, LabelBinarizer)\nfrom sklearn.impute import SimpleImputer as Imputer\nfrom sklearn.feature_selection import SelectKBest, chi2\nfrom sklearn.base import BaseEstimator, TransformerMixin\nimport sklearn.decomposition\nimport numpy as np\nfrom numpy.testing import assert_array_equal\nimport pickle\nfrom sklearn.compose import make_column_selector\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\nfrom sklearn_pandas.pipeline import TransformerPipeline\n\n\nclass MockXTransformer(object):\n    \"\"\"\n    Mock transformer that accepts no y argument.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return X\n\n\nclass MockTClassifier(object):\n    \"\"\"\n    Mock transformer/classifier.\n    \"\"\"\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        return X\n\n    def predict(self, X):\n        return True\n\n\nclass DateEncoder():\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        dt = X.dt\n        return pd.concat([dt.year, dt.month, dt.day], axis=1)\n\n\nclass ToSparseTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Transforms numpy matrix to sparse format.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return sparse.csr_matrix(X)\n\n\nclass CustomTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example of transformer in which the number of classes\n    is not equals to the number of output columns.\n    \"\"\"\n    def fit(self, X, y=None):\n        self.min = X.min()\n        self.classes_ = np.unique(X)\n        return self\n\n    def transform(self, X):\n        classes = np.unique(X)\n        if len(np.setdiff1d(classes, self.classes_)) > 0:\n            raise ValueError('Unknown values found.')\n        return X - self.min\n\n\nclass MockImageTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example transformer that takes the max of a 2d vector\n    then scales the result.\n    \"\"\"\n    def __init__(self, multiplier=10.0):\n        self.multiplier = multiplier\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        assert isinstance(X, pd.DataFrame)\n        for col in X.columns:\n            X[col] = X[col].map(lambda img: np.max(img))\n        return X * self.multiplier\n\n\n@pytest.fixture\ndef simple_dataframe():\n    return pd.DataFrame({'a': [1, 2, 3]})\n\n\n@pytest.fixture\ndef complex_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4]})\n\n\n@pytest.fixture\ndef complex_object_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4],\n                         'img2d': [1*np.eye(2), 2*np.eye(2), 3*np.eye(2),\n                                   4*np.eye(2), 5*np.eye(2), 6*np.eye(2)]})\n\n\n@pytest.fixture\ndef multiindex_dataframe():\n    \"\"\"Example MultiIndex DataFrame, taken from pandas documentation\n    \"\"\"\n    iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]\n    index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])\n    df = pd.DataFrame(np.random.randn(10, 8), columns=index)\n    return df\n\n\n@pytest.fixture\ndef multiindex_dataframe_incomplete(multiindex_dataframe):\n    \"\"\"Example MultiIndex DataFrame with missing entries\n    \"\"\"\n    df = multiindex_dataframe\n    mask_array = np.zeros(df.size)\n    mask_array[:20] = 1\n    np.random.shuffle(mask_array)\n    mask = mask_array.reshape(df.shape).astype(bool)\n    df.mask(mask, inplace=True)\n    return df\n\n\ndef test_transformed_names_simple(simple_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for simple transformation\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_transformed_names_binarizer(complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_logging(caplog, complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    import logging\n    logger = logging.getLogger('sklearn_pandas')\n    logger.setLevel(logging.INFO)\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert '[FIT_TRANSFORM] target:' in caplog.text\n\n\ndef test_transformed_names_binarizer_unicode():\n    df = pd.DataFrame({'target': [u'ñ', u'á', u'é']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    expected_names = {u'target_ñ', u'target_á', u'target_é'}\n    assert set(mapper.transformed_names_) == expected_names\n\n\ndef test_transformed_names_transformers_list(complex_dataframe):\n    \"\"\"\n    When using a list of transformers, use them in inverse order to get the\n    transformed names\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([\n        ('target', [LabelBinarizer(), MockXTransformer()])\n    ])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_transformed_names_simple_alias(simple_dataframe):\n    \"\"\"\n    If we specify an alias for a single output column, it is used for the\n    output\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None, {'alias': 'new_name'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_name']\n\n\ndef test_transformed_names_complex_alias(complex_dataframe):\n    \"\"\"\n    If we specify an alias for a multiple output column, it is used for the\n    output\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer(), {'alias': 'new'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_a', 'new_b', 'new_c']\n\n\ndef test_exception_column_context_transform(simple_dataframe):\n    \"\"\"\n    If an exception is raised when transforming a column,\n    the exception includes the name of the column being transformed\n    \"\"\"\n    class FailingTransformer(object):\n        def fit(self, X):\n            pass\n\n        def transform(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingTransformer())])\n    mapper.fit(df)\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.transform(df)\n\n\ndef test_exception_column_context_fit(simple_dataframe):\n    \"\"\"\n    If an exception is raised when fit a column,\n    the exception includes the name of the column being fitted\n    \"\"\"\n    class FailingFitter(object):\n        def fit(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingFitter())])\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.fit(df)\n\n\ndef test_simple_df(simple_dataframe):\n    \"\"\"\n    Get a dataframe from a simple mapped dataframe\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert type(transformed) == pd.DataFrame\n    assert len(transformed[\"a\"]) == len(simple_dataframe[\"a\"])\n\n\ndef test_complex_df(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None), ('feat2', None)],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_complex_object_df(complex_object_dataframe):\n    \"\"\"\n    Get a dataframe from a complex dataframe with 2d features\n    \"\"\"\n    df = complex_object_dataframe\n    img_scale = 10\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None),\n         (make_column_selector('feat2'), StandardScaler()),\n         (make_column_selector('img2d'), MockImageTransformer(img_scale))],\n        df_out=True, input_df=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_object_dataframe)\n    assert np.isclose(\n        np.sum(transformed['img2d']),\n        np.max(np.sum(df['img2d'])) * img_scale, atol=1e-12)\n\n\ndef test_numeric_column_names(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe with numeric column names\n    \"\"\"\n    df = complex_dataframe\n    df.columns = [0, 1, 2]\n    mapper = DataFrameMapper(\n        [(0, None), (1, None), (2, None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_multiindex_df(multiindex_dataframe_incomplete):\n    \"\"\"\n    Get a dataframe from a multiindex dataframe with missing data\n    \"\"\"\n    df = multiindex_dataframe_incomplete\n    mapper = DataFrameMapper([([c], Imputer()) for c in df.columns],\n                             df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(multiindex_dataframe_incomplete)\n    for c in df.columns:\n        assert len(transformed[str(c)]) == len(df[c])\n\n\ndef test_binarizer_df():\n    \"\"\"\n    Check level names from LabelBinarizer\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_a'\n    assert cols[1] == 'target_b'\n    assert cols[2] == 'target_c'\n\n\ndef test_binarizer_int_df():\n    \"\"\"\n    Check level names from LabelBinarizer for a numeric array.\n    \"\"\"\n    df = pd.DataFrame({'target': [5, 5, 6, 6, 7, 5]})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_5'\n    assert cols[1] == 'target_6'\n    assert cols[2] == 'target_7'\n\n\ndef test_binarizer2_df():\n    \"\"\"\n    Check level names from LabelBinarizer with just one output column\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_onehot_df():\n    \"\"\"\n    Check level ids from one-hot\n    \"\"\"\n    df = pd.DataFrame({'target': [0, 0, 1, 1, 2, 3, 0]})\n    mapper = DataFrameMapper([(['target'], OneHotEncoder())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 4\n    assert cols[0] == 'target_0'\n    assert cols[3] == 'target_3'\n\n\ndef test_customtransform_df():\n    \"\"\"\n    Check level ids from a transformer in which\n    the number of classes is not equals to the number of output columns.\n    \"\"\"\n    df = pd.DataFrame({'target': [6, 5, 7, 5, 4, 8, 8]})\n    mapper = DataFrameMapper([(['target'], CustomTransformer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(mapper.features[0][1].classes_) == 5\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_preserve_df_index():\n    \"\"\"\n    The index is preserved when df_out=True\n    \"\"\"\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', None)],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, df.index)\n\n\ndef test_preserve_df_index_rows_dropped():\n    \"\"\"\n    If df_out=True but the original df index length doesn't\n    match the number of final rows, use a numeric index\n    \"\"\"\n    class DropLastRowTransformer(object):\n        def fit(self, X):\n            return self\n\n        def transform(self, X):\n            return X[:-1]\n\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', DropLastRowTransformer())],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, np.array([0, 1]))\n\n\ndef test_pca(complex_dataframe):\n    \"\"\"\n    Check multi in and out with PCA\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 2\n    assert cols[0] == 'feat1_feat2_0'\n    assert cols[1] == 'feat1_feat2_1'\n\n\ndef test_fit_transform(simple_dataframe):\n    \"\"\"\n    Check that custom fit_transform methods of the transformers are invoked.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    # return something of measurable length but does nothing\n    mock_transformer.fit_transform.return_value = np.array([1, 2, 3])\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n    mapper.fit_transform(df)\n    assert mock_transformer.fit_transform.called\n\n\ndef test_fit_transform_equiv_mock(simple_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper using the mock\n    transformer which does not implement a custom fit_transform.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', MockXTransformer())])\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.all(transformed_combined == transformed_separate)\n\n\ndef test_fit_transform_equiv_pca(complex_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper and transformer\n    using PCA which implements a custom fit_transform. The\n    equivalence of both paths in the transformer only can be\n    asserted since this is tested in the sklearn tests\n    scikit-learn/sklearn/decomposition/tests/test_pca.py\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.allclose(transformed_combined, transformed_separate)\n\n\ndef test_input_df_true_first_transformer(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the first transformer is passed\n    a pd.Series instead of an np.array\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockXTransformer, 'fit', Mock())\n    monkeypatch.setattr(MockXTransformer, 'transform',\n                        Mock(return_value=np.array([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', MockXTransformer())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    args, _ = MockXTransformer().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    args, _ = MockXTransformer().transform.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_next_transformers(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the subsequent transformers get passed pandas\n    objects instead of numpy arrays (given the previous transformers\n    output pandas objects as well)\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockTClassifier, 'fit', Mock())\n    monkeypatch.setattr(MockTClassifier, 'transform',\n                        Mock(return_value=pd.Series([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer(), MockTClassifier()])\n    ], input_df=True)\n    mapper.fit(df)\n    out = mapper.transform(df)\n\n    args, _ = MockTClassifier().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_multiple_cols(complex_dataframe):\n    \"\"\"\n    When input_df is True, applying transformers to multiple columns\n    works as expected\n    \"\"\"\n    df = complex_dataframe\n\n    mapper = DataFrameMapper([\n        ('target', MockXTransformer()),\n        ('feat1',  MockXTransformer()),\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    assert_array_equal(out[:, 0], df['target'].values)\n    assert_array_equal(out[:, 1], df['feat1'].values)\n\n\ndef test_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_local_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder(), {'input_df': True})\n    ], input_df=False)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_nonexistent_columns_explicit_fail(simple_dataframe):\n    \"\"\"\n    If a nonexistent column is selected, KeyError is raised.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    with pytest.raises(KeyError):\n        mapper._get_col_subset(simple_dataframe, ['nonexistent_feature'])\n\n\ndef test_get_col_subset_single_column_array(simple_dataframe):\n    \"\"\"\n    Selecting a single column should return a 1-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, \"a\")\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]),)\n\n\ndef test_get_col_subset_single_column_list(simple_dataframe):\n    \"\"\"\n    Selecting a list of columns (even if the list contains a single element)\n    should return a 2-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, [\"a\"])\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]), 1)\n\n\ndef test_cols_string_array(simple_dataframe):\n    \"\"\"\n    If a string is specified as the columns, the transformer\n    is called with a 1-d array as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3,)\n\n\ndef test_cols_list_column_vector(simple_dataframe):\n    \"\"\"\n    If a one-element list is specified as the columns, the transformer\n    is called with a column vector as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([([\"a\"], mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3, 1)\n\n\ndef test_handle_feature_2dim():\n    \"\"\"\n    2-dimensional arrays are returned unchanged.\n    \"\"\"\n    array = np.array([[1, 2], [3, 4]])\n    assert_array_equal(_handle_feature(array), array)\n\n\ndef test_handle_feature_1dim():\n    \"\"\"\n    1-dimensional arrays are converted to 2-dimensional column vectors.\n    \"\"\"\n    array = np.array([1, 2])\n    assert_array_equal(_handle_feature(array), np.array([[1], [2]]))\n\n\ndef test_build_transformers():\n    \"\"\"\n    When a list of transformers is passed, return a pipeline with\n    each element of the iterable as a step of the pipeline.\n    \"\"\"\n    transformers = [MockTClassifier(), MockTClassifier()]\n    pipeline = _build_transformer(transformers)\n    assert isinstance(pipeline, Pipeline)\n    for ix, transformer in enumerate(transformers):\n        assert pipeline.steps[ix][1] == transformer\n\n\ndef test_selected_columns():\n    \"\"\"\n    selected_columns returns a set of the columns appearing in the features\n    of the mapper.\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert mapper._selected_columns == {'a', 'b'}\n\n\ndef test_unselected_columns():\n    \"\"\"\n    unselected_columns returns a list of the columns not appearing in the\n    features of the mapper but present in the given dataframe.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert 'c' in mapper._unselected_columns(df)\n\n\ndef test_drop_and_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns and drop columns\n    are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n            ('a', None)\n        ], drop_cols=['c'], default=False)\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (1, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_drop_and_default_none():\n    \"\"\"\n    If default=None, drop columns are discarded and\n    remaining non explicitly selected columns are passed through untransformed\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['c'], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 2)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_conflicting_drop():\n    \"\"\"\n    Drop column name shouldn't get confused with transformed columns.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['a'], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('b', None)\n    ], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n\n\ndef test_default_none():\n    \"\"\"\n    If default=None, non explicitly selected columns are passed through\n    untransformed.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        (['a'], OneHotEncoder())\n    ], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[:, 3] == np.array([3, 5, 7]).T).all()\n\n\ndef test_default_none_names():\n    \"\"\"\n    If default=None, column names are returned unmodified.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([], default=None)\n\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_default_transformer():\n    \"\"\"\n    If default=Transformer, non explicitly selected columns are applied this\n    transformer.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, np.nan, 3], })\n    mapper = DataFrameMapper([], default=Imputer())\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[: 0] == np.array([1., 2., 3.])).all()\n\n\ndef test_list_transformers_single_arg(simple_dataframe):\n    \"\"\"\n    Multiple transformers can be specified in a list even if some of them\n    only accept one X argument instead of two (X, y).\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer()])\n    ])\n    # doesn't fail\n    mapper.fit_transform(simple_dataframe)\n\n\ndef test_list_transformers():\n    \"\"\"\n    Specifying a list of transformers applies them sequentially to the\n    selected column.\n    \"\"\"\n    dataframe = pd.DataFrame({\"a\": [1, np.nan, 3], \"b\": [1, 5, 7]},\n                             dtype=np.float64)\n\n    mapper = DataFrameMapper([\n        ([\"a\"], [Imputer(), StandardScaler()]),\n        ([\"b\"], StandardScaler()),\n    ])\n    dmatrix = mapper.fit_transform(dataframe)\n\n    assert pd.isnull(dmatrix).sum() == 0  # no null values\n\n    # all features have mean 0 and std deviation 1 (standardized)\n    assert (abs(dmatrix.mean(axis=0) - 0) <= 1e-6).all()\n    assert (abs(dmatrix.std(axis=0) - 1) <= 1e-6).all()\n\n\ndef test_list_transformers_old_unpickle(simple_dataframe):\n    mapper = DataFrameMapper(None)\n    # simulate the mapper was created with < 1.0.0 code\n    mapper.features = [('a', [MockXTransformer()])]\n    mapper_pickled = pickle.dumps(mapper)\n\n    loaded_mapper = pickle.loads(mapper_pickled)\n    transformer = loaded_mapper.features[0][1]\n    assert isinstance(transformer, TransformerPipeline)\n    assert isinstance(transformer.steps[0][1], MockXTransformer)\n\n\ndef test_sparse_features(simple_dataframe):\n    \"\"\"\n    If any of the extracted features is sparse and \"sparse\" argument\n    is true, the hstacked result is also sparse.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=True)\n    dmatrix = mapper.fit_transform(df)\n\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\n\ndef test_sparse_off(simple_dataframe):\n    \"\"\"\n    If the resulting features are sparse but the \"sparse\" argument\n    of the mapper is False, return a non-sparse matrix.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=False)\n\n    dmatrix = mapper.fit_transform(df)\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\n\ndef test_fit_with_optional_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with an optional y argument in the fit method\n    are handled correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], MockTClassifier())])\n    # doesn't fail\n    mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n\ndef test_fit_with_required_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with a required y argument in the fit method\n    are handled and perform correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], SelectKBest(chi2, k=1))])\n\n    # fit, doesn't fail\n    ft_arr = mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n    # fit_transform\n    ft_arr = mapper.fit_transform(df[['feat1', 'feat2']], df['target'])\n    assert_array_equal(ft_arr, df[['feat1']].values)\n\n    # transform\n    t_arr = mapper.transform(df[['feat1', 'feat2']])\n    assert_array_equal(t_arr, df[['feat1']].values)\n\n\n# Integration tests with real dataframes\n\n@pytest.fixture\ndef iris_dataframe():\n    iris = load_iris()\n    return DataFrame(\n        data={\n            iris.feature_names[0]: iris.data[:, 0],\n            iris.feature_names[1]: iris.data[:, 1],\n            iris.feature_names[2]: iris.data[:, 2],\n            iris.feature_names[3]: iris.data[:, 3],\n            \"species\": np.array([iris.target_names[e] for e in iris.target])\n        }\n    )\n\n\n@pytest.fixture\ndef cars_dataframe():\n    return pd.read_csv(\"tests/test_data/cars.csv.gz\", compression='gzip')\n\n\ndef test_with_iris_dataframe(iris_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_dict_vectorizer():\n    df = pd.DataFrame(\n        [[{'a': 1, 'b': 2}], [{'a': 3}]],\n        columns=['colA']\n    )\n\n    outdf = DataFrameMapper(\n        [('colA', DictVectorizer())],\n        df_out=True,\n        default=False\n    ).fit_transform(df)\n\n    columns = sorted(list(outdf.columns))\n    assert len(columns) == 2\n    assert columns[0] == 'colA_0'\n    assert columns[1] == 'colA_1'\n\n\ndef test_with_car_dataframe(cars_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"description\", CountVectorizer()),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = cars_dataframe.drop(\"model\", axis=1)\n    labels = cars_dataframe[\"model\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.30\n\n\ndef test_direct_cross_validation(iris_dataframe):\n    \"\"\"\n    Starting with sklearn>=0.16.0 we no longer need CV wrappers for dataframes.\n    See https://github.com/paulgb/sklearn-pandas/issues/11\n    \"\"\"\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_heterogeneous_output_types_input_df():\n    \"\"\"\n    Modify feat2, but pass feat1 through unmodified.\n    This fails if input_df == False\n    \"\"\"\n    df = pd.DataFrame({\n        'feat1': [1, 2, 3, 4, 5, 6],\n        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n    })\n    M = DataFrameMapper([\n        (['feat2'], StandardScaler())\n        ], input_df=True, df_out=True, default=None)\n    dft = M.fit_transform(df)\n    assert dft['feat1'].dtype == np.dtype('int64')\n    assert dft['feat2'].dtype == np.dtype('float64')\n\n\ndef test_make_column_selector(iris_dataframe):\n    t = DataFrameMapper([\n        (make_column_selector(dtype_include=float), None, {'alias': 'x'}),\n        ('sepal length (cm)', None),\n    ], df_out=True, default=False)\n\n    xt = t.fit(iris_dataframe).transform(iris_dataframe)\n    expected = ['x_0', 'x_1', 'x_2', 'x_3', 'sepal length (cm)']\n    assert list(xt.columns) == expected\n\n    pickled = pickle.dumps(t)\n    t2 = pickle.loads(pickled)\n    xt2 = t2.transform(iris_dataframe)\n    assert np.array_equal(xt.values, xt2.values)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "from collections import Counter\n\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nfrom numpy.testing import assert_array_equal\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.features_generator import gen_features\n\n\nclass MockClass(object):\n\n    def __init__(self, value=1, name='class'):\n        self.value = value\n        self.name = name\n\n\nclass MockTransformer(object):\n\n    def __init__(self):\n        self.most_common_ = None\n\n    def fit(self, X, y=None):\n        [(value, _)] = Counter(X).most_common(1)\n        self.most_common_ = value\n        return self\n\n    def transform(self, X, y=None):\n        return np.asarray([self.most_common_] * len(X))\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_generate_features_with_default_parameters():\n    \"\"\"\n    Tests generating features from classes with default init arguments.\n    \"\"\"\n    columns = ['colA', 'colB', 'colC']\n    feature_defs = gen_features(columns=columns, classes=[MockClass])\n    assert len(feature_defs) == len(columns)\n\n    for feature in feature_defs:\n        assert feature[2] == {}\n\n    feature_dict = dict([_[0:2] for _ in feature_defs])\n    assert columns == sorted(feature_dict.keys())\n\n    # default init arguments for MockClass for clarification.\n    expected = {'value': 1, 'name': 'class'}\n    for column, transformers in feature_dict.items():\n        for obj in transformers:\n            assert_attributes(obj, **expected)\n\n\ndef test_generate_features_with_several_classes():\n    \"\"\"\n    Tests generating features pipeline with different transformers parameters.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'],\n        classes=[\n            {'class': MockClass},\n            {'class': MockClass, 'name': 'mockA'},\n            {'class': MockClass, 'name': 'mockB', 'value': None}\n        ]\n    )\n\n    for col, transformers, params in feature_defs:\n        assert_attributes(transformers[0], name='class', value=1)\n        assert_attributes(transformers[1], name='mockA', value=1)\n        assert_attributes(transformers[2], name='mockB', value=None)\n\n\ndef test_generate_features_with_none_only_transformers():\n    \"\"\"\n    Tests generating \"dummy\" feature definition which doesn't apply any\n    transformation.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'], classes=[None])\n\n    expected = [('colA', None, {}),\n                ('colB', None, {}),\n                ('colC', None, {})]\n\n    assert feature_defs == expected\n\n\ndef test_compatibility_with_data_frame_mapper(simple_dataset):\n    \"\"\"\n    Tests compatibility of generated feature definition with DataFrameMapper.\n    \"\"\"\n    features_defs = gen_features(\n        columns=['feat1', 'feat2'],\n        classes=[MockTransformer])\n    features_defs.append(('feat3', None))\n\n    mapper = DataFrameMapper(features_defs)\n    X = mapper.fit_transform(simple_dataset)\n    expected = np.asarray([\n        [1, 2, 1],\n        [1, 2, 2],\n        [1, 2, 3],\n        [1, 2, 4],\n        [1, 2, 5]\n    ])\n\n    assert_array_equal(X, expected)\n\n\ndef assert_attributes(obj, **attrs):\n    for attr, value in attrs.items():\n        assert getattr(obj, attr) == value\n"
    }
  ],
  "OriginCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/test.py",
      "content": "import pytest\nfrom unittest.mock import Mock\nimport numpy as np\nimport pandas as pd\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn.compose import make_column_selector\nfrom sklearn.preprocessing import StandardScaler\n\n\nclass GetStartWith:\n    def __init__(self, start_str):\n        self.start_str = start_str\n\n    def __call__(self, X: pd.DataFrame) -> list:\n        return [c for c in X.columns if c.startswith(self.start_str)]\n\n\ndf = pd.DataFrame({\n    'sepal length (cm)': [1.0, 2.0, 3.0],\n    'sepal width (cm)': [1.0, 2.0, 3.0],\n    'petal length (cm)': [1.0, 2.0, 3.0],\n    'petal width (cm)': [1.0, 2.0, 3.0]\n})\nt = DataFrameMapper([\n    (make_column_selector(dtype_include=float), StandardScaler(), {'alias': 'x'}),\n    (GetStartWith('petal'), None, {'alias': 'petal'})\n], df_out=True, default=False)\n\nt.fit(df)\nprint(t.transform(df).shape)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/setup.py",
      "content": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nfrom setuptools import setup\nfrom setuptools.command.test import test as TestCommand\nimport re\n\nfor line in open('sklearn_pandas/__init__.py'):\n    match = re.match(\"__version__ *= *'(.*)'\", line)\n    if match:\n        __version__, = match.groups()\n\n\nclass PyTest(TestCommand):\n    user_options = [('pytest-args=', 'a', \"Arguments to pass to py.test\")]\n\n    def initialize_options(self):\n        TestCommand.initialize_options(self)\n        self.pytest_args = []\n\n    def finalize_options(self):\n        TestCommand.finalize_options(self)\n        self.test_args = []\n        self.test_suite = True\n\n    def run(self):\n        import pytest\n        errno = pytest.main(self.pytest_args)\n        raise SystemExit(errno)\n\n\nsetup(name='sklearn-pandas',\n      version=__version__,\n      description='Pandas integration with sklearn',\n      maintainer='Ritesh Agrawal',\n      maintainer_email='ragrawal@gmail.com',\n      url='https://github.com/scikit-learn-contrib/sklearn-pandas',\n      packages=['sklearn_pandas'],\n      keywords=['scikit', 'sklearn', 'pandas'],\n      install_requires=[\n          'scikit-learn>=0.23.0',\n          'scipy>=1.5.1',\n          'pandas>=1.1.4',\n          'numpy>=1.18.1'\n      ],\n      tests_require=['pytest', 'mock'],\n      cmdclass={'test': PyTest},\n      license='MIT License'\n)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/noxfile.py",
      "content": "import nox\n\n@nox.session\ndef lint(session):\n    session.install('pytest>=5.3.5', 'setuptools>=45.2',\n                    'wheel>=0.34.2', 'flake8>=3.7.9',\n                    'numpy==1.18.1', 'pandas==1.1.4')\n    session.install('.')\n    session.run('flake8', 'sklearn_pandas/', 'tests')\n\n@nox.session\n@nox.parametrize('numpy', ['1.18.1', '1.19.4', '1.20.1'])\n@nox.parametrize('scipy', ['1.5.4', '1.6.0'])\n@nox.parametrize('pandas', ['1.1.4', '1.2.2'])\ndef tests(session, numpy, scipy, pandas):\n    session.install('pytest>=5.3.5', \n                    'setuptools>=45.2',\n                    'wheel>=0.34.2',\n                    f'numpy=={numpy}',\n                    f'scipy=={scipy}',\n                    f'pandas=={pandas}'\n                    )\n    session.install('.')\n    session.run('py.test', 'README.rst', 'tests')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
      "content": "def gen_features(columns, classes=None, prefix='', suffix=''):\n    \"\"\"Generates a feature definition list which can be passed\n    into DataFrameMapper\n\n    Params:\n\n    columns     a list of column names to generate features for.\n\n    classes     a list of classes for each feature, a list of dictionaries with\n                transformer class and init parameters, or None.\n\n                If list of classes is provided, then each of them is\n                instantiated with default arguments. Example:\n\n                    classes = [StandardScaler, LabelBinarizer]\n\n                If list of dictionaries is provided, then each of them should\n                have a 'class' key with transformer class. All other keys are\n                passed into 'class' value constructor. Example:\n\n                    classes = [\n                        {'class': StandardScaler, 'with_mean': False},\n                        {'class': LabelBinarizer}\n                    }]\n\n                If None value selected, then each feature left as is.\n\n    prefix      add prefix to transformed column names\n\n    suffix      add suffix to transformed column names.\n\n    \"\"\"\n    if classes is None:\n        return [(column, None) for column in columns]\n\n    feature_defs = []\n\n    for column in columns:\n        feature_transformers = []\n\n        arguments = {}\n        if prefix and prefix != \"\":\n            arguments['prefix'] = prefix\n        if suffix and suffix != \"\":\n            arguments['suffix'] = suffix\n\n        classes = [cls for cls in classes if cls is not None]\n        if not classes:\n            feature_defs.append((column, None, arguments))\n\n        else:\n            for definition in classes:\n                if isinstance(definition, dict):\n                    params = definition.copy()\n                    klass = params.pop('class')\n                    feature_transformers.append(klass(**params))\n                else:\n                    feature_transformers.append(definition())\n\n            if not feature_transformers:\n                feature_transformers = None\n\n            feature_defs.append((column, feature_transformers, arguments))\n\n    return feature_defs\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py",
      "content": "import numpy as np\nimport pandas as pd\nfrom sklearn.base import TransformerMixin\nimport warnings\n\n\ndef _get_mask(X, value):\n    \"\"\"\n    Compute the boolean mask X == missing_values.\n    \"\"\"\n    if value == \"NaN\" or \\\n       value is None or \\\n       (isinstance(value, float) and np.isnan(value)):\n        return pd.isnull(X)\n    else:\n        return X == value\n\n\nclass NumericalTransformer(TransformerMixin):\n    \"\"\"\n    Provides commonly used numerical transformers.\n    \"\"\"\n    SUPPORTED_FUNCTIONS = ['log', 'log1p']\n\n    def __init__(self, func):\n        \"\"\"\n        Params\n\n        func    function to apply to input columns. The function will be\n                applied to each value. Supported functions are defined\n                in SUPPORTED_FUNCTIONS variable. Throws assertion error if the\n                not supported.\n        \"\"\"\n\n        warnings.warn(\"\"\"\n            NumericalTransformer will be deprecated in 3.0 version.\n            Please use Sklearn.base.TransformerMixin to write\n            customer transformers\n            \"\"\", DeprecationWarning)\n\n        assert func in self.SUPPORTED_FUNCTIONS, \\\n            f\"Only following func are supported: {self.SUPPORTED_FUNCTIONS}\"\n        super(NumericalTransformer, self).__init__()\n        self.__func = func\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X, y=None):\n        if self.__func == 'log1p':\n            return np.vectorize(np.log1p)(X)\n        elif self.__func == 'log':\n            return np.vectorize(np.log)(X)\n\n        raise ValueError(f\"Invalid function name: {self.__func}\")\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/__init__.py",
      "content": "__version__ = '2.2.0'\n\nimport logging\nlogger = logging.getLogger(__name__)\n\nfrom .dataframe_mapper import DataFrameMapper  # NOQA\nfrom .features_generator import gen_features  # NOQA\nfrom .transformers import NumericalTransformer # NOQA\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/pipeline.py",
      "content": "import six\nfrom sklearn.pipeline import _name_estimators, Pipeline\nfrom sklearn.utils import tosequence\n\n\ndef _call_fit(fit_method, X, y=None, **kwargs):\n    \"\"\"\n    helper function, calls the fit or fit_transform method with the correct\n    number of parameters\n\n    fit_method: fit or fit_transform method of the transformer\n    X: the data to fit\n    y: the target vector relative to X, optional\n    kwargs: any keyword arguments to the fit method\n\n    return: the result of the fit or fit_transform method\n\n    WARNING: if this function raises a TypeError exception, test the fit\n    or fit_transform method passed to it in isolation as _call_fit will not\n    distinguish TypeError due to incorrect number of arguments from\n    other TypeError\n    \"\"\"\n    try:\n        return fit_method(X, y, **kwargs)\n    except TypeError:\n        # fit takes only one argument\n        return fit_method(X, **kwargs)\n\n\nclass TransformerPipeline(Pipeline):\n    \"\"\"\n    Pipeline that expects all steps to be transformers taking a single X\n    argument, an optional y argument, and having fit and transform methods.\n\n    Code is copied from sklearn's Pipeline\n    \"\"\"\n\n    def __init__(self, steps):\n        names, estimators = zip(*steps)\n        if len(dict(steps)) != len(steps):\n            raise ValueError(\n                \"Provided step names are not unique: %s\" % (names,))\n\n        # shallow copy of steps\n        self.steps = tosequence(steps)\n        estimator = estimators[-1]\n\n        for e in estimators:\n            if (not (hasattr(e, \"fit\") or hasattr(e, \"fit_transform\")) or not\n                    hasattr(e, \"transform\")):\n                raise TypeError(\"All steps of the chain should \"\n                                \"be transforms and implement fit and transform\"\n                                \" '%s' (type %s) doesn't)\" % (e, type(e)))\n\n        if not hasattr(estimator, \"fit\"):\n            raise TypeError(\"Last step of chain should implement fit \"\n                            \"'%s' (type %s) doesn't)\"\n                            % (estimator, type(estimator)))\n\n    def _pre_transform(self, X, y=None, **fit_params):\n        fit_params_steps = dict((step, {}) for step, _ in self.steps)\n        for pname, pval in six.iteritems(fit_params):\n            step, param = pname.split('__', 1)\n            fit_params_steps[step][param] = pval\n        Xt = X\n        for name, transform in self.steps[:-1]:\n            if hasattr(transform, \"fit_transform\"):\n                Xt = _call_fit(transform.fit_transform,\n                               Xt, y, **fit_params_steps[name])\n            else:\n                Xt = _call_fit(transform.fit,\n                               Xt, y, **fit_params_steps[name]).transform(Xt)\n        return Xt, fit_params_steps[self.steps[-1][0]]\n\n    def fit(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        _call_fit(self.steps[-1][-1].fit, Xt, y, **fit_params)\n        return self\n\n    def fit_transform(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        if hasattr(self.steps[-1][-1], 'fit_transform'):\n            return _call_fit(self.steps[-1][-1].fit_transform,\n                             Xt, y, **fit_params)\n        else:\n            return _call_fit(self.steps[-1][-1].fit,\n                             Xt, y, **fit_params).transform(Xt)\n\n\ndef make_transformer_pipeline(*steps):\n    \"\"\"Construct a TransformerPipeline from the given estimators.\n    \"\"\"\n    return TransformerPipeline(_name_estimators(steps))\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "import contextlib\nfrom datetime import datetime\nimport pandas as pd\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom .cross_validation import DataWrapper\nfrom .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline\nfrom . import logger\n\nstring_types = text_type = str\n\n\ndef _handle_feature(fea):\n    \"\"\"\n    Convert 1-dimensional arrays to 2-dimensional column vectors.\n    \"\"\"\n    if len(fea.shape) == 1:\n        fea = np.array([fea]).T\n\n    return fea\n\n\ndef _build_transformer(transformers):\n    if isinstance(transformers, list):\n        transformers = make_transformer_pipeline(*transformers)\n    return transformers\n\n\ndef _build_feature(columns, transformers, options={}, X=None):\n    if X is None:\n        return (columns, _build_transformer(transformers), options)\n    return (\n        columns(X) if callable(columns) else columns,\n        _build_transformer(transformers),\n        options\n    )\n\n\ndef _elapsed_secs(t1):\n    return (datetime.now()-t1).total_seconds()\n\n\ndef _get_feature_names(estimator):\n    \"\"\"\n    Attempt to extract feature names based on a given estimator\n    \"\"\"\n    if hasattr(estimator, 'classes_'):\n        return estimator.classes_\n    elif hasattr(estimator, 'get_feature_names'):\n        return estimator.get_feature_names()\n    return None\n\n\n@contextlib.contextmanager\ndef add_column_names_to_exception(column_names):\n    # Stolen from https://stackoverflow.com/a/17677938/356729\n    try:\n        yield\n    except Exception as ex:\n        if ex.args:\n            msg = u'{}: {}'.format(column_names, ex.args[0])\n        else:\n            msg = text_type(column_names)\n        ex.args = (msg,) + ex.args[1:]\n        raise\n\n\nclass DataFrameMapper(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Map Pandas data frame column subsets to their own\n    sklearn transformation.\n    \"\"\"\n\n    def __init__(self, features, default=False, sparse=False, df_out=False,\n                 input_df=False, drop_cols=None):\n        \"\"\"\n        Params:\n\n        features    a list of tuples with features definitions.\n                    The first element is the pandas column selector. This can\n                    be a string (for one column) or a list of strings.\n                    The second element is an object that supports\n                    sklearn's transform interface, or a list of such objects\n                    The third element is optional and, if present, must be\n                    a dictionary with the options to apply to the\n                    transformation. Example: {'alias': 'day_of_week'}\n\n        default     default transformer to apply to the columns not\n                    explicitly selected in the mapper. If False (default),\n                    discard them. If None, pass them through untouched. Any\n                    other transformer will be applied to all the unselected\n                    columns as a whole, taken as a 2d-array.\n\n        sparse      will return sparse matrix if set True and any of the\n                    extracted features is sparse. Defaults to False.\n\n        df_out      return a pandas data frame, with each column named using\n                    the pandas column that created it (if there's only one\n                    input and output) or the input columns joined with '_'\n                    if there's multiple inputs, and the name concatenated with\n                    '_1', '_2' etc if there's multiple outputs. NB: does not\n                    work if *default* or *sparse* are true\n\n        input_df    If ``True`` pass the selected columns to the transformers\n                    as a pandas DataFrame or Series. Otherwise pass them as a\n                    numpy array. Defaults to ``False``.\n\n        drop_cols   List of columns to be dropped. Defaults to None.\n\n        \"\"\"\n        self.features = features\n        self.default = default\n        self.built_default = None\n        self.sparse = sparse\n        self.df_out = df_out\n        self.input_df = input_df\n        self.drop_cols = [] if drop_cols is None else drop_cols\n        self.transformed_names_ = []\n        if (df_out and (sparse or default)):\n            raise ValueError(\"Can not use df_out with sparse or default\")\n\n    def _build(self, X=None):\n        \"\"\"\n        Build attributes built_features and built_default.\n        \"\"\"\n        if isinstance(self.features, list):\n            self.built_features = [\n                _build_feature(*f, X=X) for f in self.features\n            ]\n        else:\n            self.built_features = _build_feature(*self.features, X=X)\n        self.built_default = _build_transformer(self.default)\n\n    @property\n    def _selected_columns(self):\n        \"\"\"\n        Return a set of selected columns in the feature list.\n        \"\"\"\n        selected_columns = set()\n        for feature in self.features:\n            columns = feature[0]\n            if isinstance(columns, list):\n                selected_columns = selected_columns.union(set(columns))\n            else:\n                selected_columns.add(columns)\n        return selected_columns\n\n    def _unselected_columns(self, X):\n        \"\"\"\n        Return list of columns present in X and not selected explicitly in the\n        mapper.\n\n        Unselected columns are returned in the order they appear in the\n        dataframe to avoid issues with different ordering during default fit\n        and transform steps.\n        \"\"\"\n        X_columns = list(X.columns)\n        return [column for column in X_columns if\n                column not in self._selected_columns\n                and column not in self.drop_cols]\n\n    def __setstate__(self, state):\n        # compatibility for older versions of sklearn-pandas\n        super().__setstate__(state)\n        self.features = [_build_feature(*feat) for feat in state['features']]\n        self.sparse = state.get('sparse', False)\n        self.default = state.get('default', False)\n        self.df_out = state.get('df_out', False)\n        self.input_df = state.get('input_df', False)\n        self.drop_cols = state.get('drop_cols', [])\n        self.built_features = state.get('built_features', self.features)\n        self.built_default = state.get('built_default', self.default)\n        self.transformed_names_ = state.get('transformed_names_', [])\n\n    def __getstate__(self):\n        state = super().__getstate__()\n        state['features'] = self.features\n        state['sparse'] = self.sparse\n        state['default'] = self.default\n        state['df_out'] = self.df_out\n        state['input_df'] = self.input_df\n        state['drop_cols'] = self.drop_cols\n        state['build_features'] = getattr(self, 'built_features', None)\n        state['built_default'] = self.built_default\n        state['transformed_names_'] = self.transformed_names_\n        return state\n\n    def _get_col_subset(self, X, cols, input_df=False):\n        \"\"\"\n        Get a subset of columns from the given table X.\n\n        X       a Pandas dataframe; the table to select columns from\n        cols    a string or list of strings representing the columns to select.\n                It can also be a callable that returns True or False, i.e.\n                compatible with the built-in filter function.\n\n        Returns a numpy array with the data from the selected columns\n        \"\"\"\n\n        if isinstance(cols, string_types):\n            return_vector = True\n            cols = [cols]\n        else:\n            return_vector = False\n\n        # Needed when using the cross-validation compatibility\n        # layer for sklearn<0.16.0.\n        # Will be dropped on sklearn-pandas 2.0.\n        if isinstance(X, list):\n            X = [x[cols] for x in X]\n            X = pd.DataFrame(X)\n\n        elif isinstance(X, DataWrapper):\n            X = X.df  # fetch underlying data\n\n        if return_vector:\n            t = X[cols[0]]\n        else:\n            t = X[cols]\n\n        # return either a DataFrame/Series or a numpy array\n        if input_df:\n            return t\n        else:\n            return t.values\n\n    def fit(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n\n        \"\"\"\n        self._build(X=X)\n\n        for columns, transformers, options in self.built_features:\n            t1 = datetime.now()\n            input_df = options.get('input_df', self.input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    Xt = self._get_col_subset(X, columns, input_df)\n                    _call_fit(transformers.fit, Xt, y)\n            logger.info(f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n        # handle features not explicitly selected\n        if self.built_default:  # not False and not None\n            unsel_cols = self._unselected_columns(X)\n            with add_column_names_to_exception(unsel_cols):\n                Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n                _call_fit(self.built_default.fit, Xt, y)\n        return self\n\n    def get_names(self, columns, transformer, x, alias=None, prefix='',\n                  suffix=''):\n        \"\"\"\n        Return verbose names for the transformed columns.\n\n        columns       name (or list of names) of the original column(s)\n        transformer   transformer - can be a TransformerPipeline\n        x             transformed columns (numpy.ndarray)\n        alias         base name to use for the selected columns\n        \"\"\"\n        if alias is not None:\n            name = alias\n        elif isinstance(columns, list):\n            name = '_'.join(map(str, columns))\n        else:\n            name = columns\n        num_cols = x.shape[1] if len(x.shape) > 1 else 1\n\n        output = []\n\n        if num_cols > 1:\n            # If there are as many columns as classes in the transformer,\n            # infer column names from classes names.\n\n            # If we are dealing with multiple transformers for these columns\n            # attempt to extract the names from each of them, starting from the\n            # last one\n            if isinstance(transformer, TransformerPipeline):\n                inverse_steps = transformer.steps[::-1]\n                estimators = (estimator for name, estimator in inverse_steps)\n                names_steps = (_get_feature_names(e) for e in estimators)\n                names = next((n for n in names_steps if n is not None), None)\n            # Otherwise use the only estimator present\n            else:\n                names = _get_feature_names(transformer)\n\n            if names is not None and len(names) == num_cols:\n                output = [f\"{name}_{o}\" for o in names]\n                # otherwise, return name concatenated with '_1', '_2', etc.\n            else:\n                output = [name + '_' + str(o) for o in range(num_cols)]\n        else:\n            output = [name]\n\n        if prefix == suffix == \"\":\n            return output\n\n        return ['{}{}{}'.format(prefix, x, suffix) for x in output]\n\n    def get_dtypes(self, extracted):\n        dtypes_features = [self.get_dtype(ex) for ex in extracted]\n        return [dtype for dtype_feature in dtypes_features\n                for dtype in dtype_feature]\n\n    def get_dtype(self, ex):\n        if isinstance(ex, np.ndarray) or sparse.issparse(ex):\n            return [ex.dtype] * ex.shape[1]\n        elif isinstance(ex, pd.DataFrame):\n            return list(ex.dtypes)\n        else:\n            raise TypeError(type(ex))\n\n    def _transform(self, X, y=None, do_fit=False):\n        \"\"\"\n        Transform the given data with possibility to fit in advance.\n        Avoids code duplication for implementation of transform and\n        fit_transform.\n        \"\"\"\n        if do_fit:\n            self._build(X=X)\n\n        extracted = []\n        transformed_names_ = []\n        for columns, transformers, options in self.built_features:\n            input_df = options.get('input_df', self.input_df)\n\n            # columns could be a string or list of\n            # strings; we don't care because pandas\n            # will handle either.\n            Xt = self._get_col_subset(X, columns, input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    if do_fit and hasattr(transformers, 'fit_transform'):\n                        t1 = datetime.now()\n                        Xt = _call_fit(transformers.fit_transform, Xt, y)\n                        logger.info(f\"[FIT_TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n                    else:\n                        if do_fit:\n                            t1 = datetime.now()\n                            _call_fit(transformers.fit, Xt, y)\n                            logger.info(\n                                f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n                        t1 = datetime.now()\n                        Xt = transformers.transform(Xt)\n                        logger.info(f\"[TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n\n            extracted.append(_handle_feature(Xt))\n\n            alias = options.get('alias')\n\n            prefix = options.get('prefix', '')\n            suffix = options.get('suffix', '')\n\n            transformed_names_ += self.get_names(\n                columns, transformers, Xt, alias, prefix, suffix)\n\n        # handle features not explicitly selected\n        if self.built_default is not False:\n            unsel_cols = self._unselected_columns(X)\n            Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n            if self.built_default is not None:\n                with add_column_names_to_exception(unsel_cols):\n                    if do_fit and hasattr(self.built_default, 'fit_transform'):\n                        Xt = _call_fit(self.built_default.fit_transform, Xt, y)\n                    else:\n                        if do_fit:\n                            _call_fit(self.built_default.fit, Xt, y)\n                        Xt = self.built_default.transform(Xt)\n                transformed_names_ += self.get_names(\n                    unsel_cols, self.built_default, Xt)\n            else:\n                # if not applying a default transformer,\n                # keep column names unmodified\n                transformed_names_ += unsel_cols\n\n            extracted.append(_handle_feature(Xt))\n\n        self.transformed_names_ = transformed_names_\n\n        # combine the feature outputs into one array.\n        # at this point we lose track of which features\n        # were created from which input columns, so it's\n        # assumed that that doesn't matter to the model.\n\n        # If any of the extracted features is sparse, combine sparsely.\n        # Otherwise, combine as normal arrays.\n        if any(sparse.issparse(fea) for fea in extracted):\n            stacked = sparse.hstack(extracted).tocsr()\n            # return a sparse matrix only if the mapper was initialized\n            # with sparse=True\n            if not self.sparse:\n                stacked = stacked.toarray()\n        else:\n            stacked = np.hstack(extracted)\n\n        if self.df_out:\n            # if no rows were dropped preserve the original index,\n            # otherwise use a new integer one\n            no_rows_dropped = len(X) == len(stacked)\n            if no_rows_dropped:\n                index = X.index\n            else:\n                index = None\n\n            # output different data types, if appropriate\n            dtypes = self.get_dtypes(extracted)\n            df_out = pd.DataFrame(\n                stacked,\n                columns=self.transformed_names_,\n                index=index)\n            # preserve types\n            for col, dtype in zip(self.transformed_names_, dtypes):\n                df_out[col] = df_out[col].astype(dtype)\n            return df_out\n        else:\n            return stacked\n\n    def transform(self, X):\n        \"\"\"\n        Transform the given data. Assumes that fit has already been called.\n\n        X       the data to transform\n        \"\"\"\n        return self._transform(X)\n\n    def fit_transform(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline and directly apply\n        it to the given data.\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n        \"\"\"\n        return self._transform(X, y, True)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/cross_validation.py",
      "content": "class DataWrapper(object):\n\n    def __init__(self, df):\n        self.df = df\n\n    def __len__(self):\n        return len(self.df)\n\n    def __getitem__(self, key):\n        return self.df.iloc[key]\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_pipeline.py",
      "content": "import pytest\nfrom sklearn_pandas.pipeline import TransformerPipeline, _call_fit\n\n# In py3, mock is included with the unittest standard library\n# In py2, it's a separate package\ntry:\n    from unittest.mock import patch\nexcept ImportError:\n    from mock import patch\n\n\nclass NoTransformT(object):\n    \"\"\"Transformer without transform method.\n    \"\"\"\n    def fit(self, x):\n        return self\n\n\nclass NoFitT(object):\n    \"\"\"Transformer without fit method.\n    \"\"\"\n    def transform(self, x):\n        return self\n\n\nclass Trans(object):\n    \"\"\"\n    Transformer with fit and transform methods\n    \"\"\"\n    def fit(self, x, y=None):\n        return self\n\n    def transform(self, x):\n        return self\n\n\ndef func_x_y(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments\n    \"\"\"\n    return\n\n\ndef func_x(x, kwarg='kwarg'):\n    \"\"\"\n    Function with required x argument\n    \"\"\"\n    return\n\n\ndef func_raise_type_err(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments,\n    raises TypeError\n    \"\"\"\n    raise TypeError\n\n\ndef test_all_steps_fit_transform():\n    \"\"\"\n    All steps must implement fit and transform. Otherwise, raise TypeError.\n    \"\"\"\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoTransformT())])\n\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoFitT())])\n\n\n@patch.object(Trans, 'fit', side_effect=func_x_y)\ndef test_called_with_x_and_y(mock_fit):\n    \"\"\"\n    Fit method with required X and y arguments is called with both and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', 'y', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_x)\ndef test_called_with_x(mock_fit):\n    \"\"\"\n    Fit method with a required X arguments is called with it and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n    _call_fit(Trans().fit, 'X', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_raise_type_err)\ndef test_raises_type_error(mock_fit):\n    \"\"\"\n    If a fit method with required X and y arguments raises a TypeError, it's\n    re-raised (for a different reason) when it's called with one argument\n    \"\"\"\n    with pytest.raises(TypeError):\n        _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
      "content": "import tempfile\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nimport joblib\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas import NumericalTransformer\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_common_numerical_transformer(simple_dataset):\n    \"\"\"\n    Test log transformation\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ], df_out=True)\n    df = simple_dataset\n    outDF = transfomer.fit_transform(df)\n    assert list(outDF.columns) == ['feat1']\n    assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n\n\ndef test_numerical_transformer_serialization(simple_dataset):\n    \"\"\"\n    Test if you can serialize transformer\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ])\n\n    df = simple_dataset\n    transfomer.fit(df)\n    f = tempfile.NamedTemporaryFile(delete=True)\n    joblib.dump(transfomer, f.name)\n    transfomer2 = joblib.load(f.name)\n    np.array_equal(transfomer.transform(df), transfomer2.transform(df))\n    f.close()\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "# -*- coding: utf8 -*-\n\nimport pytest\nfrom unittest.mock import Mock\nfrom pandas import DataFrame\nimport pandas as pd\nfrom scipy import sparse\nfrom sklearn.datasets import load_iris\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.svm import SVC\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.preprocessing import (\n    StandardScaler, OneHotEncoder, LabelBinarizer)\nfrom sklearn.impute import SimpleImputer as Imputer\nfrom sklearn.feature_selection import SelectKBest, chi2\nfrom sklearn.base import BaseEstimator, TransformerMixin\nimport sklearn.decomposition\nimport numpy as np\nfrom numpy.testing import assert_array_equal\nimport pickle\nfrom sklearn.compose import make_column_selector\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\nfrom sklearn_pandas.pipeline import TransformerPipeline\n\n\nclass MockXTransformer(object):\n    \"\"\"\n    Mock transformer that accepts no y argument.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return X\n\n\nclass MockTClassifier(object):\n    \"\"\"\n    Mock transformer/classifier.\n    \"\"\"\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        return X\n\n    def predict(self, X):\n        return True\n\n\nclass DateEncoder():\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        dt = X.dt\n        return pd.concat([dt.year, dt.month, dt.day], axis=1)\n\n\nclass ToSparseTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Transforms numpy matrix to sparse format.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return sparse.csr_matrix(X)\n\n\nclass CustomTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example of transformer in which the number of classes\n    is not equals to the number of output columns.\n    \"\"\"\n    def fit(self, X, y=None):\n        self.min = X.min()\n        self.classes_ = np.unique(X)\n        return self\n\n    def transform(self, X):\n        classes = np.unique(X)\n        if len(np.setdiff1d(classes, self.classes_)) > 0:\n            raise ValueError('Unknown values found.')\n        return X - self.min\n\n\nclass MockImageTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example transformer that takes the max of a 2d vector\n    then scales the result.\n    \"\"\"\n    def __init__(self, multiplier=10.0):\n        self.multiplier = multiplier\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        assert isinstance(X, pd.DataFrame)\n        for col in X.columns:\n            X[col] = X[col].map(lambda img: np.max(img))\n        return X * self.multiplier\n\n\n@pytest.fixture\ndef simple_dataframe():\n    return pd.DataFrame({'a': [1, 2, 3]})\n\n\n@pytest.fixture\ndef complex_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4]})\n\n\n@pytest.fixture\ndef complex_object_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4],\n                         'img2d': [1*np.eye(2), 2*np.eye(2), 3*np.eye(2),\n                                   4*np.eye(2), 5*np.eye(2), 6*np.eye(2)]})\n\n\n@pytest.fixture\ndef multiindex_dataframe():\n    \"\"\"Example MultiIndex DataFrame, taken from pandas documentation\n    \"\"\"\n    iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]\n    index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])\n    df = pd.DataFrame(np.random.randn(10, 8), columns=index)\n    return df\n\n\n@pytest.fixture\ndef multiindex_dataframe_incomplete(multiindex_dataframe):\n    \"\"\"Example MultiIndex DataFrame with missing entries\n    \"\"\"\n    df = multiindex_dataframe\n    mask_array = np.zeros(df.size)\n    mask_array[:20] = 1\n    np.random.shuffle(mask_array)\n    mask = mask_array.reshape(df.shape).astype(bool)\n    df.mask(mask, inplace=True)\n    return df\n\n\ndef test_transformed_names_simple(simple_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for simple transformation\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_transformed_names_binarizer(complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_logging(caplog, complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    import logging\n    logger = logging.getLogger('sklearn_pandas')\n    logger.setLevel(logging.INFO)\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert '[FIT_TRANSFORM] target:' in caplog.text\n\n\ndef test_transformed_names_binarizer_unicode():\n    df = pd.DataFrame({'target': [u'ñ', u'á', u'é']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    expected_names = {u'target_ñ', u'target_á', u'target_é'}\n    assert set(mapper.transformed_names_) == expected_names\n\n\ndef test_transformed_names_transformers_list(complex_dataframe):\n    \"\"\"\n    When using a list of transformers, use them in inverse order to get the\n    transformed names\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([\n        ('target', [LabelBinarizer(), MockXTransformer()])\n    ])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_transformed_names_simple_alias(simple_dataframe):\n    \"\"\"\n    If we specify an alias for a single output column, it is used for the\n    output\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None, {'alias': 'new_name'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_name']\n\n\ndef test_transformed_names_complex_alias(complex_dataframe):\n    \"\"\"\n    If we specify an alias for a multiple output column, it is used for the\n    output\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer(), {'alias': 'new'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_a', 'new_b', 'new_c']\n\n\ndef test_exception_column_context_transform(simple_dataframe):\n    \"\"\"\n    If an exception is raised when transforming a column,\n    the exception includes the name of the column being transformed\n    \"\"\"\n    class FailingTransformer(object):\n        def fit(self, X):\n            pass\n\n        def transform(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingTransformer())])\n    mapper.fit(df)\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.transform(df)\n\n\ndef test_exception_column_context_fit(simple_dataframe):\n    \"\"\"\n    If an exception is raised when fit a column,\n    the exception includes the name of the column being fitted\n    \"\"\"\n    class FailingFitter(object):\n        def fit(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingFitter())])\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.fit(df)\n\n\ndef test_simple_df(simple_dataframe):\n    \"\"\"\n    Get a dataframe from a simple mapped dataframe\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert type(transformed) == pd.DataFrame\n    assert len(transformed[\"a\"]) == len(simple_dataframe[\"a\"])\n\n\ndef test_complex_df(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None), ('feat2', None)],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_complex_object_df(complex_object_dataframe):\n    \"\"\"\n    Get a dataframe from a complex dataframe with 2d features\n    \"\"\"\n    df = complex_object_dataframe\n    img_scale = 10\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None),\n         (make_column_selector('feat2'), StandardScaler()),\n         (make_column_selector('img2d'), MockImageTransformer(img_scale))],\n        df_out=True, input_df=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_object_dataframe)\n    assert np.isclose(\n        np.sum(transformed['img2d']),\n        np.max(np.sum(df['img2d'])) * img_scale, atol=1e-12)\n\n\ndef test_numeric_column_names(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe with numeric column names\n    \"\"\"\n    df = complex_dataframe\n    df.columns = [0, 1, 2]\n    mapper = DataFrameMapper(\n        [(0, None), (1, None), (2, None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_multiindex_df(multiindex_dataframe_incomplete):\n    \"\"\"\n    Get a dataframe from a multiindex dataframe with missing data\n    \"\"\"\n    df = multiindex_dataframe_incomplete\n    mapper = DataFrameMapper([([c], Imputer()) for c in df.columns],\n                             df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(multiindex_dataframe_incomplete)\n    for c in df.columns:\n        assert len(transformed[str(c)]) == len(df[c])\n\n\ndef test_binarizer_df():\n    \"\"\"\n    Check level names from LabelBinarizer\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_a'\n    assert cols[1] == 'target_b'\n    assert cols[2] == 'target_c'\n\n\ndef test_binarizer_int_df():\n    \"\"\"\n    Check level names from LabelBinarizer for a numeric array.\n    \"\"\"\n    df = pd.DataFrame({'target': [5, 5, 6, 6, 7, 5]})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_5'\n    assert cols[1] == 'target_6'\n    assert cols[2] == 'target_7'\n\n\ndef test_binarizer2_df():\n    \"\"\"\n    Check level names from LabelBinarizer with just one output column\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_onehot_df():\n    \"\"\"\n    Check level ids from one-hot\n    \"\"\"\n    df = pd.DataFrame({'target': [0, 0, 1, 1, 2, 3, 0]})\n    mapper = DataFrameMapper([(['target'], OneHotEncoder())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 4\n    assert cols[0] == 'target_0'\n    assert cols[3] == 'target_3'\n\n\ndef test_customtransform_df():\n    \"\"\"\n    Check level ids from a transformer in which\n    the number of classes is not equals to the number of output columns.\n    \"\"\"\n    df = pd.DataFrame({'target': [6, 5, 7, 5, 4, 8, 8]})\n    mapper = DataFrameMapper([(['target'], CustomTransformer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(mapper.features[0][1].classes_) == 5\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_preserve_df_index():\n    \"\"\"\n    The index is preserved when df_out=True\n    \"\"\"\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', None)],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, df.index)\n\n\ndef test_preserve_df_index_rows_dropped():\n    \"\"\"\n    If df_out=True but the original df index length doesn't\n    match the number of final rows, use a numeric index\n    \"\"\"\n    class DropLastRowTransformer(object):\n        def fit(self, X):\n            return self\n\n        def transform(self, X):\n            return X[:-1]\n\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', DropLastRowTransformer())],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, np.array([0, 1]))\n\n\ndef test_pca(complex_dataframe):\n    \"\"\"\n    Check multi in and out with PCA\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 2\n    assert cols[0] == 'feat1_feat2_0'\n    assert cols[1] == 'feat1_feat2_1'\n\n\ndef test_fit_transform(simple_dataframe):\n    \"\"\"\n    Check that custom fit_transform methods of the transformers are invoked.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    # return something of measurable length but does nothing\n    mock_transformer.fit_transform.return_value = np.array([1, 2, 3])\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n    mapper.fit_transform(df)\n    assert mock_transformer.fit_transform.called\n\n\ndef test_fit_transform_equiv_mock(simple_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper using the mock\n    transformer which does not implement a custom fit_transform.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', MockXTransformer())])\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.all(transformed_combined == transformed_separate)\n\n\ndef test_fit_transform_equiv_pca(complex_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper and transformer\n    using PCA which implements a custom fit_transform. The\n    equivalence of both paths in the transformer only can be\n    asserted since this is tested in the sklearn tests\n    scikit-learn/sklearn/decomposition/tests/test_pca.py\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.allclose(transformed_combined, transformed_separate)\n\n\ndef test_input_df_true_first_transformer(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the first transformer is passed\n    a pd.Series instead of an np.array\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockXTransformer, 'fit', Mock())\n    monkeypatch.setattr(MockXTransformer, 'transform',\n                        Mock(return_value=np.array([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', MockXTransformer())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    args, _ = MockXTransformer().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    args, _ = MockXTransformer().transform.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_next_transformers(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the subsequent transformers get passed pandas\n    objects instead of numpy arrays (given the previous transformers\n    output pandas objects as well)\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockTClassifier, 'fit', Mock())\n    monkeypatch.setattr(MockTClassifier, 'transform',\n                        Mock(return_value=pd.Series([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer(), MockTClassifier()])\n    ], input_df=True)\n    mapper.fit(df)\n    out = mapper.transform(df)\n\n    args, _ = MockTClassifier().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_multiple_cols(complex_dataframe):\n    \"\"\"\n    When input_df is True, applying transformers to multiple columns\n    works as expected\n    \"\"\"\n    df = complex_dataframe\n\n    mapper = DataFrameMapper([\n        ('target', MockXTransformer()),\n        ('feat1',  MockXTransformer()),\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    assert_array_equal(out[:, 0], df['target'].values)\n    assert_array_equal(out[:, 1], df['feat1'].values)\n\n\ndef test_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_local_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder(), {'input_df': True})\n    ], input_df=False)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_nonexistent_columns_explicit_fail(simple_dataframe):\n    \"\"\"\n    If a nonexistent column is selected, KeyError is raised.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    with pytest.raises(KeyError):\n        mapper._get_col_subset(simple_dataframe, ['nonexistent_feature'])\n\n\ndef test_get_col_subset_single_column_array(simple_dataframe):\n    \"\"\"\n    Selecting a single column should return a 1-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, \"a\")\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]),)\n\n\ndef test_get_col_subset_single_column_list(simple_dataframe):\n    \"\"\"\n    Selecting a list of columns (even if the list contains a single element)\n    should return a 2-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, [\"a\"])\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]), 1)\n\n\ndef test_cols_string_array(simple_dataframe):\n    \"\"\"\n    If a string is specified as the columns, the transformer\n    is called with a 1-d array as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3,)\n\n\ndef test_cols_list_column_vector(simple_dataframe):\n    \"\"\"\n    If a one-element list is specified as the columns, the transformer\n    is called with a column vector as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([([\"a\"], mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3, 1)\n\n\ndef test_handle_feature_2dim():\n    \"\"\"\n    2-dimensional arrays are returned unchanged.\n    \"\"\"\n    array = np.array([[1, 2], [3, 4]])\n    assert_array_equal(_handle_feature(array), array)\n\n\ndef test_handle_feature_1dim():\n    \"\"\"\n    1-dimensional arrays are converted to 2-dimensional column vectors.\n    \"\"\"\n    array = np.array([1, 2])\n    assert_array_equal(_handle_feature(array), np.array([[1], [2]]))\n\n\ndef test_build_transformers():\n    \"\"\"\n    When a list of transformers is passed, return a pipeline with\n    each element of the iterable as a step of the pipeline.\n    \"\"\"\n    transformers = [MockTClassifier(), MockTClassifier()]\n    pipeline = _build_transformer(transformers)\n    assert isinstance(pipeline, Pipeline)\n    for ix, transformer in enumerate(transformers):\n        assert pipeline.steps[ix][1] == transformer\n\n\ndef test_selected_columns():\n    \"\"\"\n    selected_columns returns a set of the columns appearing in the features\n    of the mapper.\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert mapper._selected_columns == {'a', 'b'}\n\n\ndef test_unselected_columns():\n    \"\"\"\n    unselected_columns returns a list of the columns not appearing in the\n    features of the mapper but present in the given dataframe.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert 'c' in mapper._unselected_columns(df)\n\n\ndef test_drop_and_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns and drop columns\n    are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n            ('a', None)\n        ], drop_cols=['c'], default=False)\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (1, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_drop_and_default_none():\n    \"\"\"\n    If default=None, drop columns are discarded and\n    remaining non explicitly selected columns are passed through untransformed\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['c'], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 2)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_conflicting_drop():\n    \"\"\"\n    Drop column name shouldn't get confused with transformed columns.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['a'], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('b', None)\n    ], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n\n\ndef test_default_none():\n    \"\"\"\n    If default=None, non explicitly selected columns are passed through\n    untransformed.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        (['a'], OneHotEncoder())\n    ], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[:, 3] == np.array([3, 5, 7]).T).all()\n\n\ndef test_default_none_names():\n    \"\"\"\n    If default=None, column names are returned unmodified.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([], default=None)\n\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_default_transformer():\n    \"\"\"\n    If default=Transformer, non explicitly selected columns are applied this\n    transformer.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, np.nan, 3], })\n    mapper = DataFrameMapper([], default=Imputer())\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[: 0] == np.array([1., 2., 3.])).all()\n\n\ndef test_list_transformers_single_arg(simple_dataframe):\n    \"\"\"\n    Multiple transformers can be specified in a list even if some of them\n    only accept one X argument instead of two (X, y).\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer()])\n    ])\n    # doesn't fail\n    mapper.fit_transform(simple_dataframe)\n\n\ndef test_list_transformers():\n    \"\"\"\n    Specifying a list of transformers applies them sequentially to the\n    selected column.\n    \"\"\"\n    dataframe = pd.DataFrame({\"a\": [1, np.nan, 3], \"b\": [1, 5, 7]},\n                             dtype=np.float64)\n\n    mapper = DataFrameMapper([\n        ([\"a\"], [Imputer(), StandardScaler()]),\n        ([\"b\"], StandardScaler()),\n    ])\n    dmatrix = mapper.fit_transform(dataframe)\n\n    assert pd.isnull(dmatrix).sum() == 0  # no null values\n\n    # all features have mean 0 and std deviation 1 (standardized)\n    assert (abs(dmatrix.mean(axis=0) - 0) <= 1e-6).all()\n    assert (abs(dmatrix.std(axis=0) - 1) <= 1e-6).all()\n\n\ndef test_list_transformers_old_unpickle(simple_dataframe):\n    mapper = DataFrameMapper(None)\n    # simulate the mapper was created with < 1.0.0 code\n    mapper.features = [('a', [MockXTransformer()])]\n    mapper_pickled = pickle.dumps(mapper)\n\n    loaded_mapper = pickle.loads(mapper_pickled)\n    transformer = loaded_mapper.features[0][1]\n    assert isinstance(transformer, TransformerPipeline)\n    assert isinstance(transformer.steps[0][1], MockXTransformer)\n\n\ndef test_sparse_features(simple_dataframe):\n    \"\"\"\n    If any of the extracted features is sparse and \"sparse\" argument\n    is true, the hstacked result is also sparse.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=True)\n    dmatrix = mapper.fit_transform(df)\n\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\n\ndef test_sparse_off(simple_dataframe):\n    \"\"\"\n    If the resulting features are sparse but the \"sparse\" argument\n    of the mapper is False, return a non-sparse matrix.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=False)\n\n    dmatrix = mapper.fit_transform(df)\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\n\ndef test_fit_with_optional_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with an optional y argument in the fit method\n    are handled correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], MockTClassifier())])\n    # doesn't fail\n    mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n\ndef test_fit_with_required_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with a required y argument in the fit method\n    are handled and perform correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], SelectKBest(chi2, k=1))])\n\n    # fit, doesn't fail\n    ft_arr = mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n    # fit_transform\n    ft_arr = mapper.fit_transform(df[['feat1', 'feat2']], df['target'])\n    assert_array_equal(ft_arr, df[['feat1']].values)\n\n    # transform\n    t_arr = mapper.transform(df[['feat1', 'feat2']])\n    assert_array_equal(t_arr, df[['feat1']].values)\n\n\n# Integration tests with real dataframes\n\n@pytest.fixture\ndef iris_dataframe():\n    iris = load_iris()\n    return DataFrame(\n        data={\n            iris.feature_names[0]: iris.data[:, 0],\n            iris.feature_names[1]: iris.data[:, 1],\n            iris.feature_names[2]: iris.data[:, 2],\n            iris.feature_names[3]: iris.data[:, 3],\n            \"species\": np.array([iris.target_names[e] for e in iris.target])\n        }\n    )\n\n\n@pytest.fixture\ndef cars_dataframe():\n    return pd.read_csv(\"tests/test_data/cars.csv.gz\", compression='gzip')\n\n\ndef test_with_iris_dataframe(iris_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_dict_vectorizer():\n    df = pd.DataFrame(\n        [[{'a': 1, 'b': 2}], [{'a': 3}]],\n        columns=['colA']\n    )\n\n    outdf = DataFrameMapper(\n        [('colA', DictVectorizer())],\n        df_out=True,\n        default=False\n    ).fit_transform(df)\n\n    columns = sorted(list(outdf.columns))\n    assert len(columns) == 2\n    assert columns[0] == 'colA_0'\n    assert columns[1] == 'colA_1'\n\n\ndef test_with_car_dataframe(cars_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"description\", CountVectorizer()),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = cars_dataframe.drop(\"model\", axis=1)\n    labels = cars_dataframe[\"model\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.30\n\n\ndef test_direct_cross_validation(iris_dataframe):\n    \"\"\"\n    Starting with sklearn>=0.16.0 we no longer need CV wrappers for dataframes.\n    See https://github.com/paulgb/sklearn-pandas/issues/11\n    \"\"\"\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_heterogeneous_output_types_input_df():\n    \"\"\"\n    Modify feat2, but pass feat1 through unmodified.\n    This fails if input_df == False\n    \"\"\"\n    df = pd.DataFrame({\n        'feat1': [1, 2, 3, 4, 5, 6],\n        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0, 4.0]\n    })\n    M = DataFrameMapper([\n        (['feat2'], StandardScaler())\n        ], input_df=True, df_out=True, default=None)\n    dft = M.fit_transform(df)\n    assert dft['feat1'].dtype == np.dtype('int64')\n    assert dft['feat2'].dtype == np.dtype('float64')\n\n\ndef test_make_column_selector(iris_dataframe):\n    t = DataFrameMapper([\n        (make_column_selector(dtype_include=float), None, {'alias': 'x'}),\n        ('sepal length (cm)', None),\n    ], df_out=True, default=False)\n\n    xt = t.fit(iris_dataframe).transform(iris_dataframe)\n    expected = ['x_0', 'x_1', 'x_2', 'x_3', 'sepal length (cm)']\n    assert list(xt.columns) == expected\n\n    pickled = pickle.dumps(t)\n    t2 = pickle.loads(pickled)\n    xt2 = t2.transform(iris_dataframe)\n    assert np.array_equal(xt.values, xt2.values)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "from collections import Counter\n\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nfrom numpy.testing import assert_array_equal\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.features_generator import gen_features\n\n\nclass MockClass(object):\n\n    def __init__(self, value=1, name='class'):\n        self.value = value\n        self.name = name\n\n\nclass MockTransformer(object):\n\n    def __init__(self):\n        self.most_common_ = None\n\n    def fit(self, X, y=None):\n        [(value, _)] = Counter(X).most_common(1)\n        self.most_common_ = value\n        return self\n\n    def transform(self, X, y=None):\n        return np.asarray([self.most_common_] * len(X))\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_generate_features_with_default_parameters():\n    \"\"\"\n    Tests generating features from classes with default init arguments.\n    \"\"\"\n    columns = ['colA', 'colB', 'colC']\n    feature_defs = gen_features(columns=columns, classes=[MockClass])\n    assert len(feature_defs) == len(columns)\n\n    for feature in feature_defs:\n        assert feature[2] == {}\n\n    feature_dict = dict([_[0:2] for _ in feature_defs])\n    assert columns == sorted(feature_dict.keys())\n\n    # default init arguments for MockClass for clarification.\n    expected = {'value': 1, 'name': 'class'}\n    for column, transformers in feature_dict.items():\n        for obj in transformers:\n            assert_attributes(obj, **expected)\n\n\ndef test_generate_features_with_several_classes():\n    \"\"\"\n    Tests generating features pipeline with different transformers parameters.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'],\n        classes=[\n            {'class': MockClass},\n            {'class': MockClass, 'name': 'mockA'},\n            {'class': MockClass, 'name': 'mockB', 'value': None}\n        ]\n    )\n\n    for col, transformers, params in feature_defs:\n        assert_attributes(transformers[0], name='class', value=1)\n        assert_attributes(transformers[1], name='mockA', value=1)\n        assert_attributes(transformers[2], name='mockB', value=None)\n\n\ndef test_generate_features_with_none_only_transformers():\n    \"\"\"\n    Tests generating \"dummy\" feature definition which doesn't apply any\n    transformation.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'], classes=[None])\n\n    expected = [('colA', None, {}),\n                ('colB', None, {}),\n                ('colC', None, {})]\n\n    assert feature_defs == expected\n\n\ndef test_compatibility_with_data_frame_mapper(simple_dataset):\n    \"\"\"\n    Tests compatibility of generated feature definition with DataFrameMapper.\n    \"\"\"\n    features_defs = gen_features(\n        columns=['feat1', 'feat2'],\n        classes=[MockTransformer])\n    features_defs.append(('feat3', None))\n\n    mapper = DataFrameMapper(features_defs)\n    X = mapper.fit_transform(simple_dataset)\n    expected = np.asarray([\n        [1, 2, 1],\n        [1, 2, 2],\n        [1, 2, 3],\n        [1, 2, 4],\n        [1, 2, 5]\n    ])\n\n    assert_array_equal(X, expected)\n\n\ndef assert_attributes(obj, **attrs):\n    for attr, value in attrs.items():\n        assert getattr(obj, attr) == value\n"
    }
  ],
  "ErrorMessage": "============================================================================================= FAILURES ==============================================================================================\n_____________________________________________________________________________ test_heterogeneous_output_types_input_df ______________________________________________________________________________\n\n    def test_heterogeneous_output_types_input_df():\n        \"\"\"\n        Modify feat2, but pass feat1 through unmodified.\n        This fails if input_df == False\n        \"\"\"\n>       df = pd.DataFrame({\n            'feat1': [1, 2, 3, 4, 5, 6],\n            'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n        })\n\ntests/test_dataframe_mapper.py:1008: \n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _\n../../../../anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/frame.py:733: in __init__\n    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)\n../../../../anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/internals/construction.py:503: in dict_to_mgr\n    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)\n../../../../anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/internals/construction.py:114: in arrays_to_mgr\n    index = _extract_index(arrays)\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _\n\ndata = [[1, 2, 3, 4, 5, 6], [1.0, 2.0, 3.0, 2.0, 3.0]]\n\n    def _extract_index(data) -> Index:\n        \"\"\"\n        Try to infer an Index from the passed data, raise ValueError on failure.\n        \"\"\"\n        index: Index\n        if len(data) == 0:\n            return default_index(0)\n    \n        raw_lengths = []\n        indexes: list[list[Hashable] | Index] = []\n    \n        have_raw_arrays = False\n        have_series = False\n        have_dicts = False\n    \n        for val in data:\n            if isinstance(val, ABCSeries):\n                have_series = True\n                indexes.append(val.index)\n            elif isinstance(val, dict):\n                have_dicts = True\n                indexes.append(list(val.keys()))\n            elif is_list_like(val) and getattr(val, \"ndim\", 1) == 1:\n                have_raw_arrays = True\n                raw_lengths.append(len(val))\n            elif isinstance(val, np.ndarray) and val.ndim > 1:\n                raise ValueError(\"Per-column arrays must each be 1-dimensional\")\n    \n        if not indexes and not raw_lengths:\n            raise ValueError(\"If using all scalar values, you must pass an index\")\n    \n        if have_series:\n            index = union_indexes(indexes)\n        elif have_dicts:\n            index = union_indexes(indexes, sort=False)\n    \n        if have_raw_arrays:\n            lengths = list(set(raw_lengths))\n            if len(lengths) > 1:\n>               raise ValueError(\"All arrays must be of the same length\")\nE               ValueError: All arrays must be of the same length\n\n../../../../anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/internals/construction.py:677: ValueError\n========================================================================================= warnings summary ==========================================================================================\ntests/test_dataframe_mapper.py::test_complex_object_df\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:106: SettingWithCopyWarning: \n  A value is trying to be set on a copy of a slice from a DataFrame.\n  Try using .loc[row_indexer,col_indexer] = value instead\n  \n  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n    X[col] = X[col].map(lambda img: np.max(img))\n\ntests/test_dataframe_mapper.py::test_sparse_features\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:865: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\ntests/test_dataframe_mapper.py::test_sparse_off\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:879: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\ntests/test_transformers.py::test_common_numerical_transformer\ntests/test_transformers.py::test_numerical_transformer_serialization\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py:35: DeprecationWarning: \n              NumericalTransformer will be deprecated in 3.0 version.\n              Please use Sklearn.base.TransformerMixin to write\n              customer transformers\n              \n    warnings.warn(\"\"\"\n\n-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html\n====================================================================================== short test summary info ======================================================================================\nFAILED tests/test_dataframe_mapper.py::test_heterogeneous_output_types_input_df - ValueError: All arrays must be of the same length\n============================================================================= 1 failed, 69 passed, 5 warnings in 1.40s ==============================================================================",
  "Patch": "--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -1007,7 +1007,7 @@\n     \"\"\"\n     df = pd.DataFrame({\n         'feat1': [1, 2, 3, 4, 5, 6],\n-        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n+        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0, 4.0]\n     })\n     M = DataFrameMapper([\n         (['feat2'], StandardScaler())\n",
  "BuggyCodeLocation": [
    {
      "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "function": {
        "1003": "test_heterogeneous_output_types_input_df"
      },
      "content_all": {
        "1007": "    \"\"\"\n",
        "1008": "    df = pd.DataFrame({\n",
        "1009": "        'feat1': [1, 2, 3, 4, 5, 6],\n",
        "1010": "        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n",
        "1011": "    })\n",
        "1012": "    M = DataFrameMapper([\n",
        "1013": "        (['feat2'], StandardScaler())\n"
      },
      "content_change": {
        "1010": "        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n"
      }
    },
    {
      "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "function": null,
      "content_all": {},
      "content_change": {}
    }
  ],
  "Issue": {
    "title": "Incorrect Length of Columns in Unit Test Leading to Dimension Mismatch",
    "description": "There is an issue in the `test_heterogeneous_output_types_input_df` unit test within the `scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py` file. The columns 'feat1' and 'feat2' have inconsistent lengths. This causes a dimension mismatch error when the DataFrameMapper attempts to transform the data.\n\nSteps to reproduce:\n1. Run the test suite for `scikit-learn-contrib sklearn-pandas`.\n2. Observe the dimension mismatch error in the `test_heterogeneous_output_types_input_df` test case.\n\nExpected behavior:\nAll columns in the DataFrame should have consistent lengths to avoid any dimension mismatch errors during transformations.\n\nResolution:\nEnsure that 'feat1' and 'feat2' columns in the test DataFrame used within `test_heterogeneous_output_types_input_df` have the same number of entries.",
    "explanation": "### Summary of the Issue\nThere was an issue within the `scikit-learn-contrib_sklearn-pandas` project, specifically in the `test_heterogeneous_output_types_input_df` unit test located in the `scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py` file. The problem involved columns 'feat1' and 'feat2' in a test DataFrame having inconsistent lengths, which led to a dimension mismatch error when the DataFrameMapper attempted to transform the data.\n\n### Cause of the Issue\nIn data processing and machine learning pipelines, it is essential that all columns within a DataFrame maintain consistent lengths. In this particular instance, the 'feat1' column had 6 entries while the 'feat2' column had only 5 entries. When attempting to perform transformations using the DataFrameMapper, this mismatch would cause errors because the transformer expects all columns to have the same number of rows.\n\n### Content of the Commit\nThe commit made to solve this issue involved a simple yet effective change:\n1. Adjusting the length of the 'feat2' column to match the 'feat1' column by adding an additional value to 'feat2'. \n2. By ensuring that both columns have consistent lengths, the dimension mismatch error is resolved, and the DataFrameMapper can successfully transform the data without encountering any errors.\n\n### Explanation of the Solution\n1. **Identification and Diagnosis**: The developer identified that the unit test `test_heterogeneous_output_types_input_df` was failing due to a dimension mismatch error linked to column length inconsistency within the test DataFrame.\n   \n2. **Correction**: To correct this, the developer ensured that the 'feat2' column was lengthened to match the 'feat1' column. The shortest path to resolve this was adding an additional value to the existing 'feat2' entries, making both columns of equal length (6 entries each).\n\n3. **Implementation**: Implementing this change would ensure that the DataFrameMapper, which tries to process these columns, does not encounter any dimension-related errors as it now processes columns with consistent lengths.\n\n### How the Commit Solves the Issue\nThe commit fixes the dimension mismatch error by aligning the lengths of 'feat1' and 'feat2' columns within the DataFrame used in the unit test. This ensures that the DataFrame is structurally sound for transformation:\n\n1. **Consistent Lengths**: By making sure both columns have the same number of entries, the commit addresses the root cause of the dimension mismatch error.\n2. **Transformation Stability**: The DataFrameMapper will now be able to handle the DataFrame without encountering runtime errors, resulting in successful transformations during the test.\n3. **Enhanced Test Reliability**: This ensures that the test suite doesn't fail due to simple structural issues, facilitating more reliable and robust testing of the remaining functionality.\n\n### Final Explanation and Solution Summary\nThe core issue was a DataFrame with columns of inconsistent lengths causing a transformation error when processed by the DataFrameMapper in a unit test. The solution involved aligning the lengths of all columns by ensuring 'feat2' had the same number of entries as 'feat1'. This simple but crucial change eliminated the dimension mismatch error, allowing the DataFrameMapper to transform the data successfully. The commit effectively resolves the problem by maintaining structural consistency within the DataFrame, thereby ensuring that the test runs smoothly without error."
  },
  "Explain": "### Summary of the Issue\nThere was an issue within the `scikit-learn-contrib_sklearn-pandas` project, specifically in the `test_heterogeneous_output_types_input_df` unit test located in the `scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py` file. The problem involved columns 'feat1' and 'feat2' in a test DataFrame having inconsistent lengths, which led to a dimension mismatch error when the DataFrameMapper attempted to transform the data.\n\n### Cause of the Issue\nIn data processing and machine learning pipelines, it is essential that all columns within a DataFrame maintain consistent lengths. In this particular instance, the 'feat1' column had 6 entries while the 'feat2' column had only 5 entries. When attempting to perform transformations using the DataFrameMapper, this mismatch would cause errors because the transformer expects all columns to have the same number of rows.\n\n### Content of the Commit\nThe commit made to solve this issue involved a simple yet effective change:\n1. Adjusting the length of the 'feat2' column to match the 'feat1' column by adding an additional value to 'feat2'. \n2. By ensuring that both columns have consistent lengths, the dimension mismatch error is resolved, and the DataFrameMapper can successfully transform the data without encountering any errors.\n\n### Explanation of the Solution\n1. **Identification and Diagnosis**: The developer identified that the unit test `test_heterogeneous_output_types_input_df` was failing due to a dimension mismatch error linked to column length inconsistency within the test DataFrame.\n   \n2. **Correction**: To correct this, the developer ensured that the 'feat2' column was lengthened to match the 'feat1' column. The shortest path to resolve this was adding an additional value to the existing 'feat2' entries, making both columns of equal length (6 entries each).\n\n3. **Implementation**: Implementing this change would ensure that the DataFrameMapper, which tries to process these columns, does not encounter any dimension-related errors as it now processes columns with consistent lengths.\n\n### How the Commit Solves the Issue\nThe commit fixes the dimension mismatch error by aligning the lengths of 'feat1' and 'feat2' columns within the DataFrame used in the unit test. This ensures that the DataFrame is structurally sound for transformation:\n\n1. **Consistent Lengths**: By making sure both columns have the same number of entries, the commit addresses the root cause of the dimension mismatch error.\n2. **Transformation Stability**: The DataFrameMapper will now be able to handle the DataFrame without encountering runtime errors, resulting in successful transformations during the test.\n3. **Enhanced Test Reliability**: This ensures that the test suite doesn't fail due to simple structural issues, facilitating more reliable and robust testing of the remaining functionality.\n\n### Final Explanation and Solution Summary\nThe core issue was a DataFrame with columns of inconsistent lengths causing a transformation error when processed by the DataFrameMapper in a unit test. The solution involved aligning the lengths of all columns by ensuring 'feat2' had the same number of entries as 'feat1'. This simple but crucial change eliminated the dimension mismatch error, allowing the DataFrameMapper to transform the data successfully. The commit effectively resolves the problem by maintaining structural consistency within the DataFrame, thereby ensuring that the test runs smoothly without error.",
  "Source": "Human",
  "Token": 1386,
  "Command": [
    "pytest tests"
  ],
  "FilteredCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "1 # -*- coding: utf8 -*-\n2 \n3 import pytest\n4 from unittest.mock import Mock\n5 from pandas import DataFrame\n6 import pandas as pd\n7 from scipy import sparse\n8 from sklearn.datasets import load_iris\n9 from sklearn.pipeline import Pipeline\n10 from sklearn.model_selection import cross_val_score\n11 from sklearn.svm import SVC\n12 from sklearn.feature_extraction.text import CountVectorizer\n13 from sklearn.feature_extraction import DictVectorizer\n14 from sklearn.preprocessing import (\n15     StandardScaler, OneHotEncoder, LabelBinarizer)\n16 from sklearn.impute import SimpleImputer as Imputer\n17 from sklearn.feature_selection import SelectKBest, chi2\n18 from sklearn.base import BaseEstimator, TransformerMixin\n19 import sklearn.decomposition\n20 import numpy as np\n21 from numpy.testing import assert_array_equal\n22 import pickle\n23 from sklearn.compose import make_column_selector\n24 \n25 from sklearn_pandas import DataFrameMapper\n26 from sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\n27 from sklearn_pandas.pipeline import TransformerPipeline\n28 \n29 \n30 class MockXTransformer(object):\n31     \"\"\"\n32     Mock transformer that accepts no y argument.\n33     \"\"\"\n34     def fit(self, X):\n35         return self\n36 \n37     def transform(self, X):\n38         return X\n39 \n40 \n41 class MockTClassifier(object):\n42     \"\"\"\n43     Mock transformer/classifier.\n44     \"\"\"\n45     def fit(self, X, y=None):\n46         return self\n47 \n48     def transform(self, X):\n49         return X\n50 \n51     def predict(self, X):\n52         return True\n53 \n54 \n55 class DateEncoder():\n56     def fit(self, X, y=None):\n57         retur(...truncated)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/test.py",
      "content": "1 import pytest\n2 from unittest.mock import Mock\n3 import numpy as np\n4 import pandas as pd\n5 from sklearn_pandas import DataFrameMapper\n6 from sklearn.compose import make_column_selector\n7 from sklearn.preprocessing import StandardScaler\n8 \n9 \n10 class GetStartWith:\n11     def __init__(self, start_str):\n12         self.start_str = start_str\n13 \n14     def __call__(self, X: pd.DataFrame) -> list:\n15         return [c for c in X.columns if c.startswith(self.start_str)]\n16 \n17 \n18 df = pd.DataFrame({\n19     'sepal length (cm)': [1.0, 2.0, 3.0],\n20     'sepal width (cm)': [1.0, 2.0, 3.0],\n21     'petal length (cm)': [1.0, 2.0, 3.0],\n22     'petal width (cm)': [1.0, 2.0, 3.0]\n23 })\n24 t = DataFrameMapper([\n25     (make_column_selector(dtype_include=float), StandardScaler(), {'alias': 'x'}),\n26     (GetStartWith('petal'), None, {'alias': 'petal'})\n27 ], df_out=True, default=False)\n28 \n29 t.fit(df)\n30 print(t.transform(df).shape)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
      "content": "1 import tempfile\n2 import pytest\n3 import numpy as np\n4 from pandas import DataFrame\n5 import joblib\n6 \n7 from sklearn_pandas import DataFrameMapper\n8 from sklearn_pandas import NumericalTransformer\n9 \n10 \n11 @pytest.fixture\n12 def simple_dataset():\n13     return DataFrame({\n14         'feat1': [1, 2, 1, 3, 1],\n15         'feat2': [1, 2, 2, 2, 3],\n16         'feat3': [1, 2, 3, 4, 5],\n17     })\n18 \n19 \n20 def test_common_numerical_transformer(simple_dataset):\n21     \"\"\"\n22     Test log transformation\n23     \"\"\"\n24     transfomer = DataFrameMapper([\n25         ('feat1', NumericalTransformer('log'))\n26     ], df_out=True)\n27     df = simple_dataset\n28     outDF = transfomer.fit_transform(df)\n29     assert list(outDF.columns) == ['feat1']\n30     assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n31 \n32 \n33 def test_numerical_transformer_serialization(simple_dataset):\n34     \"\"\"\n35     Test if you can serialize transformer\n36     \"\"\"\n37     transfomer = DataFrameMapper([\n38         ('feat1', NumericalTransformer('log'))\n39     ])\n40 \n41     df = simple_dataset\n42     transfomer.fit(df)\n43     f = tempfile.NamedTemporaryFile(delete=True)\n44     joblib.dump(transfomer, f.name)\n45     transfomer2 = joblib.load(f.name)\n46     np.array_equal(transfomer.transform(df), transfomer2.transform(df))\n47     f.close()"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "1 from collections import Counter\n2 \n3 import pytest\n4 import numpy as np\n5 from pandas import DataFrame\n6 from numpy.testing import assert_array_equal\n7 \n8 from sklearn_pandas import DataFrameMapper\n9 from sklearn_pandas.features_generator import gen_features\n10 \n11 \n12 class MockClass(object):\n13 \n14     def __init__(self, value=1, name='class'):\n15         self.value = value\n16         self.name = name\n17 \n18 \n19 class MockTra(...truncated)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "1 import contextlib\n2 from datetime import datetime\n3 import pandas as pd\n4 import numpy as np\n5 from scipy import sparse\n6 from sklearn.base import BaseEstimator, TransformerMixin\n7 from .cross_validation import DataWrapper\n8 from .pipeline import make_transformer_pipeline, _ca(...truncated)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py",
      "content": "1 import numpy as np\n2 import pandas as pd\n3 from sklearn.base import TransformerMixin\n4 import warnings\n5 \n6 \n7 def _get_mask(X, value):\n8     \"\"\"\n9     Compute the boolean mask X == missing_val(...truncated)"
    }
  ],
  "TokenAll": 18484,
  "FilteredLevel": 1500,
  "Results": {
    "model": "GPT-4o",
    "Difficulty": "Medium",
    "issue_origin": {
      "title": "Potential Typographical Error in Variable Name",
      "description": "In the file 'scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py' on lines 24 and 38, the variable 'transfomer' is likely a typographical error and should be corrected to 'transformer'. This could lead to confusion and maintenance issues in the future.",
      "explanation": "### Summary of the Issue\n\nThe issue identified is a typographical error in the variable name `transfomer` which appears in the file `scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py` on lines 24 and 38. The correct spelling should be `transformer`. This typo could lead to confusion among developers, reduce code readability, and increase the maintenance effort required.\n\n### Detailed Content of the Commit\n\nThe commit to address the issue comprises the following changes:\n\n```diff\ndiff --git a/scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py b/scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py\nindex e584536..ad7b9e8 100644\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py\n@@ -24,7 +24,7 @@ def test_common_numerical_transformer(simple_dataset):\n     Test log transformation\n     \"\"\"\n-    transfomer = DataFrameMapper([\n+    transformer = DataFrameMapper([\n         ('feat1', NumericalTransformer('log'))\n     ], df_out=True)\n     df = simple_dataset\n@@ -37,7 +37,7 @@ def test_numerical_transformer_serialization(simple_dataset):\n     Test if you can serialize transformer\n     \"\"\"\n-    transfomer = DataFrameMapper([\n+    transformer = DataFrameMapper([\n         ('feat1', NumericalTransformer('log'))\n     ])\n \n```\n\n### Explanation of How the Commit Solves the Issue\n\n1. **Typographical Error Correction:**\n    - The commit changes the variable name `transfomer` to `transformer` on lines 24 and 37 in the file `scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py`.\n    - This correction ensures the variable name is spelled correctly, maintaining consistency and reducing potential future confusion for developers who might misunderstand the incorrect spelling as a different variable.\n\n2. **Consistency and Maintainability:**\n    - Properly named variables contribute to better readability and understanding of the code because developers expect common words like `transformer` to be spelled accurately.\n    - Correcting this typographical error helps maintain a standard of quality within the codebase, making it easier to maintain and reducing the cognitive load on developers when reading the code. It prevents any second-guessing or misinterpretation of the variable's role and purpose.\n\n3. **Avoiding Bugs or Misuse:**\n    - Although this specific typo didn't cause any runtime errors, leaving such typos in the code could lead to bugs in the future. Correcting it preemptively avoids any potential unforeseen issues where a developer might wrongly reference `transfomer` instead of `transformer`.\n    - Uniform and intuitive naming conventions help in reducing silly mistakes during refactoring, debugging, or extending the test cases.\n\n4. **Consistency in Documentation and Comments:**\n    - This change also reflects a commitment to high code quality where even minor details, like variable names, are kept correct and consistent.\n    - Any documentation or comments referring to this variable will now also benefit from this correction, maintaining a clear and professional standard throughout the codebase.\n\nBy implementing this commit, the code is cleaner, more accurate, and follows best practices, ensuring that the code remains readable and manageable for future improvements and maintenance."
    },
    "issue_message": {
      "title": "Inconsistent array lengths in DataFrame creation",
      "description": "The arrays provided to create the DataFrame in test_heterogeneous_output_types_input_df are of different lengths. 'feat1' has 6 elements, while 'feat2' has only 5. Ensure all arrays provided to the DataFrame constructor are of the same length.",
      "explanation": "### Issue Summary\n\nThe issue at hand, titled \"**Inconsistent array lengths in DataFrame creation**,\" occurs when trying to create a pandas `DataFrame` in a test case. Specifically, the length of the arrays for 'feat1' and 'feat2' are different, leading to a `ValueError`. The error message \"All arrays must be of the same length\" indicates that pandas requires all columns to have the same number of elements when creating a `DataFrame`.\n\n### Detailed Analysis\n\n**Problematic Code Snippet:**\n\n```python\ndf = pd.DataFrame({\n    'feat1': [1, 2, 3, 4, 5, 6],\n    'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n})\n```\n\nIn the provided code snippet from the `test_heterogeneous_output_types_input_df` function, the array for 'feat1' contains 6 elements, while the array for 'feat2' contains only 5 elements. This discrepancy in lengths raises the error.\n\nThis part lies within the file `tests/test_dataframe_mapper.py` around line 1008, which causes the error when it tries to execute the DataFrame constructor.\n\n### Commit Details\n\nTo resolve this issue, we need to ensure that all arrays passed to the `DataFrame` constructor have the same length. \n\n**Proposed Change:**\n\n```diff\n-        df = pd.DataFrame({\n-            'feat1': [1, 2, 3, 4, 5, 6],\n-            'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n-        })\n+        df = pd.DataFrame({\n+            'feat1': [1, 2, 3, 4, 5],\n+            'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n+        })\n```\n\nThis modification ensures that both 'feat1' and 'feat2' have 5 elements, making the lengths consistent.\n\n### Explanation of Solution\n\n1. **Understanding the Error:**\n   The error `ValueError: All arrays must be of the same length` occurs when attempting to create a pandas `DataFrame` with columns of different lengths. Pandas does not allow this as it would result in an undefined shape for the DataFrame.\n\n2. **Solution Strategy:**\n   To solve the issue, the lengths of the arrays need to be equal. In this specific context, either `feat1` can be reduced to 5 elements (to match `feat2`), or `feat2` can be increased to 6 elements (to match `feat1`).\n\n3. **Commit Changes:**\n   By changing the 'feat1' array to match the length of the 'feat2' array (both having 5 elements), the code ensures that the `DataFrame` constructor will not raise a `ValueError`. This resolves the issue as the `DataFrame` can now be created successfully.\n\n### How the Commit Solves the Issue\n\nThe commit addresses the inconsistency by ensuring both columns have the same length:\n- The `feat1` array is reduced from 6 to 5 elements.\n- This change harmonizes the lengths of the columns, enabling the `DataFrame` to be created without raising an error.\n\nAfter these modifications, executing the `DataFrame` creation will no longer throw a `ValueError`, and the test case can proceed as expected.\n\n### Conclusion\n\nBy ensuring that the input arrays for the DataFrame creation are of equal length, we resolve the issue efficiently. This detailed examination and correction ensure that future attempts to create DataFrames within test cases will not encounter this type of error, promoting smoother test execution."
    },
    "issue_ground": {
      "title": "Incorrect Length of Columns in Unit Test Leading to Dimension Mismatch",
      "description": "There is an issue in the `test_heterogeneous_output_types_input_df` unit test within the `scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py` file. The columns 'feat1' and 'feat2' have inconsistent lengths. This causes a dimension mismatch error when the DataFrameMapper attempts to transform the data.\n\nSteps to reproduce:\n1. Run the test suite for `scikit-learn-contrib sklearn-pandas`.\n2. Observe the dimension mismatch error in the `test_heterogeneous_output_types_input_df` test case.\n\nExpected behavior:\nAll columns in the DataFrame should have consistent lengths to avoid any dimension mismatch errors during transformations.\n\nResolution:\nEnsure that 'feat1' and 'feat2' columns in the test DataFrame used within `test_heterogeneous_output_types_input_df` have the same number of entries.",
      "explanation": "### Issue Summary\n\nThe issue reported is a dimension mismatch error occurring in the unit test `test_heterogeneous_output_types_input_df` in the `scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py` file. This error is caused due to the lengths of the columns `'feat1'` and `'feat2'` being inconsistent, resulting in a ValueError when attempting to create a Pandas DataFrame.\n\n### Error Details\n\nUpon running the test suite, the following error message is encountered:\n\n```\nE               ValueError: All arrays must be of the same length\n```\n\nThe inconsistency in the lengths of the arrays within the DataFrame definition for the `feat1` and `feat2` columns is clearly the root cause:\n\n```python\ndf = pd.DataFrame({\n    'feat1': [1, 2, 3, 4, 5, 6],\n    'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n})\n```\n\nHere, `'feat1'` has 6 elements, while `'feat2'` only has 5 elements.\n\n### Description of the Commit\n\nTo resolve this issue, the lengths of the columns in the DataFrame need to be made consistent. The commit modifies the `test_heterogeneous_output_types_input_df` test case to ensure that both columns have the same number of entries.\n\nThe corrected code would look like this:\n\n```python\ndf = pd.DataFrame({\n    'feat1': [1, 2, 3, 4, 5],\n    'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n})\n```\n\n### Explanation of the Fix\n\n1. **Identification of the Root Cause**:\n   - By analyzing the error message, the issue was identified as a mismatch in the lengths of the columns in the DataFrame, causing a `ValueError`.\n\n2. **Content of the Commit**:\n   - The commit modifies the test case `test_heterogeneous_output_types_input_df` to equalize the lengths of `'feat1'` and `'feat2'`, changing the list for `'feat1'` to `[1, 2, 3, 4, 5]`.\n\n3. **How the Commit Solves the Issue**:\n   - **Consistent-Length Columns**: By ensuring that both columns `'feat1'` and `'feat2'` have the same number of elements, the DataFrame can be successfully created without a dimension mismatch error.\n   - **Functional Transformation**: With the corrected DataFrame, other operations involving the `DataFrameMapper` can proceed without encountering dimension mismatches, ensuring the test passes smoothly.\n\n### Detailed Steps in the Corrected Code:\n1. **Correcting DataFrame Initialization**:\n   - The commit ensures that the lists provided for `'feat1'` and `'feat2'` have the same length.\n\n2. **Validation**:\n   - Re-run the test suite to validate that the `test_heterogeneous_output_types_input_df` test case now passes without errors.\n\n### Conclusion\n\nThe dimension mismatch error in the `test_heterogeneous_output_types_input_df` unit test was caused by columns of differing lengths in the DataFrame. By ensuring that all columns have consistent lengths within the DataFrame initialization, this commit fixes the error, thus allowing the test to pass successfully.\n\nThis solution aligns the test setup with the expected structure and ensures consistent behavior for DataFrame operations within the `DataFrameMapper` context."
    },
    "issue_ground_truth": {
      "title": "Incorrect Length of Columns in Unit Test Leading to Dimension Mismatch",
      "description": "There is an issue in the `test_heterogeneous_output_types_input_df` unit test within the `scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py` file. The columns 'feat1' and 'feat2' have inconsistent lengths. This causes a dimension mismatch error when the DataFrameMapper attempts to transform the data.\n\nSteps to reproduce:\n1. Run the test suite for `scikit-learn-contrib sklearn-pandas`.\n2. Observe the dimension mismatch error in the `test_heterogeneous_output_types_input_df` test case.\n\nExpected behavior:\nAll columns in the DataFrame should have consistent lengths to avoid any dimension mismatch errors during transformations.\n\nResolution:\nEnsure that 'feat1' and 'feat2' columns in the test DataFrame used within `test_heterogeneous_output_types_input_df` have the same number of entries.",
      "explanation": "### Summary of the Issue\nThere was an issue within the `scikit-learn-contrib_sklearn-pandas` project, specifically in the `test_heterogeneous_output_types_input_df` unit test located in the `scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py` file. The problem involved columns 'feat1' and 'feat2' in a test DataFrame having inconsistent lengths, which led to a dimension mismatch error when the DataFrameMapper attempted to transform the data.\n\n### Cause of the Issue\nIn data processing and machine learning pipelines, it is essential that all columns within a DataFrame maintain consistent lengths. In this particular instance, the 'feat1' column had 6 entries while the 'feat2' column had only 5 entries. When attempting to perform transformations using the DataFrameMapper, this mismatch would cause errors because the transformer expects all columns to have the same number of rows.\n\n### Content of the Commit\nThe commit made to solve this issue involved a simple yet effective change:\n1. Adjusting the length of the 'feat2' column to match the 'feat1' column by adding an additional value to 'feat2'. \n2. By ensuring that both columns have consistent lengths, the dimension mismatch error is resolved, and the DataFrameMapper can successfully transform the data without encountering any errors.\n\n### Explanation of the Solution\n1. **Identification and Diagnosis**: The developer identified that the unit test `test_heterogeneous_output_types_input_df` was failing due to a dimension mismatch error linked to column length inconsistency within the test DataFrame.\n   \n2. **Correction**: To correct this, the developer ensured that the 'feat2' column was lengthened to match the 'feat1' column. The shortest path to resolve this was adding an additional value to the existing 'feat2' entries, making both columns of equal length (6 entries each).\n\n3. **Implementation**: Implementing this change would ensure that the DataFrameMapper, which tries to process these columns, does not encounter any dimension-related errors as it now processes columns with consistent lengths.\n\n### How the Commit Solves the Issue\nThe commit fixes the dimension mismatch error by aligning the lengths of 'feat1' and 'feat2' columns within the DataFrame used in the unit test. This ensures that the DataFrame is structurally sound for transformation:\n\n1. **Consistent Lengths**: By making sure both columns have the same number of entries, the commit addresses the root cause of the dimension mismatch error.\n2. **Transformation Stability**: The DataFrameMapper will now be able to handle the DataFrame without encountering runtime errors, resulting in successful transformations during the test.\n3. **Enhanced Test Reliability**: This ensures that the test suite doesn't fail due to simple structural issues, facilitating more reliable and robust testing of the remaining functionality.\n\n### Final Explanation and Solution Summary\nThe core issue was a DataFrame with columns of inconsistent lengths causing a transformation error when processed by the DataFrameMapper in a unit test. The solution involved aligning the lengths of all columns by ensuring 'feat2' had the same number of entries as 'feat1'. This simple but crucial change eliminated the dimension mismatch error, allowing the DataFrameMapper to transform the data successfully. The commit effectively resolves the problem by maintaining structural consistency within the DataFrame, thereby ensuring that the test runs smoothly without error."
    },
    "location_origin": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
        "function": {
          "20": "test_common_numerical_transformer"
        },
        "content_all": {
          "17": "    })\n",
          "18": " \n",
          "19": " \n",
          "20": "def test_common_numerical_transformer(simple_dataset):\n",
          "21": "    \"\"\"\n",
          "22": "    Test log transformation\n",
          "23": "    \"\"\"\n",
          "24": "    transfomer = DataFrameMapper([\n",
          "25": "        ('feat1', NumericalTransformer('log'))\n",
          "26": "    ], df_out=True)\n",
          "27": "    df = simple_dataset\n",
          "28": "    outDF = transfomer.fit_transform(df)\n",
          "29": "    assert list(outDF.columns) == ['feat1']\n",
          "30": "    assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n"
        },
        "content_change": {
          "24": "    transformer = DataFrameMapper([\n"
        }
      },
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
        "function": {
          "33": "test_numerical_transformer_serialization"
        },
        "content_all": {
          "30": "    assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n",
          "31": " \n",
          "32": " \n",
          "33": "def test_numerical_transformer_serialization(simple_dataset):\n",
          "34": "    \"\"\"\n",
          "35": "    Test if you can serialize transformer\n",
          "36": "    \"\"\"\n",
          "37": "    transfomer = DataFrameMapper([\n",
          "38": "        ('feat1', NumericalTransformer('log'))\n",
          "39": "    ])\n",
          "40": " \n",
          "41": "    df = simple_dataset\n",
          "42": "    transfomer.fit(df)\n",
          "43": "    f = tempfile.NamedTemporaryFile(delete=True)\n",
          "44": "    joblib.dump(transfomer, f.name)\n"
        },
        "content_change": {
          "37": "    transformer = DataFrameMapper([\n"
        }
      }
    ],
    "location_message": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "1007": "test_heterogeneous_output_types_input_df"
        },
        "content_all": {
          "1004": "        Modify feat2, but pass feat1 through unmodified.\n",
          "1005": "        This fails if input_df == False\n",
          "1006": "        \"\"\"\n",
          "1007": "    def test_heterogeneous_output_types_input_df():\n",
          "1008": "        df = pd.DataFrame({\n",
          "1009": "            'feat1': [1, 2, 3, 4, 5, 6],\n",
          "1010": "            'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n",
          "1011": "        })\n",
          "1012": "        # More lines could be here, but these are the relevant context lines.\n",
          "1013": "        \n"
        },
        "content_change": {
          "1009": "            'feat1': [1, 2, 3, 4, 5, 6],\n",
          "1010": "            'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n"
        }
      }
    ],
    "location_ground": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "110": "test_heterogeneous_output_types_input_df"
        },
        "content_all": {
          "107": "\n",
          "108": "def test_heterogeneous_output_types_input_df():\n",
          "109": "    df = pd.DataFrame({\n",
          "110": "        'feat1': [1, 2, 3, 4, 5, 6],\n",
          "111": "        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n",
          "112": "    })\n",
          "113": "    mapper = DataFrameMapper([\n",
          "114": "        ('feat1', StandardScaler()),\n",
          "115": "        ('feat2', StandardScaler())\n"
        },
        "content_change": {
          "110": "        'feat1': [1, 2, 3, 4, 5],\n"
        }
      }
    ],
    "location_ground_exp": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "187": "test_heterogeneous_output_types_input_df"
        },
        "content_all": {
          "184": "    def test_heterogeneous_output_types_input_df(self):\n",
          "185": "        df = pd.DataFrame({\n",
          "186": "            'feat1': [0, 1, 2, 3, 4, 5],\n",
          "187": "            'feat2': [0, 1, 2, 3, 4],\n",
          "188": "            'feat1_str': ['a', 'b', 'c', 'd', 'e', 'f'],\n",
          "189": "            'feat3': [10, 20, 30, 40, 50, 60],\n",
          "190": "            'feat4': [0, 1, 0, 1, 0, 1]\n",
          "191": "        })\n"
        },
        "content_change": {
          "187": "            'feat2': [0, 1, 2, 3, 4, 5],\n"
        }
      }
    ],
    "location_ground_truth": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "1003": "test_heterogeneous_output_types_input_df"
        },
        "content_all": {
          "1007": "    \"\"\"\n",
          "1008": "    df = pd.DataFrame({\n",
          "1009": "        'feat1': [1, 2, 3, 4, 5, 6],\n",
          "1010": "        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n",
          "1011": "    })\n",
          "1012": "    M = DataFrameMapper([\n",
          "1013": "        (['feat2'], StandardScaler())\n"
        },
        "content_change": {
          "1010": "        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n"
        }
      },
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": null,
        "content_all": {},
        "content_change": {}
      }
    ],
    "patch_i": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py\n@@ -24,7 +24,7 @@ def test_common_numerical_transformer(simple_dataset):\n     Test log transformation\n     \"\"\"\n-    transfomer = DataFrameMapper([\n+    transformer = DataFrameMapper([\n         ('feat1', NumericalTransformer('log'))\n     ], df_out=True)\n     df = simple_dataset\n@@ -37,7 +37,7 @@ def test_numerical_transformer_serialization(simple_dataset):\n     Test if you can serialize transformer\n     \"\"\"\n-    transfomer = DataFrameMapper([\n+    transformer = DataFrameMapper([\n         ('feat1', NumericalTransformer('log'))\n     ])\n \n@@ -42,13 +42,13 @@ def test_numerical_transformer_serialization(simple_dataset):\n     df = simple_dataset\n-    transfomer.fit(df)\n+    transformer.fit(df)\n     f = tempfile.NamedTemporaryFile(delete=True)\n-    joblib.dump(transfomer, f.name)\n-    transfomer2 = joblib.load(f.name)\n-    np.array_equal(transfomer.transform(df), transfomer2.transform(df))\n+    joblib.dump(transformer, f.name)\n+    transformer2 = joblib.load(f.name)\n+    np.array_equal(transformer.transform(df), transformer2.transform(df))\n     f.close()\n",
    "patch_im": "\n--- a/tests/test_dataframe_mapper.py\n+++ b/tests/test_dataframe_mapper.py\n@@ -1007,8 +1007,8 @@\n         input_df = DataFrame({\n             'cat1': ['a', 'b', 'a', 'a', 'b', 1],\n         })\n-        df = pd.DataFrame({\n-            'feat1': [1, 2, 3, 4, 5, 6],\n-            'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n-        })\n+        df = pd.DataFrame({\n+            'feat1': [1, 2, 3, 4, 5],\n+            'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n+        })\n \n         assert df.shape == (5, 2)\n         assert isinstance(df, pd.DataFrame)\n",
    "patch_il": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py\n@@ -24,7 +24,7 @@ def test_common_numerical_transformer(simple_dataset):\n     Test log transformation\n     \"\"\"\n-    transfomer = DataFrameMapper([\n+    transformer = DataFrameMapper([\n         ('feat1', NumericalTransformer('log'))\n     ], df_out=True)\n     df = simple_dataset\n@@ -37,7 +37,7 @@ def test_numerical_transformer_serialization(simple_dataset):\n     Test if you can serialize transformer\n     \"\"\"\n-    transfomer = DataFrameMapper([\n+    transformer = DataFrameMapper([\n         ('feat1', NumericalTransformer('log'))\n     ])\n \n",
    "patch_iml": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -1008,8 +1008,8 @@\n \n     def test_heterogeneous_output_types_input_df():\n         df = pd.DataFrame({\n-            'feat1': [1, 2, 3, 4, 5, 6],\n-            'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n+            'feat1': [1, 2, 3, 4, 5],\n+            'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n         })\n         # More lines could be here, but these are the relevant context lines.\n",
    "patch_ground": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -1,27 +1,27 @@\n # -*- coding: utf8 -*-\n \n import pytest\n from unittest.mock import Mock\n from pandas import DataFrame\n import pandas as pd\n from scipy import sparse\n from sklearn.datasets import load_iris\n from sklearn.pipeline import Pipeline\n from sklearn.model_selection import cross_val_score\n from sklearn.svm import SVC\n from sklearn.feature_extraction.text import CountVectorizer\n from sklearn.feature_extraction import DictVectorizer\n from sklearn.preprocessing import (\n     StandardScaler, OneHotEncoder, LabelBinarizer)\n from sklearn.impute import SimpleImputer as Imputer\n from sklearn.feature_selection import SelectKBest, chi2\n from sklearn.base import BaseEstimator, TransformerMixin\n import sklearn.decomposition\n import numpy as np\n from numpy.testing import assert_array_equal\n import pickle\n from sklearn.compose import make_column_selector\n \n from sklearn_pandas import DataFrameMapper\n from sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\n from sklearn_pandas.pipeline import TransformerPipeline\n \n@@ -149,7 +149,7 @@\n     }\n \n     df = pd.DataFrame({\n-        'feat1': [1, 2, 3, 4, 5, 6],\n+        'feat1': [1, 2, 3, 4, 5],\n         'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n     })\n \n",
    "patch_ground_location": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -108,7 +108,7 @@ def test_heterogeneous_output_types_input_df():\n     df = pd.DataFrame({\n         'feat1': [1, 2, 3, 4, 5, 6],\n         'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n-    mapper = DataFrameMapper([\n+    df = pd.DataFrame({\n+        'feat1': [1, 2, 3, 4, 5],\n         'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n     })\n     mapper = DataFrameMapper([\n",
    "patch_ground_exp": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -184,8 +184,8 @@\n     def test_heterogeneous_output_types_input_df(self):\n         df = pd.DataFrame({\n             'feat1': [0, 1, 2, 3, 4, 5],\n-            'feat2': [0, 1, 2, 3, 4],\n+            'feat2': [0, 1, 2, 3, 4, 5],\n             'feat1_str': ['a', 'b', 'c', 'd', 'e', 'f'],\n             'feat3': [10, 20, 30, 40, 50, 60],\n             'feat4': [0, 1, 0, 1, 0, 1]\n         })\n",
    "patch_ground_all": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -7,7 +7,7 @@ def test_heterogeneous_output_types_input_df():\n     \"\"\"\"\"\n     df = pd.DataFrame({\n         'feat1': [1, 2, 3, 4, 5, 6],\n-        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n+        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0, 4.0]\n     })\n     M = DataFrameMapper([\n         (['feat2'], StandardScaler())\n",
    "patch_ground_truth": "--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -1007,7 +1007,7 @@\n     \"\"\"\n     df = pd.DataFrame({\n         'feat1': [1, 2, 3, 4, 5, 6],\n-        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n+        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0, 4.0]\n     })\n     M = DataFrameMapper([\n         (['feat2'], StandardScaler())\n",
    "message": "============================================================================================= FAILURES ==============================================================================================\n_____________________________________________________________________________ test_heterogeneous_output_types_input_df ______________________________________________________________________________\n\n    def test_heterogeneous_output_types_input_df():\n        \"\"\"\n        Modify feat2, but pass feat1 through unmodified.\n        This fails if input_df == False\n        \"\"\"\n>       df = pd.DataFrame({\n            'feat1': [1, 2, 3, 4, 5, 6],\n            'feat2': [1.0, 2.0, 3.0, 2.0, 3.0]\n        })\n\ntests/test_dataframe_mapper.py:1008: \n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _\n../../../../anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/frame.py:733: in __init__\n    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)\n../../../../anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/internals/construction.py:503: in dict_to_mgr\n    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)\n../../../../anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/internals/construction.py:114: in arrays_to_mgr\n    index = _extract_index(arrays)\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _\n\ndata = [[1, 2, 3, 4, 5, 6], [1.0, 2.0, 3.0, 2.0, 3.0]]\n\n    def _extract_index(data) -> Index:\n        \"\"\"\n        Try to infer an Index from the passed data, raise ValueError on failure.\n        \"\"\"\n        index: Index\n        if len(data) == 0:\n            return default_index(0)\n    \n        raw_lengths = []\n        indexes: list[list[Hashable] | Index] = []\n    \n        have_raw_arrays = False\n        have_series = False\n        have_dicts = False\n    \n        for val in data:\n            if isinstance(val, ABCSeries):\n                have_series = True\n                indexes.append(val.index)\n            elif isinstance(val, dict):\n                have_dicts = True\n                indexes.append(list(val.keys()))\n            elif is_list_like(val) and getattr(val, \"ndim\", 1) == 1:\n                have_raw_arrays = True\n                raw_lengths.append(len(val))\n            elif isinstance(val, np.ndarray) and val.ndim > 1:\n                raise ValueError(\"Per-column arrays must each be 1-dimensional\")\n    \n        if not indexes and not raw_lengths:\n            raise ValueError(\"If using all scalar values, you must pass an index\")\n    \n        if have_series:\n            index = union_indexes(indexes)\n        elif have_dicts:\n            index = union_indexes(indexes, sort=False)\n    \n        if have_raw_arrays:\n            lengths = list(set(raw_lengths))\n            if len(lengths) > 1:\n>               raise ValueError(\"All arrays must be of the same length\")\nE               ValueError: All arrays must be of the same length\n\n../../../../anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/internals/construction.py:677: ValueError\n========================================================================================= warnings summary ==========================================================================================\ntests/test_dataframe_mapper.py::test_complex_object_df\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:106: SettingWithCopyWarning: \n  A value is trying to be set on a copy of a slice from a DataFrame.\n  Try using .loc[row_indexer,col_indexer] = value instead\n  \n  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n    X[col] = X[col].map(lambda img: np.max(img))\n\ntests/test_dataframe_mapper.py::test_sparse_features\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:865: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\ntests/test_dataframe_mapper.py::test_sparse_off\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:879: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\ntests/test_transformers.py::test_common_numerical_transformer\ntests/test_transformers.py::test_numerical_transformer_serialization\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py:35: DeprecationWarning: \n              NumericalTransformer will be deprecated in 3.0 version.\n              Please use Sklearn.base.TransformerMixin to write\n              customer transformers\n              \n    warnings.warn(\"\"\"\n\n-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html\n====================================================================================== short test summary info ======================================================================================\nFAILED tests/test_dataframe_mapper.py::test_heterogeneous_output_types_input_df - ValueError: All arrays must be of the same length\n============================================================================= 1 failed, 69 passed, 5 warnings in 1.40s ==============================================================================",
    "CodeBase": [
      {
        "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "content": "1 # -*- coding: utf8 -*-\n2 \n3 import pytest\n4 from unittest.mock import Mock\n5 from pandas import DataFrame\n6 import pandas as pd\n7 from scipy import sparse\n8 from sklearn.datasets import load_iris\n9 from sklearn.pipeline import Pipeline\n10 from sklearn.model_selection import cross_val_score\n11 from sklearn.svm import SVC\n12 from sklearn.feature_extraction.text import CountVectorizer\n13 from sklearn.feature_extraction import DictVectorizer\n14 from sklearn.preprocessing import (\n15     StandardScaler, OneHotEncoder, LabelBinarizer)\n16 from sklearn.impute import SimpleImputer as Imputer\n17 from sklearn.feature_selection import SelectKBest, chi2\n18 from sklearn.base import BaseEstimator, TransformerMixin\n19 import sklearn.decomposition\n20 import numpy as np\n21 from numpy.testing import assert_array_equal\n22 import pickle\n23 from sklearn.compose import make_column_selector\n24 \n25 from sklearn_pandas import DataFrameMapper\n26 from sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\n27 from sklearn_pandas.pipeline import TransformerPipeline\n28 \n29 \n30 class MockXTransformer(object):\n31     \"\"\"\n32     Mock transformer that accepts no y argument.\n33     \"\"\"\n34     def fit(self, X):\n35         return self\n36 \n37     def transform(self, X):\n38         return X\n39 \n40 \n41 class MockTClassifier(object):\n42     \"\"\"\n43     Mock transformer/classifier.\n44     \"\"\"\n45     def fit(self, X, y=None):\n46         return self\n47 \n48     def transform(self, X):\n49         return X\n50 \n51     def predict(self, X):\n52         return True\n53 \n54 \n55 class DateEncoder():\n56     def fit(self, X, y=None):\n57         retur(...truncated)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/test.py",
        "content": "1 import pytest\n2 from unittest.mock import Mock\n3 import numpy as np\n4 import pandas as pd\n5 from sklearn_pandas import DataFrameMapper\n6 from sklearn.compose import make_column_selector\n7 from sklearn.preprocessing import StandardScaler\n8 \n9 \n10 class GetStartWith:\n11     def __init__(self, start_str):\n12         self.start_str = start_str\n13 \n14     def __call__(self, X: pd.DataFrame) -> list:\n15         return [c for c in X.columns if c.startswith(self.start_str)]\n16 \n17 \n18 df = pd.DataFrame({\n19     'sepal length (cm)': [1.0, 2.0, 3.0],\n20     'sepal width (cm)': [1.0, 2.0, 3.0],\n21     'petal length (cm)': [1.0, 2.0, 3.0],\n22     'petal width (cm)': [1.0, 2.0, 3.0]\n23 })\n24 t = DataFrameMapper([\n25     (make_column_selector(dtype_include=float), StandardScaler(), {'alias': 'x'}),\n26     (GetStartWith('petal'), None, {'alias': 'petal'})\n27 ], df_out=True, default=False)\n28 \n29 t.fit(df)\n30 print(t.transform(df).shape)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
        "content": "1 import tempfile\n2 import pytest\n3 import numpy as np\n4 from pandas import DataFrame\n5 import joblib\n6 \n7 from sklearn_pandas import DataFrameMapper\n8 from sklearn_pandas import NumericalTransformer\n9 \n10 \n11 @pytest.fixture\n12 def simple_dataset():\n13     return DataFrame({\n14         'feat1': [1, 2, 1, 3, 1],\n15         'feat2': [1, 2, 2, 2, 3],\n16         'feat3': [1, 2, 3, 4, 5],\n17     })\n18 \n19 \n20 def test_common_numerical_transformer(simple_dataset):\n21     \"\"\"\n22     Test log transformation\n23     \"\"\"\n24     transfomer = DataFrameMapper([\n25         ('feat1', NumericalTransformer('log'))\n26     ], df_out=True)\n27     df = simple_dataset\n28     outDF = transfomer.fit_transform(df)\n29     assert list(outDF.columns) == ['feat1']\n30     assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n31 \n32 \n33 def test_numerical_transformer_serialization(simple_dataset):\n34     \"\"\"\n35     Test if you can serialize transformer\n36     \"\"\"\n37     transfomer = DataFrameMapper([\n38         ('feat1', NumericalTransformer('log'))\n39     ])\n40 \n41     df = simple_dataset\n42     transfomer.fit(df)\n43     f = tempfile.NamedTemporaryFile(delete=True)\n44     joblib.dump(transfomer, f.name)\n45     transfomer2 = joblib.load(f.name)\n46     np.array_equal(transfomer.transform(df), transfomer2.transform(df))\n47     f.close()"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "content": "1 from collections import Counter\n2 \n3 import pytest\n4 import numpy as np\n5 from pandas import DataFrame\n6 from numpy.testing import assert_array_equal\n7 \n8 from sklearn_pandas import DataFrameMapper\n9 from sklearn_pandas.features_generator import gen_features\n10 \n11 \n12 class MockClass(object):\n13 \n14     def __init__(self, value=1, name='class'):\n15         self.value = value\n16         self.name = name\n17 \n18 \n19 class MockTra(...truncated)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
        "content": "1 import contextlib\n2 from datetime import datetime\n3 import pandas as pd\n4 import numpy as np\n5 from scipy import sparse\n6 from sklearn.base import BaseEstimator, TransformerMixin\n7 from .cross_validation import DataWrapper\n8 from .pipeline import make_transformer_pipeline, _ca(...truncated)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py",
        "content": "1 import numpy as np\n2 import pandas as pd\n3 from sklearn.base import TransformerMixin\n4 import warnings\n5 \n6 \n7 def _get_mask(X, value):\n8     \"\"\"\n9     Compute the boolean mask X == missing_val(...truncated)"
      }
    ],
    "CommitSHA": "c9db2d6dcbf515eade751073f43318e43cae5177"
  },
  "Score": {
    "Difficulty": "Medium",
    "issue_origin": {
      "Title": 6,
      "Description": 7,
      "Reproducibility": 7,
      "Relevance": 5,
      "Explanation": 8,
      "Overall": 7
    },
    "issue_message": {
      "Title": 7,
      "Description": 6,
      "Reproducibility": 6,
      "Relevance": 8,
      "Explanation": 7,
      "Overall": 7
    },
    "issue_ground": {
      "Title": 9,
      "Description": 9,
      "Reproducibility": 9,
      "Relevance": 7,
      "Explanation": 9,
      "Overall": 8
    },
    "issue_ground_truth": {
      "title": "Incorrect Length of Columns in Unit Test Leading to Dimension Mismatch",
      "description": "There is an issue in the `test_heterogeneous_output_types_input_df` unit test within the `scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py` file. The columns 'feat1' and 'feat2' have inconsistent lengths. This causes a dimension mismatch error when the DataFrameMapper attempts to transform the data.\n\nSteps to reproduce:\n1. Run the test suite for `scikit-learn-contrib sklearn-pandas`.\n2. Observe the dimension mismatch error in the `test_heterogeneous_output_types_input_df` test case.\n\nExpected behavior:\nAll columns in the DataFrame should have consistent lengths to avoid any dimension mismatch errors during transformations.\n\nResolution:\nEnsure that 'feat1' and 'feat2' columns in the test DataFrame used within `test_heterogeneous_output_types_input_df` have the same number of entries.",
      "explanation": "### Summary of the Issue\nThere was an issue within the `scikit-learn-contrib_sklearn-pandas` project, specifically in the `test_heterogeneous_output_types_input_df` unit test located in the `scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py` file. The problem involved columns 'feat1' and 'feat2' in a test DataFrame having inconsistent lengths, which led to a dimension mismatch error when the DataFrameMapper attempted to transform the data.\n\n### Cause of the Issue\nIn data processing and machine learning pipelines, it is essential that all columns within a DataFrame maintain consistent lengths. In this particular instance, the 'feat1' column had 6 entries while the 'feat2' column had only 5 entries. When attempting to perform transformations using the DataFrameMapper, this mismatch would cause errors because the transformer expects all columns to have the same number of rows.\n\n### Content of the Commit\nThe commit made to solve this issue involved a simple yet effective change:\n1. Adjusting the length of the 'feat2' column to match the 'feat1' column by adding an additional value to 'feat2'. \n2. By ensuring that both columns have consistent lengths, the dimension mismatch error is resolved, and the DataFrameMapper can successfully transform the data without encountering any errors.\n\n### Explanation of the Solution\n1. **Identification and Diagnosis**: The developer identified that the unit test `test_heterogeneous_output_types_input_df` was failing due to a dimension mismatch error linked to column length inconsistency within the test DataFrame.\n   \n2. **Correction**: To correct this, the developer ensured that the 'feat2' column was lengthened to match the 'feat1' column. The shortest path to resolve this was adding an additional value to the existing 'feat2' entries, making both columns of equal length (6 entries each).\n\n3. **Implementation**: Implementing this change would ensure that the DataFrameMapper, which tries to process these columns, does not encounter any dimension-related errors as it now processes columns with consistent lengths.\n\n### How the Commit Solves the Issue\nThe commit fixes the dimension mismatch error by aligning the lengths of 'feat1' and 'feat2' columns within the DataFrame used in the unit test. This ensures that the DataFrame is structurally sound for transformation:\n\n1. **Consistent Lengths**: By making sure both columns have the same number of entries, the commit addresses the root cause of the dimension mismatch error.\n2. **Transformation Stability**: The DataFrameMapper will now be able to handle the DataFrame without encountering runtime errors, resulting in successful transformations during the test.\n3. **Enhanced Test Reliability**: This ensures that the test suite doesn't fail due to simple structural issues, facilitating more reliable and robust testing of the remaining functionality.\n\n### Final Explanation and Solution Summary\nThe core issue was a DataFrame with columns of inconsistent lengths causing a transformation error when processed by the DataFrameMapper in a unit test. The solution involved aligning the lengths of all columns by ensuring 'feat2' had the same number of entries as 'feat1'. This simple but crucial change eliminated the dimension mismatch error, allowing the DataFrameMapper to transform the data successfully. The commit effectively resolves the problem by maintaining structural consistency within the DataFrame, thereby ensuring that the test runs smoothly without error."
    }
  }
}