{
  "RepoName": "https://github.com/scikit-learn-contrib/sklearn-pandas.git",
  "CommitSHA": "c9db2d6dcbf515eade751073f43318e43cae5177",
  "Time": "",
  "Difficulty": "Medium",
  "Type": "indexing error",
  "BuggyCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/test.py",
      "content": "import pytest\nfrom unittest.mock import Mock\nimport numpy as np\nimport pandas as pd\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn.compose import make_column_selector\nfrom sklearn.preprocessing import StandardScaler\n\n\nclass GetStartWith:\n    def __init__(self, start_str):\n        self.start_str = start_str\n\n    def __call__(self, X: pd.DataFrame) -> list:\n        return [c for c in X.columns if c.startswith(self.start_str)]\n\n\ndf = pd.DataFrame({\n    'sepal length (cm)': [1.0, 2.0, 3.0],\n    'sepal width (cm)': [1.0, 2.0, 3.0],\n    'petal length (cm)': [1.0, 2.0, 3.0],\n    'petal width (cm)': [1.0, 2.0, 3.0]\n})\nt = DataFrameMapper([\n    (make_column_selector(dtype_include=float), StandardScaler(), {'alias': 'x'}),\n    (GetStartWith('petal'), None, {'alias': 'petal'})\n], df_out=True, default=False)\n\nt.fit(df)\nprint(t.transform(df).shape)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/setup.py",
      "content": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nfrom setuptools import setup\nfrom setuptools.command.test import test as TestCommand\nimport re\n\nfor line in open('sklearn_pandas/__init__.py'):\n    match = re.match(\"__version__ *= *'(.*)'\", line)\n    if match:\n        __version__, = match.groups()\n\n\nclass PyTest(TestCommand):\n    user_options = [('pytest-args=', 'a', \"Arguments to pass to py.test\")]\n\n    def initialize_options(self):\n        TestCommand.initialize_options(self)\n        self.pytest_args = []\n\n    def finalize_options(self):\n        TestCommand.finalize_options(self)\n        self.test_args = []\n        self.test_suite = True\n\n    def run(self):\n        import pytest\n        errno = pytest.main(self.pytest_args)\n        raise SystemExit(errno)\n\n\nsetup(name='sklearn-pandas',\n      version=__version__,\n      description='Pandas integration with sklearn',\n      maintainer='Ritesh Agrawal',\n      maintainer_email='ragrawal@gmail.com',\n      url='https://github.com/scikit-learn-contrib/sklearn-pandas',\n      packages=['sklearn_pandas'],\n      keywords=['scikit', 'sklearn', 'pandas'],\n      install_requires=[\n          'scikit-learn>=0.23.0',\n          'scipy>=1.5.1',\n          'pandas>=1.1.4',\n          'numpy>=1.18.1'\n      ],\n      tests_require=['pytest', 'mock'],\n      cmdclass={'test': PyTest},\n      license='MIT License'\n)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/noxfile.py",
      "content": "import nox\n\n@nox.session\ndef lint(session):\n    session.install('pytest>=5.3.5', 'setuptools>=45.2',\n                    'wheel>=0.34.2', 'flake8>=3.7.9',\n                    'numpy==1.18.1', 'pandas==1.1.4')\n    session.install('.')\n    session.run('flake8', 'sklearn_pandas/', 'tests')\n\n@nox.session\n@nox.parametrize('numpy', ['1.18.1', '1.19.4', '1.20.1'])\n@nox.parametrize('scipy', ['1.5.4', '1.6.0'])\n@nox.parametrize('pandas', ['1.1.4', '1.2.2'])\ndef tests(session, numpy, scipy, pandas):\n    session.install('pytest>=5.3.5', \n                    'setuptools>=45.2',\n                    'wheel>=0.34.2',\n                    f'numpy=={numpy}',\n                    f'scipy=={scipy}',\n                    f'pandas=={pandas}'\n                    )\n    session.install('.')\n    session.run('py.test', 'README.rst', 'tests')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
      "content": "def gen_features(columns, classes=None, prefix='', suffix=''):\n    \"\"\"Generates a feature definition list which can be passed\n    into DataFrameMapper\n\n    Params:\n\n    columns     a list of column names to generate features for.\n\n    classes     a list of classes for each feature, a list of dictionaries with\n                transformer class and init parameters, or None.\n\n                If list of classes is provided, then each of them is\n                instantiated with default arguments. Example:\n\n                    classes = [StandardScaler, LabelBinarizer]\n\n                If list of dictionaries is provided, then each of them should\n                have a 'class' key with transformer class. All other keys are\n                passed into 'class' value constructor. Example:\n\n                    classes = [\n                        {'class': StandardScaler, 'with_mean': False},\n                        {'class': LabelBinarizer}\n                    }]\n\n                If None value selected, then each feature left as is.\n\n    prefix      add prefix to transformed column names\n\n    suffix      add suffix to transformed column names.\n\n    \"\"\"\n    if classes is None:\n        return [(column, None) for column in columns]\n\n    feature_defs = []\n\n    for column in columns:\n        feature_transformers = []\n\n        arguments = {}\n        if prefix and prefix != \"\":\n            arguments['prefix'] = prefix\n        if suffix and suffix != \"\":\n            arguments['suffix'] = suffix\n\n        classes = [cls for cls in classes if cls is not None]\n        if not classes:\n            feature_defs.append((column, None, arguments))\n\n        else:\n            for definition in classes:\n                if isinstance(definition, dict):\n                    params = definition.copy()\n                    klass = params.pop('class')\n                    feature_transformers.append(klass(**params))\n                else:\n                    feature_transformers.append(definition())\n\n            if not feature_transformers:\n                feature_transformers = None\n\n            feature_defs.append((column, feature_transformers, arguments))\n\n    return feature_defs\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py",
      "content": "import numpy as np\nimport pandas as pd\nfrom sklearn.base import TransformerMixin\nimport warnings\n\n\ndef _get_mask(X, value):\n    \"\"\"\n    Compute the boolean mask X == missing_values.\n    \"\"\"\n    if value == \"NaN\" or \\\n       value is None or \\\n       (isinstance(value, float) and np.isnan(value)):\n        return pd.isnull(X)\n    else:\n        return X == value\n\n\nclass NumericalTransformer(TransformerMixin):\n    \"\"\"\n    Provides commonly used numerical transformers.\n    \"\"\"\n    SUPPORTED_FUNCTIONS = ['log', 'log1p']\n\n    def __init__(self, func):\n        \"\"\"\n        Params\n\n        func    function to apply to input columns. The function will be\n                applied to each value. Supported functions are defined\n                in SUPPORTED_FUNCTIONS variable. Throws assertion error if the\n                not supported.\n        \"\"\"\n\n        warnings.warn(\"\"\"\n            NumericalTransformer will be deprecated in 3.0 version.\n            Please use Sklearn.base.TransformerMixin to write\n            customer transformers\n            \"\"\", DeprecationWarning)\n\n        assert func in self.SUPPORTED_FUNCTIONS, \\\n            f\"Only following func are supported: {self.SUPPORTED_FUNCTIONS}\"\n        super(NumericalTransformer, self).__init__()\n        self.__func = func\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X, y=None):\n        if self.__func == 'log1p':\n            return np.vectorize(np.log1p)(X)\n        elif self.__func == 'log':\n            return np.vectorize(np.log)(X)\n\n        raise ValueError(f\"Invalid function name: {self.__func}\")\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/__init__.py",
      "content": "__version__ = '2.2.0'\n\nimport logging\nlogger = logging.getLogger(__name__)\n\nfrom .dataframe_mapper import DataFrameMapper  # NOQA\nfrom .features_generator import gen_features  # NOQA\nfrom .transformers import NumericalTransformer # NOQA\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/pipeline.py",
      "content": "import six\nfrom sklearn.pipeline import _name_estimators, Pipeline\nfrom sklearn.utils import tosequence\n\n\ndef _call_fit(fit_method, X, y=None, **kwargs):\n    \"\"\"\n    helper function, calls the fit or fit_transform method with the correct\n    number of parameters\n\n    fit_method: fit or fit_transform method of the transformer\n    X: the data to fit\n    y: the target vector relative to X, optional\n    kwargs: any keyword arguments to the fit method\n\n    return: the result of the fit or fit_transform method\n\n    WARNING: if this function raises a TypeError exception, test the fit\n    or fit_transform method passed to it in isolation as _call_fit will not\n    distinguish TypeError due to incorrect number of arguments from\n    other TypeError\n    \"\"\"\n    try:\n        return fit_method(X, y, **kwargs)\n    except TypeError:\n        # fit takes only one argument\n        return fit_method(X, **kwargs)\n\n\nclass TransformerPipeline(Pipeline):\n    \"\"\"\n    Pipeline that expects all steps to be transformers taking a single X\n    argument, an optional y argument, and having fit and transform methods.\n\n    Code is copied from sklearn's Pipeline\n    \"\"\"\n\n    def __init__(self, steps):\n        names, estimators = zip(*steps)\n        if len(dict(steps)) != len(steps):\n            raise ValueError(\n                \"Provided step names are not unique: %s\" % (names,))\n\n        # shallow copy of steps\n        self.steps = tosequence(steps)\n        estimator = estimators[-1]\n\n        for e in estimators:\n            if (not (hasattr(e, \"fit\") or hasattr(e, \"fit_transform\")) or not\n                    hasattr(e, \"transform\")):\n                raise TypeError(\"All steps of the chain should \"\n                                \"be transforms and implement fit and transform\"\n                                \" '%s' (type %s) doesn't)\" % (e, type(e)))\n\n        if not hasattr(estimator, \"fit\"):\n            raise TypeError(\"Last step of chain should implement fit \"\n                            \"'%s' (type %s) doesn't)\"\n                            % (estimator, type(estimator)))\n\n    def _pre_transform(self, X, y=None, **fit_params):\n        fit_params_steps = dict((step, {}) for step, _ in self.steps)\n        for pname, pval in six.iteritems(fit_params):\n            step, param = pname.split('__', 1)\n            fit_params_steps[step][param] = pval\n        Xt = X\n        for name, transform in self.steps[:-1]:\n            if hasattr(transform, \"fit_transform\"):\n                Xt = _call_fit(transform.fit_transform,\n                               Xt, y, **fit_params_steps[name])\n            else:\n                Xt = _call_fit(transform.fit,\n                               Xt, y, **fit_params_steps[name]).transform(Xt)\n        return Xt, fit_params_steps[self.steps[-1][0]]\n\n    def fit(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        _call_fit(self.steps[-1][-1].fit, Xt, y, **fit_params)\n        return self\n\n    def fit_transform(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        if hasattr(self.steps[-1][-1], 'fit_transform'):\n            return _call_fit(self.steps[-1][-1].fit_transform,\n                             Xt, y, **fit_params)\n        else:\n            return _call_fit(self.steps[-1][-1].fit,\n                             Xt, y, **fit_params).transform(Xt)\n\n\ndef make_transformer_pipeline(*steps):\n    \"\"\"Construct a TransformerPipeline from the given estimators.\n    \"\"\"\n    return TransformerPipeline(_name_estimators(steps))\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "import contextlib\nfrom datetime import datetime\nimport pandas as pd\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom .cross_validation import DataWrapper\nfrom .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline\nfrom . import logger\n\nstring_types = text_type = str\n\n\ndef _handle_feature(fea):\n    \"\"\"\n    Convert 1-dimensional arrays to 2-dimensional column vectors.\n    \"\"\"\n    if len(fea.shape) == 1:\n        fea = np.array([fea]).T\n\n    return fea\n\n\ndef _build_transformer(transformers):\n    if isinstance(transformers, list):\n        transformers = make_transformer_pipeline(*transformers)\n    return transformers\n\n\ndef _build_feature(columns, transformers, options={}, X=None):\n    if X is None:\n        return (columns, _build_transformer(transformers), options)\n    return (\n        columns(X) if callable(columns) else columns,\n        _build_transformer(transformers),\n        options\n    )\n\n\ndef _elapsed_secs(t1):\n    return (datetime.now()-t1).total_seconds()\n\n\ndef _get_feature_names(estimator):\n    \"\"\"\n    Attempt to extract feature names based on a given estimator\n    \"\"\"\n    if hasattr(estimator, 'classes_'):\n        return estimator.classes_\n    elif hasattr(estimator, 'get_feature_names'):\n        return estimator.get_feature_names()\n    return None\n\n\n@contextlib.contextmanager\ndef add_column_names_to_exception(column_names):\n    # Stolen from https://stackoverflow.com/a/17677938/356729\n    try:\n        yield\n    except Exception as ex:\n        if ex.args:\n            msg = u'{}: {}'.format(column_names, ex.args[0])\n        else:\n            msg = text_type(column_names)\n        ex.args = (msg,) + ex.args[1:]\n        raise\n\n\nclass DataFrameMapper(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Map Pandas data frame column subsets to their own\n    sklearn transformation.\n    \"\"\"\n\n    def __init__(self, features, default=False, sparse=False, df_out=False,\n                 input_df=False, drop_cols=None):\n        \"\"\"\n        Params:\n\n        features    a list of tuples with features definitions.\n                    The first element is the pandas column selector. This can\n                    be a string (for one column) or a list of strings.\n                    The second element is an object that supports\n                    sklearn's transform interface, or a list of such objects\n                    The third element is optional and, if present, must be\n                    a dictionary with the options to apply to the\n                    transformation. Example: {'alias': 'day_of_week'}\n\n        default     default transformer to apply to the columns not\n                    explicitly selected in the mapper. If False (default),\n                    discard them. If None, pass them through untouched. Any\n                    other transformer will be applied to all the unselected\n                    columns as a whole, taken as a 2d-array.\n\n        sparse      will return sparse matrix if set True and any of the\n                    extracted features is sparse. Defaults to False.\n\n        df_out      return a pandas data frame, with each column named using\n                    the pandas column that created it (if there's only one\n                    input and output) or the input columns joined with '_'\n                    if there's multiple inputs, and the name concatenated with\n                    '_1', '_2' etc if there's multiple outputs. NB: does not\n                    work if *default* or *sparse* are true\n\n        input_df    If ``True`` pass the selected columns to the transformers\n                    as a pandas DataFrame or Series. Otherwise pass them as a\n                    numpy array. Defaults to ``False``.\n\n        drop_cols   List of columns to be dropped. Defaults to None.\n\n        \"\"\"\n        self.features = features\n        self.default = default\n        self.built_default = None\n        self.sparse = sparse\n        self.df_out = df_out\n        self.input_df = input_df\n        self.drop_cols = [] if drop_cols is None else drop_cols\n        self.transformed_names_ = []\n        if (df_out and (sparse or default)):\n            raise ValueError(\"Can not use df_out with sparse or default\")\n\n    def _build(self, X=None):\n        \"\"\"\n        Build attributes built_features and built_default.\n        \"\"\"\n        if isinstance(self.features, list):\n            self.built_features = [\n                _build_feature(*f, X=X) for f in self.features\n            ]\n        else:\n            self.built_features = _build_feature(*self.features, X=X)\n        self.built_default = _build_transformer(self.default)\n\n    @property\n    def _selected_columns(self):\n        \"\"\"\n        Return a set of selected columns in the feature list.\n        \"\"\"\n        selected_columns = set()\n        for feature in self.features:\n            columns = feature[0]\n            if isinstance(columns, list):\n                selected_columns = selected_columns.union(set(columns))\n            else:\n                selected_columns.add(columns)\n        return selected_columns\n\n    def _unselected_columns(self, X):\n        \"\"\"\n        Return list of columns present in X and not selected explicitly in the\n        mapper.\n\n        Unselected columns are returned in the order they appear in the\n        dataframe to avoid issues with different ordering during default fit\n        and transform steps.\n        \"\"\"\n        X_columns = list(X.columns)\n        return [column for column in X_columns if\n                column not in self._selected_columns\n                and column not in self.drop_cols]\n\n    def __setstate__(self, state):\n        # compatibility for older versions of sklearn-pandas\n        super().__setstate__(state)\n        self.features = [_build_feature(*feat) for feat in state['features']]\n        self.sparse = state.get('sparse', False)\n        self.default = state.get('default', False)\n        self.df_out = state.get('df_out', False)\n        self.input_df = state.get('input_df', False)\n        self.drop_cols = state.get('drop_cols', [])\n        self.built_features = state.get('built_features', self.features)\n        self.built_default = state.get('built_default', self.default)\n        self.transformed_names_ = state.get('transformed_names_', [])\n\n    def __getstate__(self):\n        state = super().__getstate__()\n        state['features'] = self.features\n        state['sparse'] = self.sparse\n        state['default'] = self.default\n        state['df_out'] = self.df_out\n        state['input_df'] = self.input_df\n        state['drop_cols'] = self.drop_cols\n        state['build_features'] = getattr(self, 'built_features', None)\n        state['built_default'] = self.built_default\n        state['transformed_names_'] = self.transformed_names_\n        return state\n\n    def _get_col_subset(self, X, cols, input_df=False):\n        \"\"\"\n        Get a subset of columns from the given table X.\n\n        X       a Pandas dataframe; the table to select columns from\n        cols    a string or list of strings representing the columns to select.\n                It can also be a callable that returns True or False, i.e.\n                compatible with the built-in filter function.\n\n        Returns a numpy array with the data from the selected columns\n        \"\"\"\n\n        if isinstance(cols, string_types):\n            return_vector = True\n            cols = [cols]\n        else:\n            return_vector = False\n\n        # Needed when using the cross-validation compatibility\n        # layer for sklearn<0.16.0.\n        # Will be dropped on sklearn-pandas 2.0.\n        if isinstance(X, list):\n            X = [x[cols] for x in X]\n            X = pd.DataFrame(X)\n\n        elif isinstance(X, DataWrapper):\n            X = X.df  # fetch underlying data\n\n        if return_vector:\n            t = X[cols[0]]\n        else:\n            t = X[cols]\n\n        # return either a DataFrame/Series or a numpy array\n        if input_df:\n            return t\n        else:\n            return t.values\n\n    def fit(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n\n        \"\"\"\n        self._build(X=X)\n\n        for columns, transformers, options in self.built_features:\n            t1 = datetime.now()\n            input_df = options.get('input_df', self.input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    Xt = self._get_col_subset(X, columns, input_df)\n                    _call_fit(transformers.fit, Xt, y)\n            logger.info(f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n        # handle features not explicitly selected\n        if self.built_default:  # not False and not None\n            unsel_cols = self._unselected_columns(X)\n            with add_column_names_to_exception(unsel_cols):\n                Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n                _call_fit(self.built_default.fit, Xt, y)\n        return self\n\n    def get_names(self, columns, transformer, x, alias=None, prefix='',\n                  suffix=''):\n        \"\"\"\n        Return verbose names for the transformed columns.\n\n        columns       name (or list of names) of the original column(s)\n        transformer   transformer - can be a TransformerPipeline\n        x             transformed columns (numpy.ndarray)\n        alias         base name to use for the selected columns\n        \"\"\"\n        if alias is not None:\n            name = alias\n        elif isinstance(columns, list):\n            name = '_'.join(map(str, columns))\n        else:\n            name = columns\n        num_cols = x.shape[1] if len(x.shape) > 1 else 1\n\n        output = []\n\n        if num_cols > 1:\n            # If there are as many columns as classes in the transformer,\n            # infer column names from classes names.\n\n            # If we are dealing with multiple transformers for these columns\n            # attempt to extract the names from each of them, starting from the\n            # last one\n            if isinstance(transformer, TransformerPipeline):\n                inverse_steps = transformer.steps[::-1]\n                estimators = (estimator for name, estimator in inverse_steps)\n                names_steps = (_get_feature_names(e) for e in estimators)\n                names = next((n for n in names_steps if n is not None), None)\n            # Otherwise use the only estimator present\n            else:\n                names = _get_feature_names(transformer)\n\n            if names is not None and len(names) == num_cols:\n                output = [f\"{name}_{o}\" for o in names]\n                # otherwise, return name concatenated with '_1', '_2', etc.\n            else:\n                output = [name + '_' + str(o) for o in range(num_cols)]\n        else:\n            output = [name]\n\n        if prefix == suffix == \"\":\n            return output\n\n        return ['{}{}{}'.format(prefix, x, suffix) for x in output]\n\n    def get_dtypes(self, extracted):\n        dtypes_features = [self.get_dtype(ex) for ex in extracted]\n        return [dtype for dtype_feature in dtypes_features\n                for dtype in dtype_feature]\n\n    def get_dtype(self, ex):\n        if isinstance(ex, np.ndarray) or sparse.issparse(ex):\n            return [ex.dtype] * ex.shape[1]\n        elif isinstance(ex, pd.DataFrame):\n            return list(ex.dtypes)\n        else:\n            raise TypeError(type(ex))\n\n    def _transform(self, X, y=None, do_fit=False):\n        \"\"\"\n        Transform the given data with possibility to fit in advance.\n        Avoids code duplication for implementation of transform and\n        fit_transform.\n        \"\"\"\n        if do_fit:\n            self._build(X=X)\n\n        extracted = []\n        transformed_names_ = []\n        for columns, transformers, options in self.built_features:\n            input_df = options.get('input_df', self.input_df)\n\n            # columns could be a string or list of\n            # strings; we don't care because pandas\n            # will handle either.\n            Xt = self._get_col_subset(X, columns, input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    if do_fit and hasattr(transformers, 'fit_transform'):\n                        t1 = datetime.now()\n                        Xt = _call_fit(transformers.fit_transform, Xt, y)\n                        logger.info(f\"[FIT_TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n                    else:\n                        if do_fit:\n                            t1 = datetime.now()\n                            _call_fit(transformers.fit, Xt, y)\n                            logger.info(\n                                f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n                        t1 = datetime.now()\n                        Xt = transformers.transform(Xt)\n                        logger.info(f\"[TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n\n            extracted.append(_handle_feature(Xt))\n\n            alias = options.get('alias')\n\n            prefix = options.get('prefix', '')\n            suffix = options.get('suffix', '')\n\n            transformed_names_ += self.get_names(\n                columns, transformers, Xt, alias, prefix, suffix)\n\n        # handle features not explicitly selected\n        if self.built_default is not False:\n            unsel_cols = self._unselected_columns(X)\n            Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n            if self.built_default is not None:\n                with add_column_names_to_exception(unsel_cols):\n                    if do_fit and hasattr(self.built_default, 'fit_transform'):\n                        Xt = _call_fit(self.built_default.fit_transform, Xt, y)\n                    else:\n                        if do_fit:\n                            _call_fit(self.built_default.fit, Xt, y)\n                        Xt = self.built_default.transform(Xt)\n                transformed_names_ += self.get_names(\n                    unsel_cols, self.built_default, Xt)\n            else:\n                # if not applying a default transformer,\n                # keep column names unmodified\n                transformed_names_ += unsel_cols\n\n            extracted.append(_handle_feature(Xt))\n\n        self.transformed_names_ = transformed_names_\n\n        # combine the feature outputs into one array.\n        # at this point we lose track of which features\n        # were created from which input columns, so it's\n        # assumed that that doesn't matter to the model.\n\n        # If any of the extracted features is sparse, combine sparsely.\n        # Otherwise, combine as normal arrays.\n        if any(sparse.issparse(fea) for fea in extracted):\n            stacked = sparse.hstack(extracted).tocsr()\n            # return a sparse matrix only if the mapper was initialized\n            # with sparse=True\n            if not self.sparse:\n                stacked = stacked.toarray()\n        else:\n            stacked = np.hstack(extracted)\n\n        if self.df_out:\n            # if no rows were dropped preserve the original index,\n            # otherwise use a new integer one\n            no_rows_dropped = len(X) == len(stacked)\n            if no_rows_dropped:\n                index = X.index\n            else:\n                index = None\n\n            # output different data types, if appropriate\n            dtypes = self.get_dtypes(extracted)\n            df_out = pd.DataFrame(\n                stacked,\n                columns=self.transformed_names_,\n                index=index)\n            # preserve types\n            for col, dtype in zip(self.transformed_names_, dtypes):\n                df_out[col] = df_out[col].astype(dtype)\n            return df_out\n        else:\n            return stacked\n\n    def transform(self, X):\n        \"\"\"\n        Transform the given data. Assumes that fit has already been called.\n\n        X       the data to transform\n        \"\"\"\n        return self._transform(X)\n\n    def fit_transform(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline and directly apply\n        it to the given data.\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n        \"\"\"\n        return self._transform(X, y, True)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/cross_validation.py",
      "content": "class DataWrapper(object):\n\n    def __init__(self, df):\n        self.df = df\n\n    def __len__(self):\n        return len(self.df)\n\n    def __getitem__(self, key):\n        return self.df.iloc[key]\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_pipeline.py",
      "content": "import pytest\nfrom sklearn_pandas.pipeline import TransformerPipeline, _call_fit\n\n# In py3, mock is included with the unittest standard library\n# In py2, it's a separate package\ntry:\n    from unittest.mock import patch\nexcept ImportError:\n    from mock import patch\n\n\nclass NoTransformT(object):\n    \"\"\"Transformer without transform method.\n    \"\"\"\n    def fit(self, x):\n        return self\n\n\nclass NoFitT(object):\n    \"\"\"Transformer without fit method.\n    \"\"\"\n    def transform(self, x):\n        return self\n\n\nclass Trans(object):\n    \"\"\"\n    Transformer with fit and transform methods\n    \"\"\"\n    def fit(self, x, y=None):\n        return self\n\n    def transform(self, x):\n        return self\n\n\ndef func_x_y(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments\n    \"\"\"\n    return\n\n\ndef func_x(x, kwarg='kwarg'):\n    \"\"\"\n    Function with required x argument\n    \"\"\"\n    return\n\n\ndef func_raise_type_err(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments,\n    raises TypeError\n    \"\"\"\n    raise TypeError\n\n\ndef test_all_steps_fit_transform():\n    \"\"\"\n    All steps must implement fit and transform. Otherwise, raise TypeError.\n    \"\"\"\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoTransformT())])\n\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoFitT())])\n\n\n@patch.object(Trans, 'fit', side_effect=func_x_y)\ndef test_called_with_x_and_y(mock_fit):\n    \"\"\"\n    Fit method with required X and y arguments is called with both and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', 'y', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_x)\ndef test_called_with_x(mock_fit):\n    \"\"\"\n    Fit method with a required X arguments is called with it and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n    _call_fit(Trans().fit, 'X', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_raise_type_err)\ndef test_raises_type_error(mock_fit):\n    \"\"\"\n    If a fit method with required X and y arguments raises a TypeError, it's\n    re-raised (for a different reason) when it's called with one argument\n    \"\"\"\n    with pytest.raises(TypeError):\n        _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
      "content": "import tempfile\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nimport joblib\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas import NumericalTransformer\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_common_numerical_transformer(simple_dataset):\n    \"\"\"\n    Test log transformation\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ], df_out=True)\n    df = simple_dataset\n    outDF = transfomer.fit_transform(df)\n    assert list(outDF.columns) == ['feat1']\n    assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n\n\ndef test_numerical_transformer_serialization(simple_dataset):\n    \"\"\"\n    Test if you can serialize transformer\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ])\n\n    df = simple_dataset\n    transfomer.fit(df)\n    f = tempfile.NamedTemporaryFile(delete=True)\n    joblib.dump(transfomer, f.name)\n    transfomer2 = joblib.load(f.name)\n    np.array_equal(transfomer.transform(df), transfomer2.transform(df))\n    f.close()\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "# -*- coding: utf8 -*-\n\nimport pytest\nfrom unittest.mock import Mock\nfrom pandas import DataFrame\nimport pandas as pd\nfrom scipy import sparse\nfrom sklearn.datasets import load_iris\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.svm import SVC\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.preprocessing import (\n    StandardScaler, OneHotEncoder, LabelBinarizer)\nfrom sklearn.impute import SimpleImputer as Imputer\nfrom sklearn.feature_selection import SelectKBest, chi2\nfrom sklearn.base import BaseEstimator, TransformerMixin\nimport sklearn.decomposition\nimport numpy as np\nfrom numpy.testing import assert_array_equal\nimport pickle\nfrom sklearn.compose import make_column_selector\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\nfrom sklearn_pandas.pipeline import TransformerPipeline\n\n\nclass MockXTransformer(object):\n    \"\"\"\n    Mock transformer that accepts no y argument.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return X\n\n\nclass MockTClassifier(object):\n    \"\"\"\n    Mock transformer/classifier.\n    \"\"\"\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        return X\n\n    def predict(self, X):\n        return True\n\n\nclass DateEncoder():\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        dt = X.dt\n        return pd.concat([dt.year, dt.month, dt.day], axis=1)\n\n\nclass ToSparseTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Transforms numpy matrix to sparse format.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return sparse.csr_matrix(X)\n\n\nclass CustomTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example of transformer in which the number of classes\n    is not equals to the number of output columns.\n    \"\"\"\n    def fit(self, X, y=None):\n        self.min = X.min()\n        self.classes_ = np.unique(X)\n        return self\n\n    def transform(self, X):\n        classes = np.unique(X)\n        if len(np.setdiff1d(classes, self.classes_)) > 0:\n            raise ValueError('Unknown values found.')\n        return X - self.min\n\n\nclass MockImageTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example transformer that takes the max of a 2d vector\n    then scales the result.\n    \"\"\"\n    def __init__(self, multiplier=10.0):\n        self.multiplier = multiplier\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        assert isinstance(X, pd.DataFrame)\n        for col in X.columns:\n            X[col] = X[col].map(lambda img: np.max(img))\n        return X * self.multiplier\n\n\n@pytest.fixture\ndef simple_dataframe():\n    return pd.DataFrame({'a': [1, 2, 3]})\n\n\n@pytest.fixture\ndef complex_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4]})\n\n\n@pytest.fixture\ndef complex_object_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4],\n                         'img2d': [1*np.eye(2), 2*np.eye(2), 3*np.eye(2),\n                                   4*np.eye(2), 5*np.eye(2), 6*np.eye(2)]})\n\n\n@pytest.fixture\ndef multiindex_dataframe():\n    \"\"\"Example MultiIndex DataFrame, taken from pandas documentation\n    \"\"\"\n    iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]\n    index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])\n    df = pd.DataFrame(np.random.randn(10, 8), columns=index)\n    return df\n\n\n@pytest.fixture\ndef multiindex_dataframe_incomplete(multiindex_dataframe):\n    \"\"\"Example MultiIndex DataFrame with missing entries\n    \"\"\"\n    df = multiindex_dataframe\n    mask_array = np.zeros(df.size)\n    mask_array[:20] = 1\n    np.random.shuffle(mask_array)\n    mask = mask_array.reshape(df.shape).astype(bool)\n    df.mask(mask, inplace=True)\n    return df\n\n\ndef test_transformed_names_simple(simple_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for simple transformation\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_transformed_names_binarizer(complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_logging(caplog, complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    import logging\n    logger = logging.getLogger('sklearn_pandas')\n    logger.setLevel(logging.INFO)\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert '[FIT_TRANSFORM] target:' in caplog.text\n\n\ndef test_transformed_names_binarizer_unicode():\n    df = pd.DataFrame({'target': [u'ñ', u'á', u'é']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    expected_names = {u'target_ñ', u'target_á', u'target_é'}\n    assert set(mapper.transformed_names_) == expected_names\n\n\ndef test_transformed_names_transformers_list(complex_dataframe):\n    \"\"\"\n    When using a list of transformers, use them in inverse order to get the\n    transformed names\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([\n        ('target', [LabelBinarizer(), MockXTransformer()])\n    ])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_transformed_names_simple_alias(simple_dataframe):\n    \"\"\"\n    If we specify an alias for a single output column, it is used for the\n    output\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None, {'alias': 'new_name'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_name']\n\n\ndef test_transformed_names_complex_alias(complex_dataframe):\n    \"\"\"\n    If we specify an alias for a multiple output column, it is used for the\n    output\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer(), {'alias': 'new'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_a', 'new_b', 'new_c']\n\n\ndef test_exception_column_context_transform(simple_dataframe):\n    \"\"\"\n    If an exception is raised when transforming a column,\n    the exception includes the name of the column being transformed\n    \"\"\"\n    class FailingTransformer(object):\n        def fit(self, X):\n            pass\n\n        def transform(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingTransformer())])\n    mapper.fit(df)\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.transform(df)\n\n\ndef test_exception_column_context_fit(simple_dataframe):\n    \"\"\"\n    If an exception is raised when fit a column,\n    the exception includes the name of the column being fitted\n    \"\"\"\n    class FailingFitter(object):\n        def fit(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingFitter())])\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.fit(df)\n\n\ndef test_simple_df(simple_dataframe):\n    \"\"\"\n    Get a dataframe from a simple mapped dataframe\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert type(transformed) == pd.DataFrame\n    assert len(transformed[\"a\"]) == len(simple_dataframe[\"a\"])\n\n\ndef test_complex_df(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None), ('feat2', None)],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_complex_object_df(complex_object_dataframe):\n    \"\"\"\n    Get a dataframe from a complex dataframe with 2d features\n    \"\"\"\n    df = complex_object_dataframe\n    img_scale = 10\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None),\n         (make_column_selector('feat2'), StandardScaler()),\n         (make_column_selector('img2d'), MockImageTransformer(img_scale))],\n        df_out=True, input_df=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_object_dataframe)\n    assert np.isclose(\n        np.sum(transformed['img2d']),\n        np.max(np.sum(df['img2d'])) * img_scale, atol=1e-12)\n\n\ndef test_numeric_column_names(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe with numeric column names\n    \"\"\"\n    df = complex_dataframe\n    df.columns = [0, 1, 2]\n    mapper = DataFrameMapper(\n        [(0, None), (1, None), (2, None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_multiindex_df(multiindex_dataframe_incomplete):\n    \"\"\"\n    Get a dataframe from a multiindex dataframe with missing data\n    \"\"\"\n    df = multiindex_dataframe_incomplete\n    mapper = DataFrameMapper([([c], Imputer()) for c in df.columns],\n                             df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(multiindex_dataframe_incomplete)\n    for c in df.columns:\n        assert len(transformed[str(c)]) == len(df[c])\n\n\ndef test_binarizer_df():\n    \"\"\"\n    Check level names from LabelBinarizer\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_a'\n    assert cols[1] == 'target_b'\n    assert cols[2] == 'target_c'\n\n\ndef test_binarizer_int_df():\n    \"\"\"\n    Check level names from LabelBinarizer for a numeric array.\n    \"\"\"\n    df = pd.DataFrame({'target': [5, 5, 6, 6, 7, 5]})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_5'\n    assert cols[1] == 'target_6'\n    assert cols[2] == 'target_7'\n\n\ndef test_binarizer2_df():\n    \"\"\"\n    Check level names from LabelBinarizer with just one output column\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_onehot_df():\n    \"\"\"\n    Check level ids from one-hot\n    \"\"\"\n    df = pd.DataFrame({'target': [0, 0, 1, 1, 2, 3, 0]})\n    mapper = DataFrameMapper([(['target'], OneHotEncoder())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 4\n    assert cols[0] == 'target_0'\n    assert cols[3] == 'target_3'\n\n\ndef test_customtransform_df():\n    \"\"\"\n    Check level ids from a transformer in which\n    the number of classes is not equals to the number of output columns.\n    \"\"\"\n    df = pd.DataFrame({'target': [6, 5, 7, 5, 4, 8, 8]})\n    mapper = DataFrameMapper([(['target'], CustomTransformer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(mapper.features[0][1].classes_) == 5\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_preserve_df_index():\n    \"\"\"\n    The index is preserved when df_out=True\n    \"\"\"\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', None)],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, df.index)\n\n\ndef test_preserve_df_index_rows_dropped():\n    \"\"\"\n    If df_out=True but the original df index length doesn't\n    match the number of final rows, use a numeric index\n    \"\"\"\n    class DropLastRowTransformer(object):\n        def fit(self, X):\n            return self\n\n        def transform(self, X):\n            return X[:-1]\n\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', DropLastRowTransformer())],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, np.array([0, 1]))\n\n\ndef test_pca(complex_dataframe):\n    \"\"\"\n    Check multi in and out with PCA\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 2\n    assert cols[0] == 'feat1_feat2_0'\n    assert cols[1] == 'feat1_feat2_1'\n\n\ndef test_fit_transform(simple_dataframe):\n    \"\"\"\n    Check that custom fit_transform methods of the transformers are invoked.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    # return something of measurable length but does nothing\n    mock_transformer.fit_transform.return_value = np.array([1, 2, 3])\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n    mapper.fit_transform(df)\n    assert mock_transformer.fit_transform.called\n\n\ndef test_fit_transform_equiv_mock(simple_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper using the mock\n    transformer which does not implement a custom fit_transform.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', MockXTransformer())])\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.all(transformed_combined == transformed_separate)\n\n\ndef test_fit_transform_equiv_pca(complex_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper and transformer\n    using PCA which implements a custom fit_transform. The\n    equivalence of both paths in the transformer only can be\n    asserted since this is tested in the sklearn tests\n    scikit-learn/sklearn/decomposition/tests/test_pca.py\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.allclose(transformed_combined, transformed_separate)\n\n\ndef test_input_df_true_first_transformer(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the first transformer is passed\n    a pd.Series instead of an np.array\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockXTransformer, 'fit', Mock())\n    monkeypatch.setattr(MockXTransformer, 'transform',\n                        Mock(return_value=np.array([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', MockXTransformer())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    args, _ = MockXTransformer().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    args, _ = MockXTransformer().transform.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_next_transformers(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the subsequent transformers get passed pandas\n    objects instead of numpy arrays (given the previous transformers\n    output pandas objects as well)\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockTClassifier, 'fit', Mock())\n    monkeypatch.setattr(MockTClassifier, 'transform',\n                        Mock(return_value=pd.Series([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer(), MockTClassifier()])\n    ], input_df=True)\n    mapper.fit(df)\n    out = mapper.transform(df)\n\n    args, _ = MockTClassifier().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_multiple_cols(complex_dataframe):\n    \"\"\"\n    When input_df is True, applying transformers to multiple columns\n    works as expected\n    \"\"\"\n    df = complex_dataframe\n\n    mapper = DataFrameMapper([\n        ('target', MockXTransformer()),\n        ('feat1',  MockXTransformer()),\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    assert_array_equal(out[:, 0], df['target'].values)\n    assert_array_equal(out[:, 1], df['feat1'].values)\n\n\ndef test_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_local_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder(), {'input_df': True})\n    ], input_df=False)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_nonexistent_columns_explicit_fail(simple_dataframe):\n    \"\"\"\n    If a nonexistent column is selected, KeyError is raised.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    with pytest.raises(KeyError):\n        mapper._get_col_subset(simple_dataframe, ['nonexistent_feature'])\n\n\ndef test_get_col_subset_single_column_array(simple_dataframe):\n    \"\"\"\n    Selecting a single column should return a 1-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, \"a\")\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]),)\n\n\ndef test_get_col_subset_single_column_list(simple_dataframe):\n    \"\"\"\n    Selecting a list of columns (even if the list contains a single element)\n    should return a 2-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, [\"a\"])\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]), 1)\n\n\ndef test_cols_string_array(simple_dataframe):\n    \"\"\"\n    If a string is specified as the columns, the transformer\n    is called with a 1-d array as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3,)\n\n\ndef test_cols_list_column_vector(simple_dataframe):\n    \"\"\"\n    If a one-element list is specified as the columns, the transformer\n    is called with a column vector as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([([\"a\"], mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3, 1)\n\n\ndef test_handle_feature_2dim():\n    \"\"\"\n    2-dimensional arrays are returned unchanged.\n    \"\"\"\n    array = np.array([[1, 2], [3, 4]])\n    assert_array_equal(_handle_feature(array), array)\n\n\ndef test_handle_feature_1dim():\n    \"\"\"\n    1-dimensional arrays are converted to 2-dimensional column vectors.\n    \"\"\"\n    array = np.array([1, 2])\n    assert_array_equal(_handle_feature(array), np.array([[1], [2]]))\n\n\ndef test_build_transformers():\n    \"\"\"\n    When a list of transformers is passed, return a pipeline with\n    each element of the iterable as a step of the pipeline.\n    \"\"\"\n    transformers = [MockTClassifier(), MockTClassifier()]\n    pipeline = _build_transformer(transformers)\n    assert isinstance(pipeline, Pipeline)\n    for ix, transformer in enumerate(transformers):\n        assert pipeline.steps[ix][1] == transformer\n\n\ndef test_selected_columns():\n    \"\"\"\n    selected_columns returns a set of the columns appearing in the features\n    of the mapper.\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert mapper._selected_columns == {'a', 'b'}\n\n\ndef test_unselected_columns():\n    \"\"\"\n    unselected_columns returns a list of the columns not appearing in the\n    features of the mapper but present in the given dataframe.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert 'c' in mapper._unselected_columns(df)\n\n\ndef test_drop_and_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns and drop columns\n    are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n            ('a', None)\n        ], drop_cols=['c'], default=False)\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (1, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_drop_and_default_none():\n    \"\"\"\n    If default=None, drop columns are discarded and\n    remaining non explicitly selected columns are passed through untransformed\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['c'], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 2)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_conflicting_drop():\n    \"\"\"\n    Drop column name shouldn't get confused with transformed columns.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['a'], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('b', None)\n    ], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n\n\ndef test_default_none():\n    \"\"\"\n    If default=None, non explicitly selected columns are passed through\n    untransformed.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        (['a'], OneHotEncoder())\n    ], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[:, 3] == np.array([3, 5, 7]).T).all()\n\n\ndef test_default_none_names():\n    \"\"\"\n    If default=None, column names are returned unmodified.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([], default=None)\n\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_default_transformer():\n    \"\"\"\n    If default=Transformer, non explicitly selected columns are applied this\n    transformer.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, np.nan, 3], })\n    mapper = DataFrameMapper([], default=Imputer())\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[: 0] == np.array([1., 2., 3.])).all()\n\n\ndef test_list_transformers_single_arg(simple_dataframe):\n    \"\"\"\n    Multiple transformers can be specified in a list even if some of them\n    only accept one X argument instead of two (X, y).\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer()])\n    ])\n    # doesn't fail\n    mapper.fit_transform(simple_dataframe)\n\n\ndef test_list_transformers():\n    \"\"\"\n    Specifying a list of transformers applies them sequentially to the\n    selected column.\n    \"\"\"\n    dataframe = pd.DataFrame({\"a\": [1, np.nan, 3], \"b\": [1, 5, 7]},\n                             dtype=np.float64)\n\n    mapper = DataFrameMapper([\n        ([\"a\"], [Imputer(), StandardScaler()]),\n        ([\"b\"], StandardScaler()),\n    ])\n    dmatrix = mapper.fit_transform(dataframe)\n\n    assert pd.isnull(dmatrix).sum() == 0  # no null values\n\n    # all features have mean 0 and std deviation 1 (standardized)\n    assert (abs(dmatrix.mean(axis=0) - 0) <= 1e-6).all()\n    assert (abs(dmatrix.std(axis=0) - 1) <= 1e-6).all()\n\n\ndef test_list_transformers_old_unpickle(simple_dataframe):\n    mapper = DataFrameMapper(None)\n    # simulate the mapper was created with < 1.0.0 code\n    mapper.features = [('a', [MockXTransformer()])]\n    mapper_pickled = pickle.dumps(mapper)\n\n    loaded_mapper = pickle.loads(mapper_pickled)\n    transformer = loaded_mapper.features[0][1]\n    assert isinstance(transformer, TransformerPipeline)\n    assert isinstance(transformer.steps[0][1], MockXTransformer)\n\n\ndef test_sparse_features(simple_dataframe):\n    \"\"\"\n    If any of the extracted features is sparse and \"sparse\" argument\n    is true, the hstacked result is also sparse.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=True)\n    dmatrix = mapper.fit_transform(df)\n\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\n\ndef test_sparse_off(simple_dataframe):\n    \"\"\"\n    If the resulting features are sparse but the \"sparse\" argument\n    of the mapper is False, return a non-sparse matrix.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=False)\n\n    dmatrix = mapper.fit_transform(df)\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\n\ndef test_fit_with_optional_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with an optional y argument in the fit method\n    are handled correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], MockTClassifier())])\n    # doesn't fail\n    mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n\ndef test_fit_with_required_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with a required y argument in the fit method\n    are handled and perform correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], SelectKBest(chi2, k=1))])\n\n    # fit, doesn't fail\n    ft_arr = mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n    # fit_transform\n    ft_arr = mapper.fit_transform(df[['feat1', 'feat2']], df['target'])\n    assert_array_equal(ft_arr, df[['feat1']].values)\n\n    # transform\n    t_arr = mapper.transform(df[['feat1', 'feat2']])\n    assert_array_equal(t_arr, df[['feat1']].values)\n\n\n# Integration tests with real dataframes\n\n@pytest.fixture\ndef iris_dataframe():\n    iris = load_iris()\n    return DataFrame(\n        data={\n            iris.feature_names[0]: iris.data[:, 0],\n            iris.feature_names[1]: iris.data[:, 1],\n            iris.feature_names[2]: iris.data[:, 2],\n            iris.feature_names[3]: iris.data[:, 3],\n            \"species\": np.array([iris.target_names[e] for e in iris.target])\n        }\n    )\n\n\n@pytest.fixture\ndef cars_dataframe():\n    return pd.read_csv(\"tests/test_data/cars.csv.gz\", compression='gzip')\n\n\ndef test_with_iris_dataframe(iris_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_dict_vectorizer():\n    df = pd.DataFrame(\n        [[{'a': 1, 'b': 2}], [{'a': 3}]],\n        columns=['colA']\n    )\n\n    outdf = DataFrameMapper(\n        [('colA', DictVectorizer())],\n        df_out=True,\n        default=False\n    ).fit_transform(df)\n\n    columns = sorted(list(outdf.columns))\n    assert len(columns) == 2\n    assert columns[0] == 'colA_0'\n    assert columns[1] == 'colA_1'\n\n\ndef test_with_car_dataframe(cars_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"description\", CountVectorizer()),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = cars_dataframe.drop(\"model\", axis=1)\n    labels = cars_dataframe[\"model\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.30\n\n\ndef test_direct_cross_validation(iris_dataframe):\n    \"\"\"\n    Starting with sklearn>=0.16.0 we no longer need CV wrappers for dataframes.\n    See https://github.com/paulgb/sklearn-pandas/issues/11\n    \"\"\"\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_heterogeneous_output_types_input_df():\n    \"\"\"\n    Modify feat2, but pass feat1 through unmodified.\n    This fails if input_df == False\n    \"\"\"\n    df = pd.DataFrame({\n        'feat1': [1, 2, 3, 4, 5, 6],\n        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0, 4.0]\n    })\n    M = DataFrameMapper([\n        (['feat2'], StandardScaler())\n        ], input_df=True, df_out=True, default=None)\n    dft = M.fit_transform(df)\n    assert dft['feat1'].dtype == np.dtype('int64')\n    assert dft['feat2'].dtype == np.dtype('float64')\n\n\ndef test_make_column_selector(iris_dataframe):\n    t = DataFrameMapper([\n        (make_column_selector(dtype_include=float), None, {'alias': 'x'}),\n        ('sepal length (cm)', None),\n    ], df_out=True, default=False)\n\n    xt = t.fit(iris_dataframe).transform(iris_dataframe)\n    expected = ['x_0', 'x_1', 'x_2', 'x_3', 'sepal length (cm)']\n    assert list(xt.columns) == expected\n\n    pickled = pickle.dumps(t)\n    t2 = pickle.loads(pickled)\n    xt2 = t2.transform(iris_dataframe)\n    assert np.array_equal(xt.values, xt2.values)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "from collections import Counter\n\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nfrom numpy.testing import assert_array_equal\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.features_generator import gen_features\n\n\nclass MockClass(object):\n\n    def __init__(self, value=1, name='class'):\n        self.value = value\n        self.name = name\n\n\nclass MockTransformer(object):\n\n    def __init__(self):\n        self.most_common_ = None\n\n    def fit(self, X, y=None):\n        [(value, _)] = Counter(X).most_common(1)\n        self.most_common_ = value\n        return self\n\n    def transform(self, X, y=None):\n        return np.asarray([self.most_common_] * len(X))\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_generate_features_with_default_parameters():\n    \"\"\"\n    Tests generating features from classes with default init arguments.\n    \"\"\"\n    columns = ['colA', 'colB', 'colC']\n    feature_defs = gen_features(columns=columns, classes=[MockClass])\n    assert len(feature_defs) == len(columns)\n\n    for feature in feature_defs:\n        assert feature[2] == {}\n\n    feature_dict = dict([_[0:2] for _ in feature_defs])\n    assert columns == sorted(feature_dict.keys())\n\n    # default init arguments for MockClass for clarification.\n    expected = {'value': 1, 'name': 'class'}\n    for column, transformers in feature_dict.items():\n        for obj in transformers:\n            assert_attributes(obj, **expected)\n\n\ndef test_generate_features_with_several_classes():\n    \"\"\"\n    Tests generating features pipeline with different transformers parameters.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'],\n        classes=[\n            {'class': MockClass},\n            {'class': MockClass, 'name': 'mockA'},\n            {'class': MockClass, 'name': 'mockB', 'value': None}\n        ]\n    )\n\n    for col, transformers, params in feature_defs:\n        assert_attributes(transformers[0], name='class', value=1)\n        assert_attributes(transformers[1], name='mockA', value=1)\n        assert_attributes(transformers[2], name='mockB', value=None)\n\n\ndef test_generate_features_with_none_only_transformers():\n    \"\"\"\n    Tests generating \"dummy\" feature definition which doesn't apply any\n    transformation.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'], classes=[None])\n\n    expected = [('colA', None, {}),\n                ('colB', None, {}),\n                ('colC', None, {})]\n\n    assert feature_defs == expected\n\n\ndef test_compatibility_with_data_frame_mapper(simple_dataset):\n    \"\"\"\n    Tests compatibility of generated feature definition with DataFrameMapper.\n    \"\"\"\n    features_defs = gen_features(\n        columns=['feat1', 'feat2'],\n        classes=[MockTransformer])\n\n    mapper = DataFrameMapper(features_defs)\n    X = mapper.fit_transform(simple_dataset)\n    expected = np.asarray([\n        [1, 2, 1],\n        [1, 2, 2],\n        [1, 2, 3],\n        [1, 2, 4]\n    ])\n\n    assert_array_equal(X, expected)\n\n\ndef assert_attributes(obj, **attrs):\n    for attr, value in attrs.items():\n        assert getattr(obj, attr) == value\n"
    }
  ],
  "OriginCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/test.py",
      "content": "import pytest\nfrom unittest.mock import Mock\nimport numpy as np\nimport pandas as pd\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn.compose import make_column_selector\nfrom sklearn.preprocessing import StandardScaler\n\n\nclass GetStartWith:\n    def __init__(self, start_str):\n        self.start_str = start_str\n\n    def __call__(self, X: pd.DataFrame) -> list:\n        return [c for c in X.columns if c.startswith(self.start_str)]\n\n\ndf = pd.DataFrame({\n    'sepal length (cm)': [1.0, 2.0, 3.0],\n    'sepal width (cm)': [1.0, 2.0, 3.0],\n    'petal length (cm)': [1.0, 2.0, 3.0],\n    'petal width (cm)': [1.0, 2.0, 3.0]\n})\nt = DataFrameMapper([\n    (make_column_selector(dtype_include=float), StandardScaler(), {'alias': 'x'}),\n    (GetStartWith('petal'), None, {'alias': 'petal'})\n], df_out=True, default=False)\n\nt.fit(df)\nprint(t.transform(df).shape)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/setup.py",
      "content": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nfrom setuptools import setup\nfrom setuptools.command.test import test as TestCommand\nimport re\n\nfor line in open('sklearn_pandas/__init__.py'):\n    match = re.match(\"__version__ *= *'(.*)'\", line)\n    if match:\n        __version__, = match.groups()\n\n\nclass PyTest(TestCommand):\n    user_options = [('pytest-args=', 'a', \"Arguments to pass to py.test\")]\n\n    def initialize_options(self):\n        TestCommand.initialize_options(self)\n        self.pytest_args = []\n\n    def finalize_options(self):\n        TestCommand.finalize_options(self)\n        self.test_args = []\n        self.test_suite = True\n\n    def run(self):\n        import pytest\n        errno = pytest.main(self.pytest_args)\n        raise SystemExit(errno)\n\n\nsetup(name='sklearn-pandas',\n      version=__version__,\n      description='Pandas integration with sklearn',\n      maintainer='Ritesh Agrawal',\n      maintainer_email='ragrawal@gmail.com',\n      url='https://github.com/scikit-learn-contrib/sklearn-pandas',\n      packages=['sklearn_pandas'],\n      keywords=['scikit', 'sklearn', 'pandas'],\n      install_requires=[\n          'scikit-learn>=0.23.0',\n          'scipy>=1.5.1',\n          'pandas>=1.1.4',\n          'numpy>=1.18.1'\n      ],\n      tests_require=['pytest', 'mock'],\n      cmdclass={'test': PyTest},\n      license='MIT License'\n)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/noxfile.py",
      "content": "import nox\n\n@nox.session\ndef lint(session):\n    session.install('pytest>=5.3.5', 'setuptools>=45.2',\n                    'wheel>=0.34.2', 'flake8>=3.7.9',\n                    'numpy==1.18.1', 'pandas==1.1.4')\n    session.install('.')\n    session.run('flake8', 'sklearn_pandas/', 'tests')\n\n@nox.session\n@nox.parametrize('numpy', ['1.18.1', '1.19.4', '1.20.1'])\n@nox.parametrize('scipy', ['1.5.4', '1.6.0'])\n@nox.parametrize('pandas', ['1.1.4', '1.2.2'])\ndef tests(session, numpy, scipy, pandas):\n    session.install('pytest>=5.3.5', \n                    'setuptools>=45.2',\n                    'wheel>=0.34.2',\n                    f'numpy=={numpy}',\n                    f'scipy=={scipy}',\n                    f'pandas=={pandas}'\n                    )\n    session.install('.')\n    session.run('py.test', 'README.rst', 'tests')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
      "content": "def gen_features(columns, classes=None, prefix='', suffix=''):\n    \"\"\"Generates a feature definition list which can be passed\n    into DataFrameMapper\n\n    Params:\n\n    columns     a list of column names to generate features for.\n\n    classes     a list of classes for each feature, a list of dictionaries with\n                transformer class and init parameters, or None.\n\n                If list of classes is provided, then each of them is\n                instantiated with default arguments. Example:\n\n                    classes = [StandardScaler, LabelBinarizer]\n\n                If list of dictionaries is provided, then each of them should\n                have a 'class' key with transformer class. All other keys are\n                passed into 'class' value constructor. Example:\n\n                    classes = [\n                        {'class': StandardScaler, 'with_mean': False},\n                        {'class': LabelBinarizer}\n                    }]\n\n                If None value selected, then each feature left as is.\n\n    prefix      add prefix to transformed column names\n\n    suffix      add suffix to transformed column names.\n\n    \"\"\"\n    if classes is None:\n        return [(column, None) for column in columns]\n\n    feature_defs = []\n\n    for column in columns:\n        feature_transformers = []\n\n        arguments = {}\n        if prefix and prefix != \"\":\n            arguments['prefix'] = prefix\n        if suffix and suffix != \"\":\n            arguments['suffix'] = suffix\n\n        classes = [cls for cls in classes if cls is not None]\n        if not classes:\n            feature_defs.append((column, None, arguments))\n\n        else:\n            for definition in classes:\n                if isinstance(definition, dict):\n                    params = definition.copy()\n                    klass = params.pop('class')\n                    feature_transformers.append(klass(**params))\n                else:\n                    feature_transformers.append(definition())\n\n            if not feature_transformers:\n                feature_transformers = None\n\n            feature_defs.append((column, feature_transformers, arguments))\n\n    return feature_defs\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py",
      "content": "import numpy as np\nimport pandas as pd\nfrom sklearn.base import TransformerMixin\nimport warnings\n\n\ndef _get_mask(X, value):\n    \"\"\"\n    Compute the boolean mask X == missing_values.\n    \"\"\"\n    if value == \"NaN\" or \\\n       value is None or \\\n       (isinstance(value, float) and np.isnan(value)):\n        return pd.isnull(X)\n    else:\n        return X == value\n\n\nclass NumericalTransformer(TransformerMixin):\n    \"\"\"\n    Provides commonly used numerical transformers.\n    \"\"\"\n    SUPPORTED_FUNCTIONS = ['log', 'log1p']\n\n    def __init__(self, func):\n        \"\"\"\n        Params\n\n        func    function to apply to input columns. The function will be\n                applied to each value. Supported functions are defined\n                in SUPPORTED_FUNCTIONS variable. Throws assertion error if the\n                not supported.\n        \"\"\"\n\n        warnings.warn(\"\"\"\n            NumericalTransformer will be deprecated in 3.0 version.\n            Please use Sklearn.base.TransformerMixin to write\n            customer transformers\n            \"\"\", DeprecationWarning)\n\n        assert func in self.SUPPORTED_FUNCTIONS, \\\n            f\"Only following func are supported: {self.SUPPORTED_FUNCTIONS}\"\n        super(NumericalTransformer, self).__init__()\n        self.__func = func\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X, y=None):\n        if self.__func == 'log1p':\n            return np.vectorize(np.log1p)(X)\n        elif self.__func == 'log':\n            return np.vectorize(np.log)(X)\n\n        raise ValueError(f\"Invalid function name: {self.__func}\")\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/__init__.py",
      "content": "__version__ = '2.2.0'\n\nimport logging\nlogger = logging.getLogger(__name__)\n\nfrom .dataframe_mapper import DataFrameMapper  # NOQA\nfrom .features_generator import gen_features  # NOQA\nfrom .transformers import NumericalTransformer # NOQA\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/pipeline.py",
      "content": "import six\nfrom sklearn.pipeline import _name_estimators, Pipeline\nfrom sklearn.utils import tosequence\n\n\ndef _call_fit(fit_method, X, y=None, **kwargs):\n    \"\"\"\n    helper function, calls the fit or fit_transform method with the correct\n    number of parameters\n\n    fit_method: fit or fit_transform method of the transformer\n    X: the data to fit\n    y: the target vector relative to X, optional\n    kwargs: any keyword arguments to the fit method\n\n    return: the result of the fit or fit_transform method\n\n    WARNING: if this function raises a TypeError exception, test the fit\n    or fit_transform method passed to it in isolation as _call_fit will not\n    distinguish TypeError due to incorrect number of arguments from\n    other TypeError\n    \"\"\"\n    try:\n        return fit_method(X, y, **kwargs)\n    except TypeError:\n        # fit takes only one argument\n        return fit_method(X, **kwargs)\n\n\nclass TransformerPipeline(Pipeline):\n    \"\"\"\n    Pipeline that expects all steps to be transformers taking a single X\n    argument, an optional y argument, and having fit and transform methods.\n\n    Code is copied from sklearn's Pipeline\n    \"\"\"\n\n    def __init__(self, steps):\n        names, estimators = zip(*steps)\n        if len(dict(steps)) != len(steps):\n            raise ValueError(\n                \"Provided step names are not unique: %s\" % (names,))\n\n        # shallow copy of steps\n        self.steps = tosequence(steps)\n        estimator = estimators[-1]\n\n        for e in estimators:\n            if (not (hasattr(e, \"fit\") or hasattr(e, \"fit_transform\")) or not\n                    hasattr(e, \"transform\")):\n                raise TypeError(\"All steps of the chain should \"\n                                \"be transforms and implement fit and transform\"\n                                \" '%s' (type %s) doesn't)\" % (e, type(e)))\n\n        if not hasattr(estimator, \"fit\"):\n            raise TypeError(\"Last step of chain should implement fit \"\n                            \"'%s' (type %s) doesn't)\"\n                            % (estimator, type(estimator)))\n\n    def _pre_transform(self, X, y=None, **fit_params):\n        fit_params_steps = dict((step, {}) for step, _ in self.steps)\n        for pname, pval in six.iteritems(fit_params):\n            step, param = pname.split('__', 1)\n            fit_params_steps[step][param] = pval\n        Xt = X\n        for name, transform in self.steps[:-1]:\n            if hasattr(transform, \"fit_transform\"):\n                Xt = _call_fit(transform.fit_transform,\n                               Xt, y, **fit_params_steps[name])\n            else:\n                Xt = _call_fit(transform.fit,\n                               Xt, y, **fit_params_steps[name]).transform(Xt)\n        return Xt, fit_params_steps[self.steps[-1][0]]\n\n    def fit(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        _call_fit(self.steps[-1][-1].fit, Xt, y, **fit_params)\n        return self\n\n    def fit_transform(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        if hasattr(self.steps[-1][-1], 'fit_transform'):\n            return _call_fit(self.steps[-1][-1].fit_transform,\n                             Xt, y, **fit_params)\n        else:\n            return _call_fit(self.steps[-1][-1].fit,\n                             Xt, y, **fit_params).transform(Xt)\n\n\ndef make_transformer_pipeline(*steps):\n    \"\"\"Construct a TransformerPipeline from the given estimators.\n    \"\"\"\n    return TransformerPipeline(_name_estimators(steps))\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "import contextlib\nfrom datetime import datetime\nimport pandas as pd\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom .cross_validation import DataWrapper\nfrom .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline\nfrom . import logger\n\nstring_types = text_type = str\n\n\ndef _handle_feature(fea):\n    \"\"\"\n    Convert 1-dimensional arrays to 2-dimensional column vectors.\n    \"\"\"\n    if len(fea.shape) == 1:\n        fea = np.array([fea]).T\n\n    return fea\n\n\ndef _build_transformer(transformers):\n    if isinstance(transformers, list):\n        transformers = make_transformer_pipeline(*transformers)\n    return transformers\n\n\ndef _build_feature(columns, transformers, options={}, X=None):\n    if X is None:\n        return (columns, _build_transformer(transformers), options)\n    return (\n        columns(X) if callable(columns) else columns,\n        _build_transformer(transformers),\n        options\n    )\n\n\ndef _elapsed_secs(t1):\n    return (datetime.now()-t1).total_seconds()\n\n\ndef _get_feature_names(estimator):\n    \"\"\"\n    Attempt to extract feature names based on a given estimator\n    \"\"\"\n    if hasattr(estimator, 'classes_'):\n        return estimator.classes_\n    elif hasattr(estimator, 'get_feature_names'):\n        return estimator.get_feature_names()\n    return None\n\n\n@contextlib.contextmanager\ndef add_column_names_to_exception(column_names):\n    # Stolen from https://stackoverflow.com/a/17677938/356729\n    try:\n        yield\n    except Exception as ex:\n        if ex.args:\n            msg = u'{}: {}'.format(column_names, ex.args[0])\n        else:\n            msg = text_type(column_names)\n        ex.args = (msg,) + ex.args[1:]\n        raise\n\n\nclass DataFrameMapper(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Map Pandas data frame column subsets to their own\n    sklearn transformation.\n    \"\"\"\n\n    def __init__(self, features, default=False, sparse=False, df_out=False,\n                 input_df=False, drop_cols=None):\n        \"\"\"\n        Params:\n\n        features    a list of tuples with features definitions.\n                    The first element is the pandas column selector. This can\n                    be a string (for one column) or a list of strings.\n                    The second element is an object that supports\n                    sklearn's transform interface, or a list of such objects\n                    The third element is optional and, if present, must be\n                    a dictionary with the options to apply to the\n                    transformation. Example: {'alias': 'day_of_week'}\n\n        default     default transformer to apply to the columns not\n                    explicitly selected in the mapper. If False (default),\n                    discard them. If None, pass them through untouched. Any\n                    other transformer will be applied to all the unselected\n                    columns as a whole, taken as a 2d-array.\n\n        sparse      will return sparse matrix if set True and any of the\n                    extracted features is sparse. Defaults to False.\n\n        df_out      return a pandas data frame, with each column named using\n                    the pandas column that created it (if there's only one\n                    input and output) or the input columns joined with '_'\n                    if there's multiple inputs, and the name concatenated with\n                    '_1', '_2' etc if there's multiple outputs. NB: does not\n                    work if *default* or *sparse* are true\n\n        input_df    If ``True`` pass the selected columns to the transformers\n                    as a pandas DataFrame or Series. Otherwise pass them as a\n                    numpy array. Defaults to ``False``.\n\n        drop_cols   List of columns to be dropped. Defaults to None.\n\n        \"\"\"\n        self.features = features\n        self.default = default\n        self.built_default = None\n        self.sparse = sparse\n        self.df_out = df_out\n        self.input_df = input_df\n        self.drop_cols = [] if drop_cols is None else drop_cols\n        self.transformed_names_ = []\n        if (df_out and (sparse or default)):\n            raise ValueError(\"Can not use df_out with sparse or default\")\n\n    def _build(self, X=None):\n        \"\"\"\n        Build attributes built_features and built_default.\n        \"\"\"\n        if isinstance(self.features, list):\n            self.built_features = [\n                _build_feature(*f, X=X) for f in self.features\n            ]\n        else:\n            self.built_features = _build_feature(*self.features, X=X)\n        self.built_default = _build_transformer(self.default)\n\n    @property\n    def _selected_columns(self):\n        \"\"\"\n        Return a set of selected columns in the feature list.\n        \"\"\"\n        selected_columns = set()\n        for feature in self.features:\n            columns = feature[0]\n            if isinstance(columns, list):\n                selected_columns = selected_columns.union(set(columns))\n            else:\n                selected_columns.add(columns)\n        return selected_columns\n\n    def _unselected_columns(self, X):\n        \"\"\"\n        Return list of columns present in X and not selected explicitly in the\n        mapper.\n\n        Unselected columns are returned in the order they appear in the\n        dataframe to avoid issues with different ordering during default fit\n        and transform steps.\n        \"\"\"\n        X_columns = list(X.columns)\n        return [column for column in X_columns if\n                column not in self._selected_columns\n                and column not in self.drop_cols]\n\n    def __setstate__(self, state):\n        # compatibility for older versions of sklearn-pandas\n        super().__setstate__(state)\n        self.features = [_build_feature(*feat) for feat in state['features']]\n        self.sparse = state.get('sparse', False)\n        self.default = state.get('default', False)\n        self.df_out = state.get('df_out', False)\n        self.input_df = state.get('input_df', False)\n        self.drop_cols = state.get('drop_cols', [])\n        self.built_features = state.get('built_features', self.features)\n        self.built_default = state.get('built_default', self.default)\n        self.transformed_names_ = state.get('transformed_names_', [])\n\n    def __getstate__(self):\n        state = super().__getstate__()\n        state['features'] = self.features\n        state['sparse'] = self.sparse\n        state['default'] = self.default\n        state['df_out'] = self.df_out\n        state['input_df'] = self.input_df\n        state['drop_cols'] = self.drop_cols\n        state['build_features'] = getattr(self, 'built_features', None)\n        state['built_default'] = self.built_default\n        state['transformed_names_'] = self.transformed_names_\n        return state\n\n    def _get_col_subset(self, X, cols, input_df=False):\n        \"\"\"\n        Get a subset of columns from the given table X.\n\n        X       a Pandas dataframe; the table to select columns from\n        cols    a string or list of strings representing the columns to select.\n                It can also be a callable that returns True or False, i.e.\n                compatible with the built-in filter function.\n\n        Returns a numpy array with the data from the selected columns\n        \"\"\"\n\n        if isinstance(cols, string_types):\n            return_vector = True\n            cols = [cols]\n        else:\n            return_vector = False\n\n        # Needed when using the cross-validation compatibility\n        # layer for sklearn<0.16.0.\n        # Will be dropped on sklearn-pandas 2.0.\n        if isinstance(X, list):\n            X = [x[cols] for x in X]\n            X = pd.DataFrame(X)\n\n        elif isinstance(X, DataWrapper):\n            X = X.df  # fetch underlying data\n\n        if return_vector:\n            t = X[cols[0]]\n        else:\n            t = X[cols]\n\n        # return either a DataFrame/Series or a numpy array\n        if input_df:\n            return t\n        else:\n            return t.values\n\n    def fit(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n\n        \"\"\"\n        self._build(X=X)\n\n        for columns, transformers, options in self.built_features:\n            t1 = datetime.now()\n            input_df = options.get('input_df', self.input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    Xt = self._get_col_subset(X, columns, input_df)\n                    _call_fit(transformers.fit, Xt, y)\n            logger.info(f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n        # handle features not explicitly selected\n        if self.built_default:  # not False and not None\n            unsel_cols = self._unselected_columns(X)\n            with add_column_names_to_exception(unsel_cols):\n                Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n                _call_fit(self.built_default.fit, Xt, y)\n        return self\n\n    def get_names(self, columns, transformer, x, alias=None, prefix='',\n                  suffix=''):\n        \"\"\"\n        Return verbose names for the transformed columns.\n\n        columns       name (or list of names) of the original column(s)\n        transformer   transformer - can be a TransformerPipeline\n        x             transformed columns (numpy.ndarray)\n        alias         base name to use for the selected columns\n        \"\"\"\n        if alias is not None:\n            name = alias\n        elif isinstance(columns, list):\n            name = '_'.join(map(str, columns))\n        else:\n            name = columns\n        num_cols = x.shape[1] if len(x.shape) > 1 else 1\n\n        output = []\n\n        if num_cols > 1:\n            # If there are as many columns as classes in the transformer,\n            # infer column names from classes names.\n\n            # If we are dealing with multiple transformers for these columns\n            # attempt to extract the names from each of them, starting from the\n            # last one\n            if isinstance(transformer, TransformerPipeline):\n                inverse_steps = transformer.steps[::-1]\n                estimators = (estimator for name, estimator in inverse_steps)\n                names_steps = (_get_feature_names(e) for e in estimators)\n                names = next((n for n in names_steps if n is not None), None)\n            # Otherwise use the only estimator present\n            else:\n                names = _get_feature_names(transformer)\n\n            if names is not None and len(names) == num_cols:\n                output = [f\"{name}_{o}\" for o in names]\n                # otherwise, return name concatenated with '_1', '_2', etc.\n            else:\n                output = [name + '_' + str(o) for o in range(num_cols)]\n        else:\n            output = [name]\n\n        if prefix == suffix == \"\":\n            return output\n\n        return ['{}{}{}'.format(prefix, x, suffix) for x in output]\n\n    def get_dtypes(self, extracted):\n        dtypes_features = [self.get_dtype(ex) for ex in extracted]\n        return [dtype for dtype_feature in dtypes_features\n                for dtype in dtype_feature]\n\n    def get_dtype(self, ex):\n        if isinstance(ex, np.ndarray) or sparse.issparse(ex):\n            return [ex.dtype] * ex.shape[1]\n        elif isinstance(ex, pd.DataFrame):\n            return list(ex.dtypes)\n        else:\n            raise TypeError(type(ex))\n\n    def _transform(self, X, y=None, do_fit=False):\n        \"\"\"\n        Transform the given data with possibility to fit in advance.\n        Avoids code duplication for implementation of transform and\n        fit_transform.\n        \"\"\"\n        if do_fit:\n            self._build(X=X)\n\n        extracted = []\n        transformed_names_ = []\n        for columns, transformers, options in self.built_features:\n            input_df = options.get('input_df', self.input_df)\n\n            # columns could be a string or list of\n            # strings; we don't care because pandas\n            # will handle either.\n            Xt = self._get_col_subset(X, columns, input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    if do_fit and hasattr(transformers, 'fit_transform'):\n                        t1 = datetime.now()\n                        Xt = _call_fit(transformers.fit_transform, Xt, y)\n                        logger.info(f\"[FIT_TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n                    else:\n                        if do_fit:\n                            t1 = datetime.now()\n                            _call_fit(transformers.fit, Xt, y)\n                            logger.info(\n                                f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n                        t1 = datetime.now()\n                        Xt = transformers.transform(Xt)\n                        logger.info(f\"[TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n\n            extracted.append(_handle_feature(Xt))\n\n            alias = options.get('alias')\n\n            prefix = options.get('prefix', '')\n            suffix = options.get('suffix', '')\n\n            transformed_names_ += self.get_names(\n                columns, transformers, Xt, alias, prefix, suffix)\n\n        # handle features not explicitly selected\n        if self.built_default is not False:\n            unsel_cols = self._unselected_columns(X)\n            Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n            if self.built_default is not None:\n                with add_column_names_to_exception(unsel_cols):\n                    if do_fit and hasattr(self.built_default, 'fit_transform'):\n                        Xt = _call_fit(self.built_default.fit_transform, Xt, y)\n                    else:\n                        if do_fit:\n                            _call_fit(self.built_default.fit, Xt, y)\n                        Xt = self.built_default.transform(Xt)\n                transformed_names_ += self.get_names(\n                    unsel_cols, self.built_default, Xt)\n            else:\n                # if not applying a default transformer,\n                # keep column names unmodified\n                transformed_names_ += unsel_cols\n\n            extracted.append(_handle_feature(Xt))\n\n        self.transformed_names_ = transformed_names_\n\n        # combine the feature outputs into one array.\n        # at this point we lose track of which features\n        # were created from which input columns, so it's\n        # assumed that that doesn't matter to the model.\n\n        # If any of the extracted features is sparse, combine sparsely.\n        # Otherwise, combine as normal arrays.\n        if any(sparse.issparse(fea) for fea in extracted):\n            stacked = sparse.hstack(extracted).tocsr()\n            # return a sparse matrix only if the mapper was initialized\n            # with sparse=True\n            if not self.sparse:\n                stacked = stacked.toarray()\n        else:\n            stacked = np.hstack(extracted)\n\n        if self.df_out:\n            # if no rows were dropped preserve the original index,\n            # otherwise use a new integer one\n            no_rows_dropped = len(X) == len(stacked)\n            if no_rows_dropped:\n                index = X.index\n            else:\n                index = None\n\n            # output different data types, if appropriate\n            dtypes = self.get_dtypes(extracted)\n            df_out = pd.DataFrame(\n                stacked,\n                columns=self.transformed_names_,\n                index=index)\n            # preserve types\n            for col, dtype in zip(self.transformed_names_, dtypes):\n                df_out[col] = df_out[col].astype(dtype)\n            return df_out\n        else:\n            return stacked\n\n    def transform(self, X):\n        \"\"\"\n        Transform the given data. Assumes that fit has already been called.\n\n        X       the data to transform\n        \"\"\"\n        return self._transform(X)\n\n    def fit_transform(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline and directly apply\n        it to the given data.\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n        \"\"\"\n        return self._transform(X, y, True)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/cross_validation.py",
      "content": "class DataWrapper(object):\n\n    def __init__(self, df):\n        self.df = df\n\n    def __len__(self):\n        return len(self.df)\n\n    def __getitem__(self, key):\n        return self.df.iloc[key]\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_pipeline.py",
      "content": "import pytest\nfrom sklearn_pandas.pipeline import TransformerPipeline, _call_fit\n\n# In py3, mock is included with the unittest standard library\n# In py2, it's a separate package\ntry:\n    from unittest.mock import patch\nexcept ImportError:\n    from mock import patch\n\n\nclass NoTransformT(object):\n    \"\"\"Transformer without transform method.\n    \"\"\"\n    def fit(self, x):\n        return self\n\n\nclass NoFitT(object):\n    \"\"\"Transformer without fit method.\n    \"\"\"\n    def transform(self, x):\n        return self\n\n\nclass Trans(object):\n    \"\"\"\n    Transformer with fit and transform methods\n    \"\"\"\n    def fit(self, x, y=None):\n        return self\n\n    def transform(self, x):\n        return self\n\n\ndef func_x_y(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments\n    \"\"\"\n    return\n\n\ndef func_x(x, kwarg='kwarg'):\n    \"\"\"\n    Function with required x argument\n    \"\"\"\n    return\n\n\ndef func_raise_type_err(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments,\n    raises TypeError\n    \"\"\"\n    raise TypeError\n\n\ndef test_all_steps_fit_transform():\n    \"\"\"\n    All steps must implement fit and transform. Otherwise, raise TypeError.\n    \"\"\"\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoTransformT())])\n\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoFitT())])\n\n\n@patch.object(Trans, 'fit', side_effect=func_x_y)\ndef test_called_with_x_and_y(mock_fit):\n    \"\"\"\n    Fit method with required X and y arguments is called with both and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', 'y', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_x)\ndef test_called_with_x(mock_fit):\n    \"\"\"\n    Fit method with a required X arguments is called with it and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n    _call_fit(Trans().fit, 'X', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_raise_type_err)\ndef test_raises_type_error(mock_fit):\n    \"\"\"\n    If a fit method with required X and y arguments raises a TypeError, it's\n    re-raised (for a different reason) when it's called with one argument\n    \"\"\"\n    with pytest.raises(TypeError):\n        _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
      "content": "import tempfile\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nimport joblib\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas import NumericalTransformer\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_common_numerical_transformer(simple_dataset):\n    \"\"\"\n    Test log transformation\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ], df_out=True)\n    df = simple_dataset\n    outDF = transfomer.fit_transform(df)\n    assert list(outDF.columns) == ['feat1']\n    assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n\n\ndef test_numerical_transformer_serialization(simple_dataset):\n    \"\"\"\n    Test if you can serialize transformer\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ])\n\n    df = simple_dataset\n    transfomer.fit(df)\n    f = tempfile.NamedTemporaryFile(delete=True)\n    joblib.dump(transfomer, f.name)\n    transfomer2 = joblib.load(f.name)\n    np.array_equal(transfomer.transform(df), transfomer2.transform(df))\n    f.close()\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "# -*- coding: utf8 -*-\n\nimport pytest\nfrom unittest.mock import Mock\nfrom pandas import DataFrame\nimport pandas as pd\nfrom scipy import sparse\nfrom sklearn.datasets import load_iris\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.svm import SVC\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.preprocessing import (\n    StandardScaler, OneHotEncoder, LabelBinarizer)\nfrom sklearn.impute import SimpleImputer as Imputer\nfrom sklearn.feature_selection import SelectKBest, chi2\nfrom sklearn.base import BaseEstimator, TransformerMixin\nimport sklearn.decomposition\nimport numpy as np\nfrom numpy.testing import assert_array_equal\nimport pickle\nfrom sklearn.compose import make_column_selector\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\nfrom sklearn_pandas.pipeline import TransformerPipeline\n\n\nclass MockXTransformer(object):\n    \"\"\"\n    Mock transformer that accepts no y argument.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return X\n\n\nclass MockTClassifier(object):\n    \"\"\"\n    Mock transformer/classifier.\n    \"\"\"\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        return X\n\n    def predict(self, X):\n        return True\n\n\nclass DateEncoder():\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        dt = X.dt\n        return pd.concat([dt.year, dt.month, dt.day], axis=1)\n\n\nclass ToSparseTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Transforms numpy matrix to sparse format.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return sparse.csr_matrix(X)\n\n\nclass CustomTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example of transformer in which the number of classes\n    is not equals to the number of output columns.\n    \"\"\"\n    def fit(self, X, y=None):\n        self.min = X.min()\n        self.classes_ = np.unique(X)\n        return self\n\n    def transform(self, X):\n        classes = np.unique(X)\n        if len(np.setdiff1d(classes, self.classes_)) > 0:\n            raise ValueError('Unknown values found.')\n        return X - self.min\n\n\nclass MockImageTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example transformer that takes the max of a 2d vector\n    then scales the result.\n    \"\"\"\n    def __init__(self, multiplier=10.0):\n        self.multiplier = multiplier\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        assert isinstance(X, pd.DataFrame)\n        for col in X.columns:\n            X[col] = X[col].map(lambda img: np.max(img))\n        return X * self.multiplier\n\n\n@pytest.fixture\ndef simple_dataframe():\n    return pd.DataFrame({'a': [1, 2, 3]})\n\n\n@pytest.fixture\ndef complex_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4]})\n\n\n@pytest.fixture\ndef complex_object_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4],\n                         'img2d': [1*np.eye(2), 2*np.eye(2), 3*np.eye(2),\n                                   4*np.eye(2), 5*np.eye(2), 6*np.eye(2)]})\n\n\n@pytest.fixture\ndef multiindex_dataframe():\n    \"\"\"Example MultiIndex DataFrame, taken from pandas documentation\n    \"\"\"\n    iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]\n    index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])\n    df = pd.DataFrame(np.random.randn(10, 8), columns=index)\n    return df\n\n\n@pytest.fixture\ndef multiindex_dataframe_incomplete(multiindex_dataframe):\n    \"\"\"Example MultiIndex DataFrame with missing entries\n    \"\"\"\n    df = multiindex_dataframe\n    mask_array = np.zeros(df.size)\n    mask_array[:20] = 1\n    np.random.shuffle(mask_array)\n    mask = mask_array.reshape(df.shape).astype(bool)\n    df.mask(mask, inplace=True)\n    return df\n\n\ndef test_transformed_names_simple(simple_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for simple transformation\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_transformed_names_binarizer(complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_logging(caplog, complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    import logging\n    logger = logging.getLogger('sklearn_pandas')\n    logger.setLevel(logging.INFO)\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert '[FIT_TRANSFORM] target:' in caplog.text\n\n\ndef test_transformed_names_binarizer_unicode():\n    df = pd.DataFrame({'target': [u'ñ', u'á', u'é']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    expected_names = {u'target_ñ', u'target_á', u'target_é'}\n    assert set(mapper.transformed_names_) == expected_names\n\n\ndef test_transformed_names_transformers_list(complex_dataframe):\n    \"\"\"\n    When using a list of transformers, use them in inverse order to get the\n    transformed names\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([\n        ('target', [LabelBinarizer(), MockXTransformer()])\n    ])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_transformed_names_simple_alias(simple_dataframe):\n    \"\"\"\n    If we specify an alias for a single output column, it is used for the\n    output\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None, {'alias': 'new_name'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_name']\n\n\ndef test_transformed_names_complex_alias(complex_dataframe):\n    \"\"\"\n    If we specify an alias for a multiple output column, it is used for the\n    output\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer(), {'alias': 'new'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_a', 'new_b', 'new_c']\n\n\ndef test_exception_column_context_transform(simple_dataframe):\n    \"\"\"\n    If an exception is raised when transforming a column,\n    the exception includes the name of the column being transformed\n    \"\"\"\n    class FailingTransformer(object):\n        def fit(self, X):\n            pass\n\n        def transform(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingTransformer())])\n    mapper.fit(df)\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.transform(df)\n\n\ndef test_exception_column_context_fit(simple_dataframe):\n    \"\"\"\n    If an exception is raised when fit a column,\n    the exception includes the name of the column being fitted\n    \"\"\"\n    class FailingFitter(object):\n        def fit(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingFitter())])\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.fit(df)\n\n\ndef test_simple_df(simple_dataframe):\n    \"\"\"\n    Get a dataframe from a simple mapped dataframe\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert type(transformed) == pd.DataFrame\n    assert len(transformed[\"a\"]) == len(simple_dataframe[\"a\"])\n\n\ndef test_complex_df(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None), ('feat2', None)],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_complex_object_df(complex_object_dataframe):\n    \"\"\"\n    Get a dataframe from a complex dataframe with 2d features\n    \"\"\"\n    df = complex_object_dataframe\n    img_scale = 10\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None),\n         (make_column_selector('feat2'), StandardScaler()),\n         (make_column_selector('img2d'), MockImageTransformer(img_scale))],\n        df_out=True, input_df=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_object_dataframe)\n    assert np.isclose(\n        np.sum(transformed['img2d']),\n        np.max(np.sum(df['img2d'])) * img_scale, atol=1e-12)\n\n\ndef test_numeric_column_names(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe with numeric column names\n    \"\"\"\n    df = complex_dataframe\n    df.columns = [0, 1, 2]\n    mapper = DataFrameMapper(\n        [(0, None), (1, None), (2, None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_multiindex_df(multiindex_dataframe_incomplete):\n    \"\"\"\n    Get a dataframe from a multiindex dataframe with missing data\n    \"\"\"\n    df = multiindex_dataframe_incomplete\n    mapper = DataFrameMapper([([c], Imputer()) for c in df.columns],\n                             df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(multiindex_dataframe_incomplete)\n    for c in df.columns:\n        assert len(transformed[str(c)]) == len(df[c])\n\n\ndef test_binarizer_df():\n    \"\"\"\n    Check level names from LabelBinarizer\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_a'\n    assert cols[1] == 'target_b'\n    assert cols[2] == 'target_c'\n\n\ndef test_binarizer_int_df():\n    \"\"\"\n    Check level names from LabelBinarizer for a numeric array.\n    \"\"\"\n    df = pd.DataFrame({'target': [5, 5, 6, 6, 7, 5]})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_5'\n    assert cols[1] == 'target_6'\n    assert cols[2] == 'target_7'\n\n\ndef test_binarizer2_df():\n    \"\"\"\n    Check level names from LabelBinarizer with just one output column\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_onehot_df():\n    \"\"\"\n    Check level ids from one-hot\n    \"\"\"\n    df = pd.DataFrame({'target': [0, 0, 1, 1, 2, 3, 0]})\n    mapper = DataFrameMapper([(['target'], OneHotEncoder())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 4\n    assert cols[0] == 'target_0'\n    assert cols[3] == 'target_3'\n\n\ndef test_customtransform_df():\n    \"\"\"\n    Check level ids from a transformer in which\n    the number of classes is not equals to the number of output columns.\n    \"\"\"\n    df = pd.DataFrame({'target': [6, 5, 7, 5, 4, 8, 8]})\n    mapper = DataFrameMapper([(['target'], CustomTransformer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(mapper.features[0][1].classes_) == 5\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_preserve_df_index():\n    \"\"\"\n    The index is preserved when df_out=True\n    \"\"\"\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', None)],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, df.index)\n\n\ndef test_preserve_df_index_rows_dropped():\n    \"\"\"\n    If df_out=True but the original df index length doesn't\n    match the number of final rows, use a numeric index\n    \"\"\"\n    class DropLastRowTransformer(object):\n        def fit(self, X):\n            return self\n\n        def transform(self, X):\n            return X[:-1]\n\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', DropLastRowTransformer())],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, np.array([0, 1]))\n\n\ndef test_pca(complex_dataframe):\n    \"\"\"\n    Check multi in and out with PCA\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 2\n    assert cols[0] == 'feat1_feat2_0'\n    assert cols[1] == 'feat1_feat2_1'\n\n\ndef test_fit_transform(simple_dataframe):\n    \"\"\"\n    Check that custom fit_transform methods of the transformers are invoked.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    # return something of measurable length but does nothing\n    mock_transformer.fit_transform.return_value = np.array([1, 2, 3])\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n    mapper.fit_transform(df)\n    assert mock_transformer.fit_transform.called\n\n\ndef test_fit_transform_equiv_mock(simple_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper using the mock\n    transformer which does not implement a custom fit_transform.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', MockXTransformer())])\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.all(transformed_combined == transformed_separate)\n\n\ndef test_fit_transform_equiv_pca(complex_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper and transformer\n    using PCA which implements a custom fit_transform. The\n    equivalence of both paths in the transformer only can be\n    asserted since this is tested in the sklearn tests\n    scikit-learn/sklearn/decomposition/tests/test_pca.py\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.allclose(transformed_combined, transformed_separate)\n\n\ndef test_input_df_true_first_transformer(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the first transformer is passed\n    a pd.Series instead of an np.array\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockXTransformer, 'fit', Mock())\n    monkeypatch.setattr(MockXTransformer, 'transform',\n                        Mock(return_value=np.array([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', MockXTransformer())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    args, _ = MockXTransformer().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    args, _ = MockXTransformer().transform.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_next_transformers(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the subsequent transformers get passed pandas\n    objects instead of numpy arrays (given the previous transformers\n    output pandas objects as well)\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockTClassifier, 'fit', Mock())\n    monkeypatch.setattr(MockTClassifier, 'transform',\n                        Mock(return_value=pd.Series([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer(), MockTClassifier()])\n    ], input_df=True)\n    mapper.fit(df)\n    out = mapper.transform(df)\n\n    args, _ = MockTClassifier().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_multiple_cols(complex_dataframe):\n    \"\"\"\n    When input_df is True, applying transformers to multiple columns\n    works as expected\n    \"\"\"\n    df = complex_dataframe\n\n    mapper = DataFrameMapper([\n        ('target', MockXTransformer()),\n        ('feat1',  MockXTransformer()),\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    assert_array_equal(out[:, 0], df['target'].values)\n    assert_array_equal(out[:, 1], df['feat1'].values)\n\n\ndef test_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_local_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder(), {'input_df': True})\n    ], input_df=False)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_nonexistent_columns_explicit_fail(simple_dataframe):\n    \"\"\"\n    If a nonexistent column is selected, KeyError is raised.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    with pytest.raises(KeyError):\n        mapper._get_col_subset(simple_dataframe, ['nonexistent_feature'])\n\n\ndef test_get_col_subset_single_column_array(simple_dataframe):\n    \"\"\"\n    Selecting a single column should return a 1-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, \"a\")\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]),)\n\n\ndef test_get_col_subset_single_column_list(simple_dataframe):\n    \"\"\"\n    Selecting a list of columns (even if the list contains a single element)\n    should return a 2-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, [\"a\"])\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]), 1)\n\n\ndef test_cols_string_array(simple_dataframe):\n    \"\"\"\n    If a string is specified as the columns, the transformer\n    is called with a 1-d array as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3,)\n\n\ndef test_cols_list_column_vector(simple_dataframe):\n    \"\"\"\n    If a one-element list is specified as the columns, the transformer\n    is called with a column vector as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([([\"a\"], mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3, 1)\n\n\ndef test_handle_feature_2dim():\n    \"\"\"\n    2-dimensional arrays are returned unchanged.\n    \"\"\"\n    array = np.array([[1, 2], [3, 4]])\n    assert_array_equal(_handle_feature(array), array)\n\n\ndef test_handle_feature_1dim():\n    \"\"\"\n    1-dimensional arrays are converted to 2-dimensional column vectors.\n    \"\"\"\n    array = np.array([1, 2])\n    assert_array_equal(_handle_feature(array), np.array([[1], [2]]))\n\n\ndef test_build_transformers():\n    \"\"\"\n    When a list of transformers is passed, return a pipeline with\n    each element of the iterable as a step of the pipeline.\n    \"\"\"\n    transformers = [MockTClassifier(), MockTClassifier()]\n    pipeline = _build_transformer(transformers)\n    assert isinstance(pipeline, Pipeline)\n    for ix, transformer in enumerate(transformers):\n        assert pipeline.steps[ix][1] == transformer\n\n\ndef test_selected_columns():\n    \"\"\"\n    selected_columns returns a set of the columns appearing in the features\n    of the mapper.\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert mapper._selected_columns == {'a', 'b'}\n\n\ndef test_unselected_columns():\n    \"\"\"\n    unselected_columns returns a list of the columns not appearing in the\n    features of the mapper but present in the given dataframe.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert 'c' in mapper._unselected_columns(df)\n\n\ndef test_drop_and_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns and drop columns\n    are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n            ('a', None)\n        ], drop_cols=['c'], default=False)\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (1, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_drop_and_default_none():\n    \"\"\"\n    If default=None, drop columns are discarded and\n    remaining non explicitly selected columns are passed through untransformed\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['c'], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 2)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_conflicting_drop():\n    \"\"\"\n    Drop column name shouldn't get confused with transformed columns.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['a'], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('b', None)\n    ], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n\n\ndef test_default_none():\n    \"\"\"\n    If default=None, non explicitly selected columns are passed through\n    untransformed.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        (['a'], OneHotEncoder())\n    ], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[:, 3] == np.array([3, 5, 7]).T).all()\n\n\ndef test_default_none_names():\n    \"\"\"\n    If default=None, column names are returned unmodified.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([], default=None)\n\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_default_transformer():\n    \"\"\"\n    If default=Transformer, non explicitly selected columns are applied this\n    transformer.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, np.nan, 3], })\n    mapper = DataFrameMapper([], default=Imputer())\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[: 0] == np.array([1., 2., 3.])).all()\n\n\ndef test_list_transformers_single_arg(simple_dataframe):\n    \"\"\"\n    Multiple transformers can be specified in a list even if some of them\n    only accept one X argument instead of two (X, y).\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer()])\n    ])\n    # doesn't fail\n    mapper.fit_transform(simple_dataframe)\n\n\ndef test_list_transformers():\n    \"\"\"\n    Specifying a list of transformers applies them sequentially to the\n    selected column.\n    \"\"\"\n    dataframe = pd.DataFrame({\"a\": [1, np.nan, 3], \"b\": [1, 5, 7]},\n                             dtype=np.float64)\n\n    mapper = DataFrameMapper([\n        ([\"a\"], [Imputer(), StandardScaler()]),\n        ([\"b\"], StandardScaler()),\n    ])\n    dmatrix = mapper.fit_transform(dataframe)\n\n    assert pd.isnull(dmatrix).sum() == 0  # no null values\n\n    # all features have mean 0 and std deviation 1 (standardized)\n    assert (abs(dmatrix.mean(axis=0) - 0) <= 1e-6).all()\n    assert (abs(dmatrix.std(axis=0) - 1) <= 1e-6).all()\n\n\ndef test_list_transformers_old_unpickle(simple_dataframe):\n    mapper = DataFrameMapper(None)\n    # simulate the mapper was created with < 1.0.0 code\n    mapper.features = [('a', [MockXTransformer()])]\n    mapper_pickled = pickle.dumps(mapper)\n\n    loaded_mapper = pickle.loads(mapper_pickled)\n    transformer = loaded_mapper.features[0][1]\n    assert isinstance(transformer, TransformerPipeline)\n    assert isinstance(transformer.steps[0][1], MockXTransformer)\n\n\ndef test_sparse_features(simple_dataframe):\n    \"\"\"\n    If any of the extracted features is sparse and \"sparse\" argument\n    is true, the hstacked result is also sparse.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=True)\n    dmatrix = mapper.fit_transform(df)\n\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\n\ndef test_sparse_off(simple_dataframe):\n    \"\"\"\n    If the resulting features are sparse but the \"sparse\" argument\n    of the mapper is False, return a non-sparse matrix.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=False)\n\n    dmatrix = mapper.fit_transform(df)\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\n\ndef test_fit_with_optional_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with an optional y argument in the fit method\n    are handled correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], MockTClassifier())])\n    # doesn't fail\n    mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n\ndef test_fit_with_required_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with a required y argument in the fit method\n    are handled and perform correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], SelectKBest(chi2, k=1))])\n\n    # fit, doesn't fail\n    ft_arr = mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n    # fit_transform\n    ft_arr = mapper.fit_transform(df[['feat1', 'feat2']], df['target'])\n    assert_array_equal(ft_arr, df[['feat1']].values)\n\n    # transform\n    t_arr = mapper.transform(df[['feat1', 'feat2']])\n    assert_array_equal(t_arr, df[['feat1']].values)\n\n\n# Integration tests with real dataframes\n\n@pytest.fixture\ndef iris_dataframe():\n    iris = load_iris()\n    return DataFrame(\n        data={\n            iris.feature_names[0]: iris.data[:, 0],\n            iris.feature_names[1]: iris.data[:, 1],\n            iris.feature_names[2]: iris.data[:, 2],\n            iris.feature_names[3]: iris.data[:, 3],\n            \"species\": np.array([iris.target_names[e] for e in iris.target])\n        }\n    )\n\n\n@pytest.fixture\ndef cars_dataframe():\n    return pd.read_csv(\"tests/test_data/cars.csv.gz\", compression='gzip')\n\n\ndef test_with_iris_dataframe(iris_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_dict_vectorizer():\n    df = pd.DataFrame(\n        [[{'a': 1, 'b': 2}], [{'a': 3}]],\n        columns=['colA']\n    )\n\n    outdf = DataFrameMapper(\n        [('colA', DictVectorizer())],\n        df_out=True,\n        default=False\n    ).fit_transform(df)\n\n    columns = sorted(list(outdf.columns))\n    assert len(columns) == 2\n    assert columns[0] == 'colA_0'\n    assert columns[1] == 'colA_1'\n\n\ndef test_with_car_dataframe(cars_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"description\", CountVectorizer()),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = cars_dataframe.drop(\"model\", axis=1)\n    labels = cars_dataframe[\"model\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.30\n\n\ndef test_direct_cross_validation(iris_dataframe):\n    \"\"\"\n    Starting with sklearn>=0.16.0 we no longer need CV wrappers for dataframes.\n    See https://github.com/paulgb/sklearn-pandas/issues/11\n    \"\"\"\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_heterogeneous_output_types_input_df():\n    \"\"\"\n    Modify feat2, but pass feat1 through unmodified.\n    This fails if input_df == False\n    \"\"\"\n    df = pd.DataFrame({\n        'feat1': [1, 2, 3, 4, 5, 6],\n        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0, 4.0]\n    })\n    M = DataFrameMapper([\n        (['feat2'], StandardScaler())\n        ], input_df=True, df_out=True, default=None)\n    dft = M.fit_transform(df)\n    assert dft['feat1'].dtype == np.dtype('int64')\n    assert dft['feat2'].dtype == np.dtype('float64')\n\n\ndef test_make_column_selector(iris_dataframe):\n    t = DataFrameMapper([\n        (make_column_selector(dtype_include=float), None, {'alias': 'x'}),\n        ('sepal length (cm)', None),\n    ], df_out=True, default=False)\n\n    xt = t.fit(iris_dataframe).transform(iris_dataframe)\n    expected = ['x_0', 'x_1', 'x_2', 'x_3', 'sepal length (cm)']\n    assert list(xt.columns) == expected\n\n    pickled = pickle.dumps(t)\n    t2 = pickle.loads(pickled)\n    xt2 = t2.transform(iris_dataframe)\n    assert np.array_equal(xt.values, xt2.values)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "from collections import Counter\n\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nfrom numpy.testing import assert_array_equal\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.features_generator import gen_features\n\n\nclass MockClass(object):\n\n    def __init__(self, value=1, name='class'):\n        self.value = value\n        self.name = name\n\n\nclass MockTransformer(object):\n\n    def __init__(self):\n        self.most_common_ = None\n\n    def fit(self, X, y=None):\n        [(value, _)] = Counter(X).most_common(1)\n        self.most_common_ = value\n        return self\n\n    def transform(self, X, y=None):\n        return np.asarray([self.most_common_] * len(X))\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_generate_features_with_default_parameters():\n    \"\"\"\n    Tests generating features from classes with default init arguments.\n    \"\"\"\n    columns = ['colA', 'colB', 'colC']\n    feature_defs = gen_features(columns=columns, classes=[MockClass])\n    assert len(feature_defs) == len(columns)\n\n    for feature in feature_defs:\n        assert feature[2] == {}\n\n    feature_dict = dict([_[0:2] for _ in feature_defs])\n    assert columns == sorted(feature_dict.keys())\n\n    # default init arguments for MockClass for clarification.\n    expected = {'value': 1, 'name': 'class'}\n    for column, transformers in feature_dict.items():\n        for obj in transformers:\n            assert_attributes(obj, **expected)\n\n\ndef test_generate_features_with_several_classes():\n    \"\"\"\n    Tests generating features pipeline with different transformers parameters.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'],\n        classes=[\n            {'class': MockClass},\n            {'class': MockClass, 'name': 'mockA'},\n            {'class': MockClass, 'name': 'mockB', 'value': None}\n        ]\n    )\n\n    for col, transformers, params in feature_defs:\n        assert_attributes(transformers[0], name='class', value=1)\n        assert_attributes(transformers[1], name='mockA', value=1)\n        assert_attributes(transformers[2], name='mockB', value=None)\n\n\ndef test_generate_features_with_none_only_transformers():\n    \"\"\"\n    Tests generating \"dummy\" feature definition which doesn't apply any\n    transformation.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'], classes=[None])\n\n    expected = [('colA', None, {}),\n                ('colB', None, {}),\n                ('colC', None, {})]\n\n    assert feature_defs == expected\n\n\ndef test_compatibility_with_data_frame_mapper(simple_dataset):\n    \"\"\"\n    Tests compatibility of generated feature definition with DataFrameMapper.\n    \"\"\"\n    features_defs = gen_features(\n        columns=['feat1', 'feat2'],\n        classes=[MockTransformer])\n    features_defs.append(('feat3', None))\n\n    mapper = DataFrameMapper(features_defs)\n    X = mapper.fit_transform(simple_dataset)\n    expected = np.asarray([\n        [1, 2, 1],\n        [1, 2, 2],\n        [1, 2, 3],\n        [1, 2, 4],\n        [1, 2, 5]\n    ])\n\n    assert_array_equal(X, expected)\n\n\ndef assert_attributes(obj, **attrs):\n    for attr, value in attrs.items():\n        assert getattr(obj, attr) == value\n"
    }
  ],
  "ErrorMessage": "============================================================================================= FAILURES ==============================================================================================\n_____________________________________________________________________________ test_compatibility_with_data_frame_mapper _____________________________________________________________________________\n\nsimple_dataset =    feat1  feat2  feat3\n0      1      1      1\n1      2      2      2\n2      1      2      3\n3      3      2      4\n4      1      3      5\n\n    def test_compatibility_with_data_frame_mapper(simple_dataset):\n        \"\"\"\n        Tests compatibility of generated feature definition with DataFrameMapper.\n        \"\"\"\n        features_defs = gen_features(\n            columns=['feat1', 'feat2'],\n            classes=[MockTransformer])\n    \n        mapper = DataFrameMapper(features_defs)\n        X = mapper.fit_transform(simple_dataset)\n        expected = np.asarray([\n            [1, 2, 1],\n            [1, 2, 2],\n            [1, 2, 3],\n            [1, 2, 4]\n        ])\n    \n>       assert_array_equal(X, expected)\n\ntests/test_features_generator.py:114: \n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _\n\nargs = (<built-in function eq>, array([[1, 2],\n       [1, 2],\n       [1, 2],\n       [1, 2],\n       [1, 2]]), array([[1, 2, 1],\n       [1, 2, 2],\n       [1, 2, 3],\n       [1, 2, 4]]))\nkwds = {'err_msg': '', 'header': 'Arrays are not equal', 'strict': False, 'verbose': True}\n\n    @wraps(func)\n    def inner(*args, **kwds):\n        with self._recreate_cm():\n>           return func(*args, **kwds)\nE           AssertionError: \nE           Arrays are not equal\nE           \nE           (shapes (5, 2), (4, 3) mismatch)\nE            x: array([[1, 2],\nE                  [1, 2],\nE                  [1, 2],...\nE            y: array([[1, 2, 1],\nE                  [1, 2, 2],\nE                  [1, 2, 3],\nE                  [1, 2, 4]])\n\n../../../../anaconda3/envs/py39/lib/python3.9/contextlib.py:79: AssertionError\n----------------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------------\nINFO     sklearn_pandas:dataframe_mapper.py:343 [FIT_TRANSFORM] feat1: 2.5e-05 secs\nINFO     sklearn_pandas:dataframe_mapper.py:343 [FIT_TRANSFORM] feat2: 1e-05 secs\n========================================================================================= warnings summary ==========================================================================================\ntests/test_dataframe_mapper.py::test_complex_object_df\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:106: SettingWithCopyWarning: \n  A value is trying to be set on a copy of a slice from a DataFrame.\n  Try using .loc[row_indexer,col_indexer] = value instead\n  \n  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n    X[col] = X[col].map(lambda img: np.max(img))\n\ntests/test_dataframe_mapper.py::test_sparse_features\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:865: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\ntests/test_dataframe_mapper.py::test_sparse_off\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:879: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\ntests/test_transformers.py::test_common_numerical_transformer\ntests/test_transformers.py::test_numerical_transformer_serialization\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py:35: DeprecationWarning: \n              NumericalTransformer will be deprecated in 3.0 version.\n              Please use Sklearn.base.TransformerMixin to write\n              customer transformers\n              \n    warnings.warn(\"\"\"\n\n-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html\n====================================================================================== short test summary info ======================================================================================\nFAILED tests/test_features_generator.py::test_compatibility_with_data_frame_mapper - AssertionError: \n============================================================================= 1 failed, 69 passed, 5 warnings in 1.32s ==============================================================================",
  "Patch": "--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -101,6 +101,7 @@\n     features_defs = gen_features(\n         columns=['feat1', 'feat2'],\n         classes=[MockTransformer])\n+    features_defs.append(('feat3', None))\n \n     mapper = DataFrameMapper(features_defs)\n     X = mapper.fit_transform(simple_dataset)\n@@ -108,7 +109,8 @@\n         [1, 2, 1],\n         [1, 2, 2],\n         [1, 2, 3],\n-        [1, 2, 4]\n+        [1, 2, 4],\n+        [1, 2, 5]\n     ])\n \n     assert_array_equal(X, expected)\n",
  "BuggyCodeLocation": [
    {
      "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "function": {
        "97": "test_compatibility_with_data_frame_mapper"
      },
      "content_all": {
        "101": "    features_defs = gen_features(\n",
        "102": "        columns=['feat1', 'feat2'],\n",
        "103": "        classes=[MockTransformer])\n",
        "104": "\n",
        "105": "    mapper = DataFrameMapper(features_defs)\n",
        "106": "    X = mapper.fit_transform(simple_dataset)\n",
        "108": "        [1, 2, 1],\n",
        "109": "        [1, 2, 2],\n",
        "110": "        [1, 2, 3],\n",
        "111": "        [1, 2, 4]\n",
        "112": "    ])\n",
        "113": "\n"
      },
      "content_change": {
        "111": "        [1, 2, 4]\n"
      }
    },
    {
      "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "function": null,
      "content_all": {
        "114": "    assert_array_equal(X, expected)\n"
      },
      "content_change": {}
    }
  ],
  "Issue": {
    "title": "Mismatch in Expected Transformed Data Output for Additional Feature",
    "description": "When generating features using `gen_features` and applying transformations through `DataFrameMapper`, there is an inconsistency in the output transformed data when new features are included. Specifically, adding a new feature (`feat3` in this case) for transformation should accurately reflect in the transformed dataset output. Currently, the `expected` array in the test case does not include the additional expected data from the new feature.\n\nSteps to Reproduce:\n1. Define a dataset with multiple features, including `feat1`, `feat2`, and `feat3`.\n2. Use `gen_features` to generate feature definitions for `feat1` and `feat2` along with `MockTransformer`.\n3. Append `feat3` to the features_defs with a `None` transformer.\n4. Create a `DataFrameMapper` with the generated features and transform the dataset.\n5. Check the transformed dataset against the expected output, which should now include the transformed data for `feat3`.\n\nExpected Behavior:\nThe transformed dataset should include all defined features in the correct order, reflecting any new additions accurately in the output.\n\nObserved Behavior:\nThe `expected` array does not account for the new feature `feat3`, causing assertion failures when comparing the transformed output to the expected result.",
    "explanation": "### Summary of the Issue\n\nThe issue reported was an inconsistency in the transformed data output when new features are included in a dataset. Specifically, when adding a new feature (`feat3`), the expected output array in the test case does not account for this new feature. This leads to assertion failures when comparing the actual transformed output with the expected result.\n\n### Detailed Content of the Commit\n\nThe commit made changes to the test file, specifically to the function that generates feature definitions and subsequently uses `DataFrameMapper`. The changes included the following:\n\n1. **Feature Definition Update**: The commit appended the new feature (`feat3`) to the `features_defs` list. Initially, only `feat1` and `feat2` were being transformed, but now `feat3` has been added without any specific transformer.\n2. **Expected Output Update**: The commit updated the expected output array to account for the transformation of `feat3`. The array was expanded to include the values from `feat3` reflecting its presence in the dataset.\n\n### Explanation of the Cause of the Issue\n\nThe primary cause of the issue was that the original test setup did not account for the changes when new features were added to the dataset. When new features are included (such as `feat3`), the expected output array in the test should match the actual transformed dataset which now includes these additional features. However, the expected array was not updated accordingly, resulting in mismatched dimensions and values, thus causing assertion failures during testing.\n\n### Solution Provided by the Commit\n\nThe changes made in the commit effectively addressed the issue by:\n\n1. **Updating Feature Definitions**: Making sure that the feature definition list (`features_defs`) reflects all features in the dataset, including any new ones such as `feat3`.\n2. **Synchronizing Expected Output**: Modifying the expected output array to include the values from the newly added feature (`feat3`). This ensures that the test's expected outcome accurately mirrors the actual transformation that occurs when `DataFrameMapper` processes the complete dataset.\n\n### How the Commit Solves the Issue\n\nBy adding `feat3` to the `features_defs` and updating the expected output array, the commit ensures that the entire feature set of the dataset is correctly handled. This means that:\n\n- The transformation of the dataset now includes all relevant features.\n- The assertion in the unit tests compares the transformed dataset against an expected array that appropriately includes all feature data, thus preventing mismatches and assertion errors.\n\nThis solution ensures that tests are accurately reflecting the transformations applied to the dataset, including any additions of new features such as `feat3`.\n\n### Solution to the Issue\n\nThe developer fixed the inconsistency by updating both the feature definitions and the expected output in the test case. This alignment ensures that any new feature added to the dataset is properly included in both the `DataFrameMapper` transformations and the expected results in the tests. By doing so, tests will pass successfully, confirming that the transformations are applied correctly across all features, and the expected output is accurately defined.\n\nIn summary:\n\n- **Issue**: Incompatibility in expected output when new features are added.\n- **Commit Changes**: Addition of the new feature (`feat3`) to the feature definitions, and updating the expected output array to include this new feature.\n- **Resolution**: Ensures consistency between the transformed dataset and the expected test results, thereby preventing assertion failures.\n\nThis solution is vital for maintaining the integrity of the feature transformation pipeline and ensuring robust testing practices.\n\n"
  },
  "Explain": "### Summary of the Issue\n\nThe issue reported was an inconsistency in the transformed data output when new features are included in a dataset. Specifically, when adding a new feature (`feat3`), the expected output array in the test case does not account for this new feature. This leads to assertion failures when comparing the actual transformed output with the expected result.\n\n### Detailed Content of the Commit\n\nThe commit made changes to the test file, specifically to the function that generates feature definitions and subsequently uses `DataFrameMapper`. The changes included the following:\n\n1. **Feature Definition Update**: The commit appended the new feature (`feat3`) to the `features_defs` list. Initially, only `feat1` and `feat2` were being transformed, but now `feat3` has been added without any specific transformer.\n2. **Expected Output Update**: The commit updated the expected output array to account for the transformation of `feat3`. The array was expanded to include the values from `feat3` reflecting its presence in the dataset.\n\n### Explanation of the Cause of the Issue\n\nThe primary cause of the issue was that the original test setup did not account for the changes when new features were added to the dataset. When new features are included (such as `feat3`), the expected output array in the test should match the actual transformed dataset which now includes these additional features. However, the expected array was not updated accordingly, resulting in mismatched dimensions and values, thus causing assertion failures during testing.\n\n### Solution Provided by the Commit\n\nThe changes made in the commit effectively addressed the issue by:\n\n1. **Updating Feature Definitions**: Making sure that the feature definition list (`features_defs`) reflects all features in the dataset, including any new ones such as `feat3`.\n2. **Synchronizing Expected Output**: Modifying the expected output array to include the values from the newly added feature (`feat3`). This ensures that the test's expected outcome accurately mirrors the actual transformation that occurs when `DataFrameMapper` processes the complete dataset.\n\n### How the Commit Solves the Issue\n\nBy adding `feat3` to the `features_defs` and updating the expected output array, the commit ensures that the entire feature set of the dataset is correctly handled. This means that:\n\n- The transformation of the dataset now includes all relevant features.\n- The assertion in the unit tests compares the transformed dataset against an expected array that appropriately includes all feature data, thus preventing mismatches and assertion errors.\n\nThis solution ensures that tests are accurately reflecting the transformations applied to the dataset, including any additions of new features such as `feat3`.\n\n### Solution to the Issue\n\nThe developer fixed the inconsistency by updating both the feature definitions and the expected output in the test case. This alignment ensures that any new feature added to the dataset is properly included in both the `DataFrameMapper` transformations and the expected results in the tests. By doing so, tests will pass successfully, confirming that the transformations are applied correctly across all features, and the expected output is accurately defined.\n\nIn summary:\n\n- **Issue**: Incompatibility in expected output when new features are added.\n- **Commit Changes**: Addition of the new feature (`feat3`) to the feature definitions, and updating the expected output array to include this new feature.\n- **Resolution**: Ensures consistency between the transformed dataset and the expected test results, thereby preventing assertion failures.\n\nThis solution is vital for maintaining the integrity of the feature transformation pipeline and ensuring robust testing practices.\n\n",
  "Source": "Human",
  "Token": 1517,
  "Command": [
    "pytest tests"
  ],
  "FilteredCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "1 # -*- coding: utf8 -*-\n2 \n3 import pytest\n4 from unittest.mock import Mock\n5 from pandas import DataFrame\n6 import pandas as pd\n7 from scipy import sparse\n8 from sklearn.datasets import load_iris\n9 from sklearn.pipeline import Pipeline\n10 from sklearn.model_selection import cross_val_score\n11 from sklearn.svm import SVC\n12 from sklearn.feature_extraction.text import CountVectorizer\n13 from sklearn.feature_extraction import DictVectorizer\n14 from sklearn.preprocessing import (\n15     StandardScaler, OneHotEncoder, LabelBinarizer)\n16 from sklearn.impute import SimpleImputer as Imputer\n17 from sklearn.feature_selection import SelectKBest, chi2\n18 from sklearn.base import BaseEstimator, TransformerMixin\n19 import sklearn.decomposition\n20 import numpy as np\n21 from numpy.testing import assert_array_equal\n22 import pickle\n23 from sklearn.compose import make_column_selector\n24 \n25 from sklearn_pandas import DataFrameMapper\n26 from sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\n27 from sklearn_pandas.pipeline import TransformerPipeline\n28 \n29 \n30 class MockXTransformer(object):\n31     \"\"\"\n32     Mock transformer that accepts no y argument.\n33     \"\"\"\n34     def fit(self, X):\n35         return self\n36 \n37     def transform(self, X):\n38         return X\n39 \n40 \n41 class MockTClassifier(object):\n42     \"\"\"\n43     Mock transformer/classifier.\n44     \"\"\"\n45     def fit(self, X, y=None):\n46         return self\n47 \n48     def transform(self, X):\n49         return X\n50 \n51     def predict(self, X):\n52         return True\n53 \n54 \n55 class DateEncoder():\n56     def fit(self, X, y=None):\n57         retur(...truncated)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "1 from collections import Counter\n2 \n3 import pytest\n4 import numpy as np\n5 from pandas import DataFrame\n6 from numpy.testing import assert_array_equal\n7 \n8 from sklearn_pandas import DataFrameMapper\n9 from sklearn_pandas.features_generator import gen_features\n10 \n11 \n12 class MockClass(object):\n13 \n14     def __init__(self, value=1, name='class'):\n15         self.value = value\n16         self.name = name\n17 \n18 \n19 class MockTransformer(object):\n20 \n21     def __init__(self):\n22         self.most_common_ = None\n23 \n24     def fit(self, X, y=None):\n25         [(value, _)] = Counter(X).most_common(1)\n26         self.most_common_ = value\n27         return self\n28 \n29     def transform(self, X, y=None):\n30         return np.asarray([self.most_common_] * len(X))\n31 \n32 \n33 @pytest.fixture\n34 def simple_dataset():\n35     return DataFrame({\n36         'feat1': [1, 2, 1, 3, 1],\n37         'feat2': [1, 2, 2, 2, 3],\n38         'feat3': [1, 2, 3, 4, 5],\n39     })\n40 \n41 \n42 def test_generate_features_with_default_parameters():\n43     \"\"\"\n44     Tests generating features from classes with default init arguments.\n45     \"\"\"\n46     columns = ['colA', 'colB', 'colC']\n47     feature_defs = gen_features(columns=columns, classes=[MockClass])\n48     assert len(feature_defs) == len(columns)\n49 \n50     for feature in feature_defs:\n51         assert feature[2] == {}\n52 \n53     feature_dict = dict([_[0:2] for _ in feature_defs])\n54     assert columns == sorted(feature_dict.keys())\n55 \n56     # default init arguments for MockClass for clarification.\n57     expected = {'value': 1, 'name': 'class'}\n58     for column, transformers in feature_dict.items():\n59         for obj in transformers:\n60             assert_attributes(obj, **expected)\n61 \n62 \n63 def test_generate_features_with_several_classes():\n64     \"\"\"\n65     Tests generating features pipeline with different transformers parameters.\n66     \"\"\"\n67     feature_defs = gen_features(\n68         columns=['colA', 'colB', 'colC'],\n69         classes=[\n70             {'class': MockClass},\n71             {'class': MockClass, 'name': 'mockA'},\n72             {'class': MockClass, 'name': 'mockB', 'value': None}\n73         ]\n74     )\n75 \n76     for col, transformers, params in feature_defs:\n77         assert_attributes(transformers[0], name='class', value=1)\n78         assert_attributes(transformers[1], name='mockA', value=1)\n79         assert_attributes(transformers[2], name='mockB', value=None)\n80 \n81 \n82 def test_generate_features_with_none_only_transformers():\n83     \"\"\"\n84     Tests generating \"dummy\" feature definition which doesn't apply any\n85     transformation.\n86     \"\"\"\n87     feature_defs = gen_features(\n88         columns=['colA', 'colB', 'colC'], classes=[None])\n89 \n90     expected = [('colA', None, {}),\n91                 ('colB', None, {}),\n92                 ('colC', None, {})]\n93 \n94     assert feature_defs == expected\n95 \n96 \n97 def test_compatibility_with_data_frame_mapper(simple_dataset):\n98     \"\"\"\n99     Tests compatibility of generated feature definition with DataFrameMapper.\n100     \"\"\"\n101     features_defs = gen_features(\n102         columns=['feat1', 'feat2'],\n103         classes=[MockTransformer])\n104 \n105     mapper = DataFrameMapper(features_defs)\n106     X = mapper.fit_transform(simple_dataset)\n107     expected = np.asarray([\n108         [1, 2, 1],\n109         [1, 2, 2],\n110         [1, 2, 3],\n111         [1, 2, 4]\n112     ])\n113 \n114     assert_array_equal(X, expected)\n115 \n116 \n117 def assert_attributes(obj, **attrs):\n118     for attr, value in attrs.items():\n119         assert getattr(obj, attr) == value"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
      "content": "1 import tempfile\n2 import pytest\n3 import numpy as np\n4 from pandas import DataFrame\n5 import joblib\n6 \n7 from sklearn_pandas import DataFra(...truncated)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "1 import contextlib\n2 from datetime import datetime\n3 import pandas as pd\n4 (...truncated)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
      "content": "1 def gen_features(colu(...truncated)"
    }
  ],
  "TokenAll": 18464,
  "FilteredLevel": 1500,
  "Results": {
    "model": "GPT-4o",
    "Difficulty": "Medium",
    "issue_origin": {
      "title": "Potential Assertion Error",
      "description": "In the file 'scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py', the test 'test_compatibility_with_data_frame_mapper' is at risk of failing at line 114 due to 'assert_array_equal'. The test expects an output with these exact values [[1, 2, 1], [1, 2, 2], [1, 2, 3], [1, 2, 4]], yet the input 'simple_dataset' passed to 'DataFrameMapper' has more features. The assertion should be cross-checked to ensure it matches the transformation logic.",
      "explanation": "### Summary of the Issue\n\nThe test case `test_compatibility_with_data_frame_mapper` located in the `test_features_generator.py` file might fail due to an assertion error at line 114. The test checks if the transformed output matches an expected array. The input dataset `simple_dataset` has more features than are being checked, which might lead to an inconsistency between the actual and expected transformed data.\n\n### Content of the Commit\n\nThe fix requires ensuring that the transformed result includes only the features considered in the `expected` array. We should verify and possibly adjust the expected values or the transformation logic to guarantee they align correctly. Here's a hypothetical commit that addresses the issue:\n\n```python\n33 @pytest.fixture\n34 def simple_dataset():\n35     return DataFrame({\n36         'feat1': [1, 2, 1, 3, 1],\n37         'feat2': [1, 2, 2, 2, 3],\n38         'feat3': [1, 2, 3, 4, 5],\n39     })\n\n...\n\n97 def test_compatibility_with_data_frame_mapper(simple_dataset):\n98     \"\"\"\n99     Tests compatibility of generated feature definition with DataFrameMapper.\n100     \"\"\"\n101     feature_defs = gen_features(\n102         columns=['feat1', 'feat2', 'feat3'],\n103         classes=[MockTransformer]\n104     )\n105 \n106     mapper = DataFrameMapper(feature_defs)\n107     X = mapper.fit_transform(simple_dataset)\n108     expected = np.asarray([\n109         [1, 2, 1],\n110         [1, 2, 2],\n111         [1, 2, 3],\n112         [1, 2, 4],\n113         [1, 2, 5]\n114     ])\n115 \n116     assert_array_equal(X, expected)\n```\n\n### Explanation from Developer's Perspective\n\n#### Cause of the Issue\n\nThe issue occurs because the initial test only considers two features (`feat1` and `feat2`) to be transformed, while the `simple_dataset` includes three features (`feat1`, `feat2`, and `feat3`). The assertion assumes that the resulting transformation will exclude `feat3`, leading to a mismatch between the actual result (which includes all three feature transformations) and the expected result (which includes only the transformations of `feat1` and `feat2`).\n\n#### Solution\n\n1. **Adjust Feature Definitions**:\n   The test needs to take into account all features present in the input dataset (`simple_dataset`). This is achieved by adjusting the `columns` parameter in the `gen_features` call to include all columns from the dataset (`feat1`, `feat2`, and `feat3`).\n\n2. **Modify the Expected Output**:\n   Since the `simple_dataset` contains `feat3` and the transformation logic should handle this feature as well, the `expected` output array should also be adjusted to include the transformed values of `feat3`.\n\n#### How the Commit Solves the Issue\n\n- By changing the `columns` parameter in the `gen_features` function call from `['feat1', 'feat2']` to `['feat1', 'feat2', 'feat3']`, the test will now correctly request the transformation of all input features present in `simple_dataset`.\n- The `expected` array is updated to include the results of transforming `feat3`, ensuring that the actual output matches the expected one.\n- This adjustment ensures the correct alignment between the transformation logic and the assertion, avoiding any potential assertion error.\n\nIn conclusion, the commit ensures the test accurately reflects the dataset's structure and that the transformation logic is correctly validated, thus preventing assertion errors due to mismatched expectations."
    },
    "issue_message": {
      "title": "Mismatch in expected and actual array shapes",
      "description": "In the 'test_compatibility_with_data_frame_mapper' test, the expected and actual arrays have different shapes (expected shape (4, 3) vs actual shape (5, 2)). This discrepancy causes the test to fail. Update the 'expected' array in the test to match the shape of the generated data.",
      "explanation": "### Issue Summary\n\nThe issue revolves around a mismatch in the expected and actual shapes of arrays in the `test_compatibility_with_data_frame_mapper` test. The array produced by the `fit_transform` method on the `DataFrameMapper` object is different from the expected array. Specifically, the test failed due to the expected shape of the array being `(4, 3)` while the actual shape was `(5, 2)`.\n\n### Insight from Code Analysis\n\nThe relevant test code is located in `tests/test_features_generator.py`:\n\n```python\n97 def test_compatibility_with_data_frame_mapper(simple_dataset):\n98     \"\"\"\n99     Tests compatibility of generated feature definition with DataFrameMapper.\n100    \"\"\"\n101    features_defs = gen_features(\n102        columns=['feat1', 'feat2'],\n103        classes=[MockTransformer])\n104 \n105    mapper = DataFrameMapper(features_defs)\n106    X = mapper.fit_transform(simple_dataset)\n107    expected = np.asarray([\n108        [1, 2, 1],\n109        [1, 2, 2],\n110        [1, 2, 3],\n111        [1, 2, 4]\n112    ])\n113\n114    assert_array_equal(X, expected)\n```\n\nFrom the above snippet, we can gather that the `simple_dataset` DataFrame contains:\n\n```plaintext\nfeat1  feat2  feat3\n0    1      1      1\n1    2      2      2\n2    1      2      3\n3    3      2      4\n4    1      3      5\n```\n\n### Error Detail\nThe error message indicates:\n\n```plaintext\n(shapes (5, 2), (4, 3) mismatch)\n x: array([[1, 2],\n          [1, 2],\n          [1, 2],\n          [1, 2],\n          [1, 2]])\n y: array([[1, 2, 1],\n          [1, 2, 2],\n          [1, 2, 3],\n          [1, 2, 4]])\n```\n\nHere, the produced array `X` has a shape of `(5, 2)`, showing five rows and two columns, while the expected array has a shape `(4, 3)`, showing four rows and three columns.\n\n### Cause of the Issue\n\nThe core issue here appears to be in the creation of the `expected` array within the test. The expectation does not match the shape generated by the transformation logic. The `simple_dataset` DataFrame has five rows and `gen_features` includes both `feat1` and `feat2` columns, generating a transformed dataset with matching row count but different column counts or values.\n\n### Commit Solution\n\nThe commit should address the mismatch in the `test_compatibility_with_data_frame_mapper` test case. Here's a plausible update to the commit content:\n\n```diff\n--- a/tests/test_features_generator.py\n+++ b/tests/test_features_generator.py\n@@ -107,11 +107,9 @@ def test_compatibility_with_data_frame_mapper(simple_dataset):\n-    expected = np.asarray([\n-        [1, 2, 1],\n-        [1, 2, 2],\n-        [1, 2, 3],\n-        [1, 2, 4]\n+    expected = np.asarray([\n+        [1, 1],\n+        [2, 2],\n+        [1, 2],\n+        [3, 2],\n+        [1, 3]\n     ])\n```\n\n### Explanation\n\n1. **Expected Array Update**: The key change is to correctly update the `expected` array to match the actual output array `X`, which holds the shape `(5, 2)`. The correct expectation should represent the feature transformations applied by `MockTransformer` on `simple_dataset`.\n\n2. **Rows and Columns Mismatch**: The commit adjusts the expected array, ensuring it acknowledges the five rows generated by `gen_features` transformation. Each row in the `expected` is set to accurately reflect the actual result of the transformation.\n\n### How the Commit Solves the Issue\n\n- **Alignment of Shapes**: The corrected expected array ensures both `X` and `expected` have an aligned shape of `(5, 2)`.\n\n- **Transformation Logic Consistency**: The updated array now matches the logic implemented by `MockTransformer`. This mock class's `transform` method replicates the most common value across the transformed output, ensuring the results are homogeneous and predictable.\n\nBy making these changes, the test will assert equality correctly between the transformed data and the expected output, leading to a passing test without assertion errors."
    },
    "issue_ground": {
      "title": "Mismatch in Expected Transformed Data Output for Additional Feature",
      "description": "When generating features using `gen_features` and applying transformations through `DataFrameMapper`, there is an inconsistency in the output transformed data when new features are included. Specifically, adding a new feature (`feat3` in this case) for transformation should accurately reflect in the transformed dataset output. Currently, the `expected` array in the test case does not include the additional expected data from the new feature.\n\nSteps to Reproduce:\n1. Define a dataset with multiple features, including `feat1`, `feat2`, and `feat3`.\n2. Use `gen_features` to generate feature definitions for `feat1` and `feat2` along with `MockTransformer`.\n3. Append `feat3` to the features_defs with a `None` transformer.\n4. Create a `DataFrameMapper` with the generated features and transform the dataset.\n5. Check the transformed dataset against the expected output, which should now include the transformed data for `feat3`.\n\nExpected Behavior:\nThe transformed dataset should include all defined features in the correct order, reflecting any new additions accurately in the output.\n\nObserved Behavior:\nThe `expected` array does not account for the new feature `feat3`, causing assertion failures when comparing the transformed output to the expected result.",
      "explanation": "## Summary of the Issue\n\nThe issue is a mismatch in the expected transformed data output when a new feature, `feat3`, is included in the dataset and transformations are applied using `DataFrameMapper`. The transformation test fails because the expected transformed dataset does not account for this new feature. The test comparison between the transformed dataset and the expected result fails due to this inconsistency.\n\n## Analysis of the Code\n\nThe core of the issue lies in the `test_compatibility_with_data_frame_mapper` function within the `tests/test_features_generator.py` file. The test is designed to check the compatibility of generated features with `DataFrameMapper`.\n\n### Relevant Portions of the Provided Code:\n1. **Test Function**: \n   - Path: `tests/test_features_generator.py`\n   - Portion:\n     ```python\n     def test_compatibility_with_data_frame_mapper(simple_dataset):\n         features_defs = gen_features(\n             columns=['feat1', 'feat2'],\n             classes=[MockTransformer])\n    \n         mapper = DataFrameMapper(features_defs)\n         X = mapper.fit_transform(simple_dataset)\n         expected = np.asarray([\n             [1, 2, 1],\n             [1, 2, 2],\n             [1, 2, 3],\n             [1, 2, 4]\n         ])\n    \n         assert_array_equal(X, expected)\n     ```\n\n2. **Simple Dataset Fixture**:\n   - Path: `tests/test_features_generator.py`\n   - Portion:\n     ```python\n     @pytest.fixture\n     def simple_dataset():\n         return DataFrame({\n             'feat1': [1, 2, 1, 3, 1],\n             'feat2': [1, 2, 2, 2, 3],\n             'feat3': [1, 2, 3, 4, 5],\n         })\n     ```\n\n## Cause of the Issue\n\nThe discrepancy arises because the `expected` array in the test case does not account for the new feature `feat3`. The `gen_features` function is only called for `feat1` and `feat2`, and the `expected` array subsequently reflects these two features without including `feat3`. The actual transformed output includes three columns (including `feat3`), while the expected output only considers two, leading to an assertion failure due to shape mismatch.\n\n## Commit Content\n\nTo resolve this, the commit should update the test to correctly handle the new feature. This involves generating features for `feat1`, `feat2`, and `feat3` and correcting the expected output array to include the additional feature.\n\n### Proposed Change in the Commit:\n1. Modify the `gen_features` call to include `feat3`.\n2. Update the expected array to reflect the transformation of all three features.\n\n### Updated Test Function:\n```python\ndef test_compatibility_with_data_frame_mapper(simple_dataset):\n    features_defs = gen_features(\n        columns=['feat1', 'feat2', 'feat3'],\n        classes=[MockTransformer])\n    \n    mapper = DataFrameMapper(features_defs)\n    X = mapper.fit_transform(simple_dataset)\n    expected = np.asarray([\n        [1, 2, 1],\n        [1, 2, 2],\n        [1, 2, 3],\n        [1, 2, 4],\n        [1, 2, 5]\n    ])\n    \n    assert_array_equal(X, expected)\n```\n\n## Explanation of How the Commit Solves the Issue\n\n1. **Inclusion of `feat3` in `gen_features`**:\n   - This ensures that all three features, `feat1`, `feat2`, and `feat3`, are recognized and accounted for during the feature generation step. \n   - This aligns the feature transformation with the input dataset containing all three features.\n\n2. **Updating the `expected` Array**:\n   - By updating the `expected` array to include the transformed values of `feat3`, we ensure that the shape and values of the `expected` array match the output of transforming the entire dataset.\n   - The expected array:\n     ```python\n     expected = np.asarray([\n         [1, 2, 1],\n         [1, 2, 2],\n         [1, 2, 3],\n         [1, 2, 4],\n         [1, 2, 5]\n     ])\n     ```\n   - Now, the shape of the `expected` array is `(5, 3)`, matching the transformed dataset with `feat1`, `feat2`, and `feat3`.\n\n### Outcome:\nWith these changes, the transformed output will match the expected output, thus resolving the assertion error and verifying that the feature generation and transformation process correctly accommodates new features.\n\n## Conclusion\n\nThe identified mismatch in the expected output of the test stems from an unaccounted feature, `feat3`. By updating the test to include this feature in both the feature generation and the expected output, the"
    },
    "issue_ground_truth": {
      "title": "Mismatch in Expected Transformed Data Output for Additional Feature",
      "description": "When generating features using `gen_features` and applying transformations through `DataFrameMapper`, there is an inconsistency in the output transformed data when new features are included. Specifically, adding a new feature (`feat3` in this case) for transformation should accurately reflect in the transformed dataset output. Currently, the `expected` array in the test case does not include the additional expected data from the new feature.\n\nSteps to Reproduce:\n1. Define a dataset with multiple features, including `feat1`, `feat2`, and `feat3`.\n2. Use `gen_features` to generate feature definitions for `feat1` and `feat2` along with `MockTransformer`.\n3. Append `feat3` to the features_defs with a `None` transformer.\n4. Create a `DataFrameMapper` with the generated features and transform the dataset.\n5. Check the transformed dataset against the expected output, which should now include the transformed data for `feat3`.\n\nExpected Behavior:\nThe transformed dataset should include all defined features in the correct order, reflecting any new additions accurately in the output.\n\nObserved Behavior:\nThe `expected` array does not account for the new feature `feat3`, causing assertion failures when comparing the transformed output to the expected result.",
      "explanation": "### Summary of the Issue\n\nThe issue reported was an inconsistency in the transformed data output when new features are included in a dataset. Specifically, when adding a new feature (`feat3`), the expected output array in the test case does not account for this new feature. This leads to assertion failures when comparing the actual transformed output with the expected result.\n\n### Detailed Content of the Commit\n\nThe commit made changes to the test file, specifically to the function that generates feature definitions and subsequently uses `DataFrameMapper`. The changes included the following:\n\n1. **Feature Definition Update**: The commit appended the new feature (`feat3`) to the `features_defs` list. Initially, only `feat1` and `feat2` were being transformed, but now `feat3` has been added without any specific transformer.\n2. **Expected Output Update**: The commit updated the expected output array to account for the transformation of `feat3`. The array was expanded to include the values from `feat3` reflecting its presence in the dataset.\n\n### Explanation of the Cause of the Issue\n\nThe primary cause of the issue was that the original test setup did not account for the changes when new features were added to the dataset. When new features are included (such as `feat3`), the expected output array in the test should match the actual transformed dataset which now includes these additional features. However, the expected array was not updated accordingly, resulting in mismatched dimensions and values, thus causing assertion failures during testing.\n\n### Solution Provided by the Commit\n\nThe changes made in the commit effectively addressed the issue by:\n\n1. **Updating Feature Definitions**: Making sure that the feature definition list (`features_defs`) reflects all features in the dataset, including any new ones such as `feat3`.\n2. **Synchronizing Expected Output**: Modifying the expected output array to include the values from the newly added feature (`feat3`). This ensures that the test's expected outcome accurately mirrors the actual transformation that occurs when `DataFrameMapper` processes the complete dataset.\n\n### How the Commit Solves the Issue\n\nBy adding `feat3` to the `features_defs` and updating the expected output array, the commit ensures that the entire feature set of the dataset is correctly handled. This means that:\n\n- The transformation of the dataset now includes all relevant features.\n- The assertion in the unit tests compares the transformed dataset against an expected array that appropriately includes all feature data, thus preventing mismatches and assertion errors.\n\nThis solution ensures that tests are accurately reflecting the transformations applied to the dataset, including any additions of new features such as `feat3`.\n\n### Solution to the Issue\n\nThe developer fixed the inconsistency by updating both the feature definitions and the expected output in the test case. This alignment ensures that any new feature added to the dataset is properly included in both the `DataFrameMapper` transformations and the expected results in the tests. By doing so, tests will pass successfully, confirming that the transformations are applied correctly across all features, and the expected output is accurately defined.\n\nIn summary:\n\n- **Issue**: Incompatibility in expected output when new features are added.\n- **Commit Changes**: Addition of the new feature (`feat3`) to the feature definitions, and updating the expected output array to include this new feature.\n- **Resolution**: Ensures consistency between the transformed dataset and the expected test results, thereby preventing assertion failures.\n\nThis solution is vital for maintaining the integrity of the feature transformation pipeline and ensuring robust testing practices.\n\n"
    },
    "location_origin": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "function": {
          "97": "test_compatibility_with_data_frame_mapper"
        },
        "content_all": {
          "94": "\n",
          "95": "\n",
          "96": "\n",
          "97": "def test_compatibility_with_data_frame_mapper(simple_dataset):\n",
          "98": "    \"\"\"\n",
          "99": "    Tests compatibility of generated feature definition with DataFrameMapper.\n",
          "100": "    \"\"\"\n",
          "101": "    features_defs = gen_features(\n",
          "102": "        columns=['feat1', 'feat2'],\n",
          "103": "        classes=[MockTransformer])\n",
          "104": "\n",
          "105": "    mapper = DataFrameMapper(features_defs)\n",
          "106": "    X = mapper.fit_transform(simple_dataset)\n",
          "107": "    expected = np.asarray([\n",
          "108": "        [1, 2, 1],\n",
          "109": "        [1, 2, 2],\n",
          "110": "        [1, 2, 3],\n",
          "111": "        [1, 2, 4]\n",
          "112": "    ])\n",
          "113": "\n",
          "114": "    assert_array_equal(X, expected)\n",
          "115": "\n",
          "116": "\n"
        },
        "content_change": {
          "102": "        columns=['feat1', 'feat2', 'feat3'],\n",
          "107": "    expected = np.asarray([\n",
          "108": "        [1, 2, 1],\n",
          "109": "        [1, 2, 2],\n",
          "110": "        [1, 2, 3],\n",
          "111": "        [1, 2, 4],\n",
          "112": "        [1, 2, 5]\n",
          "113": "    ])\n"
        }
      }
    ],
    "location_message": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "function": {
          "97": "test_compatibility_with_data_frame_mapper"
        },
        "content_all": {
          "94": "\n",
          "95": "\n",
          "96": "\n",
          "97": "def test_compatibility_with_data_frame_mapper(simple_dataset):\n",
          "98": "    \"\"\"\n",
          "99": "    Tests compatibility of generated feature definition with DataFrameMapper.\n",
          "100": "    \"\"\"\n",
          "101": "    features_defs = gen_features(\n",
          "102": "        columns=['feat1', 'feat2'],\n",
          "103": "        classes=[MockTransformer])\n",
          "104": "\n",
          "105": "    mapper = DataFrameMapper(features_defs)\n",
          "106": "    X = mapper.fit_transform(simple_dataset)\n",
          "107": "    expected = np.asarray([\n",
          "108": "        [1, 2, 1],\n",
          "109": "        [1, 2, 2],\n",
          "110": "        [1, 2, 3],\n",
          "111": "        [1, 2, 4]\n",
          "112": "    ])\n",
          "113": "\n",
          "114": "    assert_array_equal(X, expected)\n",
          "115": "\n",
          "116": "\n"
        },
        "content_change": {
          "107": "    expected = np.asarray([\n",
          "108": "        [1, 1],\n",
          "109": "        [2, 2],\n",
          "110": "        [1, 2],\n",
          "111": "        [3, 2],\n",
          "112": "        [1, 3]\n"
        }
      }
    ],
    "location_ground": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "function": {
          "97": "test_compatibility_with_data_frame_mapper"
        },
        "content_all": {
          "96": "\n",
          "97": "def test_compatibility_with_data_frame_mapper(simple_dataset):\n",
          "98": "    \"\"\"\n",
          "99": "    Tests compatibility of generated feature definition with DataFrameMapper.\n",
          "100": "    \"\"\"\n",
          "101": "    features_defs = gen_features(\n",
          "102": "        columns=['feat1', 'feat2'],\n",
          "103": "        classes=[MockTransformer])\n",
          "104": "    \n",
          "105": "    mapper = DataFrameMapper(features_defs)\n",
          "106": "    X = mapper.fit_transform(simple_dataset)\n",
          "107": "    expected = np.asarray([\n",
          "108": "        [1, 2, 1],\n",
          "109": "        [1, 2, 2],\n",
          "110": "        [1, 2, 3],\n",
          "111": "        [1, 2, 4]\n",
          "112": "    ])\n",
          "113": "    \n",
          "114": "    assert_array_equal(X, expected)\n",
          "115": "\n"
        },
        "content_change": {
          "102": "        columns=['feat1', 'feat2', 'feat3'],\n",
          "107": "    expected = np.asarray([\n",
          "108": "        [1, 2, 1],\n",
          "109": "        [1, 2, 2],\n",
          "110": "        [1, 2, 3],\n",
          "111": "        [1, 2, 4],\n",
          "112": "        [1, 2, 5]\n",
          "113": "    ])\n"
        }
      }
    ],
    "location_ground_exp": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "function": {
          "97": "test_compatibility_with_data_frame_mapper"
        },
        "content_all": {
          "96": "\n",
          "97": "def test_compatibility_with_data_frame_mapper(simple_dataset):\n",
          "98": "    \"\"\"\n",
          "99": "    Tests compatibility of generated feature definition with DataFrameMapper.\n",
          "100": "    \"\"\"\n",
          "101": "    features_defs = gen_features(\n",
          "102": "        columns=['feat1', 'feat2'],\n",
          "103": "        classes=[MockTransformer])\n",
          "104": "\n",
          "105": "    mapper = DataFrameMapper(features_defs)\n",
          "106": "    X = mapper.fit_transform(simple_dataset)\n",
          "107": "    expected = np.asarray([\n",
          "108": "        [1, 2, 1],\n",
          "109": "        [1, 2, 2],\n",
          "110": "        [1, 2, 3],\n",
          "111": "        [1, 2, 4]\n",
          "112": "    ])\n",
          "113": "\n",
          "114": "    assert_array_equal(X, expected)\n",
          "115": "\n"
        },
        "content_change": {
          "102": "        columns=['feat1', 'feat2', 'feat3'],\n",
          "112": "        [1, 2, 4, 4],\n",
          "114": "        [1, 2, 5]\n"
        }
      }
    ],
    "location_ground_truth": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "function": {
          "97": "test_compatibility_with_data_frame_mapper"
        },
        "content_all": {
          "101": "    features_defs = gen_features(\n",
          "102": "        columns=['feat1', 'feat2'],\n",
          "103": "        classes=[MockTransformer])\n",
          "104": "\n",
          "105": "    mapper = DataFrameMapper(features_defs)\n",
          "106": "    X = mapper.fit_transform(simple_dataset)\n",
          "108": "        [1, 2, 1],\n",
          "109": "        [1, 2, 2],\n",
          "110": "        [1, 2, 3],\n",
          "111": "        [1, 2, 4]\n",
          "112": "    ])\n",
          "113": "\n"
        },
        "content_change": {
          "111": "        [1, 2, 4]\n"
        }
      },
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "function": null,
        "content_all": {
          "114": "    assert_array_equal(X, expected)\n"
        },
        "content_change": {}
      }
    ],
    "patch_i": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -101,7 +101,7 @@ def test_generate_features_with_none_only_transformers():\n \n def test_compatibility_with_data_frame_mapper(simple_dataset):\n     \"\"\"\n     Tests compatibility of generated feature definition with DataFrameMapper.\n     \"\"\"\n-    features_defs = gen_features(\n-        columns=['feat1', 'feat2'],\n-        classes=[MockTransformer])\n+    feature_defs = gen_features(\n+        columns=['feat1', 'feat2', 'feat3'],\n+        classes=[MockTransformer]\n+    )\n\n     mapper = DataFrameMapper(features_defs)\n     X = mapper.fit_transform(simple_dataset)\n     expected = np.asarray([\n         [1, 2, 1],\n         [1, 2, 2],\n         [1, 2, 3],\n         [1, 2, 4],\n-        [1, 2, 5]\n+        [1, 2, 5]\n     ])\n \n     assert_array_equal(X, expected)\n",
    "patch_im": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -107,11 +107,9 @@ def test_compatibility_with_data_frame_mapper(simple_dataset):\n     expected = np.asarray([\n-        [1, 2, 1],\n-        [1, 2, 2],\n-        [1, 2, 3],\n-        [1, 2, 4]\n+        [1, 1],\n+        [2, 2],\n+        [1, 2],\n+        [3, 2],\n+        [1, 3]\n     ])\n \n     assert_array_equal(X, expected)\n",
    "patch_il": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -101,7 +101,7 @@ def test_compatibility_with_data_frame_mapper(simple_dataset):\n     features_defs = gen_features(\n-        columns=['feat1', 'feat2'],\n+        columns=['feat1', 'feat2', 'feat3'],\n         classes=[MockTransformer])\n \n     mapper = DataFrameMapper(features_defs)\n     X = mapper.fit_transform(simple_dataset)\n     expected = np.asarray([\n         [1, 2, 1],\n         [1, 2, 2],\n         [1, 2, 3],\n         [1, 2, 4],\n+        [1, 2, 5]\n     ])\n \n     assert_array_equal(X, expected)\n",
    "patch_iml": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -107,11 +107,9 @@ def test_compatibility_with_data_frame_mapper(simple_dataset):\n     expected = np.asarray([\n-        [1, 2, 1],\n-        [1, 2, 2],\n-        [1, 2, 3],\n-        [1, 2, 4]\n+        [1, 1],\n+        [2, 2],\n+        [1, 2],\n+        [3, 2],\n+        [1, 3]\n     ])\n \n     assert_array_equal(X, expected)\n",
    "patch_ground": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -101,7 +101,11 @@ def test_generate_features_with_several_classes():\n \n def test_compatibility_with_data_frame_mapper(simple_dataset):\n     \"\"\"\n-    Tests compatibility of generated feature definition with DataFrameMapper.\n+    Tests compatibility of generated feature definition with DataFrameMapper,\n+    including the transformation of an additional feature 'feat3' which has\n+    no specified transformer.\n     \"\"\"\n     features_defs = gen_features(\n-        columns=['feat1', 'feat2'],\n+        columns=['feat1', 'feat2', 'feat3'],\n         classes=[MockTransformer]\n     )\n \n     mapper = DataFrameMapper(features_defs)\n     X = mapper.fit_transform(simple_dataset)\n-    expected = np.asarray([\n-        [1, 2, 1],\n-        [1, 2, 2],\n-        [1, 2, 3],\n-        [1, 2, 4]\n-    ])\n+    expected = np.asarray([\n+        [1, 2, 1],\n+        [1, 2, 2],\n+        [1, 2, 3],\n+        [1, 2, 4],\n+        [1, 2, 5],\n+    ])\n \n     assert_array_equal(X, expected)\n",
    "patch_ground_location": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -101,8 +101,8 @@ def test_generate_features_with_none_only_transformers():\n \n def test_compatibility_with_data_frame_mapper(simple_dataset):\n     \"\"\"\n     Tests compatibility of generated feature definition with DataFrameMapper.\n     \"\"\"\n-    features_defs = gen_features(\n-        columns=['feat1', 'feat2'],\n-        classes=[MockTransformer])\n+    features_defs = gen_features(\n+        columns=['feat1', 'feat2', 'feat3'],\n+        classes=[MockTransformer])\n \n     mapper = DataFrameMapper(features_defs)\n     X = mapper.fit_transform(simple_dataset)\n-    expected = np.asarray([\n-        [1, 2, 1],\n-        [1, 2, 2],\n-        [1, 2, 3],\n-        [1, 2, 4]\n-    ])\n+    expected = np.asarray([\n+        [1, 2, 1],\n+        [1, 2, 2],\n+        [1, 2, 3],\n+        [1, 2, 4],\n+        [1, 2, 5]\n+    ])\n \n     assert_array_equal(X, expected)\n",
    "patch_ground_exp": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -101,7 +101,7 @@ def test_compatibility_with_data_frame_mapper(simple_dataset):\n     features_defs = gen_features(\n-        columns=['feat1', 'feat2'],\n+        columns=['feat1', 'feat2', 'feat3'],\n         classes=[MockTransformer])\n \n     mapper = DataFrameMapper(features_defs)\n     X = mapper.fit_transform(simple_dataset)\n     expected = np.asarray([\n-        [1, 2, 1],\n-        [1, 2, 2],\n-        [1, 2, 3],\n-        [1, 2, 4]\n+        [1, 2, 1, 1],\n+        [1, 2, 2, 2],\n+        [1, 2, 2, 3],\n+        [1, 2, 2, 4],\n+        [1, 2, 3, 5]\n     ])\n \n     assert_array_equal(X, expected)\n",
    "patch_ground_all": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -101,7 +101,8 @@ def test_compatibility_with_data_frame_mapper(simple_dataset):\n         columns=['feat1', 'feat2'],\n         classes=[MockTransformer])\n \n-    mapper = DataFrameMapper(features_defs)\n+    # Include feat3 with None transformer\n+    features_defs.append(('feat3', None, {}))\n \n     mapper = DataFrameMapper(features_defs)\n     X = mapper.fit_transform(simple_dataset)\n@@ -108,11 +109,12 @@ def test_compatibility_with_data_frame_mapper(simple_dataset):\n         [1, 2, 1],\n         [1, 2, 2],\n         [1, 2, 3],\n-        [1, 2, 4]\n+        [1, 2, 4],\n+        [1, 2, 5]\n     ])\n \n     assert_array_equal(X, expected)\n \n def assert_attributes(obj, **attrs):\n     for attr, value in attrs.items():\n         assert getattr(obj, attr) == value\n",
    "patch_ground_truth": "--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -101,6 +101,7 @@\n     features_defs = gen_features(\n         columns=['feat1', 'feat2'],\n         classes=[MockTransformer])\n+    features_defs.append(('feat3', None))\n \n     mapper = DataFrameMapper(features_defs)\n     X = mapper.fit_transform(simple_dataset)\n@@ -108,7 +109,8 @@\n         [1, 2, 1],\n         [1, 2, 2],\n         [1, 2, 3],\n-        [1, 2, 4]\n+        [1, 2, 4],\n+        [1, 2, 5]\n     ])\n \n     assert_array_equal(X, expected)\n",
    "message": "============================================================================================= FAILURES ==============================================================================================\n_____________________________________________________________________________ test_compatibility_with_data_frame_mapper _____________________________________________________________________________\n\nsimple_dataset =    feat1  feat2  feat3\n0      1      1      1\n1      2      2      2\n2      1      2      3\n3      3      2      4\n4      1      3      5\n\n    def test_compatibility_with_data_frame_mapper(simple_dataset):\n        \"\"\"\n        Tests compatibility of generated feature definition with DataFrameMapper.\n        \"\"\"\n        features_defs = gen_features(\n            columns=['feat1', 'feat2'],\n            classes=[MockTransformer])\n    \n        mapper = DataFrameMapper(features_defs)\n        X = mapper.fit_transform(simple_dataset)\n        expected = np.asarray([\n            [1, 2, 1],\n            [1, 2, 2],\n            [1, 2, 3],\n            [1, 2, 4]\n        ])\n    \n>       assert_array_equal(X, expected)\n\ntests/test_features_generator.py:114: \n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _\n\nargs = (<built-in function eq>, array([[1, 2],\n       [1, 2],\n       [1, 2],\n       [1, 2],\n       [1, 2]]), array([[1, 2, 1],\n       [1, 2, 2],\n       [1, 2, 3],\n       [1, 2, 4]]))\nkwds = {'err_msg': '', 'header': 'Arrays are not equal', 'strict': False, 'verbose': True}\n\n    @wraps(func)\n    def inner(*args, **kwds):\n        with self._recreate_cm():\n>           return func(*args, **kwds)\nE           AssertionError: \nE           Arrays are not equal\nE           \nE           (shapes (5, 2), (4, 3) mismatch)\nE            x: array([[1, 2],\nE                  [1, 2],\nE                  [1, 2],...\nE            y: array([[1, 2, 1],\nE                  [1, 2, 2],\nE                  [1, 2, 3],\nE                  [1, 2, 4]])\n\n../../../../anaconda3/envs/py39/lib/python3.9/contextlib.py:79: AssertionError\n----------------------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------------------\nINFO     sklearn_pandas:dataframe_mapper.py:343 [FIT_TRANSFORM] feat1: 2.5e-05 secs\nINFO     sklearn_pandas:dataframe_mapper.py:343 [FIT_TRANSFORM] feat2: 1e-05 secs\n========================================================================================= warnings summary ==========================================================================================\ntests/test_dataframe_mapper.py::test_complex_object_df\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:106: SettingWithCopyWarning: \n  A value is trying to be set on a copy of a slice from a DataFrame.\n  Try using .loc[row_indexer,col_indexer] = value instead\n  \n  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n    X[col] = X[col].map(lambda img: np.max(img))\n\ntests/test_dataframe_mapper.py::test_sparse_features\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:865: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\ntests/test_dataframe_mapper.py::test_sparse_off\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:879: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\ntests/test_transformers.py::test_common_numerical_transformer\ntests/test_transformers.py::test_numerical_transformer_serialization\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py:35: DeprecationWarning: \n              NumericalTransformer will be deprecated in 3.0 version.\n              Please use Sklearn.base.TransformerMixin to write\n              customer transformers\n              \n    warnings.warn(\"\"\"\n\n-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html\n====================================================================================== short test summary info ======================================================================================\nFAILED tests/test_features_generator.py::test_compatibility_with_data_frame_mapper - AssertionError: \n============================================================================= 1 failed, 69 passed, 5 warnings in 1.32s ==============================================================================",
    "CodeBase": [
      {
        "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "content": "1 # -*- coding: utf8 -*-\n2 \n3 import pytest\n4 from unittest.mock import Mock\n5 from pandas import DataFrame\n6 import pandas as pd\n7 from scipy import sparse\n8 from sklearn.datasets import load_iris\n9 from sklearn.pipeline import Pipeline\n10 from sklearn.model_selection import cross_val_score\n11 from sklearn.svm import SVC\n12 from sklearn.feature_extraction.text import CountVectorizer\n13 from sklearn.feature_extraction import DictVectorizer\n14 from sklearn.preprocessing import (\n15     StandardScaler, OneHotEncoder, LabelBinarizer)\n16 from sklearn.impute import SimpleImputer as Imputer\n17 from sklearn.feature_selection import SelectKBest, chi2\n18 from sklearn.base import BaseEstimator, TransformerMixin\n19 import sklearn.decomposition\n20 import numpy as np\n21 from numpy.testing import assert_array_equal\n22 import pickle\n23 from sklearn.compose import make_column_selector\n24 \n25 from sklearn_pandas import DataFrameMapper\n26 from sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\n27 from sklearn_pandas.pipeline import TransformerPipeline\n28 \n29 \n30 class MockXTransformer(object):\n31     \"\"\"\n32     Mock transformer that accepts no y argument.\n33     \"\"\"\n34     def fit(self, X):\n35         return self\n36 \n37     def transform(self, X):\n38         return X\n39 \n40 \n41 class MockTClassifier(object):\n42     \"\"\"\n43     Mock transformer/classifier.\n44     \"\"\"\n45     def fit(self, X, y=None):\n46         return self\n47 \n48     def transform(self, X):\n49         return X\n50 \n51     def predict(self, X):\n52         return True\n53 \n54 \n55 class DateEncoder():\n56     def fit(self, X, y=None):\n57         retur(...truncated)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "content": "1 from collections import Counter\n2 \n3 import pytest\n4 import numpy as np\n5 from pandas import DataFrame\n6 from numpy.testing import assert_array_equal\n7 \n8 from sklearn_pandas import DataFrameMapper\n9 from sklearn_pandas.features_generator import gen_features\n10 \n11 \n12 class MockClass(object):\n13 \n14     def __init__(self, value=1, name='class'):\n15         self.value = value\n16         self.name = name\n17 \n18 \n19 class MockTransformer(object):\n20 \n21     def __init__(self):\n22         self.most_common_ = None\n23 \n24     def fit(self, X, y=None):\n25         [(value, _)] = Counter(X).most_common(1)\n26         self.most_common_ = value\n27         return self\n28 \n29     def transform(self, X, y=None):\n30         return np.asarray([self.most_common_] * len(X))\n31 \n32 \n33 @pytest.fixture\n34 def simple_dataset():\n35     return DataFrame({\n36         'feat1': [1, 2, 1, 3, 1],\n37         'feat2': [1, 2, 2, 2, 3],\n38         'feat3': [1, 2, 3, 4, 5],\n39     })\n40 \n41 \n42 def test_generate_features_with_default_parameters():\n43     \"\"\"\n44     Tests generating features from classes with default init arguments.\n45     \"\"\"\n46     columns = ['colA', 'colB', 'colC']\n47     feature_defs = gen_features(columns=columns, classes=[MockClass])\n48     assert len(feature_defs) == len(columns)\n49 \n50     for feature in feature_defs:\n51         assert feature[2] == {}\n52 \n53     feature_dict = dict([_[0:2] for _ in feature_defs])\n54     assert columns == sorted(feature_dict.keys())\n55 \n56     # default init arguments for MockClass for clarification.\n57     expected = {'value': 1, 'name': 'class'}\n58     for column, transformers in feature_dict.items():\n59         for obj in transformers:\n60             assert_attributes(obj, **expected)\n61 \n62 \n63 def test_generate_features_with_several_classes():\n64     \"\"\"\n65     Tests generating features pipeline with different transformers parameters.\n66     \"\"\"\n67     feature_defs = gen_features(\n68         columns=['colA', 'colB', 'colC'],\n69         classes=[\n70             {'class': MockClass},\n71             {'class': MockClass, 'name': 'mockA'},\n72             {'class': MockClass, 'name': 'mockB', 'value': None}\n73         ]\n74     )\n75 \n76     for col, transformers, params in feature_defs:\n77         assert_attributes(transformers[0], name='class', value=1)\n78         assert_attributes(transformers[1], name='mockA', value=1)\n79         assert_attributes(transformers[2], name='mockB', value=None)\n80 \n81 \n82 def test_generate_features_with_none_only_transformers():\n83     \"\"\"\n84     Tests generating \"dummy\" feature definition which doesn't apply any\n85     transformation.\n86     \"\"\"\n87     feature_defs = gen_features(\n88         columns=['colA', 'colB', 'colC'], classes=[None])\n89 \n90     expected = [('colA', None, {}),\n91                 ('colB', None, {}),\n92                 ('colC', None, {})]\n93 \n94     assert feature_defs == expected\n95 \n96 \n97 def test_compatibility_with_data_frame_mapper(simple_dataset):\n98     \"\"\"\n99     Tests compatibility of generated feature definition with DataFrameMapper.\n100     \"\"\"\n101     features_defs = gen_features(\n102         columns=['feat1', 'feat2'],\n103         classes=[MockTransformer])\n104 \n105     mapper = DataFrameMapper(features_defs)\n106     X = mapper.fit_transform(simple_dataset)\n107     expected = np.asarray([\n108         [1, 2, 1],\n109         [1, 2, 2],\n110         [1, 2, 3],\n111         [1, 2, 4]\n112     ])\n113 \n114     assert_array_equal(X, expected)\n115 \n116 \n117 def assert_attributes(obj, **attrs):\n118     for attr, value in attrs.items():\n119         assert getattr(obj, attr) == value"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
        "content": "1 import tempfile\n2 import pytest\n3 import numpy as np\n4 from pandas import DataFrame\n5 import joblib\n6 \n7 from sklearn_pandas import DataFra(...truncated)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
        "content": "1 import contextlib\n2 from datetime import datetime\n3 import pandas as pd\n4 (...truncated)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
        "content": "1 def gen_features(colu(...truncated)"
      }
    ],
    "CommitSHA": "c9db2d6dcbf515eade751073f43318e43cae5177"
  },
  "Score": {
    "Difficulty": "Medium",
    "issue_origin": {
      "Title": 4,
      "Description": 6,
      "Reproducibility": 5,
      "Relevance": 6,
      "Explanation": 5,
      "Overall": 5
    },
    "issue_message": {
      "Title": 7,
      "Description": 6,
      "Reproducibility": 6,
      "Relevance": 7,
      "Explanation": 7,
      "Overall": 7
    },
    "issue_ground": {
      "Title": 8,
      "Description": 7,
      "Reproducibility": 8,
      "Relevance": 7,
      "Explanation": 6,
      "Overall": 7
    },
    "issue_ground_truth": {
      "title": "Mismatch in Expected Transformed Data Output for Additional Feature",
      "description": "When generating features using `gen_features` and applying transformations through `DataFrameMapper`, there is an inconsistency in the output transformed data when new features are included. Specifically, adding a new feature (`feat3` in this case) for transformation should accurately reflect in the transformed dataset output. Currently, the `expected` array in the test case does not include the additional expected data from the new feature.\n\nSteps to Reproduce:\n1. Define a dataset with multiple features, including `feat1`, `feat2`, and `feat3`.\n2. Use `gen_features` to generate feature definitions for `feat1` and `feat2` along with `MockTransformer`.\n3. Append `feat3` to the features_defs with a `None` transformer.\n4. Create a `DataFrameMapper` with the generated features and transform the dataset.\n5. Check the transformed dataset against the expected output, which should now include the transformed data for `feat3`.\n\nExpected Behavior:\nThe transformed dataset should include all defined features in the correct order, reflecting any new additions accurately in the output.\n\nObserved Behavior:\nThe `expected` array does not account for the new feature `feat3`, causing assertion failures when comparing the transformed output to the expected result.",
      "explanation": "### Summary of the Issue\n\nThe issue reported was an inconsistency in the transformed data output when new features are included in a dataset. Specifically, when adding a new feature (`feat3`), the expected output array in the test case does not account for this new feature. This leads to assertion failures when comparing the actual transformed output with the expected result.\n\n### Detailed Content of the Commit\n\nThe commit made changes to the test file, specifically to the function that generates feature definitions and subsequently uses `DataFrameMapper`. The changes included the following:\n\n1. **Feature Definition Update**: The commit appended the new feature (`feat3`) to the `features_defs` list. Initially, only `feat1` and `feat2` were being transformed, but now `feat3` has been added without any specific transformer.\n2. **Expected Output Update**: The commit updated the expected output array to account for the transformation of `feat3`. The array was expanded to include the values from `feat3` reflecting its presence in the dataset.\n\n### Explanation of the Cause of the Issue\n\nThe primary cause of the issue was that the original test setup did not account for the changes when new features were added to the dataset. When new features are included (such as `feat3`), the expected output array in the test should match the actual transformed dataset which now includes these additional features. However, the expected array was not updated accordingly, resulting in mismatched dimensions and values, thus causing assertion failures during testing.\n\n### Solution Provided by the Commit\n\nThe changes made in the commit effectively addressed the issue by:\n\n1. **Updating Feature Definitions**: Making sure that the feature definition list (`features_defs`) reflects all features in the dataset, including any new ones such as `feat3`.\n2. **Synchronizing Expected Output**: Modifying the expected output array to include the values from the newly added feature (`feat3`). This ensures that the test's expected outcome accurately mirrors the actual transformation that occurs when `DataFrameMapper` processes the complete dataset.\n\n### How the Commit Solves the Issue\n\nBy adding `feat3` to the `features_defs` and updating the expected output array, the commit ensures that the entire feature set of the dataset is correctly handled. This means that:\n\n- The transformation of the dataset now includes all relevant features.\n- The assertion in the unit tests compares the transformed dataset against an expected array that appropriately includes all feature data, thus preventing mismatches and assertion errors.\n\nThis solution ensures that tests are accurately reflecting the transformations applied to the dataset, including any additions of new features such as `feat3`.\n\n### Solution to the Issue\n\nThe developer fixed the inconsistency by updating both the feature definitions and the expected output in the test case. This alignment ensures that any new feature added to the dataset is properly included in both the `DataFrameMapper` transformations and the expected results in the tests. By doing so, tests will pass successfully, confirming that the transformations are applied correctly across all features, and the expected output is accurately defined.\n\nIn summary:\n\n- **Issue**: Incompatibility in expected output when new features are added.\n- **Commit Changes**: Addition of the new feature (`feat3`) to the feature definitions, and updating the expected output array to include this new feature.\n- **Resolution**: Ensures consistency between the transformed dataset and the expected test results, thereby preventing assertion failures.\n\nThis solution is vital for maintaining the integrity of the feature transformation pipeline and ensuring robust testing practices.\n\n"
    }
  }
}