{
  "RepoName": "https://github.com/scikit-learn-contrib/sklearn-pandas.git",
  "CommitSHA": "c9db2d6dcbf515eade751073f43318e43cae5177",
  "Time": "",
  "Difficulty": "Medium",
  "Type": "type mismatch",
  "BuggyCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/test.py",
      "content": "import pytest\nfrom unittest.mock import Mock\nimport numpy as np\nimport pandas as pd\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn.compose import make_column_selector\nfrom sklearn.preprocessing import StandardScaler\n\n\nclass GetStartWith:\n    def __init__(self, start_str):\n        self.start_str = start_str\n\n    def __call__(self, X: pd.DataFrame) -> list:\n        return [c for c in X.columns if c.startswith(self.start_str)]\n\n\ndf = pd.DataFrame({\n    'sepal length (cm)': [1.0, 2.0, 3.0],\n    'sepal width (cm)': [1.0, 2.0, 3.0],\n    'petal length (cm)': [1.0, 2.0, 3.0],\n    'petal width (cm)': [1.0, 2.0, 3.0]\n})\nt = DataFrameMapper([\n    (make_column_selector(dtype_include=float), StandardScaler(), {'alias': 'x'}),\n    (GetStartWith('petal'), None, {'alias': 'petal'})\n], df_out=True, default=False)\n\nt.fit(df)\nprint(t.transform(df).shape)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/setup.py",
      "content": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nfrom setuptools import setup\nfrom setuptools.command.test import test as TestCommand\nimport re\n\nfor line in open('sklearn_pandas/__init__.py'):\n    match = re.match(\"__version__ *= *'(.*)'\", line)\n    if match:\n        __version__, = match.groups()\n\n\nclass PyTest(TestCommand):\n    user_options = [('pytest-args=', 'a', \"Arguments to pass to py.test\")]\n\n    def initialize_options(self):\n        TestCommand.initialize_options(self)\n        self.pytest_args = []\n\n    def finalize_options(self):\n        TestCommand.finalize_options(self)\n        self.test_args = []\n        self.test_suite = True\n\n    def run(self):\n        import pytest\n        errno = pytest.main(self.pytest_args)\n        raise SystemExit(errno)\n\n\nsetup(name='sklearn-pandas',\n      version=__version__,\n      description='Pandas integration with sklearn',\n      maintainer='Ritesh Agrawal',\n      maintainer_email='ragrawal@gmail.com',\n      url='https://github.com/scikit-learn-contrib/sklearn-pandas',\n      packages=['sklearn_pandas'],\n      keywords=['scikit', 'sklearn', 'pandas'],\n      install_requires=[\n          'scikit-learn>=0.23.0',\n          'scipy>=1.5.1',\n          'pandas>=1.1.4',\n          'numpy>=1.18.1'\n      ],\n      tests_require=['pytest', 'mock'],\n      cmdclass={'test': PyTest},\n      license='MIT License'\n)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/noxfile.py",
      "content": "import nox\n\n@nox.session\ndef lint(session):\n    session.install('pytest>=5.3.5', 'setuptools>=45.2',\n                    'wheel>=0.34.2', 'flake8>=3.7.9',\n                    'numpy==1.18.1', 'pandas==1.1.4')\n    session.install('.')\n    session.run('flake8', 'sklearn_pandas/', 'tests')\n\n@nox.session\n@nox.parametrize('numpy', ['1.18.1', '1.19.4', '1.20.1'])\n@nox.parametrize('scipy', ['1.5.4', '1.6.0'])\n@nox.parametrize('pandas', ['1.1.4', '1.2.2'])\ndef tests(session, numpy, scipy, pandas):\n    session.install('pytest>=5.3.5', \n                    'setuptools>=45.2',\n                    'wheel>=0.34.2',\n                    f'numpy=={numpy}',\n                    f'scipy=={scipy}',\n                    f'pandas=={pandas}'\n                    )\n    session.install('.')\n    session.run('py.test', 'README.rst', 'tests')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
      "content": "def gen_features(columns, classes=None, prefix='', suffix=''):\n    \"\"\"Generates a feature definition list which can be passed\n    into DataFrameMapper\n\n    Params:\n\n    columns     a list of column names to generate features for.\n\n    classes     a list of classes for each feature, a list of dictionaries with\n                transformer class and init parameters, or None.\n\n                If list of classes is provided, then each of them is\n                instantiated with default arguments. Example:\n\n                    classes = [StandardScaler, LabelBinarizer]\n\n                If list of dictionaries is provided, then each of them should\n                have a 'class' key with transformer class. All other keys are\n                passed into 'class' value constructor. Example:\n\n                    classes = [\n                        {'class': StandardScaler, 'with_mean': False},\n                        {'class': LabelBinarizer}\n                    }]\n\n                If None value selected, then each feature left as is.\n\n    prefix      add prefix to transformed column names\n\n    suffix      add suffix to transformed column names.\n\n    \"\"\"\n    if classes is None:\n        return [(column, None) for column in columns]\n\n    feature_defs = []\n\n    for column in columns:\n        feature_transformers = []\n\n        arguments = {}\n        if prefix and prefix != \"\":\n            arguments['prefix'] = prefix\n        if suffix and suffix != \"\":\n            arguments['suffix'] = suffix\n\n        classes = [cls for cls in classes if cls is not None]\n        if not classes:\n            feature_defs.append((column, None, arguments))\n\n        else:\n            for definition in classes:\n                if isinstance(definition, dict):\n                    params = definition.copy()\n                    klass = params.pop('class')\n                    feature_transformers.append(klass(**params))\n                else:\n                    feature_transformers.append(definition())\n\n            if not feature_transformers:\n                feature_transformers = None\n\n            feature_defs.append((column, feature_transformers, arguments))\n\n    return feature_defs\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py",
      "content": "import numpy as np\nimport pandas as pd\nfrom sklearn.base import TransformerMixin\nimport warnings\n\n\ndef _get_mask(X, value):\n    \"\"\"\n    Compute the boolean mask X == missing_values.\n    \"\"\"\n    if value == \"NaN\" or \\\n       value is None or \\\n       (isinstance(value, float) and np.isnan(value)):\n        return pd.isnull(X)\n    else:\n        return X == value\n\n\nclass NumericalTransformer(TransformerMixin):\n    \"\"\"\n    Provides commonly used numerical transformers.\n    \"\"\"\n    SUPPORTED_FUNCTIONS = ['log', 'log1p']\n\n    def __init__(self, func):\n        \"\"\"\n        Params\n\n        func    function to apply to input columns. The function will be\n                applied to each value. Supported functions are defined\n                in SUPPORTED_FUNCTIONS variable. Throws assertion error if the\n                not supported.\n        \"\"\"\n\n        warnings.warn(\"\"\"\n            NumericalTransformer will be deprecated in 3.0 version.\n            Please use Sklearn.base.TransformerMixin to write\n            customer transformers\n            \"\"\", DeprecationWarning)\n\n        assert func in self.SUPPORTED_FUNCTIONS, \\\n            f\"Only following func are supported: {self.SUPPORTED_FUNCTIONS}\"\n        super(NumericalTransformer, self).__init__()\n        self.__func = func\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X, y=None):\n        if self.__func == 'log1p':\n            return np.vectorize(np.log1p)(X)\n        elif self.__func == 'log':\n            return np.vectorize(np.log)(X)\n\n        raise ValueError(f\"Invalid function name: {self.__func}\")\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/__init__.py",
      "content": "__version__ = '2.2.0'\n\nimport logging\nlogger = logging.getLogger(__name__)\n\nfrom .dataframe_mapper import DataFrameMapper  # NOQA\nfrom .features_generator import gen_features  # NOQA\nfrom .transformers import NumericalTransformer # NOQA\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/pipeline.py",
      "content": "import six\nfrom sklearn.pipeline import _name_estimators, Pipeline\nfrom sklearn.utils import tosequence\n\n\ndef _call_fit(fit_method, X, y=None, **kwargs):\n    \"\"\"\n    helper function, calls the fit or fit_transform method with the correct\n    number of parameters\n\n    fit_method: fit or fit_transform method of the transformer\n    X: the data to fit\n    y: the target vector relative to X, optional\n    kwargs: any keyword arguments to the fit method\n\n    return: the result of the fit or fit_transform method\n\n    WARNING: if this function raises a TypeError exception, test the fit\n    or fit_transform method passed to it in isolation as _call_fit will not\n    distinguish TypeError due to incorrect number of arguments from\n    other TypeError\n    \"\"\"\n    try:\n        return fit_method(X, y, **kwargs)\n    except TypeError:\n        # fit takes only one argument\n        return fit_method(X, **kwargs)\n\n\nclass TransformerPipeline(Pipeline):\n    \"\"\"\n    Pipeline that expects all steps to be transformers taking a single X\n    argument, an optional y argument, and having fit and transform methods.\n\n    Code is copied from sklearn's Pipeline\n    \"\"\"\n\n    def __init__(self, steps):\n        names, estimators = zip(*steps)\n        if len(dict(steps)) != len(steps):\n            raise ValueError(\n                \"Provided step names are not unique: %s\" % (names,))\n\n        # shallow copy of steps\n        self.steps = tosequence(steps)\n        estimator = estimators[-1]\n\n        for e in estimators:\n            if (not (hasattr(e, \"fit\") or hasattr(e, \"fit_transform\")) or not\n                    hasattr(e, \"transform\")):\n                raise TypeError(\"All steps of the chain should \"\n                                \"be transforms and implement fit and transform\"\n                                \" '%s' (type %s) doesn't)\" % (e, type(e)))\n\n        if not hasattr(estimator, \"fit\"):\n            raise TypeError(\"Last step of chain should implement fit \"\n                            \"'%s' (type %s) doesn't)\"\n                            % (estimator, type(estimator)))\n\n    def _pre_transform(self, X, y=None, **fit_params):\n        fit_params_steps = dict((step, {}) for step, _ in self.steps)\n        for pname, pval in six.iteritems(fit_params):\n            step, param = pname.split('__', 1)\n            fit_params_steps[step][param] = pval\n        Xt = X\n        for name, transform in self.steps[:-1]:\n            if hasattr(transform, \"fit_transform\"):\n                Xt = _call_fit(transform.fit_transform,\n                               Xt, y, **fit_params_steps[name])\n            else:\n                Xt = _call_fit(transform.fit,\n                               Xt, y, **fit_params_steps[name]).transform(Xt)\n        return Xt, fit_params_steps[self.steps[-1][0]]\n\n    def fit(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        _call_fit(self.steps[-1][-1].fit, Xt, y, **fit_params)\n        return self\n\n    def fit_transform(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        if hasattr(self.steps[-1][-1], 'fit_transform'):\n            return _call_fit(self.steps[-1][-1].fit_transform,\n                             Xt, y, **fit_params)\n        else:\n            return _call_fit(self.steps[-1][-1].fit,\n                             Xt, y, **fit_params).transform(Xt)\n\n\ndef make_transformer_pipeline(*steps):\n    \"\"\"Construct a TransformerPipeline from the given estimators.\n    \"\"\"\n    return TransformerPipeline(_name_estimators(steps))\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "import contextlib\nfrom datetime import datetime\nimport pandas as pd\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom .cross_validation import DataWrapper\nfrom .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline\nfrom . import logger\n\nstring_types = text_type = str\n\n\ndef _handle_feature(fea):\n    \"\"\"\n    Convert 1-dimensional arrays to 2-dimensional column vectors.\n    \"\"\"\n    if len(fea.shape) == 1:\n        fea = np.array([fea]).T\n\n    return fea\n\n\ndef _build_transformer(transformers):\n    if isinstance(transformers, list):\n        transformers = make_transformer_pipeline(*transformers)\n    return transformers\n\n\ndef _build_feature(columns, transformers, options={}, X=None):\n    if X is None:\n        return (columns, _build_transformer(transformers), options)\n    return (\n        columns(X) if callable(columns) else columns,\n        _build_transformer(transformers),\n        options\n    )\n\n\ndef _elapsed_secs(t1):\n    return (datetime.now()-t1).total_seconds()\n\n\ndef _get_feature_names(estimator):\n    \"\"\"\n    Attempt to extract feature names based on a given estimator\n    \"\"\"\n    if hasattr(estimator, 'classes_'):\n        return estimator.classes_\n    elif hasattr(estimator, 'get_feature_names'):\n        return estimator.get_feature_names()\n    return None\n\n\n@contextlib.contextmanager\ndef add_column_names_to_exception(column_names):\n    # Stolen from https://stackoverflow.com/a/17677938/356729\n    try:\n        yield\n    except Exception as ex:\n        if ex.args:\n            msg = u'{}: {}'.format(column_names, ex.args[0])\n        else:\n            msg = text_type(column_names)\n        ex.args = (msg,) + ex.args[1:]\n        raise\n\n\nclass DataFrameMapper(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Map Pandas data frame column subsets to their own\n    sklearn transformation.\n    \"\"\"\n\n    def __init__(self, features, default=False, sparse=False, df_out=False,\n                 input_df=False, drop_cols=None):\n        \"\"\"\n        Params:\n\n        features    a list of tuples with features definitions.\n                    The first element is the pandas column selector. This can\n                    be a string (for one column) or a list of strings.\n                    The second element is an object that supports\n                    sklearn's transform interface, or a list of such objects\n                    The third element is optional and, if present, must be\n                    a dictionary with the options to apply to the\n                    transformation. Example: {'alias': 'day_of_week'}\n\n        default     default transformer to apply to the columns not\n                    explicitly selected in the mapper. If False (default),\n                    discard them. If None, pass them through untouched. Any\n                    other transformer will be applied to all the unselected\n                    columns as a whole, taken as a 2d-array.\n\n        sparse      will return sparse matrix if set True and any of the\n                    extracted features is sparse. Defaults to False.\n\n        df_out      return a pandas data frame, with each column named using\n                    the pandas column that created it (if there's only one\n                    input and output) or the input columns joined with '_'\n                    if there's multiple inputs, and the name concatenated with\n                    '_1', '_2' etc if there's multiple outputs. NB: does not\n                    work if *default* or *sparse* are true\n\n        input_df    If ``True`` pass the selected columns to the transformers\n                    as a pandas DataFrame or Series. Otherwise pass them as a\n                    numpy array. Defaults to ``False``.\n\n        drop_cols   List of columns to be dropped. Defaults to None.\n\n        \"\"\"\n        self.features = features\n        self.default = default\n        self.built_default = None\n        self.sparse = sparse\n        self.df_out = df_out\n        self.input_df = input_df\n        self.drop_cols = [] if drop_cols is None else drop_cols\n        self.transformed_names_ = []\n        if (df_out and (sparse or default)):\n            raise ValueError(\"Can not use df_out with sparse or default\")\n\n    def _build(self, X=None):\n        \"\"\"\n        Build attributes built_features and built_default.\n        \"\"\"\n        if isinstance(self.features, list):\n            self.built_features = [\n                _build_feature(*f, X=X) for f in self.features\n            ]\n        else:\n            self.built_features = _build_feature(*self.features, X=X)\n        self.built_default = _build_transformer(self.default)\n\n    @property\n    def _selected_columns(self):\n        \"\"\"\n        Return a set of selected columns in the feature list.\n        \"\"\"\n        selected_columns = set()\n        for feature in self.features:\n            columns = feature[0]\n            if isinstance(columns, list):\n                selected_columns = selected_columns.union(set(columns))\n            else:\n                selected_columns.add(columns)\n        return selected_columns\n\n    def _unselected_columns(self, X):\n        \"\"\"\n        Return list of columns present in X and not selected explicitly in the\n        mapper.\n\n        Unselected columns are returned in the order they appear in the\n        dataframe to avoid issues with different ordering during default fit\n        and transform steps.\n        \"\"\"\n        X_columns = list(X.columns)\n        return [column for column in X_columns if\n                column not in self._selected_columns\n                and column not in self.drop_cols]\n\n    def __setstate__(self, state):\n        # compatibility for older versions of sklearn-pandas\n        super().__setstate__(state)\n        self.features = [_build_feature(*feat) for feat in state['features']]\n        self.sparse = state.get('sparse', False)\n        self.default = state.get('default', False)\n        self.df_out = state.get('df_out', False)\n        self.input_df = state.get('input_df', False)\n        self.drop_cols = state.get('drop_cols', [])\n        self.built_features = state.get('built_features', self.features)\n        self.built_default = state.get('built_default', self.default)\n        self.transformed_names_ = state.get('transformed_names_', [])\n\n    def __getstate__(self):\n        state = super().__getstate__()\n        state['features'] = self.features\n        state['sparse'] = self.sparse\n        state['default'] = self.default\n        state['df_out'] = self.df_out\n        state['input_df'] = self.input_df\n        state['drop_cols'] = self.drop_cols\n        state['build_features'] = getattr(self, 'built_features', None)\n        state['built_default'] = self.built_default\n        state['transformed_names_'] = self.transformed_names_\n        return state\n\n    def _get_col_subset(self, X, cols, input_df=False):\n        \"\"\"\n        Get a subset of columns from the given table X.\n\n        X       a Pandas dataframe; the table to select columns from\n        cols    a string or list of strings representing the columns to select.\n                It can also be a callable that returns True or False, i.e.\n                compatible with the built-in filter function.\n\n        Returns a numpy array with the data from the selected columns\n        \"\"\"\n\n        if isinstance(cols, string_types):\n            return_vector = True\n            cols = [cols]\n        else:\n            return_vector = False\n\n        # Needed when using the cross-validation compatibility\n        # layer for sklearn<0.16.0.\n        # Will be dropped on sklearn-pandas 2.0.\n        if isinstance(X, list):\n            X = [x[cols] for x in X]\n            X = pd.DataFrame(X)\n\n        elif isinstance(X, DataWrapper):\n            X = X.df  # fetch underlying data\n\n        if return_vector:\n            t = X[cols[0]]\n        else:\n            t = X[cols]\n\n        # return either a DataFrame/Series or a numpy array\n        if input_df:\n            return t\n        else:\n            return t.values\n\n    def fit(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n\n        \"\"\"\n        self._build(X=X)\n\n        for columns, transformers, options in self.built_features:\n            t1 = datetime.now()\n            input_df = options.get('input_df', self.input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    Xt = self._get_col_subset(X, columns, input_df)\n                    _call_fit(transformers.fit, Xt, y)\n            logger.info(f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n        # handle features not explicitly selected\n        if self.built_default:  # not False and not None\n            unsel_cols = self._unselected_columns(X)\n            with add_column_names_to_exception(unsel_cols):\n                Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n                _call_fit(self.built_default.fit, Xt, y)\n        return self\n\n    def get_names(self, columns, transformer, x, alias=None, prefix='',\n                  suffix=''):\n        \"\"\"\n        Return verbose names for the transformed columns.\n\n        columns       name (or list of names) of the original column(s)\n        transformer   transformer - can be a TransformerPipeline\n        x             transformed columns (numpy.ndarray)\n        alias         base name to use for the selected columns\n        \"\"\"\n        if alias is not None:\n            name = alias\n        elif isinstance(columns, list):\n            name = '_'.join(map(str, columns))\n        else:\n            name = columns\n        num_cols = x.shape[1] if len(x.shape) > 1 else 1\n\n        output = []\n\n        if num_cols > 1:\n            # If there are as many columns as classes in the transformer,\n            # infer column names from classes names.\n\n            # If we are dealing with multiple transformers for these columns\n            # attempt to extract the names from each of them, starting from the\n            # last one\n            if isinstance(transformer, TransformerPipeline):\n                inverse_steps = transformer.steps[::-1]\n                estimators = (estimator for name, estimator in inverse_steps)\n                names_steps = (_get_feature_names(e) for e in estimators)\n                names = next((n for n in names_steps if n is not None), None)\n            # Otherwise use the only estimator present\n            else:\n                names = _get_feature_names(transformer)\n\n            if names is not None and len(names) == num_cols:\n                output = [f\"{name}_{o}\" for o in names]\n                # otherwise, return name concatenated with '_1', '_2', etc.\n            else:\n                output = [name + '_' + str(o) for o in range(num_cols)]\n        else:\n            output = [name]\n\n        if prefix == suffix == \"\":\n            return output\n\n        return ['{}{}{}'.format(prefix, x, suffix) for x in output]\n\n    def get_dtypes(self, extracted):\n        dtypes_features = [self.get_dtype(ex) for ex in extracted]\n        return [dtype for dtype_feature in dtypes_features\n                for dtype in dtype_feature]\n\n    def get_dtype(self, ex):\n        if isinstance(ex, np.ndarray) or sparse.issparse(ex):\n            return [ex.dtype] * ex.shape[1]\n        elif isinstance(ex, pd.DataFrame):\n            return list(ex.dtypes)\n        else:\n            raise TypeError(type(ex))\n\n    def _transform(self, X, y=None, do_fit=False):\n        \"\"\"\n        Transform the given data with possibility to fit in advance.\n        Avoids code duplication for implementation of transform and\n        fit_transform.\n        \"\"\"\n        if do_fit:\n            self._build(X=X)\n\n        extracted = []\n        transformed_names_ = []\n        for columns, transformers, options in self.built_features:\n            input_df = options.get('input_df', self.input_df)\n\n            # columns could be a string or list of\n            # strings; we don't care because pandas\n            # will handle either.\n            Xt = self._get_col_subset(X, columns, input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    if do_fit and hasattr(transformers, 'fit_transform'):\n                        t1 = datetime.now()\n                        Xt = _call_fit(transformers.fit_transform, Xt, y)\n                        logger.info(f\"[FIT_TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n                    else:\n                        if do_fit:\n                            t1 = datetime.now()\n                            _call_fit(transformers.fit, Xt, y)\n                            logger.info(\n                                f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n                        t1 = datetime.now()\n                        Xt = transformers.transform(Xt)\n                        logger.info(f\"[TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n\n            extracted.append(_handle_feature(Xt))\n\n            alias = options.get('alias')\n\n            prefix = options.get('prefix', '')\n            suffix = options.get('suffix', '')\n\n            transformed_names_ += self.get_names(\n                columns, transformers, Xt, alias, prefix, suffix)\n\n        # handle features not explicitly selected\n        if self.built_default is not False:\n            unsel_cols = self._unselected_columns(X)\n            Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n            if self.built_default is not None:\n                with add_column_names_to_exception(unsel_cols):\n                    if do_fit and hasattr(self.built_default, 'fit_transform'):\n                        Xt = _call_fit(self.built_default.fit_transform, Xt, y)\n                    else:\n                        if do_fit:\n                            _call_fit(self.built_default.fit, Xt, y)\n                        Xt = self.built_default.transform(Xt)\n                transformed_names_ += self.get_names(\n                    unsel_cols, self.built_default, Xt)\n            else:\n                # if not applying a default transformer,\n                # keep column names unmodified\n                transformed_names_ += unsel_cols\n\n            extracted.append(_handle_feature(Xt))\n\n        self.transformed_names_ = transformed_names_\n\n        # combine the feature outputs into one array.\n        # at this point we lose track of which features\n        # were created from which input columns, so it's\n        # assumed that that doesn't matter to the model.\n\n        # If any of the extracted features is sparse, combine sparsely.\n        # Otherwise, combine as normal arrays.\n        if any(sparse.issparse(fea) for fea in extracted):\n            stacked = sparse.hstack(extracted).tocsr()\n            # return a sparse matrix only if the mapper was initialized\n            # with sparse=True\n            if not self.sparse:\n                stacked = stacked.toarray()\n        else:\n            stacked = np.hstack(extracted)\n\n        if self.df_out:\n            # if no rows were dropped preserve the original index,\n            # otherwise use a new integer one\n            no_rows_dropped = len(X) == len(stacked)\n            if no_rows_dropped:\n                index = X.index\n            else:\n                index = None\n\n            # output different data types, if appropriate\n            dtypes = self.get_dtypes(extracted)\n            df_out = pd.DataFrame(\n                stacked,\n                columns=self.transformed_names_,\n                index=index)\n            # preserve types\n            for col, dtype in zip(self.transformed_names_, dtypes):\n                df_out[col] = df_out[col].astype(dtype)\n            return df_out\n        else:\n            return stacked\n\n    def transform(self, X):\n        \"\"\"\n        Transform the given data. Assumes that fit has already been called.\n\n        X       the data to transform\n        \"\"\"\n        return self._transform(X)\n\n    def fit_transform(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline and directly apply\n        it to the given data.\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n        \"\"\"\n        return self._transform(X, y, True)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/cross_validation.py",
      "content": "class DataWrapper(object):\n\n    def __init__(self, df):\n        self.df = df\n\n    def __len__(self):\n        return len(self.df)\n\n    def __getitem__(self, key):\n        return self.df.iloc[key]\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_pipeline.py",
      "content": "import pytest\nfrom sklearn_pandas.pipeline import TransformerPipeline, _call_fit\n\n# In py3, mock is included with the unittest standard library\n# In py2, it's a separate package\ntry:\n    from unittest.mock import patch\nexcept ImportError:\n    from mock import patch\n\n\nclass NoTransformT(object):\n    \"\"\"Transformer without transform method.\n    \"\"\"\n    def fit(self, x):\n        return self\n\n\nclass NoFitT(object):\n    \"\"\"Transformer without fit method.\n    \"\"\"\n    def transform(self, x):\n        return self\n\n\nclass Trans(object):\n    \"\"\"\n    Transformer with fit and transform methods\n    \"\"\"\n    def fit(self, x, y=None):\n        return self\n\n    def transform(self, x):\n        return self\n\n\ndef func_x_y(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments\n    \"\"\"\n    return\n\n\ndef func_x(x, kwarg='kwarg'):\n    \"\"\"\n    Function with required x argument\n    \"\"\"\n    return\n\n\ndef func_raise_type_err(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments,\n    raises TypeError\n    \"\"\"\n    raise TypeError\n\n\ndef test_all_steps_fit_transform():\n    \"\"\"\n    All steps must implement fit and transform. Otherwise, raise TypeError.\n    \"\"\"\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoTransformT())])\n\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoFitT())])\n\n\n@patch.object(Trans, 'fit', side_effect=func_x_y)\ndef test_called_with_x_and_y(mock_fit):\n    \"\"\"\n    Fit method with required X and y arguments is called with both and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', 'y', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_x)\ndef test_called_with_x(mock_fit):\n    \"\"\"\n    Fit method with a required X arguments is called with it and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n    _call_fit(Trans().fit, 'X', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_raise_type_err)\ndef test_raises_type_error(mock_fit):\n    \"\"\"\n    If a fit method with required X and y arguments raises a TypeError, it's\n    re-raised (for a different reason) when it's called with one argument\n    \"\"\"\n    with pytest.raises(TypeError):\n        _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
      "content": "import tempfile\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nimport joblib\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas import NumericalTransformer\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_common_numerical_transformer(simple_dataset):\n    \"\"\"\n    Test log transformation\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ], df_out=True)\n    df = simple_dataset\n    outDF = transfomer.fit_transform(df)\n    assert list(outDF.columns) == ['feat1']\n    assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n\n\ndef test_numerical_transformer_serialization(simple_dataset):\n    \"\"\"\n    Test if you can serialize transformer\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ])\n\n    df = simple_dataset\n    transfomer.fit(df)\n    f = tempfile.NamedTemporaryFile(delete=True)\n    joblib.dump(transfomer, f.name)\n    transfomer2 = joblib.load(f.name)\n    np.array_equal(transfomer.transform(df), transfomer2.transform(df))\n    f.close()\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "# -*- coding: utf8 -*-\n\nimport pytest\nfrom unittest.mock import Mock\nfrom pandas import DataFrame\nimport pandas as pd\nfrom scipy import sparse\nfrom sklearn.datasets import load_iris\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.svm import SVC\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.preprocessing import (\n    StandardScaler, OneHotEncoder, LabelBinarizer)\nfrom sklearn.impute import SimpleImputer as Imputer\nfrom sklearn.feature_selection import SelectKBest, chi2\nfrom sklearn.base import BaseEstimator, TransformerMixin\nimport sklearn.decomposition\nimport numpy as np\nfrom numpy.testing import assert_array_equal\nimport pickle\nfrom sklearn.compose import make_column_selector\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\nfrom sklearn_pandas.pipeline import TransformerPipeline\n\n\nclass MockXTransformer(object):\n    \"\"\"\n    Mock transformer that accepts no y argument.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return X\n\n\nclass MockTClassifier(object):\n    \"\"\"\n    Mock transformer/classifier.\n    \"\"\"\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        return X\n\n    def predict(self, X):\n        return True\n\n\nclass DateEncoder():\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        dt = X.dt\n        return pd.concat([dt.year, dt.month, dt.day], axis=1)\n\n\nclass ToSparseTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Transforms numpy matrix to sparse format.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return sparse.csr_matrix(X)\n\n\nclass CustomTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example of transformer in which the number of classes\n    is not equals to the number of output columns.\n    \"\"\"\n    def fit(self, X, y=None):\n        self.min = X.min()\n        self.classes_ = np.unique(X)\n        return self\n\n    def transform(self, X):\n        classes = np.unique(X)\n        if len(np.setdiff1d(classes, self.classes_)) > 0:\n            raise ValueError('Unknown values found.')\n        return X - self.min\n\n\nclass MockImageTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example transformer that takes the max of a 2d vector\n    then scales the result.\n    \"\"\"\n    def __init__(self, multiplier=10.0):\n        self.multiplier = multiplier\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        assert isinstance(X, pd.DataFrame)\n        for col in X.columns:\n            X[col] = X[col].map(lambda img: np.max(img))\n        return X * self.multiplier\n\n\n@pytest.fixture\ndef simple_dataframe():\n    return pd.DataFrame({'a': [1, 2, 3]})\n\n\n@pytest.fixture\ndef complex_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4]})\n\n\n@pytest.fixture\ndef complex_object_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4],\n                         'img2d': [1*np.eye(2), 2*np.eye(2), 3*np.eye(2),\n                                   4*np.eye(2), 5*np.eye(2), 6*np.eye(2)]})\n\n\n@pytest.fixture\ndef multiindex_dataframe():\n    \"\"\"Example MultiIndex DataFrame, taken from pandas documentation\n    \"\"\"\n    iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]\n    index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])\n    df = pd.DataFrame(np.random.randn(10, 8), columns=index)\n    return df\n\n\n@pytest.fixture\ndef multiindex_dataframe_incomplete(multiindex_dataframe):\n    \"\"\"Example MultiIndex DataFrame with missing entries\n    \"\"\"\n    df = multiindex_dataframe\n    mask_array = np.zeros(df.size)\n    mask_array[:20] = 1\n    np.random.shuffle(mask_array)\n    mask = mask_array.reshape(df.shape).astype(bool)\n    df.mask(mask, inplace=True)\n    return df\n\n\ndef test_transformed_names_simple(simple_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for simple transformation\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_transformed_names_binarizer(complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_logging(caplog, complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    import logging\n    logger = logging.getLogger('sklearn_pandas')\n    logger.setLevel(logging.INFO)\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert '[FIT_TRANSFORM] target:' in caplog.text\n\n\ndef test_transformed_names_binarizer_unicode():\n    df = pd.DataFrame({'target': [u'ñ', u'á', u'é']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    expected_names = {u'target_ñ', u'target_á', u'target_é'}\n    assert set(mapper.transformed_names_) == expected_names\n\n\ndef test_transformed_names_transformers_list(complex_dataframe):\n    \"\"\"\n    When using a list of transformers, use them in inverse order to get the\n    transformed names\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([\n        ('target', [LabelBinarizer(), MockXTransformer()])\n    ])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_transformed_names_simple_alias(simple_dataframe):\n    \"\"\"\n    If we specify an alias for a single output column, it is used for the\n    output\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None, {'alias': 'new_name'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_name']\n\n\ndef test_transformed_names_complex_alias(complex_dataframe):\n    \"\"\"\n    If we specify an alias for a multiple output column, it is used for the\n    output\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer(), {'alias': 'new'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_a', 'new_b', 'new_c']\n\n\ndef test_exception_column_context_transform(simple_dataframe):\n    \"\"\"\n    If an exception is raised when transforming a column,\n    the exception includes the name of the column being transformed\n    \"\"\"\n    class FailingTransformer(object):\n        def fit(self, X):\n            pass\n\n        def transform(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingTransformer())])\n    mapper.fit(df)\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.transform(df)\n\n\ndef test_exception_column_context_fit(simple_dataframe):\n    \"\"\"\n    If an exception is raised when fit a column,\n    the exception includes the name of the column being fitted\n    \"\"\"\n    class FailingFitter(object):\n        def fit(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingFitter())])\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.fit(df)\n\n\ndef test_simple_df(simple_dataframe):\n    \"\"\"\n    Get a dataframe from a simple mapped dataframe\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert type(transformed) == pd.DataFrame\n    assert len(transformed[\"a\"]) == len(simple_dataframe[\"a\"])\n\n\ndef test_complex_df(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None), ('feat2', None)],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_complex_object_df(complex_object_dataframe):\n    \"\"\"\n    Get a dataframe from a complex dataframe with 2d features\n    \"\"\"\n    df = complex_object_dataframe\n    img_scale = 10\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None),\n         (make_column_selector('feat2'), StandardScaler()),\n         (make_column_selector('img2d'), MockImageTransformer(img_scale))],\n        df_out=True, input_df=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_object_dataframe)\n    assert np.isclose(\n        np.sum(transformed['img2d']),\n        np.max(np.sum(df['img2d'])) * img_scale, atol=1e-12)\n\n\ndef test_numeric_column_names(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe with numeric column names\n    \"\"\"\n    df = complex_dataframe\n    df.columns = [0, 1, 2]\n    mapper = DataFrameMapper(\n        [(0, None), (1, None), (2, None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_multiindex_df(multiindex_dataframe_incomplete):\n    \"\"\"\n    Get a dataframe from a multiindex dataframe with missing data\n    \"\"\"\n    df = multiindex_dataframe_incomplete\n    mapper = DataFrameMapper([([c], Imputer()) for c in df.columns],\n                             df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(multiindex_dataframe_incomplete)\n    for c in df.columns:\n        assert len(transformed[str(c)]) == len(df[c])\n\n\ndef test_binarizer_df():\n    \"\"\"\n    Check level names from LabelBinarizer\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_a'\n    assert cols[1] == 'target_b'\n    assert cols[2] == 'target_c'\n\n\ndef test_binarizer_int_df():\n    \"\"\"\n    Check level names from LabelBinarizer for a numeric array.\n    \"\"\"\n    df = pd.DataFrame({'target': [5, 5, 6, 6, 7, 5]})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_5'\n    assert cols[1] == 'target_6'\n    assert cols[2] == 'target_7'\n\n\ndef test_binarizer2_df():\n    \"\"\"\n    Check level names from LabelBinarizer with just one output column\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_onehot_df():\n    \"\"\"\n    Check level ids from one-hot\n    \"\"\"\n    df = pd.DataFrame({'target': [0, 0, 1, 1, 2, 3, 0]})\n    mapper = DataFrameMapper([(['target'], OneHotEncoder())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 4\n    assert cols[0] == 'target_0'\n    assert cols[3] == 'target_3'\n\n\ndef test_customtransform_df():\n    \"\"\"\n    Check level ids from a transformer in which\n    the number of classes is not equals to the number of output columns.\n    \"\"\"\n    df = pd.DataFrame({'target': [6, 5, 7, 5, 4, 8, 8]})\n    mapper = DataFrameMapper([(['target'], CustomTransformer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(mapper.features[0][1].classes_) == 5\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_preserve_df_index():\n    \"\"\"\n    The index is preserved when df_out=True\n    \"\"\"\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', None)],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, df.index)\n\n\ndef test_preserve_df_index_rows_dropped():\n    \"\"\"\n    If df_out=True but the original df index length doesn't\n    match the number of final rows, use a numeric index\n    \"\"\"\n    class DropLastRowTransformer(object):\n        def fit(self, X):\n            return self\n\n        def transform(self, X):\n            return X[:-1]\n\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', DropLastRowTransformer())],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, np.array([0, 1]))\n\n\ndef test_pca(complex_dataframe):\n    \"\"\"\n    Check multi in and out with PCA\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 2\n    assert cols[0] == 'feat1_feat2_0'\n    assert cols[1] == 'feat1_feat2_1'\n\n\ndef test_fit_transform(simple_dataframe):\n    \"\"\"\n    Check that custom fit_transform methods of the transformers are invoked.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    # return something of measurable length but does nothing\n    mock_transformer.fit_transform.return_value = np.array([1, 2, 3])\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n    mapper.fit_transform(df)\n    assert mock_transformer.fit_transform.called\n\n\ndef test_fit_transform_equiv_mock(simple_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper using the mock\n    transformer which does not implement a custom fit_transform.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', MockXTransformer())])\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.all(transformed_combined == transformed_separate)\n\n\ndef test_fit_transform_equiv_pca(complex_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper and transformer\n    using PCA which implements a custom fit_transform. The\n    equivalence of both paths in the transformer only can be\n    asserted since this is tested in the sklearn tests\n    scikit-learn/sklearn/decomposition/tests/test_pca.py\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.allclose(transformed_combined, transformed_separate)\n\n\ndef test_input_df_true_first_transformer(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the first transformer is passed\n    a pd.Series instead of an np.array\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockXTransformer, 'fit', Mock())\n    monkeypatch.setattr(MockXTransformer, 'transform',\n                        Mock(return_value=np.array([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', MockXTransformer())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    args, _ = MockXTransformer().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    args, _ = MockXTransformer().transform.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_next_transformers(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the subsequent transformers get passed pandas\n    objects instead of numpy arrays (given the previous transformers\n    output pandas objects as well)\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockTClassifier, 'fit', Mock())\n    monkeypatch.setattr(MockTClassifier, 'transform',\n                        Mock(return_value=pd.Series([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer(), MockTClassifier()])\n    ], input_df=True)\n    mapper.fit(df)\n    out = mapper.transform(df)\n\n    args, _ = MockTClassifier().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_multiple_cols(complex_dataframe):\n    \"\"\"\n    When input_df is True, applying transformers to multiple columns\n    works as expected\n    \"\"\"\n    df = complex_dataframe\n\n    mapper = DataFrameMapper([\n        ('target', MockXTransformer()),\n        ('feat1',  MockXTransformer()),\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    assert_array_equal(out[:, 0], df['target'].values)\n    assert_array_equal(out[:, 1], df['feat1'].values)\n\n\ndef test_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_local_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder(), {'input_df': True})\n    ], input_df=False)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_nonexistent_columns_explicit_fail(simple_dataframe):\n    \"\"\"\n    If a nonexistent column is selected, KeyError is raised.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    with pytest.raises(KeyError):\n        mapper._get_col_subset(simple_dataframe, ['nonexistent_feature'])\n\n\ndef test_get_col_subset_single_column_array(simple_dataframe):\n    \"\"\"\n    Selecting a single column should return a 1-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, \"a\")\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]),)\n\n\ndef test_get_col_subset_single_column_list(simple_dataframe):\n    \"\"\"\n    Selecting a list of columns (even if the list contains a single element)\n    should return a 2-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, [\"a\"])\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]), 1)\n\n\ndef test_cols_string_array(simple_dataframe):\n    \"\"\"\n    If a string is specified as the columns, the transformer\n    is called with a 1-d array as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3,)\n\n\ndef test_cols_list_column_vector(simple_dataframe):\n    \"\"\"\n    If a one-element list is specified as the columns, the transformer\n    is called with a column vector as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([([\"a\"], mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3, 1)\n\n\ndef test_handle_feature_2dim():\n    \"\"\"\n    2-dimensional arrays are returned unchanged.\n    \"\"\"\n    array = np.array([[1, 2], [3, 4]])\n    assert_array_equal(_handle_feature(array), array)\n\n\ndef test_handle_feature_1dim():\n    \"\"\"\n    1-dimensional arrays are converted to 2-dimensional column vectors.\n    \"\"\"\n    array = np.array([1, 2])\n    assert_array_equal(_handle_feature(array), np.array([[1], [2]]))\n\n\ndef test_build_transformers():\n    \"\"\"\n    When a list of transformers is passed, return a pipeline with\n    each element of the iterable as a step of the pipeline.\n    \"\"\"\n    transformers = [MockTClassifier(), MockTClassifier()]\n    pipeline = _build_transformer(transformers)\n    assert isinstance(pipeline, Pipeline)\n    for ix, transformer in enumerate(transformers):\n        assert pipeline.steps[ix][1] == transformer\n\n\ndef test_selected_columns():\n    \"\"\"\n    selected_columns returns a set of the columns appearing in the features\n    of the mapper.\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert mapper._selected_columns == {'a', 'b'}\n\n\ndef test_unselected_columns():\n    \"\"\"\n    unselected_columns returns a list of the columns not appearing in the\n    features of the mapper but present in the given dataframe.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert 'c' in mapper._unselected_columns(df)\n\n\ndef test_drop_and_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns and drop columns\n    are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n            ('a', None)\n        ], drop_cols=['c'], default=False)\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (1, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_drop_and_default_none():\n    \"\"\"\n    If default=None, drop columns are discarded and\n    remaining non explicitly selected columns are passed through untransformed\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['c'], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 2)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_conflicting_drop():\n    \"\"\"\n    Drop column name shouldn't get confused with transformed columns.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['a'], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('b', None)\n    ], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n\n\ndef test_default_none():\n    \"\"\"\n    If default=None, non explicitly selected columns are passed through\n    untransformed.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        (['a'], OneHotEncoder())\n    ], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[:, 3] == np.array([3, 5, 7]).T).all()\n\n\ndef test_default_none_names():\n    \"\"\"\n    If default=None, column names are returned unmodified.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([], default=None)\n\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_default_transformer():\n    \"\"\"\n    If default=Transformer, non explicitly selected columns are applied this\n    transformer.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, np.nan, 3], })\n    mapper = DataFrameMapper([], default=Imputer())\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[: 0] == np.array([1., 2., 3.])).all()\n\n\ndef test_list_transformers_single_arg(simple_dataframe):\n    \"\"\"\n    Multiple transformers can be specified in a list even if some of them\n    only accept one X argument instead of two (X, y).\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer()])\n    ])\n    # doesn't fail\n    mapper.fit_transform(simple_dataframe)\n\n\ndef test_list_transformers():\n    \"\"\"\n    Specifying a list of transformers applies them sequentially to the\n    selected column.\n    \"\"\"\n    dataframe = pd.DataFrame({\"a\": [1, np.nan, 3], \"b\": [1, 5, 7]},\n                             dtype=np.float64)\n\n    mapper = DataFrameMapper([\n        ([\"a\"], [Imputer(), StandardScaler()]),\n        ([\"b\"], StandardScaler()),\n    ])\n    dmatrix = mapper.fit_transform(dataframe)\n\n    assert pd.isnull(dmatrix).sum() == 0  # no null values\n\n    # all features have mean 0 and std deviation 1 (standardized)\n    assert (abs(dmatrix.mean(axis=0) - 0) <= 1e-6).all()\n    assert (abs(dmatrix.std(axis=0) - 1) <= 1e-6).all()\n\n\ndef test_list_transformers_old_unpickle(simple_dataframe):\n    mapper = DataFrameMapper(None)\n    # simulate the mapper was created with < 1.0.0 code\n    mapper.features = [('a', [MockXTransformer()])]\n    mapper_pickled = pickle.dumps(mapper)\n\n    loaded_mapper = pickle.loads(mapper_pickled)\n    transformer = loaded_mapper.features[0][1]\n    assert isinstance(transformer, TransformerPipeline)\n    assert isinstance(transformer.steps[0][1], MockXTransformer)\n\n\ndef test_sparse_features(simple_dataframe):\n    \"\"\"\n    If any of the extracted features is sparse and \"sparse\" argument\n    is true, the hstacked result is also sparse.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=True)\n    dmatrix = mapper.fit_transform(df)\n\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\n\ndef test_sparse_off(simple_dataframe):\n    \"\"\"\n    If the resulting features are sparse but the \"sparse\" argument\n    of the mapper is False, return a non-sparse matrix.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=False)\n\n    dmatrix = mapper.fit_transform(df)\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\n\ndef test_fit_with_optional_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with an optional y argument in the fit method\n    are handled correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], MockTClassifier())])\n    # doesn't fail\n    mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n\ndef test_fit_with_required_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with a required y argument in the fit method\n    are handled and perform correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], SelectKBest(chi2, k=1))])\n\n    # fit, doesn't fail\n    ft_arr = mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n    # fit_transform\n    ft_arr = mapper.fit_transform(df[['feat1', 'feat2']], df['target'])\n    assert_array_equal(ft_arr, df[['feat1']].values)\n\n    # transform\n    t_arr = mapper.transform(df[['feat1', 'feat2']])\n    assert_array_equal(t_arr, df[['feat1']].values)\n\n\n# Integration tests with real dataframes\n\n@pytest.fixture\ndef iris_dataframe():\n    iris = load_iris()\n    return DataFrame(\n        data={\n            iris.feature_names[0]: iris.data[:, 0],\n            iris.feature_names[1]: iris.data[:, 1],\n            iris.feature_names[2]: iris.data[:, 2],\n            iris.feature_names[3]: iris.data[:, 3],\n            \"species\": np.array([iris.target_names[e] for e in iris.target])\n        }\n    )\n\n\n@pytest.fixture\ndef cars_dataframe():\n    return pd.read_csv(\"tests/test_data/cars.csv.gz\", compression='gzip')\n\n\ndef test_with_iris_dataframe(iris_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_dict_vectorizer():\n    df = pd.DataFrame(\n        [[{'a': 1, 'b': 2}], [{'a': 3}]],\n        columns=['colA']\n    )\n\n    outdf = DataFrameMapper(\n        [('colA', DictVectorizer())],\n        df_out=True,\n        default=False\n    ).fit_transform(df)\n\n    columns = sorted(list(outdf.columns))\n    assert len(columns) == 2\n    assert columns[0] == 'colA_0'\n    assert columns[1] == 'colA_1'\n\n\ndef test_with_car_dataframe(cars_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"description\", CountVectorizer()),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = cars_dataframe.drop(\"model\", axis=1)\n    labels = cars_dataframe[\"model\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.30\n\n\ndef test_direct_cross_validation(iris_dataframe):\n    \"\"\"\n    Starting with sklearn>=0.16.0 we no longer need CV wrappers for dataframes.\n    See https://github.com/paulgb/sklearn-pandas/issues/11\n    \"\"\"\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_heterogeneous_output_types_input_df():\n    \"\"\"\n    Modify feat2, but pass feat1 through unmodified.\n    This fails if input_df == False\n    \"\"\"\n    df = pd.DataFrame({\n        'feat1': [1, 2, 3, 4, 5, 6],\n        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0, 4.0]\n    })\n    M = DataFrameMapper([\n        (['feat2'], StandardScaler())\n        ], input_df=True, df_out=True, default=None)\n    dft = M.fit_transform(df)\n    assert dft['feat1'].dtype == np.dtype('int64')\n    assert dft['feat2'].dtype == np.dtype('float64')\n\n\ndef test_make_column_selector(iris_dataframe):\n    t = DataFrameMapper([\n        (make_column_selector(dtype_include=float), None, {'alias': 'x'}),\n        ('sepal length (cm)', None),\n    ], df_out=True, default=False)\n\n    xt = t.fit(iris_dataframe).transform(iris_dataframe)\n    expected = ['x_0', 'x_1', 'x_2', 'x_3', 'sepal length (cm)']\n    assert list(xt.columns) == expected\n\n    pickled = pickle.dumps(t)\n    t2 = pickle.loads(pickled)\n    xt2 = t2.transform(iris_dataframe)\n    assert np.array_equal(xt.values, xt2.values)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "from collections import Counter\n\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nfrom numpy.testing import assert_array_equal\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.features_generator import gen_features\n\n\nclass MockClass(object):\n\n    def __init__(self, value=1, name='class'):\n        self.value = value\n        self.name = name\n\n\nclass MockTransformer(object):\n\n    def __init__(self):\n        self.most_common_ = None\n\n    def fit(self, X, y=None):\n        [(value, _)] = Counter(X).most_common(1)\n        self.most_common_ = value\n        return self\n\n    def transform(self, X, y=None):\n        return np.asarray([self.most_common_] * len(X))\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_generate_features_with_default_parameters():\n    \"\"\"\n    Tests generating features from classes with default init arguments.\n    \"\"\"\n    columns = ['colA', 'colB', 'colC']\n    feature_defs = gen_features(columns=columns, classes=[MockClass])\n    assert len(feature_defs) == len(columns)\n\n    for feature in feature_defs:\n        assert feature[2] == {}\n\n    feature_dict = dict([_[0:2] for _ in feature_defs])\n    assert columns == sorted(feature_dict.keys())\n\n    # default init arguments for MockClass for clarification.\n    expected = {'value': 1, 'name': 'class'}\n    for column, transformers in feature_dict.items():\n        for obj in transformers:\n            assert_attributes(obj, **expected)\n\n\ndef test_generate_features_with_several_classes():\n    \"\"\"\n    Tests generating features pipeline with different transformers parameters.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'],\n        classes=[\n            {'class': MockClass},\n            {'class': MockClass, 'name': 'mockA'},\n            {'class': MockClass, 'name': 'mockB', 'value': None}\n        ]\n    )\n\n    for col, transformers, params in feature_defs:\n        assert_attributes(transformers[0], name='class', value=1)\n        assert_attributes(transformers[1], name='mockA', value=1)\n        assert_attributes(transformers[2], name='mockB', value=None)\n\n\ndef test_generate_features_with_none_only_transformers():\n    \"\"\"\n    Tests generating \"dummy\" feature definition which doesn't apply any\n    transformation.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'], classes=[int])\n\n    expected = [('colA', int, {}),\n                ('colB', int, {}),\n                ('colC', int, {})]\n\n    assert feature_defs == expected\n\n\ndef test_compatibility_with_data_frame_mapper(simple_dataset):\n    \"\"\"\n    Tests compatibility of generated feature definition with DataFrameMapper.\n    \"\"\"\n    features_defs = gen_features(\n        columns=['feat1', 'feat2'],\n        classes=[MockTransformer])\n    features_defs.append(('feat3', None))\n\n    mapper = DataFrameMapper(features_defs)\n    X = mapper.fit_transform(simple_dataset)\n    expected = np.asarray([\n        [1, 2, 1],\n        [1, 2, 2],\n        [1, 2, 3],\n        [1, 2, 4],\n        [1, 2, 5]\n    ])\n\n    assert_array_equal(X, expected)\n\n\ndef assert_attributes(obj, **attrs):\n    for attr, value in attrs.items():\n        assert getattr(obj, attr) == value\n"
    }
  ],
  "OriginCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/test.py",
      "content": "import pytest\nfrom unittest.mock import Mock\nimport numpy as np\nimport pandas as pd\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn.compose import make_column_selector\nfrom sklearn.preprocessing import StandardScaler\n\n\nclass GetStartWith:\n    def __init__(self, start_str):\n        self.start_str = start_str\n\n    def __call__(self, X: pd.DataFrame) -> list:\n        return [c for c in X.columns if c.startswith(self.start_str)]\n\n\ndf = pd.DataFrame({\n    'sepal length (cm)': [1.0, 2.0, 3.0],\n    'sepal width (cm)': [1.0, 2.0, 3.0],\n    'petal length (cm)': [1.0, 2.0, 3.0],\n    'petal width (cm)': [1.0, 2.0, 3.0]\n})\nt = DataFrameMapper([\n    (make_column_selector(dtype_include=float), StandardScaler(), {'alias': 'x'}),\n    (GetStartWith('petal'), None, {'alias': 'petal'})\n], df_out=True, default=False)\n\nt.fit(df)\nprint(t.transform(df).shape)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/setup.py",
      "content": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nfrom setuptools import setup\nfrom setuptools.command.test import test as TestCommand\nimport re\n\nfor line in open('sklearn_pandas/__init__.py'):\n    match = re.match(\"__version__ *= *'(.*)'\", line)\n    if match:\n        __version__, = match.groups()\n\n\nclass PyTest(TestCommand):\n    user_options = [('pytest-args=', 'a', \"Arguments to pass to py.test\")]\n\n    def initialize_options(self):\n        TestCommand.initialize_options(self)\n        self.pytest_args = []\n\n    def finalize_options(self):\n        TestCommand.finalize_options(self)\n        self.test_args = []\n        self.test_suite = True\n\n    def run(self):\n        import pytest\n        errno = pytest.main(self.pytest_args)\n        raise SystemExit(errno)\n\n\nsetup(name='sklearn-pandas',\n      version=__version__,\n      description='Pandas integration with sklearn',\n      maintainer='Ritesh Agrawal',\n      maintainer_email='ragrawal@gmail.com',\n      url='https://github.com/scikit-learn-contrib/sklearn-pandas',\n      packages=['sklearn_pandas'],\n      keywords=['scikit', 'sklearn', 'pandas'],\n      install_requires=[\n          'scikit-learn>=0.23.0',\n          'scipy>=1.5.1',\n          'pandas>=1.1.4',\n          'numpy>=1.18.1'\n      ],\n      tests_require=['pytest', 'mock'],\n      cmdclass={'test': PyTest},\n      license='MIT License'\n)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/noxfile.py",
      "content": "import nox\n\n@nox.session\ndef lint(session):\n    session.install('pytest>=5.3.5', 'setuptools>=45.2',\n                    'wheel>=0.34.2', 'flake8>=3.7.9',\n                    'numpy==1.18.1', 'pandas==1.1.4')\n    session.install('.')\n    session.run('flake8', 'sklearn_pandas/', 'tests')\n\n@nox.session\n@nox.parametrize('numpy', ['1.18.1', '1.19.4', '1.20.1'])\n@nox.parametrize('scipy', ['1.5.4', '1.6.0'])\n@nox.parametrize('pandas', ['1.1.4', '1.2.2'])\ndef tests(session, numpy, scipy, pandas):\n    session.install('pytest>=5.3.5', \n                    'setuptools>=45.2',\n                    'wheel>=0.34.2',\n                    f'numpy=={numpy}',\n                    f'scipy=={scipy}',\n                    f'pandas=={pandas}'\n                    )\n    session.install('.')\n    session.run('py.test', 'README.rst', 'tests')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
      "content": "def gen_features(columns, classes=None, prefix='', suffix=''):\n    \"\"\"Generates a feature definition list which can be passed\n    into DataFrameMapper\n\n    Params:\n\n    columns     a list of column names to generate features for.\n\n    classes     a list of classes for each feature, a list of dictionaries with\n                transformer class and init parameters, or None.\n\n                If list of classes is provided, then each of them is\n                instantiated with default arguments. Example:\n\n                    classes = [StandardScaler, LabelBinarizer]\n\n                If list of dictionaries is provided, then each of them should\n                have a 'class' key with transformer class. All other keys are\n                passed into 'class' value constructor. Example:\n\n                    classes = [\n                        {'class': StandardScaler, 'with_mean': False},\n                        {'class': LabelBinarizer}\n                    }]\n\n                If None value selected, then each feature left as is.\n\n    prefix      add prefix to transformed column names\n\n    suffix      add suffix to transformed column names.\n\n    \"\"\"\n    if classes is None:\n        return [(column, None) for column in columns]\n\n    feature_defs = []\n\n    for column in columns:\n        feature_transformers = []\n\n        arguments = {}\n        if prefix and prefix != \"\":\n            arguments['prefix'] = prefix\n        if suffix and suffix != \"\":\n            arguments['suffix'] = suffix\n\n        classes = [cls for cls in classes if cls is not None]\n        if not classes:\n            feature_defs.append((column, None, arguments))\n\n        else:\n            for definition in classes:\n                if isinstance(definition, dict):\n                    params = definition.copy()\n                    klass = params.pop('class')\n                    feature_transformers.append(klass(**params))\n                else:\n                    feature_transformers.append(definition())\n\n            if not feature_transformers:\n                feature_transformers = None\n\n            feature_defs.append((column, feature_transformers, arguments))\n\n    return feature_defs\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py",
      "content": "import numpy as np\nimport pandas as pd\nfrom sklearn.base import TransformerMixin\nimport warnings\n\n\ndef _get_mask(X, value):\n    \"\"\"\n    Compute the boolean mask X == missing_values.\n    \"\"\"\n    if value == \"NaN\" or \\\n       value is None or \\\n       (isinstance(value, float) and np.isnan(value)):\n        return pd.isnull(X)\n    else:\n        return X == value\n\n\nclass NumericalTransformer(TransformerMixin):\n    \"\"\"\n    Provides commonly used numerical transformers.\n    \"\"\"\n    SUPPORTED_FUNCTIONS = ['log', 'log1p']\n\n    def __init__(self, func):\n        \"\"\"\n        Params\n\n        func    function to apply to input columns. The function will be\n                applied to each value. Supported functions are defined\n                in SUPPORTED_FUNCTIONS variable. Throws assertion error if the\n                not supported.\n        \"\"\"\n\n        warnings.warn(\"\"\"\n            NumericalTransformer will be deprecated in 3.0 version.\n            Please use Sklearn.base.TransformerMixin to write\n            customer transformers\n            \"\"\", DeprecationWarning)\n\n        assert func in self.SUPPORTED_FUNCTIONS, \\\n            f\"Only following func are supported: {self.SUPPORTED_FUNCTIONS}\"\n        super(NumericalTransformer, self).__init__()\n        self.__func = func\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X, y=None):\n        if self.__func == 'log1p':\n            return np.vectorize(np.log1p)(X)\n        elif self.__func == 'log':\n            return np.vectorize(np.log)(X)\n\n        raise ValueError(f\"Invalid function name: {self.__func}\")\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/__init__.py",
      "content": "__version__ = '2.2.0'\n\nimport logging\nlogger = logging.getLogger(__name__)\n\nfrom .dataframe_mapper import DataFrameMapper  # NOQA\nfrom .features_generator import gen_features  # NOQA\nfrom .transformers import NumericalTransformer # NOQA\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/pipeline.py",
      "content": "import six\nfrom sklearn.pipeline import _name_estimators, Pipeline\nfrom sklearn.utils import tosequence\n\n\ndef _call_fit(fit_method, X, y=None, **kwargs):\n    \"\"\"\n    helper function, calls the fit or fit_transform method with the correct\n    number of parameters\n\n    fit_method: fit or fit_transform method of the transformer\n    X: the data to fit\n    y: the target vector relative to X, optional\n    kwargs: any keyword arguments to the fit method\n\n    return: the result of the fit or fit_transform method\n\n    WARNING: if this function raises a TypeError exception, test the fit\n    or fit_transform method passed to it in isolation as _call_fit will not\n    distinguish TypeError due to incorrect number of arguments from\n    other TypeError\n    \"\"\"\n    try:\n        return fit_method(X, y, **kwargs)\n    except TypeError:\n        # fit takes only one argument\n        return fit_method(X, **kwargs)\n\n\nclass TransformerPipeline(Pipeline):\n    \"\"\"\n    Pipeline that expects all steps to be transformers taking a single X\n    argument, an optional y argument, and having fit and transform methods.\n\n    Code is copied from sklearn's Pipeline\n    \"\"\"\n\n    def __init__(self, steps):\n        names, estimators = zip(*steps)\n        if len(dict(steps)) != len(steps):\n            raise ValueError(\n                \"Provided step names are not unique: %s\" % (names,))\n\n        # shallow copy of steps\n        self.steps = tosequence(steps)\n        estimator = estimators[-1]\n\n        for e in estimators:\n            if (not (hasattr(e, \"fit\") or hasattr(e, \"fit_transform\")) or not\n                    hasattr(e, \"transform\")):\n                raise TypeError(\"All steps of the chain should \"\n                                \"be transforms and implement fit and transform\"\n                                \" '%s' (type %s) doesn't)\" % (e, type(e)))\n\n        if not hasattr(estimator, \"fit\"):\n            raise TypeError(\"Last step of chain should implement fit \"\n                            \"'%s' (type %s) doesn't)\"\n                            % (estimator, type(estimator)))\n\n    def _pre_transform(self, X, y=None, **fit_params):\n        fit_params_steps = dict((step, {}) for step, _ in self.steps)\n        for pname, pval in six.iteritems(fit_params):\n            step, param = pname.split('__', 1)\n            fit_params_steps[step][param] = pval\n        Xt = X\n        for name, transform in self.steps[:-1]:\n            if hasattr(transform, \"fit_transform\"):\n                Xt = _call_fit(transform.fit_transform,\n                               Xt, y, **fit_params_steps[name])\n            else:\n                Xt = _call_fit(transform.fit,\n                               Xt, y, **fit_params_steps[name]).transform(Xt)\n        return Xt, fit_params_steps[self.steps[-1][0]]\n\n    def fit(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        _call_fit(self.steps[-1][-1].fit, Xt, y, **fit_params)\n        return self\n\n    def fit_transform(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        if hasattr(self.steps[-1][-1], 'fit_transform'):\n            return _call_fit(self.steps[-1][-1].fit_transform,\n                             Xt, y, **fit_params)\n        else:\n            return _call_fit(self.steps[-1][-1].fit,\n                             Xt, y, **fit_params).transform(Xt)\n\n\ndef make_transformer_pipeline(*steps):\n    \"\"\"Construct a TransformerPipeline from the given estimators.\n    \"\"\"\n    return TransformerPipeline(_name_estimators(steps))\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "import contextlib\nfrom datetime import datetime\nimport pandas as pd\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom .cross_validation import DataWrapper\nfrom .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline\nfrom . import logger\n\nstring_types = text_type = str\n\n\ndef _handle_feature(fea):\n    \"\"\"\n    Convert 1-dimensional arrays to 2-dimensional column vectors.\n    \"\"\"\n    if len(fea.shape) == 1:\n        fea = np.array([fea]).T\n\n    return fea\n\n\ndef _build_transformer(transformers):\n    if isinstance(transformers, list):\n        transformers = make_transformer_pipeline(*transformers)\n    return transformers\n\n\ndef _build_feature(columns, transformers, options={}, X=None):\n    if X is None:\n        return (columns, _build_transformer(transformers), options)\n    return (\n        columns(X) if callable(columns) else columns,\n        _build_transformer(transformers),\n        options\n    )\n\n\ndef _elapsed_secs(t1):\n    return (datetime.now()-t1).total_seconds()\n\n\ndef _get_feature_names(estimator):\n    \"\"\"\n    Attempt to extract feature names based on a given estimator\n    \"\"\"\n    if hasattr(estimator, 'classes_'):\n        return estimator.classes_\n    elif hasattr(estimator, 'get_feature_names'):\n        return estimator.get_feature_names()\n    return None\n\n\n@contextlib.contextmanager\ndef add_column_names_to_exception(column_names):\n    # Stolen from https://stackoverflow.com/a/17677938/356729\n    try:\n        yield\n    except Exception as ex:\n        if ex.args:\n            msg = u'{}: {}'.format(column_names, ex.args[0])\n        else:\n            msg = text_type(column_names)\n        ex.args = (msg,) + ex.args[1:]\n        raise\n\n\nclass DataFrameMapper(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Map Pandas data frame column subsets to their own\n    sklearn transformation.\n    \"\"\"\n\n    def __init__(self, features, default=False, sparse=False, df_out=False,\n                 input_df=False, drop_cols=None):\n        \"\"\"\n        Params:\n\n        features    a list of tuples with features definitions.\n                    The first element is the pandas column selector. This can\n                    be a string (for one column) or a list of strings.\n                    The second element is an object that supports\n                    sklearn's transform interface, or a list of such objects\n                    The third element is optional and, if present, must be\n                    a dictionary with the options to apply to the\n                    transformation. Example: {'alias': 'day_of_week'}\n\n        default     default transformer to apply to the columns not\n                    explicitly selected in the mapper. If False (default),\n                    discard them. If None, pass them through untouched. Any\n                    other transformer will be applied to all the unselected\n                    columns as a whole, taken as a 2d-array.\n\n        sparse      will return sparse matrix if set True and any of the\n                    extracted features is sparse. Defaults to False.\n\n        df_out      return a pandas data frame, with each column named using\n                    the pandas column that created it (if there's only one\n                    input and output) or the input columns joined with '_'\n                    if there's multiple inputs, and the name concatenated with\n                    '_1', '_2' etc if there's multiple outputs. NB: does not\n                    work if *default* or *sparse* are true\n\n        input_df    If ``True`` pass the selected columns to the transformers\n                    as a pandas DataFrame or Series. Otherwise pass them as a\n                    numpy array. Defaults to ``False``.\n\n        drop_cols   List of columns to be dropped. Defaults to None.\n\n        \"\"\"\n        self.features = features\n        self.default = default\n        self.built_default = None\n        self.sparse = sparse\n        self.df_out = df_out\n        self.input_df = input_df\n        self.drop_cols = [] if drop_cols is None else drop_cols\n        self.transformed_names_ = []\n        if (df_out and (sparse or default)):\n            raise ValueError(\"Can not use df_out with sparse or default\")\n\n    def _build(self, X=None):\n        \"\"\"\n        Build attributes built_features and built_default.\n        \"\"\"\n        if isinstance(self.features, list):\n            self.built_features = [\n                _build_feature(*f, X=X) for f in self.features\n            ]\n        else:\n            self.built_features = _build_feature(*self.features, X=X)\n        self.built_default = _build_transformer(self.default)\n\n    @property\n    def _selected_columns(self):\n        \"\"\"\n        Return a set of selected columns in the feature list.\n        \"\"\"\n        selected_columns = set()\n        for feature in self.features:\n            columns = feature[0]\n            if isinstance(columns, list):\n                selected_columns = selected_columns.union(set(columns))\n            else:\n                selected_columns.add(columns)\n        return selected_columns\n\n    def _unselected_columns(self, X):\n        \"\"\"\n        Return list of columns present in X and not selected explicitly in the\n        mapper.\n\n        Unselected columns are returned in the order they appear in the\n        dataframe to avoid issues with different ordering during default fit\n        and transform steps.\n        \"\"\"\n        X_columns = list(X.columns)\n        return [column for column in X_columns if\n                column not in self._selected_columns\n                and column not in self.drop_cols]\n\n    def __setstate__(self, state):\n        # compatibility for older versions of sklearn-pandas\n        super().__setstate__(state)\n        self.features = [_build_feature(*feat) for feat in state['features']]\n        self.sparse = state.get('sparse', False)\n        self.default = state.get('default', False)\n        self.df_out = state.get('df_out', False)\n        self.input_df = state.get('input_df', False)\n        self.drop_cols = state.get('drop_cols', [])\n        self.built_features = state.get('built_features', self.features)\n        self.built_default = state.get('built_default', self.default)\n        self.transformed_names_ = state.get('transformed_names_', [])\n\n    def __getstate__(self):\n        state = super().__getstate__()\n        state['features'] = self.features\n        state['sparse'] = self.sparse\n        state['default'] = self.default\n        state['df_out'] = self.df_out\n        state['input_df'] = self.input_df\n        state['drop_cols'] = self.drop_cols\n        state['build_features'] = getattr(self, 'built_features', None)\n        state['built_default'] = self.built_default\n        state['transformed_names_'] = self.transformed_names_\n        return state\n\n    def _get_col_subset(self, X, cols, input_df=False):\n        \"\"\"\n        Get a subset of columns from the given table X.\n\n        X       a Pandas dataframe; the table to select columns from\n        cols    a string or list of strings representing the columns to select.\n                It can also be a callable that returns True or False, i.e.\n                compatible with the built-in filter function.\n\n        Returns a numpy array with the data from the selected columns\n        \"\"\"\n\n        if isinstance(cols, string_types):\n            return_vector = True\n            cols = [cols]\n        else:\n            return_vector = False\n\n        # Needed when using the cross-validation compatibility\n        # layer for sklearn<0.16.0.\n        # Will be dropped on sklearn-pandas 2.0.\n        if isinstance(X, list):\n            X = [x[cols] for x in X]\n            X = pd.DataFrame(X)\n\n        elif isinstance(X, DataWrapper):\n            X = X.df  # fetch underlying data\n\n        if return_vector:\n            t = X[cols[0]]\n        else:\n            t = X[cols]\n\n        # return either a DataFrame/Series or a numpy array\n        if input_df:\n            return t\n        else:\n            return t.values\n\n    def fit(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n\n        \"\"\"\n        self._build(X=X)\n\n        for columns, transformers, options in self.built_features:\n            t1 = datetime.now()\n            input_df = options.get('input_df', self.input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    Xt = self._get_col_subset(X, columns, input_df)\n                    _call_fit(transformers.fit, Xt, y)\n            logger.info(f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n        # handle features not explicitly selected\n        if self.built_default:  # not False and not None\n            unsel_cols = self._unselected_columns(X)\n            with add_column_names_to_exception(unsel_cols):\n                Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n                _call_fit(self.built_default.fit, Xt, y)\n        return self\n\n    def get_names(self, columns, transformer, x, alias=None, prefix='',\n                  suffix=''):\n        \"\"\"\n        Return verbose names for the transformed columns.\n\n        columns       name (or list of names) of the original column(s)\n        transformer   transformer - can be a TransformerPipeline\n        x             transformed columns (numpy.ndarray)\n        alias         base name to use for the selected columns\n        \"\"\"\n        if alias is not None:\n            name = alias\n        elif isinstance(columns, list):\n            name = '_'.join(map(str, columns))\n        else:\n            name = columns\n        num_cols = x.shape[1] if len(x.shape) > 1 else 1\n\n        output = []\n\n        if num_cols > 1:\n            # If there are as many columns as classes in the transformer,\n            # infer column names from classes names.\n\n            # If we are dealing with multiple transformers for these columns\n            # attempt to extract the names from each of them, starting from the\n            # last one\n            if isinstance(transformer, TransformerPipeline):\n                inverse_steps = transformer.steps[::-1]\n                estimators = (estimator for name, estimator in inverse_steps)\n                names_steps = (_get_feature_names(e) for e in estimators)\n                names = next((n for n in names_steps if n is not None), None)\n            # Otherwise use the only estimator present\n            else:\n                names = _get_feature_names(transformer)\n\n            if names is not None and len(names) == num_cols:\n                output = [f\"{name}_{o}\" for o in names]\n                # otherwise, return name concatenated with '_1', '_2', etc.\n            else:\n                output = [name + '_' + str(o) for o in range(num_cols)]\n        else:\n            output = [name]\n\n        if prefix == suffix == \"\":\n            return output\n\n        return ['{}{}{}'.format(prefix, x, suffix) for x in output]\n\n    def get_dtypes(self, extracted):\n        dtypes_features = [self.get_dtype(ex) for ex in extracted]\n        return [dtype for dtype_feature in dtypes_features\n                for dtype in dtype_feature]\n\n    def get_dtype(self, ex):\n        if isinstance(ex, np.ndarray) or sparse.issparse(ex):\n            return [ex.dtype] * ex.shape[1]\n        elif isinstance(ex, pd.DataFrame):\n            return list(ex.dtypes)\n        else:\n            raise TypeError(type(ex))\n\n    def _transform(self, X, y=None, do_fit=False):\n        \"\"\"\n        Transform the given data with possibility to fit in advance.\n        Avoids code duplication for implementation of transform and\n        fit_transform.\n        \"\"\"\n        if do_fit:\n            self._build(X=X)\n\n        extracted = []\n        transformed_names_ = []\n        for columns, transformers, options in self.built_features:\n            input_df = options.get('input_df', self.input_df)\n\n            # columns could be a string or list of\n            # strings; we don't care because pandas\n            # will handle either.\n            Xt = self._get_col_subset(X, columns, input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    if do_fit and hasattr(transformers, 'fit_transform'):\n                        t1 = datetime.now()\n                        Xt = _call_fit(transformers.fit_transform, Xt, y)\n                        logger.info(f\"[FIT_TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n                    else:\n                        if do_fit:\n                            t1 = datetime.now()\n                            _call_fit(transformers.fit, Xt, y)\n                            logger.info(\n                                f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n                        t1 = datetime.now()\n                        Xt = transformers.transform(Xt)\n                        logger.info(f\"[TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n\n            extracted.append(_handle_feature(Xt))\n\n            alias = options.get('alias')\n\n            prefix = options.get('prefix', '')\n            suffix = options.get('suffix', '')\n\n            transformed_names_ += self.get_names(\n                columns, transformers, Xt, alias, prefix, suffix)\n\n        # handle features not explicitly selected\n        if self.built_default is not False:\n            unsel_cols = self._unselected_columns(X)\n            Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n            if self.built_default is not None:\n                with add_column_names_to_exception(unsel_cols):\n                    if do_fit and hasattr(self.built_default, 'fit_transform'):\n                        Xt = _call_fit(self.built_default.fit_transform, Xt, y)\n                    else:\n                        if do_fit:\n                            _call_fit(self.built_default.fit, Xt, y)\n                        Xt = self.built_default.transform(Xt)\n                transformed_names_ += self.get_names(\n                    unsel_cols, self.built_default, Xt)\n            else:\n                # if not applying a default transformer,\n                # keep column names unmodified\n                transformed_names_ += unsel_cols\n\n            extracted.append(_handle_feature(Xt))\n\n        self.transformed_names_ = transformed_names_\n\n        # combine the feature outputs into one array.\n        # at this point we lose track of which features\n        # were created from which input columns, so it's\n        # assumed that that doesn't matter to the model.\n\n        # If any of the extracted features is sparse, combine sparsely.\n        # Otherwise, combine as normal arrays.\n        if any(sparse.issparse(fea) for fea in extracted):\n            stacked = sparse.hstack(extracted).tocsr()\n            # return a sparse matrix only if the mapper was initialized\n            # with sparse=True\n            if not self.sparse:\n                stacked = stacked.toarray()\n        else:\n            stacked = np.hstack(extracted)\n\n        if self.df_out:\n            # if no rows were dropped preserve the original index,\n            # otherwise use a new integer one\n            no_rows_dropped = len(X) == len(stacked)\n            if no_rows_dropped:\n                index = X.index\n            else:\n                index = None\n\n            # output different data types, if appropriate\n            dtypes = self.get_dtypes(extracted)\n            df_out = pd.DataFrame(\n                stacked,\n                columns=self.transformed_names_,\n                index=index)\n            # preserve types\n            for col, dtype in zip(self.transformed_names_, dtypes):\n                df_out[col] = df_out[col].astype(dtype)\n            return df_out\n        else:\n            return stacked\n\n    def transform(self, X):\n        \"\"\"\n        Transform the given data. Assumes that fit has already been called.\n\n        X       the data to transform\n        \"\"\"\n        return self._transform(X)\n\n    def fit_transform(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline and directly apply\n        it to the given data.\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n        \"\"\"\n        return self._transform(X, y, True)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/cross_validation.py",
      "content": "class DataWrapper(object):\n\n    def __init__(self, df):\n        self.df = df\n\n    def __len__(self):\n        return len(self.df)\n\n    def __getitem__(self, key):\n        return self.df.iloc[key]\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_pipeline.py",
      "content": "import pytest\nfrom sklearn_pandas.pipeline import TransformerPipeline, _call_fit\n\n# In py3, mock is included with the unittest standard library\n# In py2, it's a separate package\ntry:\n    from unittest.mock import patch\nexcept ImportError:\n    from mock import patch\n\n\nclass NoTransformT(object):\n    \"\"\"Transformer without transform method.\n    \"\"\"\n    def fit(self, x):\n        return self\n\n\nclass NoFitT(object):\n    \"\"\"Transformer without fit method.\n    \"\"\"\n    def transform(self, x):\n        return self\n\n\nclass Trans(object):\n    \"\"\"\n    Transformer with fit and transform methods\n    \"\"\"\n    def fit(self, x, y=None):\n        return self\n\n    def transform(self, x):\n        return self\n\n\ndef func_x_y(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments\n    \"\"\"\n    return\n\n\ndef func_x(x, kwarg='kwarg'):\n    \"\"\"\n    Function with required x argument\n    \"\"\"\n    return\n\n\ndef func_raise_type_err(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments,\n    raises TypeError\n    \"\"\"\n    raise TypeError\n\n\ndef test_all_steps_fit_transform():\n    \"\"\"\n    All steps must implement fit and transform. Otherwise, raise TypeError.\n    \"\"\"\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoTransformT())])\n\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoFitT())])\n\n\n@patch.object(Trans, 'fit', side_effect=func_x_y)\ndef test_called_with_x_and_y(mock_fit):\n    \"\"\"\n    Fit method with required X and y arguments is called with both and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', 'y', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_x)\ndef test_called_with_x(mock_fit):\n    \"\"\"\n    Fit method with a required X arguments is called with it and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n    _call_fit(Trans().fit, 'X', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_raise_type_err)\ndef test_raises_type_error(mock_fit):\n    \"\"\"\n    If a fit method with required X and y arguments raises a TypeError, it's\n    re-raised (for a different reason) when it's called with one argument\n    \"\"\"\n    with pytest.raises(TypeError):\n        _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
      "content": "import tempfile\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nimport joblib\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas import NumericalTransformer\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_common_numerical_transformer(simple_dataset):\n    \"\"\"\n    Test log transformation\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ], df_out=True)\n    df = simple_dataset\n    outDF = transfomer.fit_transform(df)\n    assert list(outDF.columns) == ['feat1']\n    assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n\n\ndef test_numerical_transformer_serialization(simple_dataset):\n    \"\"\"\n    Test if you can serialize transformer\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ])\n\n    df = simple_dataset\n    transfomer.fit(df)\n    f = tempfile.NamedTemporaryFile(delete=True)\n    joblib.dump(transfomer, f.name)\n    transfomer2 = joblib.load(f.name)\n    np.array_equal(transfomer.transform(df), transfomer2.transform(df))\n    f.close()\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "# -*- coding: utf8 -*-\n\nimport pytest\nfrom unittest.mock import Mock\nfrom pandas import DataFrame\nimport pandas as pd\nfrom scipy import sparse\nfrom sklearn.datasets import load_iris\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.svm import SVC\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.preprocessing import (\n    StandardScaler, OneHotEncoder, LabelBinarizer)\nfrom sklearn.impute import SimpleImputer as Imputer\nfrom sklearn.feature_selection import SelectKBest, chi2\nfrom sklearn.base import BaseEstimator, TransformerMixin\nimport sklearn.decomposition\nimport numpy as np\nfrom numpy.testing import assert_array_equal\nimport pickle\nfrom sklearn.compose import make_column_selector\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\nfrom sklearn_pandas.pipeline import TransformerPipeline\n\n\nclass MockXTransformer(object):\n    \"\"\"\n    Mock transformer that accepts no y argument.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return X\n\n\nclass MockTClassifier(object):\n    \"\"\"\n    Mock transformer/classifier.\n    \"\"\"\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        return X\n\n    def predict(self, X):\n        return True\n\n\nclass DateEncoder():\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        dt = X.dt\n        return pd.concat([dt.year, dt.month, dt.day], axis=1)\n\n\nclass ToSparseTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Transforms numpy matrix to sparse format.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return sparse.csr_matrix(X)\n\n\nclass CustomTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example of transformer in which the number of classes\n    is not equals to the number of output columns.\n    \"\"\"\n    def fit(self, X, y=None):\n        self.min = X.min()\n        self.classes_ = np.unique(X)\n        return self\n\n    def transform(self, X):\n        classes = np.unique(X)\n        if len(np.setdiff1d(classes, self.classes_)) > 0:\n            raise ValueError('Unknown values found.')\n        return X - self.min\n\n\nclass MockImageTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example transformer that takes the max of a 2d vector\n    then scales the result.\n    \"\"\"\n    def __init__(self, multiplier=10.0):\n        self.multiplier = multiplier\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        assert isinstance(X, pd.DataFrame)\n        for col in X.columns:\n            X[col] = X[col].map(lambda img: np.max(img))\n        return X * self.multiplier\n\n\n@pytest.fixture\ndef simple_dataframe():\n    return pd.DataFrame({'a': [1, 2, 3]})\n\n\n@pytest.fixture\ndef complex_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4]})\n\n\n@pytest.fixture\ndef complex_object_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4],\n                         'img2d': [1*np.eye(2), 2*np.eye(2), 3*np.eye(2),\n                                   4*np.eye(2), 5*np.eye(2), 6*np.eye(2)]})\n\n\n@pytest.fixture\ndef multiindex_dataframe():\n    \"\"\"Example MultiIndex DataFrame, taken from pandas documentation\n    \"\"\"\n    iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]\n    index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])\n    df = pd.DataFrame(np.random.randn(10, 8), columns=index)\n    return df\n\n\n@pytest.fixture\ndef multiindex_dataframe_incomplete(multiindex_dataframe):\n    \"\"\"Example MultiIndex DataFrame with missing entries\n    \"\"\"\n    df = multiindex_dataframe\n    mask_array = np.zeros(df.size)\n    mask_array[:20] = 1\n    np.random.shuffle(mask_array)\n    mask = mask_array.reshape(df.shape).astype(bool)\n    df.mask(mask, inplace=True)\n    return df\n\n\ndef test_transformed_names_simple(simple_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for simple transformation\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_transformed_names_binarizer(complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_logging(caplog, complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    import logging\n    logger = logging.getLogger('sklearn_pandas')\n    logger.setLevel(logging.INFO)\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert '[FIT_TRANSFORM] target:' in caplog.text\n\n\ndef test_transformed_names_binarizer_unicode():\n    df = pd.DataFrame({'target': [u'ñ', u'á', u'é']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    expected_names = {u'target_ñ', u'target_á', u'target_é'}\n    assert set(mapper.transformed_names_) == expected_names\n\n\ndef test_transformed_names_transformers_list(complex_dataframe):\n    \"\"\"\n    When using a list of transformers, use them in inverse order to get the\n    transformed names\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([\n        ('target', [LabelBinarizer(), MockXTransformer()])\n    ])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_transformed_names_simple_alias(simple_dataframe):\n    \"\"\"\n    If we specify an alias for a single output column, it is used for the\n    output\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None, {'alias': 'new_name'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_name']\n\n\ndef test_transformed_names_complex_alias(complex_dataframe):\n    \"\"\"\n    If we specify an alias for a multiple output column, it is used for the\n    output\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer(), {'alias': 'new'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_a', 'new_b', 'new_c']\n\n\ndef test_exception_column_context_transform(simple_dataframe):\n    \"\"\"\n    If an exception is raised when transforming a column,\n    the exception includes the name of the column being transformed\n    \"\"\"\n    class FailingTransformer(object):\n        def fit(self, X):\n            pass\n\n        def transform(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingTransformer())])\n    mapper.fit(df)\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.transform(df)\n\n\ndef test_exception_column_context_fit(simple_dataframe):\n    \"\"\"\n    If an exception is raised when fit a column,\n    the exception includes the name of the column being fitted\n    \"\"\"\n    class FailingFitter(object):\n        def fit(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingFitter())])\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.fit(df)\n\n\ndef test_simple_df(simple_dataframe):\n    \"\"\"\n    Get a dataframe from a simple mapped dataframe\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert type(transformed) == pd.DataFrame\n    assert len(transformed[\"a\"]) == len(simple_dataframe[\"a\"])\n\n\ndef test_complex_df(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None), ('feat2', None)],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_complex_object_df(complex_object_dataframe):\n    \"\"\"\n    Get a dataframe from a complex dataframe with 2d features\n    \"\"\"\n    df = complex_object_dataframe\n    img_scale = 10\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None),\n         (make_column_selector('feat2'), StandardScaler()),\n         (make_column_selector('img2d'), MockImageTransformer(img_scale))],\n        df_out=True, input_df=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_object_dataframe)\n    assert np.isclose(\n        np.sum(transformed['img2d']),\n        np.max(np.sum(df['img2d'])) * img_scale, atol=1e-12)\n\n\ndef test_numeric_column_names(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe with numeric column names\n    \"\"\"\n    df = complex_dataframe\n    df.columns = [0, 1, 2]\n    mapper = DataFrameMapper(\n        [(0, None), (1, None), (2, None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_multiindex_df(multiindex_dataframe_incomplete):\n    \"\"\"\n    Get a dataframe from a multiindex dataframe with missing data\n    \"\"\"\n    df = multiindex_dataframe_incomplete\n    mapper = DataFrameMapper([([c], Imputer()) for c in df.columns],\n                             df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(multiindex_dataframe_incomplete)\n    for c in df.columns:\n        assert len(transformed[str(c)]) == len(df[c])\n\n\ndef test_binarizer_df():\n    \"\"\"\n    Check level names from LabelBinarizer\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_a'\n    assert cols[1] == 'target_b'\n    assert cols[2] == 'target_c'\n\n\ndef test_binarizer_int_df():\n    \"\"\"\n    Check level names from LabelBinarizer for a numeric array.\n    \"\"\"\n    df = pd.DataFrame({'target': [5, 5, 6, 6, 7, 5]})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_5'\n    assert cols[1] == 'target_6'\n    assert cols[2] == 'target_7'\n\n\ndef test_binarizer2_df():\n    \"\"\"\n    Check level names from LabelBinarizer with just one output column\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_onehot_df():\n    \"\"\"\n    Check level ids from one-hot\n    \"\"\"\n    df = pd.DataFrame({'target': [0, 0, 1, 1, 2, 3, 0]})\n    mapper = DataFrameMapper([(['target'], OneHotEncoder())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 4\n    assert cols[0] == 'target_0'\n    assert cols[3] == 'target_3'\n\n\ndef test_customtransform_df():\n    \"\"\"\n    Check level ids from a transformer in which\n    the number of classes is not equals to the number of output columns.\n    \"\"\"\n    df = pd.DataFrame({'target': [6, 5, 7, 5, 4, 8, 8]})\n    mapper = DataFrameMapper([(['target'], CustomTransformer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(mapper.features[0][1].classes_) == 5\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_preserve_df_index():\n    \"\"\"\n    The index is preserved when df_out=True\n    \"\"\"\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', None)],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, df.index)\n\n\ndef test_preserve_df_index_rows_dropped():\n    \"\"\"\n    If df_out=True but the original df index length doesn't\n    match the number of final rows, use a numeric index\n    \"\"\"\n    class DropLastRowTransformer(object):\n        def fit(self, X):\n            return self\n\n        def transform(self, X):\n            return X[:-1]\n\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', DropLastRowTransformer())],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, np.array([0, 1]))\n\n\ndef test_pca(complex_dataframe):\n    \"\"\"\n    Check multi in and out with PCA\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 2\n    assert cols[0] == 'feat1_feat2_0'\n    assert cols[1] == 'feat1_feat2_1'\n\n\ndef test_fit_transform(simple_dataframe):\n    \"\"\"\n    Check that custom fit_transform methods of the transformers are invoked.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    # return something of measurable length but does nothing\n    mock_transformer.fit_transform.return_value = np.array([1, 2, 3])\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n    mapper.fit_transform(df)\n    assert mock_transformer.fit_transform.called\n\n\ndef test_fit_transform_equiv_mock(simple_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper using the mock\n    transformer which does not implement a custom fit_transform.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', MockXTransformer())])\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.all(transformed_combined == transformed_separate)\n\n\ndef test_fit_transform_equiv_pca(complex_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper and transformer\n    using PCA which implements a custom fit_transform. The\n    equivalence of both paths in the transformer only can be\n    asserted since this is tested in the sklearn tests\n    scikit-learn/sklearn/decomposition/tests/test_pca.py\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.allclose(transformed_combined, transformed_separate)\n\n\ndef test_input_df_true_first_transformer(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the first transformer is passed\n    a pd.Series instead of an np.array\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockXTransformer, 'fit', Mock())\n    monkeypatch.setattr(MockXTransformer, 'transform',\n                        Mock(return_value=np.array([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', MockXTransformer())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    args, _ = MockXTransformer().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    args, _ = MockXTransformer().transform.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_next_transformers(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the subsequent transformers get passed pandas\n    objects instead of numpy arrays (given the previous transformers\n    output pandas objects as well)\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockTClassifier, 'fit', Mock())\n    monkeypatch.setattr(MockTClassifier, 'transform',\n                        Mock(return_value=pd.Series([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer(), MockTClassifier()])\n    ], input_df=True)\n    mapper.fit(df)\n    out = mapper.transform(df)\n\n    args, _ = MockTClassifier().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_multiple_cols(complex_dataframe):\n    \"\"\"\n    When input_df is True, applying transformers to multiple columns\n    works as expected\n    \"\"\"\n    df = complex_dataframe\n\n    mapper = DataFrameMapper([\n        ('target', MockXTransformer()),\n        ('feat1',  MockXTransformer()),\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    assert_array_equal(out[:, 0], df['target'].values)\n    assert_array_equal(out[:, 1], df['feat1'].values)\n\n\ndef test_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_local_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder(), {'input_df': True})\n    ], input_df=False)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_nonexistent_columns_explicit_fail(simple_dataframe):\n    \"\"\"\n    If a nonexistent column is selected, KeyError is raised.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    with pytest.raises(KeyError):\n        mapper._get_col_subset(simple_dataframe, ['nonexistent_feature'])\n\n\ndef test_get_col_subset_single_column_array(simple_dataframe):\n    \"\"\"\n    Selecting a single column should return a 1-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, \"a\")\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]),)\n\n\ndef test_get_col_subset_single_column_list(simple_dataframe):\n    \"\"\"\n    Selecting a list of columns (even if the list contains a single element)\n    should return a 2-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, [\"a\"])\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]), 1)\n\n\ndef test_cols_string_array(simple_dataframe):\n    \"\"\"\n    If a string is specified as the columns, the transformer\n    is called with a 1-d array as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3,)\n\n\ndef test_cols_list_column_vector(simple_dataframe):\n    \"\"\"\n    If a one-element list is specified as the columns, the transformer\n    is called with a column vector as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([([\"a\"], mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3, 1)\n\n\ndef test_handle_feature_2dim():\n    \"\"\"\n    2-dimensional arrays are returned unchanged.\n    \"\"\"\n    array = np.array([[1, 2], [3, 4]])\n    assert_array_equal(_handle_feature(array), array)\n\n\ndef test_handle_feature_1dim():\n    \"\"\"\n    1-dimensional arrays are converted to 2-dimensional column vectors.\n    \"\"\"\n    array = np.array([1, 2])\n    assert_array_equal(_handle_feature(array), np.array([[1], [2]]))\n\n\ndef test_build_transformers():\n    \"\"\"\n    When a list of transformers is passed, return a pipeline with\n    each element of the iterable as a step of the pipeline.\n    \"\"\"\n    transformers = [MockTClassifier(), MockTClassifier()]\n    pipeline = _build_transformer(transformers)\n    assert isinstance(pipeline, Pipeline)\n    for ix, transformer in enumerate(transformers):\n        assert pipeline.steps[ix][1] == transformer\n\n\ndef test_selected_columns():\n    \"\"\"\n    selected_columns returns a set of the columns appearing in the features\n    of the mapper.\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert mapper._selected_columns == {'a', 'b'}\n\n\ndef test_unselected_columns():\n    \"\"\"\n    unselected_columns returns a list of the columns not appearing in the\n    features of the mapper but present in the given dataframe.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert 'c' in mapper._unselected_columns(df)\n\n\ndef test_drop_and_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns and drop columns\n    are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n            ('a', None)\n        ], drop_cols=['c'], default=False)\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (1, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_drop_and_default_none():\n    \"\"\"\n    If default=None, drop columns are discarded and\n    remaining non explicitly selected columns are passed through untransformed\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['c'], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 2)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_conflicting_drop():\n    \"\"\"\n    Drop column name shouldn't get confused with transformed columns.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['a'], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('b', None)\n    ], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n\n\ndef test_default_none():\n    \"\"\"\n    If default=None, non explicitly selected columns are passed through\n    untransformed.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        (['a'], OneHotEncoder())\n    ], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[:, 3] == np.array([3, 5, 7]).T).all()\n\n\ndef test_default_none_names():\n    \"\"\"\n    If default=None, column names are returned unmodified.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([], default=None)\n\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_default_transformer():\n    \"\"\"\n    If default=Transformer, non explicitly selected columns are applied this\n    transformer.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, np.nan, 3], })\n    mapper = DataFrameMapper([], default=Imputer())\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[: 0] == np.array([1., 2., 3.])).all()\n\n\ndef test_list_transformers_single_arg(simple_dataframe):\n    \"\"\"\n    Multiple transformers can be specified in a list even if some of them\n    only accept one X argument instead of two (X, y).\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer()])\n    ])\n    # doesn't fail\n    mapper.fit_transform(simple_dataframe)\n\n\ndef test_list_transformers():\n    \"\"\"\n    Specifying a list of transformers applies them sequentially to the\n    selected column.\n    \"\"\"\n    dataframe = pd.DataFrame({\"a\": [1, np.nan, 3], \"b\": [1, 5, 7]},\n                             dtype=np.float64)\n\n    mapper = DataFrameMapper([\n        ([\"a\"], [Imputer(), StandardScaler()]),\n        ([\"b\"], StandardScaler()),\n    ])\n    dmatrix = mapper.fit_transform(dataframe)\n\n    assert pd.isnull(dmatrix).sum() == 0  # no null values\n\n    # all features have mean 0 and std deviation 1 (standardized)\n    assert (abs(dmatrix.mean(axis=0) - 0) <= 1e-6).all()\n    assert (abs(dmatrix.std(axis=0) - 1) <= 1e-6).all()\n\n\ndef test_list_transformers_old_unpickle(simple_dataframe):\n    mapper = DataFrameMapper(None)\n    # simulate the mapper was created with < 1.0.0 code\n    mapper.features = [('a', [MockXTransformer()])]\n    mapper_pickled = pickle.dumps(mapper)\n\n    loaded_mapper = pickle.loads(mapper_pickled)\n    transformer = loaded_mapper.features[0][1]\n    assert isinstance(transformer, TransformerPipeline)\n    assert isinstance(transformer.steps[0][1], MockXTransformer)\n\n\ndef test_sparse_features(simple_dataframe):\n    \"\"\"\n    If any of the extracted features is sparse and \"sparse\" argument\n    is true, the hstacked result is also sparse.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=True)\n    dmatrix = mapper.fit_transform(df)\n\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\n\ndef test_sparse_off(simple_dataframe):\n    \"\"\"\n    If the resulting features are sparse but the \"sparse\" argument\n    of the mapper is False, return a non-sparse matrix.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=False)\n\n    dmatrix = mapper.fit_transform(df)\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\n\ndef test_fit_with_optional_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with an optional y argument in the fit method\n    are handled correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], MockTClassifier())])\n    # doesn't fail\n    mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n\ndef test_fit_with_required_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with a required y argument in the fit method\n    are handled and perform correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], SelectKBest(chi2, k=1))])\n\n    # fit, doesn't fail\n    ft_arr = mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n    # fit_transform\n    ft_arr = mapper.fit_transform(df[['feat1', 'feat2']], df['target'])\n    assert_array_equal(ft_arr, df[['feat1']].values)\n\n    # transform\n    t_arr = mapper.transform(df[['feat1', 'feat2']])\n    assert_array_equal(t_arr, df[['feat1']].values)\n\n\n# Integration tests with real dataframes\n\n@pytest.fixture\ndef iris_dataframe():\n    iris = load_iris()\n    return DataFrame(\n        data={\n            iris.feature_names[0]: iris.data[:, 0],\n            iris.feature_names[1]: iris.data[:, 1],\n            iris.feature_names[2]: iris.data[:, 2],\n            iris.feature_names[3]: iris.data[:, 3],\n            \"species\": np.array([iris.target_names[e] for e in iris.target])\n        }\n    )\n\n\n@pytest.fixture\ndef cars_dataframe():\n    return pd.read_csv(\"tests/test_data/cars.csv.gz\", compression='gzip')\n\n\ndef test_with_iris_dataframe(iris_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_dict_vectorizer():\n    df = pd.DataFrame(\n        [[{'a': 1, 'b': 2}], [{'a': 3}]],\n        columns=['colA']\n    )\n\n    outdf = DataFrameMapper(\n        [('colA', DictVectorizer())],\n        df_out=True,\n        default=False\n    ).fit_transform(df)\n\n    columns = sorted(list(outdf.columns))\n    assert len(columns) == 2\n    assert columns[0] == 'colA_0'\n    assert columns[1] == 'colA_1'\n\n\ndef test_with_car_dataframe(cars_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"description\", CountVectorizer()),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = cars_dataframe.drop(\"model\", axis=1)\n    labels = cars_dataframe[\"model\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.30\n\n\ndef test_direct_cross_validation(iris_dataframe):\n    \"\"\"\n    Starting with sklearn>=0.16.0 we no longer need CV wrappers for dataframes.\n    See https://github.com/paulgb/sklearn-pandas/issues/11\n    \"\"\"\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_heterogeneous_output_types_input_df():\n    \"\"\"\n    Modify feat2, but pass feat1 through unmodified.\n    This fails if input_df == False\n    \"\"\"\n    df = pd.DataFrame({\n        'feat1': [1, 2, 3, 4, 5, 6],\n        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0, 4.0]\n    })\n    M = DataFrameMapper([\n        (['feat2'], StandardScaler())\n        ], input_df=True, df_out=True, default=None)\n    dft = M.fit_transform(df)\n    assert dft['feat1'].dtype == np.dtype('int64')\n    assert dft['feat2'].dtype == np.dtype('float64')\n\n\ndef test_make_column_selector(iris_dataframe):\n    t = DataFrameMapper([\n        (make_column_selector(dtype_include=float), None, {'alias': 'x'}),\n        ('sepal length (cm)', None),\n    ], df_out=True, default=False)\n\n    xt = t.fit(iris_dataframe).transform(iris_dataframe)\n    expected = ['x_0', 'x_1', 'x_2', 'x_3', 'sepal length (cm)']\n    assert list(xt.columns) == expected\n\n    pickled = pickle.dumps(t)\n    t2 = pickle.loads(pickled)\n    xt2 = t2.transform(iris_dataframe)\n    assert np.array_equal(xt.values, xt2.values)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "from collections import Counter\n\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nfrom numpy.testing import assert_array_equal\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.features_generator import gen_features\n\n\nclass MockClass(object):\n\n    def __init__(self, value=1, name='class'):\n        self.value = value\n        self.name = name\n\n\nclass MockTransformer(object):\n\n    def __init__(self):\n        self.most_common_ = None\n\n    def fit(self, X, y=None):\n        [(value, _)] = Counter(X).most_common(1)\n        self.most_common_ = value\n        return self\n\n    def transform(self, X, y=None):\n        return np.asarray([self.most_common_] * len(X))\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_generate_features_with_default_parameters():\n    \"\"\"\n    Tests generating features from classes with default init arguments.\n    \"\"\"\n    columns = ['colA', 'colB', 'colC']\n    feature_defs = gen_features(columns=columns, classes=[MockClass])\n    assert len(feature_defs) == len(columns)\n\n    for feature in feature_defs:\n        assert feature[2] == {}\n\n    feature_dict = dict([_[0:2] for _ in feature_defs])\n    assert columns == sorted(feature_dict.keys())\n\n    # default init arguments for MockClass for clarification.\n    expected = {'value': 1, 'name': 'class'}\n    for column, transformers in feature_dict.items():\n        for obj in transformers:\n            assert_attributes(obj, **expected)\n\n\ndef test_generate_features_with_several_classes():\n    \"\"\"\n    Tests generating features pipeline with different transformers parameters.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'],\n        classes=[\n            {'class': MockClass},\n            {'class': MockClass, 'name': 'mockA'},\n            {'class': MockClass, 'name': 'mockB', 'value': None}\n        ]\n    )\n\n    for col, transformers, params in feature_defs:\n        assert_attributes(transformers[0], name='class', value=1)\n        assert_attributes(transformers[1], name='mockA', value=1)\n        assert_attributes(transformers[2], name='mockB', value=None)\n\n\ndef test_generate_features_with_none_only_transformers():\n    \"\"\"\n    Tests generating \"dummy\" feature definition which doesn't apply any\n    transformation.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'], classes=[None])\n\n    expected = [('colA', None, {}),\n                ('colB', None, {}),\n                ('colC', None, {})]\n\n    assert feature_defs == expected\n\n\ndef test_compatibility_with_data_frame_mapper(simple_dataset):\n    \"\"\"\n    Tests compatibility of generated feature definition with DataFrameMapper.\n    \"\"\"\n    features_defs = gen_features(\n        columns=['feat1', 'feat2'],\n        classes=[MockTransformer])\n    features_defs.append(('feat3', None))\n\n    mapper = DataFrameMapper(features_defs)\n    X = mapper.fit_transform(simple_dataset)\n    expected = np.asarray([\n        [1, 2, 1],\n        [1, 2, 2],\n        [1, 2, 3],\n        [1, 2, 4],\n        [1, 2, 5]\n    ])\n\n    assert_array_equal(X, expected)\n\n\ndef assert_attributes(obj, **attrs):\n    for attr, value in attrs.items():\n        assert getattr(obj, attr) == value\n"
    }
  ],
  "ErrorMessage": "============================================================================================= FAILURES ==============================================================================================\n________________________________________________________________________ test_generate_features_with_none_only_transformers _________________________________________________________________________\n\n    def test_generate_features_with_none_only_transformers():\n        \"\"\"\n        Tests generating \"dummy\" feature definition which doesn't apply any\n        transformation.\n        \"\"\"\n        feature_defs = gen_features(\n            columns=['colA', 'colB', 'colC'], classes=[int])\n    \n        expected = [('colA', int, {}),\n                    ('colB', int, {}),\n                    ('colC', int, {})]\n    \n>       assert feature_defs == expected\nE       AssertionError: assert [('colA', [0]...lC', [0], {})] == [('colA', <cl...s 'int'>, {})]\nE         \nE         At index 0 diff: ('colA', [0], {}) != ('colA', <class 'int'>, {})\nE         \nE         Full diff:\nE           [\nE               (\nE                   'colA',...\nE         \nE         ...Full output truncated (23 lines hidden), use '-vv' to show\n\ntests/test_features_generator.py:94: AssertionError\n========================================================================================= warnings summary ==========================================================================================\ntests/test_dataframe_mapper.py::test_complex_object_df\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:106: SettingWithCopyWarning: \n  A value is trying to be set on a copy of a slice from a DataFrame.\n  Try using .loc[row_indexer,col_indexer] = value instead\n  \n  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n    X[col] = X[col].map(lambda img: np.max(img))\n\ntests/test_dataframe_mapper.py::test_sparse_features\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:865: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\ntests/test_dataframe_mapper.py::test_sparse_off\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:879: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\ntests/test_transformers.py::test_common_numerical_transformer\ntests/test_transformers.py::test_numerical_transformer_serialization\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py:35: DeprecationWarning: \n              NumericalTransformer will be deprecated in 3.0 version.\n              Please use Sklearn.base.TransformerMixin to write\n              customer transformers\n              \n    warnings.warn(\"\"\"\n\n-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html\n====================================================================================== short test summary info ======================================================================================\nFAILED tests/test_features_generator.py::test_generate_features_with_none_only_transformers - AssertionError: assert [('colA', [0]...lC', [0], {})] == [('colA', <cl...s 'int'>, {})]\n============================================================================= 1 failed, 69 passed, 5 warnings in 1.40s ==============================================================================",
  "Patch": "--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -85,11 +85,11 @@\n     transformation.\n     \"\"\"\n     feature_defs = gen_features(\n-        columns=['colA', 'colB', 'colC'], classes=[int])\n+        columns=['colA', 'colB', 'colC'], classes=[None])\n \n-    expected = [('colA', int, {}),\n-                ('colB', int, {}),\n-                ('colC', int, {})]\n+    expected = [('colA', None, {}),\n+                ('colB', None, {}),\n+                ('colC', None, {})]\n \n     assert feature_defs == expected\n \n",
  "BuggyCodeLocation": [
    {
      "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "function": {
        "82": "test_generate_features_with_none_only_transformers"
      },
      "content_all": {
        "85": "    transformation.\n",
        "86": "    \"\"\"\n",
        "87": "    feature_defs = gen_features(\n",
        "88": "        columns=['colA', 'colB', 'colC'], classes=[int])\n",
        "89": "\n",
        "90": "    expected = [('colA', int, {}),\n",
        "91": "                ('colB', int, {}),\n",
        "92": "                ('colC', int, {})]\n",
        "93": "\n"
      },
      "content_change": {
        "88": "        columns=['colA', 'colB', 'colC'], classes=[int])\n",
        "90": "    expected = [('colA', int, {}),\n",
        "91": "                ('colB', int, {}),\n",
        "92": "                ('colC', int, {})]\n"
      }
    },
    {
      "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "function": null,
      "content_all": {
        "94": "    assert feature_defs == expected\n",
        "95": "\n"
      },
      "content_change": {}
    }
  ],
  "Issue": {
    "title": "Feature Generation Incorrect Class Type Handling in Tests",
    "description": "There is an issue with the `gen_features` function in the `test_features_generator.py` file, where the class type provided for feature generation is not handled correctly. The `classes` argument is supposed to allow for a list of classes or `None`, but the test was incorrectly setting this to a specific type (e.g., int). This leads to an incorrect comparison and fails to correctly generate features as expected when a class is set to `None`. This affects test scenarios where the expected behavior is that no transformation should occur. This issue needs to be rectified to ensure the tests correctly validate the feature generation logic when classes are set to `None`.",
    "explanation": "### Summary of the Issue\n\nThe issue reported is related to the `gen_features` function in the `test_features_generator.py` test file. This function was supposed to handle the `classes` parameter, which can be a list of classes for generating features or `None`. However, in the tests, the `classes` parameter was incorrectly set to a specific type, such as `int`, instead of accepting `None`. This caused an improper comparison and failed to generate the expected features correctly when `None` was provided. Tests where features should remain untransformed (when classes are set to `None`) were failing due to this misconfiguration.\n\n### Content of the Commit\n\nThe commit modifies the `test_features_generator.py` file. The key changes involve adjusting the test cases to correctly handle and set the `classes` argument to `None` instead of a specific type like `int`. This ensures that the function behaves as expected, particularly in scenarios where no transformation should occur. \n\n### Explanation of the Issue and Solution\n\n#### Cause of the Issue:\n1. **Incorrect Handling of `classes` Argument**: The `gen_features` function is designed to accept a list of classes or `None` values for generating feature transformations. When `None` is provided, the features should remain untransformed.\n   \n2. **Misconfigured Test**: The tests were incorrectly assigning a specific type (e.g., `int`) to the `classes` parameter, which led to incorrect comparisons. For instance, setting `classes=[int]` instead of `classes=[None]` caused the function to behave differently than intended, leading to failed test cases.\n\n3. **Test Expectations Not Matching Actual Output**: The expected outcome was for features to remain untransformed when `classes` was set to `None`. However, due to the misconfiguration, the tests did not reflect this behavior, and failed to validate the correct functionality of the `gen_features` function.\n\n#### Solution Provided by the Commit:\n1. **Correct Parameter Setting**: The commit rectifies the test cases to correctly set the `classes` argument to `None`. This ensures that the correct behavior is triggered within the `gen_features` function—features remain untransformed when `None` is specified.\n\n2. **Adjusting Expected Outcomes**: The expected outcomes in the tests are adjusted to match the untransformed feature definitions. This aligns the expected results with the actual behavior when `None` is provided.\n\n3. **Ensuring Correct Comparisons**: By setting `classes` to `[None]`, the test cases accurately reflect scenarios where no transformation should occur, thus ensuring the function is correctly validated for these scenarios.\n\n### Detailed Explanation of the Solution\n\n**Handling of the `classes` Parameter**: \n- The test now correctly uses `classes=[None]`, aligning with the intended design of the `gen_features` function. When `None` is provided, it indicates that the feature should not undergo any transformation. This change ensures that the function's logic of returning features untransformed is tested properly.\n\n**Validation of Expected Outcomes**:\n- The expected results are updated to reflect the correct untransformed feature definitions. This ensures that the test cases now correctly validate the function's behavior when features are supposed to remain as-is.\n\n**Overall Result**:\n- Post-commit, the `gen_features` function is now correctly tested, ensuring that features remain untransformed when `None` is provided as the `classes` parameter. This fix addresses the inadequacies in the previous test configuration and ensures accurate validation of the function's behavior under different scenarios.\n\nBy making these adjustments, the commit ensures that the `gen_features` function is correctly handling the `classes` parameter, thereby fixing the bug and ensuring the accuracy of the test cases. This thorough validation guarantees that the function performs as expected when no transformations are needed."
  },
  "Explain": "### Summary of the Issue\n\nThe issue reported is related to the `gen_features` function in the `test_features_generator.py` test file. This function was supposed to handle the `classes` parameter, which can be a list of classes for generating features or `None`. However, in the tests, the `classes` parameter was incorrectly set to a specific type, such as `int`, instead of accepting `None`. This caused an improper comparison and failed to generate the expected features correctly when `None` was provided. Tests where features should remain untransformed (when classes are set to `None`) were failing due to this misconfiguration.\n\n### Content of the Commit\n\nThe commit modifies the `test_features_generator.py` file. The key changes involve adjusting the test cases to correctly handle and set the `classes` argument to `None` instead of a specific type like `int`. This ensures that the function behaves as expected, particularly in scenarios where no transformation should occur. \n\n### Explanation of the Issue and Solution\n\n#### Cause of the Issue:\n1. **Incorrect Handling of `classes` Argument**: The `gen_features` function is designed to accept a list of classes or `None` values for generating feature transformations. When `None` is provided, the features should remain untransformed.\n   \n2. **Misconfigured Test**: The tests were incorrectly assigning a specific type (e.g., `int`) to the `classes` parameter, which led to incorrect comparisons. For instance, setting `classes=[int]` instead of `classes=[None]` caused the function to behave differently than intended, leading to failed test cases.\n\n3. **Test Expectations Not Matching Actual Output**: The expected outcome was for features to remain untransformed when `classes` was set to `None`. However, due to the misconfiguration, the tests did not reflect this behavior, and failed to validate the correct functionality of the `gen_features` function.\n\n#### Solution Provided by the Commit:\n1. **Correct Parameter Setting**: The commit rectifies the test cases to correctly set the `classes` argument to `None`. This ensures that the correct behavior is triggered within the `gen_features` function—features remain untransformed when `None` is specified.\n\n2. **Adjusting Expected Outcomes**: The expected outcomes in the tests are adjusted to match the untransformed feature definitions. This aligns the expected results with the actual behavior when `None` is provided.\n\n3. **Ensuring Correct Comparisons**: By setting `classes` to `[None]`, the test cases accurately reflect scenarios where no transformation should occur, thus ensuring the function is correctly validated for these scenarios.\n\n### Detailed Explanation of the Solution\n\n**Handling of the `classes` Parameter**: \n- The test now correctly uses `classes=[None]`, aligning with the intended design of the `gen_features` function. When `None` is provided, it indicates that the feature should not undergo any transformation. This change ensures that the function's logic of returning features untransformed is tested properly.\n\n**Validation of Expected Outcomes**:\n- The expected results are updated to reflect the correct untransformed feature definitions. This ensures that the test cases now correctly validate the function's behavior when features are supposed to remain as-is.\n\n**Overall Result**:\n- Post-commit, the `gen_features` function is now correctly tested, ensuring that features remain untransformed when `None` is provided as the `classes` parameter. This fix addresses the inadequacies in the previous test configuration and ensures accurate validation of the function's behavior under different scenarios.\n\nBy making these adjustments, the commit ensures that the `gen_features` function is correctly handling the `classes` parameter, thereby fixing the bug and ensuring the accuracy of the test cases. This thorough validation guarantees that the function performs as expected when no transformations are needed.",
  "Source": "Human",
  "Token": 1499,
  "Command": [
    "pytest tests"
  ],
  "FilteredCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "1 # -*- coding: utf8 -*-\n2 \n3 import pytest\n4 from unittest.mock import Mock\n5 from pandas import DataFrame\n6 import pandas as pd\n7 from scipy import sparse\n8 from sklearn.datasets import load_iris\n9 from sklearn.pipeline import Pipeline\n10 from sklearn.model_selection import cross_val_score\n11 from sklearn.svm import SVC\n12 from sklearn.feature_extraction.text import CountVectorizer\n13 from sklearn.feature_extraction import DictVectorizer\n14 from sklearn.preprocessing import (\n15     StandardScaler, OneHotEncoder, LabelBinarizer)\n16 from sklearn.impute import SimpleImputer as Imputer\n17 from sklearn.feature_selection import SelectKBest, chi2\n18 from sklearn.base import BaseEstimator, TransformerMixin\n19 import sklearn.decomposition\n20 import numpy as np\n21 from numpy.testing import assert_array_equal\n22 import pickle\n23 from sklearn.compose import make_column_selector\n24 \n25 from sklearn_pandas import DataFrameMapper\n26 from sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\n27 from sklearn_pandas.pipeline import TransformerPipeline\n28 \n29 \n30 class MockXTransformer(object):\n31     \"\"\"\n32     Mock transformer that accepts no y argument.\n33     \"\"\"\n34     def fit(self, X):\n35         return self\n36 \n37     def transform(self, X):\n38         return X\n39 \n40 \n41 class MockTClassifier(object):\n42     \"\"\"\n43     Mock transformer/classifier.\n44     \"\"\"\n45     def fit(self, X, y=None):\n46         return self\n47 \n48     def transform(self, X):\n49         return X\n50 \n51     def predict(self, X):\n52         return True\n53 \n54 \n55 class DateEncoder():\n56     def fit(self, X, y=None):\n57         retur(...truncated)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "1 from collections import Counter\n2 \n3 import pytest\n4 import numpy as np\n5 from pandas import DataFrame\n6 from numpy.testing import assert_array_equal\n7 \n8 from sklearn_pandas import DataFrameMapper\n9 from sklearn_pandas.features_generator import gen_features\n10 \n11 \n12 class MockClass(object):\n13 \n14     def __init__(self, value=1, name='class'):\n15         self.value = value\n16         self.name = name\n17 \n18 \n19 class MockTransformer(object):\n20 \n21     def __init__(self):\n22         self.most_common_ = None\n23 \n24     def fit(self, X, y=None):\n25         [(value, _)] = Counter(X).most_common(1)\n26         self.most_common_ = value\n27         return self\n28 \n29     def transform(self, X, y=None):\n30         return np.asarray([self.most_common_] * len(X))\n31 \n32 \n33 @pytest.fixture\n34 def simple_dataset():\n35     return DataFrame({\n36         'feat1': [1, 2, 1, 3, 1],\n37         'feat2': [1, 2, 2, 2, 3],\n38         'feat3': [1, 2, 3, 4, 5],\n39     })\n40 \n41 \n42 def test_generate_features_with_default_parameters():\n43     \"\"\"\n44     Tests generating features from classes with default init arguments.\n45     \"\"\"\n46     columns = ['colA', 'colB', 'colC']\n47     feature_defs = gen_features(columns=columns, classes=[MockClass])\n48     assert len(feature_defs) == len(columns)\n49 \n50     for feature in feature_defs:\n51         assert feature[2] == {}\n52 \n53     feature_dict = dict([_[0:2] for _ in feature_defs])\n54     assert columns == sorted(feature_dict.keys())\n55 \n56     # default init arguments for MockClass for clarification.\n57     expected = {'value': 1, 'name': 'class'}\n58     for column, transformers in feature_dict.items():\n59         for obj in transformers:\n60             assert_attributes(obj, **expected)\n61 \n62 \n63 def test_generate_features_with_several_classes():\n64     \"\"\"\n65     Tests generating features pipeline with different transformers parameters.\n66     \"\"\"\n67     feature_defs = gen_features(\n68         columns=['colA', 'colB', 'colC'],\n69         classes=[\n70             {'class': MockClass},\n71             {'class': MockClass, 'name': 'mockA'},\n72             {'class': MockClass, 'name': 'mockB', 'value': None}\n73         ]\n74     )\n75 \n76     for col, transformers, params in feature_defs:\n77         assert_attributes(transformers[0], name='class', value=1)\n78         assert_attributes(transformers[1], name='mockA', value=1)\n79         assert_attributes(transformers[2], name='mockB', value=None)\n80 \n81 \n82 def test_generate_features_with_none_only_transformers():\n83     \"\"\"\n84     Tests generating \"dummy\" feature definition which doesn't apply any\n85     transformation.\n86     \"\"\"\n87     feature_defs = gen_features(\n88         columns=['colA', 'colB', 'colC'], classes=[int])\n89 \n90     expected = [('colA', int, {}),\n91                 ('colB', int, {}),\n92                 ('colC', int, {})]\n93 \n94     assert feature_defs == expected\n95 \n96 \n97 def test_compatibility_with_data_frame_mapper(simple_dataset):\n98     \"\"\"\n99     Tests compatibility of generated feature definition with DataFrameMapper.\n100     \"\"\"\n101     features_defs = gen_features(\n102         columns=['feat1', 'feat2'],\n103         classes=[MockTransformer])\n104     features_defs.append(('feat3', None))\n105 \n106     mapper = DataFrameMapper(features_defs)\n107     X = mapper.fit_transform(simple_dataset)\n108     expected = np.asarray([\n109         [1, 2, 1],\n110         [1, 2, 2],\n111         [1, 2, 3],\n112         [1, 2, 4],\n113         [1, 2, 5]\n114     ])\n115 \n116     assert_array_equal(X, expected)\n117 \n118 \n119 def assert_attributes(obj, **attrs):\n120     for attr, value in attrs.items():\n121         assert getattr(obj, attr) == value"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
      "content": "1 def gen_features(columns, classes=None, prefix='', suffix=''):\n2     \"\"\"Generates a feature definition li(...truncated)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "1 import contextlib\n2 from datetime import datetim(...truncated)"
    }
  ],
  "TokenAll": 18489,
  "FilteredLevel": 1500,
  "Results": {
    "model": "GPT-4o",
    "Difficulty": "Medium",
    "issue_origin": {
      "title": "Unused import statements",
      "description": "In the file 'scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py', the following import statements are not used anywhere in the code: 'from unittest.mock import Mock', 'from sklearn.pipeline import Pipeline', 'from sklearn.model_selection import cross_val_score', 'from sklearn.svm import SVC', 'from sklearn.feature_extraction.text import CountVectorizer', 'from sklearn.feature_extraction import DictVectorizer', 'from sklearn.preprocessing import LabelBinarizer', 'from sklearn.feature_selection import SelectKBest, chi2', 'from sklearn.base import BaseEstimator, TransformerMixin', and 'from sklearn.decomposition'. These should be removed to clean up the code.",
      "explanation": "### Summary of the Issue\n\nThe issue titled \"Unused import statements\" refers to unused import statements in the file `test_dataframe_mapper.py` within the `scikit-learn-contrib_sklearn-pandas/tests` directory. These imports are not used anywhere in the file and hence should be removed for code cleanliness. Unused imports can clutter the code, lead to confusion, and potentially cause unnecessary memory usage.\n\n### Code Analysis\n\nThe relevant file, `test_dataframe_mapper.py`, contains multiple import statements at the beginning of the file, with some of these being unused. Here is a brief rundown of the imports:\n\n1. Needed Imports:\n   - `pytest`\n   - `pandas`\n   - `scipy`\n   - `numpy`\n   - Specific classes and functions from `sklearn_pandas`\n   - Specific imports for the testing code itself.\n\n2. Unused Imports (as stated in the issue):\n   - `from unittest.mock import Mock`\n   - `from sklearn.pipeline import Pipeline`\n   - `from sklearn.model_selection import cross_val_score`\n   - `from sklearn.svm import SVC`\n   - `from sklearn.feature_extraction.text import CountVectorizer`\n   - `from sklearn.feature_extraction import DictVectorizer`\n   - `from sklearn.preprocessing import LabelBinarizer`\n   - `from sklearn.feature_selection import SelectKBest, chi2`\n   - `from sklearn.base import BaseEstimator, TransformerMixin`\n   - `import sklearn.decomposition`\n\n### Commit Content\n\nTo address the issue, a commit would remove all the aforementioned unused import statements, cleaning up the codebase. Here's a potential diff representation of the commit:\n\n```diff\n3c3\n< from unittest.mock import Mock\n---\n> \n9c9\n< from sklearn.pipeline import Pipeline\n---\n> \n10c10\n< from sklearn.model_selection import cross_val_score\n---\n> \n11c11\n< from sklearn.svm import SVC\n---\n> \n12c12\n< from sklearn.feature_extraction.text import CountVectorizer\n---\n> \n13c13\n< from sklearn.feature_extraction import DictVectorizer\n---\n> \n15,16c15\n<     StandardScaler, OneHotEncoder, LabelBinarizer)\n---\n>     StandardScaler, OneHotEncoder)\n17c16\n< from sklearn.feature_selection import SelectKBest, chi2\n---\n> \n18c17\n< from sklearn.base import BaseEstimator, TransformerMixin\n---\n> \n19c18\n< import sklearn.decomposition\n---\n>\n```\n\n### Explanation of How the Commit Solves the Issue\n\nThe commit addresses the issue by removing all the unused import statements, which achieves the following:\n\n1. **Improves Code Readability**: It becomes easier to understand which libraries and modules are actually being used, making the code easier to read and maintain.\n  \n2. **Reduces Memory Footprint**: Each import can potentially consume memory. Although modern systems can handle these efficiently, it's always good practice to minimize unnecessary resource usage.\n\n3. **Avoids Potential Confusion**: Future developers reading the code will not be misled by import statements that are never used, which can be particularly confusing if they suggest dependencies that don't exist.\n\n4. **Streamlines Linting and Static Analysis**: Many linters and static analysis tools flag unused imports, and such warnings can obscure more important issues. Removing unused imports makes it easier to focus on real problems.\n\nBy cleaning up these unused imports, the commit enhances the overall quality and maintainability of the codebase, ensuring that the testing file `test_dataframe_mapper.py` is both efficient and clear."
    },
    "issue_message": {
      "title": "Mismatch in expected and actual feature definitions",
      "description": "The test `test_generate_features_with_none_only_transformers` fails due to a mismatch between the expected feature definitions and the actual output. The expected output contains type 'int', while the actual output contains '[0]'. Verify the implementation of `gen_features` function in `features_generator.py` to ensure feature definitions are generated correctly.",
      "explanation": "## Issue Summary\n\nThe issue is that the test `test_generate_features_with_none_only_transformers` fails because the function `gen_features` produces a mismatched output. The expected feature definitions should contain the type `int`, but the actual output contains `[0]`. This suggests that some part of the `gen_features` function is incorrectly processing the provided `classes=[int]` argument and returning `[0]` instead of `int`.\n\n## Commit Analysis\n\nLet's check the probable content of the commit that addresses this issue. A corrective commit might look something like this:\n\n### Before the Commit (Hypothetical Code):\n\n```python\ndef gen_features(columns, classes=None, prefix='', suffix=''):\n    features = []\n    for column in columns:\n        transformers = []\n        if classes is not None:\n            for cls in classes:\n                if isinstance(cls, dict):\n                    clazz = cls['class']\n                    args = {k: v for k, v in cls.items() if k != 'class'}\n                    transformers.append(clazz(**args))\n                else:\n                    transformers.append(cls())\n        if len(transformers) == 0:\n            transformers = [None]\n        features.append((prefix + column + suffix, transformers, {}))\n    return features\n```\n\n### After the Commit (Modification to Correct the Issue):\n\n```python\ndef gen_features(columns, classes=None, prefix='', suffix=''):\n    features = []\n    for column in columns:\n        transformers = []\n        if classes is not None:\n            for cls in classes:\n                if isinstance(cls, dict):\n                    clazz = cls['class']\n                    args = {k: v for k, v in cls.items() if k != 'class'}\n                    transformers.append(clazz(**args))\n                else:\n                    # Check if cls is a type; if so, append it directly without instantiation\n                    if isinstance(cls, type):\n                        transformers.append(cls)\n                    else:\n                        transformers.append(cls())\n        if len(transformers) == 0:\n            transformers = [None]\n        features.append((prefix + column + suffix, transformers, {}))\n    return features\n```\n\n## Explanation\n\n### The Issue\n\n1. **Mismatch in Feature Definitions**: The test expected `int` types in the feature definitions but found `[0]`.\n2. **Cause**: The original `gen_features` function treated everything in `classes` as a class that needs instantiation. The integer type (`int`) is not supposed to be instantiated; it should be directly used.\n\n### The Solution in the Commit\n\n1. **Type Checking**: The added condition checks if an item in the `classes` list is a type (`isinstance(cls, type)`).\n2. **Direct Appending**: If `cls` is a type, it's appended directly to the `transformers` list instead of being instantiated.\n3. **Class Instantiation**: If `cls` is not a type (e.g., it's already an instance or some other callable), the function still attempts to instantiate it.\n\n### How the Commit Fixes the Issue\n\n- **Ensuring Proper Handling**: This condition ensures that when `int` is provided, it does not attempt to instantiate `int` but uses it directly.\n- **Correct Output Structure**: Therefore, the output for the test will now properly contain `(column_name, int, {})` as expected, instead of `(column_name, [0], {})`.\n\n### Verification of the Fix\n\nAssuming the corrected code has been committed, rerunning the tests should now result in a pass for `test_generate_features_with_none_only_transformers` since the output from `gen_features` should match the expected feature definitions.\n\nBy making this change, the `gen_features` function will handle type classes correctly, fixing the originally failing test and providing the expected behavior in use cases where simple Python data types like `int` are used in feature definitions."
    },
    "issue_ground": {
      "title": "Feature Generation Incorrect Class Type Handling in Tests",
      "description": "There is an issue with the `gen_features` function in the `test_features_generator.py` file, where the class type provided for feature generation is not handled correctly. The `classes` argument is supposed to allow for a list of classes or `None`, but the test was incorrectly setting this to a specific type (e.g., int). This leads to an incorrect comparison and fails to correctly generate features as expected when a class is set to `None`. This affects test scenarios where the expected behavior is that no transformation should occur. This issue needs to be rectified to ensure the tests correctly validate the feature generation logic when classes are set to `None`.",
      "explanation": "### Summary of the Issue\n\nThe given issue involves a test failure in the `test_features_generator.py` file. The `gen_features` function's `classes` parameter is not handled correctly in a test scenario. The test `test_generate_features_with_none_only_transformers` expects the `gen_features` function to properly handle the case where the `classes` argument is set to `None`, indicating no transformation should occur. However, the test is currently failing because the function incorrectly processes the `classes` argument when it includes `None` or is a type like `int`.\n\n### Examining the Current Code and Issue\n\nRelevant extracts from the current code are as follows:\n\n#### Current Test Case\n```python\ndef test_generate_features_with_none_only_transformers():\n    \"\"\"\n    Tests generating \"dummy\" feature definition which doesn't apply any\n    transformation.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'], classes=[int])  # Issue here\n\n    expected = [('colA', int, {}),\n                ('colB', int, {}),\n                ('colC', int, {})]\n\n    assert feature_defs == expected  # This assertion fails\n```\n\n#### Error Message\n```\nE       AssertionError: assert [('colA', [0]...lC', [0], {})] == [('colA', <cl...s 'int'>, {})]\n\nE         At index 0 diff: ('colA', [0], {}) != ('colA', <class 'int'>, {})\n```\n\nThe function `gen_features` incorrectly processes `classes=[int]`, leading to transformed features list with integers `[0]` instead of `<class 'int'>`.\n\n### Analyzing the Cause\n\nThe root cause lies in how the `gen_features` function interprets the `classes` parameter values when `None` or types like `int` are included. It seems that when `None` or basic types are provided, it defaults to converting the class instances improperly, leading to integers being mapped instead of the expected class types.\n\n### Proposed Commit to Fix the Issue\n\nThe fix involves modifying the `gen_features` function to correctly handle `None` or basic types in the `classes` parameter. The fix might include ensuring that when such values are encountered, they are properly interpreted as `None` or their respective classes without unwarranted transformations.\n\n#### Sample Fix in Commit\n```python\ndef gen_features(columns, classes=None, prefix='', suffix=''):\n    \"\"\"\n    Generates a feature definition list from given column names and transformers classes.\n    \"\"\"\n    features = []\n    for column in columns:\n        # Ensure the class handling accommodates None or type int\n        if classes is None:\n            transformers = [(None, {})]\n        else:\n            transformers = []\n            for cls in classes:\n                if isinstance(cls, dict):\n                    cls_instance = cls['class'](**{k: v for k, v in cls.items() if k != 'class'})\n                    transformers.append((cls_instance, cls_instance.__dict__))\n                elif isinstance(cls, type):\n                    transformers.append((cls, {}))  # Handle class types appropriately\n                else:\n                    transformers.append((cls.__class__, cls.__dict__))\n        \n        features.append((prefix + column + suffix, transformers, {}))\n    return features\n```\n\n### How the Commit Solves the Issue\n\n1. **Classes Handling Logic**: The fixed function ensures it distinguishes between `None`, type classes, and other class instances.\n   - For `None`, it directly provides a `dummy` definition.\n   - For classes like `int`, it maps them as type entries with empty initialization parameters.\n   - For given instances, it correctly extracts their types and parameters.\n   \n2. **Transformers Initialization**: The corrected handling creates transformers appropriately, avoiding misinterpretation like converting to integers `[0]`.\n\n### Resulting in the Resolution\n\nBy correctly distinguishing between `None`, basic types, and instantiated classes, the fix ensures `gen_features` produces the expected tuples, making:\n```python\n('colA', int, {})\n```\ninstead of \n```python\n('colA', [0], {})\n```\nleading to passing the specified test:\n```python\nassert feature_defs == expected  # Passes now\n```\n\nThis commit hence resolves the assertion error, ensuring the test validation on unaltered feature definitions works correctly when no transformation is desired."
    },
    "issue_ground_truth": {
      "title": "Feature Generation Incorrect Class Type Handling in Tests",
      "description": "There is an issue with the `gen_features` function in the `test_features_generator.py` file, where the class type provided for feature generation is not handled correctly. The `classes` argument is supposed to allow for a list of classes or `None`, but the test was incorrectly setting this to a specific type (e.g., int). This leads to an incorrect comparison and fails to correctly generate features as expected when a class is set to `None`. This affects test scenarios where the expected behavior is that no transformation should occur. This issue needs to be rectified to ensure the tests correctly validate the feature generation logic when classes are set to `None`.",
      "explanation": "### Summary of the Issue\n\nThe issue reported is related to the `gen_features` function in the `test_features_generator.py` test file. This function was supposed to handle the `classes` parameter, which can be a list of classes for generating features or `None`. However, in the tests, the `classes` parameter was incorrectly set to a specific type, such as `int`, instead of accepting `None`. This caused an improper comparison and failed to generate the expected features correctly when `None` was provided. Tests where features should remain untransformed (when classes are set to `None`) were failing due to this misconfiguration.\n\n### Content of the Commit\n\nThe commit modifies the `test_features_generator.py` file. The key changes involve adjusting the test cases to correctly handle and set the `classes` argument to `None` instead of a specific type like `int`. This ensures that the function behaves as expected, particularly in scenarios where no transformation should occur. \n\n### Explanation of the Issue and Solution\n\n#### Cause of the Issue:\n1. **Incorrect Handling of `classes` Argument**: The `gen_features` function is designed to accept a list of classes or `None` values for generating feature transformations. When `None` is provided, the features should remain untransformed.\n   \n2. **Misconfigured Test**: The tests were incorrectly assigning a specific type (e.g., `int`) to the `classes` parameter, which led to incorrect comparisons. For instance, setting `classes=[int]` instead of `classes=[None]` caused the function to behave differently than intended, leading to failed test cases.\n\n3. **Test Expectations Not Matching Actual Output**: The expected outcome was for features to remain untransformed when `classes` was set to `None`. However, due to the misconfiguration, the tests did not reflect this behavior, and failed to validate the correct functionality of the `gen_features` function.\n\n#### Solution Provided by the Commit:\n1. **Correct Parameter Setting**: The commit rectifies the test cases to correctly set the `classes` argument to `None`. This ensures that the correct behavior is triggered within the `gen_features` function—features remain untransformed when `None` is specified.\n\n2. **Adjusting Expected Outcomes**: The expected outcomes in the tests are adjusted to match the untransformed feature definitions. This aligns the expected results with the actual behavior when `None` is provided.\n\n3. **Ensuring Correct Comparisons**: By setting `classes` to `[None]`, the test cases accurately reflect scenarios where no transformation should occur, thus ensuring the function is correctly validated for these scenarios.\n\n### Detailed Explanation of the Solution\n\n**Handling of the `classes` Parameter**: \n- The test now correctly uses `classes=[None]`, aligning with the intended design of the `gen_features` function. When `None` is provided, it indicates that the feature should not undergo any transformation. This change ensures that the function's logic of returning features untransformed is tested properly.\n\n**Validation of Expected Outcomes**:\n- The expected results are updated to reflect the correct untransformed feature definitions. This ensures that the test cases now correctly validate the function's behavior when features are supposed to remain as-is.\n\n**Overall Result**:\n- Post-commit, the `gen_features` function is now correctly tested, ensuring that features remain untransformed when `None` is provided as the `classes` parameter. This fix addresses the inadequacies in the previous test configuration and ensures accurate validation of the function's behavior under different scenarios.\n\nBy making these adjustments, the commit ensures that the `gen_features` function is correctly handling the `classes` parameter, thereby fixing the bug and ensuring the accuracy of the test cases. This thorough validation guarantees that the function performs as expected when no transformations are needed."
    },
    "location_origin": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "0": "Global Imports"
        },
        "content_all": {
          "1": "1 # -*- coding: utf8 -*-\n",
          "2": "2 \n",
          "3": "3 import pytest\n",
          "4": "4 from unittest.mock import Mock\n",
          "5": "5 from pandas import DataFrame\n",
          "6": "6 import pandas as pd\n",
          "7": "7 from scipy import sparse\n",
          "8": "8 from sklearn.datasets import load_iris\n",
          "9": "9 from sklearn.pipeline import Pipeline\n",
          "10": "10 from sklearn.model_selection import cross_val_score\n",
          "11": "11 from sklearn.svm import SVC\n",
          "12": "12 from sklearn.feature_extraction.text import CountVectorizer\n",
          "13": "13 from sklearn.feature_extraction import DictVectorizer\n",
          "14": "14 from sklearn.preprocessing import (\n",
          "15": "15     StandardScaler, OneHotEncoder, LabelBinarizer)\n",
          "16": "16 from sklearn.impute import SimpleImputer as Imputer\n",
          "17": "17 from sklearn.feature_selection import SelectKBest, chi2\n",
          "18": "18 from sklearn.base import BaseEstimator, TransformerMixin\n",
          "19": "19 import sklearn.decomposition\n",
          "20": "20 import numpy as np\n",
          "21": "21 from numpy.testing import assert_array_equal\n",
          "22": "22 import pickle\n",
          "23": "23 from sklearn.compose import make_column_selector\n",
          "24": "24 \n",
          "25": "25 from sklearn_pandas import DataFrameMapper\n",
          "26": "26 from sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\n",
          "27": "27 from sklearn_pandas.pipeline import TransformerPipeline\n"
        },
        "content_change": {
          "4": "4 \n",
          "9": "9 \n",
          "10": "10 \n",
          "11": "11 \n",
          "12": "12 \n",
          "13": "13 \n",
          "15": "15     StandardScaler, OneHotEncoder)\n",
          "17": "17 \n",
          "18": "18 \n",
          "19": "19 \n"
        }
      }
    ],
    "location_message": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
        "function": {
          "1": "gen_features"
        },
        "content_all": {
          "0": "",
          "1": "def gen_features(columns, classes=None, prefix='', suffix=''):",
          "2": "    \"\"\"Generates a feature definition list based on columns and transformers provided.\"\"\"",
          "3": "    features = []",
          "4": "    for column in columns:",
          "5": "        transformers = []",
          "6": "        if classes is not None:",
          "7": "            for cls in classes:",
          "8": "                if isinstance(cls, dict):",
          "9": "                    clazz = cls['class']",
          "10": "                    args = {k: v for k, v in cls.items() if k != 'class'}",
          "11": "                    transformers.append(clazz(**args))",
          "12": "                else:",
          "13": "                    transformers.append(cls())",
          "14": "        if len(transformers) == 0:",
          "15": "            transformers = [None]",
          "16": "        features.append((prefix + column + suffix, transformers, {}))",
          "17": "    return features",
          "18": ""
        },
        "content_change": {
          "12": "                else:",
          "13": "                    # Check if cls is a type; if so, append it directly without instantiation",
          "14": "                    if isinstance(cls, type):",
          "15": "                        transformers.append(cls)",
          "16": "                    else:",
          "17": "                        transformers.append(cls())"
        }
      }
    ],
    "location_ground": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "function": {
          "82": "test_generate_features_with_none_only_transformers"
        },
        "content_all": {
          "81": "\n",
          "82": "def test_generate_features_with_none_only_transformers():\n",
          "83": "    \"\"\"\n",
          "84": "    Tests generating \"dummy\" feature definition which doesn't apply any\n",
          "85": "    transformation.\n",
          "86": "    \"\"\"\n",
          "87": "    feature_defs = gen_features(\n",
          "88": "        columns=['colA', 'colB', 'colC'], classes=[int])  # Issue here\n",
          "89": "\n",
          "90": "    expected = [('colA', int, {}),\n",
          "91": "                ('colB', int, {}),\n",
          "92": "                ('colC', int, {})]\n",
          "93": "\n",
          "94": "    assert feature_defs == expected  # This assertion fails\n",
          "95": "\n",
          "96": "\n"
        },
        "content_change": {
          "88": "        columns=['colA', 'colB', 'colC'], classes=[int])  # Issue here\n"
        }
      },
      {
        "file": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
        "function": {
          "1": "gen_features"
        },
        "content_all": {
          "0": "\n",
          "1": "def gen_features(columns, classes=None, prefix='', suffix=''):\n",
          "2": "    \"\"\"Generates a feature definition list from given column names and transformers classes.\n",
          "3": "    \"\"\"\n",
          "4": "    features = []\n",
          "5": "    for column in columns:\n",
          "6": "        # Ensure the class handling accommodates None or type int\n",
          "7": "        if classes is None:\n",
          "8": "            transformers = [(None, {})]\n",
          "9": "        else:\n",
          "10": "            transformers = []\n",
          "11": "            for cls in classes:\n",
          "12": "                if isinstance(cls, dict):\n",
          "13": "                    cls_instance = cls['class'](**{k: v for k, v in cls.items() if k != 'class'})\n",
          "14": "                    transformers.append((cls_instance, cls_instance.__dict__))\n",
          "15": "                elif isinstance(cls, type):\n",
          "16": "                    transformers.append((cls, {}))  # Handle class types appropriately\n",
          "17": "                else:\n",
          "18": "                    transformers.append((cls.__class__, cls.__dict__))\n",
          "19": "\n",
          "20": "        features.append((prefix + column + suffix, transformers, {}))\n",
          "21": "    return features\n",
          "22": "\n"
        },
        "content_change": {
          "15": "                elif isinstance(cls, type):\n",
          "16": "                    transformers.append((cls, {}))  # Handle class types appropriately\n"
        }
      }
    ],
    "location_ground_exp": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "function": {
          "42": "test_generate_features_with_default_parameters"
        },
        "content_all": {
          "39": " ",
          "40": " ",
          "41": " ",
          "42": "def test_generate_features_with_default_parameters():",
          "43": "    \"\"\"",
          "44": "    Tests generating features from classes with default init arguments.",
          "45": "    \"\"\"",
          "46": "    columns = ['colA', 'colB', 'colC']",
          "47": "    feature_defs = gen_features(columns=columns, classes=[MockClass])",
          "48": "    assert len(feature_defs) == len(columns)",
          "49": " ",
          "50": "    for feature in feature_defs:",
          "51": "        assert feature[2] == {}",
          "52": " ",
          "53": "    feature_dict = dict([_[0:2] for _ in feature_defs])",
          "54": "    assert columns == sorted(feature_dict.keys())",
          "55": " ",
          "56": "    # default init arguments for MockClass for clarification.",
          "57": "    expected = {'value': 1, 'name': 'class'}",
          "58": "    for column, transformers in feature_dict.items():",
          "59": "        for obj in transformers:",
          "60": "            assert_attributes(obj, **expected)",
          "61": " "
        },
        "content_change": {
          "47": "    feature_defs = gen_features(columns=columns, classes=[None])"
        }
      },
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "function": {
          "82": "test_generate_features_with_none_only_transformers"
        },
        "content_all": {
          "79": " ",
          "80": " ",
          "81": " ",
          "82": "def test_generate_features_with_none_only_transformers():",
          "83": "    \"\"\"",
          "84": "    Tests generating \"dummy\" feature definition which doesn't apply any",
          "85": "    transformation.",
          "86": "    \"\"\"",
          "87": "    feature_defs = gen_features(",
          "88": "        columns=['colA', 'colB', 'colC'], classes=[int])",
          "89": " ",
          "90": "    expected = [('colA', int, {}),",
          "91": "                ('colB', int, {}),",
          "92": "                ('colC', int, {})]",
          "93": " ",
          "94": "    assert feature_defs == expected",
          "95": " "
        },
        "content_change": {
          "88": "        columns=['colA', 'colB', 'colC'], classes=[None])"
        }
      }
    ],
    "location_ground_truth": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "function": {
          "82": "test_generate_features_with_none_only_transformers"
        },
        "content_all": {
          "85": "    transformation.\n",
          "86": "    \"\"\"\n",
          "87": "    feature_defs = gen_features(\n",
          "88": "        columns=['colA', 'colB', 'colC'], classes=[int])\n",
          "89": "\n",
          "90": "    expected = [('colA', int, {}),\n",
          "91": "                ('colB', int, {}),\n",
          "92": "                ('colC', int, {})]\n",
          "93": "\n"
        },
        "content_change": {
          "88": "        columns=['colA', 'colB', 'colC'], classes=[int])\n",
          "90": "    expected = [('colA', int, {}),\n",
          "91": "                ('colB', int, {}),\n",
          "92": "                ('colC', int, {})]\n"
        }
      },
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "function": null,
        "content_all": {
          "94": "    assert feature_defs == expected\n",
          "95": "\n"
        },
        "content_change": {}
      }
    ],
    "patch_i": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -1,19 +1,12 @@\n # -*- coding: utf8 -*-\n \n import pytest\n-from unittest.mock import Mock\n from pandas import DataFrame\n import pandas as pd\n from scipy import sparse\n from sklearn.datasets import load_iris\n-from sklearn.pipeline import Pipeline\n-from sklearn.model_selection import cross_val_score\n-from sklearn.svm import SVC\n-from sklearn.feature_extraction.text import CountVectorizer\n-from sklearn.feature_extraction import DictVectorizer\n from sklearn.preprocessing import (\n-    StandardScaler, OneHotEncoder, LabelBinarizer)\n+    StandardScaler, OneHotEncoder)\n from sklearn.impute import SimpleImputer as Imputer\n-from sklearn.feature_selection import SelectKBest, chi2\n-from sklearn.base import BaseEstimator, TransformerMixin\n-import sklearn.decomposition\n import numpy as np\n from numpy.testing import assert_array_equal\n import pickle\n",
    "patch_im": "\n--- a/sklearn_pandas/features_generator.py\n+++ b/sklearn_pandas/features_generator.py\n@@ -1 +1,14 @@\n def gen_features(columns, classes=None, prefix='', suffix=''):\n     features = []\n     for column in columns:\n         transformers = []\n         if classes is not None:\n             for cls in classes:\n                 if isinstance(cls, dict):\n                     clazz = cls['class']\n                     args = {k: v for k, v in cls.items() if k != 'class'}\n                     transformers.append(clazz(**args))\n                 else:\n+                    # Check if cls is a type; if so, append it directly without instantiation\n+                    if isinstance(cls, type):\n+                        transformers.append(cls)\n+                    else:\n+                        transformers.append(cls())\n         if len(transformers) == 0:\n             transformers = [None]\n         features.append((prefix + column + suffix, transformers, {}))\n",
    "patch_il": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -3,9 +3,6 @@\n import pytest\n from unittest.mock import Mock\n from pandas import DataFrame\n-import pandas as pd\n-from scipy import sparse\n from sklearn.datasets import load_iris\n-from sklearn.pipeline import Pipeline\n-from sklearn.model_selection import cross_val_score\n-from sklearn.svm import SVC\n-from sklearn.feature_extraction.text import CountVectorizer\n-from sklearn.feature_extraction import DictVectorizer\n from sklearn.preprocessing import (\n     StandardScaler, OneHotEncoder, LabelBinarizer)\n from sklearn.impute import SimpleImputer as Imputer\n-from sklearn.feature_selection import SelectKBest, chi2\n-from sklearn.base import BaseEstimator, TransformerMixin\n-import sklearn.decomposition\n import numpy as np\n from numpy.testing import assert_array_equal\n import pickle\n",
    "patch_iml": "\n--- a/sklearn_pandas/features_generator.py\n+++ b/sklearn_pandas/features_generator.py\n@@ -11,7 +11,12 @@\n                     clazz = cls['class']\n                     args = {k: v for k, v in cls.items() if k != 'class'}\n                     transformers.append(clazz(**args))\n-                else:\n+                else: \n+                    # Check if cls is a type; if so, append it directly without instantiation\n+                    if isinstance(cls, type):\n+                        transformers.append(cls)\n+                    else:\n+                        transformers.append(cls())\n         if len(transformers) == 0:\n             transformers = [None]\n         features.append((prefix + column + suffix, transformers, {}))\n",
    "patch_ground": "\n--- a/sklearn_pandas/features_generator.py\n+++ b/sklearn_pandas/features_generator.py\n@@ -1,5 +1,16 @@\n def gen_features(columns, classes=None, prefix='', suffix=''):\n-    \"\"\"Generates a feature definition list from given column names and transformer classes.\"\"\"\n+    \"\"\"\n+    Generates a feature definition list from given column names and transformer classes.\n+    \"\"\"\n     features = []\n     for column in columns:\n-        transformers = [(cls(), cls().__dict__) if isinstance(cls, type) else (cls, cls.__dict__) for cls in classes]\n+        if classes is None:\n+            transformers = [(None, {})]\n+        else:\n+            transformers = []\n+            for cls in classes:\n+                if isinstance(cls, dict):\n+                    cls_instance = cls['class'](**{k: v for k, v in cls.items() if k != 'class'})\n+                    transformers.append((cls_instance, cls_instance.__dict__))\n+                elif isinstance(cls, type):\n+                    transformers.append((cls, {}))\n+                else:\n+                    transformers.append((cls.__class__, cls.__dict__))\n         features.append((prefix + column + suffix, transformers, {}))\n     return features\n\n",
    "patch_ground_location": "\n--- a/scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py\n@@ -1,21 +1,25 @@\n def gen_features(columns, classes=None, prefix='', suffix=''):\n     \"\"\"\n     Generates a feature definition list from given column names and transformers classes.\n     \"\"\"\n    features = []\n    for column in columns:\n        # Ensure the class handling accommodates None or type int\n        if classes is None:\n            transformers = [(None, {})]\n        else:\n            transformers = []\n            for cls in classes:\n                if isinstance(cls, dict):\n                    cls_instance = cls['class'](**{k: v for k, v in cls.items() if k != 'class'})\n                    transformers.append((cls_instance, cls_instance.__dict__))\n                elif isinstance(cls, type):\n                    transformers.append((cls, {}))  # Handle class types appropriately\n                else:\n                    transformers.append((cls.__class__, cls.__dict__))\n\n        features.append((prefix + column + suffix, transformers, {}))\n    return features\n",
    "patch_ground_exp": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -47,7 +47,7 @@ def test_generate_features_with_default_parameters():\n     columns = ['colA', 'colB', 'colC']\n-    feature_defs = gen_features(columns=columns, classes=[MockClass])\n+    feature_defs = gen_features(columns=columns, classes=[None])\n     assert len(feature_defs) == len(columns)\n\n     for feature in feature_defs:\n@@ -88,7 +88,7 @@ def test_generate_features_with_none_only_transformers():\n     feature_defs = gen_features(\n-        columns=['colA', 'colB', 'colC'], classes=[int])\n+        columns=['colA', 'colB', 'colC'], classes=[None])\n\n     expected = [('colA', int, {}),\n                 ('colB', int, {}),\n                 ('colC', int, {})]\n\n     assert feature_defs == expected\n",
    "patch_ground_all": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -87,12 +87,12 @@\n     transformation.\n     \"\"\"\n     feature_defs = gen_features(\n-        columns=['colA', 'colB', 'colC'], classes=[int])\n+        columns=['colA', 'colB', 'colC'], classes=[None])\n\n-    expected = [('colA', int, {}),\n-                ('colB', int, {}),\n-                ('colC', int, {})]\n+    expected = [('colA', None, {}),\n+                ('colB', None, {}),\n+                ('colC', None, {})]\n\n     assert feature_defs == expected\n",
    "patch_ground_truth": "--- a/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py\n@@ -85,11 +85,11 @@\n     transformation.\n     \"\"\"\n     feature_defs = gen_features(\n-        columns=['colA', 'colB', 'colC'], classes=[int])\n+        columns=['colA', 'colB', 'colC'], classes=[None])\n \n-    expected = [('colA', int, {}),\n-                ('colB', int, {}),\n-                ('colC', int, {})]\n+    expected = [('colA', None, {}),\n+                ('colB', None, {}),\n+                ('colC', None, {})]\n \n     assert feature_defs == expected\n \n",
    "message": "============================================================================================= FAILURES ==============================================================================================\n________________________________________________________________________ test_generate_features_with_none_only_transformers _________________________________________________________________________\n\n    def test_generate_features_with_none_only_transformers():\n        \"\"\"\n        Tests generating \"dummy\" feature definition which doesn't apply any\n        transformation.\n        \"\"\"\n        feature_defs = gen_features(\n            columns=['colA', 'colB', 'colC'], classes=[int])\n    \n        expected = [('colA', int, {}),\n                    ('colB', int, {}),\n                    ('colC', int, {})]\n    \n>       assert feature_defs == expected\nE       AssertionError: assert [('colA', [0]...lC', [0], {})] == [('colA', <cl...s 'int'>, {})]\nE         \nE         At index 0 diff: ('colA', [0], {}) != ('colA', <class 'int'>, {})\nE         \nE         Full diff:\nE           [\nE               (\nE                   'colA',...\nE         \nE         ...Full output truncated (23 lines hidden), use '-vv' to show\n\ntests/test_features_generator.py:94: AssertionError\n========================================================================================= warnings summary ==========================================================================================\ntests/test_dataframe_mapper.py::test_complex_object_df\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:106: SettingWithCopyWarning: \n  A value is trying to be set on a copy of a slice from a DataFrame.\n  Try using .loc[row_indexer,col_indexer] = value instead\n  \n  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n    X[col] = X[col].map(lambda img: np.max(img))\n\ntests/test_dataframe_mapper.py::test_sparse_features\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:865: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\ntests/test_dataframe_mapper.py::test_sparse_off\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:879: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\ntests/test_transformers.py::test_common_numerical_transformer\ntests/test_transformers.py::test_numerical_transformer_serialization\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py:35: DeprecationWarning: \n              NumericalTransformer will be deprecated in 3.0 version.\n              Please use Sklearn.base.TransformerMixin to write\n              customer transformers\n              \n    warnings.warn(\"\"\"\n\n-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html\n====================================================================================== short test summary info ======================================================================================\nFAILED tests/test_features_generator.py::test_generate_features_with_none_only_transformers - AssertionError: assert [('colA', [0]...lC', [0], {})] == [('colA', <cl...s 'int'>, {})]\n============================================================================= 1 failed, 69 passed, 5 warnings in 1.40s ==============================================================================",
    "CodeBase": [
      {
        "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "content": "1 # -*- coding: utf8 -*-\n2 \n3 import pytest\n4 from unittest.mock import Mock\n5 from pandas import DataFrame\n6 import pandas as pd\n7 from scipy import sparse\n8 from sklearn.datasets import load_iris\n9 from sklearn.pipeline import Pipeline\n10 from sklearn.model_selection import cross_val_score\n11 from sklearn.svm import SVC\n12 from sklearn.feature_extraction.text import CountVectorizer\n13 from sklearn.feature_extraction import DictVectorizer\n14 from sklearn.preprocessing import (\n15     StandardScaler, OneHotEncoder, LabelBinarizer)\n16 from sklearn.impute import SimpleImputer as Imputer\n17 from sklearn.feature_selection import SelectKBest, chi2\n18 from sklearn.base import BaseEstimator, TransformerMixin\n19 import sklearn.decomposition\n20 import numpy as np\n21 from numpy.testing import assert_array_equal\n22 import pickle\n23 from sklearn.compose import make_column_selector\n24 \n25 from sklearn_pandas import DataFrameMapper\n26 from sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\n27 from sklearn_pandas.pipeline import TransformerPipeline\n28 \n29 \n30 class MockXTransformer(object):\n31     \"\"\"\n32     Mock transformer that accepts no y argument.\n33     \"\"\"\n34     def fit(self, X):\n35         return self\n36 \n37     def transform(self, X):\n38         return X\n39 \n40 \n41 class MockTClassifier(object):\n42     \"\"\"\n43     Mock transformer/classifier.\n44     \"\"\"\n45     def fit(self, X, y=None):\n46         return self\n47 \n48     def transform(self, X):\n49         return X\n50 \n51     def predict(self, X):\n52         return True\n53 \n54 \n55 class DateEncoder():\n56     def fit(self, X, y=None):\n57         retur(...truncated)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "content": "1 from collections import Counter\n2 \n3 import pytest\n4 import numpy as np\n5 from pandas import DataFrame\n6 from numpy.testing import assert_array_equal\n7 \n8 from sklearn_pandas import DataFrameMapper\n9 from sklearn_pandas.features_generator import gen_features\n10 \n11 \n12 class MockClass(object):\n13 \n14     def __init__(self, value=1, name='class'):\n15         self.value = value\n16         self.name = name\n17 \n18 \n19 class MockTransformer(object):\n20 \n21     def __init__(self):\n22         self.most_common_ = None\n23 \n24     def fit(self, X, y=None):\n25         [(value, _)] = Counter(X).most_common(1)\n26         self.most_common_ = value\n27         return self\n28 \n29     def transform(self, X, y=None):\n30         return np.asarray([self.most_common_] * len(X))\n31 \n32 \n33 @pytest.fixture\n34 def simple_dataset():\n35     return DataFrame({\n36         'feat1': [1, 2, 1, 3, 1],\n37         'feat2': [1, 2, 2, 2, 3],\n38         'feat3': [1, 2, 3, 4, 5],\n39     })\n40 \n41 \n42 def test_generate_features_with_default_parameters():\n43     \"\"\"\n44     Tests generating features from classes with default init arguments.\n45     \"\"\"\n46     columns = ['colA', 'colB', 'colC']\n47     feature_defs = gen_features(columns=columns, classes=[MockClass])\n48     assert len(feature_defs) == len(columns)\n49 \n50     for feature in feature_defs:\n51         assert feature[2] == {}\n52 \n53     feature_dict = dict([_[0:2] for _ in feature_defs])\n54     assert columns == sorted(feature_dict.keys())\n55 \n56     # default init arguments for MockClass for clarification.\n57     expected = {'value': 1, 'name': 'class'}\n58     for column, transformers in feature_dict.items():\n59         for obj in transformers:\n60             assert_attributes(obj, **expected)\n61 \n62 \n63 def test_generate_features_with_several_classes():\n64     \"\"\"\n65     Tests generating features pipeline with different transformers parameters.\n66     \"\"\"\n67     feature_defs = gen_features(\n68         columns=['colA', 'colB', 'colC'],\n69         classes=[\n70             {'class': MockClass},\n71             {'class': MockClass, 'name': 'mockA'},\n72             {'class': MockClass, 'name': 'mockB', 'value': None}\n73         ]\n74     )\n75 \n76     for col, transformers, params in feature_defs:\n77         assert_attributes(transformers[0], name='class', value=1)\n78         assert_attributes(transformers[1], name='mockA', value=1)\n79         assert_attributes(transformers[2], name='mockB', value=None)\n80 \n81 \n82 def test_generate_features_with_none_only_transformers():\n83     \"\"\"\n84     Tests generating \"dummy\" feature definition which doesn't apply any\n85     transformation.\n86     \"\"\"\n87     feature_defs = gen_features(\n88         columns=['colA', 'colB', 'colC'], classes=[int])\n89 \n90     expected = [('colA', int, {}),\n91                 ('colB', int, {}),\n92                 ('colC', int, {})]\n93 \n94     assert feature_defs == expected\n95 \n96 \n97 def test_compatibility_with_data_frame_mapper(simple_dataset):\n98     \"\"\"\n99     Tests compatibility of generated feature definition with DataFrameMapper.\n100     \"\"\"\n101     features_defs = gen_features(\n102         columns=['feat1', 'feat2'],\n103         classes=[MockTransformer])\n104     features_defs.append(('feat3', None))\n105 \n106     mapper = DataFrameMapper(features_defs)\n107     X = mapper.fit_transform(simple_dataset)\n108     expected = np.asarray([\n109         [1, 2, 1],\n110         [1, 2, 2],\n111         [1, 2, 3],\n112         [1, 2, 4],\n113         [1, 2, 5]\n114     ])\n115 \n116     assert_array_equal(X, expected)\n117 \n118 \n119 def assert_attributes(obj, **attrs):\n120     for attr, value in attrs.items():\n121         assert getattr(obj, attr) == value"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
        "content": "1 def gen_features(columns, classes=None, prefix='', suffix=''):\n2     \"\"\"Generates a feature definition li(...truncated)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
        "content": "1 import contextlib\n2 from datetime import datetim(...truncated)"
      }
    ],
    "CommitSHA": "c9db2d6dcbf515eade751073f43318e43cae5177"
  },
  "Score": {
    "Difficulty": "Medium",
    "issue_origin": {
      "Title": 6,
      "Description": 7,
      "Reproducibility": 6,
      "Relevance": 5,
      "Explanation": 6,
      "Overall": 6
    },
    "issue_message": {
      "Title": 6,
      "Description": 5,
      "Reproducibility": 4,
      "Relevance": 6,
      "Explanation": 7,
      "Overall": 6
    },
    "issue_ground": {
      "Title": 8,
      "Description": 8,
      "Reproducibility": 8,
      "Relevance": 8,
      "Explanation": 8,
      "Overall": 8
    },
    "issue_ground_truth": {
      "title": "Feature Generation Incorrect Class Type Handling in Tests",
      "description": "There is an issue with the `gen_features` function in the `test_features_generator.py` file, where the class type provided for feature generation is not handled correctly. The `classes` argument is supposed to allow for a list of classes or `None`, but the test was incorrectly setting this to a specific type (e.g., int). This leads to an incorrect comparison and fails to correctly generate features as expected when a class is set to `None`. This affects test scenarios where the expected behavior is that no transformation should occur. This issue needs to be rectified to ensure the tests correctly validate the feature generation logic when classes are set to `None`.",
      "explanation": "### Summary of the Issue\n\nThe issue reported is related to the `gen_features` function in the `test_features_generator.py` test file. This function was supposed to handle the `classes` parameter, which can be a list of classes for generating features or `None`. However, in the tests, the `classes` parameter was incorrectly set to a specific type, such as `int`, instead of accepting `None`. This caused an improper comparison and failed to generate the expected features correctly when `None` was provided. Tests where features should remain untransformed (when classes are set to `None`) were failing due to this misconfiguration.\n\n### Content of the Commit\n\nThe commit modifies the `test_features_generator.py` file. The key changes involve adjusting the test cases to correctly handle and set the `classes` argument to `None` instead of a specific type like `int`. This ensures that the function behaves as expected, particularly in scenarios where no transformation should occur. \n\n### Explanation of the Issue and Solution\n\n#### Cause of the Issue:\n1. **Incorrect Handling of `classes` Argument**: The `gen_features` function is designed to accept a list of classes or `None` values for generating feature transformations. When `None` is provided, the features should remain untransformed.\n   \n2. **Misconfigured Test**: The tests were incorrectly assigning a specific type (e.g., `int`) to the `classes` parameter, which led to incorrect comparisons. For instance, setting `classes=[int]` instead of `classes=[None]` caused the function to behave differently than intended, leading to failed test cases.\n\n3. **Test Expectations Not Matching Actual Output**: The expected outcome was for features to remain untransformed when `classes` was set to `None`. However, due to the misconfiguration, the tests did not reflect this behavior, and failed to validate the correct functionality of the `gen_features` function.\n\n#### Solution Provided by the Commit:\n1. **Correct Parameter Setting**: The commit rectifies the test cases to correctly set the `classes` argument to `None`. This ensures that the correct behavior is triggered within the `gen_features` function—features remain untransformed when `None` is specified.\n\n2. **Adjusting Expected Outcomes**: The expected outcomes in the tests are adjusted to match the untransformed feature definitions. This aligns the expected results with the actual behavior when `None` is provided.\n\n3. **Ensuring Correct Comparisons**: By setting `classes` to `[None]`, the test cases accurately reflect scenarios where no transformation should occur, thus ensuring the function is correctly validated for these scenarios.\n\n### Detailed Explanation of the Solution\n\n**Handling of the `classes` Parameter**: \n- The test now correctly uses `classes=[None]`, aligning with the intended design of the `gen_features` function. When `None` is provided, it indicates that the feature should not undergo any transformation. This change ensures that the function's logic of returning features untransformed is tested properly.\n\n**Validation of Expected Outcomes**:\n- The expected results are updated to reflect the correct untransformed feature definitions. This ensures that the test cases now correctly validate the function's behavior when features are supposed to remain as-is.\n\n**Overall Result**:\n- Post-commit, the `gen_features` function is now correctly tested, ensuring that features remain untransformed when `None` is provided as the `classes` parameter. This fix addresses the inadequacies in the previous test configuration and ensures accurate validation of the function's behavior under different scenarios.\n\nBy making these adjustments, the commit ensures that the `gen_features` function is correctly handling the `classes` parameter, thereby fixing the bug and ensuring the accuracy of the test cases. This thorough validation guarantees that the function performs as expected when no transformations are needed."
    }
  }
}