{
  "RepoName": "https://github.com/scikit-learn-contrib/sklearn-pandas.git",
  "CommitSHA": "c9db2d6dcbf515eade751073f43318e43cae5177",
  "Time": "",
  "Difficulty": "Medium",
  "Type": "missing or wrong import",
  "BuggyCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/test.py",
      "content": "import pytest\nfrom unittest.mock import Mock\nimport numpy as np\nimport pandas as pd\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn.compose import make_column_selector\nfrom sklearn.preprocessing import StandardScaler\n\n\nclass GetStartWith:\n    def __init__(self, start_str):\n        self.start_str = start_str\n\n    def __call__(self, X: pd.DataFrame) -> list:\n        return [c for c in X.columns if c.startswith(self.start_str)]\n\n\ndf = pd.DataFrame({\n    'sepal length (cm)': [1.0, 2.0, 3.0],\n    'sepal width (cm)': [1.0, 2.0, 3.0],\n    'petal length (cm)': [1.0, 2.0, 3.0],\n    'petal width (cm)': [1.0, 2.0, 3.0]\n})\nt = DataFrameMapper([\n    (make_column_selector(dtype_include=float), StandardScaler(), {'alias': 'x'}),\n    (GetStartWith('petal'), None, {'alias': 'petal'})\n], df_out=True, default=False)\n\nt.fit(df)\nprint(t.transform(df).shape)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/setup.py",
      "content": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nfrom setuptools import setup\nfrom setuptools.command.test import test as TestCommand\nimport re\n\nfor line in open('sklearn_pandas/__init__.py'):\n    match = re.match(\"__version__ *= *'(.*)'\", line)\n    if match:\n        __version__, = match.groups()\n\n\nclass PyTest(TestCommand):\n    user_options = [('pytest-args=', 'a', \"Arguments to pass to py.test\")]\n\n    def initialize_options(self):\n        TestCommand.initialize_options(self)\n        self.pytest_args = []\n\n    def finalize_options(self):\n        TestCommand.finalize_options(self)\n        self.test_args = []\n        self.test_suite = True\n\n    def run(self):\n        import pytest\n        errno = pytest.main(self.pytest_args)\n        raise SystemExit(errno)\n\n\nsetup(name='sklearn-pandas',\n      version=__version__,\n      description='Pandas integration with sklearn',\n      maintainer='Ritesh Agrawal',\n      maintainer_email='ragrawal@gmail.com',\n      url='https://github.com/scikit-learn-contrib/sklearn-pandas',\n      packages=['sklearn_pandas'],\n      keywords=['scikit', 'sklearn', 'pandas'],\n      install_requires=[\n          'scikit-learn>=0.23.0',\n          'scipy>=1.5.1',\n          'pandas>=1.1.4',\n          'numpy>=1.18.1'\n      ],\n      tests_require=['pytest', 'mock'],\n      cmdclass={'test': PyTest},\n      license='MIT License'\n)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/noxfile.py",
      "content": "import nox\n\n@nox.session\ndef lint(session):\n    session.install('pytest>=5.3.5', 'setuptools>=45.2',\n                    'wheel>=0.34.2', 'flake8>=3.7.9',\n                    'numpy==1.18.1', 'pandas==1.1.4')\n    session.install('.')\n    session.run('flake8', 'sklearn_pandas/', 'tests')\n\n@nox.session\n@nox.parametrize('numpy', ['1.18.1', '1.19.4', '1.20.1'])\n@nox.parametrize('scipy', ['1.5.4', '1.6.0'])\n@nox.parametrize('pandas', ['1.1.4', '1.2.2'])\ndef tests(session, numpy, scipy, pandas):\n    session.install('pytest>=5.3.5', \n                    'setuptools>=45.2',\n                    'wheel>=0.34.2',\n                    f'numpy=={numpy}',\n                    f'scipy=={scipy}',\n                    f'pandas=={pandas}'\n                    )\n    session.install('.')\n    session.run('py.test', 'README.rst', 'tests')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
      "content": "def gen_features(columns, classes=None, prefix='', suffix=''):\n    \"\"\"Generates a feature definition list which can be passed\n    into DataFrameMapper\n\n    Params:\n\n    columns     a list of column names to generate features for.\n\n    classes     a list of classes for each feature, a list of dictionaries with\n                transformer class and init parameters, or None.\n\n                If list of classes is provided, then each of them is\n                instantiated with default arguments. Example:\n\n                    classes = [StandardScaler, LabelBinarizer]\n\n                If list of dictionaries is provided, then each of them should\n                have a 'class' key with transformer class. All other keys are\n                passed into 'class' value constructor. Example:\n\n                    classes = [\n                        {'class': StandardScaler, 'with_mean': False},\n                        {'class': LabelBinarizer}\n                    }]\n\n                If None value selected, then each feature left as is.\n\n    prefix      add prefix to transformed column names\n\n    suffix      add suffix to transformed column names.\n\n    \"\"\"\n    if classes is None:\n        return [(column, None) for column in columns]\n\n    feature_defs = []\n\n    for column in columns:\n        feature_transformers = []\n\n        arguments = {}\n        if prefix and prefix != \"\":\n            arguments['prefix'] = prefix\n        if suffix and suffix != \"\":\n            arguments['suffix'] = suffix\n\n        classes = [cls for cls in classes if cls is not None]\n        if not classes:\n            feature_defs.append((column, None, arguments))\n\n        else:\n            for definition in classes:\n                if isinstance(definition, dict):\n                    params = definition.copy()\n                    klass = params.pop('class')\n                    feature_transformers.append(klass(**params))\n                else:\n                    feature_transformers.append(definition())\n\n            if not feature_transformers:\n                feature_transformers = None\n\n            feature_defs.append((column, feature_transformers, arguments))\n\n    return feature_defs\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py",
      "content": "import numpy as np\nimport pandas as pd\nfrom sklearn.base import TransformerMixin\nimport warnings\n\n\ndef _get_mask(X, value):\n    \"\"\"\n    Compute the boolean mask X == missing_values.\n    \"\"\"\n    if value == \"NaN\" or \\\n       value is None or \\\n       (isinstance(value, float) and np.isnan(value)):\n        return pd.isnull(X)\n    else:\n        return X == value\n\n\nclass NumericalTransformer(TransformerMixin):\n    \"\"\"\n    Provides commonly used numerical transformers.\n    \"\"\"\n    SUPPORTED_FUNCTIONS = ['log', 'log1p']\n\n    def __init__(self, func):\n        \"\"\"\n        Params\n\n        func    function to apply to input columns. The function will be\n                applied to each value. Supported functions are defined\n                in SUPPORTED_FUNCTIONS variable. Throws assertion error if the\n                not supported.\n        \"\"\"\n\n        warnings.warn(\"\"\"\n            NumericalTransformer will be deprecated in 3.0 version.\n            Please use Sklearn.base.TransformerMixin to write\n            customer transformers\n            \"\"\", DeprecationWarning)\n\n        assert func in self.SUPPORTED_FUNCTIONS, \\\n            f\"Only following func are supported: {self.SUPPORTED_FUNCTIONS}\"\n        super(NumericalTransformer, self).__init__()\n        self.__func = func\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X, y=None):\n        if self.__func == 'log1p':\n            return np.vectorize(np.log1p)(X)\n        elif self.__func == 'log':\n            return np.vectorize(np.log)(X)\n\n        raise ValueError(f\"Invalid function name: {self.__func}\")\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/__init__.py",
      "content": "__version__ = '2.2.0'\n\nimport logging\nlogger = logging.getLogger(__name__)\n\nfrom .dataframe_mapper import DataFrameMapper  # NOQA\nfrom .features_generator import gen_features  # NOQA\nfrom .transformers import NumericalTransformer # NOQA\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/pipeline.py",
      "content": "import six\nfrom sklearn.pipeline import _name_estimators, Pipeline\nfrom sklearn.utils import tosequence\n\n\ndef _call_fit(fit_method, X, y=None, **kwargs):\n    \"\"\"\n    helper function, calls the fit or fit_transform method with the correct\n    number of parameters\n\n    fit_method: fit or fit_transform method of the transformer\n    X: the data to fit\n    y: the target vector relative to X, optional\n    kwargs: any keyword arguments to the fit method\n\n    return: the result of the fit or fit_transform method\n\n    WARNING: if this function raises a TypeError exception, test the fit\n    or fit_transform method passed to it in isolation as _call_fit will not\n    distinguish TypeError due to incorrect number of arguments from\n    other TypeError\n    \"\"\"\n    try:\n        return fit_method(X, y, **kwargs)\n    except TypeError:\n        # fit takes only one argument\n        return fit_method(X, **kwargs)\n\n\nclass TransformerPipeline(Pipeline):\n    \"\"\"\n    Pipeline that expects all steps to be transformers taking a single X\n    argument, an optional y argument, and having fit and transform methods.\n\n    Code is copied from sklearn's Pipeline\n    \"\"\"\n\n    def __init__(self, steps):\n        names, estimators = zip(*steps)\n        if len(dict(steps)) != len(steps):\n            raise ValueError(\n                \"Provided step names are not unique: %s\" % (names,))\n\n        # shallow copy of steps\n        self.steps = tosequence(steps)\n        estimator = estimators[-1]\n\n        for e in estimators:\n            if (not (hasattr(e, \"fit\") or hasattr(e, \"fit_transform\")) or not\n                    hasattr(e, \"transform\")):\n                raise TypeError(\"All steps of the chain should \"\n                                \"be transforms and implement fit and transform\"\n                                \" '%s' (type %s) doesn't)\" % (e, type(e)))\n\n        if not hasattr(estimator, \"fit\"):\n            raise TypeError(\"Last step of chain should implement fit \"\n                            \"'%s' (type %s) doesn't)\"\n                            % (estimator, type(estimator)))\n\n    def _pre_transform(self, X, y=None, **fit_params):\n        fit_params_steps = dict((step, {}) for step, _ in self.steps)\n        for pname, pval in six.iteritems(fit_params):\n            step, param = pname.split('__', 1)\n            fit_params_steps[step][param] = pval\n        Xt = X\n        for name, transform in self.steps[:-1]:\n            if hasattr(transform, \"fit_transform\"):\n                Xt = _call_fit(transform.fit_transform,\n                               Xt, y, **fit_params_steps[name])\n            else:\n                Xt = _call_fit(transform.fit,\n                               Xt, y, **fit_params_steps[name]).transform(Xt)\n        return Xt, fit_params_steps[self.steps[-1][0]]\n\n    def fit(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        _call_fit(self.steps[-1][-1].fit, Xt, y, **fit_params)\n        return self\n\n    def fit_transform(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        if hasattr(self.steps[-1][-1], 'fit_transform'):\n            return _call_fit(self.steps[-1][-1].fit_transform,\n                             Xt, y, **fit_params)\n        else:\n            return _call_fit(self.steps[-1][-1].fit,\n                             Xt, y, **fit_params).transform(Xt)\n\n\ndef make_transformer_pipeline(*steps):\n    \"\"\"Construct a TransformerPipeline from the given estimators.\n    \"\"\"\n    return TransformerPipeline(_name_estimators(steps))\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "import contextlib\nfrom datetime import datetime\nimport pandas as pd\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom .cross_validation import DataWrapper\nfrom .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline\nfrom . import logger\n\nstring_types = text_type = str\n\n\ndef _handle_feature(fea):\n    \"\"\"\n    Convert 1-dimensional arrays to 2-dimensional column vectors.\n    \"\"\"\n    if len(fea.shape) == 1:\n        fea = np.array([fea]).T\n\n    return fea\n\n\ndef _build_transformer(transformers):\n    if isinstance(transformers, list):\n        transformers = make_transformer_pipeline(*transformers)\n    return transformers\n\n\ndef _build_feature(columns, transformers, options={}, X=None):\n    if X is None:\n        return (columns, _build_transformer(transformers), options)\n    return (\n        columns(X) if callable(columns) else columns,\n        _build_transformer(transformers),\n        options\n    )\n\n\ndef _elapsed_secs(t1):\n    return (datetime.now()-t1).total_seconds()\n\n\ndef _get_feature_names(estimator):\n    \"\"\"\n    Attempt to extract feature names based on a given estimator\n    \"\"\"\n    if hasattr(estimator, 'classes_'):\n        return estimator.classes_\n    elif hasattr(estimator, 'get_feature_names'):\n        return estimator.get_feature_names()\n    return None\n\n\n@contextlib.contextmanager\ndef add_column_names_to_exception(column_names):\n    # Stolen from https://stackoverflow.com/a/17677938/356729\n    try:\n        yield\n    except Exception as ex:\n        if ex.args:\n            msg = u'{}: {}'.format(column_names, ex.args[0])\n        else:\n            msg = text_type(column_names)\n        ex.args = (msg,) + ex.args[1:]\n        raise\n\n\nclass DataFrameMapper(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Map Pandas data frame column subsets to their own\n    sklearn transformation.\n    \"\"\"\n\n    def __init__(self, features, default=False, sparse=False, df_out=False,\n                 input_df=False, drop_cols=None):\n        \"\"\"\n        Params:\n\n        features    a list of tuples with features definitions.\n                    The first element is the pandas column selector. This can\n                    be a string (for one column) or a list of strings.\n                    The second element is an object that supports\n                    sklearn's transform interface, or a list of such objects\n                    The third element is optional and, if present, must be\n                    a dictionary with the options to apply to the\n                    transformation. Example: {'alias': 'day_of_week'}\n\n        default     default transformer to apply to the columns not\n                    explicitly selected in the mapper. If False (default),\n                    discard them. If None, pass them through untouched. Any\n                    other transformer will be applied to all the unselected\n                    columns as a whole, taken as a 2d-array.\n\n        sparse      will return sparse matrix if set True and any of the\n                    extracted features is sparse. Defaults to False.\n\n        df_out      return a pandas data frame, with each column named using\n                    the pandas column that created it (if there's only one\n                    input and output) or the input columns joined with '_'\n                    if there's multiple inputs, and the name concatenated with\n                    '_1', '_2' etc if there's multiple outputs. NB: does not\n                    work if *default* or *sparse* are true\n\n        input_df    If ``True`` pass the selected columns to the transformers\n                    as a pandas DataFrame or Series. Otherwise pass them as a\n                    numpy array. Defaults to ``False``.\n\n        drop_cols   List of columns to be dropped. Defaults to None.\n\n        \"\"\"\n        self.features = features\n        self.default = default\n        self.built_default = None\n        self.sparse = sparse\n        self.df_out = df_out\n        self.input_df = input_df\n        self.drop_cols = [] if drop_cols is None else drop_cols\n        self.transformed_names_ = []\n        if (df_out and (sparse or default)):\n            raise ValueError(\"Can not use df_out with sparse or default\")\n\n    def _build(self, X=None):\n        \"\"\"\n        Build attributes built_features and built_default.\n        \"\"\"\n        if isinstance(self.features, list):\n            self.built_features = [\n                _build_feature(*f, X=X) for f in self.features\n            ]\n        else:\n            self.built_features = _build_feature(*self.features, X=X)\n        self.built_default = _build_transformer(self.default)\n\n    @property\n    def _selected_columns(self):\n        \"\"\"\n        Return a set of selected columns in the feature list.\n        \"\"\"\n        selected_columns = set()\n        for feature in self.features:\n            columns = feature[0]\n            if isinstance(columns, list):\n                selected_columns = selected_columns.union(set(columns))\n            else:\n                selected_columns.add(columns)\n        return selected_columns\n\n    def _unselected_columns(self, X):\n        \"\"\"\n        Return list of columns present in X and not selected explicitly in the\n        mapper.\n\n        Unselected columns are returned in the order they appear in the\n        dataframe to avoid issues with different ordering during default fit\n        and transform steps.\n        \"\"\"\n        X_columns = list(X.columns)\n        return [column for column in X_columns if\n                column not in self._selected_columns\n                and column not in self.drop_cols]\n\n    def __setstate__(self, state):\n        # compatibility for older versions of sklearn-pandas\n        super().__setstate__(state)\n        self.features = [_build_feature(*feat) for feat in state['features']]\n        self.sparse = state.get('sparse', False)\n        self.default = state.get('default', False)\n        self.df_out = state.get('df_out', False)\n        self.input_df = state.get('input_df', False)\n        self.drop_cols = state.get('drop_cols', [])\n        self.built_features = state.get('built_features', self.features)\n        self.built_default = state.get('built_default', self.default)\n        self.transformed_names_ = state.get('transformed_names_', [])\n\n    def __getstate__(self):\n        state = super().__getstate__()\n        state['features'] = self.features\n        state['sparse'] = self.sparse\n        state['default'] = self.default\n        state['df_out'] = self.df_out\n        state['input_df'] = self.input_df\n        state['drop_cols'] = self.drop_cols\n        state['build_features'] = getattr(self, 'built_features', None)\n        state['built_default'] = self.built_default\n        state['transformed_names_'] = self.transformed_names_\n        return state\n\n    def _get_col_subset(self, X, cols, input_df=False):\n        \"\"\"\n        Get a subset of columns from the given table X.\n\n        X       a Pandas dataframe; the table to select columns from\n        cols    a string or list of strings representing the columns to select.\n                It can also be a callable that returns True or False, i.e.\n                compatible with the built-in filter function.\n\n        Returns a numpy array with the data from the selected columns\n        \"\"\"\n\n        if isinstance(cols, string_types):\n            return_vector = True\n            cols = [cols]\n        else:\n            return_vector = False\n\n        # Needed when using the cross-validation compatibility\n        # layer for sklearn<0.16.0.\n        # Will be dropped on sklearn-pandas 2.0.\n        if isinstance(X, list):\n            X = [x[cols] for x in X]\n            X = pd.DataFrame(X)\n\n        elif isinstance(X, DataWrapper):\n            X = X.df  # fetch underlying data\n\n        if return_vector:\n            t = X[cols[0]]\n        else:\n            t = X[cols]\n\n        # return either a DataFrame/Series or a numpy array\n        if input_df:\n            return t\n        else:\n            return t.values\n\n    def fit(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n\n        \"\"\"\n        self._build(X=X)\n\n        for columns, transformers, options in self.built_features:\n            t1 = datetime.now()\n            input_df = options.get('input_df', self.input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    Xt = self._get_col_subset(X, columns, input_df)\n                    _call_fit(transformers.fit, Xt, y)\n            logger.info(f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n        # handle features not explicitly selected\n        if self.built_default:  # not False and not None\n            unsel_cols = self._unselected_columns(X)\n            with add_column_names_to_exception(unsel_cols):\n                Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n                _call_fit(self.built_default.fit, Xt, y)\n        return self\n\n    def get_names(self, columns, transformer, x, alias=None, prefix='',\n                  suffix=''):\n        \"\"\"\n        Return verbose names for the transformed columns.\n\n        columns       name (or list of names) of the original column(s)\n        transformer   transformer - can be a TransformerPipeline\n        x             transformed columns (numpy.ndarray)\n        alias         base name to use for the selected columns\n        \"\"\"\n        if alias is not None:\n            name = alias\n        elif isinstance(columns, list):\n            name = '_'.join(map(str, columns))\n        else:\n            name = columns\n        num_cols = x.shape[1] if len(x.shape) > 1 else 1\n\n        output = []\n\n        if num_cols > 1:\n            # If there are as many columns as classes in the transformer,\n            # infer column names from classes names.\n\n            # If we are dealing with multiple transformers for these columns\n            # attempt to extract the names from each of them, starting from the\n            # last one\n            if isinstance(transformer, TransformerPipeline):\n                inverse_steps = transformer.steps[::-1]\n                estimators = (estimator for name, estimator in inverse_steps)\n                names_steps = (_get_feature_names(e) for e in estimators)\n                names = next((n for n in names_steps if n is not None), None)\n            # Otherwise use the only estimator present\n            else:\n                names = _get_feature_names(transformer)\n\n            if names is not None and len(names) == num_cols:\n                output = [f\"{name}_{o}\" for o in names]\n                # otherwise, return name concatenated with '_1', '_2', etc.\n            else:\n                output = [name + '_' + str(o) for o in range(num_cols)]\n        else:\n            output = [name]\n\n        if prefix == suffix == \"\":\n            return output\n\n        return ['{}{}{}'.format(prefix, x, suffix) for x in output]\n\n    def get_dtypes(self, extracted):\n        dtypes_features = [self.get_dtype(ex) for ex in extracted]\n        return [dtype for dtype_feature in dtypes_features\n                for dtype in dtype_feature]\n\n    def get_dtype(self, ex):\n        if isinstance(ex, np.ndarray) or sparse.issparse(ex):\n            return [ex.dtype] * ex.shape[1]\n        elif isinstance(ex, pd.DataFrame):\n            return list(ex.dtypes)\n        else:\n            raise TypeError(type(ex))\n\n    def _transform(self, X, y=None, do_fit=False):\n        \"\"\"\n        Transform the given data with possibility to fit in advance.\n        Avoids code duplication for implementation of transform and\n        fit_transform.\n        \"\"\"\n        if do_fit:\n            self._build(X=X)\n\n        extracted = []\n        transformed_names_ = []\n        for columns, transformers, options in self.built_features:\n            input_df = options.get('input_df', self.input_df)\n\n            # columns could be a string or list of\n            # strings; we don't care because pandas\n            # will handle either.\n            Xt = self._get_col_subset(X, columns, input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    if do_fit and hasattr(transformers, 'fit_transform'):\n                        t1 = datetime.now()\n                        Xt = _call_fit(transformers.fit_transform, Xt, y)\n                        logger.info(f\"[FIT_TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n                    else:\n                        if do_fit:\n                            t1 = datetime.now()\n                            _call_fit(transformers.fit, Xt, y)\n                            logger.info(\n                                f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n                        t1 = datetime.now()\n                        Xt = transformers.transform(Xt)\n                        logger.info(f\"[TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n\n            extracted.append(_handle_feature(Xt))\n\n            alias = options.get('alias')\n\n            prefix = options.get('prefix', '')\n            suffix = options.get('suffix', '')\n\n            transformed_names_ += self.get_names(\n                columns, transformers, Xt, alias, prefix, suffix)\n\n        # handle features not explicitly selected\n        if self.built_default is not False:\n            unsel_cols = self._unselected_columns(X)\n            Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n            if self.built_default is not None:\n                with add_column_names_to_exception(unsel_cols):\n                    if do_fit and hasattr(self.built_default, 'fit_transform'):\n                        Xt = _call_fit(self.built_default.fit_transform, Xt, y)\n                    else:\n                        if do_fit:\n                            _call_fit(self.built_default.fit, Xt, y)\n                        Xt = self.built_default.transform(Xt)\n                transformed_names_ += self.get_names(\n                    unsel_cols, self.built_default, Xt)\n            else:\n                # if not applying a default transformer,\n                # keep column names unmodified\n                transformed_names_ += unsel_cols\n\n            extracted.append(_handle_feature(Xt))\n\n        self.transformed_names_ = transformed_names_\n\n        # combine the feature outputs into one array.\n        # at this point we lose track of which features\n        # were created from which input columns, so it's\n        # assumed that that doesn't matter to the model.\n\n        # If any of the extracted features is sparse, combine sparsely.\n        # Otherwise, combine as normal arrays.\n        if any(sparse.issparse(fea) for fea in extracted):\n            stacked = sparse.hstack(extracted).tocsr()\n            # return a sparse matrix only if the mapper was initialized\n            # with sparse=True\n            if not self.sparse:\n                stacked = stacked.toarray()\n        else:\n            stacked = np.hstack(extracted)\n\n        if self.df_out:\n            # if no rows were dropped preserve the original index,\n            # otherwise use a new integer one\n            no_rows_dropped = len(X) == len(stacked)\n            if no_rows_dropped:\n                index = X.index\n            else:\n                index = None\n\n            # output different data types, if appropriate\n            dtypes = self.get_dtypes(extracted)\n            df_out = pd.DataFrame(\n                stacked,\n                columns=self.transformed_names_,\n                index=index)\n            # preserve types\n            for col, dtype in zip(self.transformed_names_, dtypes):\n                df_out[col] = df_out[col].astype(dtype)\n            return df_out\n        else:\n            return stacked\n\n    def transform(self, X):\n        \"\"\"\n        Transform the given data. Assumes that fit has already been called.\n\n        X       the data to transform\n        \"\"\"\n        return self._transform(X)\n\n    def fit_transform(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline and directly apply\n        it to the given data.\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n        \"\"\"\n        return self._transform(X, y, True)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/cross_validation.py",
      "content": "class DataWrapper(object):\n\n    def __init__(self, df):\n        self.df = df\n\n    def __len__(self):\n        return len(self.df)\n\n    def __getitem__(self, key):\n        return self.df.iloc[key]\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_pipeline.py",
      "content": "import pytest\nfrom sklearn_pandas.pipeline import TransformerPipeline, _call_fit\n\n# In py3, mock is included with the unittest standard library\n# In py2, it's a separate package\ntry:\n    from unittest.mock import patch\nexcept ImportError:\n    from mock import patch\n\n\nclass NoTransformT(object):\n    \"\"\"Transformer without transform method.\n    \"\"\"\n    def fit(self, x):\n        return self\n\n\nclass NoFitT(object):\n    \"\"\"Transformer without fit method.\n    \"\"\"\n    def transform(self, x):\n        return self\n\n\nclass Trans(object):\n    \"\"\"\n    Transformer with fit and transform methods\n    \"\"\"\n    def fit(self, x, y=None):\n        return self\n\n    def transform(self, x):\n        return self\n\n\ndef func_x_y(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments\n    \"\"\"\n    return\n\n\ndef func_x(x, kwarg='kwarg'):\n    \"\"\"\n    Function with required x argument\n    \"\"\"\n    return\n\n\ndef func_raise_type_err(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments,\n    raises TypeError\n    \"\"\"\n    raise TypeError\n\n\ndef test_all_steps_fit_transform():\n    \"\"\"\n    All steps must implement fit and transform. Otherwise, raise TypeError.\n    \"\"\"\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoTransformT())])\n\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoFitT())])\n\n\n@patch.object(Trans, 'fit', side_effect=func_x_y)\ndef test_called_with_x_and_y(mock_fit):\n    \"\"\"\n    Fit method with required X and y arguments is called with both and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', 'y', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_x)\ndef test_called_with_x(mock_fit):\n    \"\"\"\n    Fit method with a required X arguments is called with it and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n    _call_fit(Trans().fit, 'X', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_raise_type_err)\ndef test_raises_type_error(mock_fit):\n    \"\"\"\n    If a fit method with required X and y arguments raises a TypeError, it's\n    re-raised (for a different reason) when it's called with one argument\n    \"\"\"\n    with pytest.raises(TypeError):\n        _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
      "content": "import tempfile\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nimport joblib\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas import NumericalTransformer\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_common_numerical_transformer(simple_dataset):\n    \"\"\"\n    Test log transformation\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ], df_out=True)\n    df = simple_dataset\n    outDF = transfomer.fit_transform(df)\n    assert list(outDF.columns) == ['feat1']\n    assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n\n\ndef test_numerical_transformer_serialization(simple_dataset):\n    \"\"\"\n    Test if you can serialize transformer\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ])\n\n    df = simple_dataset\n    transfomer.fit(df)\n    f = tempfile.NamedTemporaryFile(delete=True)\n    joblib.dump(transfomer, f.name)\n    transfomer2 = joblib.load(f.name)\n    np.array_equal(transfomer.transform(df), transfomer2.transform(df))\n    f.close()\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "# -*- coding: utf8 -*-\n\nimport pytest\nfrom unittest.mock import Mock\nfrom pandas import DataFrame\nimport pandas as pd\nfrom scipy import sparse\nfrom sklearn.datasets import load_iris\nfrom sklearn import pipeline as Pipeline\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.svm import SVC\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.preprocessing import (\n    StandardScaler, OneHotEncoder, LabelBinarizer)\nfrom sklearn.impute import SimpleImputer as Imputer\nfrom sklearn.feature_selection import SelectKBest, chi2\nfrom sklearn.base import BaseEstimator, TransformerMixin\nimport sklearn.decomposition\nimport numpy as np\nfrom numpy.testing import assert_array_equal\nimport pickle\nfrom sklearn.compose import make_column_selector\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\nfrom sklearn_pandas.pipeline import TransformerPipeline\n\n\nclass MockXTransformer(object):\n    \"\"\"\n    Mock transformer that accepts no y argument.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return X\n\n\nclass MockTClassifier(object):\n    \"\"\"\n    Mock transformer/classifier.\n    \"\"\"\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        return X\n\n    def predict(self, X):\n        return True\n\n\nclass DateEncoder():\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        dt = X.dt\n        return pd.concat([dt.year, dt.month, dt.day], axis=1)\n\n\nclass ToSparseTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Transforms numpy matrix to sparse format.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return sparse.csr_matrix(X)\n\n\nclass CustomTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example of transformer in which the number of classes\n    is not equals to the number of output columns.\n    \"\"\"\n    def fit(self, X, y=None):\n        self.min = X.min()\n        self.classes_ = np.unique(X)\n        return self\n\n    def transform(self, X):\n        classes = np.unique(X)\n        if len(np.setdiff1d(classes, self.classes_)) > 0:\n            raise ValueError('Unknown values found.')\n        return X - self.min\n\n\nclass MockImageTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example transformer that takes the max of a 2d vector\n    then scales the result.\n    \"\"\"\n    def __init__(self, multiplier=10.0):\n        self.multiplier = multiplier\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        assert isinstance(X, pd.DataFrame)\n        for col in X.columns:\n            X[col] = X[col].map(lambda img: np.max(img))\n        return X * self.multiplier\n\n\n@pytest.fixture\ndef simple_dataframe():\n    return pd.DataFrame({'a': [1, 2, 3]})\n\n\n@pytest.fixture\ndef complex_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4]})\n\n\n@pytest.fixture\ndef complex_object_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4],\n                         'img2d': [1*np.eye(2), 2*np.eye(2), 3*np.eye(2),\n                                   4*np.eye(2), 5*np.eye(2), 6*np.eye(2)]})\n\n\n@pytest.fixture\ndef multiindex_dataframe():\n    \"\"\"Example MultiIndex DataFrame, taken from pandas documentation\n    \"\"\"\n    iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]\n    index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])\n    df = pd.DataFrame(np.random.randn(10, 8), columns=index)\n    return df\n\n\n@pytest.fixture\ndef multiindex_dataframe_incomplete(multiindex_dataframe):\n    \"\"\"Example MultiIndex DataFrame with missing entries\n    \"\"\"\n    df = multiindex_dataframe\n    mask_array = np.zeros(df.size)\n    mask_array[:20] = 1\n    np.random.shuffle(mask_array)\n    mask = mask_array.reshape(df.shape).astype(bool)\n    df.mask(mask, inplace=True)\n    return df\n\n\ndef test_transformed_names_simple(simple_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for simple transformation\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_transformed_names_binarizer(complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_logging(caplog, complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    import logging\n    logger = logging.getLogger('sklearn_pandas')\n    logger.setLevel(logging.INFO)\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert '[FIT_TRANSFORM] target:' in caplog.text\n\n\ndef test_transformed_names_binarizer_unicode():\n    df = pd.DataFrame({'target': [u'ñ', u'á', u'é']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    expected_names = {u'target_ñ', u'target_á', u'target_é'}\n    assert set(mapper.transformed_names_) == expected_names\n\n\ndef test_transformed_names_transformers_list(complex_dataframe):\n    \"\"\"\n    When using a list of transformers, use them in inverse order to get the\n    transformed names\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([\n        ('target', [LabelBinarizer(), MockXTransformer()])\n    ])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_transformed_names_simple_alias(simple_dataframe):\n    \"\"\"\n    If we specify an alias for a single output column, it is used for the\n    output\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None, {'alias': 'new_name'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_name']\n\n\ndef test_transformed_names_complex_alias(complex_dataframe):\n    \"\"\"\n    If we specify an alias for a multiple output column, it is used for the\n    output\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer(), {'alias': 'new'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_a', 'new_b', 'new_c']\n\n\ndef test_exception_column_context_transform(simple_dataframe):\n    \"\"\"\n    If an exception is raised when transforming a column,\n    the exception includes the name of the column being transformed\n    \"\"\"\n    class FailingTransformer(object):\n        def fit(self, X):\n            pass\n\n        def transform(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingTransformer())])\n    mapper.fit(df)\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.transform(df)\n\n\ndef test_exception_column_context_fit(simple_dataframe):\n    \"\"\"\n    If an exception is raised when fit a column,\n    the exception includes the name of the column being fitted\n    \"\"\"\n    class FailingFitter(object):\n        def fit(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingFitter())])\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.fit(df)\n\n\ndef test_simple_df(simple_dataframe):\n    \"\"\"\n    Get a dataframe from a simple mapped dataframe\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert type(transformed) == pd.DataFrame\n    assert len(transformed[\"a\"]) == len(simple_dataframe[\"a\"])\n\n\ndef test_complex_df(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None), ('feat2', None)],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_complex_object_df(complex_object_dataframe):\n    \"\"\"\n    Get a dataframe from a complex dataframe with 2d features\n    \"\"\"\n    df = complex_object_dataframe\n    img_scale = 10\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None),\n         (make_column_selector('feat2'), StandardScaler()),\n         (make_column_selector('img2d'), MockImageTransformer(img_scale))],\n        df_out=True, input_df=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_object_dataframe)\n    assert np.isclose(\n        np.sum(transformed['img2d']),\n        np.max(np.sum(df['img2d'])) * img_scale, atol=1e-12)\n\n\ndef test_numeric_column_names(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe with numeric column names\n    \"\"\"\n    df = complex_dataframe\n    df.columns = [0, 1, 2]\n    mapper = DataFrameMapper(\n        [(0, None), (1, None), (2, None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_multiindex_df(multiindex_dataframe_incomplete):\n    \"\"\"\n    Get a dataframe from a multiindex dataframe with missing data\n    \"\"\"\n    df = multiindex_dataframe_incomplete\n    mapper = DataFrameMapper([([c], Imputer()) for c in df.columns],\n                             df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(multiindex_dataframe_incomplete)\n    for c in df.columns:\n        assert len(transformed[str(c)]) == len(df[c])\n\n\ndef test_binarizer_df():\n    \"\"\"\n    Check level names from LabelBinarizer\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_a'\n    assert cols[1] == 'target_b'\n    assert cols[2] == 'target_c'\n\n\ndef test_binarizer_int_df():\n    \"\"\"\n    Check level names from LabelBinarizer for a numeric array.\n    \"\"\"\n    df = pd.DataFrame({'target': [5, 5, 6, 6, 7, 5]})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_5'\n    assert cols[1] == 'target_6'\n    assert cols[2] == 'target_7'\n\n\ndef test_binarizer2_df():\n    \"\"\"\n    Check level names from LabelBinarizer with just one output column\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_onehot_df():\n    \"\"\"\n    Check level ids from one-hot\n    \"\"\"\n    df = pd.DataFrame({'target': [0, 0, 1, 1, 2, 3, 0]})\n    mapper = DataFrameMapper([(['target'], OneHotEncoder())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 4\n    assert cols[0] == 'target_0'\n    assert cols[3] == 'target_3'\n\n\ndef test_customtransform_df():\n    \"\"\"\n    Check level ids from a transformer in which\n    the number of classes is not equals to the number of output columns.\n    \"\"\"\n    df = pd.DataFrame({'target': [6, 5, 7, 5, 4, 8, 8]})\n    mapper = DataFrameMapper([(['target'], CustomTransformer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(mapper.features[0][1].classes_) == 5\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_preserve_df_index():\n    \"\"\"\n    The index is preserved when df_out=True\n    \"\"\"\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', None)],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, df.index)\n\n\ndef test_preserve_df_index_rows_dropped():\n    \"\"\"\n    If df_out=True but the original df index length doesn't\n    match the number of final rows, use a numeric index\n    \"\"\"\n    class DropLastRowTransformer(object):\n        def fit(self, X):\n            return self\n\n        def transform(self, X):\n            return X[:-1]\n\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', DropLastRowTransformer())],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, np.array([0, 1]))\n\n\ndef test_pca(complex_dataframe):\n    \"\"\"\n    Check multi in and out with PCA\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 2\n    assert cols[0] == 'feat1_feat2_0'\n    assert cols[1] == 'feat1_feat2_1'\n\n\ndef test_fit_transform(simple_dataframe):\n    \"\"\"\n    Check that custom fit_transform methods of the transformers are invoked.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    # return something of measurable length but does nothing\n    mock_transformer.fit_transform.return_value = np.array([1, 2, 3])\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n    mapper.fit_transform(df)\n    assert mock_transformer.fit_transform.called\n\n\ndef test_fit_transform_equiv_mock(simple_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper using the mock\n    transformer which does not implement a custom fit_transform.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', MockXTransformer())])\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.all(transformed_combined == transformed_separate)\n\n\ndef test_fit_transform_equiv_pca(complex_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper and transformer\n    using PCA which implements a custom fit_transform. The\n    equivalence of both paths in the transformer only can be\n    asserted since this is tested in the sklearn tests\n    scikit-learn/sklearn/decomposition/tests/test_pca.py\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.allclose(transformed_combined, transformed_separate)\n\n\ndef test_input_df_true_first_transformer(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the first transformer is passed\n    a pd.Series instead of an np.array\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockXTransformer, 'fit', Mock())\n    monkeypatch.setattr(MockXTransformer, 'transform',\n                        Mock(return_value=np.array([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', MockXTransformer())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    args, _ = MockXTransformer().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    args, _ = MockXTransformer().transform.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_next_transformers(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the subsequent transformers get passed pandas\n    objects instead of numpy arrays (given the previous transformers\n    output pandas objects as well)\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockTClassifier, 'fit', Mock())\n    monkeypatch.setattr(MockTClassifier, 'transform',\n                        Mock(return_value=pd.Series([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer(), MockTClassifier()])\n    ], input_df=True)\n    mapper.fit(df)\n    out = mapper.transform(df)\n\n    args, _ = MockTClassifier().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_multiple_cols(complex_dataframe):\n    \"\"\"\n    When input_df is True, applying transformers to multiple columns\n    works as expected\n    \"\"\"\n    df = complex_dataframe\n\n    mapper = DataFrameMapper([\n        ('target', MockXTransformer()),\n        ('feat1',  MockXTransformer()),\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    assert_array_equal(out[:, 0], df['target'].values)\n    assert_array_equal(out[:, 1], df['feat1'].values)\n\n\ndef test_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_local_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder(), {'input_df': True})\n    ], input_df=False)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_nonexistent_columns_explicit_fail(simple_dataframe):\n    \"\"\"\n    If a nonexistent column is selected, KeyError is raised.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    with pytest.raises(KeyError):\n        mapper._get_col_subset(simple_dataframe, ['nonexistent_feature'])\n\n\ndef test_get_col_subset_single_column_array(simple_dataframe):\n    \"\"\"\n    Selecting a single column should return a 1-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, \"a\")\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]),)\n\n\ndef test_get_col_subset_single_column_list(simple_dataframe):\n    \"\"\"\n    Selecting a list of columns (even if the list contains a single element)\n    should return a 2-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, [\"a\"])\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]), 1)\n\n\ndef test_cols_string_array(simple_dataframe):\n    \"\"\"\n    If a string is specified as the columns, the transformer\n    is called with a 1-d array as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3,)\n\n\ndef test_cols_list_column_vector(simple_dataframe):\n    \"\"\"\n    If a one-element list is specified as the columns, the transformer\n    is called with a column vector as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([([\"a\"], mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3, 1)\n\n\ndef test_handle_feature_2dim():\n    \"\"\"\n    2-dimensional arrays are returned unchanged.\n    \"\"\"\n    array = np.array([[1, 2], [3, 4]])\n    assert_array_equal(_handle_feature(array), array)\n\n\ndef test_handle_feature_1dim():\n    \"\"\"\n    1-dimensional arrays are converted to 2-dimensional column vectors.\n    \"\"\"\n    array = np.array([1, 2])\n    assert_array_equal(_handle_feature(array), np.array([[1], [2]]))\n\n\ndef test_build_transformers():\n    \"\"\"\n    When a list of transformers is passed, return a pipeline with\n    each element of the iterable as a step of the pipeline.\n    \"\"\"\n    transformers = [MockTClassifier(), MockTClassifier()]\n    pipeline = _build_transformer(transformers)\n    assert isinstance(pipeline, Pipeline)\n    for ix, transformer in enumerate(transformers):\n        assert pipeline.steps[ix][1] == transformer\n\n\ndef test_selected_columns():\n    \"\"\"\n    selected_columns returns a set of the columns appearing in the features\n    of the mapper.\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert mapper._selected_columns == {'a', 'b'}\n\n\ndef test_unselected_columns():\n    \"\"\"\n    unselected_columns returns a list of the columns not appearing in the\n    features of the mapper but present in the given dataframe.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert 'c' in mapper._unselected_columns(df)\n\n\ndef test_drop_and_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns and drop columns\n    are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n            ('a', None)\n        ], drop_cols=['c'], default=False)\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (1, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_drop_and_default_none():\n    \"\"\"\n    If default=None, drop columns are discarded and\n    remaining non explicitly selected columns are passed through untransformed\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['c'], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 2)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_conflicting_drop():\n    \"\"\"\n    Drop column name shouldn't get confused with transformed columns.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['a'], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('b', None)\n    ], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n\n\ndef test_default_none():\n    \"\"\"\n    If default=None, non explicitly selected columns are passed through\n    untransformed.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        (['a'], OneHotEncoder())\n    ], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[:, 3] == np.array([3, 5, 7]).T).all()\n\n\ndef test_default_none_names():\n    \"\"\"\n    If default=None, column names are returned unmodified.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([], default=None)\n\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_default_transformer():\n    \"\"\"\n    If default=Transformer, non explicitly selected columns are applied this\n    transformer.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, np.nan, 3], })\n    mapper = DataFrameMapper([], default=Imputer())\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[: 0] == np.array([1., 2., 3.])).all()\n\n\ndef test_list_transformers_single_arg(simple_dataframe):\n    \"\"\"\n    Multiple transformers can be specified in a list even if some of them\n    only accept one X argument instead of two (X, y).\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer()])\n    ])\n    # doesn't fail\n    mapper.fit_transform(simple_dataframe)\n\n\ndef test_list_transformers():\n    \"\"\"\n    Specifying a list of transformers applies them sequentially to the\n    selected column.\n    \"\"\"\n    dataframe = pd.DataFrame({\"a\": [1, np.nan, 3], \"b\": [1, 5, 7]},\n                             dtype=np.float64)\n\n    mapper = DataFrameMapper([\n        ([\"a\"], [Imputer(), StandardScaler()]),\n        ([\"b\"], StandardScaler()),\n    ])\n    dmatrix = mapper.fit_transform(dataframe)\n\n    assert pd.isnull(dmatrix).sum() == 0  # no null values\n\n    # all features have mean 0 and std deviation 1 (standardized)\n    assert (abs(dmatrix.mean(axis=0) - 0) <= 1e-6).all()\n    assert (abs(dmatrix.std(axis=0) - 1) <= 1e-6).all()\n\n\ndef test_list_transformers_old_unpickle(simple_dataframe):\n    mapper = DataFrameMapper(None)\n    # simulate the mapper was created with < 1.0.0 code\n    mapper.features = [('a', [MockXTransformer()])]\n    mapper_pickled = pickle.dumps(mapper)\n\n    loaded_mapper = pickle.loads(mapper_pickled)\n    transformer = loaded_mapper.features[0][1]\n    assert isinstance(transformer, TransformerPipeline)\n    assert isinstance(transformer.steps[0][1], MockXTransformer)\n\n\ndef test_sparse_features(simple_dataframe):\n    \"\"\"\n    If any of the extracted features is sparse and \"sparse\" argument\n    is true, the hstacked result is also sparse.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=True)\n    dmatrix = mapper.fit_transform(df)\n\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\n\ndef test_sparse_off(simple_dataframe):\n    \"\"\"\n    If the resulting features are sparse but the \"sparse\" argument\n    of the mapper is False, return a non-sparse matrix.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=False)\n\n    dmatrix = mapper.fit_transform(df)\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\n\ndef test_fit_with_optional_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with an optional y argument in the fit method\n    are handled correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], MockTClassifier())])\n    # doesn't fail\n    mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n\ndef test_fit_with_required_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with a required y argument in the fit method\n    are handled and perform correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], SelectKBest(chi2, k=1))])\n\n    # fit, doesn't fail\n    ft_arr = mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n    # fit_transform\n    ft_arr = mapper.fit_transform(df[['feat1', 'feat2']], df['target'])\n    assert_array_equal(ft_arr, df[['feat1']].values)\n\n    # transform\n    t_arr = mapper.transform(df[['feat1', 'feat2']])\n    assert_array_equal(t_arr, df[['feat1']].values)\n\n\n# Integration tests with real dataframes\n\n@pytest.fixture\ndef iris_dataframe():\n    iris = load_iris()\n    return DataFrame(\n        data={\n            iris.feature_names[0]: iris.data[:, 0],\n            iris.feature_names[1]: iris.data[:, 1],\n            iris.feature_names[2]: iris.data[:, 2],\n            iris.feature_names[3]: iris.data[:, 3],\n            \"species\": np.array([iris.target_names[e] for e in iris.target])\n        }\n    )\n\n\n@pytest.fixture\ndef cars_dataframe():\n    return pd.read_csv(\"tests/test_data/cars.csv.gz\", compression='gzip')\n\n\ndef test_with_iris_dataframe(iris_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_dict_vectorizer():\n    df = pd.DataFrame(\n        [[{'a': 1, 'b': 2}], [{'a': 3}]],\n        columns=['colA']\n    )\n\n    outdf = DataFrameMapper(\n        [('colA', DictVectorizer())],\n        df_out=True,\n        default=False\n    ).fit_transform(df)\n\n    columns = sorted(list(outdf.columns))\n    assert len(columns) == 2\n    assert columns[0] == 'colA_0'\n    assert columns[1] == 'colA_1'\n\n\ndef test_with_car_dataframe(cars_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"description\", CountVectorizer()),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = cars_dataframe.drop(\"model\", axis=1)\n    labels = cars_dataframe[\"model\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.30\n\n\ndef test_direct_cross_validation(iris_dataframe):\n    \"\"\"\n    Starting with sklearn>=0.16.0 we no longer need CV wrappers for dataframes.\n    See https://github.com/paulgb/sklearn-pandas/issues/11\n    \"\"\"\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_heterogeneous_output_types_input_df():\n    \"\"\"\n    Modify feat2, but pass feat1 through unmodified.\n    This fails if input_df == False\n    \"\"\"\n    df = pd.DataFrame({\n        'feat1': [1, 2, 3, 4, 5, 6],\n        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0, 4.0]\n    })\n    M = DataFrameMapper([\n        (['feat2'], StandardScaler())\n        ], input_df=True, df_out=True, default=None)\n    dft = M.fit_transform(df)\n    assert dft['feat1'].dtype == np.dtype('int64')\n    assert dft['feat2'].dtype == np.dtype('float64')\n\n\ndef test_make_column_selector(iris_dataframe):\n    t = DataFrameMapper([\n        (make_column_selector(dtype_include=float), None, {'alias': 'x'}),\n        ('sepal length (cm)', None),\n    ], df_out=True, default=False)\n\n    xt = t.fit(iris_dataframe).transform(iris_dataframe)\n    expected = ['x_0', 'x_1', 'x_2', 'x_3', 'sepal length (cm)']\n    assert list(xt.columns) == expected\n\n    pickled = pickle.dumps(t)\n    t2 = pickle.loads(pickled)\n    xt2 = t2.transform(iris_dataframe)\n    assert np.array_equal(xt.values, xt2.values)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "from collections import Counter\n\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nfrom numpy.testing import assert_array_equal\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.features_generator import gen_features\n\n\nclass MockClass(object):\n\n    def __init__(self, value=1, name='class'):\n        self.value = value\n        self.name = name\n\n\nclass MockTransformer(object):\n\n    def __init__(self):\n        self.most_common_ = None\n\n    def fit(self, X, y=None):\n        [(value, _)] = Counter(X).most_common(1)\n        self.most_common_ = value\n        return self\n\n    def transform(self, X, y=None):\n        return np.asarray([self.most_common_] * len(X))\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_generate_features_with_default_parameters():\n    \"\"\"\n    Tests generating features from classes with default init arguments.\n    \"\"\"\n    columns = ['colA', 'colB', 'colC']\n    feature_defs = gen_features(columns=columns, classes=[MockClass])\n    assert len(feature_defs) == len(columns)\n\n    for feature in feature_defs:\n        assert feature[2] == {}\n\n    feature_dict = dict([_[0:2] for _ in feature_defs])\n    assert columns == sorted(feature_dict.keys())\n\n    # default init arguments for MockClass for clarification.\n    expected = {'value': 1, 'name': 'class'}\n    for column, transformers in feature_dict.items():\n        for obj in transformers:\n            assert_attributes(obj, **expected)\n\n\ndef test_generate_features_with_several_classes():\n    \"\"\"\n    Tests generating features pipeline with different transformers parameters.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'],\n        classes=[\n            {'class': MockClass},\n            {'class': MockClass, 'name': 'mockA'},\n            {'class': MockClass, 'name': 'mockB', 'value': None}\n        ]\n    )\n\n    for col, transformers, params in feature_defs:\n        assert_attributes(transformers[0], name='class', value=1)\n        assert_attributes(transformers[1], name='mockA', value=1)\n        assert_attributes(transformers[2], name='mockB', value=None)\n\n\ndef test_generate_features_with_none_only_transformers():\n    \"\"\"\n    Tests generating \"dummy\" feature definition which doesn't apply any\n    transformation.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'], classes=[None])\n\n    expected = [('colA', None, {}),\n                ('colB', None, {}),\n                ('colC', None, {})]\n\n    assert feature_defs == expected\n\n\ndef test_compatibility_with_data_frame_mapper(simple_dataset):\n    \"\"\"\n    Tests compatibility of generated feature definition with DataFrameMapper.\n    \"\"\"\n    features_defs = gen_features(\n        columns=['feat1', 'feat2'],\n        classes=[MockTransformer])\n    features_defs.append(('feat3', None))\n\n    mapper = DataFrameMapper(features_defs)\n    X = mapper.fit_transform(simple_dataset)\n    expected = np.asarray([\n        [1, 2, 1],\n        [1, 2, 2],\n        [1, 2, 3],\n        [1, 2, 4],\n        [1, 2, 5]\n    ])\n\n    assert_array_equal(X, expected)\n\n\ndef assert_attributes(obj, **attrs):\n    for attr, value in attrs.items():\n        assert getattr(obj, attr) == value\n"
    }
  ],
  "OriginCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/test.py",
      "content": "import pytest\nfrom unittest.mock import Mock\nimport numpy as np\nimport pandas as pd\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn.compose import make_column_selector\nfrom sklearn.preprocessing import StandardScaler\n\n\nclass GetStartWith:\n    def __init__(self, start_str):\n        self.start_str = start_str\n\n    def __call__(self, X: pd.DataFrame) -> list:\n        return [c for c in X.columns if c.startswith(self.start_str)]\n\n\ndf = pd.DataFrame({\n    'sepal length (cm)': [1.0, 2.0, 3.0],\n    'sepal width (cm)': [1.0, 2.0, 3.0],\n    'petal length (cm)': [1.0, 2.0, 3.0],\n    'petal width (cm)': [1.0, 2.0, 3.0]\n})\nt = DataFrameMapper([\n    (make_column_selector(dtype_include=float), StandardScaler(), {'alias': 'x'}),\n    (GetStartWith('petal'), None, {'alias': 'petal'})\n], df_out=True, default=False)\n\nt.fit(df)\nprint(t.transform(df).shape)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/setup.py",
      "content": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nfrom setuptools import setup\nfrom setuptools.command.test import test as TestCommand\nimport re\n\nfor line in open('sklearn_pandas/__init__.py'):\n    match = re.match(\"__version__ *= *'(.*)'\", line)\n    if match:\n        __version__, = match.groups()\n\n\nclass PyTest(TestCommand):\n    user_options = [('pytest-args=', 'a', \"Arguments to pass to py.test\")]\n\n    def initialize_options(self):\n        TestCommand.initialize_options(self)\n        self.pytest_args = []\n\n    def finalize_options(self):\n        TestCommand.finalize_options(self)\n        self.test_args = []\n        self.test_suite = True\n\n    def run(self):\n        import pytest\n        errno = pytest.main(self.pytest_args)\n        raise SystemExit(errno)\n\n\nsetup(name='sklearn-pandas',\n      version=__version__,\n      description='Pandas integration with sklearn',\n      maintainer='Ritesh Agrawal',\n      maintainer_email='ragrawal@gmail.com',\n      url='https://github.com/scikit-learn-contrib/sklearn-pandas',\n      packages=['sklearn_pandas'],\n      keywords=['scikit', 'sklearn', 'pandas'],\n      install_requires=[\n          'scikit-learn>=0.23.0',\n          'scipy>=1.5.1',\n          'pandas>=1.1.4',\n          'numpy>=1.18.1'\n      ],\n      tests_require=['pytest', 'mock'],\n      cmdclass={'test': PyTest},\n      license='MIT License'\n)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/noxfile.py",
      "content": "import nox\n\n@nox.session\ndef lint(session):\n    session.install('pytest>=5.3.5', 'setuptools>=45.2',\n                    'wheel>=0.34.2', 'flake8>=3.7.9',\n                    'numpy==1.18.1', 'pandas==1.1.4')\n    session.install('.')\n    session.run('flake8', 'sklearn_pandas/', 'tests')\n\n@nox.session\n@nox.parametrize('numpy', ['1.18.1', '1.19.4', '1.20.1'])\n@nox.parametrize('scipy', ['1.5.4', '1.6.0'])\n@nox.parametrize('pandas', ['1.1.4', '1.2.2'])\ndef tests(session, numpy, scipy, pandas):\n    session.install('pytest>=5.3.5', \n                    'setuptools>=45.2',\n                    'wheel>=0.34.2',\n                    f'numpy=={numpy}',\n                    f'scipy=={scipy}',\n                    f'pandas=={pandas}'\n                    )\n    session.install('.')\n    session.run('py.test', 'README.rst', 'tests')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/features_generator.py",
      "content": "def gen_features(columns, classes=None, prefix='', suffix=''):\n    \"\"\"Generates a feature definition list which can be passed\n    into DataFrameMapper\n\n    Params:\n\n    columns     a list of column names to generate features for.\n\n    classes     a list of classes for each feature, a list of dictionaries with\n                transformer class and init parameters, or None.\n\n                If list of classes is provided, then each of them is\n                instantiated with default arguments. Example:\n\n                    classes = [StandardScaler, LabelBinarizer]\n\n                If list of dictionaries is provided, then each of them should\n                have a 'class' key with transformer class. All other keys are\n                passed into 'class' value constructor. Example:\n\n                    classes = [\n                        {'class': StandardScaler, 'with_mean': False},\n                        {'class': LabelBinarizer}\n                    }]\n\n                If None value selected, then each feature left as is.\n\n    prefix      add prefix to transformed column names\n\n    suffix      add suffix to transformed column names.\n\n    \"\"\"\n    if classes is None:\n        return [(column, None) for column in columns]\n\n    feature_defs = []\n\n    for column in columns:\n        feature_transformers = []\n\n        arguments = {}\n        if prefix and prefix != \"\":\n            arguments['prefix'] = prefix\n        if suffix and suffix != \"\":\n            arguments['suffix'] = suffix\n\n        classes = [cls for cls in classes if cls is not None]\n        if not classes:\n            feature_defs.append((column, None, arguments))\n\n        else:\n            for definition in classes:\n                if isinstance(definition, dict):\n                    params = definition.copy()\n                    klass = params.pop('class')\n                    feature_transformers.append(klass(**params))\n                else:\n                    feature_transformers.append(definition())\n\n            if not feature_transformers:\n                feature_transformers = None\n\n            feature_defs.append((column, feature_transformers, arguments))\n\n    return feature_defs\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py",
      "content": "import numpy as np\nimport pandas as pd\nfrom sklearn.base import TransformerMixin\nimport warnings\n\n\ndef _get_mask(X, value):\n    \"\"\"\n    Compute the boolean mask X == missing_values.\n    \"\"\"\n    if value == \"NaN\" or \\\n       value is None or \\\n       (isinstance(value, float) and np.isnan(value)):\n        return pd.isnull(X)\n    else:\n        return X == value\n\n\nclass NumericalTransformer(TransformerMixin):\n    \"\"\"\n    Provides commonly used numerical transformers.\n    \"\"\"\n    SUPPORTED_FUNCTIONS = ['log', 'log1p']\n\n    def __init__(self, func):\n        \"\"\"\n        Params\n\n        func    function to apply to input columns. The function will be\n                applied to each value. Supported functions are defined\n                in SUPPORTED_FUNCTIONS variable. Throws assertion error if the\n                not supported.\n        \"\"\"\n\n        warnings.warn(\"\"\"\n            NumericalTransformer will be deprecated in 3.0 version.\n            Please use Sklearn.base.TransformerMixin to write\n            customer transformers\n            \"\"\", DeprecationWarning)\n\n        assert func in self.SUPPORTED_FUNCTIONS, \\\n            f\"Only following func are supported: {self.SUPPORTED_FUNCTIONS}\"\n        super(NumericalTransformer, self).__init__()\n        self.__func = func\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X, y=None):\n        if self.__func == 'log1p':\n            return np.vectorize(np.log1p)(X)\n        elif self.__func == 'log':\n            return np.vectorize(np.log)(X)\n\n        raise ValueError(f\"Invalid function name: {self.__func}\")\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/__init__.py",
      "content": "__version__ = '2.2.0'\n\nimport logging\nlogger = logging.getLogger(__name__)\n\nfrom .dataframe_mapper import DataFrameMapper  # NOQA\nfrom .features_generator import gen_features  # NOQA\nfrom .transformers import NumericalTransformer # NOQA\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/pipeline.py",
      "content": "import six\nfrom sklearn.pipeline import _name_estimators, Pipeline\nfrom sklearn.utils import tosequence\n\n\ndef _call_fit(fit_method, X, y=None, **kwargs):\n    \"\"\"\n    helper function, calls the fit or fit_transform method with the correct\n    number of parameters\n\n    fit_method: fit or fit_transform method of the transformer\n    X: the data to fit\n    y: the target vector relative to X, optional\n    kwargs: any keyword arguments to the fit method\n\n    return: the result of the fit or fit_transform method\n\n    WARNING: if this function raises a TypeError exception, test the fit\n    or fit_transform method passed to it in isolation as _call_fit will not\n    distinguish TypeError due to incorrect number of arguments from\n    other TypeError\n    \"\"\"\n    try:\n        return fit_method(X, y, **kwargs)\n    except TypeError:\n        # fit takes only one argument\n        return fit_method(X, **kwargs)\n\n\nclass TransformerPipeline(Pipeline):\n    \"\"\"\n    Pipeline that expects all steps to be transformers taking a single X\n    argument, an optional y argument, and having fit and transform methods.\n\n    Code is copied from sklearn's Pipeline\n    \"\"\"\n\n    def __init__(self, steps):\n        names, estimators = zip(*steps)\n        if len(dict(steps)) != len(steps):\n            raise ValueError(\n                \"Provided step names are not unique: %s\" % (names,))\n\n        # shallow copy of steps\n        self.steps = tosequence(steps)\n        estimator = estimators[-1]\n\n        for e in estimators:\n            if (not (hasattr(e, \"fit\") or hasattr(e, \"fit_transform\")) or not\n                    hasattr(e, \"transform\")):\n                raise TypeError(\"All steps of the chain should \"\n                                \"be transforms and implement fit and transform\"\n                                \" '%s' (type %s) doesn't)\" % (e, type(e)))\n\n        if not hasattr(estimator, \"fit\"):\n            raise TypeError(\"Last step of chain should implement fit \"\n                            \"'%s' (type %s) doesn't)\"\n                            % (estimator, type(estimator)))\n\n    def _pre_transform(self, X, y=None, **fit_params):\n        fit_params_steps = dict((step, {}) for step, _ in self.steps)\n        for pname, pval in six.iteritems(fit_params):\n            step, param = pname.split('__', 1)\n            fit_params_steps[step][param] = pval\n        Xt = X\n        for name, transform in self.steps[:-1]:\n            if hasattr(transform, \"fit_transform\"):\n                Xt = _call_fit(transform.fit_transform,\n                               Xt, y, **fit_params_steps[name])\n            else:\n                Xt = _call_fit(transform.fit,\n                               Xt, y, **fit_params_steps[name]).transform(Xt)\n        return Xt, fit_params_steps[self.steps[-1][0]]\n\n    def fit(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        _call_fit(self.steps[-1][-1].fit, Xt, y, **fit_params)\n        return self\n\n    def fit_transform(self, X, y=None, **fit_params):\n        Xt, fit_params = self._pre_transform(X, y, **fit_params)\n        if hasattr(self.steps[-1][-1], 'fit_transform'):\n            return _call_fit(self.steps[-1][-1].fit_transform,\n                             Xt, y, **fit_params)\n        else:\n            return _call_fit(self.steps[-1][-1].fit,\n                             Xt, y, **fit_params).transform(Xt)\n\n\ndef make_transformer_pipeline(*steps):\n    \"\"\"Construct a TransformerPipeline from the given estimators.\n    \"\"\"\n    return TransformerPipeline(_name_estimators(steps))\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "import contextlib\nfrom datetime import datetime\nimport pandas as pd\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom .cross_validation import DataWrapper\nfrom .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline\nfrom . import logger\n\nstring_types = text_type = str\n\n\ndef _handle_feature(fea):\n    \"\"\"\n    Convert 1-dimensional arrays to 2-dimensional column vectors.\n    \"\"\"\n    if len(fea.shape) == 1:\n        fea = np.array([fea]).T\n\n    return fea\n\n\ndef _build_transformer(transformers):\n    if isinstance(transformers, list):\n        transformers = make_transformer_pipeline(*transformers)\n    return transformers\n\n\ndef _build_feature(columns, transformers, options={}, X=None):\n    if X is None:\n        return (columns, _build_transformer(transformers), options)\n    return (\n        columns(X) if callable(columns) else columns,\n        _build_transformer(transformers),\n        options\n    )\n\n\ndef _elapsed_secs(t1):\n    return (datetime.now()-t1).total_seconds()\n\n\ndef _get_feature_names(estimator):\n    \"\"\"\n    Attempt to extract feature names based on a given estimator\n    \"\"\"\n    if hasattr(estimator, 'classes_'):\n        return estimator.classes_\n    elif hasattr(estimator, 'get_feature_names'):\n        return estimator.get_feature_names()\n    return None\n\n\n@contextlib.contextmanager\ndef add_column_names_to_exception(column_names):\n    # Stolen from https://stackoverflow.com/a/17677938/356729\n    try:\n        yield\n    except Exception as ex:\n        if ex.args:\n            msg = u'{}: {}'.format(column_names, ex.args[0])\n        else:\n            msg = text_type(column_names)\n        ex.args = (msg,) + ex.args[1:]\n        raise\n\n\nclass DataFrameMapper(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Map Pandas data frame column subsets to their own\n    sklearn transformation.\n    \"\"\"\n\n    def __init__(self, features, default=False, sparse=False, df_out=False,\n                 input_df=False, drop_cols=None):\n        \"\"\"\n        Params:\n\n        features    a list of tuples with features definitions.\n                    The first element is the pandas column selector. This can\n                    be a string (for one column) or a list of strings.\n                    The second element is an object that supports\n                    sklearn's transform interface, or a list of such objects\n                    The third element is optional and, if present, must be\n                    a dictionary with the options to apply to the\n                    transformation. Example: {'alias': 'day_of_week'}\n\n        default     default transformer to apply to the columns not\n                    explicitly selected in the mapper. If False (default),\n                    discard them. If None, pass them through untouched. Any\n                    other transformer will be applied to all the unselected\n                    columns as a whole, taken as a 2d-array.\n\n        sparse      will return sparse matrix if set True and any of the\n                    extracted features is sparse. Defaults to False.\n\n        df_out      return a pandas data frame, with each column named using\n                    the pandas column that created it (if there's only one\n                    input and output) or the input columns joined with '_'\n                    if there's multiple inputs, and the name concatenated with\n                    '_1', '_2' etc if there's multiple outputs. NB: does not\n                    work if *default* or *sparse* are true\n\n        input_df    If ``True`` pass the selected columns to the transformers\n                    as a pandas DataFrame or Series. Otherwise pass them as a\n                    numpy array. Defaults to ``False``.\n\n        drop_cols   List of columns to be dropped. Defaults to None.\n\n        \"\"\"\n        self.features = features\n        self.default = default\n        self.built_default = None\n        self.sparse = sparse\n        self.df_out = df_out\n        self.input_df = input_df\n        self.drop_cols = [] if drop_cols is None else drop_cols\n        self.transformed_names_ = []\n        if (df_out and (sparse or default)):\n            raise ValueError(\"Can not use df_out with sparse or default\")\n\n    def _build(self, X=None):\n        \"\"\"\n        Build attributes built_features and built_default.\n        \"\"\"\n        if isinstance(self.features, list):\n            self.built_features = [\n                _build_feature(*f, X=X) for f in self.features\n            ]\n        else:\n            self.built_features = _build_feature(*self.features, X=X)\n        self.built_default = _build_transformer(self.default)\n\n    @property\n    def _selected_columns(self):\n        \"\"\"\n        Return a set of selected columns in the feature list.\n        \"\"\"\n        selected_columns = set()\n        for feature in self.features:\n            columns = feature[0]\n            if isinstance(columns, list):\n                selected_columns = selected_columns.union(set(columns))\n            else:\n                selected_columns.add(columns)\n        return selected_columns\n\n    def _unselected_columns(self, X):\n        \"\"\"\n        Return list of columns present in X and not selected explicitly in the\n        mapper.\n\n        Unselected columns are returned in the order they appear in the\n        dataframe to avoid issues with different ordering during default fit\n        and transform steps.\n        \"\"\"\n        X_columns = list(X.columns)\n        return [column for column in X_columns if\n                column not in self._selected_columns\n                and column not in self.drop_cols]\n\n    def __setstate__(self, state):\n        # compatibility for older versions of sklearn-pandas\n        super().__setstate__(state)\n        self.features = [_build_feature(*feat) for feat in state['features']]\n        self.sparse = state.get('sparse', False)\n        self.default = state.get('default', False)\n        self.df_out = state.get('df_out', False)\n        self.input_df = state.get('input_df', False)\n        self.drop_cols = state.get('drop_cols', [])\n        self.built_features = state.get('built_features', self.features)\n        self.built_default = state.get('built_default', self.default)\n        self.transformed_names_ = state.get('transformed_names_', [])\n\n    def __getstate__(self):\n        state = super().__getstate__()\n        state['features'] = self.features\n        state['sparse'] = self.sparse\n        state['default'] = self.default\n        state['df_out'] = self.df_out\n        state['input_df'] = self.input_df\n        state['drop_cols'] = self.drop_cols\n        state['build_features'] = getattr(self, 'built_features', None)\n        state['built_default'] = self.built_default\n        state['transformed_names_'] = self.transformed_names_\n        return state\n\n    def _get_col_subset(self, X, cols, input_df=False):\n        \"\"\"\n        Get a subset of columns from the given table X.\n\n        X       a Pandas dataframe; the table to select columns from\n        cols    a string or list of strings representing the columns to select.\n                It can also be a callable that returns True or False, i.e.\n                compatible with the built-in filter function.\n\n        Returns a numpy array with the data from the selected columns\n        \"\"\"\n\n        if isinstance(cols, string_types):\n            return_vector = True\n            cols = [cols]\n        else:\n            return_vector = False\n\n        # Needed when using the cross-validation compatibility\n        # layer for sklearn<0.16.0.\n        # Will be dropped on sklearn-pandas 2.0.\n        if isinstance(X, list):\n            X = [x[cols] for x in X]\n            X = pd.DataFrame(X)\n\n        elif isinstance(X, DataWrapper):\n            X = X.df  # fetch underlying data\n\n        if return_vector:\n            t = X[cols[0]]\n        else:\n            t = X[cols]\n\n        # return either a DataFrame/Series or a numpy array\n        if input_df:\n            return t\n        else:\n            return t.values\n\n    def fit(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n\n        \"\"\"\n        self._build(X=X)\n\n        for columns, transformers, options in self.built_features:\n            t1 = datetime.now()\n            input_df = options.get('input_df', self.input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    Xt = self._get_col_subset(X, columns, input_df)\n                    _call_fit(transformers.fit, Xt, y)\n            logger.info(f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n        # handle features not explicitly selected\n        if self.built_default:  # not False and not None\n            unsel_cols = self._unselected_columns(X)\n            with add_column_names_to_exception(unsel_cols):\n                Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n                _call_fit(self.built_default.fit, Xt, y)\n        return self\n\n    def get_names(self, columns, transformer, x, alias=None, prefix='',\n                  suffix=''):\n        \"\"\"\n        Return verbose names for the transformed columns.\n\n        columns       name (or list of names) of the original column(s)\n        transformer   transformer - can be a TransformerPipeline\n        x             transformed columns (numpy.ndarray)\n        alias         base name to use for the selected columns\n        \"\"\"\n        if alias is not None:\n            name = alias\n        elif isinstance(columns, list):\n            name = '_'.join(map(str, columns))\n        else:\n            name = columns\n        num_cols = x.shape[1] if len(x.shape) > 1 else 1\n\n        output = []\n\n        if num_cols > 1:\n            # If there are as many columns as classes in the transformer,\n            # infer column names from classes names.\n\n            # If we are dealing with multiple transformers for these columns\n            # attempt to extract the names from each of them, starting from the\n            # last one\n            if isinstance(transformer, TransformerPipeline):\n                inverse_steps = transformer.steps[::-1]\n                estimators = (estimator for name, estimator in inverse_steps)\n                names_steps = (_get_feature_names(e) for e in estimators)\n                names = next((n for n in names_steps if n is not None), None)\n            # Otherwise use the only estimator present\n            else:\n                names = _get_feature_names(transformer)\n\n            if names is not None and len(names) == num_cols:\n                output = [f\"{name}_{o}\" for o in names]\n                # otherwise, return name concatenated with '_1', '_2', etc.\n            else:\n                output = [name + '_' + str(o) for o in range(num_cols)]\n        else:\n            output = [name]\n\n        if prefix == suffix == \"\":\n            return output\n\n        return ['{}{}{}'.format(prefix, x, suffix) for x in output]\n\n    def get_dtypes(self, extracted):\n        dtypes_features = [self.get_dtype(ex) for ex in extracted]\n        return [dtype for dtype_feature in dtypes_features\n                for dtype in dtype_feature]\n\n    def get_dtype(self, ex):\n        if isinstance(ex, np.ndarray) or sparse.issparse(ex):\n            return [ex.dtype] * ex.shape[1]\n        elif isinstance(ex, pd.DataFrame):\n            return list(ex.dtypes)\n        else:\n            raise TypeError(type(ex))\n\n    def _transform(self, X, y=None, do_fit=False):\n        \"\"\"\n        Transform the given data with possibility to fit in advance.\n        Avoids code duplication for implementation of transform and\n        fit_transform.\n        \"\"\"\n        if do_fit:\n            self._build(X=X)\n\n        extracted = []\n        transformed_names_ = []\n        for columns, transformers, options in self.built_features:\n            input_df = options.get('input_df', self.input_df)\n\n            # columns could be a string or list of\n            # strings; we don't care because pandas\n            # will handle either.\n            Xt = self._get_col_subset(X, columns, input_df)\n\n            if transformers is not None:\n                with add_column_names_to_exception(columns):\n                    if do_fit and hasattr(transformers, 'fit_transform'):\n                        t1 = datetime.now()\n                        Xt = _call_fit(transformers.fit_transform, Xt, y)\n                        logger.info(f\"[FIT_TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n                    else:\n                        if do_fit:\n                            t1 = datetime.now()\n                            _call_fit(transformers.fit, Xt, y)\n                            logger.info(\n                                f\"[FIT] {columns}: {_elapsed_secs(t1)} secs\")\n\n                        t1 = datetime.now()\n                        Xt = transformers.transform(Xt)\n                        logger.info(f\"[TRANSFORM] {columns}: {_elapsed_secs(t1)} secs\")  # NOQA\n\n            extracted.append(_handle_feature(Xt))\n\n            alias = options.get('alias')\n\n            prefix = options.get('prefix', '')\n            suffix = options.get('suffix', '')\n\n            transformed_names_ += self.get_names(\n                columns, transformers, Xt, alias, prefix, suffix)\n\n        # handle features not explicitly selected\n        if self.built_default is not False:\n            unsel_cols = self._unselected_columns(X)\n            Xt = self._get_col_subset(X, unsel_cols, self.input_df)\n            if self.built_default is not None:\n                with add_column_names_to_exception(unsel_cols):\n                    if do_fit and hasattr(self.built_default, 'fit_transform'):\n                        Xt = _call_fit(self.built_default.fit_transform, Xt, y)\n                    else:\n                        if do_fit:\n                            _call_fit(self.built_default.fit, Xt, y)\n                        Xt = self.built_default.transform(Xt)\n                transformed_names_ += self.get_names(\n                    unsel_cols, self.built_default, Xt)\n            else:\n                # if not applying a default transformer,\n                # keep column names unmodified\n                transformed_names_ += unsel_cols\n\n            extracted.append(_handle_feature(Xt))\n\n        self.transformed_names_ = transformed_names_\n\n        # combine the feature outputs into one array.\n        # at this point we lose track of which features\n        # were created from which input columns, so it's\n        # assumed that that doesn't matter to the model.\n\n        # If any of the extracted features is sparse, combine sparsely.\n        # Otherwise, combine as normal arrays.\n        if any(sparse.issparse(fea) for fea in extracted):\n            stacked = sparse.hstack(extracted).tocsr()\n            # return a sparse matrix only if the mapper was initialized\n            # with sparse=True\n            if not self.sparse:\n                stacked = stacked.toarray()\n        else:\n            stacked = np.hstack(extracted)\n\n        if self.df_out:\n            # if no rows were dropped preserve the original index,\n            # otherwise use a new integer one\n            no_rows_dropped = len(X) == len(stacked)\n            if no_rows_dropped:\n                index = X.index\n            else:\n                index = None\n\n            # output different data types, if appropriate\n            dtypes = self.get_dtypes(extracted)\n            df_out = pd.DataFrame(\n                stacked,\n                columns=self.transformed_names_,\n                index=index)\n            # preserve types\n            for col, dtype in zip(self.transformed_names_, dtypes):\n                df_out[col] = df_out[col].astype(dtype)\n            return df_out\n        else:\n            return stacked\n\n    def transform(self, X):\n        \"\"\"\n        Transform the given data. Assumes that fit has already been called.\n\n        X       the data to transform\n        \"\"\"\n        return self._transform(X)\n\n    def fit_transform(self, X, y=None):\n        \"\"\"\n        Fit a transformation from the pipeline and directly apply\n        it to the given data.\n\n        X       the data to fit\n\n        y       the target vector relative to X, optional\n        \"\"\"\n        return self._transform(X, y, True)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/cross_validation.py",
      "content": "class DataWrapper(object):\n\n    def __init__(self, df):\n        self.df = df\n\n    def __len__(self):\n        return len(self.df)\n\n    def __getitem__(self, key):\n        return self.df.iloc[key]\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_pipeline.py",
      "content": "import pytest\nfrom sklearn_pandas.pipeline import TransformerPipeline, _call_fit\n\n# In py3, mock is included with the unittest standard library\n# In py2, it's a separate package\ntry:\n    from unittest.mock import patch\nexcept ImportError:\n    from mock import patch\n\n\nclass NoTransformT(object):\n    \"\"\"Transformer without transform method.\n    \"\"\"\n    def fit(self, x):\n        return self\n\n\nclass NoFitT(object):\n    \"\"\"Transformer without fit method.\n    \"\"\"\n    def transform(self, x):\n        return self\n\n\nclass Trans(object):\n    \"\"\"\n    Transformer with fit and transform methods\n    \"\"\"\n    def fit(self, x, y=None):\n        return self\n\n    def transform(self, x):\n        return self\n\n\ndef func_x_y(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments\n    \"\"\"\n    return\n\n\ndef func_x(x, kwarg='kwarg'):\n    \"\"\"\n    Function with required x argument\n    \"\"\"\n    return\n\n\ndef func_raise_type_err(x, y, kwarg='kwarg'):\n    \"\"\"\n    Function with required x and y arguments,\n    raises TypeError\n    \"\"\"\n    raise TypeError\n\n\ndef test_all_steps_fit_transform():\n    \"\"\"\n    All steps must implement fit and transform. Otherwise, raise TypeError.\n    \"\"\"\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoTransformT())])\n\n    with pytest.raises(TypeError):\n        TransformerPipeline([('svc', NoFitT())])\n\n\n@patch.object(Trans, 'fit', side_effect=func_x_y)\ndef test_called_with_x_and_y(mock_fit):\n    \"\"\"\n    Fit method with required X and y arguments is called with both and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', 'y', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_x)\ndef test_called_with_x(mock_fit):\n    \"\"\"\n    Fit method with a required X arguments is called with it and with\n    any additional keywords\n    \"\"\"\n    _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n    _call_fit(Trans().fit, 'X', kwarg='kwarg')\n    mock_fit.assert_called_with('X', kwarg='kwarg')\n\n\n@patch.object(Trans, 'fit', side_effect=func_raise_type_err)\ndef test_raises_type_error(mock_fit):\n    \"\"\"\n    If a fit method with required X and y arguments raises a TypeError, it's\n    re-raised (for a different reason) when it's called with one argument\n    \"\"\"\n    with pytest.raises(TypeError):\n        _call_fit(Trans().fit, 'X', 'y', kwarg='kwarg')\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_transformers.py",
      "content": "import tempfile\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nimport joblib\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas import NumericalTransformer\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_common_numerical_transformer(simple_dataset):\n    \"\"\"\n    Test log transformation\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ], df_out=True)\n    df = simple_dataset\n    outDF = transfomer.fit_transform(df)\n    assert list(outDF.columns) == ['feat1']\n    assert np.array_equal(df['feat1'].apply(np.log).values, outDF.feat1.values)\n\n\ndef test_numerical_transformer_serialization(simple_dataset):\n    \"\"\"\n    Test if you can serialize transformer\n    \"\"\"\n    transfomer = DataFrameMapper([\n        ('feat1', NumericalTransformer('log'))\n    ])\n\n    df = simple_dataset\n    transfomer.fit(df)\n    f = tempfile.NamedTemporaryFile(delete=True)\n    joblib.dump(transfomer, f.name)\n    transfomer2 = joblib.load(f.name)\n    np.array_equal(transfomer.transform(df), transfomer2.transform(df))\n    f.close()\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "# -*- coding: utf8 -*-\n\nimport pytest\nfrom unittest.mock import Mock\nfrom pandas import DataFrame\nimport pandas as pd\nfrom scipy import sparse\nfrom sklearn.datasets import load_iris\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.svm import SVC\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.preprocessing import (\n    StandardScaler, OneHotEncoder, LabelBinarizer)\nfrom sklearn.impute import SimpleImputer as Imputer\nfrom sklearn.feature_selection import SelectKBest, chi2\nfrom sklearn.base import BaseEstimator, TransformerMixin\nimport sklearn.decomposition\nimport numpy as np\nfrom numpy.testing import assert_array_equal\nimport pickle\nfrom sklearn.compose import make_column_selector\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\nfrom sklearn_pandas.pipeline import TransformerPipeline\n\n\nclass MockXTransformer(object):\n    \"\"\"\n    Mock transformer that accepts no y argument.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return X\n\n\nclass MockTClassifier(object):\n    \"\"\"\n    Mock transformer/classifier.\n    \"\"\"\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        return X\n\n    def predict(self, X):\n        return True\n\n\nclass DateEncoder():\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        dt = X.dt\n        return pd.concat([dt.year, dt.month, dt.day], axis=1)\n\n\nclass ToSparseTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Transforms numpy matrix to sparse format.\n    \"\"\"\n    def fit(self, X):\n        return self\n\n    def transform(self, X):\n        return sparse.csr_matrix(X)\n\n\nclass CustomTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example of transformer in which the number of classes\n    is not equals to the number of output columns.\n    \"\"\"\n    def fit(self, X, y=None):\n        self.min = X.min()\n        self.classes_ = np.unique(X)\n        return self\n\n    def transform(self, X):\n        classes = np.unique(X)\n        if len(np.setdiff1d(classes, self.classes_)) > 0:\n            raise ValueError('Unknown values found.')\n        return X - self.min\n\n\nclass MockImageTransformer(BaseEstimator, TransformerMixin):\n    \"\"\"\n    Example transformer that takes the max of a 2d vector\n    then scales the result.\n    \"\"\"\n    def __init__(self, multiplier=10.0):\n        self.multiplier = multiplier\n\n    def fit(self, X, y=None):\n        return self\n\n    def transform(self, X):\n        assert isinstance(X, pd.DataFrame)\n        for col in X.columns:\n            X[col] = X[col].map(lambda img: np.max(img))\n        return X * self.multiplier\n\n\n@pytest.fixture\ndef simple_dataframe():\n    return pd.DataFrame({'a': [1, 2, 3]})\n\n\n@pytest.fixture\ndef complex_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4]})\n\n\n@pytest.fixture\ndef complex_object_dataframe():\n    return pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'c'],\n                         'feat1': [1, 2, 3, 4, 5, 6],\n                         'feat2': [1, 2, 3, 2, 3, 4],\n                         'img2d': [1*np.eye(2), 2*np.eye(2), 3*np.eye(2),\n                                   4*np.eye(2), 5*np.eye(2), 6*np.eye(2)]})\n\n\n@pytest.fixture\ndef multiindex_dataframe():\n    \"\"\"Example MultiIndex DataFrame, taken from pandas documentation\n    \"\"\"\n    iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]\n    index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])\n    df = pd.DataFrame(np.random.randn(10, 8), columns=index)\n    return df\n\n\n@pytest.fixture\ndef multiindex_dataframe_incomplete(multiindex_dataframe):\n    \"\"\"Example MultiIndex DataFrame with missing entries\n    \"\"\"\n    df = multiindex_dataframe\n    mask_array = np.zeros(df.size)\n    mask_array[:20] = 1\n    np.random.shuffle(mask_array)\n    mask = mask_array.reshape(df.shape).astype(bool)\n    df.mask(mask, inplace=True)\n    return df\n\n\ndef test_transformed_names_simple(simple_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for simple transformation\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_transformed_names_binarizer(complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_logging(caplog, complex_dataframe):\n    \"\"\"\n    Get transformed names of features in `transformed_names` attribute\n    for a transformation that multiplies the number of columns\n    \"\"\"\n    import logging\n    logger = logging.getLogger('sklearn_pandas')\n    logger.setLevel(logging.INFO)\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    assert '[FIT_TRANSFORM] target:' in caplog.text\n\n\ndef test_transformed_names_binarizer_unicode():\n    df = pd.DataFrame({'target': [u'ñ', u'á', u'é']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())])\n    mapper.fit_transform(df)\n    expected_names = {u'target_ñ', u'target_á', u'target_é'}\n    assert set(mapper.transformed_names_) == expected_names\n\n\ndef test_transformed_names_transformers_list(complex_dataframe):\n    \"\"\"\n    When using a list of transformers, use them in inverse order to get the\n    transformed names\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([\n        ('target', [LabelBinarizer(), MockXTransformer()])\n    ])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['target_a', 'target_b', 'target_c']\n\n\ndef test_transformed_names_simple_alias(simple_dataframe):\n    \"\"\"\n    If we specify an alias for a single output column, it is used for the\n    output\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None, {'alias': 'new_name'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_name']\n\n\ndef test_transformed_names_complex_alias(complex_dataframe):\n    \"\"\"\n    If we specify an alias for a multiple output column, it is used for the\n    output\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([('target', LabelBinarizer(), {'alias': 'new'})])\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['new_a', 'new_b', 'new_c']\n\n\ndef test_exception_column_context_transform(simple_dataframe):\n    \"\"\"\n    If an exception is raised when transforming a column,\n    the exception includes the name of the column being transformed\n    \"\"\"\n    class FailingTransformer(object):\n        def fit(self, X):\n            pass\n\n        def transform(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingTransformer())])\n    mapper.fit(df)\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.transform(df)\n\n\ndef test_exception_column_context_fit(simple_dataframe):\n    \"\"\"\n    If an exception is raised when fit a column,\n    the exception includes the name of the column being fitted\n    \"\"\"\n    class FailingFitter(object):\n        def fit(self, X):\n            raise Exception('Some exception')\n\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', FailingFitter())])\n\n    with pytest.raises(Exception, match='a: Some exception'):\n        mapper.fit(df)\n\n\ndef test_simple_df(simple_dataframe):\n    \"\"\"\n    Get a dataframe from a simple mapped dataframe\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert type(transformed) == pd.DataFrame\n    assert len(transformed[\"a\"]) == len(simple_dataframe[\"a\"])\n\n\ndef test_complex_df(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None), ('feat2', None)],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_complex_object_df(complex_object_dataframe):\n    \"\"\"\n    Get a dataframe from a complex dataframe with 2d features\n    \"\"\"\n    df = complex_object_dataframe\n    img_scale = 10\n    mapper = DataFrameMapper(\n        [('target', None), ('feat1', None),\n         (make_column_selector('feat2'), StandardScaler()),\n         (make_column_selector('img2d'), MockImageTransformer(img_scale))],\n        df_out=True, input_df=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_object_dataframe)\n    assert np.isclose(\n        np.sum(transformed['img2d']),\n        np.max(np.sum(df['img2d'])) * img_scale, atol=1e-12)\n\n\ndef test_numeric_column_names(complex_dataframe):\n    \"\"\"\n    Get a dataframe from a complex mapped dataframe with numeric column names\n    \"\"\"\n    df = complex_dataframe\n    df.columns = [0, 1, 2]\n    mapper = DataFrameMapper(\n        [(0, None), (1, None), (2, None)], df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(complex_dataframe)\n    for c in df.columns:\n        assert len(transformed[c]) == len(df[c])\n\n\ndef test_multiindex_df(multiindex_dataframe_incomplete):\n    \"\"\"\n    Get a dataframe from a multiindex dataframe with missing data\n    \"\"\"\n    df = multiindex_dataframe_incomplete\n    mapper = DataFrameMapper([([c], Imputer()) for c in df.columns],\n                             df_out=True)\n    transformed = mapper.fit_transform(df)\n    assert len(transformed) == len(multiindex_dataframe_incomplete)\n    for c in df.columns:\n        assert len(transformed[str(c)]) == len(df[c])\n\n\ndef test_binarizer_df():\n    \"\"\"\n    Check level names from LabelBinarizer\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'c', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_a'\n    assert cols[1] == 'target_b'\n    assert cols[2] == 'target_c'\n\n\ndef test_binarizer_int_df():\n    \"\"\"\n    Check level names from LabelBinarizer for a numeric array.\n    \"\"\"\n    df = pd.DataFrame({'target': [5, 5, 6, 6, 7, 5]})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 3\n    assert cols[0] == 'target_5'\n    assert cols[1] == 'target_6'\n    assert cols[2] == 'target_7'\n\n\ndef test_binarizer2_df():\n    \"\"\"\n    Check level names from LabelBinarizer with just one output column\n    \"\"\"\n    df = pd.DataFrame({'target': ['a', 'a', 'b', 'b', 'a']})\n    mapper = DataFrameMapper([('target', LabelBinarizer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_onehot_df():\n    \"\"\"\n    Check level ids from one-hot\n    \"\"\"\n    df = pd.DataFrame({'target': [0, 0, 1, 1, 2, 3, 0]})\n    mapper = DataFrameMapper([(['target'], OneHotEncoder())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 4\n    assert cols[0] == 'target_0'\n    assert cols[3] == 'target_3'\n\n\ndef test_customtransform_df():\n    \"\"\"\n    Check level ids from a transformer in which\n    the number of classes is not equals to the number of output columns.\n    \"\"\"\n    df = pd.DataFrame({'target': [6, 5, 7, 5, 4, 8, 8]})\n    mapper = DataFrameMapper([(['target'], CustomTransformer())], df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(mapper.features[0][1].classes_) == 5\n    assert len(cols) == 1\n    assert cols[0] == 'target'\n\n\ndef test_preserve_df_index():\n    \"\"\"\n    The index is preserved when df_out=True\n    \"\"\"\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', None)],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, df.index)\n\n\ndef test_preserve_df_index_rows_dropped():\n    \"\"\"\n    If df_out=True but the original df index length doesn't\n    match the number of final rows, use a numeric index\n    \"\"\"\n    class DropLastRowTransformer(object):\n        def fit(self, X):\n            return self\n\n        def transform(self, X):\n            return X[:-1]\n\n    df = pd.DataFrame({'target': [1, 2, 3]},\n                      index=['a', 'b', 'c'])\n    mapper = DataFrameMapper([('target', DropLastRowTransformer())],\n                             df_out=True)\n\n    transformed = mapper.fit_transform(df)\n\n    assert_array_equal(transformed.index, np.array([0, 1]))\n\n\ndef test_pca(complex_dataframe):\n    \"\"\"\n    Check multi in and out with PCA\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed = mapper.fit_transform(df)\n    cols = transformed.columns\n    assert len(cols) == 2\n    assert cols[0] == 'feat1_feat2_0'\n    assert cols[1] == 'feat1_feat2_1'\n\n\ndef test_fit_transform(simple_dataframe):\n    \"\"\"\n    Check that custom fit_transform methods of the transformers are invoked.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    # return something of measurable length but does nothing\n    mock_transformer.fit_transform.return_value = np.array([1, 2, 3])\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n    mapper.fit_transform(df)\n    assert mock_transformer.fit_transform.called\n\n\ndef test_fit_transform_equiv_mock(simple_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper using the mock\n    transformer which does not implement a custom fit_transform.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([('a', MockXTransformer())])\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.all(transformed_combined == transformed_separate)\n\n\ndef test_fit_transform_equiv_pca(complex_dataframe):\n    \"\"\"\n    Check for equivalent results for code paths fit_transform\n    versus fit and transform in DataFrameMapper and transformer\n    using PCA which implements a custom fit_transform. The\n    equivalence of both paths in the transformer only can be\n    asserted since this is tested in the sklearn tests\n    scikit-learn/sklearn/decomposition/tests/test_pca.py\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper(\n        [(['feat1', 'feat2'], sklearn.decomposition.PCA(2))],\n        df_out=True)\n    transformed_combined = mapper.fit_transform(df)\n    transformed_separate = mapper.fit(df).transform(df)\n    assert np.allclose(transformed_combined, transformed_separate)\n\n\ndef test_input_df_true_first_transformer(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the first transformer is passed\n    a pd.Series instead of an np.array\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockXTransformer, 'fit', Mock())\n    monkeypatch.setattr(MockXTransformer, 'transform',\n                        Mock(return_value=np.array([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', MockXTransformer())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    args, _ = MockXTransformer().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    args, _ = MockXTransformer().transform.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_next_transformers(simple_dataframe, monkeypatch):\n    \"\"\"\n    If input_df is True, the subsequent transformers get passed pandas\n    objects instead of numpy arrays (given the previous transformers\n    output pandas objects as well)\n    \"\"\"\n    df = simple_dataframe\n    monkeypatch.setattr(MockTClassifier, 'fit', Mock())\n    monkeypatch.setattr(MockTClassifier, 'transform',\n                        Mock(return_value=pd.Series([1, 2, 3])))\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer(), MockTClassifier()])\n    ], input_df=True)\n    mapper.fit(df)\n    out = mapper.transform(df)\n\n    args, _ = MockTClassifier().fit.call_args\n    assert isinstance(args[0], pd.Series)\n\n    assert_array_equal(out, np.array([1, 2, 3]).reshape(-1, 1))\n\n\ndef test_input_df_true_multiple_cols(complex_dataframe):\n    \"\"\"\n    When input_df is True, applying transformers to multiple columns\n    works as expected\n    \"\"\"\n    df = complex_dataframe\n\n    mapper = DataFrameMapper([\n        ('target', MockXTransformer()),\n        ('feat1',  MockXTransformer()),\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n\n    assert_array_equal(out[:, 0], df['target'].values)\n    assert_array_equal(out[:, 1], df['feat1'].values)\n\n\ndef test_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder())\n    ], input_df=True)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_local_input_df_date_encoder():\n    \"\"\"\n    When input_df is True we can apply a transformer that only works\n    with pandas dataframes like a DateEncoder\n    \"\"\"\n    df = pd.DataFrame(\n        {'dates': pd.date_range('2015-10-30', '2015-11-02')})\n    mapper = DataFrameMapper([\n        ('dates', DateEncoder(), {'input_df': True})\n    ], input_df=False)\n    out = mapper.fit_transform(df)\n    expected = np.array([\n        [2015, 10, 30],\n        [2015, 10, 31],\n        [2015, 11, 1],\n        [2015, 11, 2]\n    ])\n    assert_array_equal(out, expected)\n\n\ndef test_nonexistent_columns_explicit_fail(simple_dataframe):\n    \"\"\"\n    If a nonexistent column is selected, KeyError is raised.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    with pytest.raises(KeyError):\n        mapper._get_col_subset(simple_dataframe, ['nonexistent_feature'])\n\n\ndef test_get_col_subset_single_column_array(simple_dataframe):\n    \"\"\"\n    Selecting a single column should return a 1-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, \"a\")\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]),)\n\n\ndef test_get_col_subset_single_column_list(simple_dataframe):\n    \"\"\"\n    Selecting a list of columns (even if the list contains a single element)\n    should return a 2-dimensional numpy array.\n    \"\"\"\n    mapper = DataFrameMapper(None)\n    array = mapper._get_col_subset(simple_dataframe, [\"a\"])\n\n    assert type(array) == np.ndarray\n    assert array.shape == (len(simple_dataframe[\"a\"]), 1)\n\n\ndef test_cols_string_array(simple_dataframe):\n    \"\"\"\n    If a string is specified as the columns, the transformer\n    is called with a 1-d array as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([(\"a\", mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3,)\n\n\ndef test_cols_list_column_vector(simple_dataframe):\n    \"\"\"\n    If a one-element list is specified as the columns, the transformer\n    is called with a column vector as input.\n    \"\"\"\n    df = simple_dataframe\n    mock_transformer = Mock()\n    mapper = DataFrameMapper([([\"a\"], mock_transformer)])\n\n    mapper.fit(df)\n    args, kwargs = mock_transformer.fit.call_args\n    assert args[0].shape == (3, 1)\n\n\ndef test_handle_feature_2dim():\n    \"\"\"\n    2-dimensional arrays are returned unchanged.\n    \"\"\"\n    array = np.array([[1, 2], [3, 4]])\n    assert_array_equal(_handle_feature(array), array)\n\n\ndef test_handle_feature_1dim():\n    \"\"\"\n    1-dimensional arrays are converted to 2-dimensional column vectors.\n    \"\"\"\n    array = np.array([1, 2])\n    assert_array_equal(_handle_feature(array), np.array([[1], [2]]))\n\n\ndef test_build_transformers():\n    \"\"\"\n    When a list of transformers is passed, return a pipeline with\n    each element of the iterable as a step of the pipeline.\n    \"\"\"\n    transformers = [MockTClassifier(), MockTClassifier()]\n    pipeline = _build_transformer(transformers)\n    assert isinstance(pipeline, Pipeline)\n    for ix, transformer in enumerate(transformers):\n        assert pipeline.steps[ix][1] == transformer\n\n\ndef test_selected_columns():\n    \"\"\"\n    selected_columns returns a set of the columns appearing in the features\n    of the mapper.\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert mapper._selected_columns == {'a', 'b'}\n\n\ndef test_unselected_columns():\n    \"\"\"\n    unselected_columns returns a list of the columns not appearing in the\n    features of the mapper but present in the given dataframe.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n        ('a', None),\n        (['a', 'b'], None)\n    ])\n    assert 'c' in mapper._unselected_columns(df)\n\n\ndef test_drop_and_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns and drop columns\n    are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]})\n    mapper = DataFrameMapper([\n            ('a', None)\n        ], drop_cols=['c'], default=False)\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (1, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_drop_and_default_none():\n    \"\"\"\n    If default=None, drop columns are discarded and\n    remaining non explicitly selected columns are passed through untransformed\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['c'], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 2)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_conflicting_drop():\n    \"\"\"\n    Drop column name shouldn't get confused with transformed columns.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('a', None)\n    ], drop_cols=['a'], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n    assert mapper.transformed_names_ == ['a']\n\n\ndef test_default_false():\n    \"\"\"\n    If default=False, non explicitly selected columns are discarded.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        ('b', None)\n    ], default=False)\n\n    transformed = mapper.fit_transform(df)\n    assert transformed.shape == (3, 1)\n\n\ndef test_default_none():\n    \"\"\"\n    If default=None, non explicitly selected columns are passed through\n    untransformed.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([\n        (['a'], OneHotEncoder())\n    ], default=None)\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[:, 3] == np.array([3, 5, 7]).T).all()\n\n\ndef test_default_none_names():\n    \"\"\"\n    If default=None, column names are returned unmodified.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 5, 7]})\n    mapper = DataFrameMapper([], default=None)\n\n    mapper.fit_transform(df)\n    assert mapper.transformed_names_ == ['a', 'b']\n\n\ndef test_default_transformer():\n    \"\"\"\n    If default=Transformer, non explicitly selected columns are applied this\n    transformer.\n    \"\"\"\n    df = pd.DataFrame({'a': [1, np.nan, 3], })\n    mapper = DataFrameMapper([], default=Imputer())\n\n    transformed = mapper.fit_transform(df)\n    assert (transformed[: 0] == np.array([1., 2., 3.])).all()\n\n\ndef test_list_transformers_single_arg(simple_dataframe):\n    \"\"\"\n    Multiple transformers can be specified in a list even if some of them\n    only accept one X argument instead of two (X, y).\n    \"\"\"\n    mapper = DataFrameMapper([\n        ('a', [MockXTransformer()])\n    ])\n    # doesn't fail\n    mapper.fit_transform(simple_dataframe)\n\n\ndef test_list_transformers():\n    \"\"\"\n    Specifying a list of transformers applies them sequentially to the\n    selected column.\n    \"\"\"\n    dataframe = pd.DataFrame({\"a\": [1, np.nan, 3], \"b\": [1, 5, 7]},\n                             dtype=np.float64)\n\n    mapper = DataFrameMapper([\n        ([\"a\"], [Imputer(), StandardScaler()]),\n        ([\"b\"], StandardScaler()),\n    ])\n    dmatrix = mapper.fit_transform(dataframe)\n\n    assert pd.isnull(dmatrix).sum() == 0  # no null values\n\n    # all features have mean 0 and std deviation 1 (standardized)\n    assert (abs(dmatrix.mean(axis=0) - 0) <= 1e-6).all()\n    assert (abs(dmatrix.std(axis=0) - 1) <= 1e-6).all()\n\n\ndef test_list_transformers_old_unpickle(simple_dataframe):\n    mapper = DataFrameMapper(None)\n    # simulate the mapper was created with < 1.0.0 code\n    mapper.features = [('a', [MockXTransformer()])]\n    mapper_pickled = pickle.dumps(mapper)\n\n    loaded_mapper = pickle.loads(mapper_pickled)\n    transformer = loaded_mapper.features[0][1]\n    assert isinstance(transformer, TransformerPipeline)\n    assert isinstance(transformer.steps[0][1], MockXTransformer)\n\n\ndef test_sparse_features(simple_dataframe):\n    \"\"\"\n    If any of the extracted features is sparse and \"sparse\" argument\n    is true, the hstacked result is also sparse.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=True)\n    dmatrix = mapper.fit_transform(df)\n\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\n\ndef test_sparse_off(simple_dataframe):\n    \"\"\"\n    If the resulting features are sparse but the \"sparse\" argument\n    of the mapper is False, return a non-sparse matrix.\n    \"\"\"\n    df = simple_dataframe\n    mapper = DataFrameMapper([\n        (\"a\", ToSparseTransformer())\n    ], sparse=False)\n\n    dmatrix = mapper.fit_transform(df)\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\n\ndef test_fit_with_optional_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with an optional y argument in the fit method\n    are handled correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], MockTClassifier())])\n    # doesn't fail\n    mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n\ndef test_fit_with_required_y_arg(complex_dataframe):\n    \"\"\"\n    Transformers with a required y argument in the fit method\n    are handled and perform correctly\n    \"\"\"\n    df = complex_dataframe\n    mapper = DataFrameMapper([(['feat1', 'feat2'], SelectKBest(chi2, k=1))])\n\n    # fit, doesn't fail\n    ft_arr = mapper.fit(df[['feat1', 'feat2']], df['target'])\n\n    # fit_transform\n    ft_arr = mapper.fit_transform(df[['feat1', 'feat2']], df['target'])\n    assert_array_equal(ft_arr, df[['feat1']].values)\n\n    # transform\n    t_arr = mapper.transform(df[['feat1', 'feat2']])\n    assert_array_equal(t_arr, df[['feat1']].values)\n\n\n# Integration tests with real dataframes\n\n@pytest.fixture\ndef iris_dataframe():\n    iris = load_iris()\n    return DataFrame(\n        data={\n            iris.feature_names[0]: iris.data[:, 0],\n            iris.feature_names[1]: iris.data[:, 1],\n            iris.feature_names[2]: iris.data[:, 2],\n            iris.feature_names[3]: iris.data[:, 3],\n            \"species\": np.array([iris.target_names[e] for e in iris.target])\n        }\n    )\n\n\n@pytest.fixture\ndef cars_dataframe():\n    return pd.read_csv(\"tests/test_data/cars.csv.gz\", compression='gzip')\n\n\ndef test_with_iris_dataframe(iris_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_dict_vectorizer():\n    df = pd.DataFrame(\n        [[{'a': 1, 'b': 2}], [{'a': 3}]],\n        columns=['colA']\n    )\n\n    outdf = DataFrameMapper(\n        [('colA', DictVectorizer())],\n        df_out=True,\n        default=False\n    ).fit_transform(df)\n\n    columns = sorted(list(outdf.columns))\n    assert len(columns) == 2\n    assert columns[0] == 'colA_0'\n    assert columns[1] == 'colA_1'\n\n\ndef test_with_car_dataframe(cars_dataframe):\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"description\", CountVectorizer()),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = cars_dataframe.drop(\"model\", axis=1)\n    labels = cars_dataframe[\"model\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.30\n\n\ndef test_direct_cross_validation(iris_dataframe):\n    \"\"\"\n    Starting with sklearn>=0.16.0 we no longer need CV wrappers for dataframes.\n    See https://github.com/paulgb/sklearn-pandas/issues/11\n    \"\"\"\n    pipeline = Pipeline([\n        (\"preprocess\", DataFrameMapper([\n            (\"petal length (cm)\", None),\n            (\"petal width (cm)\", None),\n            (\"sepal length (cm)\", None),\n            (\"sepal width (cm)\", None),\n        ])),\n        (\"classify\", SVC(kernel='linear'))\n    ])\n    data = iris_dataframe.drop(\"species\", axis=1)\n    labels = iris_dataframe[\"species\"]\n    scores = cross_val_score(pipeline, data, labels)\n    assert scores.mean() > 0.96\n    assert (scores.std() * 2) < 0.04\n\n\ndef test_heterogeneous_output_types_input_df():\n    \"\"\"\n    Modify feat2, but pass feat1 through unmodified.\n    This fails if input_df == False\n    \"\"\"\n    df = pd.DataFrame({\n        'feat1': [1, 2, 3, 4, 5, 6],\n        'feat2': [1.0, 2.0, 3.0, 2.0, 3.0, 4.0]\n    })\n    M = DataFrameMapper([\n        (['feat2'], StandardScaler())\n        ], input_df=True, df_out=True, default=None)\n    dft = M.fit_transform(df)\n    assert dft['feat1'].dtype == np.dtype('int64')\n    assert dft['feat2'].dtype == np.dtype('float64')\n\n\ndef test_make_column_selector(iris_dataframe):\n    t = DataFrameMapper([\n        (make_column_selector(dtype_include=float), None, {'alias': 'x'}),\n        ('sepal length (cm)', None),\n    ], df_out=True, default=False)\n\n    xt = t.fit(iris_dataframe).transform(iris_dataframe)\n    expected = ['x_0', 'x_1', 'x_2', 'x_3', 'sepal length (cm)']\n    assert list(xt.columns) == expected\n\n    pickled = pickle.dumps(t)\n    t2 = pickle.loads(pickled)\n    xt2 = t2.transform(iris_dataframe)\n    assert np.array_equal(xt.values, xt2.values)\n"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "from collections import Counter\n\nimport pytest\nimport numpy as np\nfrom pandas import DataFrame\nfrom numpy.testing import assert_array_equal\n\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn_pandas.features_generator import gen_features\n\n\nclass MockClass(object):\n\n    def __init__(self, value=1, name='class'):\n        self.value = value\n        self.name = name\n\n\nclass MockTransformer(object):\n\n    def __init__(self):\n        self.most_common_ = None\n\n    def fit(self, X, y=None):\n        [(value, _)] = Counter(X).most_common(1)\n        self.most_common_ = value\n        return self\n\n    def transform(self, X, y=None):\n        return np.asarray([self.most_common_] * len(X))\n\n\n@pytest.fixture\ndef simple_dataset():\n    return DataFrame({\n        'feat1': [1, 2, 1, 3, 1],\n        'feat2': [1, 2, 2, 2, 3],\n        'feat3': [1, 2, 3, 4, 5],\n    })\n\n\ndef test_generate_features_with_default_parameters():\n    \"\"\"\n    Tests generating features from classes with default init arguments.\n    \"\"\"\n    columns = ['colA', 'colB', 'colC']\n    feature_defs = gen_features(columns=columns, classes=[MockClass])\n    assert len(feature_defs) == len(columns)\n\n    for feature in feature_defs:\n        assert feature[2] == {}\n\n    feature_dict = dict([_[0:2] for _ in feature_defs])\n    assert columns == sorted(feature_dict.keys())\n\n    # default init arguments for MockClass for clarification.\n    expected = {'value': 1, 'name': 'class'}\n    for column, transformers in feature_dict.items():\n        for obj in transformers:\n            assert_attributes(obj, **expected)\n\n\ndef test_generate_features_with_several_classes():\n    \"\"\"\n    Tests generating features pipeline with different transformers parameters.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'],\n        classes=[\n            {'class': MockClass},\n            {'class': MockClass, 'name': 'mockA'},\n            {'class': MockClass, 'name': 'mockB', 'value': None}\n        ]\n    )\n\n    for col, transformers, params in feature_defs:\n        assert_attributes(transformers[0], name='class', value=1)\n        assert_attributes(transformers[1], name='mockA', value=1)\n        assert_attributes(transformers[2], name='mockB', value=None)\n\n\ndef test_generate_features_with_none_only_transformers():\n    \"\"\"\n    Tests generating \"dummy\" feature definition which doesn't apply any\n    transformation.\n    \"\"\"\n    feature_defs = gen_features(\n        columns=['colA', 'colB', 'colC'], classes=[None])\n\n    expected = [('colA', None, {}),\n                ('colB', None, {}),\n                ('colC', None, {})]\n\n    assert feature_defs == expected\n\n\ndef test_compatibility_with_data_frame_mapper(simple_dataset):\n    \"\"\"\n    Tests compatibility of generated feature definition with DataFrameMapper.\n    \"\"\"\n    features_defs = gen_features(\n        columns=['feat1', 'feat2'],\n        classes=[MockTransformer])\n    features_defs.append(('feat3', None))\n\n    mapper = DataFrameMapper(features_defs)\n    X = mapper.fit_transform(simple_dataset)\n    expected = np.asarray([\n        [1, 2, 1],\n        [1, 2, 2],\n        [1, 2, 3],\n        [1, 2, 4],\n        [1, 2, 5]\n    ])\n\n    assert_array_equal(X, expected)\n\n\ndef assert_attributes(obj, **attrs):\n    for attr, value in attrs.items():\n        assert getattr(obj, attr) == value\n"
    }
  ],
  "ErrorMessage": "\n============================================================================================= FAILURES ==============================================================================================\n______________________________________________________________________________________ test_build_transformers ______________________________________________________________________________________\n\n    def test_build_transformers():\n        \"\"\"\n        When a list of transformers is passed, return a pipeline with\n        each element of the iterable as a step of the pipeline.\n        \"\"\"\n        transformers = [MockTClassifier(), MockTClassifier()]\n        pipeline = _build_transformer(transformers)\n>       assert isinstance(pipeline, Pipeline)\nE       TypeError: isinstance() arg 2 must be a type or tuple of types\n\ntests/test_dataframe_mapper.py:686: TypeError\n_____________________________________________________________________________________ test_with_iris_dataframe ______________________________________________________________________________________\n\niris_dataframe =      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)    species\n0                  5.1       ...ginica\n149                5.9               3.0                5.1               1.8  virginica\n\n[150 rows x 5 columns]\n\n    def test_with_iris_dataframe(iris_dataframe):\n>       pipeline = Pipeline([\n            (\"preprocess\", DataFrameMapper([\n                (\"petal length (cm)\", None),\n                (\"petal width (cm)\", None),\n                (\"sepal length (cm)\", None),\n                (\"sepal width (cm)\", None),\n            ])),\n            (\"classify\", SVC(kernel='linear'))\n        ])\nE       TypeError: 'module' object is not callable\n\ntests/test_dataframe_mapper.py:935: TypeError\n______________________________________________________________________________________ test_with_car_dataframe ______________________________________________________________________________________\n\ncars_dataframe =      model                                        description\n0      500  Immaculate bodywork, electric panoramic sunr...R REPAIRS DUE TO EPC/EML COMIN...\n541  corsa  1.2 16v auto corsa 3 door, Full mot with no ad...\n\n[542 rows x 2 columns]\n\n    def test_with_car_dataframe(cars_dataframe):\n>       pipeline = Pipeline([\n            (\"preprocess\", DataFrameMapper([\n                (\"description\", CountVectorizer()),\n            ])),\n            (\"classify\", SVC(kernel='linear'))\n        ])\nE       TypeError: 'module' object is not callable\n\ntests/test_dataframe_mapper.py:970: TypeError\n___________________________________________________________________________________ test_direct_cross_validation ____________________________________________________________________________________\n\niris_dataframe =      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)    species\n0                  5.1       ...ginica\n149                5.9               3.0                5.1               1.8  virginica\n\n[150 rows x 5 columns]\n\n    def test_direct_cross_validation(iris_dataframe):\n        \"\"\"\n        Starting with sklearn>=0.16.0 we no longer need CV wrappers for dataframes.\n        See https://github.com/paulgb/sklearn-pandas/issues/11\n        \"\"\"\n>       pipeline = Pipeline([\n            (\"preprocess\", DataFrameMapper([\n                (\"petal length (cm)\", None),\n                (\"petal width (cm)\", None),\n                (\"sepal length (cm)\", None),\n                (\"sepal width (cm)\", None),\n            ])),\n            (\"classify\", SVC(kernel='linear'))\n        ])\nE       TypeError: 'module' object is not callable\n\ntests/test_dataframe_mapper.py:987: TypeError\n========================================================================================= warnings summary ==========================================================================================\ntests/test_dataframe_mapper.py::test_complex_object_df\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:106: SettingWithCopyWarning: \n  A value is trying to be set on a copy of a slice from a DataFrame.\n  Try using .loc[row_indexer,col_indexer] = value instead\n  \n  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n    X[col] = X[col].map(lambda img: np.max(img))\n\ntests/test_dataframe_mapper.py::test_sparse_features\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:865: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\ntests/test_dataframe_mapper.py::test_sparse_off\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:879: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\ntests/test_transformers.py::test_common_numerical_transformer\ntests/test_transformers.py::test_numerical_transformer_serialization\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py:35: DeprecationWarning: \n              NumericalTransformer will be deprecated in 3.0 version.\n              Please use Sklearn.base.TransformerMixin to write\n              customer transformers\n              \n    warnings.warn(\"\"\"\n\n-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html\n====================================================================================== short test summary info ======================================================================================\nFAILED tests/test_dataframe_mapper.py::test_build_transformers - TypeError: isinstance() arg 2 must be a type or tuple of types\nFAILED tests/test_dataframe_mapper.py::test_with_iris_dataframe - TypeError: 'module' object is not callable\nFAILED tests/test_dataframe_mapper.py::test_with_car_dataframe - TypeError: 'module' object is not callable\nFAILED tests/test_dataframe_mapper.py::test_direct_cross_validation - TypeError: 'module' object is not callable\n============================================================================= 4 failed, 66 passed, 5 warnings in 0.66s ==============================================================================",
  "Patch": "--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -6,7 +6,7 @@\n import pandas as pd\n from scipy import sparse\n from sklearn.datasets import load_iris\n-from sklearn import pipeline as Pipeline\n+from sklearn.pipeline import Pipeline\n from sklearn.model_selection import cross_val_score\n from sklearn.svm import SVC\n from sklearn.feature_extraction.text import CountVectorizer\n",
  "BuggyCodeLocation": [
    {
      "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "function": null,
      "content_all": {
        "6": "import pandas as pd\n",
        "7": "from scipy import sparse\n",
        "8": "from sklearn.datasets import load_iris\n",
        "9": "from sklearn import pipeline as Pipeline\n",
        "10": "from sklearn.model_selection import cross_val_score\n",
        "11": "from sklearn.svm import SVC\n",
        "12": "from sklearn.feature_extraction.text import CountVectorizer\n"
      },
      "content_change": {
        "9": "from sklearn import pipeline as Pipeline\n"
      }
    }
  ],
  "Issue": {
    "title": "Inconsistent Import Style for `Pipeline` in test_dataframe_mapper.py",
    "description": "There is an inconsistency in the way the `Pipeline` class is imported from `sklearn.pipeline` within the test_dataframe_mapper.py file. Currently, it is being imported as a module (`pipeline as Pipeline`), which is not the correct usage style and can lead to confusion and potential import errors during runtime or testing. To maintain a uniform import style across the codebase and to follow Python's best practices for importing classes, the `Pipeline` class should be directly imported from `sklearn.pipeline` as `from sklearn.pipeline import Pipeline`. This change will ensure that the code is more readable and less error-prone, especially during automated testing. Additionally, this consistency is crucial for maintaining code quality and ease of maintenance.",
    "explanation": "### Summary of the Issue:\nThe issue reported is about the inconsistent import style of the `Pipeline` class in the `test_dataframe_mapper.py` file. Specifically, the `Pipeline` class from the `sklearn.pipeline` module is incorrectly imported as a module (i.e., `pipeline as Pipeline`). This import style is not only unconventional but can also potentially lead to confusion and runtime errors. To promote uniformity and readability across the codebase, the import style should follow Python's best practices.\n\n### Cause of the Issue:\nThe primary cause of the issue is the incorrect import statement for the `Pipeline` class. Instead of importing the class directly, the module itself is imported and aliased as `Pipeline`. This contradicts standard Python importing conventions, which recommend importing classes and functions directly for simplicity and clarity. Indirect imports can mislead developers into thinking that they are dealing with a class when they are actually dealing with a module, leading to possible errors when the code is executed.\n\n### Solution Provided by the Commit:\nThe commit addresses this issue by modifying the import statement in the `test_dataframe_mapper.py` file. Instead of importing the `pipeline` module and aliasing it as `Pipeline`, the commit changes the statement to directly import the `Pipeline` class from the `sklearn.pipeline` module. This adjustment conforms to conventional Python import styles, enhancing both the readability and maintainability of the code.\n\n### Detailed Explanation:\n\n1. **Summary of the Issue:**\n   - There is a problem in the file `test_dataframe_mapper.py` where the `Pipeline` class from the `sklearn.pipeline` module is imported incorrectly. It is being imported as a module and then aliased as `Pipeline`.\n   - This incorrect import method is non-standard and can induce confusion or lead to import-related errors during runtime or testing.\n\n2. **Content of the Commit:**\n   - The commit includes an update to the import statement to fix the inconsistency. The original import statement is replaced with a more conventional one.\n   - Specifically, the original statement: `from sklearn import pipeline as Pipeline` is corrected to `from sklearn.pipeline import Pipeline`.\n\n3. **How the Commit Solves the Issue:**\n   - By importing the `Pipeline` class directly, the commit resolves the confusion caused by the previous indirect import style.\n   - Directly importing `Pipeline` ensures that it is clear to anyone reading the code that `Pipeline` is a class from `sklearn.pipeline`, not a module.\n   - This change aligns the code with Python’s best practices, which advocate for straightforward and readable import statements.\n\n4. **Solution to the Issue:**\n   - **Improve Uniformity:** The consistent and direct import style makes the code uniform with other parts of the codebase, which likely also use direct imports.\n   - **Enhance Readability:** It becomes evident to developers that `Pipeline` is a class, not a module, making the code more readable and understandable.\n   - **Reduce Error Potential:** By adhering to the standard import practice, the risk of import-related errors is minimized. This is particularly important during automated testing when unexpected import errors can disrupt the testing process.\n\n### Conclusion:\nThe commit makes a fundamental yet important change to the import statement in `test_dataframe_mapper.py`. It replaces an indirect import with a direct import, bringing clarity, reducing potential errors, and aligning with best practices. This subtle but critical fix enhances the overall code quality, aiding developers in maintaining and extending the codebase efficiently."
  },
  "Explain": "### Summary of the Issue:\nThe issue reported is about the inconsistent import style of the `Pipeline` class in the `test_dataframe_mapper.py` file. Specifically, the `Pipeline` class from the `sklearn.pipeline` module is incorrectly imported as a module (i.e., `pipeline as Pipeline`). This import style is not only unconventional but can also potentially lead to confusion and runtime errors. To promote uniformity and readability across the codebase, the import style should follow Python's best practices.\n\n### Cause of the Issue:\nThe primary cause of the issue is the incorrect import statement for the `Pipeline` class. Instead of importing the class directly, the module itself is imported and aliased as `Pipeline`. This contradicts standard Python importing conventions, which recommend importing classes and functions directly for simplicity and clarity. Indirect imports can mislead developers into thinking that they are dealing with a class when they are actually dealing with a module, leading to possible errors when the code is executed.\n\n### Solution Provided by the Commit:\nThe commit addresses this issue by modifying the import statement in the `test_dataframe_mapper.py` file. Instead of importing the `pipeline` module and aliasing it as `Pipeline`, the commit changes the statement to directly import the `Pipeline` class from the `sklearn.pipeline` module. This adjustment conforms to conventional Python import styles, enhancing both the readability and maintainability of the code.\n\n### Detailed Explanation:\n\n1. **Summary of the Issue:**\n   - There is a problem in the file `test_dataframe_mapper.py` where the `Pipeline` class from the `sklearn.pipeline` module is imported incorrectly. It is being imported as a module and then aliased as `Pipeline`.\n   - This incorrect import method is non-standard and can induce confusion or lead to import-related errors during runtime or testing.\n\n2. **Content of the Commit:**\n   - The commit includes an update to the import statement to fix the inconsistency. The original import statement is replaced with a more conventional one.\n   - Specifically, the original statement: `from sklearn import pipeline as Pipeline` is corrected to `from sklearn.pipeline import Pipeline`.\n\n3. **How the Commit Solves the Issue:**\n   - By importing the `Pipeline` class directly, the commit resolves the confusion caused by the previous indirect import style.\n   - Directly importing `Pipeline` ensures that it is clear to anyone reading the code that `Pipeline` is a class from `sklearn.pipeline`, not a module.\n   - This change aligns the code with Python’s best practices, which advocate for straightforward and readable import statements.\n\n4. **Solution to the Issue:**\n   - **Improve Uniformity:** The consistent and direct import style makes the code uniform with other parts of the codebase, which likely also use direct imports.\n   - **Enhance Readability:** It becomes evident to developers that `Pipeline` is a class, not a module, making the code more readable and understandable.\n   - **Reduce Error Potential:** By adhering to the standard import practice, the risk of import-related errors is minimized. This is particularly important during automated testing when unexpected import errors can disrupt the testing process.\n\n### Conclusion:\nThe commit makes a fundamental yet important change to the import statement in `test_dataframe_mapper.py`. It replaces an indirect import with a direct import, bringing clarity, reducing potential errors, and aligning with best practices. This subtle but critical fix enhances the overall code quality, aiding developers in maintaining and extending the codebase efficiently.",
  "Source": "Human",
  "Token": 1275,
  "Command": [
    "pytest tests"
  ],
  "FilteredCode": [
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
      "content": "1 # -*- coding: utf8 -*-\n2 \n3 import pytest\n4 from unittest.mock import Mock\n5 from pandas import DataFrame\n6 import pandas as pd\n7 from scipy import sparse\n8 from sklearn.datasets import load_iris\n9 from sklearn import pipeline as Pipeline\n10 from sklearn.model_selection import cross_val_score\n11 from sklearn.svm import SVC\n12 from sklearn.feature_extraction.text import CountVectorizer\n13 from sklearn.feature_extraction import DictVectorizer\n14 from sklearn.preprocessing import (\n15     StandardScaler, OneHotEncoder, LabelBinarizer)\n16 from sklearn.impute import SimpleImputer as Imputer\n17 from sklearn.feature_selection import SelectKBest, chi2\n18 from sklearn.base import BaseEstimator, TransformerMixin\n19 import sklearn.decomposition\n20 import numpy as np\n21 from numpy.testing import assert_array_equal\n22 import pickle\n23 from sklearn.compose import make_column_selector\n24 \n25 from sklearn_pandas import DataFrameMapper\n26 from sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\n27 from sklearn_pandas.pipeline import TransformerPipeline\n28 \n29 \n30 class MockXTransformer(object):\n31     \"\"\"\n32     Mock transformer that accepts no y argument.\n33     \"\"\"\n34     def fit(self, X):\n35         return self\n36 \n37     def transform(self, X):\n38         return X\n39 \n40 \n41 class MockTClassifier(object):\n42     \"\"\"\n43     Mock transformer/classifier.\n44     \"\"\"\n45     def fit(self, X, y=None):\n46         return self\n47 \n48     def transform(self, X):\n49         return X\n50 \n51     def predict(self, X):\n52         return True\n53 \n54 \n55 class DateEncoder():\n56     def fit(self, X, y=None):\n57         re(...truncated)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/setup.py",
      "content": "1 #!/usr/bin/env python\n2 # -*- coding: utf-8 -*-\n3 \n4 from setuptools import setup\n5 from setuptools.command.test import test as TestCommand\n6 import re\n7 \n8 for line in open('sklearn_pandas/__init__.py'):\n9     match = re.match(\"__version__ *= *'(.*)'\", line)\n10     if match:\n11         __version__, = match.groups()\n12 \n13 \n14 class PyTest(TestCommand):\n15     user_options = [('pytest-args=', 'a', \"Arguments to pass to py.test\")]\n16 \n17     def initialize_options(self):\n18         TestCommand.initialize_options(self)\n19         self.pytest_args = []\n20 \n21     def finalize_options(self):\n22         TestCommand.finalize_options(self)\n23         self.test_args = []\n24         self.test_suite = True\n25 \n26     def run(self):\n27         import pytest\n28         errno = pytest.main(self.pytest_args)\n29         raise SystemExit(errno)\n30 \n31 \n32 setup(name='sklearn-pandas',\n33       version=__version__,\n34       description='Pandas integration with sklearn',\n35       maintainer='Ritesh Agrawal',\n36       maintainer_email='ragrawal@gmail.com',\n37       url='https://github.com/scikit-learn-contrib/sklearn-pandas',\n38       packages=['sklearn_pandas'],\n39       keywords=['scikit', 'sklearn', 'pandas'],\n40       install_requires=[\n41           'scikit-learn>=0.23.0',\n42           'scipy>=1.5.1',\n43           'pandas>=1.1.4',\n44           'numpy>=1.18.1'\n45       ],\n46       tests_require=['pytest', 'mock'],\n47       cmdclass={'test': PyTest},\n48       license='MIT License'\n49 )"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
      "content": "1 import contextlib\n2 from datetime import datetime\n3 import pandas as pd\n4 import numpy as np\n5 from scipy import sparse\n6 from sklearn.base import BaseEstimator, TransformerMixin\n7 from .cross_validation import DataWrapper\n8 from .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline\n9 from . import logger\n10 \n11 string_types = text_type = str\n12 \n13 \n14 def _handle_feature(fea):\n15     \"\"\"\n16     Convert 1-dimensional arrays to 2-dimensional column vectors.\n17     \"\"\"\n18     if len(fea.shape) == 1:\n19         fea = np.array([fea]).T\n20 \n21     return fea\n22 \n23 \n24 def _build_transformer(transformers):\n25     if isinstance(transformers, list):\n26         transformers = make_transformer_pipeline(*transformer(...truncated)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/pipeline.py",
      "content": "1 import six\n2 from sklearn.pipeline import _name_estimators, Pipeline\n3 from sklearn.utils import tosequence\n4 \n5 \n6 def _call_fit(fit_method, X, y=None, **kwargs):\n7     \"\"\"\n8     helper function, calls the fit or fit_transform method with the correct\n9     number of parameters\n10 \n11     fit_method: fit or fit_transform method of the transformer\n12     X: the data to fit\n13     y: the target vector relative to X, optional\n14     kwargs: any keyword arguments to the fit method\n15 \n16     return: the r(...truncated)"
    },
    {
      "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
      "content": "1 from collections import Counter\n2 \n3 import pytest\n4 import numpy as np\n5 from pandas import DataFrame\n6 from numpy.testing import assert_array_equal\n7 \n8 from sklearn_pandas import DataFrameMapper\n9 from sklearn_pandas.features_generator import gen_features\n10 \n11 \n12 class MockClass(object):\n13 \n14     def __init__(self, value=1, name='class'):\n15   (...truncated)"
    }
  ],
  "TokenAll": 18490,
  "FilteredLevel": 1500,
  "Results": {
    "model": "GPT-4o",
    "Difficulty": "Medium",
    "issue_origin": {
      "title": "Incorrect import alias",
      "description": "In the file 'scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py' at line 9, the sklearn.pipeline is imported as Pipeline, but it should be imported as 'pipeline' instead to match its usage. Please correct the import alias.",
      "explanation": "### Issue Summary\n\nThe issue reported is about an incorrect import alias in a test file. Specifically, in the file `scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py`, at line 9, `sklearn.pipeline` is imported and aliased as `Pipeline`. However, based on its usage in the code, it should be imported as `pipeline` instead.\n\n### Code Analysis\n\nAnalyzing the code snippet provided, we identify the problematic line:\n```python\n9 from sklearn import pipeline as Pipeline\n```\nThe rest of the imports and definitions are standard and do not seem connected to this issue.\n\n### Explanation of the Issue\n\n1. **Incorrect Alias**: The import statement at line 9 is:\n    ```python\n    from sklearn import pipeline as Pipeline\n    ```\n    This means that the module `pipeline` from `sklearn` is being imported and then aliased to `Pipeline`. However, `Pipeline` is not a module but a class within the `pipeline` module.\n\n2. **Confusion in Usage**: The naming convention in Python signifies module names should typically be in lowercase and classes in CamelCase. By aliasing the module `pipeline` as `Pipeline`, this can cause confusion when trying to use `Pipeline` as a class. \n\n3. **Correct Import and Usage**: To correctly import the `Pipeline` class from `sklearn.pipeline`, the line should be:\n    ```python\n    from sklearn.pipeline import Pipeline\n    ```\n\n### Solution\n\nThe solution is to correct the import statement by importing the `Pipeline` class directly from the `sklearn.pipeline` module.\n\n### Commit Explanation\n\nA hypothetical commit to resolve this issue might look like:\n\n```text\nCorrected incorrect import alias in test_dataframe_mapper.py\n\n- Imported Pipeline class from sklearn.pipeline directly.\n```\nThe code change necessary in the file `scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py` would be:\n\nOriginal:\n```python\n9 from sklearn import pipeline as Pipeline\n```\n\nUpdated:\n```python\n9 from sklearn.pipeline import Pipeline\n```\n\n### How the Commit Solves the Issue\n\n1. **Accuracy of Import**:\n   By importing the `Pipeline` class directly from the `sklearn.pipeline` module using:\n   ```python\n   from sklearn.pipeline import Pipeline\n   ```\n   it ensures that `Pipeline` unequivocally refers to the class `Pipeline` and does not introduce any ambiguity.\n\n2. **Improved Readability and Maintainability**:\n   This change aligns with Python's conventions and makes the code more readable and maintainable. Any developer reading the code can easily understand that `Pipeline` is a class being used from `sklearn.pipeline`.\n\n3. **Avoiding Potential Errors**:\n   Importing `pipeline` as `Pipeline` could lead to runtime errors if there are attempts to utilize `Pipeline` as a class, only to realize it's a module. By correctly importing the `Pipeline` class, such potential issues are resolved.\n\n### Conclusion\n\nThe commit correctly changes the import statement to import the `Pipeline` class directly from `sklearn.pipeline`, thereby fixing the improper aliasing error and aligning the code with standard Python practices for clearer readability and maintainability. This careful correction ensures the code behaves as expected and future contributors can easily understand and work with the code."
    },
    "issue_message": {
      "title": "TypeError: 'module' object is not callable",
      "description": "The code is using 'Pipeline' instead of 'pipeline.Pipeline' in 'tests/test_dataframe_mapper.py'. This results in a TypeError as 'Pipeline' is being interpreted as a module rather than a callable class. Please update the code to use 'pipeline.Pipeline' instead of 'Pipeline' in the test cases.",
      "explanation": "### Summary of the Issue\n\nThe given issue pertains to a `TypeError` that occurs in the `tests/test_dataframe_mapper.py` file. Specifically, the error message `'module' object is not callable` is encountered because the code mistakenly imports the `pipeline` module from `sklearn` as `Pipeline` instead of importing the `Pipeline` callable class from the `sklearn.pipeline` module.\n\n### Detailed Analysis\n\nHere's an analysis of the situation which led to the error:\n\n1. **Wrong Import Statement**:\n   The line in question is:\n   ```python\n   9    from sklearn import pipeline as Pipeline\n   ```\n   This line imports the entire `pipeline` module and assigns it the name `Pipeline`. Later in the code, when `Pipeline` is called as a class, it results in a `TypeError` because a module is not callable.\n\n2. **Subsequent Errors**:\n   Several test cases fail due to this incorrect import:\n   - `test_build_transformers`\n   - `test_with_iris_dataframe`\n   - `test_with_car_dataframe`\n   - `test_direct_cross_validation`\n\n   These tests attempt to instantiate a pipeline using `Pipeline()`, leading to the mentioned `TypeError`.\n\n### Content of the Commit\n\nTo resolve the issue, the commit would need to correctly import the `Pipeline` class. The expected content of the commit should alter the incorrect import statement. Here's how it could look:\n\n```diff\n9c9\n- from sklearn import pipeline as Pipeline\n+ from sklearn.pipeline import Pipeline\n```\n\n### Explanation of How the Commit Solves the Issue\n\n1. **Correct Import**:\n   By changing the import statement to:\n   ```python\n   from sklearn.pipeline import Pipeline\n   ```\n   the `Pipeline` used in the code now correctly refers to the `Pipeline` class in the `sklearn.pipeline` module, which is callable.\n\n2. **Fixing the TypeError**:\n   With this correction, when the tests try to create an instance of `Pipeline`, it refers to the correct class, thus avoiding the `'module' object is not callable` error.\n\n3. **Impact on Test Cases**:\n   Since the proper class is now being used, the type checks and the instantiation of `Pipeline` within the test cases:\n   ```python\n   pipeline = Pipeline([\n       (\"preprocess\", DataFrameMapper([\n           (\"petal length (cm)\", None),\n           (\"petal width (cm)\", None),\n           (\"sepal length (cm)\", None),\n           (\"sepal width (cm)\", None),\n       ])),\n       (\"classify\", SVC(kernel='linear'))\n   ])\n   ```\n   will not throw errors and will work as intended. This will resolve the TypeErrors in the tests `test_with_iris_dataframe`, `test_with_car_dataframe`, and `test_direct_cross_validation`.\n\nBy ensuring that the `Pipeline` class is correctly imported, the commit effectively eliminates the root cause of the error, allowing the tests to execute successfully."
    },
    "issue_ground": {
      "title": "Inconsistent Import Style for `Pipeline` in test_dataframe_mapper.py",
      "description": "There is an inconsistency in the way the `Pipeline` class is imported from `sklearn.pipeline` within the test_dataframe_mapper.py file. Currently, it is being imported as a module (`pipeline as Pipeline`), which is not the correct usage style and can lead to confusion and potential import errors during runtime or testing. To maintain a uniform import style across the codebase and to follow Python's best practices for importing classes, the `Pipeline` class should be directly imported from `sklearn.pipeline` as `from sklearn.pipeline import Pipeline`. This change will ensure that the code is more readable and less error-prone, especially during automated testing. Additionally, this consistency is crucial for maintaining code quality and ease of maintenance.",
      "explanation": "## Summary of the Issue\n\nThe issue reported states that there is an inconsistency in the way the `Pipeline` class from `sklearn.pipeline` is imported in the `test_dataframe_mapper.py` file. Specifically, the class is imported as a module with the statement `from sklearn import pipeline as Pipeline`, which is incorrect. This inconsistency has led to runtime errors in the form of `TypeError` because the `Pipeline` class is being used incorrectly as a result of improper import. \n\n## Content of the Commit\n\nTo address this issue, the commit involves correcting the import statement for the `Pipeline` class in `test_dataframe_mapper.py`. The updated import statement should directly import the class from the `sklearn.pipeline`, adhering to Python's best practices for importing classes. Here's how the changes are made:\n\n```python\n# Old import\n9 from sklearn import pipeline as Pipeline\n\n# New import\n9 from sklearn.pipeline import Pipeline\n```\n\nThis change will ensure that the code within `test_dataframe_mapper.py` is using the `Pipeline` class correctly, avoiding the namespace confusion caused by importing the module and renaming it.\n\n## Explanation of How the Commit Solves the Issue\n\nThe commit solves the issue by directly importing the `Pipeline` class from the `sklearn.pipeline` module. This resolves the following errors:\n\n### 1. `TypeError: isinstance() arg 2 must be a type or tuple of types`\nThis error was due to the use of `Pipeline` as a module rather than a class. The `isinstance` function expects a type or a tuple of types, not a module. By correctly importing the `Pipeline` class, the `isinstance(pipeline, Pipeline)` check now correctly identifies `pipeline` as an instance of `Pipeline`.\n\n### 2. `TypeError: 'module' object is not callable`\nThis error occurred when trying to instantiate `Pipeline` with `Pipeline([...])`. Since `Pipeline` was mistakenly imported as a module, trying to call it led to a `TypeError`. Correcting the import to the statement `from sklearn.pipeline import Pipeline` allows for the correct usage of `Pipeline` as a class, and thus makes it callable.\n\nWith the corrected import, the `test_dataframe_mapper.py` will now correctly handle all instances where the `Pipeline` class is used. This change not only fixes the existing import issues but also ensures consistency across the codebase, making the code more readable and maintainable. The automated tests will also run smoothly without encountering the mentioned errors.\n\n## Conclusion\n\nTo summarize, the root cause of the issue was an incorrect import statement for the `Pipeline` class, which led to namespace and type errors. The commit fixes this by using the correct import statement. By directly importing `Pipeline` from `sklearn.pipeline`, the code now correctly references the `Pipeline` class, ensuring proper function calls and type checks, thereby resolving the runtime errors and improving overall code consistency and readability."
    },
    "issue_ground_truth": {
      "title": "Inconsistent Import Style for `Pipeline` in test_dataframe_mapper.py",
      "description": "There is an inconsistency in the way the `Pipeline` class is imported from `sklearn.pipeline` within the test_dataframe_mapper.py file. Currently, it is being imported as a module (`pipeline as Pipeline`), which is not the correct usage style and can lead to confusion and potential import errors during runtime or testing. To maintain a uniform import style across the codebase and to follow Python's best practices for importing classes, the `Pipeline` class should be directly imported from `sklearn.pipeline` as `from sklearn.pipeline import Pipeline`. This change will ensure that the code is more readable and less error-prone, especially during automated testing. Additionally, this consistency is crucial for maintaining code quality and ease of maintenance.",
      "explanation": "### Summary of the Issue:\nThe issue reported is about the inconsistent import style of the `Pipeline` class in the `test_dataframe_mapper.py` file. Specifically, the `Pipeline` class from the `sklearn.pipeline` module is incorrectly imported as a module (i.e., `pipeline as Pipeline`). This import style is not only unconventional but can also potentially lead to confusion and runtime errors. To promote uniformity and readability across the codebase, the import style should follow Python's best practices.\n\n### Cause of the Issue:\nThe primary cause of the issue is the incorrect import statement for the `Pipeline` class. Instead of importing the class directly, the module itself is imported and aliased as `Pipeline`. This contradicts standard Python importing conventions, which recommend importing classes and functions directly for simplicity and clarity. Indirect imports can mislead developers into thinking that they are dealing with a class when they are actually dealing with a module, leading to possible errors when the code is executed.\n\n### Solution Provided by the Commit:\nThe commit addresses this issue by modifying the import statement in the `test_dataframe_mapper.py` file. Instead of importing the `pipeline` module and aliasing it as `Pipeline`, the commit changes the statement to directly import the `Pipeline` class from the `sklearn.pipeline` module. This adjustment conforms to conventional Python import styles, enhancing both the readability and maintainability of the code.\n\n### Detailed Explanation:\n\n1. **Summary of the Issue:**\n   - There is a problem in the file `test_dataframe_mapper.py` where the `Pipeline` class from the `sklearn.pipeline` module is imported incorrectly. It is being imported as a module and then aliased as `Pipeline`.\n   - This incorrect import method is non-standard and can induce confusion or lead to import-related errors during runtime or testing.\n\n2. **Content of the Commit:**\n   - The commit includes an update to the import statement to fix the inconsistency. The original import statement is replaced with a more conventional one.\n   - Specifically, the original statement: `from sklearn import pipeline as Pipeline` is corrected to `from sklearn.pipeline import Pipeline`.\n\n3. **How the Commit Solves the Issue:**\n   - By importing the `Pipeline` class directly, the commit resolves the confusion caused by the previous indirect import style.\n   - Directly importing `Pipeline` ensures that it is clear to anyone reading the code that `Pipeline` is a class from `sklearn.pipeline`, not a module.\n   - This change aligns the code with Python’s best practices, which advocate for straightforward and readable import statements.\n\n4. **Solution to the Issue:**\n   - **Improve Uniformity:** The consistent and direct import style makes the code uniform with other parts of the codebase, which likely also use direct imports.\n   - **Enhance Readability:** It becomes evident to developers that `Pipeline` is a class, not a module, making the code more readable and understandable.\n   - **Reduce Error Potential:** By adhering to the standard import practice, the risk of import-related errors is minimized. This is particularly important during automated testing when unexpected import errors can disrupt the testing process.\n\n### Conclusion:\nThe commit makes a fundamental yet important change to the import statement in `test_dataframe_mapper.py`. It replaces an indirect import with a direct import, bringing clarity, reducing potential errors, and aligning with best practices. This subtle but critical fix enhances the overall code quality, aiding developers in maintaining and extending the codebase efficiently."
    },
    "location_origin": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "9": "global"
        },
        "content_all": {
          "6": "import pandas as pd\n",
          "7": "from scipy import sparse\n",
          "8": "from sklearn.datasets import load_iris\n",
          "9": "from sklearn import pipeline as Pipeline\n",
          "10": "from sklearn.model_selection import cross_val_score\n",
          "11": "from sklearn.svm import SVC\n",
          "12": "from sklearn.feature_extraction.text import CountVectorizer\n"
        },
        "content_change": {
          "9": "from sklearn.pipeline import Pipeline\n"
        }
      }
    ],
    "location_message": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "1": "Module Level Import"
        },
        "content_all": {
          "6": "import pandas as pd\n",
          "7": "from scipy import sparse\n",
          "8": "from sklearn.datasets import load_iris\n",
          "9": "from sklearn import pipeline as Pipeline\n",
          "10": "from sklearn.model_selection import cross_val_score\n",
          "11": "from sklearn.svm import SVC\n",
          "12": "from sklearn.feature_extraction.text import CountVectorizer\n"
        },
        "content_change": {
          "9": "from sklearn.pipeline import Pipeline\n"
        }
      },
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "686": "test_build_transformers"
        },
        "content_all": {
          "682": "    transformers = [MockTClassifier(), MockTClassifier()]\n",
          "683": "    pipeline = _build_transformer(transformers)\n",
          "684": "\n",
          "685": "    # Verify the type since importing Pipeline was incorrect\n",
          "686": "    assert isinstance(pipeline, Pipeline)\n",
          "687": "\n",
          "688": "def test_with_iris_dataframe(iris_dataframe):\n"
        },
        "content_change": {
          "686": "    assert isinstance(pipeline, Pipeline)\n"
        }
      },
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "935": "test_with_iris_dataframe"
        },
        "content_all": {
          "932": "        ]),\n",
          "933": "        (\"classify\", SVC(kernel='linear'))\n",
          "934": "    ])\n",
          "935": "    ]))\n",
          "936": "    ])\n",
          "937": "])\n",
          "938": "    pipeline.fit(X, y)\n"
        },
        "content_change": {
          "935": "    ]))\n"
        }
      },
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "970": "test_with_car_dataframe"
        },
        "content_all": {
          "966": "                (\"description\", CountVectorizer()),\n",
          "967": "            ])),\n",
          "968": "            (\"classify\", SVC(kernel='linear'))\n",
          "969": "        ]))\n",
          "970": "    ])\n",
          "971": "])\n",
          "972": "    pipeline.fit(X, y)\n"
        },
        "content_change": {
          "970": "    ])\n"
        }
      },
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "987": "test_direct_cross_validation"
        },
        "content_all": {
          "982": "                (\"petal width (cm)\", None),\n",
          "983": "                (\"sepal length (cm)\", None),\n",
          "984": "                (\"sepal width (cm)\", None),\n",
          "985": "            ])),\n",
          "986": "            (\"classify\", SVC(kernel='linear'))\n",
          "987": "        ]))\n",
          "988": "    ])\n",
          "989": "])\n"
        },
        "content_change": {
          "987": "        ]))\n"
        }
      }
    ],
    "location_ground": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "9": "import statements"
        },
        "content_all": {
          "6": "6 import pandas as pd",
          "7": "7 from scipy import sparse",
          "8": "8 from sklearn.datasets import load_iris",
          "9": "9 from sklearn import pipeline as Pipeline",
          "10": "10 from sklearn.model_selection import cross_val_score",
          "11": "11 from sklearn.svm import SVC",
          "12": "12 from sklearn.feature_extraction.text import CountVectorizer"
        },
        "content_change": {
          "9": "9 from sklearn.pipeline import Pipeline"
        }
      }
    ],
    "location_ground_exp": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": {
          "none": "global"
        },
        "content_all": {
          "6": "import pandas as pd",
          "7": "from scipy import sparse",
          "8": "from sklearn.datasets import load_iris",
          "9": "from sklearn import pipeline as Pipeline",
          "10": "from sklearn.model_selection import cross_val_score",
          "11": "from sklearn.svm import SVC",
          "12": "from sklearn.feature_extraction.text import CountVectorizer"
        },
        "content_change": {
          "9": "from sklearn.pipeline import Pipeline"
        }
      }
    ],
    "location_ground_truth": [
      {
        "file": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "function": null,
        "content_all": {
          "6": "import pandas as pd\n",
          "7": "from scipy import sparse\n",
          "8": "from sklearn.datasets import load_iris\n",
          "9": "from sklearn import pipeline as Pipeline\n",
          "10": "from sklearn.model_selection import cross_val_score\n",
          "11": "from sklearn.svm import SVC\n",
          "12": "from sklearn.feature_extraction.text import CountVectorizer\n"
        },
        "content_change": {
          "9": "from sklearn import pipeline as Pipeline\n"
        }
      }
    ],
    "patch_i": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -6,7 +6,7 @@\n from scipy import sparse\n from sklearn.datasets import load_iris\n-from sklearn import pipeline as Pipeline\n+from sklearn.pipeline import Pipeline\n from sklearn.model_selection import cross_val_score\n from sklearn.svm import SVC\n from sklearn.feature_extraction.text import CountVectorizer\n",
    "patch_im": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -8,7 +8,7 @@\n import pandas as pd\n from scipy import sparse\n from sklearn.datasets import load_iris\n-from sklearn import pipeline as Pipeline\n+from sklearn.pipeline import Pipeline\n from sklearn.model_selection import cross_val_score\n from sklearn.svm import SVC\n from sklearn.feature_extraction.text import CountVectorizer\n",
    "patch_il": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -6,7 +6,7 @@\n import pandas as pd\n from scipy import sparse\n from sklearn.datasets import load_iris\n-from sklearn import pipeline as Pipeline\n+from sklearn.pipeline import Pipeline\n from sklearn.model_selection import cross_val_score\n from sklearn.svm import SVC\n from sklearn.feature_extraction.text import CountVectorizer\n",
    "patch_iml": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -8,7 +8,7 @@\n from unittest.mock import Mock\n from pandas import DataFrame\n import pandas as pd\n from scipy import sparse\n from sklearn.datasets import load_iris\n-from sklearn import pipeline as Pipeline\n+from sklearn.pipeline import Pipeline\n from sklearn.model_selection import cross_val_score\n from sklearn.svm import SVC\n from sklearn.feature_extraction.text import CountVectorizer\n from sklearn.feature_extraction import DictVectorizer\n from sklearn.preprocessing import (\n",
    "patch_ground": "\n--- a/tests/test_dataframe_mapper.py\n+++ b/tests/test_dataframe_mapper.py\n@@ -8 +8 @@\n-from sklearn import pipeline as Pipeline\n+from sklearn.pipeline import Pipeline\n",
    "patch_ground_location": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -6,7 +6,7 @@\n import pandas as pd\n from scipy import sparse\n from sklearn.datasets import load_iris\n-from sklearn import pipeline as Pipeline\n+from sklearn.pipeline import Pipeline\n from sklearn.model_selection import cross_val_score\n from sklearn.svm import SVC\n from sklearn.feature_extraction.text import CountVectorizer\n",
    "patch_ground_exp": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -8,7 +8,7 @@\n from sklearn.datasets import load_iris\n-from sklearn import pipeline as Pipeline\n+from sklearn.pipeline import Pipeline\n from sklearn.model_selection import cross_val_score\n from sklearn.svm import SVC\n from sklearn.feature_extraction.text import CountVectorizer\n",
    "patch_ground_all": "\n--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -8,7 +8,7 @@\n from sklearn.datasets import load_iris\n-from sklearn import pipeline as Pipeline\n+from sklearn.pipeline import Pipeline\n from sklearn.model_selection import cross_val_score\n from sklearn.svm import SVC\n from sklearn.feature_extraction.text import CountVectorizer\n",
    "patch_ground_truth": "--- a/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n+++ b/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py\n@@ -6,7 +6,7 @@\n import pandas as pd\n from scipy import sparse\n from sklearn.datasets import load_iris\n-from sklearn import pipeline as Pipeline\n+from sklearn.pipeline import Pipeline\n from sklearn.model_selection import cross_val_score\n from sklearn.svm import SVC\n from sklearn.feature_extraction.text import CountVectorizer\n",
    "message": "\n============================================================================================= FAILURES ==============================================================================================\n______________________________________________________________________________________ test_build_transformers ______________________________________________________________________________________\n\n    def test_build_transformers():\n        \"\"\"\n        When a list of transformers is passed, return a pipeline with\n        each element of the iterable as a step of the pipeline.\n        \"\"\"\n        transformers = [MockTClassifier(), MockTClassifier()]\n        pipeline = _build_transformer(transformers)\n>       assert isinstance(pipeline, Pipeline)\nE       TypeError: isinstance() arg 2 must be a type or tuple of types\n\ntests/test_dataframe_mapper.py:686: TypeError\n_____________________________________________________________________________________ test_with_iris_dataframe ______________________________________________________________________________________\n\niris_dataframe =      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)    species\n0                  5.1       ...ginica\n149                5.9               3.0                5.1               1.8  virginica\n\n[150 rows x 5 columns]\n\n    def test_with_iris_dataframe(iris_dataframe):\n>       pipeline = Pipeline([\n            (\"preprocess\", DataFrameMapper([\n                (\"petal length (cm)\", None),\n                (\"petal width (cm)\", None),\n                (\"sepal length (cm)\", None),\n                (\"sepal width (cm)\", None),\n            ])),\n            (\"classify\", SVC(kernel='linear'))\n        ])\nE       TypeError: 'module' object is not callable\n\ntests/test_dataframe_mapper.py:935: TypeError\n______________________________________________________________________________________ test_with_car_dataframe ______________________________________________________________________________________\n\ncars_dataframe =      model                                        description\n0      500  Immaculate bodywork, electric panoramic sunr...R REPAIRS DUE TO EPC/EML COMIN...\n541  corsa  1.2 16v auto corsa 3 door, Full mot with no ad...\n\n[542 rows x 2 columns]\n\n    def test_with_car_dataframe(cars_dataframe):\n>       pipeline = Pipeline([\n            (\"preprocess\", DataFrameMapper([\n                (\"description\", CountVectorizer()),\n            ])),\n            (\"classify\", SVC(kernel='linear'))\n        ])\nE       TypeError: 'module' object is not callable\n\ntests/test_dataframe_mapper.py:970: TypeError\n___________________________________________________________________________________ test_direct_cross_validation ____________________________________________________________________________________\n\niris_dataframe =      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)    species\n0                  5.1       ...ginica\n149                5.9               3.0                5.1               1.8  virginica\n\n[150 rows x 5 columns]\n\n    def test_direct_cross_validation(iris_dataframe):\n        \"\"\"\n        Starting with sklearn>=0.16.0 we no longer need CV wrappers for dataframes.\n        See https://github.com/paulgb/sklearn-pandas/issues/11\n        \"\"\"\n>       pipeline = Pipeline([\n            (\"preprocess\", DataFrameMapper([\n                (\"petal length (cm)\", None),\n                (\"petal width (cm)\", None),\n                (\"sepal length (cm)\", None),\n                (\"sepal width (cm)\", None),\n            ])),\n            (\"classify\", SVC(kernel='linear'))\n        ])\nE       TypeError: 'module' object is not callable\n\ntests/test_dataframe_mapper.py:987: TypeError\n========================================================================================= warnings summary ==========================================================================================\ntests/test_dataframe_mapper.py::test_complex_object_df\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:106: SettingWithCopyWarning: \n  A value is trying to be set on a copy of a slice from a DataFrame.\n  Try using .loc[row_indexer,col_indexer] = value instead\n  \n  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n    X[col] = X[col].map(lambda img: np.max(img))\n\ntests/test_dataframe_mapper.py::test_sparse_features\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:865: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) == sparse.csr.csr_matrix\n\ntests/test_dataframe_mapper.py::test_sparse_off\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py:879: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.\n    assert type(dmatrix) != sparse.csr.csr_matrix\n\ntests/test_transformers.py::test_common_numerical_transformer\ntests/test_transformers.py::test_numerical_transformer_serialization\n  /home/user/Documents/repoben/buggycode/scikit-learn-contrib_sklearn-pandas/sklearn_pandas/transformers.py:35: DeprecationWarning: \n              NumericalTransformer will be deprecated in 3.0 version.\n              Please use Sklearn.base.TransformerMixin to write\n              customer transformers\n              \n    warnings.warn(\"\"\"\n\n-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html\n====================================================================================== short test summary info ======================================================================================\nFAILED tests/test_dataframe_mapper.py::test_build_transformers - TypeError: isinstance() arg 2 must be a type or tuple of types\nFAILED tests/test_dataframe_mapper.py::test_with_iris_dataframe - TypeError: 'module' object is not callable\nFAILED tests/test_dataframe_mapper.py::test_with_car_dataframe - TypeError: 'module' object is not callable\nFAILED tests/test_dataframe_mapper.py::test_direct_cross_validation - TypeError: 'module' object is not callable\n============================================================================= 4 failed, 66 passed, 5 warnings in 0.66s ==============================================================================",
    "CodeBase": [
      {
        "path": "scikit-learn-contrib_sklearn-pandas/tests/test_dataframe_mapper.py",
        "content": "1 # -*- coding: utf8 -*-\n2 \n3 import pytest\n4 from unittest.mock import Mock\n5 from pandas import DataFrame\n6 import pandas as pd\n7 from scipy import sparse\n8 from sklearn.datasets import load_iris\n9 from sklearn import pipeline as Pipeline\n10 from sklearn.model_selection import cross_val_score\n11 from sklearn.svm import SVC\n12 from sklearn.feature_extraction.text import CountVectorizer\n13 from sklearn.feature_extraction import DictVectorizer\n14 from sklearn.preprocessing import (\n15     StandardScaler, OneHotEncoder, LabelBinarizer)\n16 from sklearn.impute import SimpleImputer as Imputer\n17 from sklearn.feature_selection import SelectKBest, chi2\n18 from sklearn.base import BaseEstimator, TransformerMixin\n19 import sklearn.decomposition\n20 import numpy as np\n21 from numpy.testing import assert_array_equal\n22 import pickle\n23 from sklearn.compose import make_column_selector\n24 \n25 from sklearn_pandas import DataFrameMapper\n26 from sklearn_pandas.dataframe_mapper import _handle_feature, _build_transformer\n27 from sklearn_pandas.pipeline import TransformerPipeline\n28 \n29 \n30 class MockXTransformer(object):\n31     \"\"\"\n32     Mock transformer that accepts no y argument.\n33     \"\"\"\n34     def fit(self, X):\n35         return self\n36 \n37     def transform(self, X):\n38         return X\n39 \n40 \n41 class MockTClassifier(object):\n42     \"\"\"\n43     Mock transformer/classifier.\n44     \"\"\"\n45     def fit(self, X, y=None):\n46         return self\n47 \n48     def transform(self, X):\n49         return X\n50 \n51     def predict(self, X):\n52         return True\n53 \n54 \n55 class DateEncoder():\n56     def fit(self, X, y=None):\n57         re(...truncated)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/setup.py",
        "content": "1 #!/usr/bin/env python\n2 # -*- coding: utf-8 -*-\n3 \n4 from setuptools import setup\n5 from setuptools.command.test import test as TestCommand\n6 import re\n7 \n8 for line in open('sklearn_pandas/__init__.py'):\n9     match = re.match(\"__version__ *= *'(.*)'\", line)\n10     if match:\n11         __version__, = match.groups()\n12 \n13 \n14 class PyTest(TestCommand):\n15     user_options = [('pytest-args=', 'a', \"Arguments to pass to py.test\")]\n16 \n17     def initialize_options(self):\n18         TestCommand.initialize_options(self)\n19         self.pytest_args = []\n20 \n21     def finalize_options(self):\n22         TestCommand.finalize_options(self)\n23         self.test_args = []\n24         self.test_suite = True\n25 \n26     def run(self):\n27         import pytest\n28         errno = pytest.main(self.pytest_args)\n29         raise SystemExit(errno)\n30 \n31 \n32 setup(name='sklearn-pandas',\n33       version=__version__,\n34       description='Pandas integration with sklearn',\n35       maintainer='Ritesh Agrawal',\n36       maintainer_email='ragrawal@gmail.com',\n37       url='https://github.com/scikit-learn-contrib/sklearn-pandas',\n38       packages=['sklearn_pandas'],\n39       keywords=['scikit', 'sklearn', 'pandas'],\n40       install_requires=[\n41           'scikit-learn>=0.23.0',\n42           'scipy>=1.5.1',\n43           'pandas>=1.1.4',\n44           'numpy>=1.18.1'\n45       ],\n46       tests_require=['pytest', 'mock'],\n47       cmdclass={'test': PyTest},\n48       license='MIT License'\n49 )"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/dataframe_mapper.py",
        "content": "1 import contextlib\n2 from datetime import datetime\n3 import pandas as pd\n4 import numpy as np\n5 from scipy import sparse\n6 from sklearn.base import BaseEstimator, TransformerMixin\n7 from .cross_validation import DataWrapper\n8 from .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline\n9 from . import logger\n10 \n11 string_types = text_type = str\n12 \n13 \n14 def _handle_feature(fea):\n15     \"\"\"\n16     Convert 1-dimensional arrays to 2-dimensional column vectors.\n17     \"\"\"\n18     if len(fea.shape) == 1:\n19         fea = np.array([fea]).T\n20 \n21     return fea\n22 \n23 \n24 def _build_transformer(transformers):\n25     if isinstance(transformers, list):\n26         transformers = make_transformer_pipeline(*transformer(...truncated)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/sklearn_pandas/pipeline.py",
        "content": "1 import six\n2 from sklearn.pipeline import _name_estimators, Pipeline\n3 from sklearn.utils import tosequence\n4 \n5 \n6 def _call_fit(fit_method, X, y=None, **kwargs):\n7     \"\"\"\n8     helper function, calls the fit or fit_transform method with the correct\n9     number of parameters\n10 \n11     fit_method: fit or fit_transform method of the transformer\n12     X: the data to fit\n13     y: the target vector relative to X, optional\n14     kwargs: any keyword arguments to the fit method\n15 \n16     return: the r(...truncated)"
      },
      {
        "path": "scikit-learn-contrib_sklearn-pandas/tests/test_features_generator.py",
        "content": "1 from collections import Counter\n2 \n3 import pytest\n4 import numpy as np\n5 from pandas import DataFrame\n6 from numpy.testing import assert_array_equal\n7 \n8 from sklearn_pandas import DataFrameMapper\n9 from sklearn_pandas.features_generator import gen_features\n10 \n11 \n12 class MockClass(object):\n13 \n14     def __init__(self, value=1, name='class'):\n15   (...truncated)"
      }
    ],
    "CommitSHA": "c9db2d6dcbf515eade751073f43318e43cae5177"
  },
  "Score": {
    "Difficulty": "Medium",
    "issue_origin": {
      "Title": 6,
      "Description": 5,
      "Reproducibility": 5,
      "Relevance": 7,
      "Explanation": 6,
      "Overall": 6
    },
    "issue_message": {
      "Title": 6,
      "Description": 5,
      "Reproducibility": 4,
      "Relevance": 7,
      "Explanation": 8,
      "Overall": 6
    },
    "issue_ground": {
      "Title": 8,
      "Description": 8,
      "Reproducibility": 8,
      "Relevance": 8,
      "Explanation": 8,
      "Overall": 8
    },
    "issue_ground_truth": {
      "title": "Inconsistent Import Style for `Pipeline` in test_dataframe_mapper.py",
      "description": "There is an inconsistency in the way the `Pipeline` class is imported from `sklearn.pipeline` within the test_dataframe_mapper.py file. Currently, it is being imported as a module (`pipeline as Pipeline`), which is not the correct usage style and can lead to confusion and potential import errors during runtime or testing. To maintain a uniform import style across the codebase and to follow Python's best practices for importing classes, the `Pipeline` class should be directly imported from `sklearn.pipeline` as `from sklearn.pipeline import Pipeline`. This change will ensure that the code is more readable and less error-prone, especially during automated testing. Additionally, this consistency is crucial for maintaining code quality and ease of maintenance.",
      "explanation": "### Summary of the Issue:\nThe issue reported is about the inconsistent import style of the `Pipeline` class in the `test_dataframe_mapper.py` file. Specifically, the `Pipeline` class from the `sklearn.pipeline` module is incorrectly imported as a module (i.e., `pipeline as Pipeline`). This import style is not only unconventional but can also potentially lead to confusion and runtime errors. To promote uniformity and readability across the codebase, the import style should follow Python's best practices.\n\n### Cause of the Issue:\nThe primary cause of the issue is the incorrect import statement for the `Pipeline` class. Instead of importing the class directly, the module itself is imported and aliased as `Pipeline`. This contradicts standard Python importing conventions, which recommend importing classes and functions directly for simplicity and clarity. Indirect imports can mislead developers into thinking that they are dealing with a class when they are actually dealing with a module, leading to possible errors when the code is executed.\n\n### Solution Provided by the Commit:\nThe commit addresses this issue by modifying the import statement in the `test_dataframe_mapper.py` file. Instead of importing the `pipeline` module and aliasing it as `Pipeline`, the commit changes the statement to directly import the `Pipeline` class from the `sklearn.pipeline` module. This adjustment conforms to conventional Python import styles, enhancing both the readability and maintainability of the code.\n\n### Detailed Explanation:\n\n1. **Summary of the Issue:**\n   - There is a problem in the file `test_dataframe_mapper.py` where the `Pipeline` class from the `sklearn.pipeline` module is imported incorrectly. It is being imported as a module and then aliased as `Pipeline`.\n   - This incorrect import method is non-standard and can induce confusion or lead to import-related errors during runtime or testing.\n\n2. **Content of the Commit:**\n   - The commit includes an update to the import statement to fix the inconsistency. The original import statement is replaced with a more conventional one.\n   - Specifically, the original statement: `from sklearn import pipeline as Pipeline` is corrected to `from sklearn.pipeline import Pipeline`.\n\n3. **How the Commit Solves the Issue:**\n   - By importing the `Pipeline` class directly, the commit resolves the confusion caused by the previous indirect import style.\n   - Directly importing `Pipeline` ensures that it is clear to anyone reading the code that `Pipeline` is a class from `sklearn.pipeline`, not a module.\n   - This change aligns the code with Python’s best practices, which advocate for straightforward and readable import statements.\n\n4. **Solution to the Issue:**\n   - **Improve Uniformity:** The consistent and direct import style makes the code uniform with other parts of the codebase, which likely also use direct imports.\n   - **Enhance Readability:** It becomes evident to developers that `Pipeline` is a class, not a module, making the code more readable and understandable.\n   - **Reduce Error Potential:** By adhering to the standard import practice, the risk of import-related errors is minimized. This is particularly important during automated testing when unexpected import errors can disrupt the testing process.\n\n### Conclusion:\nThe commit makes a fundamental yet important change to the import statement in `test_dataframe_mapper.py`. It replaces an indirect import with a direct import, bringing clarity, reducing potential errors, and aligning with best practices. This subtle but critical fix enhances the overall code quality, aiding developers in maintaining and extending the codebase efficiently."
    }
  }
}