{
  "instance_id": "scikit-learn__scikit-learn-13496",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-03-23T09:46:59Z",
  "problem_statement": "Expose warm_start in Isolation forest\nIt seems to me that `sklearn.ensemble.IsolationForest` supports incremental addition of new trees with the `warm_start` parameter of its parent class, `sklearn.ensemble.BaseBagging`.\r\n\r\nEven though this parameter is not exposed in `__init__()` , it gets inherited from `BaseBagging` and one can use it by changing it to `True` after initialization. To make it work, you have to also increment `n_estimators` on every iteration. \r\n\r\nIt took me a while to notice that it actually works, and I had to inspect the source code of both `IsolationForest` and `BaseBagging`. Also, it looks to me that the behavior is in-line with `sklearn.ensemble.BaseForest` that is behind e.g. `sklearn.ensemble.RandomForestClassifier`.\r\n\r\nTo make it more easier to use, I'd suggest to:\r\n* expose `warm_start` in `IsolationForest.__init__()`, default `False`;\r\n* document it in the same way as it is documented for `RandomForestClassifier`, i.e. say:\r\n```py\r\n    warm_start : bool, optional (default=False)\r\n        When set to ``True``, reuse the solution of the previous call to fit\r\n        and add more estimators to the ensemble, otherwise, just fit a whole\r\n        new forest. See :term:`the Glossary <warm_start>`.\r\n```\r\n* add a test to make sure it works properly;\r\n* possibly also mention in the \"IsolationForest example\" documentation entry;\r\n\n",
  "patch": "diff --git a/sklearn/ensemble/iforest.py b/sklearn/ensemble/iforest.py\n--- a/sklearn/ensemble/iforest.py\n+++ b/sklearn/ensemble/iforest.py\n@@ -120,6 +120,12 @@ class IsolationForest(BaseBagging, OutlierMixin):\n     verbose : int, optional (default=0)\n         Controls the verbosity of the tree building process.\n \n+    warm_start : bool, optional (default=False)\n+        When set to ``True``, reuse the solution of the previous call to fit\n+        and add more estimators to the ensemble, otherwise, just fit a whole\n+        new forest. See :term:`the Glossary <warm_start>`.\n+\n+        .. versionadded:: 0.21\n \n     Attributes\n     ----------\n@@ -173,7 +179,8 @@ def __init__(self,\n                  n_jobs=None,\n                  behaviour='old',\n                  random_state=None,\n-                 verbose=0):\n+                 verbose=0,\n+                 warm_start=False):\n         super().__init__(\n             base_estimator=ExtraTreeRegressor(\n                 max_features=1,\n@@ -185,6 +192,7 @@ def __init__(self,\n             n_estimators=n_estimators,\n             max_samples=max_samples,\n             max_features=max_features,\n+            warm_start=warm_start,\n             n_jobs=n_jobs,\n             random_state=random_state,\n             verbose=verbose)\n",
  "similar_bug_items": [
    {
      "pr_number": 7594,
      "pr_title": "[MRG+1] FIX Make sure GridSearchCV and RandomizedSearchCV are pickle-able",
      "pr_body": "Fixes #7562 \n- Subclasses the `np.ma.MaskedArray` and overrides the `__getstate__` to make obj dtyped `MaskedArray`s pickle-able.\n- Uses this fixed `utils.fixes.MaskedArray` inside `gs.cv_results_`...\n\nThis is based off of https://github.com/numpy/numpy/pull/8122\n\nPlease review @jnothman @amueller @GaelVaroquaux @davechallis\n",
      "issue_id": 7562,
      "issue_title": "Error unpickling RandomizedSearchCV objects in 0.18 due to masked arrays",
      "issue_body": "#### Description\n\nIn version 0.18, loading pickles of fitted RandomizedSearchCV objects results in a `TypeError` exception (from pickle also created with version 0.18).\n\nThe error seems related to the use of masked arrays in the `RandomizedSearchCV.cv_results_` attribute - clearing this before pickling (i.e. setting to to `None`) allows pickling/unpickling to work.\n#### Steps/Code to Reproduce\n\n```\nimport pickle                                                                   \nfrom sklearn.model_selection import RandomizedSearchCV                          \nfrom sklearn.ensemble import RandomForestClassifier                             \n\nX = [[1, 0], [1, 0], [1, 0], [0, 1], [0, 1], [0, 1]]                            \ny = [1, 1, 1, 0, 0, 0]                                                          \n\nmodel = RandomizedSearchCV(RandomForestClassifier(),                            \n                           {'n_estimators': [5, 10, 20]},                       \n                           n_iter=3)                                            \nmodel.fit(X, y)                                                                 \n\nwith open('model.pkl', 'wb') as fh:                                             \n    pickle.dump(model, fh)                                                      \n\nwith open('model.pkl', 'rb') as fh:                                             \n    model = pickle.load(fh)\n\nprint(model.predict(X))\n```\n#### Expected Results\n\n```\n[1, 1, 1, 0, 0, 0]\n```\n#### Actual Results\n\n```\nTraceback (most recent call last):\n  File \"./t.py\", line 19, in <module>\n    model = pickle.load(fh)\n  File \"/Users/dsc/miniconda3/envs/p3/lib/python3.5/site-packages/numpy/ma/core.py\", line 5863, in __setstate__\n    super(MaskedArray, self).__setstate__((shp, typ, isf, raw))\nTypeError: object pickle not returning list\n```\n#### Versions\n\nPython 3.5.1 |Continuum Analytics, Inc.| (default, Dec  7 2015, 11:24:55) \n[GCC 4.2.1 (Apple Inc. build 5577)]\nNumPy 1.11.1\nSciPy 0.18.1\nScikit-Learn 0.18\n",
      "issue_closed_at": "2016-10-10T19:33:44Z",
      "base_commit": "33ed90dc0aa0549a5963000d7d070aa18ca389c4",
      "changes": [
        {
          "file": "sklearn/model_selection/_search.py",
          "type": "line",
          "name": "line 30",
          "code": "from ..utils import check_random_state\nfrom ..utils.fixes import sp_version\nfrom ..utils.fixes import rankdata\nfrom ..utils.random import sample_without_replacement\nfrom ..utils.validation import indexable, check_is_fitted\nfrom ..utils.metaestimators import if_delegate_has_method"
        },
        {
          "file": "sklearn/model_selection/_search.py",
          "type": "function",
          "name": "_store",
          "class_name": "BaseSearchCV",
          "code": "def _store(key_name, array, weights=None, splits=False, rank=False):\n            \"\"\"A small helper to store the scores/times to the cv_results_\"\"\"\n            array = np.array(array, dtype=np.float64).reshape(n_candidates,\n                                                              n_splits)\n            if splits:\n                for split_i in range(n_splits):\n                    results[\"split%d_%s\"\n                            % (split_i, key_name)] = array[:, split_i]\n\n            array_means = np.average(array, axis=1, weights=weights)\n            results['mean_%s' % key_name] = array_means\n            # Weighted std is not directly available in numpy\n            array_stds = np.sqrt(np.average((array -\n                                             array_means[:, np.newaxis]) ** 2,\n                                            axis=1, weights=weights))\n            results['std_%s' % key_name] = array_stds\n\n            if rank:\n                results[\"rank_%s\" % key_name] = np.asarray(\n                    rankdata(-array_means, method='min'), dtype=np.int32)"
        },
        {
          "file": "sklearn/utils/fixes.py",
          "type": "function",
          "name": "rankdata",
          "class_name": null,
          "code": "def rankdata(a, method='average'):\n        if method not in ('average', 'min', 'max', 'dense', 'ordinal'):\n            raise ValueError('unknown method \"{0}\"'.format(method))\n\n        arr = np.ravel(np.asarray(a))\n        algo = 'mergesort' if method == 'ordinal' else 'quicksort'\n        sorter = np.argsort(arr, kind=algo)\n\n        inv = np.empty(sorter.size, dtype=np.intp)\n        inv[sorter] = np.arange(sorter.size, dtype=np.intp)\n\n        if method == 'ordinal':\n            return inv + 1\n\n        arr = arr[sorter]\n        obs = np.r_[True, arr[1:] != arr[:-1]]\n        dense = obs.cumsum()[inv]\n\n        if method == 'dense':\n            return dense\n\n        # cumulative counts of each unique value\n        count = np.r_[np.nonzero(obs)[0], len(obs)]\n\n        if method == 'max':\n            return count[dense]\n\n        if method == 'min':\n            return count[dense - 1] + 1\n\n        # average method\n        return .5 * (count[dense] + count[dense - 1] + 1)"
        }
      ]
    },
    {
      "pr_number": 11914,
      "pr_title": "[MRG] ENH Better error message for metrics of neighbors",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n#Fixes #11906 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded expression to error message to get list of valid metrics.\r\n\r\n<!--\r\n#### Any other comments?\r\n\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 11906,
      "issue_title": "Better error message for invalid metric in NearestNeighbors ",
      "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n",
      "issue_closed_at": "2018-09-13T15:34:02Z",
      "base_commit": "7ed61a24feb4ffde0bee9342acf4a58e3f946a61",
      "changes": [
        {
          "file": "sklearn/neighbors/__init__.py",
          "type": "line",
          "name": "line 14",
          "code": "from .kde import KernelDensity\nfrom .approximate import LSHForest\nfrom .lof import LocalOutlierFactor\n\n__all__ = ['BallTree',\n           'DistanceMetric',"
        },
        {
          "file": "sklearn/neighbors/__init__.py",
          "type": "line",
          "name": "line 28",
          "code": "           'radius_neighbors_graph',\n           'KernelDensity',\n           'LSHForest',\n           'LocalOutlierFactor']"
        },
        {
          "file": "sklearn/neighbors/base.py",
          "type": "function",
          "name": "_check_algorithm_metric",
          "class_name": "NeighborsBase",
          "code": "def _check_algorithm_metric(self):\n        if self.algorithm not in ['auto', 'brute',\n                                  'kd_tree', 'ball_tree']:\n            raise ValueError(\"unrecognized algorithm: '%s'\" % self.algorithm)\n\n        if self.algorithm == 'auto':\n            if self.metric == 'precomputed':\n                alg_check = 'brute'\n            elif (callable(self.metric) or\n                  self.metric in VALID_METRICS['ball_tree']):\n                alg_check = 'ball_tree'\n            else:\n                alg_check = 'brute'\n        else:\n            alg_check = self.algorithm\n\n        if callable(self.metric):\n            if self.algorithm == 'kd_tree':\n                # callable metric is only valid for brute force and ball_tree\n                raise ValueError(\n                    \"kd_tree algorithm does not support callable metric '%s'\"\n                    % self.metric)\n        elif self.metric not in VALID_METRICS[alg_check]:\n            raise ValueError(\"Metric '%s' not valid for algorithm '%s'\"\n                             % (self.metric, self.algorithm))\n\n        if self.metric_params is not None and 'p' in self.metric_params:\n            warnings.warn(\"Parameter p is found in metric_params. \"\n                          \"The corresponding parameter from __init__ \"\n                          \"is ignored.\", SyntaxWarning, stacklevel=3)\n            effective_p = self.metric_params['p']\n        else:\n            effective_p = self.p\n\n        if self.metric in ['wminkowski', 'minkowski'] and effective_p < 1:\n            raise ValueError(\"p must be greater than one for minkowski metric\")"
        }
      ]
    },
    {
      "pr_number": 13157,
      "pr_title": "[MRG+1]\u00a0API Change default multioutput in RegressorMixin.score to keep consistent with metrics.r2_score",
      "pr_body": "Closes #12772 \r\nWondering if someone has a better way :)\r\nIn the original issue, I tried to ask why we prefer uniform_average, but received no reply. I guess we choose uniform_average to keep consistent with other regression metrics.",
      "issue_id": 12772,
      "issue_title": "Different r2_score multioutput default in r2_score and base.RegressorMixin",
      "issue_body": "We've changed multioutput default in r2_score to \"uniform_average\" in 0.19, but in base.RegressorMixin, we still use ``multioutput='variance_weighted'`` (#5143).\r\nAlso see the strange things below:\r\nhttps://github.com/scikit-learn/scikit-learn/blob/4603e481e9ac67eaf906ae5936263b675ba9bc9c/sklearn/multioutput.py#L283-L286",
      "issue_closed_at": "2019-03-15T09:47:51Z",
      "base_commit": "85440978f517118e78dc15f84e397d50d14c8097",
      "changes": [
        {
          "file": "sklearn/base.py",
          "type": "function",
          "name": "score",
          "class_name": "DensityMixin",
          "code": "def score(self, X, y=None):\n        \"\"\"Returns the score of the model on the data X\n\n        Parameters\n        ----------\n        X : array-like, shape = (n_samples, n_features)\n\n        Returns\n        -------\n        score : float\n        \"\"\"\n        pass"
        },
        {
          "file": "sklearn/linear_model/coordinate_descent.py",
          "type": "class",
          "name": "MultiTaskLassoCV",
          "code": "class MultiTaskLassoCV(LinearModelCV, RegressorMixin):\n    \"\"\"Multi-task Lasso model trained with L1/L2 mixed-norm as regularizer.\n\n    See glossary entry for :term:`cross-validation estimator`.\n\n    The optimization objective for MultiTaskLasso is::\n\n        (1 / (2 * n_samples)) * ||Y - XW||^Fro_2 + alpha * ||W||_21\n\n    Where::\n\n        ||W||_21 = \\\\sum_i \\\\sqrt{\\\\sum_j w_{ij}^2}\n\n    i.e. the sum of norm of each row.\n\n    Read more in the :ref:`User Guide <multi_task_lasso>`.\n\n    Parameters\n    ----------\n    eps : float, optional\n        Length of the path. ``eps=1e-3`` means that\n        ``alpha_min / alpha_max = 1e-3``.\n\n    n_alphas : int, optional\n        Number of alphas along the regularization path\n\n    alphas : array-like, optional\n        List of alphas where to compute the models.\n        If not provided, set automatically.\n\n    fit_intercept : boolean\n        whether to calculate the intercept for this model. If set\n        to false, no intercept will be used in calculations\n        (e.g. data is expected to be already centered).\n\n    normalize : boolean, optional, default False\n        This parameter is ignored when ``fit_intercept`` is set to False.\n        If True, the regressors X will be normalized before regression by\n        subtracting the mean and dividing by the l2-norm.\n        If you wish to standardize, please use\n        :class:`sklearn.preprocessing.StandardScaler` before calling ``fit``\n        on an estimator with ``normalize=False``.\n\n    max_iter : int, optional\n        The maximum number of iterations.\n\n    tol : float, optional\n        The tolerance for the optimization: if the updates are\n        smaller than ``tol``, the optimization code checks the\n        dual gap for optimality and continues until it is smaller\n        than ``tol``.\n\n    copy_X : boolean, optional, default True\n        If ``True``, X will be copied; else, it may be overwritten.\n\n    cv : int, cross-validation generator or an iterable, optional\n        Determines the cross-validation splitting strategy.\n        Possible inputs for cv are:\n\n        - None, to use the default 3-fold cross-validation,\n        - integer, to specify the number of folds.\n        - :term:`CV splitter`,\n        - An iterable yielding (train, test) splits as arrays of indices.\n\n        For integer/None inputs, :class:`KFold` is used.\n\n        Refer :ref:`User Guide <cross_validation>` for the various\n        cross-validation strategies that can be used here.\n\n        .. versionchanged:: 0.20\n            ``cv`` default value if None will change from 3-fold to 5-fold\n            in v0.22.\n\n    verbose : bool or integer\n        Amount of verbosity.\n\n    n_jobs : int or None, optional (default=None)\n        Number of CPUs to use during the cross validation. Note that this is\n        used only if multiple values for l1_ratio are given.\n        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.\n        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`\n        for more details.\n\n    random_state : int, RandomState instance or None, optional, default None\n        The seed of the pseudo random number generator that selects a random\n        feature to update.  If int, random_state is the seed used by the random\n        number generator; If RandomState instance, random_state is the random\n        number generator; If None, the random number generator is the\n        RandomState instance used by `np.random`. Used when ``selection`` ==\n        'random'\n\n    selection : str, default 'cyclic'\n        If set to 'random', a random coefficient is updated every iteration\n        rather than looping over features sequentially by default. This\n        (setting to 'random') often leads to significantly faster convergence\n        especially when tol is higher than 1e-4.\n\n    Attributes\n    ----------\n    intercept_ : array, shape (n_tasks,)\n        Independent term in decision function.\n\n    coef_ : array, shape (n_tasks, n_features)\n        Parameter vector (W in the cost function formula).\n        Note that ``coef_`` stores the transpose of ``W``, ``W.T``.\n\n    alpha_ : float\n        The amount of penalization chosen by cross validation\n\n    mse_path_ : array, shape (n_alphas, n_folds)\n        mean square error for the test set on each fold, varying alpha\n\n    alphas_ : numpy array, shape (n_alphas,)\n        The grid of alphas used for fitting.\n\n    n_iter_ : int\n        number of iterations run by the coordinate descent solver to reach\n        the specified tolerance for the optimal alpha.\n\n    Examples\n    --------\n    >>> from sklearn.linear_model import MultiTaskLassoCV\n    >>> from sklearn.datasets import make_regression\n    >>> X, y = make_regression(n_targets=2, noise=4, random_state=0)\n    >>> reg = MultiTaskLassoCV(cv=5, random_state=0).fit(X, y)\n    >>> reg.score(X, y) # doctest: +ELLIPSIS\n    0.9994...\n    >>> reg.alpha_\n    0.5713...\n    >>> reg.predict(X[:1,])\n    array([[153.7971...,  94.9015...]])\n\n    See also\n    --------\n    MultiTaskElasticNet\n    ElasticNetCV\n    MultiTaskElasticNetCV\n\n    Notes\n    -----\n    The algorithm used to fit the model is coordinate descent.\n\n    To avoid unnecessary memory duplication the X argument of the fit method\n    should be directly passed as a Fortran-contiguous numpy array.\n    \"\"\"\n    path = staticmethod(lasso_path)\n\n    def __init__(self, eps=1e-3, n_alphas=100, alphas=None, fit_intercept=True,\n                 normalize=False, max_iter=1000, tol=1e-4, copy_X=True,\n                 cv='warn', verbose=False, n_jobs=None, random_state=None,\n                 selection='cyclic'):\n        super().__init__(\n            eps=eps, n_alphas=n_alphas, alphas=alphas,\n            fit_intercept=fit_intercept, normalize=normalize,\n            max_iter=max_iter, tol=tol, copy_X=copy_X,\n            cv=cv, verbose=verbose, n_jobs=n_jobs, random_state=random_state,\n            selection=selection)\n\n    def _more_tags(self):\n        return {'multioutput_only': True}"
        },
        {
          "file": "sklearn/multioutput.py",
          "type": "function",
          "name": "partial_fit",
          "class_name": "MultiOutputRegressor",
          "code": "def partial_fit(self, X, y, sample_weight=None):\n        \"\"\"Incrementally fit the model to data.\n        Fit a separate model for each output variable.\n\n        Parameters\n        ----------\n        X : (sparse) array-like, shape (n_samples, n_features)\n            Data.\n\n        y : (sparse) array-like, shape (n_samples, n_outputs)\n            Multi-output targets.\n\n        sample_weight : array-like, shape = (n_samples) or None\n            Sample weights. If None, then samples are equally weighted.\n            Only supported if the underlying regressor supports sample\n            weights.\n\n        Returns\n        -------\n        self : object\n        \"\"\"\n        super().partial_fit(\n            X, y, sample_weight=sample_weight)"
        }
      ]
    },
    {
      "pr_number": 6907,
      "pr_title": "[MRG+1] Added support for sample_weight in linearSVR, including tests and documentation. Fixes #6862",
      "pr_body": "<!--\nThanks for contributing a pull request! Please ensure you have taken a look at\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#Contributing-Pull-Requests\n-->\n#### Reference Issue\n\n<!-- Example: Fixes #1234 -->\n#### What does this implement/fix? Explain your changes.\n#### Any other comments?\n\n<!--\nPlease be aware that we are a loose team of volunteers so patience is\nnecessary; assistance handling other issues is very welcome. We value\nall user contributions, no matter how minor they are. If we are slow to\nreview, either the pull request needs some benchmarking, tinkering,\nconvincing, etc. or more likely the reviewers are simply busy. In either\ncase, we ask for your understanding during the review process.\nFor more information, see our FAQ on this topic:\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\n\nThanks for contributing!\n-->\n\n\u2026umentation\n",
      "issue_id": 6860,
      "issue_title": "[Question]When excuting \"make html\" to generate the full web page of \"http://scikit-learn.org\", it pops up ERROR",
      "issue_body": "I wanted to generate the full web page of \"http://scikit-learn.org\" under the guide of \"scikit-learn-master\\doc\\README.md\", there are the error messages:\n D:\\scikit-learn-master\\doc>make html\n Running Sphinx v1.3.1\n\nException occurred:\n File \"D:\\Anaconda3\\lib\\subprocess.py\", line 1220, in _execute_child\n startupinfo)\n\nFileNotFoundError: [WinError 2] \u7cfb\u7edf\u627e\u4e0d\u5230\u6307\u5b9a\u7684\u6587\u4ef6\u3002\nThe full traceback has been saved in C:...\\Local\\Temp\\sphinx-err-c4x44do0.log, if you want to report the issue to the developers.\n Please also report this if it was a user error, so that a better error message can be provided next time.\n A bug report can be filed in the tracker at https://github.com/sphinx-doc/sphinx/issues. Thanks!\n\nBuild finished. The HTML pages are in _build/html.\n\nI opened up D:\\Anaconda3\\lib\\subprocess.py found line 1220 was that winapi create process \n\n```\n   try:\n        hp, ht, pid, tid = _winapi.CreateProcess(executable, args,\n                                 # no special security\n                                 None, None,\n                                 int(not close_fds),\n                                 creationflags,\n                                 env,\n                                 cwd,\n                                 startupinfo)\n```\n\nAnd there is the attachment file.\n[sphinx-err-c4x44do0.log.txt](https://github.com/scikit-learn/scikit-learn/files/299912/sphinx-err-c4x44do0.log.txt)\n",
      "issue_closed_at": "2016-06-21T13:30:54Z",
      "base_commit": "4a2bc34be20bc6df06d61cc936387a00b2fd155e",
      "changes": [
        {
          "file": "sklearn/svm/classes.py",
          "type": "line",
          "name": "line 6",
          "code": "from ..linear_model.base import LinearClassifierMixin, SparseCoefMixin, \\\n    LinearModel\nfrom ..feature_selection.from_model import _LearntSelectorMixin\nfrom ..utils import check_X_y\nfrom ..utils.validation import _num_samples\nfrom ..utils.multiclass import check_classification_targets\n"
        },
        {
          "file": "sklearn/svm/classes.py",
          "type": "function",
          "name": "__init__",
          "class_name": "OneClassSVM",
          "code": "def __init__(self, kernel='rbf', degree=3, gamma='auto', coef0=0.0,\n                 tol=1e-3, nu=0.5, shrinking=True, cache_size=200,\n                 verbose=False, max_iter=-1, random_state=None):\n\n        super(OneClassSVM, self).__init__(\n            'one_class', kernel, degree, gamma, coef0, tol, 0., nu, 0.,\n            shrinking, False, cache_size, None, verbose, max_iter,\n            random_state)"
        },
        {
          "file": "sklearn/svm/classes.py",
          "type": "function",
          "name": "fit",
          "class_name": "OneClassSVM",
          "code": "def fit(self, X, y=None, sample_weight=None, **params):\n        \"\"\"\n        Detects the soft boundary of the set of samples X.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape (n_samples, n_features)\n            Set of samples, where n_samples is the number of samples and\n            n_features is the number of features.\n\n        sample_weight : array-like, shape (n_samples,)\n            Per-sample weights. Rescale C per sample. Higher weights\n            force the classifier to put more emphasis on these points.\n\n        Returns\n        -------\n        self : object\n            Returns self.\n\n        Notes\n        -----\n        If X is not a C-ordered contiguous array it is copied.\n\n        \"\"\"\n        super(OneClassSVM, self).fit(X, np.ones(_num_samples(X)), sample_weight=sample_weight,\n                                     **params)\n        return self"
        },
        {
          "file": "sklearn/svm/classes.py",
          "type": "class",
          "name": "SVR",
          "code": "class SVR(BaseLibSVM, RegressorMixin):\n    \"\"\"Epsilon-Support Vector Regression.\n\n    The free parameters in the model are C and epsilon.\n\n    The implementation is based on libsvm.\n\n    Read more in the :ref:`User Guide <svm_regression>`.\n\n    Parameters\n    ----------\n    C : float, optional (default=1.0)\n        Penalty parameter C of the error term.\n\n    epsilon : float, optional (default=0.1)\n         Epsilon in the epsilon-SVR model. It specifies the epsilon-tube\n         within which no penalty is associated in the training loss function\n         with points predicted within a distance epsilon from the actual\n         value.\n\n    kernel : string, optional (default='rbf')\n         Specifies the kernel type to be used in the algorithm.\n         It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or\n         a callable.\n         If none is given, 'rbf' will be used. If a callable is given it is\n         used to precompute the kernel matrix.\n\n    degree : int, optional (default=3)\n        Degree of the polynomial kernel function ('poly').\n        Ignored by all other kernels.\n\n    gamma : float, optional (default='auto')\n        Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.\n        If gamma is 'auto' then 1/n_features will be used instead.\n\n    coef0 : float, optional (default=0.0)\n        Independent term in kernel function.\n        It is only significant in 'poly' and 'sigmoid'.\n\n    shrinking : boolean, optional (default=True)\n        Whether to use the shrinking heuristic.\n\n    tol : float, optional (default=1e-3)\n        Tolerance for stopping criterion.\n\n    cache_size : float, optional\n        Specify the size of the kernel cache (in MB).\n\n    verbose : bool, default: False\n        Enable verbose output. Note that this setting takes advantage of a\n        per-process runtime setting in libsvm that, if enabled, may not work\n        properly in a multithreaded context.\n\n    max_iter : int, optional (default=-1)\n        Hard limit on iterations within solver, or -1 for no limit.\n\n    Attributes\n    ----------\n    support_ : array-like, shape = [n_SV]\n        Indices of support vectors.\n\n    support_vectors_ : array-like, shape = [nSV, n_features]\n        Support vectors.\n\n    dual_coef_ : array, shape = [1, n_SV]\n        Coefficients of the support vector in the decision function.\n\n    coef_ : array, shape = [1, n_features]\n        Weights assigned to the features (coefficients in the primal\n        problem). This is only available in the case of a linear kernel.\n\n        `coef_` is readonly property derived from `dual_coef_` and\n        `support_vectors_`.\n\n    intercept_ : array, shape = [1]\n        Constants in decision function.\n\n    Examples\n    --------\n    >>> from sklearn.svm import SVR\n    >>> import numpy as np\n    >>> n_samples, n_features = 10, 5\n    >>> np.random.seed(0)\n    >>> y = np.random.randn(n_samples)\n    >>> X = np.random.randn(n_samples, n_features)\n    >>> clf = SVR(C=1.0, epsilon=0.2)\n    >>> clf.fit(X, y) #doctest: +NORMALIZE_WHITESPACE\n    SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.2, gamma='auto',\n        kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)\n\n    See also\n    --------\n    NuSVR\n        Support Vector Machine for regression implemented using libsvm\n        using a parameter to control the number of support vectors.\n\n    LinearSVR\n        Scalable Linear Support Vector Machine for regression\n        implemented using liblinear.\n    \"\"\"\n    def __init__(self, kernel='rbf', degree=3, gamma='auto', coef0=0.0,\n                 tol=1e-3, C=1.0, epsilon=0.1, shrinking=True,\n                 cache_size=200, verbose=False, max_iter=-1):\n\n        super(SVR, self).__init__(\n            'epsilon_svr', kernel=kernel, degree=degree, gamma=gamma,\n            coef0=coef0, tol=tol, C=C, nu=0., epsilon=epsilon, verbose=verbose,\n            shrinking=shrinking, probability=False, cache_size=cache_size,\n            class_weight=None, max_iter=max_iter, random_state=None)"
        }
      ]
    },
    {
      "pr_number": 12104,
      "pr_title": "[MRG] Convert ColumnTransformer input list to numpy array",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\nFixes #12096.\r\n\r\n#### What does this implement/fix? Explain your changes.\r\nConverts the input list for ColumnTransformer to a numpy array.\r\n\r\nAdded a check inside `transform` and `fit_transform` to check if the input `X` is a list, if it is then it gets converted to a numpy array.\r\n\r\n#### Any other comments?\r\nShould this conversion be documented in the docstrings for ColumnTransfomer's `fit`, `transform` and `fit_transform`?\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 12096,
      "issue_title": "ColumnTransformer breaks where X is a list",
      "issue_body": "```py\r\n>>> from sklearn.preprocessing import StandardScaler\r\n>>> from sklearn.compose import ColumnTransformer\r\n>>> ColumnTransformer([('foobar', StandardScaler(), [0, 1, 2])]).fit([[1, 2, 3]])\r\nTraceback (most recent call last):\r\n  File \"<stdin>\", line 1, in <module>\r\n  File \"/Users/joel/repos/scikit-learn/sklearn/compose/_column_transformer.py\", line 398, in fit\r\n    self.fit_transform(X, y=y)\r\n  File \"/Users/joel/repos/scikit-learn/sklearn/compose/_column_transformer.py\", line 422, in fit_transform\r\n    self._validate_remainder(X)\r\n  File \"/Users/joel/repos/scikit-learn/sklearn/compose/_column_transformer.py\", line 275, in _validate_remainder\r\n    n_columns = X.shape[1]\r\nAttributeError: 'list' object has no attribute 'shape'\r\n```\r\n\r\nThe passed list should be interpreted as an array for the sake of extracting columns. Instead an error is raised.",
      "issue_closed_at": "2018-09-25T14:37:16Z",
      "base_commit": "0c0a9e8406987fd53ccd705c3e455514c47c49c4",
      "changes": [
        {
          "file": "sklearn/compose/_column_transformer.py",
          "type": "line",
          "name": "line 16",
          "code": "from ..base import clone, TransformerMixin\nfrom ..utils import Parallel, delayed\nfrom ..externals import six\nfrom ..pipeline import (\n    _fit_one_transformer, _fit_transform_one, _transform_one, _name_estimators)\nfrom ..preprocessing import FunctionTransformer\nfrom ..utils import Bunch\nfrom ..utils.metaestimators import _BaseComposition\nfrom ..utils.validation import check_is_fitted\n\n\n__all__ = ['ColumnTransformer', 'make_column_transformer']"
        },
        {
          "file": "sklearn/compose/_column_transformer.py",
          "type": "function",
          "name": "fit_transform",
          "class_name": "ColumnTransformer",
          "code": "def fit_transform(self, X, y=None):\n        \"\"\"Fit all transformers, transform the data and concatenate results.\n\n        Parameters\n        ----------\n        X : array-like or DataFrame of shape [n_samples, n_features]\n            Input data, of which specified subsets are used to fit the\n            transformers.\n\n        y : array-like, shape (n_samples, ...), optional\n            Targets for supervised learning.\n\n        Returns\n        -------\n        X_t : array-like or sparse matrix, shape (n_samples, sum_n_components)\n            hstack of results of transformers. sum_n_components is the\n            sum of n_components (output dimension) over transformers. If\n            any result is a sparse matrix, everything will be converted to\n            sparse matrices.\n\n        \"\"\"\n        self._validate_transformers()\n        self._validate_column_callables(X)\n        self._validate_remainder(X)\n\n        result = self._fit_transform(X, y, _fit_transform_one)\n\n        if not result:\n            self._update_fitted_transformers([])\n            # All transformers are None\n            return np.zeros((X.shape[0], 0))\n\n        Xs, transformers = zip(*result)\n\n        # determine if concatenated output will be sparse or not\n        if all(sparse.issparse(X) for X in Xs):\n            self.sparse_output_ = True\n        elif any(sparse.issparse(X) for X in Xs):\n            nnz = sum(X.nnz if sparse.issparse(X) else X.size for X in Xs)\n            total = sum(X.shape[0] * X.shape[1] if sparse.issparse(X)\n                        else X.size for X in Xs)\n            density = nnz / total\n            self.sparse_output_ = density < self.sparse_threshold\n        else:\n            self.sparse_output_ = False\n\n        self._update_fitted_transformers(transformers)\n        self._validate_output(Xs)\n\n        return self._hstack(list(Xs))"
        },
        {
          "file": "sklearn/compose/_column_transformer.py",
          "type": "function",
          "name": "transform",
          "class_name": "ColumnTransformer",
          "code": "def transform(self, X):\n        \"\"\"Transform X separately by each transformer, concatenate results.\n\n        Parameters\n        ----------\n        X : array-like or DataFrame of shape [n_samples, n_features]\n            The data to be transformed by subset.\n\n        Returns\n        -------\n        X_t : array-like or sparse matrix, shape (n_samples, sum_n_components)\n            hstack of results of transformers. sum_n_components is the\n            sum of n_components (output dimension) over transformers. If\n            any result is a sparse matrix, everything will be converted to\n            sparse matrices.\n\n        \"\"\"\n        check_is_fitted(self, 'transformers_')\n\n        Xs = self._fit_transform(X, None, _transform_one, fitted=True)\n        self._validate_output(Xs)\n\n        if not Xs:\n            # All transformers are None\n            return np.zeros((X.shape[0], 0))\n\n        return self._hstack(list(Xs))"
        },
        {
          "file": "sklearn/compose/_column_transformer.py",
          "type": "function",
          "name": "_hstack",
          "class_name": "ColumnTransformer",
          "code": "def _hstack(self, Xs):\n        \"\"\"Stacks Xs horizontally.\n\n        This allows subclasses to control the stacking behavior, while reusing\n        everything else from ColumnTransformer.\n\n        Parameters\n        ----------\n        Xs : List of numpy arrays, sparse arrays, or DataFrames\n        \"\"\"\n        if self.sparse_output_:\n            return sparse.hstack(Xs).tocsr()\n        else:\n            Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]\n            return np.hstack(Xs)"
        }
      ]
    }
  ]
}