{
  "instance_id": "scikit-learn__scikit-learn-15512",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-11-02T22:28:57Z",
  "problem_statement": "Return values of non converged affinity propagation clustering\nThe affinity propagation Documentation states: \r\n\"When the algorithm does not converge, it returns an empty array as cluster_center_indices and -1 as label for each training sample.\"\r\n\r\nExample:\r\n```python\r\nfrom sklearn.cluster import AffinityPropagation\r\nimport pandas as pd\r\n\r\ndata = pd.DataFrame([[1,0,0,0,0,0],[0,1,1,1,0,0],[0,0,1,0,0,1]])\r\naf = AffinityPropagation(affinity='euclidean', verbose=True, copy=False, max_iter=2).fit(data)\r\n\r\nprint(af.cluster_centers_indices_)\r\nprint(af.labels_)\r\n\r\n```\r\nI would expect that the clustering here (which does not converge) prints first an empty List and then [-1,-1,-1], however, I get [2] as cluster center and [0,0,0] as cluster labels. \r\nThe only way I currently know if the clustering fails is if I use the verbose option, however that is very unhandy. A hacky solution is to check if max_iter == n_iter_ but it could have converged exactly 15 iterations before max_iter (although unlikely).\r\nI am not sure if this is intended behavior and the documentation is wrong?\r\n\r\nFor my use-case within a bigger script, I would prefer to get back -1 values or have a property to check if it has converged, as otherwise, a user might not be aware that the clustering never converged.\r\n\r\n\r\n#### Versions\r\nSystem:\r\n    python: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 02:32:25)  [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]\r\nexecutable: /home/jenniferh/Programs/anaconda3/envs/TF_RDKit_1_19/bin/python\r\n   machine: Linux-4.15.0-52-generic-x86_64-with-debian-stretch-sid\r\nBLAS:\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\n  lib_dirs: /home/jenniferh/Programs/anaconda3/envs/TF_RDKit_1_19/lib\r\ncblas_libs: mkl_rt, pthread\r\nPython deps:\r\n    pip: 18.1\r\n   setuptools: 40.6.3\r\n   sklearn: 0.20.3\r\n   numpy: 1.15.4\r\n   scipy: 1.2.0\r\n   Cython: 0.29.2\r\n   pandas: 0.23.4\r\n\r\n\n",
  "patch": "diff --git a/sklearn/cluster/_affinity_propagation.py b/sklearn/cluster/_affinity_propagation.py\n--- a/sklearn/cluster/_affinity_propagation.py\n+++ b/sklearn/cluster/_affinity_propagation.py\n@@ -194,17 +194,19 @@ def affinity_propagation(S, preference=None, convergence_iter=15, max_iter=200,\n             unconverged = (np.sum((se == convergence_iter) + (se == 0))\n                            != n_samples)\n             if (not unconverged and (K > 0)) or (it == max_iter):\n+                never_converged = False\n                 if verbose:\n                     print(\"Converged after %d iterations.\" % it)\n                 break\n     else:\n+        never_converged = True\n         if verbose:\n             print(\"Did not converge\")\n \n     I = np.flatnonzero(E)\n     K = I.size  # Identify exemplars\n \n-    if K > 0:\n+    if K > 0 and not never_converged:\n         c = np.argmax(S[:, I], axis=1)\n         c[I] = np.arange(K)  # Identify clusters\n         # Refine the final set of exemplars and clusters and return results\n@@ -408,6 +410,7 @@ def predict(self, X):\n             Cluster labels.\n         \"\"\"\n         check_is_fitted(self)\n+        X = check_array(X)\n         if not hasattr(self, \"cluster_centers_\"):\n             raise ValueError(\"Predict method is not supported when \"\n                              \"affinity='precomputed'.\")\n",
  "similar_bug_items": [
    {
      "pr_number": 4322,
      "pr_title": "[MRG+2] Pass include_self=True to kneighbors_graph",
      "pr_body": "Fixes #4235.\n",
      "issue_id": 4235,
      "issue_title": "Check all usages of kneighbors_graph",
      "issue_body": "After @MechCoder fixed kneighbors_graph in #4046, we should check all uses in the code for whether we want `include_self==True` or not. I came across this in `SpectralEmbedding` where the current default makes no sense, I think.\nYou can simply `git grep` and should find a lot of occurrences.\nIf our test output wasn't so flooded with warnings, we would have probably detected that earlier :-/\n",
      "issue_closed_at": "2015-03-20T12:05:11Z",
      "base_commit": "4be62142aa3afed11d45c9cdcb066b3d8ba9badf",
      "changes": [
        {
          "file": "examples/cluster/plot_cluster_comparison.py",
          "type": "line",
          "name": "line 45",
          "code": "colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])\ncolors = np.hstack([colors] * 20)\n\nclustering_names =  [\n    'MiniBatchKMeans', 'AffinityPropagation', 'MeanShift',\n    'SpectralClustering', 'Ward', 'AgglomerativeClustering',\n    'DBSCAN', 'Birch'\n    ]\n\nplt.figure(figsize=(len(clustering_names) * 2 + 3, 9.5))\nplt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,"
        },
        {
          "file": "examples/cluster/plot_cluster_comparison.py",
          "type": "line",
          "name": "line 67",
          "code": "    bandwidth = cluster.estimate_bandwidth(X, quantile=0.3)\n\n    # connectivity matrix for structured Ward\n    connectivity = kneighbors_graph(X, n_neighbors=10)\n    # make connectivity symmetric\n    connectivity = 0.5 * (connectivity + connectivity.T)\n"
        },
        {
          "file": "examples/cluster/plot_cluster_comparison.py",
          "type": "line",
          "name": "line 83",
          "code": "    affinity_propagation = cluster.AffinityPropagation(damping=.9,\n                                                       preference=-200)\n\n    average_linkage = cluster.AgglomerativeClustering(linkage=\"average\",\n                            affinity=\"cityblock\", n_clusters=2,\n                            connectivity=connectivity)\n\n    birch = cluster.Birch(n_clusters=2)\n    clustering_algorithms = [\n        two_means, affinity_propagation, ms, spectral, ward, average_linkage,\n        dbscan, birch\n        ]\n\n    for name, algorithm in zip(clustering_names, clustering_algorithms):\n        # predict cluster memberships"
        },
        {
          "file": "examples/cluster/plot_ward_structured_vs_unstructured.py",
          "type": "line",
          "name": "line 65",
          "code": "###############################################################################\n# Define the structure A of the data. Here a 10 nearest neighbors\nfrom sklearn.neighbors import kneighbors_graph\nconnectivity = kneighbors_graph(X, n_neighbors=10)\n\n###############################################################################\n# Compute clustering"
        },
        {
          "file": "sklearn/manifold/spectral_embedding_.py",
          "type": "function",
          "name": "_get_affinity_matrix",
          "class_name": "SpectralEmbedding",
          "code": "def _get_affinity_matrix(self, X, Y=None):\n        \"\"\"Caclulate the affinity matrix from data\n        Parameters\n        ----------\n        X : array-like, shape (n_samples, n_features)\n            Training vector, where n_samples in the number of samples\n            and n_features is the number of features.\n\n            If affinity is \"precomputed\"\n            X : array-like, shape (n_samples, n_samples),\n            Interpret X as precomputed adjacency graph computed from\n            samples.\n\n        Returns\n        -------\n        affinity_matrix, shape (n_samples, n_samples)\n        \"\"\"\n        if self.affinity == 'precomputed':\n            self.affinity_matrix_ = X\n            return self.affinity_matrix_\n        if self.affinity == 'nearest_neighbors':\n            if sparse.issparse(X):\n                warnings.warn(\"Nearest neighbors affinity currently does \"\n                              \"not support sparse input, falling back to \"\n                              \"rbf affinity\")\n                self.affinity = \"rbf\"\n            else:\n                self.n_neighbors_ = (self.n_neighbors\n                                     if self.n_neighbors is not None\n                                     else max(int(X.shape[0] / 10), 1))\n                self.affinity_matrix_ = kneighbors_graph(X, self.n_neighbors_)\n                # currently only symmetric affinity_matrix supported\n                self.affinity_matrix_ = 0.5 * (self.affinity_matrix_ +\n                                               self.affinity_matrix_.T)\n                return self.affinity_matrix_\n        if self.affinity == 'rbf':\n            self.gamma_ = (self.gamma\n                           if self.gamma is not None else 1.0 / X.shape[1])\n            self.affinity_matrix_ = rbf_kernel(X, gamma=self.gamma_)\n            return self.affinity_matrix_\n        self.affinity_matrix_ = self.affinity(X)\n        return self.affinity_matrix_"
        }
      ]
    },
    {
      "pr_number": 11914,
      "pr_title": "[MRG] ENH Better error message for metrics of neighbors",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n#Fixes #11906 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded expression to error message to get list of valid metrics.\r\n\r\n<!--\r\n#### Any other comments?\r\n\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 11906,
      "issue_title": "Better error message for invalid metric in NearestNeighbors ",
      "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n",
      "issue_closed_at": "2018-09-13T15:34:02Z",
      "base_commit": "7ed61a24feb4ffde0bee9342acf4a58e3f946a61",
      "changes": [
        {
          "file": "sklearn/neighbors/__init__.py",
          "type": "line",
          "name": "line 14",
          "code": "from .kde import KernelDensity\nfrom .approximate import LSHForest\nfrom .lof import LocalOutlierFactor\n\n__all__ = ['BallTree',\n           'DistanceMetric',"
        },
        {
          "file": "sklearn/neighbors/__init__.py",
          "type": "line",
          "name": "line 28",
          "code": "           'radius_neighbors_graph',\n           'KernelDensity',\n           'LSHForest',\n           'LocalOutlierFactor']"
        },
        {
          "file": "sklearn/neighbors/base.py",
          "type": "function",
          "name": "_check_algorithm_metric",
          "class_name": "NeighborsBase",
          "code": "def _check_algorithm_metric(self):\n        if self.algorithm not in ['auto', 'brute',\n                                  'kd_tree', 'ball_tree']:\n            raise ValueError(\"unrecognized algorithm: '%s'\" % self.algorithm)\n\n        if self.algorithm == 'auto':\n            if self.metric == 'precomputed':\n                alg_check = 'brute'\n            elif (callable(self.metric) or\n                  self.metric in VALID_METRICS['ball_tree']):\n                alg_check = 'ball_tree'\n            else:\n                alg_check = 'brute'\n        else:\n            alg_check = self.algorithm\n\n        if callable(self.metric):\n            if self.algorithm == 'kd_tree':\n                # callable metric is only valid for brute force and ball_tree\n                raise ValueError(\n                    \"kd_tree algorithm does not support callable metric '%s'\"\n                    % self.metric)\n        elif self.metric not in VALID_METRICS[alg_check]:\n            raise ValueError(\"Metric '%s' not valid for algorithm '%s'\"\n                             % (self.metric, self.algorithm))\n\n        if self.metric_params is not None and 'p' in self.metric_params:\n            warnings.warn(\"Parameter p is found in metric_params. \"\n                          \"The corresponding parameter from __init__ \"\n                          \"is ignored.\", SyntaxWarning, stacklevel=3)\n            effective_p = self.metric_params['p']\n        else:\n            effective_p = self.p\n\n        if self.metric in ['wminkowski', 'minkowski'] and effective_p < 1:\n            raise ValueError(\"p must be greater than one for minkowski metric\")"
        }
      ]
    },
    {
      "pr_number": 11796,
      "pr_title": "[MRG+2] Fix LDA predict_proba() ",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\nFixes #6848\r\ncloses #11727\r\ncloses #5149\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\nFixes the `predict_proba()` method of LinearDiscriminantAnalysis.\r\nAn `if` statement is used to differentiate between the binary and multi-class case, due to the different output format of the `decision_function` method implemented in the `LinearClassifierMixin` class.\r\n\r\n#### Any other comments?\r\nCopying from #6848:\r\nDo we perhaps want to include additional tests checking the output of predict_proba for LDA and QDA both for the binary and multi-class cases?\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 6848,
      "issue_title": "LinearDiscriminantAnalysis predict probability bug",
      "issue_body": "I am pretty confident there is a bug introduced in commit\n7c1101d7c26ba0b77184cce9c0b9be79adb526de\n\nConcretely, line 518 of the current version \nhttps://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py\nshould be removed as it yields wrong results. \n\nThere is no reason why constant 1 should be added to the computed probability after exponentiation and before inversion. \n\nTo verify this, I have run a one-to-one comparison between the outcome of the method and MATLAB's builtin LDA classifier on the Iris dataset. Only after removal of line 518, results match (up to a tolerance).\n\nIf everyone agrees on that, I am happy to submit a PR.\n",
      "issue_closed_at": "2019-03-07T16:44:18Z",
      "base_commit": "b73a51bcda362d94d8907915a382a8eb403554c8",
      "changes": [
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "line",
          "name": "line 22",
          "code": "from .utils import check_array, check_X_y\nfrom .utils.validation import check_is_fitted\nfrom .utils.multiclass import check_classification_targets\nfrom .preprocessing import StandardScaler\n\n"
        },
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "function",
          "name": "predict_proba",
          "class_name": "QuadraticDiscriminantAnalysis",
          "code": "def predict_proba(self, X):\n        \"\"\"Return posterior probabilities of classification.\n\n        Parameters\n        ----------\n        X : array-like, shape = [n_samples, n_features]\n            Array of samples/test vectors.\n\n        Returns\n        -------\n        C : array, shape = [n_samples, n_classes]\n            Posterior probabilities of classification per class.\n        \"\"\"\n        values = self._decision_function(X)\n        # compute the likelihood of the underlying gaussian models\n        # up to a multiplicative constant.\n        likelihood = np.exp(values - values.max(axis=1)[:, np.newaxis])\n        # compute posterior probabilities\n        return likelihood / likelihood.sum(axis=1)[:, np.newaxis]"
        }
      ]
    },
    {
      "pr_number": 13641,
      "pr_title": "[MRG+1] API make sure vectorizers read data from file before analyzing",
      "pr_body": "Fixes #5482\r\n\r\nIf the given analyzer is a calable, it seems reasonable to assume if `input='file'` or `input='filename'`, the data should be read from the file first, and then passed to the analyzer, the same way as it's done for non-callable analyzers.\r\n\r\nThis PR clarifies this in the docstrings, and passes the \"decoded\" input to the analyzer. It should be less of a concern regarding the input on the bytes vs str since we don't support python2 anymore.\r\n\r\nI'm not entirely sure if this is what we wanna do, it's more of a proposal to move it forward.\r\n\r\nAfter this PR, the following would result in a `FileNotFoundError` exception:\r\n\r\n```python\r\ncv = CountVectorizer(analyzer=lambda x: x.split(), input='filename')\r\ncv.fit(['hello world']).vocabulary_\r\n```",
      "issue_id": 5482,
      "issue_title": "CountVectorizer with custom analyzer ignores input argument",
      "issue_body": "Example:\n\n``` py\ncv = CountVectorizer(analyzer=lambda x: x.split(), input='filename')\ncv.fit(['hello world']).vocabulary_\n```\n\nSame for `input=\"file\"`. Not sure if this should be fixed or just documented; I don't like changing the behavior of the vectorizers yet again...\n",
      "issue_closed_at": "2019-04-23T03:50:24Z",
      "base_commit": "badaa153e67ffa56fb1a413b3b7b5b8507024291",
      "changes": [
        {
          "file": "sklearn/feature_extraction/text.py",
          "type": "line",
          "name": "line 31",
          "code": "from ..utils.validation import check_is_fitted, check_array, FLOAT_DTYPES\nfrom ..utils import _IS_32BIT\nfrom ..utils.fixes import _astype_copy_false\n\n\n__all__ = ['HashingVectorizer',"
        },
        {
          "file": "sklearn/feature_extraction/text.py",
          "type": "function",
          "name": "_check_stop_words_consistency",
          "class_name": "VectorizerMixin",
          "code": "def _check_stop_words_consistency(self, stop_words, preprocess, tokenize):\n        \"\"\"Check if stop words are consistent\n\n        Returns\n        -------\n        is_consistent : True if stop words are consistent with the preprocessor\n                        and tokenizer, False if they are not, None if the check\n                        was previously performed, \"error\" if it could not be\n                        performed (e.g. because of the use of a custom\n                        preprocessor / tokenizer)\n        \"\"\"\n        if id(self.stop_words) == getattr(self, '_stop_words_id', None):\n            # Stop words are were previously validated\n            return None\n\n        # NB: stop_words is validated, unlike self.stop_words\n        try:\n            inconsistent = set()\n            for w in stop_words or ():\n                tokens = list(tokenize(preprocess(w)))\n                for token in tokens:\n                    if token not in stop_words:\n                        inconsistent.add(token)\n            self._stop_words_id = id(self.stop_words)\n\n            if inconsistent:\n                warnings.warn('Your stop_words may be inconsistent with '\n                              'your preprocessing. Tokenizing the stop '\n                              'words generated tokens %r not in '\n                              'stop_words.' % sorted(inconsistent))\n            return not inconsistent\n        except Exception:\n            # Failed to check stop words consistency (e.g. because a custom\n            # preprocessor or tokenizer was used)\n            self._stop_words_id = id(self.stop_words)\n            return 'error'"
        },
        {
          "file": "sklearn/feature_extraction/text.py",
          "type": "class",
          "name": "HashingVectorizer",
          "code": "class HashingVectorizer(BaseEstimator, VectorizerMixin, TransformerMixin):\n    \"\"\"Convert a collection of text documents to a matrix of token occurrences\n\n    It turns a collection of text documents into a scipy.sparse matrix holding\n    token occurrence counts (or binary occurrence information), possibly\n    normalized as token frequencies if norm='l1' or projected on the euclidean\n    unit sphere if norm='l2'.\n\n    This text vectorizer implementation uses the hashing trick to find the\n    token string name to feature integer index mapping.\n\n    This strategy has several advantages:\n\n    - it is very low memory scalable to large datasets as there is no need to\n      store a vocabulary dictionary in memory\n\n    - it is fast to pickle and un-pickle as it holds no state besides the\n      constructor parameters\n\n    - it can be used in a streaming (partial fit) or parallel pipeline as there\n      is no state computed during fit.\n\n    There are also a couple of cons (vs using a CountVectorizer with an\n    in-memory vocabulary):\n\n    - there is no way to compute the inverse transform (from feature indices to\n      string feature names) which can be a problem when trying to introspect\n      which features are most important to a model.\n\n    - there can be collisions: distinct tokens can be mapped to the same\n      feature index. However in practice this is rarely an issue if n_features\n      is large enough (e.g. 2 ** 18 for text classification problems).\n\n    - no IDF weighting as this would render the transformer stateful.\n\n    The hash function employed is the signed 32-bit version of Murmurhash3.\n\n    Read more in the :ref:`User Guide <text_feature_extraction>`.\n\n    Parameters\n    ----------\n\n    input : string {'filename', 'file', 'content'}\n        If 'filename', the sequence passed as an argument to fit is\n        expected to be a list of filenames that need reading to fetch\n        the raw content to analyze.\n\n        If 'file', the sequence items must have a 'read' method (file-like\n        object) that is called to fetch the bytes in memory.\n\n        Otherwise the input is expected to be the sequence strings or\n        bytes items are expected to be analyzed directly.\n\n    encoding : string, default='utf-8'\n        If bytes or files are given to analyze, this encoding is used to\n        decode.\n\n    decode_error : {'strict', 'ignore', 'replace'}\n        Instruction on what to do if a byte sequence is given to analyze that\n        contains characters not of the given `encoding`. By default, it is\n        'strict', meaning that a UnicodeDecodeError will be raised. Other\n        values are 'ignore' and 'replace'.\n\n    strip_accents : {'ascii', 'unicode', None}\n        Remove accents and perform other character normalization\n        during the preprocessing step.\n        'ascii' is a fast method that only works on characters that have\n        an direct ASCII mapping.\n        'unicode' is a slightly slower method that works on any characters.\n        None (default) does nothing.\n\n        Both 'ascii' and 'unicode' use NFKD normalization from\n        :func:`unicodedata.normalize`.\n\n    lowercase : boolean, default=True\n        Convert all characters to lowercase before tokenizing.\n\n    preprocessor : callable or None (default)\n        Override the preprocessing (string transformation) stage while\n        preserving the tokenizing and n-grams generation steps.\n\n    tokenizer : callable or None (default)\n        Override the string tokenization step while preserving the\n        preprocessing and n-grams generation steps.\n        Only applies if ``analyzer == 'word'``.\n\n    stop_words : string {'english'}, list, or None (default)\n        If 'english', a built-in stop word list for English is used.\n        There are several known issues with 'english' and you should\n        consider an alternative (see :ref:`stop_words`).\n\n        If a list, that list is assumed to contain stop words, all of which\n        will be removed from the resulting tokens.\n        Only applies if ``analyzer == 'word'``.\n\n    token_pattern : string\n        Regular expression denoting what constitutes a \"token\", only used\n        if ``analyzer == 'word'``. The default regexp selects tokens of 2\n        or more alphanumeric characters (punctuation is completely ignored\n        and always treated as a token separator).\n\n    ngram_range : tuple (min_n, max_n), default=(1, 1)\n        The lower and upper boundary of the range of n-values for different\n        n-grams to be extracted. All values of n such that min_n <= n <= max_n\n        will be used.\n\n    analyzer : string, {'word', 'char', 'char_wb'} or callable\n        Whether the feature should be made of word or character n-grams.\n        Option 'char_wb' creates character n-grams only from text inside\n        word boundaries; n-grams at the edges of words are padded with space.\n\n        If a callable is passed it is used to extract the sequence of features\n        out of the raw, unprocessed input.\n\n    n_features : integer, default=(2 ** 20)\n        The number of features (columns) in the output matrices. Small numbers\n        of features are likely to cause hash collisions, but large numbers\n        will cause larger coefficient dimensions in linear learners.\n\n    binary : boolean, default=False.\n        If True, all non zero counts are set to 1. This is useful for discrete\n        probabilistic models that model binary events rather than integer\n        counts.\n\n    norm : 'l1', 'l2' or None, optional\n        Norm used to normalize term vectors. None for no normalization.\n\n    alternate_sign : boolean, optional, default True\n        When True, an alternating sign is added to the features as to\n        approximately conserve the inner product in the hashed space even for\n        small n_features. This approach is similar to sparse random projection.\n\n        .. versionadded:: 0.19\n\n    dtype : type, optional\n        Type of the matrix returned by fit_transform() or transform().\n\n    Examples\n    --------\n    >>> from sklearn.feature_extraction.text import HashingVectorizer\n    >>> corpus = [\n    ...     'This is the first document.',\n    ...     'This document is the second document.',\n    ...     'And this is the third one.',\n    ...     'Is this the first document?',\n    ... ]\n    >>> vectorizer = HashingVectorizer(n_features=2**4)\n    >>> X = vectorizer.fit_transform(corpus)\n    >>> print(X.shape)\n    (4, 16)\n\n    See also\n    --------\n    CountVectorizer, TfidfVectorizer\n\n    \"\"\"\n    def __init__(self, input='content', encoding='utf-8',\n                 decode_error='strict', strip_accents=None,\n                 lowercase=True, preprocessor=None, tokenizer=None,\n                 stop_words=None, token_pattern=r\"(?u)\\b\\w\\w+\\b\",\n                 ngram_range=(1, 1), analyzer='word', n_features=(2 ** 20),\n                 binary=False, norm='l2', alternate_sign=True,\n                 dtype=np.float64):\n        self.input = input\n        self.encoding = encoding\n        self.decode_error = decode_error\n        self.strip_accents = strip_accents\n        self.preprocessor = preprocessor\n        self.tokenizer = tokenizer\n        self.analyzer = analyzer\n        self.lowercase = lowercase\n        self.token_pattern = token_pattern\n        self.stop_words = stop_words\n        self.n_features = n_features\n        self.ngram_range = ngram_range\n        self.binary = binary\n        self.norm = norm\n        self.alternate_sign = alternate_sign\n        self.dtype = dtype\n\n    def partial_fit(self, X, y=None):\n        \"\"\"Does nothing: this transformer is stateless.\n\n        This method is just there to mark the fact that this transformer\n        can work in a streaming setup.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_features]\n            Training data.\n        \"\"\"\n        return self\n\n    def fit(self, X, y=None):\n        \"\"\"Does nothing: this transformer is stateless.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_features]\n            Training data.\n        \"\"\"\n        # triggers a parameter validation\n        if isinstance(X, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        self._validate_params()\n\n        self._get_hasher().fit(X, y=y)\n        return self\n\n    def transform(self, X):\n        \"\"\"Transform a sequence of documents to a document-term matrix.\n\n        Parameters\n        ----------\n        X : iterable over raw text documents, length = n_samples\n            Samples. Each sample must be a text document (either bytes or\n            unicode strings, file name or file object depending on the\n            constructor argument) which will be tokenized and hashed.\n\n        Returns\n        -------\n        X : scipy.sparse matrix, shape = (n_samples, self.n_features)\n            Document-term matrix.\n        \"\"\"\n        if isinstance(X, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        self._validate_params()\n\n        analyzer = self.build_analyzer()\n        X = self._get_hasher().transform(analyzer(doc) for doc in X)\n        if self.binary:\n            X.data.fill(1)\n        if self.norm is not None:\n            X = normalize(X, norm=self.norm, copy=False)\n        return X\n\n    def fit_transform(self, X, y=None):\n        \"\"\"Transform a sequence of documents to a document-term matrix.\n\n        Parameters\n        ----------\n        X : iterable over raw text documents, length = n_samples\n            Samples. Each sample must be a text document (either bytes or\n            unicode strings, file name or file object depending on the\n            constructor argument) which will be tokenized and hashed.\n        y : any\n            Ignored. This parameter exists only for compatibility with\n            sklearn.pipeline.Pipeline.\n\n        Returns\n        -------\n        X : scipy.sparse matrix, shape = (n_samples, self.n_features)\n            Document-term matrix.\n        \"\"\"\n        return self.fit(X, y).transform(X)\n\n    def _get_hasher(self):\n        return FeatureHasher(n_features=self.n_features,\n                             input_type='string', dtype=self.dtype,\n                             alternate_sign=self.alternate_sign)\n\n    def _more_tags(self):\n        return {'X_types': ['string']}"
        },
        {
          "file": "sklearn/feature_extraction/text.py",
          "type": "class",
          "name": "CountVectorizer",
          "code": "class CountVectorizer(BaseEstimator, VectorizerMixin):\n    \"\"\"Convert a collection of text documents to a matrix of token counts\n\n    This implementation produces a sparse representation of the counts using\n    scipy.sparse.csr_matrix.\n\n    If you do not provide an a-priori dictionary and you do not use an analyzer\n    that does some kind of feature selection then the number of features will\n    be equal to the vocabulary size found by analyzing the data.\n\n    Read more in the :ref:`User Guide <text_feature_extraction>`.\n\n    Parameters\n    ----------\n    input : string {'filename', 'file', 'content'}\n        If 'filename', the sequence passed as an argument to fit is\n        expected to be a list of filenames that need reading to fetch\n        the raw content to analyze.\n\n        If 'file', the sequence items must have a 'read' method (file-like\n        object) that is called to fetch the bytes in memory.\n\n        Otherwise the input is expected to be the sequence strings or\n        bytes items are expected to be analyzed directly.\n\n    encoding : string, 'utf-8' by default.\n        If bytes or files are given to analyze, this encoding is used to\n        decode.\n\n    decode_error : {'strict', 'ignore', 'replace'}\n        Instruction on what to do if a byte sequence is given to analyze that\n        contains characters not of the given `encoding`. By default, it is\n        'strict', meaning that a UnicodeDecodeError will be raised. Other\n        values are 'ignore' and 'replace'.\n\n    strip_accents : {'ascii', 'unicode', None}\n        Remove accents and perform other character normalization\n        during the preprocessing step.\n        'ascii' is a fast method that only works on characters that have\n        an direct ASCII mapping.\n        'unicode' is a slightly slower method that works on any characters.\n        None (default) does nothing.\n\n        Both 'ascii' and 'unicode' use NFKD normalization from\n        :func:`unicodedata.normalize`.\n\n    lowercase : boolean, True by default\n        Convert all characters to lowercase before tokenizing.\n\n    preprocessor : callable or None (default)\n        Override the preprocessing (string transformation) stage while\n        preserving the tokenizing and n-grams generation steps.\n\n    tokenizer : callable or None (default)\n        Override the string tokenization step while preserving the\n        preprocessing and n-grams generation steps.\n        Only applies if ``analyzer == 'word'``.\n\n    stop_words : string {'english'}, list, or None (default)\n        If 'english', a built-in stop word list for English is used.\n        There are several known issues with 'english' and you should\n        consider an alternative (see :ref:`stop_words`).\n\n        If a list, that list is assumed to contain stop words, all of which\n        will be removed from the resulting tokens.\n        Only applies if ``analyzer == 'word'``.\n\n        If None, no stop words will be used. max_df can be set to a value\n        in the range [0.7, 1.0) to automatically detect and filter stop\n        words based on intra corpus document frequency of terms.\n\n    token_pattern : string\n        Regular expression denoting what constitutes a \"token\", only used\n        if ``analyzer == 'word'``. The default regexp select tokens of 2\n        or more alphanumeric characters (punctuation is completely ignored\n        and always treated as a token separator).\n\n    ngram_range : tuple (min_n, max_n)\n        The lower and upper boundary of the range of n-values for different\n        n-grams to be extracted. All values of n such that min_n <= n <= max_n\n        will be used.\n\n    analyzer : string, {'word', 'char', 'char_wb'} or callable\n        Whether the feature should be made of word or character n-grams.\n        Option 'char_wb' creates character n-grams only from text inside\n        word boundaries; n-grams at the edges of words are padded with space.\n\n        If a callable is passed it is used to extract the sequence of features\n        out of the raw, unprocessed input.\n\n    max_df : float in range [0.0, 1.0] or int, default=1.0\n        When building the vocabulary ignore terms that have a document\n        frequency strictly higher than the given threshold (corpus-specific\n        stop words).\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    min_df : float in range [0.0, 1.0] or int, default=1\n        When building the vocabulary ignore terms that have a document\n        frequency strictly lower than the given threshold. This value is also\n        called cut-off in the literature.\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    max_features : int or None, default=None\n        If not None, build a vocabulary that only consider the top\n        max_features ordered by term frequency across the corpus.\n\n        This parameter is ignored if vocabulary is not None.\n\n    vocabulary : Mapping or iterable, optional\n        Either a Mapping (e.g., a dict) where keys are terms and values are\n        indices in the feature matrix, or an iterable over terms. If not\n        given, a vocabulary is determined from the input documents. Indices\n        in the mapping should not be repeated and should not have any gap\n        between 0 and the largest index.\n\n    binary : boolean, default=False\n        If True, all non zero counts are set to 1. This is useful for discrete\n        probabilistic models that model binary events rather than integer\n        counts.\n\n    dtype : type, optional\n        Type of the matrix returned by fit_transform() or transform().\n\n    Attributes\n    ----------\n    vocabulary_ : dict\n        A mapping of terms to feature indices.\n\n    stop_words_ : set\n        Terms that were ignored because they either:\n\n          - occurred in too many documents (`max_df`)\n          - occurred in too few documents (`min_df`)\n          - were cut off by feature selection (`max_features`).\n\n        This is only available if no vocabulary was given.\n\n    Examples\n    --------\n    >>> from sklearn.feature_extraction.text import CountVectorizer\n    >>> corpus = [\n    ...     'This is the first document.',\n    ...     'This document is the second document.',\n    ...     'And this is the third one.',\n    ...     'Is this the first document?',\n    ... ]\n    >>> vectorizer = CountVectorizer()\n    >>> X = vectorizer.fit_transform(corpus)\n    >>> print(vectorizer.get_feature_names())\n    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']\n    >>> print(X.toarray())  # doctest: +NORMALIZE_WHITESPACE\n    [[0 1 1 1 0 0 1 0 1]\n     [0 2 0 1 0 1 1 0 1]\n     [1 0 0 1 1 0 1 1 1]\n     [0 1 1 1 0 0 1 0 1]]\n\n    See also\n    --------\n    HashingVectorizer, TfidfVectorizer\n\n    Notes\n    -----\n    The ``stop_words_`` attribute can get large and increase the model size\n    when pickling. This attribute is provided only for introspection and can\n    be safely removed using delattr or set to None before pickling.\n    \"\"\"\n\n    def __init__(self, input='content', encoding='utf-8',\n                 decode_error='strict', strip_accents=None,\n                 lowercase=True, preprocessor=None, tokenizer=None,\n                 stop_words=None, token_pattern=r\"(?u)\\b\\w\\w+\\b\",\n                 ngram_range=(1, 1), analyzer='word',\n                 max_df=1.0, min_df=1, max_features=None,\n                 vocabulary=None, binary=False, dtype=np.int64):\n        self.input = input\n        self.encoding = encoding\n        self.decode_error = decode_error\n        self.strip_accents = strip_accents\n        self.preprocessor = preprocessor\n        self.tokenizer = tokenizer\n        self.analyzer = analyzer\n        self.lowercase = lowercase\n        self.token_pattern = token_pattern\n        self.stop_words = stop_words\n        self.max_df = max_df\n        self.min_df = min_df\n        if max_df < 0 or min_df < 0:\n            raise ValueError(\"negative value for max_df or min_df\")\n        self.max_features = max_features\n        if max_features is not None:\n            if (not isinstance(max_features, numbers.Integral) or\n                    max_features <= 0):\n                raise ValueError(\n                    \"max_features=%r, neither a positive integer nor None\"\n                    % max_features)\n        self.ngram_range = ngram_range\n        self.vocabulary = vocabulary\n        self.binary = binary\n        self.dtype = dtype\n\n    def _sort_features(self, X, vocabulary):\n        \"\"\"Sort features by name\n\n        Returns a reordered matrix and modifies the vocabulary in place\n        \"\"\"\n        sorted_features = sorted(vocabulary.items())\n        map_index = np.empty(len(sorted_features), dtype=X.indices.dtype)\n        for new_val, (term, old_val) in enumerate(sorted_features):\n            vocabulary[term] = new_val\n            map_index[old_val] = new_val\n\n        X.indices = map_index.take(X.indices, mode='clip')\n        return X\n\n    def _limit_features(self, X, vocabulary, high=None, low=None,\n                        limit=None):\n        \"\"\"Remove too rare or too common features.\n\n        Prune features that are non zero in more samples than high or less\n        documents than low, modifying the vocabulary, and restricting it to\n        at most the limit most frequent.\n\n        This does not prune samples with zero features.\n        \"\"\"\n        if high is None and low is None and limit is None:\n            return X, set()\n\n        # Calculate a mask based on document frequencies\n        dfs = _document_frequency(X)\n        tfs = np.asarray(X.sum(axis=0)).ravel()\n        mask = np.ones(len(dfs), dtype=bool)\n        if high is not None:\n            mask &= dfs <= high\n        if low is not None:\n            mask &= dfs >= low\n        if limit is not None and mask.sum() > limit:\n            mask_inds = (-tfs[mask]).argsort()[:limit]\n            new_mask = np.zeros(len(dfs), dtype=bool)\n            new_mask[np.where(mask)[0][mask_inds]] = True\n            mask = new_mask\n\n        new_indices = np.cumsum(mask) - 1  # maps old indices to new\n        removed_terms = set()\n        for term, old_index in list(vocabulary.items()):\n            if mask[old_index]:\n                vocabulary[term] = new_indices[old_index]\n            else:\n                del vocabulary[term]\n                removed_terms.add(term)\n        kept_indices = np.where(mask)[0]\n        if len(kept_indices) == 0:\n            raise ValueError(\"After pruning, no terms remain. Try a lower\"\n                             \" min_df or a higher max_df.\")\n        return X[:, kept_indices], removed_terms\n\n    def _count_vocab(self, raw_documents, fixed_vocab):\n        \"\"\"Create sparse feature matrix, and vocabulary where fixed_vocab=False\n        \"\"\"\n        if fixed_vocab:\n            vocabulary = self.vocabulary_\n        else:\n            # Add a new value when a new vocabulary item is seen\n            vocabulary = defaultdict()\n            vocabulary.default_factory = vocabulary.__len__\n\n        analyze = self.build_analyzer()\n        j_indices = []\n        indptr = []\n\n        values = _make_int_array()\n        indptr.append(0)\n        for doc in raw_documents:\n            feature_counter = {}\n            for feature in analyze(doc):\n                try:\n                    feature_idx = vocabulary[feature]\n                    if feature_idx not in feature_counter:\n                        feature_counter[feature_idx] = 1\n                    else:\n                        feature_counter[feature_idx] += 1\n                except KeyError:\n                    # Ignore out-of-vocabulary items for fixed_vocab=True\n                    continue\n\n            j_indices.extend(feature_counter.keys())\n            values.extend(feature_counter.values())\n            indptr.append(len(j_indices))\n\n        if not fixed_vocab:\n            # disable defaultdict behaviour\n            vocabulary = dict(vocabulary)\n            if not vocabulary:\n                raise ValueError(\"empty vocabulary; perhaps the documents only\"\n                                 \" contain stop words\")\n\n        if indptr[-1] > 2147483648:  # = 2**31 - 1\n            if _IS_32BIT:\n                raise ValueError(('sparse CSR array has {} non-zero '\n                                  'elements and requires 64 bit indexing, '\n                                  'which is unsupported with 32 bit Python.')\n                                 .format(indptr[-1]))\n            indices_dtype = np.int64\n\n        else:\n            indices_dtype = np.int32\n        j_indices = np.asarray(j_indices, dtype=indices_dtype)\n        indptr = np.asarray(indptr, dtype=indices_dtype)\n        values = np.frombuffer(values, dtype=np.intc)\n\n        X = sp.csr_matrix((values, j_indices, indptr),\n                          shape=(len(indptr) - 1, len(vocabulary)),\n                          dtype=self.dtype)\n        X.sort_indices()\n        return vocabulary, X\n\n    def fit(self, raw_documents, y=None):\n        \"\"\"Learn a vocabulary dictionary of all tokens in the raw documents.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            An iterable which yields either str, unicode or file objects.\n\n        Returns\n        -------\n        self\n        \"\"\"\n        self.fit_transform(raw_documents)\n        return self\n\n    def fit_transform(self, raw_documents, y=None):\n        \"\"\"Learn the vocabulary dictionary and return term-document matrix.\n\n        This is equivalent to fit followed by transform, but more efficiently\n        implemented.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            An iterable which yields either str, unicode or file objects.\n\n        Returns\n        -------\n        X : array, [n_samples, n_features]\n            Document-term matrix.\n        \"\"\"\n        # We intentionally don't call the transform method to make\n        # fit_transform overridable without unwanted side effects in\n        # TfidfVectorizer.\n        if isinstance(raw_documents, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        self._validate_params()\n        self._validate_vocabulary()\n        max_df = self.max_df\n        min_df = self.min_df\n        max_features = self.max_features\n\n        vocabulary, X = self._count_vocab(raw_documents,\n                                          self.fixed_vocabulary_)\n\n        if self.binary:\n            X.data.fill(1)\n\n        if not self.fixed_vocabulary_:\n            X = self._sort_features(X, vocabulary)\n\n            n_doc = X.shape[0]\n            max_doc_count = (max_df\n                             if isinstance(max_df, numbers.Integral)\n                             else max_df * n_doc)\n            min_doc_count = (min_df\n                             if isinstance(min_df, numbers.Integral)\n                             else min_df * n_doc)\n            if max_doc_count < min_doc_count:\n                raise ValueError(\n                    \"max_df corresponds to < documents than min_df\")\n            X, self.stop_words_ = self._limit_features(X, vocabulary,\n                                                       max_doc_count,\n                                                       min_doc_count,\n                                                       max_features)\n\n            self.vocabulary_ = vocabulary\n\n        return X\n\n    def transform(self, raw_documents):\n        \"\"\"Transform documents to document-term matrix.\n\n        Extract token counts out of raw text documents using the vocabulary\n        fitted with fit or the one provided to the constructor.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            An iterable which yields either str, unicode or file objects.\n\n        Returns\n        -------\n        X : sparse matrix, [n_samples, n_features]\n            Document-term matrix.\n        \"\"\"\n        if isinstance(raw_documents, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        if not hasattr(self, 'vocabulary_'):\n            self._validate_vocabulary()\n\n        self._check_vocabulary()\n\n        # use the same matrix-building strategy as fit_transform\n        _, X = self._count_vocab(raw_documents, fixed_vocab=True)\n        if self.binary:\n            X.data.fill(1)\n        return X\n\n    def inverse_transform(self, X):\n        \"\"\"Return terms per document with nonzero entries in X.\n\n        Parameters\n        ----------\n        X : {array, sparse matrix}, shape = [n_samples, n_features]\n\n        Returns\n        -------\n        X_inv : list of arrays, len = n_samples\n            List of arrays of terms.\n        \"\"\"\n        self._check_vocabulary()\n\n        if sp.issparse(X):\n            # We need CSR format for fast row manipulations.\n            X = X.tocsr()\n        else:\n            # We need to convert X to a matrix, so that the indexing\n            # returns 2D objects\n            X = np.asmatrix(X)\n        n_samples = X.shape[0]\n\n        terms = np.array(list(self.vocabulary_.keys()))\n        indices = np.array(list(self.vocabulary_.values()))\n        inverse_vocabulary = terms[np.argsort(indices)]\n\n        return [inverse_vocabulary[X[i, :].nonzero()[1]].ravel()\n                for i in range(n_samples)]\n\n    def get_feature_names(self):\n        \"\"\"Array mapping from feature integer indices to feature name\"\"\"\n        if not hasattr(self, 'vocabulary_'):\n            self._validate_vocabulary()\n\n        self._check_vocabulary()\n\n        return [t for t, i in sorted(self.vocabulary_.items(),\n                                     key=itemgetter(1))]\n\n    def _more_tags(self):\n        return {'X_types': ['string']}"
        },
        {
          "file": "sklearn/feature_extraction/text.py",
          "type": "class",
          "name": "TfidfVectorizer",
          "code": "class TfidfVectorizer(CountVectorizer):\n    \"\"\"Convert a collection of raw documents to a matrix of TF-IDF features.\n\n    Equivalent to :class:`CountVectorizer` followed by\n    :class:`TfidfTransformer`.\n\n    Read more in the :ref:`User Guide <text_feature_extraction>`.\n\n    Parameters\n    ----------\n    input : string {'filename', 'file', 'content'}\n        If 'filename', the sequence passed as an argument to fit is\n        expected to be a list of filenames that need reading to fetch\n        the raw content to analyze.\n\n        If 'file', the sequence items must have a 'read' method (file-like\n        object) that is called to fetch the bytes in memory.\n\n        Otherwise the input is expected to be the sequence strings or\n        bytes items are expected to be analyzed directly.\n\n    encoding : string, 'utf-8' by default.\n        If bytes or files are given to analyze, this encoding is used to\n        decode.\n\n    decode_error : {'strict', 'ignore', 'replace'} (default='strict')\n        Instruction on what to do if a byte sequence is given to analyze that\n        contains characters not of the given `encoding`. By default, it is\n        'strict', meaning that a UnicodeDecodeError will be raised. Other\n        values are 'ignore' and 'replace'.\n\n    strip_accents : {'ascii', 'unicode', None} (default=None)\n        Remove accents and perform other character normalization\n        during the preprocessing step.\n        'ascii' is a fast method that only works on characters that have\n        an direct ASCII mapping.\n        'unicode' is a slightly slower method that works on any characters.\n        None (default) does nothing.\n\n        Both 'ascii' and 'unicode' use NFKD normalization from\n        :func:`unicodedata.normalize`.\n\n    lowercase : boolean (default=True)\n        Convert all characters to lowercase before tokenizing.\n\n    preprocessor : callable or None (default=None)\n        Override the preprocessing (string transformation) stage while\n        preserving the tokenizing and n-grams generation steps.\n\n    tokenizer : callable or None (default=None)\n        Override the string tokenization step while preserving the\n        preprocessing and n-grams generation steps.\n        Only applies if ``analyzer == 'word'``.\n\n    analyzer : string, {'word', 'char', 'char_wb'} or callable\n        Whether the feature should be made of word or character n-grams.\n        Option 'char_wb' creates character n-grams only from text inside\n        word boundaries; n-grams at the edges of words are padded with space.\n\n        If a callable is passed it is used to extract the sequence of features\n        out of the raw, unprocessed input.\n\n    stop_words : string {'english'}, list, or None (default=None)\n        If a string, it is passed to _check_stop_list and the appropriate stop\n        list is returned. 'english' is currently the only supported string\n        value.\n        There are several known issues with 'english' and you should\n        consider an alternative (see :ref:`stop_words`).\n\n        If a list, that list is assumed to contain stop words, all of which\n        will be removed from the resulting tokens.\n        Only applies if ``analyzer == 'word'``.\n\n        If None, no stop words will be used. max_df can be set to a value\n        in the range [0.7, 1.0) to automatically detect and filter stop\n        words based on intra corpus document frequency of terms.\n\n    token_pattern : string\n        Regular expression denoting what constitutes a \"token\", only used\n        if ``analyzer == 'word'``. The default regexp selects tokens of 2\n        or more alphanumeric characters (punctuation is completely ignored\n        and always treated as a token separator).\n\n    ngram_range : tuple (min_n, max_n) (default=(1, 1))\n        The lower and upper boundary of the range of n-values for different\n        n-grams to be extracted. All values of n such that min_n <= n <= max_n\n        will be used.\n\n    max_df : float in range [0.0, 1.0] or int (default=1.0)\n        When building the vocabulary ignore terms that have a document\n        frequency strictly higher than the given threshold (corpus-specific\n        stop words).\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    min_df : float in range [0.0, 1.0] or int (default=1)\n        When building the vocabulary ignore terms that have a document\n        frequency strictly lower than the given threshold. This value is also\n        called cut-off in the literature.\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    max_features : int or None (default=None)\n        If not None, build a vocabulary that only consider the top\n        max_features ordered by term frequency across the corpus.\n\n        This parameter is ignored if vocabulary is not None.\n\n    vocabulary : Mapping or iterable, optional (default=None)\n        Either a Mapping (e.g., a dict) where keys are terms and values are\n        indices in the feature matrix, or an iterable over terms. If not\n        given, a vocabulary is determined from the input documents.\n\n    binary : boolean (default=False)\n        If True, all non-zero term counts are set to 1. This does not mean\n        outputs will have only 0/1 values, only that the tf term in tf-idf\n        is binary. (Set idf and normalization to False to get 0/1 outputs.)\n\n    dtype : type, optional (default=float64)\n        Type of the matrix returned by fit_transform() or transform().\n\n    norm : 'l1', 'l2' or None, optional (default='l2')\n        Each output row will have unit norm, either:\n        * 'l2': Sum of squares of vector elements is 1. The cosine\n        similarity between two vectors is their dot product when l2 norm has\n        been applied.\n        * 'l1': Sum of absolute values of vector elements is 1.\n        See :func:`preprocessing.normalize`\n\n    use_idf : boolean (default=True)\n        Enable inverse-document-frequency reweighting.\n\n    smooth_idf : boolean (default=True)\n        Smooth idf weights by adding one to document frequencies, as if an\n        extra document was seen containing every term in the collection\n        exactly once. Prevents zero divisions.\n\n    sublinear_tf : boolean (default=False)\n        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).\n\n    Attributes\n    ----------\n    vocabulary_ : dict\n        A mapping of terms to feature indices.\n\n    idf_ : array, shape (n_features)\n        The inverse document frequency (IDF) vector; only defined\n        if  ``use_idf`` is True.\n\n    stop_words_ : set\n        Terms that were ignored because they either:\n\n          - occurred in too many documents (`max_df`)\n          - occurred in too few documents (`min_df`)\n          - were cut off by feature selection (`max_features`).\n\n        This is only available if no vocabulary was given.\n\n    Examples\n    --------\n    >>> from sklearn.feature_extraction.text import TfidfVectorizer\n    >>> corpus = [\n    ...     'This is the first document.',\n    ...     'This document is the second document.',\n    ...     'And this is the third one.',\n    ...     'Is this the first document?',\n    ... ]\n    >>> vectorizer = TfidfVectorizer()\n    >>> X = vectorizer.fit_transform(corpus)\n    >>> print(vectorizer.get_feature_names())\n    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']\n    >>> print(X.shape)\n    (4, 9)\n\n    See also\n    --------\n    CountVectorizer : Transforms text into a sparse matrix of n-gram counts.\n\n    TfidfTransformer : Performs the TF-IDF transformation from a provided\n        matrix of counts.\n\n    Notes\n    -----\n    The ``stop_words_`` attribute can get large and increase the model size\n    when pickling. This attribute is provided only for introspection and can\n    be safely removed using delattr or set to None before pickling.\n    \"\"\"\n\n    def __init__(self, input='content', encoding='utf-8',\n                 decode_error='strict', strip_accents=None, lowercase=True,\n                 preprocessor=None, tokenizer=None, analyzer='word',\n                 stop_words=None, token_pattern=r\"(?u)\\b\\w\\w+\\b\",\n                 ngram_range=(1, 1), max_df=1.0, min_df=1,\n                 max_features=None, vocabulary=None, binary=False,\n                 dtype=np.float64, norm='l2', use_idf=True, smooth_idf=True,\n                 sublinear_tf=False):\n\n        super().__init__(\n            input=input, encoding=encoding, decode_error=decode_error,\n            strip_accents=strip_accents, lowercase=lowercase,\n            preprocessor=preprocessor, tokenizer=tokenizer, analyzer=analyzer,\n            stop_words=stop_words, token_pattern=token_pattern,\n            ngram_range=ngram_range, max_df=max_df, min_df=min_df,\n            max_features=max_features, vocabulary=vocabulary, binary=binary,\n            dtype=dtype)\n\n        self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,\n                                       smooth_idf=smooth_idf,\n                                       sublinear_tf=sublinear_tf)\n\n    # Broadcast the TF-IDF parameters to the underlying transformer instance\n    # for easy grid search and repr\n\n    @property\n    def norm(self):\n        return self._tfidf.norm\n\n    @norm.setter\n    def norm(self, value):\n        self._tfidf.norm = value\n\n    @property\n    def use_idf(self):\n        return self._tfidf.use_idf\n\n    @use_idf.setter\n    def use_idf(self, value):\n        self._tfidf.use_idf = value\n\n    @property\n    def smooth_idf(self):\n        return self._tfidf.smooth_idf\n\n    @smooth_idf.setter\n    def smooth_idf(self, value):\n        self._tfidf.smooth_idf = value\n\n    @property\n    def sublinear_tf(self):\n        return self._tfidf.sublinear_tf\n\n    @sublinear_tf.setter\n    def sublinear_tf(self, value):\n        self._tfidf.sublinear_tf = value\n\n    @property\n    def idf_(self):\n        return self._tfidf.idf_\n\n    @idf_.setter\n    def idf_(self, value):\n        self._validate_vocabulary()\n        if hasattr(self, 'vocabulary_'):\n            if len(self.vocabulary_) != len(value):\n                raise ValueError(\"idf length = %d must be equal \"\n                                 \"to vocabulary size = %d\" %\n                                 (len(value), len(self.vocabulary)))\n        self._tfidf.idf_ = value\n\n    def _check_params(self):\n        if self.dtype not in FLOAT_DTYPES:\n            warnings.warn(\"Only {} 'dtype' should be used. {} 'dtype' will \"\n                          \"be converted to np.float64.\"\n                          .format(FLOAT_DTYPES, self.dtype),\n                          UserWarning)\n\n    def fit(self, raw_documents, y=None):\n        \"\"\"Learn vocabulary and idf from training set.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            an iterable which yields either str, unicode or file objects\n\n        Returns\n        -------\n        self : TfidfVectorizer\n        \"\"\"\n        self._check_params()\n        X = super().fit_transform(raw_documents)\n        self._tfidf.fit(X)\n        return self\n\n    def fit_transform(self, raw_documents, y=None):\n        \"\"\"Learn vocabulary and idf, return term-document matrix.\n\n        This is equivalent to fit followed by transform, but more efficiently\n        implemented.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            an iterable which yields either str, unicode or file objects\n\n        Returns\n        -------\n        X : sparse matrix, [n_samples, n_features]\n            Tf-idf-weighted document-term matrix.\n        \"\"\"\n        self._check_params()\n        X = super().fit_transform(raw_documents)\n        self._tfidf.fit(X)\n        # X is already a transformed view of raw_documents so\n        # we set copy to False\n        return self._tfidf.transform(X, copy=False)\n\n    def transform(self, raw_documents, copy=True):\n        \"\"\"Transform documents to document-term matrix.\n\n        Uses the vocabulary and document frequencies (df) learned by fit (or\n        fit_transform).\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            an iterable which yields either str, unicode or file objects\n\n        copy : boolean, default True\n            Whether to copy X and operate on the copy or perform in-place\n            operations.\n\n        Returns\n        -------\n        X : sparse matrix, [n_samples, n_features]\n            Tf-idf-weighted document-term matrix.\n        \"\"\"\n        check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')\n\n        X = super().transform(raw_documents)\n        return self._tfidf.transform(X, copy=False)\n\n    def _more_tags(self):\n        return {'X_types': ['string'], '_skip_test': True}"
        }
      ]
    },
    {
      "pr_number": 13157,
      "pr_title": "[MRG+1]\u00a0API Change default multioutput in RegressorMixin.score to keep consistent with metrics.r2_score",
      "pr_body": "Closes #12772 \r\nWondering if someone has a better way :)\r\nIn the original issue, I tried to ask why we prefer uniform_average, but received no reply. I guess we choose uniform_average to keep consistent with other regression metrics.",
      "issue_id": 12772,
      "issue_title": "Different r2_score multioutput default in r2_score and base.RegressorMixin",
      "issue_body": "We've changed multioutput default in r2_score to \"uniform_average\" in 0.19, but in base.RegressorMixin, we still use ``multioutput='variance_weighted'`` (#5143).\r\nAlso see the strange things below:\r\nhttps://github.com/scikit-learn/scikit-learn/blob/4603e481e9ac67eaf906ae5936263b675ba9bc9c/sklearn/multioutput.py#L283-L286",
      "issue_closed_at": "2019-03-15T09:47:51Z",
      "base_commit": "85440978f517118e78dc15f84e397d50d14c8097",
      "changes": [
        {
          "file": "sklearn/base.py",
          "type": "function",
          "name": "score",
          "class_name": "DensityMixin",
          "code": "def score(self, X, y=None):\n        \"\"\"Returns the score of the model on the data X\n\n        Parameters\n        ----------\n        X : array-like, shape = (n_samples, n_features)\n\n        Returns\n        -------\n        score : float\n        \"\"\"\n        pass"
        },
        {
          "file": "sklearn/linear_model/coordinate_descent.py",
          "type": "class",
          "name": "MultiTaskLassoCV",
          "code": "class MultiTaskLassoCV(LinearModelCV, RegressorMixin):\n    \"\"\"Multi-task Lasso model trained with L1/L2 mixed-norm as regularizer.\n\n    See glossary entry for :term:`cross-validation estimator`.\n\n    The optimization objective for MultiTaskLasso is::\n\n        (1 / (2 * n_samples)) * ||Y - XW||^Fro_2 + alpha * ||W||_21\n\n    Where::\n\n        ||W||_21 = \\\\sum_i \\\\sqrt{\\\\sum_j w_{ij}^2}\n\n    i.e. the sum of norm of each row.\n\n    Read more in the :ref:`User Guide <multi_task_lasso>`.\n\n    Parameters\n    ----------\n    eps : float, optional\n        Length of the path. ``eps=1e-3`` means that\n        ``alpha_min / alpha_max = 1e-3``.\n\n    n_alphas : int, optional\n        Number of alphas along the regularization path\n\n    alphas : array-like, optional\n        List of alphas where to compute the models.\n        If not provided, set automatically.\n\n    fit_intercept : boolean\n        whether to calculate the intercept for this model. If set\n        to false, no intercept will be used in calculations\n        (e.g. data is expected to be already centered).\n\n    normalize : boolean, optional, default False\n        This parameter is ignored when ``fit_intercept`` is set to False.\n        If True, the regressors X will be normalized before regression by\n        subtracting the mean and dividing by the l2-norm.\n        If you wish to standardize, please use\n        :class:`sklearn.preprocessing.StandardScaler` before calling ``fit``\n        on an estimator with ``normalize=False``.\n\n    max_iter : int, optional\n        The maximum number of iterations.\n\n    tol : float, optional\n        The tolerance for the optimization: if the updates are\n        smaller than ``tol``, the optimization code checks the\n        dual gap for optimality and continues until it is smaller\n        than ``tol``.\n\n    copy_X : boolean, optional, default True\n        If ``True``, X will be copied; else, it may be overwritten.\n\n    cv : int, cross-validation generator or an iterable, optional\n        Determines the cross-validation splitting strategy.\n        Possible inputs for cv are:\n\n        - None, to use the default 3-fold cross-validation,\n        - integer, to specify the number of folds.\n        - :term:`CV splitter`,\n        - An iterable yielding (train, test) splits as arrays of indices.\n\n        For integer/None inputs, :class:`KFold` is used.\n\n        Refer :ref:`User Guide <cross_validation>` for the various\n        cross-validation strategies that can be used here.\n\n        .. versionchanged:: 0.20\n            ``cv`` default value if None will change from 3-fold to 5-fold\n            in v0.22.\n\n    verbose : bool or integer\n        Amount of verbosity.\n\n    n_jobs : int or None, optional (default=None)\n        Number of CPUs to use during the cross validation. Note that this is\n        used only if multiple values for l1_ratio are given.\n        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.\n        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`\n        for more details.\n\n    random_state : int, RandomState instance or None, optional, default None\n        The seed of the pseudo random number generator that selects a random\n        feature to update.  If int, random_state is the seed used by the random\n        number generator; If RandomState instance, random_state is the random\n        number generator; If None, the random number generator is the\n        RandomState instance used by `np.random`. Used when ``selection`` ==\n        'random'\n\n    selection : str, default 'cyclic'\n        If set to 'random', a random coefficient is updated every iteration\n        rather than looping over features sequentially by default. This\n        (setting to 'random') often leads to significantly faster convergence\n        especially when tol is higher than 1e-4.\n\n    Attributes\n    ----------\n    intercept_ : array, shape (n_tasks,)\n        Independent term in decision function.\n\n    coef_ : array, shape (n_tasks, n_features)\n        Parameter vector (W in the cost function formula).\n        Note that ``coef_`` stores the transpose of ``W``, ``W.T``.\n\n    alpha_ : float\n        The amount of penalization chosen by cross validation\n\n    mse_path_ : array, shape (n_alphas, n_folds)\n        mean square error for the test set on each fold, varying alpha\n\n    alphas_ : numpy array, shape (n_alphas,)\n        The grid of alphas used for fitting.\n\n    n_iter_ : int\n        number of iterations run by the coordinate descent solver to reach\n        the specified tolerance for the optimal alpha.\n\n    Examples\n    --------\n    >>> from sklearn.linear_model import MultiTaskLassoCV\n    >>> from sklearn.datasets import make_regression\n    >>> X, y = make_regression(n_targets=2, noise=4, random_state=0)\n    >>> reg = MultiTaskLassoCV(cv=5, random_state=0).fit(X, y)\n    >>> reg.score(X, y) # doctest: +ELLIPSIS\n    0.9994...\n    >>> reg.alpha_\n    0.5713...\n    >>> reg.predict(X[:1,])\n    array([[153.7971...,  94.9015...]])\n\n    See also\n    --------\n    MultiTaskElasticNet\n    ElasticNetCV\n    MultiTaskElasticNetCV\n\n    Notes\n    -----\n    The algorithm used to fit the model is coordinate descent.\n\n    To avoid unnecessary memory duplication the X argument of the fit method\n    should be directly passed as a Fortran-contiguous numpy array.\n    \"\"\"\n    path = staticmethod(lasso_path)\n\n    def __init__(self, eps=1e-3, n_alphas=100, alphas=None, fit_intercept=True,\n                 normalize=False, max_iter=1000, tol=1e-4, copy_X=True,\n                 cv='warn', verbose=False, n_jobs=None, random_state=None,\n                 selection='cyclic'):\n        super().__init__(\n            eps=eps, n_alphas=n_alphas, alphas=alphas,\n            fit_intercept=fit_intercept, normalize=normalize,\n            max_iter=max_iter, tol=tol, copy_X=copy_X,\n            cv=cv, verbose=verbose, n_jobs=n_jobs, random_state=random_state,\n            selection=selection)\n\n    def _more_tags(self):\n        return {'multioutput_only': True}"
        },
        {
          "file": "sklearn/multioutput.py",
          "type": "function",
          "name": "partial_fit",
          "class_name": "MultiOutputRegressor",
          "code": "def partial_fit(self, X, y, sample_weight=None):\n        \"\"\"Incrementally fit the model to data.\n        Fit a separate model for each output variable.\n\n        Parameters\n        ----------\n        X : (sparse) array-like, shape (n_samples, n_features)\n            Data.\n\n        y : (sparse) array-like, shape (n_samples, n_outputs)\n            Multi-output targets.\n\n        sample_weight : array-like, shape = (n_samples) or None\n            Sample weights. If None, then samples are equally weighted.\n            Only supported if the underlying regressor supports sample\n            weights.\n\n        Returns\n        -------\n        self : object\n        \"\"\"\n        super().partial_fit(\n            X, y, sample_weight=sample_weight)"
        }
      ]
    }
  ]
}