{
  "instance_id": "scikit-learn__scikit-learn-14983",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-09-14T15:31:18Z",
  "problem_statement": "RepeatedKFold and RepeatedStratifiedKFold do not show correct __repr__ string\n#### Description\r\n\r\n`RepeatedKFold` and `RepeatedStratifiedKFold` do not show correct \\_\\_repr\\_\\_ string.\r\n\r\n#### Steps/Code to Reproduce\r\n\r\n```python\r\n>>> from sklearn.model_selection import RepeatedKFold, RepeatedStratifiedKFold\r\n>>> repr(RepeatedKFold())\r\n>>> repr(RepeatedStratifiedKFold())\r\n```\r\n\r\n#### Expected Results\r\n\r\n```python\r\n>>> repr(RepeatedKFold())\r\nRepeatedKFold(n_splits=5, n_repeats=10, random_state=None)\r\n>>> repr(RepeatedStratifiedKFold())\r\nRepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=None)\r\n```\r\n\r\n#### Actual Results\r\n\r\n```python\r\n>>> repr(RepeatedKFold())\r\n'<sklearn.model_selection._split.RepeatedKFold object at 0x0000016421AA4288>'\r\n>>> repr(RepeatedStratifiedKFold())\r\n'<sklearn.model_selection._split.RepeatedStratifiedKFold object at 0x0000016420E115C8>'\r\n```\r\n\r\n#### Versions\r\n```\r\nSystem:\r\n    python: 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]\r\nexecutable: D:\\anaconda3\\envs\\xyz\\python.exe\r\n   machine: Windows-10-10.0.16299-SP0\r\n\r\nBLAS:\r\n    macros:\r\n  lib_dirs:\r\ncblas_libs: cblas\r\n\r\nPython deps:\r\n       pip: 19.2.2\r\nsetuptools: 41.0.1\r\n   sklearn: 0.21.2\r\n     numpy: 1.16.4\r\n     scipy: 1.3.1\r\n    Cython: None\r\n    pandas: 0.24.2\r\n```\n",
  "patch": "diff --git a/sklearn/model_selection/_split.py b/sklearn/model_selection/_split.py\n--- a/sklearn/model_selection/_split.py\n+++ b/sklearn/model_selection/_split.py\n@@ -1163,6 +1163,9 @@ def get_n_splits(self, X=None, y=None, groups=None):\n                      **self.cvargs)\n         return cv.get_n_splits(X, y, groups) * self.n_repeats\n \n+    def __repr__(self):\n+        return _build_repr(self)\n+\n \n class RepeatedKFold(_RepeatedSplits):\n     \"\"\"Repeated K-Fold cross validator.\n@@ -2158,6 +2161,8 @@ def _build_repr(self):\n         try:\n             with warnings.catch_warnings(record=True) as w:\n                 value = getattr(self, key, None)\n+                if value is None and hasattr(self, 'cvargs'):\n+                    value = self.cvargs.get(key, None)\n             if len(w) and w[0].category == DeprecationWarning:\n                 # if the parameter is deprecated, don't show it\n                 continue\n",
  "similar_bug_items": [
    {
      "pr_number": 11042,
      "pr_title": "[MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034",
      "pr_body": "#### Reference Issues/PRs\r\nOriginal discussion at #11034\r\n\r\n#### What does this implement/fix? Explain your changes.\r\n",
      "issue_id": 11034,
      "issue_title": "OneHotEncoder does not output scipy sparse matrix of given dtype",
      "issue_body": "#### Description\r\nOneHotEncoder ignores the specified dtype in the construction of the sparse array when mixed input data are passed, i.e with both categorical and real data type\r\n\r\n#### Steps/Code to Reproduce\r\n```python\r\nimport numpy as np\r\n\r\nfrom sklearn.preprocessing import OneHotEncoder\r\nenc = OneHotEncoder(dtype=np.float32, categorical_features=[0, 1])\r\n\r\nx = np.array([[0, 1, 0, 0], [1, 2, 0, 0]], dtype=int)\r\nsparse = enc.fit(x).transform(x)\r\n```\r\n\r\n#### Expected Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float32'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Actual Results\r\n```python\r\nsparse: <2x6 sparse matrix of type '<class 'numpy.float64'>'\r\n\twith 4 stored elements in COOrdinate format>\r\n```\r\n\r\n#### Versions\r\n__Platform__: Linux-4.13.0-38-generic-x86_64-with-debian-stretch-sid\r\n__Python__: 3.6.3 |Anaconda custom (64-bit)| (default, Oct 13 2017, 12:02:49) [GCC 7.2.0]\r\n__NumPy__: NumPy \r\n__SciPy__: SciPy 1.0.1\r\n__Scikit-Learn__: Scikit-Learn 0.19.1\r\n",
      "issue_closed_at": "2018-06-06T09:03:02Z",
      "base_commit": "f049ec72eb70443ec8d7826066c4246035677c11",
      "changes": [
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "add_dummy_feature",
          "class_name": null,
          "code": "def add_dummy_feature(X, value=1.0):\n    \"\"\"Augment dataset with an additional dummy feature.\n\n    This is useful for fitting an intercept term with implementations which\n    cannot otherwise fit it directly.\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Data.\n\n    value : float\n        Value to use for the dummy feature.\n\n    Returns\n    -------\n\n    X : {array, sparse matrix}, shape [n_samples, n_features + 1]\n        Same data with dummy feature added as first column.\n\n    Examples\n    --------\n\n    >>> from sklearn.preprocessing import add_dummy_feature\n    >>> add_dummy_feature([[0, 1], [1, 0]])\n    array([[1., 0., 1.],\n           [1., 1., 0.]])\n    \"\"\"\n    X = check_array(X, accept_sparse=['csc', 'csr', 'coo'], dtype=FLOAT_DTYPES)\n    n_samples, n_features = X.shape\n    shape = (n_samples, n_features + 1)\n    if sparse.issparse(X):\n        if sparse.isspmatrix_coo(X):\n            # Shift columns to the right.\n            col = X.col + 1\n            # Column indices of dummy feature are 0 everywhere.\n            col = np.concatenate((np.zeros(n_samples), col))\n            # Row indices of dummy feature are 0, ..., n_samples-1.\n            row = np.concatenate((np.arange(n_samples), X.row))\n            # Prepend the dummy feature n_samples times.\n            data = np.concatenate((np.ones(n_samples) * value, X.data))\n            return sparse.coo_matrix((data, (row, col)), shape)\n        elif sparse.isspmatrix_csc(X):\n            # Shift index pointers since we need to add n_samples elements.\n            indptr = X.indptr + n_samples\n            # indptr[0] must be 0.\n            indptr = np.concatenate((np.array([0]), indptr))\n            # Row indices of dummy feature are 0, ..., n_samples-1.\n            indices = np.concatenate((np.arange(n_samples), X.indices))\n            # Prepend the dummy feature n_samples times.\n            data = np.concatenate((np.ones(n_samples) * value, X.data))\n            return sparse.csc_matrix((data, indices, indptr), shape)\n        else:\n            klass = X.__class__\n            return klass(add_dummy_feature(X.tocoo(), value))\n    else:\n        return np.hstack((np.ones((n_samples, 1)) * value, X))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "_transform_selected",
          "class_name": null,
          "code": "def _transform_selected(X, transform, selected=\"all\", copy=True):\n    \"\"\"Apply a transform function to portion of selected features\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Dense array or sparse matrix.\n\n    transform : callable\n        A callable transform(X) -> X_transformed\n\n    copy : boolean, optional\n        Copy X even if it could be avoided.\n\n    selected: \"all\" or array of indices or mask\n        Specify which features to apply the transform to.\n\n    Returns\n    -------\n    X : array or sparse matrix, shape=(n_samples, n_features_new)\n    \"\"\"\n    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)\n\n    if isinstance(selected, six.string_types) and selected == \"all\":\n        return transform(X)\n\n    if len(selected) == 0:\n        return X\n\n    n_features = X.shape[1]\n    ind = np.arange(n_features)\n    sel = np.zeros(n_features, dtype=bool)\n    sel[np.asarray(selected)] = True\n    not_sel = np.logical_not(sel)\n    n_selected = np.sum(sel)\n\n    if n_selected == 0:\n        # No features selected.\n        return X\n    elif n_selected == n_features:\n        # All features selected.\n        return transform(X)\n    else:\n        X_sel = transform(X[:, ind[sel]])\n        X_not_sel = X[:, ind[not_sel]]\n\n        if sparse.issparse(X_sel) or sparse.issparse(X_not_sel):\n            return sparse.hstack((X_sel, X_not_sel))\n        else:\n            return np.hstack((X_sel, X_not_sel))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "_transform_selected",
          "class_name": null,
          "code": "def _transform_selected(X, transform, selected=\"all\", copy=True):\n    \"\"\"Apply a transform function to portion of selected features\n\n    Parameters\n    ----------\n    X : {array-like, sparse matrix}, shape [n_samples, n_features]\n        Dense array or sparse matrix.\n\n    transform : callable\n        A callable transform(X) -> X_transformed\n\n    copy : boolean, optional\n        Copy X even if it could be avoided.\n\n    selected: \"all\" or array of indices or mask\n        Specify which features to apply the transform to.\n\n    Returns\n    -------\n    X : array or sparse matrix, shape=(n_samples, n_features_new)\n    \"\"\"\n    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)\n\n    if isinstance(selected, six.string_types) and selected == \"all\":\n        return transform(X)\n\n    if len(selected) == 0:\n        return X\n\n    n_features = X.shape[1]\n    ind = np.arange(n_features)\n    sel = np.zeros(n_features, dtype=bool)\n    sel[np.asarray(selected)] = True\n    not_sel = np.logical_not(sel)\n    n_selected = np.sum(sel)\n\n    if n_selected == 0:\n        # No features selected.\n        return X\n    elif n_selected == n_features:\n        # All features selected.\n        return transform(X)\n    else:\n        X_sel = transform(X[:, ind[sel]])\n        X_not_sel = X[:, ind[not_sel]]\n\n        if sparse.issparse(X_sel) or sparse.issparse(X_not_sel):\n            return sparse.hstack((X_sel, X_not_sel))\n        else:\n            return np.hstack((X_sel, X_not_sel))"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "fit_transform",
          "class_name": "OneHotEncoder",
          "code": "def fit_transform(self, X, y=None):\n        \"\"\"Fit OneHotEncoder to X, then transform X.\n\n        Equivalent to self.fit(X).transform(X), but more convenient and more\n        efficient. See fit for the parameters, transform for the return value.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_feature]\n            Input array of type int.\n        \"\"\"\n        return _transform_selected(X, self._fit_transform,\n                                   self.categorical_features, copy=True)"
        },
        {
          "file": "sklearn/preprocessing/data.py",
          "type": "function",
          "name": "transform",
          "class_name": "CategoricalEncoder",
          "code": "def transform(self, X):\n        \"\"\"Transform X using specified encoding scheme.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_features]\n            The data to encode.\n\n        Returns\n        -------\n        X_out : sparse matrix or a 2-d array\n            Transformed input.\n\n        \"\"\"\n        X_temp = check_array(X, dtype=None)\n        if not hasattr(X, 'dtype') and np.issubdtype(X_temp.dtype, np.str_):\n            X = check_array(X, dtype=np.object)\n        else:\n            X = X_temp\n\n        n_samples, n_features = X.shape\n        X_int = np.zeros_like(X, dtype=np.int)\n        X_mask = np.ones_like(X, dtype=np.bool)\n\n        for i in range(n_features):\n            Xi = X[:, i]\n            valid_mask = np.in1d(Xi, self.categories_[i])\n\n            if not np.all(valid_mask):\n                if self.handle_unknown == 'error':\n                    diff = np.unique(X[~valid_mask, i])\n                    msg = (\"Found unknown categories {0} in column {1}\"\n                           \" during transform\".format(diff, i))\n                    raise ValueError(msg)\n                else:\n                    # Set the problematic rows to an acceptable value and\n                    # continue `The rows are marked `X_mask` and will be\n                    # removed later.\n                    X_mask[:, i] = valid_mask\n                    Xi = Xi.copy()\n                    Xi[~valid_mask] = self.categories_[i][0]\n            X_int[:, i] = self._label_encoders_[i].transform(Xi)\n\n        if self.encoding == 'ordinal':\n            return X_int.astype(self.dtype, copy=False)\n\n        mask = X_mask.ravel()\n        n_values = [cats.shape[0] for cats in self.categories_]\n        n_values = np.array([0] + n_values)\n        feature_indices = np.cumsum(n_values)\n\n        indices = (X_int + feature_indices[:-1]).ravel()[mask]\n        indptr = X_mask.sum(axis=1).cumsum()\n        indptr = np.insert(indptr, 0, 0)\n        data = np.ones(n_samples * n_features)[mask]\n\n        out = sparse.csr_matrix((data, indices, indptr),\n                                shape=(n_samples, feature_indices[-1]),\n                                dtype=self.dtype)\n        if self.encoding == 'onehot-dense':\n            return out.toarray()\n        else:\n            return out"
        }
      ]
    },
    {
      "pr_number": 13641,
      "pr_title": "[MRG+1] API make sure vectorizers read data from file before analyzing",
      "pr_body": "Fixes #5482\r\n\r\nIf the given analyzer is a calable, it seems reasonable to assume if `input='file'` or `input='filename'`, the data should be read from the file first, and then passed to the analyzer, the same way as it's done for non-callable analyzers.\r\n\r\nThis PR clarifies this in the docstrings, and passes the \"decoded\" input to the analyzer. It should be less of a concern regarding the input on the bytes vs str since we don't support python2 anymore.\r\n\r\nI'm not entirely sure if this is what we wanna do, it's more of a proposal to move it forward.\r\n\r\nAfter this PR, the following would result in a `FileNotFoundError` exception:\r\n\r\n```python\r\ncv = CountVectorizer(analyzer=lambda x: x.split(), input='filename')\r\ncv.fit(['hello world']).vocabulary_\r\n```",
      "issue_id": 5482,
      "issue_title": "CountVectorizer with custom analyzer ignores input argument",
      "issue_body": "Example:\n\n``` py\ncv = CountVectorizer(analyzer=lambda x: x.split(), input='filename')\ncv.fit(['hello world']).vocabulary_\n```\n\nSame for `input=\"file\"`. Not sure if this should be fixed or just documented; I don't like changing the behavior of the vectorizers yet again...\n",
      "issue_closed_at": "2019-04-23T03:50:24Z",
      "base_commit": "badaa153e67ffa56fb1a413b3b7b5b8507024291",
      "changes": [
        {
          "file": "sklearn/feature_extraction/text.py",
          "type": "line",
          "name": "line 31",
          "code": "from ..utils.validation import check_is_fitted, check_array, FLOAT_DTYPES\nfrom ..utils import _IS_32BIT\nfrom ..utils.fixes import _astype_copy_false\n\n\n__all__ = ['HashingVectorizer',"
        },
        {
          "file": "sklearn/feature_extraction/text.py",
          "type": "function",
          "name": "_check_stop_words_consistency",
          "class_name": "VectorizerMixin",
          "code": "def _check_stop_words_consistency(self, stop_words, preprocess, tokenize):\n        \"\"\"Check if stop words are consistent\n\n        Returns\n        -------\n        is_consistent : True if stop words are consistent with the preprocessor\n                        and tokenizer, False if they are not, None if the check\n                        was previously performed, \"error\" if it could not be\n                        performed (e.g. because of the use of a custom\n                        preprocessor / tokenizer)\n        \"\"\"\n        if id(self.stop_words) == getattr(self, '_stop_words_id', None):\n            # Stop words are were previously validated\n            return None\n\n        # NB: stop_words is validated, unlike self.stop_words\n        try:\n            inconsistent = set()\n            for w in stop_words or ():\n                tokens = list(tokenize(preprocess(w)))\n                for token in tokens:\n                    if token not in stop_words:\n                        inconsistent.add(token)\n            self._stop_words_id = id(self.stop_words)\n\n            if inconsistent:\n                warnings.warn('Your stop_words may be inconsistent with '\n                              'your preprocessing. Tokenizing the stop '\n                              'words generated tokens %r not in '\n                              'stop_words.' % sorted(inconsistent))\n            return not inconsistent\n        except Exception:\n            # Failed to check stop words consistency (e.g. because a custom\n            # preprocessor or tokenizer was used)\n            self._stop_words_id = id(self.stop_words)\n            return 'error'"
        },
        {
          "file": "sklearn/feature_extraction/text.py",
          "type": "class",
          "name": "HashingVectorizer",
          "code": "class HashingVectorizer(BaseEstimator, VectorizerMixin, TransformerMixin):\n    \"\"\"Convert a collection of text documents to a matrix of token occurrences\n\n    It turns a collection of text documents into a scipy.sparse matrix holding\n    token occurrence counts (or binary occurrence information), possibly\n    normalized as token frequencies if norm='l1' or projected on the euclidean\n    unit sphere if norm='l2'.\n\n    This text vectorizer implementation uses the hashing trick to find the\n    token string name to feature integer index mapping.\n\n    This strategy has several advantages:\n\n    - it is very low memory scalable to large datasets as there is no need to\n      store a vocabulary dictionary in memory\n\n    - it is fast to pickle and un-pickle as it holds no state besides the\n      constructor parameters\n\n    - it can be used in a streaming (partial fit) or parallel pipeline as there\n      is no state computed during fit.\n\n    There are also a couple of cons (vs using a CountVectorizer with an\n    in-memory vocabulary):\n\n    - there is no way to compute the inverse transform (from feature indices to\n      string feature names) which can be a problem when trying to introspect\n      which features are most important to a model.\n\n    - there can be collisions: distinct tokens can be mapped to the same\n      feature index. However in practice this is rarely an issue if n_features\n      is large enough (e.g. 2 ** 18 for text classification problems).\n\n    - no IDF weighting as this would render the transformer stateful.\n\n    The hash function employed is the signed 32-bit version of Murmurhash3.\n\n    Read more in the :ref:`User Guide <text_feature_extraction>`.\n\n    Parameters\n    ----------\n\n    input : string {'filename', 'file', 'content'}\n        If 'filename', the sequence passed as an argument to fit is\n        expected to be a list of filenames that need reading to fetch\n        the raw content to analyze.\n\n        If 'file', the sequence items must have a 'read' method (file-like\n        object) that is called to fetch the bytes in memory.\n\n        Otherwise the input is expected to be the sequence strings or\n        bytes items are expected to be analyzed directly.\n\n    encoding : string, default='utf-8'\n        If bytes or files are given to analyze, this encoding is used to\n        decode.\n\n    decode_error : {'strict', 'ignore', 'replace'}\n        Instruction on what to do if a byte sequence is given to analyze that\n        contains characters not of the given `encoding`. By default, it is\n        'strict', meaning that a UnicodeDecodeError will be raised. Other\n        values are 'ignore' and 'replace'.\n\n    strip_accents : {'ascii', 'unicode', None}\n        Remove accents and perform other character normalization\n        during the preprocessing step.\n        'ascii' is a fast method that only works on characters that have\n        an direct ASCII mapping.\n        'unicode' is a slightly slower method that works on any characters.\n        None (default) does nothing.\n\n        Both 'ascii' and 'unicode' use NFKD normalization from\n        :func:`unicodedata.normalize`.\n\n    lowercase : boolean, default=True\n        Convert all characters to lowercase before tokenizing.\n\n    preprocessor : callable or None (default)\n        Override the preprocessing (string transformation) stage while\n        preserving the tokenizing and n-grams generation steps.\n\n    tokenizer : callable or None (default)\n        Override the string tokenization step while preserving the\n        preprocessing and n-grams generation steps.\n        Only applies if ``analyzer == 'word'``.\n\n    stop_words : string {'english'}, list, or None (default)\n        If 'english', a built-in stop word list for English is used.\n        There are several known issues with 'english' and you should\n        consider an alternative (see :ref:`stop_words`).\n\n        If a list, that list is assumed to contain stop words, all of which\n        will be removed from the resulting tokens.\n        Only applies if ``analyzer == 'word'``.\n\n    token_pattern : string\n        Regular expression denoting what constitutes a \"token\", only used\n        if ``analyzer == 'word'``. The default regexp selects tokens of 2\n        or more alphanumeric characters (punctuation is completely ignored\n        and always treated as a token separator).\n\n    ngram_range : tuple (min_n, max_n), default=(1, 1)\n        The lower and upper boundary of the range of n-values for different\n        n-grams to be extracted. All values of n such that min_n <= n <= max_n\n        will be used.\n\n    analyzer : string, {'word', 'char', 'char_wb'} or callable\n        Whether the feature should be made of word or character n-grams.\n        Option 'char_wb' creates character n-grams only from text inside\n        word boundaries; n-grams at the edges of words are padded with space.\n\n        If a callable is passed it is used to extract the sequence of features\n        out of the raw, unprocessed input.\n\n    n_features : integer, default=(2 ** 20)\n        The number of features (columns) in the output matrices. Small numbers\n        of features are likely to cause hash collisions, but large numbers\n        will cause larger coefficient dimensions in linear learners.\n\n    binary : boolean, default=False.\n        If True, all non zero counts are set to 1. This is useful for discrete\n        probabilistic models that model binary events rather than integer\n        counts.\n\n    norm : 'l1', 'l2' or None, optional\n        Norm used to normalize term vectors. None for no normalization.\n\n    alternate_sign : boolean, optional, default True\n        When True, an alternating sign is added to the features as to\n        approximately conserve the inner product in the hashed space even for\n        small n_features. This approach is similar to sparse random projection.\n\n        .. versionadded:: 0.19\n\n    dtype : type, optional\n        Type of the matrix returned by fit_transform() or transform().\n\n    Examples\n    --------\n    >>> from sklearn.feature_extraction.text import HashingVectorizer\n    >>> corpus = [\n    ...     'This is the first document.',\n    ...     'This document is the second document.',\n    ...     'And this is the third one.',\n    ...     'Is this the first document?',\n    ... ]\n    >>> vectorizer = HashingVectorizer(n_features=2**4)\n    >>> X = vectorizer.fit_transform(corpus)\n    >>> print(X.shape)\n    (4, 16)\n\n    See also\n    --------\n    CountVectorizer, TfidfVectorizer\n\n    \"\"\"\n    def __init__(self, input='content', encoding='utf-8',\n                 decode_error='strict', strip_accents=None,\n                 lowercase=True, preprocessor=None, tokenizer=None,\n                 stop_words=None, token_pattern=r\"(?u)\\b\\w\\w+\\b\",\n                 ngram_range=(1, 1), analyzer='word', n_features=(2 ** 20),\n                 binary=False, norm='l2', alternate_sign=True,\n                 dtype=np.float64):\n        self.input = input\n        self.encoding = encoding\n        self.decode_error = decode_error\n        self.strip_accents = strip_accents\n        self.preprocessor = preprocessor\n        self.tokenizer = tokenizer\n        self.analyzer = analyzer\n        self.lowercase = lowercase\n        self.token_pattern = token_pattern\n        self.stop_words = stop_words\n        self.n_features = n_features\n        self.ngram_range = ngram_range\n        self.binary = binary\n        self.norm = norm\n        self.alternate_sign = alternate_sign\n        self.dtype = dtype\n\n    def partial_fit(self, X, y=None):\n        \"\"\"Does nothing: this transformer is stateless.\n\n        This method is just there to mark the fact that this transformer\n        can work in a streaming setup.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_features]\n            Training data.\n        \"\"\"\n        return self\n\n    def fit(self, X, y=None):\n        \"\"\"Does nothing: this transformer is stateless.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_features]\n            Training data.\n        \"\"\"\n        # triggers a parameter validation\n        if isinstance(X, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        self._validate_params()\n\n        self._get_hasher().fit(X, y=y)\n        return self\n\n    def transform(self, X):\n        \"\"\"Transform a sequence of documents to a document-term matrix.\n\n        Parameters\n        ----------\n        X : iterable over raw text documents, length = n_samples\n            Samples. Each sample must be a text document (either bytes or\n            unicode strings, file name or file object depending on the\n            constructor argument) which will be tokenized and hashed.\n\n        Returns\n        -------\n        X : scipy.sparse matrix, shape = (n_samples, self.n_features)\n            Document-term matrix.\n        \"\"\"\n        if isinstance(X, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        self._validate_params()\n\n        analyzer = self.build_analyzer()\n        X = self._get_hasher().transform(analyzer(doc) for doc in X)\n        if self.binary:\n            X.data.fill(1)\n        if self.norm is not None:\n            X = normalize(X, norm=self.norm, copy=False)\n        return X\n\n    def fit_transform(self, X, y=None):\n        \"\"\"Transform a sequence of documents to a document-term matrix.\n\n        Parameters\n        ----------\n        X : iterable over raw text documents, length = n_samples\n            Samples. Each sample must be a text document (either bytes or\n            unicode strings, file name or file object depending on the\n            constructor argument) which will be tokenized and hashed.\n        y : any\n            Ignored. This parameter exists only for compatibility with\n            sklearn.pipeline.Pipeline.\n\n        Returns\n        -------\n        X : scipy.sparse matrix, shape = (n_samples, self.n_features)\n            Document-term matrix.\n        \"\"\"\n        return self.fit(X, y).transform(X)\n\n    def _get_hasher(self):\n        return FeatureHasher(n_features=self.n_features,\n                             input_type='string', dtype=self.dtype,\n                             alternate_sign=self.alternate_sign)\n\n    def _more_tags(self):\n        return {'X_types': ['string']}"
        },
        {
          "file": "sklearn/feature_extraction/text.py",
          "type": "class",
          "name": "CountVectorizer",
          "code": "class CountVectorizer(BaseEstimator, VectorizerMixin):\n    \"\"\"Convert a collection of text documents to a matrix of token counts\n\n    This implementation produces a sparse representation of the counts using\n    scipy.sparse.csr_matrix.\n\n    If you do not provide an a-priori dictionary and you do not use an analyzer\n    that does some kind of feature selection then the number of features will\n    be equal to the vocabulary size found by analyzing the data.\n\n    Read more in the :ref:`User Guide <text_feature_extraction>`.\n\n    Parameters\n    ----------\n    input : string {'filename', 'file', 'content'}\n        If 'filename', the sequence passed as an argument to fit is\n        expected to be a list of filenames that need reading to fetch\n        the raw content to analyze.\n\n        If 'file', the sequence items must have a 'read' method (file-like\n        object) that is called to fetch the bytes in memory.\n\n        Otherwise the input is expected to be the sequence strings or\n        bytes items are expected to be analyzed directly.\n\n    encoding : string, 'utf-8' by default.\n        If bytes or files are given to analyze, this encoding is used to\n        decode.\n\n    decode_error : {'strict', 'ignore', 'replace'}\n        Instruction on what to do if a byte sequence is given to analyze that\n        contains characters not of the given `encoding`. By default, it is\n        'strict', meaning that a UnicodeDecodeError will be raised. Other\n        values are 'ignore' and 'replace'.\n\n    strip_accents : {'ascii', 'unicode', None}\n        Remove accents and perform other character normalization\n        during the preprocessing step.\n        'ascii' is a fast method that only works on characters that have\n        an direct ASCII mapping.\n        'unicode' is a slightly slower method that works on any characters.\n        None (default) does nothing.\n\n        Both 'ascii' and 'unicode' use NFKD normalization from\n        :func:`unicodedata.normalize`.\n\n    lowercase : boolean, True by default\n        Convert all characters to lowercase before tokenizing.\n\n    preprocessor : callable or None (default)\n        Override the preprocessing (string transformation) stage while\n        preserving the tokenizing and n-grams generation steps.\n\n    tokenizer : callable or None (default)\n        Override the string tokenization step while preserving the\n        preprocessing and n-grams generation steps.\n        Only applies if ``analyzer == 'word'``.\n\n    stop_words : string {'english'}, list, or None (default)\n        If 'english', a built-in stop word list for English is used.\n        There are several known issues with 'english' and you should\n        consider an alternative (see :ref:`stop_words`).\n\n        If a list, that list is assumed to contain stop words, all of which\n        will be removed from the resulting tokens.\n        Only applies if ``analyzer == 'word'``.\n\n        If None, no stop words will be used. max_df can be set to a value\n        in the range [0.7, 1.0) to automatically detect and filter stop\n        words based on intra corpus document frequency of terms.\n\n    token_pattern : string\n        Regular expression denoting what constitutes a \"token\", only used\n        if ``analyzer == 'word'``. The default regexp select tokens of 2\n        or more alphanumeric characters (punctuation is completely ignored\n        and always treated as a token separator).\n\n    ngram_range : tuple (min_n, max_n)\n        The lower and upper boundary of the range of n-values for different\n        n-grams to be extracted. All values of n such that min_n <= n <= max_n\n        will be used.\n\n    analyzer : string, {'word', 'char', 'char_wb'} or callable\n        Whether the feature should be made of word or character n-grams.\n        Option 'char_wb' creates character n-grams only from text inside\n        word boundaries; n-grams at the edges of words are padded with space.\n\n        If a callable is passed it is used to extract the sequence of features\n        out of the raw, unprocessed input.\n\n    max_df : float in range [0.0, 1.0] or int, default=1.0\n        When building the vocabulary ignore terms that have a document\n        frequency strictly higher than the given threshold (corpus-specific\n        stop words).\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    min_df : float in range [0.0, 1.0] or int, default=1\n        When building the vocabulary ignore terms that have a document\n        frequency strictly lower than the given threshold. This value is also\n        called cut-off in the literature.\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    max_features : int or None, default=None\n        If not None, build a vocabulary that only consider the top\n        max_features ordered by term frequency across the corpus.\n\n        This parameter is ignored if vocabulary is not None.\n\n    vocabulary : Mapping or iterable, optional\n        Either a Mapping (e.g., a dict) where keys are terms and values are\n        indices in the feature matrix, or an iterable over terms. If not\n        given, a vocabulary is determined from the input documents. Indices\n        in the mapping should not be repeated and should not have any gap\n        between 0 and the largest index.\n\n    binary : boolean, default=False\n        If True, all non zero counts are set to 1. This is useful for discrete\n        probabilistic models that model binary events rather than integer\n        counts.\n\n    dtype : type, optional\n        Type of the matrix returned by fit_transform() or transform().\n\n    Attributes\n    ----------\n    vocabulary_ : dict\n        A mapping of terms to feature indices.\n\n    stop_words_ : set\n        Terms that were ignored because they either:\n\n          - occurred in too many documents (`max_df`)\n          - occurred in too few documents (`min_df`)\n          - were cut off by feature selection (`max_features`).\n\n        This is only available if no vocabulary was given.\n\n    Examples\n    --------\n    >>> from sklearn.feature_extraction.text import CountVectorizer\n    >>> corpus = [\n    ...     'This is the first document.',\n    ...     'This document is the second document.',\n    ...     'And this is the third one.',\n    ...     'Is this the first document?',\n    ... ]\n    >>> vectorizer = CountVectorizer()\n    >>> X = vectorizer.fit_transform(corpus)\n    >>> print(vectorizer.get_feature_names())\n    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']\n    >>> print(X.toarray())  # doctest: +NORMALIZE_WHITESPACE\n    [[0 1 1 1 0 0 1 0 1]\n     [0 2 0 1 0 1 1 0 1]\n     [1 0 0 1 1 0 1 1 1]\n     [0 1 1 1 0 0 1 0 1]]\n\n    See also\n    --------\n    HashingVectorizer, TfidfVectorizer\n\n    Notes\n    -----\n    The ``stop_words_`` attribute can get large and increase the model size\n    when pickling. This attribute is provided only for introspection and can\n    be safely removed using delattr or set to None before pickling.\n    \"\"\"\n\n    def __init__(self, input='content', encoding='utf-8',\n                 decode_error='strict', strip_accents=None,\n                 lowercase=True, preprocessor=None, tokenizer=None,\n                 stop_words=None, token_pattern=r\"(?u)\\b\\w\\w+\\b\",\n                 ngram_range=(1, 1), analyzer='word',\n                 max_df=1.0, min_df=1, max_features=None,\n                 vocabulary=None, binary=False, dtype=np.int64):\n        self.input = input\n        self.encoding = encoding\n        self.decode_error = decode_error\n        self.strip_accents = strip_accents\n        self.preprocessor = preprocessor\n        self.tokenizer = tokenizer\n        self.analyzer = analyzer\n        self.lowercase = lowercase\n        self.token_pattern = token_pattern\n        self.stop_words = stop_words\n        self.max_df = max_df\n        self.min_df = min_df\n        if max_df < 0 or min_df < 0:\n            raise ValueError(\"negative value for max_df or min_df\")\n        self.max_features = max_features\n        if max_features is not None:\n            if (not isinstance(max_features, numbers.Integral) or\n                    max_features <= 0):\n                raise ValueError(\n                    \"max_features=%r, neither a positive integer nor None\"\n                    % max_features)\n        self.ngram_range = ngram_range\n        self.vocabulary = vocabulary\n        self.binary = binary\n        self.dtype = dtype\n\n    def _sort_features(self, X, vocabulary):\n        \"\"\"Sort features by name\n\n        Returns a reordered matrix and modifies the vocabulary in place\n        \"\"\"\n        sorted_features = sorted(vocabulary.items())\n        map_index = np.empty(len(sorted_features), dtype=X.indices.dtype)\n        for new_val, (term, old_val) in enumerate(sorted_features):\n            vocabulary[term] = new_val\n            map_index[old_val] = new_val\n\n        X.indices = map_index.take(X.indices, mode='clip')\n        return X\n\n    def _limit_features(self, X, vocabulary, high=None, low=None,\n                        limit=None):\n        \"\"\"Remove too rare or too common features.\n\n        Prune features that are non zero in more samples than high or less\n        documents than low, modifying the vocabulary, and restricting it to\n        at most the limit most frequent.\n\n        This does not prune samples with zero features.\n        \"\"\"\n        if high is None and low is None and limit is None:\n            return X, set()\n\n        # Calculate a mask based on document frequencies\n        dfs = _document_frequency(X)\n        tfs = np.asarray(X.sum(axis=0)).ravel()\n        mask = np.ones(len(dfs), dtype=bool)\n        if high is not None:\n            mask &= dfs <= high\n        if low is not None:\n            mask &= dfs >= low\n        if limit is not None and mask.sum() > limit:\n            mask_inds = (-tfs[mask]).argsort()[:limit]\n            new_mask = np.zeros(len(dfs), dtype=bool)\n            new_mask[np.where(mask)[0][mask_inds]] = True\n            mask = new_mask\n\n        new_indices = np.cumsum(mask) - 1  # maps old indices to new\n        removed_terms = set()\n        for term, old_index in list(vocabulary.items()):\n            if mask[old_index]:\n                vocabulary[term] = new_indices[old_index]\n            else:\n                del vocabulary[term]\n                removed_terms.add(term)\n        kept_indices = np.where(mask)[0]\n        if len(kept_indices) == 0:\n            raise ValueError(\"After pruning, no terms remain. Try a lower\"\n                             \" min_df or a higher max_df.\")\n        return X[:, kept_indices], removed_terms\n\n    def _count_vocab(self, raw_documents, fixed_vocab):\n        \"\"\"Create sparse feature matrix, and vocabulary where fixed_vocab=False\n        \"\"\"\n        if fixed_vocab:\n            vocabulary = self.vocabulary_\n        else:\n            # Add a new value when a new vocabulary item is seen\n            vocabulary = defaultdict()\n            vocabulary.default_factory = vocabulary.__len__\n\n        analyze = self.build_analyzer()\n        j_indices = []\n        indptr = []\n\n        values = _make_int_array()\n        indptr.append(0)\n        for doc in raw_documents:\n            feature_counter = {}\n            for feature in analyze(doc):\n                try:\n                    feature_idx = vocabulary[feature]\n                    if feature_idx not in feature_counter:\n                        feature_counter[feature_idx] = 1\n                    else:\n                        feature_counter[feature_idx] += 1\n                except KeyError:\n                    # Ignore out-of-vocabulary items for fixed_vocab=True\n                    continue\n\n            j_indices.extend(feature_counter.keys())\n            values.extend(feature_counter.values())\n            indptr.append(len(j_indices))\n\n        if not fixed_vocab:\n            # disable defaultdict behaviour\n            vocabulary = dict(vocabulary)\n            if not vocabulary:\n                raise ValueError(\"empty vocabulary; perhaps the documents only\"\n                                 \" contain stop words\")\n\n        if indptr[-1] > 2147483648:  # = 2**31 - 1\n            if _IS_32BIT:\n                raise ValueError(('sparse CSR array has {} non-zero '\n                                  'elements and requires 64 bit indexing, '\n                                  'which is unsupported with 32 bit Python.')\n                                 .format(indptr[-1]))\n            indices_dtype = np.int64\n\n        else:\n            indices_dtype = np.int32\n        j_indices = np.asarray(j_indices, dtype=indices_dtype)\n        indptr = np.asarray(indptr, dtype=indices_dtype)\n        values = np.frombuffer(values, dtype=np.intc)\n\n        X = sp.csr_matrix((values, j_indices, indptr),\n                          shape=(len(indptr) - 1, len(vocabulary)),\n                          dtype=self.dtype)\n        X.sort_indices()\n        return vocabulary, X\n\n    def fit(self, raw_documents, y=None):\n        \"\"\"Learn a vocabulary dictionary of all tokens in the raw documents.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            An iterable which yields either str, unicode or file objects.\n\n        Returns\n        -------\n        self\n        \"\"\"\n        self.fit_transform(raw_documents)\n        return self\n\n    def fit_transform(self, raw_documents, y=None):\n        \"\"\"Learn the vocabulary dictionary and return term-document matrix.\n\n        This is equivalent to fit followed by transform, but more efficiently\n        implemented.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            An iterable which yields either str, unicode or file objects.\n\n        Returns\n        -------\n        X : array, [n_samples, n_features]\n            Document-term matrix.\n        \"\"\"\n        # We intentionally don't call the transform method to make\n        # fit_transform overridable without unwanted side effects in\n        # TfidfVectorizer.\n        if isinstance(raw_documents, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        self._validate_params()\n        self._validate_vocabulary()\n        max_df = self.max_df\n        min_df = self.min_df\n        max_features = self.max_features\n\n        vocabulary, X = self._count_vocab(raw_documents,\n                                          self.fixed_vocabulary_)\n\n        if self.binary:\n            X.data.fill(1)\n\n        if not self.fixed_vocabulary_:\n            X = self._sort_features(X, vocabulary)\n\n            n_doc = X.shape[0]\n            max_doc_count = (max_df\n                             if isinstance(max_df, numbers.Integral)\n                             else max_df * n_doc)\n            min_doc_count = (min_df\n                             if isinstance(min_df, numbers.Integral)\n                             else min_df * n_doc)\n            if max_doc_count < min_doc_count:\n                raise ValueError(\n                    \"max_df corresponds to < documents than min_df\")\n            X, self.stop_words_ = self._limit_features(X, vocabulary,\n                                                       max_doc_count,\n                                                       min_doc_count,\n                                                       max_features)\n\n            self.vocabulary_ = vocabulary\n\n        return X\n\n    def transform(self, raw_documents):\n        \"\"\"Transform documents to document-term matrix.\n\n        Extract token counts out of raw text documents using the vocabulary\n        fitted with fit or the one provided to the constructor.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            An iterable which yields either str, unicode or file objects.\n\n        Returns\n        -------\n        X : sparse matrix, [n_samples, n_features]\n            Document-term matrix.\n        \"\"\"\n        if isinstance(raw_documents, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        if not hasattr(self, 'vocabulary_'):\n            self._validate_vocabulary()\n\n        self._check_vocabulary()\n\n        # use the same matrix-building strategy as fit_transform\n        _, X = self._count_vocab(raw_documents, fixed_vocab=True)\n        if self.binary:\n            X.data.fill(1)\n        return X\n\n    def inverse_transform(self, X):\n        \"\"\"Return terms per document with nonzero entries in X.\n\n        Parameters\n        ----------\n        X : {array, sparse matrix}, shape = [n_samples, n_features]\n\n        Returns\n        -------\n        X_inv : list of arrays, len = n_samples\n            List of arrays of terms.\n        \"\"\"\n        self._check_vocabulary()\n\n        if sp.issparse(X):\n            # We need CSR format for fast row manipulations.\n            X = X.tocsr()\n        else:\n            # We need to convert X to a matrix, so that the indexing\n            # returns 2D objects\n            X = np.asmatrix(X)\n        n_samples = X.shape[0]\n\n        terms = np.array(list(self.vocabulary_.keys()))\n        indices = np.array(list(self.vocabulary_.values()))\n        inverse_vocabulary = terms[np.argsort(indices)]\n\n        return [inverse_vocabulary[X[i, :].nonzero()[1]].ravel()\n                for i in range(n_samples)]\n\n    def get_feature_names(self):\n        \"\"\"Array mapping from feature integer indices to feature name\"\"\"\n        if not hasattr(self, 'vocabulary_'):\n            self._validate_vocabulary()\n\n        self._check_vocabulary()\n\n        return [t for t, i in sorted(self.vocabulary_.items(),\n                                     key=itemgetter(1))]\n\n    def _more_tags(self):\n        return {'X_types': ['string']}"
        },
        {
          "file": "sklearn/feature_extraction/text.py",
          "type": "class",
          "name": "TfidfVectorizer",
          "code": "class TfidfVectorizer(CountVectorizer):\n    \"\"\"Convert a collection of raw documents to a matrix of TF-IDF features.\n\n    Equivalent to :class:`CountVectorizer` followed by\n    :class:`TfidfTransformer`.\n\n    Read more in the :ref:`User Guide <text_feature_extraction>`.\n\n    Parameters\n    ----------\n    input : string {'filename', 'file', 'content'}\n        If 'filename', the sequence passed as an argument to fit is\n        expected to be a list of filenames that need reading to fetch\n        the raw content to analyze.\n\n        If 'file', the sequence items must have a 'read' method (file-like\n        object) that is called to fetch the bytes in memory.\n\n        Otherwise the input is expected to be the sequence strings or\n        bytes items are expected to be analyzed directly.\n\n    encoding : string, 'utf-8' by default.\n        If bytes or files are given to analyze, this encoding is used to\n        decode.\n\n    decode_error : {'strict', 'ignore', 'replace'} (default='strict')\n        Instruction on what to do if a byte sequence is given to analyze that\n        contains characters not of the given `encoding`. By default, it is\n        'strict', meaning that a UnicodeDecodeError will be raised. Other\n        values are 'ignore' and 'replace'.\n\n    strip_accents : {'ascii', 'unicode', None} (default=None)\n        Remove accents and perform other character normalization\n        during the preprocessing step.\n        'ascii' is a fast method that only works on characters that have\n        an direct ASCII mapping.\n        'unicode' is a slightly slower method that works on any characters.\n        None (default) does nothing.\n\n        Both 'ascii' and 'unicode' use NFKD normalization from\n        :func:`unicodedata.normalize`.\n\n    lowercase : boolean (default=True)\n        Convert all characters to lowercase before tokenizing.\n\n    preprocessor : callable or None (default=None)\n        Override the preprocessing (string transformation) stage while\n        preserving the tokenizing and n-grams generation steps.\n\n    tokenizer : callable or None (default=None)\n        Override the string tokenization step while preserving the\n        preprocessing and n-grams generation steps.\n        Only applies if ``analyzer == 'word'``.\n\n    analyzer : string, {'word', 'char', 'char_wb'} or callable\n        Whether the feature should be made of word or character n-grams.\n        Option 'char_wb' creates character n-grams only from text inside\n        word boundaries; n-grams at the edges of words are padded with space.\n\n        If a callable is passed it is used to extract the sequence of features\n        out of the raw, unprocessed input.\n\n    stop_words : string {'english'}, list, or None (default=None)\n        If a string, it is passed to _check_stop_list and the appropriate stop\n        list is returned. 'english' is currently the only supported string\n        value.\n        There are several known issues with 'english' and you should\n        consider an alternative (see :ref:`stop_words`).\n\n        If a list, that list is assumed to contain stop words, all of which\n        will be removed from the resulting tokens.\n        Only applies if ``analyzer == 'word'``.\n\n        If None, no stop words will be used. max_df can be set to a value\n        in the range [0.7, 1.0) to automatically detect and filter stop\n        words based on intra corpus document frequency of terms.\n\n    token_pattern : string\n        Regular expression denoting what constitutes a \"token\", only used\n        if ``analyzer == 'word'``. The default regexp selects tokens of 2\n        or more alphanumeric characters (punctuation is completely ignored\n        and always treated as a token separator).\n\n    ngram_range : tuple (min_n, max_n) (default=(1, 1))\n        The lower and upper boundary of the range of n-values for different\n        n-grams to be extracted. All values of n such that min_n <= n <= max_n\n        will be used.\n\n    max_df : float in range [0.0, 1.0] or int (default=1.0)\n        When building the vocabulary ignore terms that have a document\n        frequency strictly higher than the given threshold (corpus-specific\n        stop words).\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    min_df : float in range [0.0, 1.0] or int (default=1)\n        When building the vocabulary ignore terms that have a document\n        frequency strictly lower than the given threshold. This value is also\n        called cut-off in the literature.\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    max_features : int or None (default=None)\n        If not None, build a vocabulary that only consider the top\n        max_features ordered by term frequency across the corpus.\n\n        This parameter is ignored if vocabulary is not None.\n\n    vocabulary : Mapping or iterable, optional (default=None)\n        Either a Mapping (e.g., a dict) where keys are terms and values are\n        indices in the feature matrix, or an iterable over terms. If not\n        given, a vocabulary is determined from the input documents.\n\n    binary : boolean (default=False)\n        If True, all non-zero term counts are set to 1. This does not mean\n        outputs will have only 0/1 values, only that the tf term in tf-idf\n        is binary. (Set idf and normalization to False to get 0/1 outputs.)\n\n    dtype : type, optional (default=float64)\n        Type of the matrix returned by fit_transform() or transform().\n\n    norm : 'l1', 'l2' or None, optional (default='l2')\n        Each output row will have unit norm, either:\n        * 'l2': Sum of squares of vector elements is 1. The cosine\n        similarity between two vectors is their dot product when l2 norm has\n        been applied.\n        * 'l1': Sum of absolute values of vector elements is 1.\n        See :func:`preprocessing.normalize`\n\n    use_idf : boolean (default=True)\n        Enable inverse-document-frequency reweighting.\n\n    smooth_idf : boolean (default=True)\n        Smooth idf weights by adding one to document frequencies, as if an\n        extra document was seen containing every term in the collection\n        exactly once. Prevents zero divisions.\n\n    sublinear_tf : boolean (default=False)\n        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).\n\n    Attributes\n    ----------\n    vocabulary_ : dict\n        A mapping of terms to feature indices.\n\n    idf_ : array, shape (n_features)\n        The inverse document frequency (IDF) vector; only defined\n        if  ``use_idf`` is True.\n\n    stop_words_ : set\n        Terms that were ignored because they either:\n\n          - occurred in too many documents (`max_df`)\n          - occurred in too few documents (`min_df`)\n          - were cut off by feature selection (`max_features`).\n\n        This is only available if no vocabulary was given.\n\n    Examples\n    --------\n    >>> from sklearn.feature_extraction.text import TfidfVectorizer\n    >>> corpus = [\n    ...     'This is the first document.',\n    ...     'This document is the second document.',\n    ...     'And this is the third one.',\n    ...     'Is this the first document?',\n    ... ]\n    >>> vectorizer = TfidfVectorizer()\n    >>> X = vectorizer.fit_transform(corpus)\n    >>> print(vectorizer.get_feature_names())\n    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']\n    >>> print(X.shape)\n    (4, 9)\n\n    See also\n    --------\n    CountVectorizer : Transforms text into a sparse matrix of n-gram counts.\n\n    TfidfTransformer : Performs the TF-IDF transformation from a provided\n        matrix of counts.\n\n    Notes\n    -----\n    The ``stop_words_`` attribute can get large and increase the model size\n    when pickling. This attribute is provided only for introspection and can\n    be safely removed using delattr or set to None before pickling.\n    \"\"\"\n\n    def __init__(self, input='content', encoding='utf-8',\n                 decode_error='strict', strip_accents=None, lowercase=True,\n                 preprocessor=None, tokenizer=None, analyzer='word',\n                 stop_words=None, token_pattern=r\"(?u)\\b\\w\\w+\\b\",\n                 ngram_range=(1, 1), max_df=1.0, min_df=1,\n                 max_features=None, vocabulary=None, binary=False,\n                 dtype=np.float64, norm='l2', use_idf=True, smooth_idf=True,\n                 sublinear_tf=False):\n\n        super().__init__(\n            input=input, encoding=encoding, decode_error=decode_error,\n            strip_accents=strip_accents, lowercase=lowercase,\n            preprocessor=preprocessor, tokenizer=tokenizer, analyzer=analyzer,\n            stop_words=stop_words, token_pattern=token_pattern,\n            ngram_range=ngram_range, max_df=max_df, min_df=min_df,\n            max_features=max_features, vocabulary=vocabulary, binary=binary,\n            dtype=dtype)\n\n        self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,\n                                       smooth_idf=smooth_idf,\n                                       sublinear_tf=sublinear_tf)\n\n    # Broadcast the TF-IDF parameters to the underlying transformer instance\n    # for easy grid search and repr\n\n    @property\n    def norm(self):\n        return self._tfidf.norm\n\n    @norm.setter\n    def norm(self, value):\n        self._tfidf.norm = value\n\n    @property\n    def use_idf(self):\n        return self._tfidf.use_idf\n\n    @use_idf.setter\n    def use_idf(self, value):\n        self._tfidf.use_idf = value\n\n    @property\n    def smooth_idf(self):\n        return self._tfidf.smooth_idf\n\n    @smooth_idf.setter\n    def smooth_idf(self, value):\n        self._tfidf.smooth_idf = value\n\n    @property\n    def sublinear_tf(self):\n        return self._tfidf.sublinear_tf\n\n    @sublinear_tf.setter\n    def sublinear_tf(self, value):\n        self._tfidf.sublinear_tf = value\n\n    @property\n    def idf_(self):\n        return self._tfidf.idf_\n\n    @idf_.setter\n    def idf_(self, value):\n        self._validate_vocabulary()\n        if hasattr(self, 'vocabulary_'):\n            if len(self.vocabulary_) != len(value):\n                raise ValueError(\"idf length = %d must be equal \"\n                                 \"to vocabulary size = %d\" %\n                                 (len(value), len(self.vocabulary)))\n        self._tfidf.idf_ = value\n\n    def _check_params(self):\n        if self.dtype not in FLOAT_DTYPES:\n            warnings.warn(\"Only {} 'dtype' should be used. {} 'dtype' will \"\n                          \"be converted to np.float64.\"\n                          .format(FLOAT_DTYPES, self.dtype),\n                          UserWarning)\n\n    def fit(self, raw_documents, y=None):\n        \"\"\"Learn vocabulary and idf from training set.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            an iterable which yields either str, unicode or file objects\n\n        Returns\n        -------\n        self : TfidfVectorizer\n        \"\"\"\n        self._check_params()\n        X = super().fit_transform(raw_documents)\n        self._tfidf.fit(X)\n        return self\n\n    def fit_transform(self, raw_documents, y=None):\n        \"\"\"Learn vocabulary and idf, return term-document matrix.\n\n        This is equivalent to fit followed by transform, but more efficiently\n        implemented.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            an iterable which yields either str, unicode or file objects\n\n        Returns\n        -------\n        X : sparse matrix, [n_samples, n_features]\n            Tf-idf-weighted document-term matrix.\n        \"\"\"\n        self._check_params()\n        X = super().fit_transform(raw_documents)\n        self._tfidf.fit(X)\n        # X is already a transformed view of raw_documents so\n        # we set copy to False\n        return self._tfidf.transform(X, copy=False)\n\n    def transform(self, raw_documents, copy=True):\n        \"\"\"Transform documents to document-term matrix.\n\n        Uses the vocabulary and document frequencies (df) learned by fit (or\n        fit_transform).\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            an iterable which yields either str, unicode or file objects\n\n        copy : boolean, default True\n            Whether to copy X and operate on the copy or perform in-place\n            operations.\n\n        Returns\n        -------\n        X : sparse matrix, [n_samples, n_features]\n            Tf-idf-weighted document-term matrix.\n        \"\"\"\n        check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')\n\n        X = super().transform(raw_documents)\n        return self._tfidf.transform(X, copy=False)\n\n    def _more_tags(self):\n        return {'X_types': ['string'], '_skip_test': True}"
        }
      ]
    },
    {
      "pr_number": 13323,
      "pr_title": "Hotfix Skip non deterministic tests on PowerPC",
      "pr_body": "An attempt to work around https://github.com/scikit-learn/scikit-learn/issues/13051 by extending the tests currently considered non determinisitic on 32 bit arch to PowerPC as well.\r\n\r\nCloses #13051\r\n\r\nThe `platform.machine` settings can be seen e.g. [here](https://github.com/python/cpython/blob/f14c28f39766855420dd58d209da4ad847f3030e/Lib/test/test_sysconfig.py#L402).\r\n\r\nIdeally, it would be good to understand why these are non-deterministic (on 32 bit arch as well) but meanwhile since we are not officially supporting PowerPC this aims to avoid new bug reports from debian maintainers while packaging scikit-learn, for tests that we deem acceptable to skip on 32 bit arch.",
      "issue_id": 13051,
      "issue_title": "[0.20.2] test_non_meta_estimators fails on Powerpc 64 bit little endian",
      "issue_body": "Splitting #13036:\r\nTest failure for `test_non_meta_estimators[KernelPCA-KernelPCA-check_pipeline_consistency]`:\r\n```\r\nname = 'KernelPCA'\r\nEstimator = <class 'sklearn.decomposition.kernel_pca.KernelPCA'>\r\ncheck = <function check_pipeline_consistency at 0x3fff8b056b18>\r\n\r\n    @pytest.mark.parametrize(\r\n            \"name, Estimator, check\",\r\n            _generate_checks_per_estimator(_yield_all_checks,\r\n                                           _tested_non_meta_estimators()),\r\n            ids=_rename_partial\r\n    )\r\n    def test_non_meta_estimators(name, Estimator, check):\r\n        # Common tests for non-meta estimators\r\n        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,\r\n                                       UserWarning, FutureWarning)):\r\n            estimator = Estimator()\r\n            set_checking_parameters(estimator)\r\n>           check(name, estimator)\r\n\r\nsklearn/tests/test_common.py:101: \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\nsklearn/utils/testing.py:350: in wrapper\r\n    return fn(*args, **kwargs)\r\nsklearn/utils/estimator_checks.py:1048: in check_pipeline_consistency\r\n    assert_allclose_dense_sparse(result, result_pipe)\r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\n\r\nx = array([[ 9.17543358e-01,  4.04410079e-02,  2.21585378e-02,\r\n        -2.69563828...,  2.47000231e-11,\r\n         2.81623715e-09,  1.36863809e-09,  1.28803077e-09]])\r\ny = array([[ 9.17543358e-01,  4.04410079e-02,  2.21585378e-02,\r\n        -2.69560026..., -6.90266524e-10,\r\n        -8.33997668e-11,  1.37018888e-09,  2.03087265e-09]])\r\nrtol = 1e-07, atol = 1e-09, err_msg = ''\r\n\r\n    def assert_allclose_dense_sparse(x, y, rtol=1e-07, atol=1e-9, err_msg=''):\r\n        \"\"\"[...]\"\"\"\r\n        if sp.sparse.issparse(x) and sp.sparse.issparse(y):\r\n            x = x.tocsr()\r\n            y = y.tocsr()\r\n            x.sum_duplicates()\r\n            y.sum_duplicates()\r\n            assert_array_equal(x.indices, y.indices, err_msg=err_msg)\r\n            assert_array_equal(x.indptr, y.indptr, err_msg=err_msg)\r\n            assert_allclose(x.data, y.data, rtol=rtol, atol=atol, err_msg=err_msg)\r\n        elif not sp.sparse.issparse(x) and not sp.sparse.issparse(y):\r\n            # both dense\r\n>           assert_allclose(x, y, rtol=rtol, atol=atol, err_msg=err_msg)\r\nE           AssertionError: \r\nE           Not equal to tolerance rtol=1e-07, atol=1e-09\r\nE           \r\nE           (mismatch 13.7777777778%)\r\nE            x: array([[ 9.175434e-01,  4.044101e-02,  2.215854e-02, -2.695638e-08,\r\nE                    3.128852e-08,  0.000000e+00,  0.000000e+00,  0.000000e+00,\r\nE                    0.000000e+00,  0.000000e+00,  0.000000e+00,  0.000000e+00,...\r\nE            y: array([[ 9.175434e-01,  4.044101e-02,  2.215854e-02, -2.695600e-08,\r\nE                    3.128926e-08,  0.000000e+00,  0.000000e+00,  0.000000e+00,\r\nE                    0.000000e+00,  0.000000e+00,  0.000000e+00,  0.000000e+00,...\r\n\r\nsklearn/utils/testing.py:464: AssertionError\r\n```\r\nSimilarly for  `test_non_meta_estimators[LocallyLinearEmbedding-LocallyLinearEmbedding-check_pipeline_consistency]`:\r\n```\r\n>           assert_allclose(x, y, rtol=rtol, atol=atol, err_msg=err_msg)\r\nE           AssertionError: \r\nE           Not equal to tolerance rtol=1e-07, atol=1e-09\r\nE           \r\nE           (mismatch 50.0%)\r\nE            x: array([[-2.011707e-01,  2.825862e-01],\r\nE                  [-1.618549e-01,  1.102424e-12],\r\nE                  [-1.618549e-01,  3.172043e-11],...\r\nE            y: array([[-1.753929e-01,  2.825862e-01],\r\nE                  [-1.894835e-01, -1.533793e-11],\r\nE                  [-1.894835e-01,  3.785336e-11],...\r\n```\r\nand for `test_non_meta_estimators[LocallyLinearEmbedding-LocallyLinearEmbedding-check_transformer_data_not_an_array]`:\r\n```\r\n>           assert_allclose(x, y, rtol=rtol, atol=atol, err_msg=err_msg)\r\nE           AssertionError: \r\nE           Not equal to tolerance rtol=1e-07, atol=0.01\r\nE           fit_transform and transform outcomes not consistent in LocallyLinearEmbedding(eigen_solver='auto', hessian_tol=0.0001, max_iter=5,\r\nE                       method='standard', modified_tol=1e-12, n_components=2,\r\nE                       n_jobs=None, n_neighbors=5, neighbors_algorithm='auto',\r\nE                       random_state=0, reg=0.001, tol=1e-06)\r\nE           (mismatch 50.0%)\r\nE            x: array([[-8.953081e-02,  2.983824e-01],\r\nE                  [-2.421795e-01,  4.555523e-13],\r\nE                  [-2.421795e-01,  3.751804e-12],...\r\nE            y: array([[-1.897519e-01,  2.981428e-01],\r\nE                  [-1.751025e-01,  6.989894e-12],\r\nE                  [-1.751025e-01,  9.856396e-12],...\r\n```\r\nand for `test_non_meta_estimators[LocallyLinearEmbedding-LocallyLinearEmbedding-check_transformer_general]`:\r\n```\r\n>           assert_allclose(x, y, rtol=rtol, atol=atol, err_msg=err_msg)\r\nE           AssertionError: \r\nE           Not equal to tolerance rtol=1e-07, atol=0.01\r\nE           fit_transform and transform outcomes not consistent in LocallyLinearEmbedding(eigen_solver='auto', hessian_tol=0.0001, max_iter=5,\r\nE                       method='standard', modified_tol=1e-12, n_components=2,\r\nE                       n_jobs=None, n_neighbors=5, neighbors_algorithm='auto',\r\nE                       random_state=0, reg=0.001, tol=1e-06)\r\nE           (mismatch 50.0%)\r\nE            x: array([[ 4.698100e-02, -2.983824e-01],\r\nE                  [ 2.538887e-01, -1.711825e-12],\r\nE                  [ 2.538887e-01, -8.653966e-12],...\r\nE            y: array([[-2.546916e-01, -2.981428e-01],\r\nE                  [ 4.241318e-02, -6.757535e-12],\r\nE                  [ 4.241318e-02, -2.107003e-11],...\r\n```",
      "issue_closed_at": "2019-02-28T10:16:53Z",
      "base_commit": "9d211978741a7e23f8b7c0bf1315d7ac7a259861",
      "changes": [
        {
          "file": "sklearn/base.py",
          "type": "line",
          "name": "line 6",
          "code": "import copy\nimport warnings\nfrom collections import defaultdict\nimport struct\nimport inspect\n\nimport numpy as np"
        },
        {
          "file": "sklearn/base.py",
          "type": "function",
          "name": "_more_tags",
          "class_name": "_UnstableOn32BitMixin",
          "code": "def _more_tags(self):\n        return {'non_deterministic': _IS_32BIT}"
        },
        {
          "file": "sklearn/cross_decomposition/cca_.py",
          "type": "line",
          "name": "line 1",
          "code": "from .pls_ import _PLS\nfrom ..base import _UnstableOn32BitMixin\n\n__all__ = ['CCA']\n\n\nclass CCA(_PLS, _UnstableOn32BitMixin):\n    \"\"\"CCA Canonical Correlation Analysis.\n\n    CCA inherits from PLS with mode=\"B\" and deflation_mode=\"canonical\"."
        },
        {
          "file": "sklearn/manifold/locally_linear.py",
          "type": "line",
          "name": "line 9",
          "code": "from scipy.sparse import eye, csr_matrix\nfrom scipy.sparse.linalg import eigsh\n\nfrom ..base import BaseEstimator, TransformerMixin, _UnstableOn32BitMixin\nfrom ..utils import check_random_state, check_array\nfrom ..utils.extmath import stable_cumsum\nfrom ..utils.validation import check_is_fitted"
        },
        {
          "file": "sklearn/manifold/locally_linear.py",
          "type": "function",
          "name": "locally_linear_embedding",
          "class_name": null,
          "code": "def locally_linear_embedding(\n        X, n_neighbors, n_components, reg=1e-3, eigen_solver='auto', tol=1e-6,\n        max_iter=100, method='standard', hessian_tol=1E-4, modified_tol=1E-12,\n        random_state=None, n_jobs=None):\n    \"\"\"Perform a Locally Linear Embedding analysis on the data.\n\n    Read more in the :ref:`User Guide <locally_linear_embedding>`.\n\n    Parameters\n    ----------\n    X : {array-like, NearestNeighbors}\n        Sample data, shape = (n_samples, n_features), in the form of a\n        numpy array or a NearestNeighbors object.\n\n    n_neighbors : integer\n        number of neighbors to consider for each point.\n\n    n_components : integer\n        number of coordinates for the manifold.\n\n    reg : float\n        regularization constant, multiplies the trace of the local covariance\n        matrix of the distances.\n\n    eigen_solver : string, {'auto', 'arpack', 'dense'}\n        auto : algorithm will attempt to choose the best method for input data\n\n        arpack : use arnoldi iteration in shift-invert mode.\n                    For this method, M may be a dense matrix, sparse matrix,\n                    or general linear operator.\n                    Warning: ARPACK can be unstable for some problems.  It is\n                    best to try several random seeds in order to check results.\n\n        dense  : use standard dense matrix operations for the eigenvalue\n                    decomposition.  For this method, M must be an array\n                    or matrix type.  This method should be avoided for\n                    large problems.\n\n    tol : float, optional\n        Tolerance for 'arpack' method\n        Not used if eigen_solver=='dense'.\n\n    max_iter : integer\n        maximum number of iterations for the arpack solver.\n\n    method : {'standard', 'hessian', 'modified', 'ltsa'}\n        standard : use the standard locally linear embedding algorithm.\n                   see reference [1]_\n        hessian  : use the Hessian eigenmap method.  This method requires\n                   n_neighbors > n_components * (1 + (n_components + 1) / 2.\n                   see reference [2]_\n        modified : use the modified locally linear embedding algorithm.\n                   see reference [3]_\n        ltsa     : use local tangent space alignment algorithm\n                   see reference [4]_\n\n    hessian_tol : float, optional\n        Tolerance for Hessian eigenmapping method.\n        Only used if method == 'hessian'\n\n    modified_tol : float, optional\n        Tolerance for modified LLE method.\n        Only used if method == 'modified'\n\n    random_state : int, RandomState instance or None, optional (default=None)\n        If int, random_state is the seed used by the random number generator;\n        If RandomState instance, random_state is the random number generator;\n        If None, the random number generator is the RandomState instance used\n        by `np.random`. Used when ``solver`` == 'arpack'.\n\n    n_jobs : int or None, optional (default=None)\n        The number of parallel jobs to run for neighbors search.\n        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.\n        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`\n        for more details.\n\n    Returns\n    -------\n    Y : array-like, shape [n_samples, n_components]\n        Embedding vectors.\n\n    squared_error : float\n        Reconstruction error for the embedding vectors. Equivalent to\n        ``norm(Y - W Y, 'fro')**2``, where W are the reconstruction weights.\n\n    References\n    ----------\n\n    .. [1] Roweis, S. & Saul, L. Nonlinear dimensionality reduction\n        by locally linear embedding.  Science 290:2323 (2000).\n    .. [2] Donoho, D. & Grimes, C. Hessian eigenmaps: Locally\n        linear embedding techniques for high-dimensional data.\n        Proc Natl Acad Sci U S A.  100:5591 (2003).\n    .. [3] Zhang, Z. & Wang, J. MLLE: Modified Locally Linear\n        Embedding Using Multiple Weights.\n        http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.70.382\n    .. [4] Zhang, Z. & Zha, H. Principal manifolds and nonlinear\n        dimensionality reduction via tangent space alignment.\n        Journal of Shanghai Univ.  8:406 (2004)\n    \"\"\"\n    if eigen_solver not in ('auto', 'arpack', 'dense'):\n        raise ValueError(\"unrecognized eigen_solver '%s'\" % eigen_solver)\n\n    if method not in ('standard', 'hessian', 'modified', 'ltsa'):\n        raise ValueError(\"unrecognized method '%s'\" % method)\n\n    nbrs = NearestNeighbors(n_neighbors=n_neighbors + 1, n_jobs=n_jobs)\n    nbrs.fit(X)\n    X = nbrs._fit_X\n\n    N, d_in = X.shape\n\n    if n_components > d_in:\n        raise ValueError(\"output dimension must be less than or equal \"\n                         \"to input dimension\")\n    if n_neighbors >= N:\n        raise ValueError(\n            \"Expected n_neighbors <= n_samples, \"\n            \" but n_samples = %d, n_neighbors = %d\" %\n            (N, n_neighbors)\n        )\n\n    if n_neighbors <= 0:\n        raise ValueError(\"n_neighbors must be positive\")\n\n    M_sparse = (eigen_solver != 'dense')\n\n    if method == 'standard':\n        W = barycenter_kneighbors_graph(\n            nbrs, n_neighbors=n_neighbors, reg=reg, n_jobs=n_jobs)\n\n        # we'll compute M = (I-W)'(I-W)\n        # depending on the solver, we'll do this differently\n        if M_sparse:\n            M = eye(*W.shape, format=W.format) - W\n            M = (M.T * M).tocsr()\n        else:\n            M = (W.T * W - W.T - W).toarray()\n            M.flat[::M.shape[0] + 1] += 1  # W = W - I = W - I\n\n    elif method == 'hessian':\n        dp = n_components * (n_components + 1) // 2\n\n        if n_neighbors <= n_components + dp:\n            raise ValueError(\"for method='hessian', n_neighbors must be \"\n                             \"greater than \"\n                             \"[n_components * (n_components + 3) / 2]\")\n\n        neighbors = nbrs.kneighbors(X, n_neighbors=n_neighbors + 1,\n                                    return_distance=False)\n        neighbors = neighbors[:, 1:]\n\n        Yi = np.empty((n_neighbors, 1 + n_components + dp), dtype=np.float64)\n        Yi[:, 0] = 1\n\n        M = np.zeros((N, N), dtype=np.float64)\n\n        use_svd = (n_neighbors > d_in)\n\n        for i in range(N):\n            Gi = X[neighbors[i]]\n            Gi -= Gi.mean(0)\n\n            # build Hessian estimator\n            if use_svd:\n                U = svd(Gi, full_matrices=0)[0]\n            else:\n                Ci = np.dot(Gi, Gi.T)\n                U = eigh(Ci)[1][:, ::-1]\n\n            Yi[:, 1:1 + n_components] = U[:, :n_components]\n\n            j = 1 + n_components\n            for k in range(n_components):\n                Yi[:, j:j + n_components - k] = (U[:, k:k + 1] *\n                                                 U[:, k:n_components])\n                j += n_components - k\n\n            Q, R = qr(Yi)\n\n            w = Q[:, n_components + 1:]\n            S = w.sum(0)\n\n            S[np.where(abs(S) < hessian_tol)] = 1\n            w /= S\n\n            nbrs_x, nbrs_y = np.meshgrid(neighbors[i], neighbors[i])\n            M[nbrs_x, nbrs_y] += np.dot(w, w.T)\n\n        if M_sparse:\n            M = csr_matrix(M)\n\n    elif method == 'modified':\n        if n_neighbors < n_components:\n            raise ValueError(\"modified LLE requires \"\n                             \"n_neighbors >= n_components\")\n\n        neighbors = nbrs.kneighbors(X, n_neighbors=n_neighbors + 1,\n                                    return_distance=False)\n        neighbors = neighbors[:, 1:]\n\n        # find the eigenvectors and eigenvalues of each local covariance\n        # matrix. We want V[i] to be a [n_neighbors x n_neighbors] matrix,\n        # where the columns are eigenvectors\n        V = np.zeros((N, n_neighbors, n_neighbors))\n        nev = min(d_in, n_neighbors)\n        evals = np.zeros([N, nev])\n\n        # choose the most efficient way to find the eigenvectors\n        use_svd = (n_neighbors > d_in)\n\n        if use_svd:\n            for i in range(N):\n                X_nbrs = X[neighbors[i]] - X[i]\n                V[i], evals[i], _ = svd(X_nbrs,\n                                        full_matrices=True)\n            evals **= 2\n        else:\n            for i in range(N):\n                X_nbrs = X[neighbors[i]] - X[i]\n                C_nbrs = np.dot(X_nbrs, X_nbrs.T)\n                evi, vi = eigh(C_nbrs)\n                evals[i] = evi[::-1]\n                V[i] = vi[:, ::-1]\n\n        # find regularized weights: this is like normal LLE.\n        # because we've already computed the SVD of each covariance matrix,\n        # it's faster to use this rather than np.linalg.solve\n        reg = 1E-3 * evals.sum(1)\n\n        tmp = np.dot(V.transpose(0, 2, 1), np.ones(n_neighbors))\n        tmp[:, :nev] /= evals + reg[:, None]\n        tmp[:, nev:] /= reg[:, None]\n\n        w_reg = np.zeros((N, n_neighbors))\n        for i in range(N):\n            w_reg[i] = np.dot(V[i], tmp[i])\n        w_reg /= w_reg.sum(1)[:, None]\n\n        # calculate eta: the median of the ratio of small to large eigenvalues\n        # across the points.  This is used to determine s_i, below\n        rho = evals[:, n_components:].sum(1) / evals[:, :n_components].sum(1)\n        eta = np.median(rho)\n\n        # find s_i, the size of the \"almost null space\" for each point:\n        # this is the size of the largest set of eigenvalues\n        # such that Sum[v; v in set]/Sum[v; v not in set] < eta\n        s_range = np.zeros(N, dtype=int)\n        evals_cumsum = stable_cumsum(evals, 1)\n        eta_range = evals_cumsum[:, -1:] / evals_cumsum[:, :-1] - 1\n        for i in range(N):\n            s_range[i] = np.searchsorted(eta_range[i, ::-1], eta)\n        s_range += n_neighbors - nev  # number of zero eigenvalues\n\n        # Now calculate M.\n        # This is the [N x N] matrix whose null space is the desired embedding\n        M = np.zeros((N, N), dtype=np.float64)\n        for i in range(N):\n            s_i = s_range[i]\n\n            # select bottom s_i eigenvectors and calculate alpha\n            Vi = V[i, :, n_neighbors - s_i:]\n            alpha_i = np.linalg.norm(Vi.sum(0)) / np.sqrt(s_i)\n\n            # compute Householder matrix which satisfies\n            #  Hi*Vi.T*ones(n_neighbors) = alpha_i*ones(s)\n            # using prescription from paper\n            h = np.full(s_i, alpha_i) - np.dot(Vi.T, np.ones(n_neighbors))\n\n            norm_h = np.linalg.norm(h)\n            if norm_h < modified_tol:\n                h *= 0\n            else:\n                h /= norm_h\n\n            # Householder matrix is\n            #  >> Hi = np.identity(s_i) - 2*np.outer(h,h)\n            # Then the weight matrix is\n            #  >> Wi = np.dot(Vi,Hi) + (1-alpha_i) * w_reg[i,:,None]\n            # We do this much more efficiently:\n            Wi = (Vi - 2 * np.outer(np.dot(Vi, h), h) +\n                  (1 - alpha_i) * w_reg[i, :, None])\n\n            # Update M as follows:\n            # >> W_hat = np.zeros( (N,s_i) )\n            # >> W_hat[neighbors[i],:] = Wi\n            # >> W_hat[i] -= 1\n            # >> M += np.dot(W_hat,W_hat.T)\n            # We can do this much more efficiently:\n            nbrs_x, nbrs_y = np.meshgrid(neighbors[i], neighbors[i])\n            M[nbrs_x, nbrs_y] += np.dot(Wi, Wi.T)\n            Wi_sum1 = Wi.sum(1)\n            M[i, neighbors[i]] -= Wi_sum1\n            M[neighbors[i], i] -= Wi_sum1\n            M[i, i] += s_i\n\n        if M_sparse:\n            M = csr_matrix(M)\n\n    elif method == 'ltsa':\n        neighbors = nbrs.kneighbors(X, n_neighbors=n_neighbors + 1,\n                                    return_distance=False)\n        neighbors = neighbors[:, 1:]\n\n        M = np.zeros((N, N))\n\n        use_svd = (n_neighbors > d_in)\n\n        for i in range(N):\n            Xi = X[neighbors[i]]\n            Xi -= Xi.mean(0)\n\n            # compute n_components largest eigenvalues of Xi * Xi^T\n            if use_svd:\n                v = svd(Xi, full_matrices=True)[0]\n            else:\n                Ci = np.dot(Xi, Xi.T)\n                v = eigh(Ci)[1][:, ::-1]\n\n            Gi = np.zeros((n_neighbors, n_components + 1))\n            Gi[:, 1:] = v[:, :n_components]\n            Gi[:, 0] = 1. / np.sqrt(n_neighbors)\n\n            GiGiT = np.dot(Gi, Gi.T)\n\n            nbrs_x, nbrs_y = np.meshgrid(neighbors[i], neighbors[i])\n            M[nbrs_x, nbrs_y] -= GiGiT\n            M[neighbors[i], neighbors[i]] += 1\n\n    return null_space(M, n_components, k_skip=1, eigen_solver=eigen_solver,\n                      tol=tol, max_iter=max_iter, random_state=random_state)"
        }
      ]
    },
    {
      "pr_number": 11796,
      "pr_title": "[MRG+2] Fix LDA predict_proba() ",
      "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\nFixes #6848\r\ncloses #11727\r\ncloses #5149\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\nFixes the `predict_proba()` method of LinearDiscriminantAnalysis.\r\nAn `if` statement is used to differentiate between the binary and multi-class case, due to the different output format of the `decision_function` method implemented in the `LinearClassifierMixin` class.\r\n\r\n#### Any other comments?\r\nCopying from #6848:\r\nDo we perhaps want to include additional tests checking the output of predict_proba for LDA and QDA both for the binary and multi-class cases?\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
      "issue_id": 6848,
      "issue_title": "LinearDiscriminantAnalysis predict probability bug",
      "issue_body": "I am pretty confident there is a bug introduced in commit\n7c1101d7c26ba0b77184cce9c0b9be79adb526de\n\nConcretely, line 518 of the current version \nhttps://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py\nshould be removed as it yields wrong results. \n\nThere is no reason why constant 1 should be added to the computed probability after exponentiation and before inversion. \n\nTo verify this, I have run a one-to-one comparison between the outcome of the method and MATLAB's builtin LDA classifier on the Iris dataset. Only after removal of line 518, results match (up to a tolerance).\n\nIf everyone agrees on that, I am happy to submit a PR.\n",
      "issue_closed_at": "2019-03-07T16:44:18Z",
      "base_commit": "b73a51bcda362d94d8907915a382a8eb403554c8",
      "changes": [
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "line",
          "name": "line 22",
          "code": "from .utils import check_array, check_X_y\nfrom .utils.validation import check_is_fitted\nfrom .utils.multiclass import check_classification_targets\nfrom .preprocessing import StandardScaler\n\n"
        },
        {
          "file": "sklearn/discriminant_analysis.py",
          "type": "function",
          "name": "predict_proba",
          "class_name": "QuadraticDiscriminantAnalysis",
          "code": "def predict_proba(self, X):\n        \"\"\"Return posterior probabilities of classification.\n\n        Parameters\n        ----------\n        X : array-like, shape = [n_samples, n_features]\n            Array of samples/test vectors.\n\n        Returns\n        -------\n        C : array, shape = [n_samples, n_classes]\n            Posterior probabilities of classification per class.\n        \"\"\"\n        values = self._decision_function(X)\n        # compute the likelihood of the underlying gaussian models\n        # up to a multiplicative constant.\n        likelihood = np.exp(values - values.max(axis=1)[:, np.newaxis])\n        # compute posterior probabilities\n        return likelihood / likelihood.sum(axis=1)[:, np.newaxis]"
        }
      ]
    },
    {
      "pr_number": 9569,
      "pr_title": "[MRG+2] remove modification of warning registry for no reason",
      "pr_body": "Fixes #9560. Fixes #2755. Fixes #7346.",
      "issue_id": 7346,
      "issue_title": "Pop from empty list coming from get_params()",
      "issue_body": "<!--\nIf your issue is a usage question, submit it here instead:\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\n-->\n\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\n#### Description\n\n I am getting a pop from empty list error from the warnings.filers.pop(0) call in get_params(). I am using Dask to parallelize the computation of fitting a bunch of MeanShift objects. I only get this error on one machine (a remote linux machine), but it works fine on my home compute (running ubuntu 14) \n#### Steps/Code to Reproduce\n\n<!--\n\n-->\n#### Expected Results\n\nShould just fit the MeanShifts and move on\n#### Actual Results\n\nTraceback (most recent call last):\n  File \"tda_profile.py\", line 34, in <module>\n    _tda.fit(train_features, train_targets)\n  File \"/home/ben/tda/tda_parallel_test.py\", line 652, in fit\n    fits = fits.compute()\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/dask/base.py\", line 86, in compute\n    return compute(self, *_kwargs)[0]\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/dask/base.py\", line 179, in compute\n    results = get(dsk, keys, *_kwargs)\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/dask/threaded.py\", line 57, in get\n    **kwargs)\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/dask/async.py\", line 484, in get_async\n    raise(remote_exception(res, tb))\ndask.async.IndexError: pop from empty list\n## Traceback\n\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/dask/async.py\", line 267, in execute_task\n    result = _execute_task(task, data)\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/dask/async.py\", line 249, in _execute_task\n    return func(*args2)\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/sklearn/cluster/mean_shift_.py\", line 391, in fit\n    cluster_all=self.cluster_all, n_jobs=self.n_jobs)\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/sklearn/cluster/mean_shift_.py\", line 191, in mean_shift\n    (seed, X, nbrs, max_iter) for seed in seeds)\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py\", line 800, in **call**\n    while self.dispatch_one_batch(iterator):\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py\", line 658, in dispatch_one_batch\n    self._dispatch(tasks)\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py\", line 566, in _dispatch\n    job = ImmediateComputeBatch(batch)\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py\", line 180, in __init__\n    self.results = batch()\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py\", line 72, in **call**\n    return [func(_args, *_kwargs) for func, args, kwargs in self.items]\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py\", line 72, in <listcomp>\n    return [func(_args, *_kwargs) for func, args, kwargs in self.items]\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/sklearn/cluster/mean_shift_.py\", line 75, in _mean_shift_single_seed\n    bandwidth = nbrs.get_params()['radius']\n  File \"/home/ben/anaconda3/lib/python3.5/site-packages/sklearn/base.py\", line 227, in get_params\n    warnings.filters.pop(0)\n#### Versions\n\n> > > import platform; print(platform.platform())\n> > > Linux-3.10.0-327.el7.x86_64-x86_64-with-centos-7.2.1511-Core\n> > > import sys; print(\"Python\", sys.version)\n> > > Python 3.5.2 |Anaconda 4.1.1 (64-bit)| (default, Jul  2 2016, 17:53:06) \n> > > [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]\n> > > import numpy; print(\"NumPy\", numpy.**version**)\n> > > NumPy 1.11.1\n> > > import scipy; print(\"SciPy\", scipy.**version**)\n> > > SciPy 0.17.1\n> > > import sklearn; print(\"Scikit-Learn\", sklearn.**version**)\n> > > Scikit-Learn 0.17.1\n\n<!-- Thanks for contributing! -->\n",
      "issue_closed_at": "2017-09-08T15:29:37Z",
      "base_commit": "e1fb03c86d2a2c47ef008ead958e1bc10fb06e77",
      "changes": [
        {
          "file": "sklearn/base.py",
          "type": "function",
          "name": "get_params",
          "class_name": "BaseEstimator",
          "code": "def get_params(self, deep=True):\n        \"\"\"Get parameters for this estimator.\n\n        Parameters\n        ----------\n        deep : boolean, optional\n            If True, will return the parameters for this estimator and\n            contained subobjects that are estimators.\n\n        Returns\n        -------\n        params : mapping of string to any\n            Parameter names mapped to their values.\n        \"\"\"\n        out = dict()\n        for key in self._get_param_names():\n            # We need deprecation warnings to always be on in order to\n            # catch deprecated param values.\n            # This is set in utils/__init__.py but it gets overwritten\n            # when running under python3 somehow.\n            warnings.simplefilter(\"always\", DeprecationWarning)\n            try:\n                with warnings.catch_warnings(record=True) as w:\n                    value = getattr(self, key, None)\n                if len(w) and w[0].category == DeprecationWarning:\n                    # if the parameter is deprecated, don't show it\n                    continue\n            finally:\n                warnings.filters.pop(0)\n\n            # XXX: should we rather test if instance of estimator?\n            if deep and hasattr(value, 'get_params'):\n                deep_items = value.get_params().items()\n                out.update((key + '__' + k, val) for k, val in deep_items)\n            out[key] = value\n        return out"
        },
        {
          "file": "sklearn/base.py",
          "type": "function",
          "name": "__setstate__",
          "class_name": "BaseEstimator",
          "code": "def __setstate__(self, state):\n        if type(self).__module__.startswith('sklearn.'):\n            pickle_version = state.pop(\"_sklearn_version\", \"pre-0.18\")\n            if pickle_version != __version__:\n                warnings.warn(\n                    \"Trying to unpickle estimator {0} from version {1} when \"\n                    \"using version {2}. This might lead to breaking code or \"\n                    \"invalid results. Use at your own risk.\".format(\n                        self.__class__.__name__, pickle_version, __version__),\n                    UserWarning)\n        try:\n            super(BaseEstimator, self).__setstate__(state)\n        except AttributeError:\n            self.__dict__.update(state)"
        }
      ]
    }
  ]
}