{
  "Selected_candidate": {
    "pr_number": 13641,
    "pr_title": "[MRG+1] API make sure vectorizers read data from file before analyzing",
    "pr_body": "Fixes #5482\r\n\r\nIf the given analyzer is a calable, it seems reasonable to assume if `input='file'` or `input='filename'`, the data should be read from the file first, and then passed to the analyzer, the same way as it's done for non-callable analyzers.\r\n\r\nThis PR clarifies this in the docstrings, and passes the \"decoded\" input to the analyzer. It should be less of a concern regarding the input on the bytes vs str since we don't support python2 anymore.\r\n\r\nI'm not entirely sure if this is what we wanna do, it's more of a proposal to move it forward.\r\n\r\nAfter this PR, the following would result in a `FileNotFoundError` exception:\r\n\r\n```python\r\ncv = CountVectorizer(analyzer=lambda x: x.split(), input='filename')\r\ncv.fit(['hello world']).vocabulary_\r\n```",
    "issue_id": 5482,
    "issue_title": "CountVectorizer with custom analyzer ignores input argument",
    "issue_body": "Example:\n\n``` py\ncv = CountVectorizer(analyzer=lambda x: x.split(), input='filename')\ncv.fit(['hello world']).vocabulary_\n```\n\nSame for `input=\"file\"`. Not sure if this should be fixed or just documented; I don't like changing the behavior of the vectorizers yet again...\n",
    "issue_closed_at": "2019-04-23T03:50:24Z",
    "base_commit": "badaa153e67ffa56fb1a413b3b7b5b8507024291",
    "changes": [
      {
        "file": "sklearn/feature_extraction/text.py",
        "type": "line",
        "name": "line 31",
        "code": "from ..utils.validation import check_is_fitted, check_array, FLOAT_DTYPES\nfrom ..utils import _IS_32BIT\nfrom ..utils.fixes import _astype_copy_false\n\n\n__all__ = ['HashingVectorizer',"
      },
      {
        "file": "sklearn/feature_extraction/text.py",
        "type": "function",
        "name": "_check_stop_words_consistency",
        "class_name": "VectorizerMixin",
        "code": "def _check_stop_words_consistency(self, stop_words, preprocess, tokenize):\n        \"\"\"Check if stop words are consistent\n\n        Returns\n        -------\n        is_consistent : True if stop words are consistent with the preprocessor\n                        and tokenizer, False if they are not, None if the check\n                        was previously performed, \"error\" if it could not be\n                        performed (e.g. because of the use of a custom\n                        preprocessor / tokenizer)\n        \"\"\"\n        if id(self.stop_words) == getattr(self, '_stop_words_id', None):\n            # Stop words are were previously validated\n            return None\n\n        # NB: stop_words is validated, unlike self.stop_words\n        try:\n            inconsistent = set()\n            for w in stop_words or ():\n                tokens = list(tokenize(preprocess(w)))\n                for token in tokens:\n                    if token not in stop_words:\n                        inconsistent.add(token)\n            self._stop_words_id = id(self.stop_words)\n\n            if inconsistent:\n                warnings.warn('Your stop_words may be inconsistent with '\n                              'your preprocessing. Tokenizing the stop '\n                              'words generated tokens %r not in '\n                              'stop_words.' % sorted(inconsistent))\n            return not inconsistent\n        except Exception:\n            # Failed to check stop words consistency (e.g. because a custom\n            # preprocessor or tokenizer was used)\n            self._stop_words_id = id(self.stop_words)\n            return 'error'"
      },
      {
        "file": "sklearn/feature_extraction/text.py",
        "type": "class",
        "name": "HashingVectorizer",
        "code": "class HashingVectorizer(BaseEstimator, VectorizerMixin, TransformerMixin):\n    \"\"\"Convert a collection of text documents to a matrix of token occurrences\n\n    It turns a collection of text documents into a scipy.sparse matrix holding\n    token occurrence counts (or binary occurrence information), possibly\n    normalized as token frequencies if norm='l1' or projected on the euclidean\n    unit sphere if norm='l2'.\n\n    This text vectorizer implementation uses the hashing trick to find the\n    token string name to feature integer index mapping.\n\n    This strategy has several advantages:\n\n    - it is very low memory scalable to large datasets as there is no need to\n      store a vocabulary dictionary in memory\n\n    - it is fast to pickle and un-pickle as it holds no state besides the\n      constructor parameters\n\n    - it can be used in a streaming (partial fit) or parallel pipeline as there\n      is no state computed during fit.\n\n    There are also a couple of cons (vs using a CountVectorizer with an\n    in-memory vocabulary):\n\n    - there is no way to compute the inverse transform (from feature indices to\n      string feature names) which can be a problem when trying to introspect\n      which features are most important to a model.\n\n    - there can be collisions: distinct tokens can be mapped to the same\n      feature index. However in practice this is rarely an issue if n_features\n      is large enough (e.g. 2 ** 18 for text classification problems).\n\n    - no IDF weighting as this would render the transformer stateful.\n\n    The hash function employed is the signed 32-bit version of Murmurhash3.\n\n    Read more in the :ref:`User Guide <text_feature_extraction>`.\n\n    Parameters\n    ----------\n\n    input : string {'filename', 'file', 'content'}\n        If 'filename', the sequence passed as an argument to fit is\n        expected to be a list of filenames that need reading to fetch\n        the raw content to analyze.\n\n        If 'file', the sequence items must have a 'read' method (file-like\n        object) that is called to fetch the bytes in memory.\n\n        Otherwise the input is expected to be the sequence strings or\n        bytes items are expected to be analyzed directly.\n\n    encoding : string, default='utf-8'\n        If bytes or files are given to analyze, this encoding is used to\n        decode.\n\n    decode_error : {'strict', 'ignore', 'replace'}\n        Instruction on what to do if a byte sequence is given to analyze that\n        contains characters not of the given `encoding`. By default, it is\n        'strict', meaning that a UnicodeDecodeError will be raised. Other\n        values are 'ignore' and 'replace'.\n\n    strip_accents : {'ascii', 'unicode', None}\n        Remove accents and perform other character normalization\n        during the preprocessing step.\n        'ascii' is a fast method that only works on characters that have\n        an direct ASCII mapping.\n        'unicode' is a slightly slower method that works on any characters.\n        None (default) does nothing.\n\n        Both 'ascii' and 'unicode' use NFKD normalization from\n        :func:`unicodedata.normalize`.\n\n    lowercase : boolean, default=True\n        Convert all characters to lowercase before tokenizing.\n\n    preprocessor : callable or None (default)\n        Override the preprocessing (string transformation) stage while\n        preserving the tokenizing and n-grams generation steps.\n\n    tokenizer : callable or None (default)\n        Override the string tokenization step while preserving the\n        preprocessing and n-grams generation steps.\n        Only applies if ``analyzer == 'word'``.\n\n    stop_words : string {'english'}, list, or None (default)\n        If 'english', a built-in stop word list for English is used.\n        There are several known issues with 'english' and you should\n        consider an alternative (see :ref:`stop_words`).\n\n        If a list, that list is assumed to contain stop words, all of which\n        will be removed from the resulting tokens.\n        Only applies if ``analyzer == 'word'``.\n\n    token_pattern : string\n        Regular expression denoting what constitutes a \"token\", only used\n        if ``analyzer == 'word'``. The default regexp selects tokens of 2\n        or more alphanumeric characters (punctuation is completely ignored\n        and always treated as a token separator).\n\n    ngram_range : tuple (min_n, max_n), default=(1, 1)\n        The lower and upper boundary of the range of n-values for different\n        n-grams to be extracted. All values of n such that min_n <= n <= max_n\n        will be used.\n\n    analyzer : string, {'word', 'char', 'char_wb'} or callable\n        Whether the feature should be made of word or character n-grams.\n        Option 'char_wb' creates character n-grams only from text inside\n        word boundaries; n-grams at the edges of words are padded with space.\n\n        If a callable is passed it is used to extract the sequence of features\n        out of the raw, unprocessed input.\n\n    n_features : integer, default=(2 ** 20)\n        The number of features (columns) in the output matrices. Small numbers\n        of features are likely to cause hash collisions, but large numbers\n        will cause larger coefficient dimensions in linear learners.\n\n    binary : boolean, default=False.\n        If True, all non zero counts are set to 1. This is useful for discrete\n        probabilistic models that model binary events rather than integer\n        counts.\n\n    norm : 'l1', 'l2' or None, optional\n        Norm used to normalize term vectors. None for no normalization.\n\n    alternate_sign : boolean, optional, default True\n        When True, an alternating sign is added to the features as to\n        approximately conserve the inner product in the hashed space even for\n        small n_features. This approach is similar to sparse random projection.\n\n        .. versionadded:: 0.19\n\n    dtype : type, optional\n        Type of the matrix returned by fit_transform() or transform().\n\n    Examples\n    --------\n    >>> from sklearn.feature_extraction.text import HashingVectorizer\n    >>> corpus = [\n    ...     'This is the first document.',\n    ...     'This document is the second document.',\n    ...     'And this is the third one.',\n    ...     'Is this the first document?',\n    ... ]\n    >>> vectorizer = HashingVectorizer(n_features=2**4)\n    >>> X = vectorizer.fit_transform(corpus)\n    >>> print(X.shape)\n    (4, 16)\n\n    See also\n    --------\n    CountVectorizer, TfidfVectorizer\n\n    \"\"\"\n    def __init__(self, input='content', encoding='utf-8',\n                 decode_error='strict', strip_accents=None,\n                 lowercase=True, preprocessor=None, tokenizer=None,\n                 stop_words=None, token_pattern=r\"(?u)\\b\\w\\w+\\b\",\n                 ngram_range=(1, 1), analyzer='word', n_features=(2 ** 20),\n                 binary=False, norm='l2', alternate_sign=True,\n                 dtype=np.float64):\n        self.input = input\n        self.encoding = encoding\n        self.decode_error = decode_error\n        self.strip_accents = strip_accents\n        self.preprocessor = preprocessor\n        self.tokenizer = tokenizer\n        self.analyzer = analyzer\n        self.lowercase = lowercase\n        self.token_pattern = token_pattern\n        self.stop_words = stop_words\n        self.n_features = n_features\n        self.ngram_range = ngram_range\n        self.binary = binary\n        self.norm = norm\n        self.alternate_sign = alternate_sign\n        self.dtype = dtype\n\n    def partial_fit(self, X, y=None):\n        \"\"\"Does nothing: this transformer is stateless.\n\n        This method is just there to mark the fact that this transformer\n        can work in a streaming setup.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_features]\n            Training data.\n        \"\"\"\n        return self\n\n    def fit(self, X, y=None):\n        \"\"\"Does nothing: this transformer is stateless.\n\n        Parameters\n        ----------\n        X : array-like, shape [n_samples, n_features]\n            Training data.\n        \"\"\"\n        # triggers a parameter validation\n        if isinstance(X, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        self._validate_params()\n\n        self._get_hasher().fit(X, y=y)\n        return self\n\n    def transform(self, X):\n        \"\"\"Transform a sequence of documents to a document-term matrix.\n\n        Parameters\n        ----------\n        X : iterable over raw text documents, length = n_samples\n            Samples. Each sample must be a text document (either bytes or\n            unicode strings, file name or file object depending on the\n            constructor argument) which will be tokenized and hashed.\n\n        Returns\n        -------\n        X : scipy.sparse matrix, shape = (n_samples, self.n_features)\n            Document-term matrix.\n        \"\"\"\n        if isinstance(X, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        self._validate_params()\n\n        analyzer = self.build_analyzer()\n        X = self._get_hasher().transform(analyzer(doc) for doc in X)\n        if self.binary:\n            X.data.fill(1)\n        if self.norm is not None:\n            X = normalize(X, norm=self.norm, copy=False)\n        return X\n\n    def fit_transform(self, X, y=None):\n        \"\"\"Transform a sequence of documents to a document-term matrix.\n\n        Parameters\n        ----------\n        X : iterable over raw text documents, length = n_samples\n            Samples. Each sample must be a text document (either bytes or\n            unicode strings, file name or file object depending on the\n            constructor argument) which will be tokenized and hashed.\n        y : any\n            Ignored. This parameter exists only for compatibility with\n            sklearn.pipeline.Pipeline.\n\n        Returns\n        -------\n        X : scipy.sparse matrix, shape = (n_samples, self.n_features)\n            Document-term matrix.\n        \"\"\"\n        return self.fit(X, y).transform(X)\n\n    def _get_hasher(self):\n        return FeatureHasher(n_features=self.n_features,\n                             input_type='string', dtype=self.dtype,\n                             alternate_sign=self.alternate_sign)\n\n    def _more_tags(self):\n        return {'X_types': ['string']}"
      },
      {
        "file": "sklearn/feature_extraction/text.py",
        "type": "class",
        "name": "CountVectorizer",
        "code": "class CountVectorizer(BaseEstimator, VectorizerMixin):\n    \"\"\"Convert a collection of text documents to a matrix of token counts\n\n    This implementation produces a sparse representation of the counts using\n    scipy.sparse.csr_matrix.\n\n    If you do not provide an a-priori dictionary and you do not use an analyzer\n    that does some kind of feature selection then the number of features will\n    be equal to the vocabulary size found by analyzing the data.\n\n    Read more in the :ref:`User Guide <text_feature_extraction>`.\n\n    Parameters\n    ----------\n    input : string {'filename', 'file', 'content'}\n        If 'filename', the sequence passed as an argument to fit is\n        expected to be a list of filenames that need reading to fetch\n        the raw content to analyze.\n\n        If 'file', the sequence items must have a 'read' method (file-like\n        object) that is called to fetch the bytes in memory.\n\n        Otherwise the input is expected to be the sequence strings or\n        bytes items are expected to be analyzed directly.\n\n    encoding : string, 'utf-8' by default.\n        If bytes or files are given to analyze, this encoding is used to\n        decode.\n\n    decode_error : {'strict', 'ignore', 'replace'}\n        Instruction on what to do if a byte sequence is given to analyze that\n        contains characters not of the given `encoding`. By default, it is\n        'strict', meaning that a UnicodeDecodeError will be raised. Other\n        values are 'ignore' and 'replace'.\n\n    strip_accents : {'ascii', 'unicode', None}\n        Remove accents and perform other character normalization\n        during the preprocessing step.\n        'ascii' is a fast method that only works on characters that have\n        an direct ASCII mapping.\n        'unicode' is a slightly slower method that works on any characters.\n        None (default) does nothing.\n\n        Both 'ascii' and 'unicode' use NFKD normalization from\n        :func:`unicodedata.normalize`.\n\n    lowercase : boolean, True by default\n        Convert all characters to lowercase before tokenizing.\n\n    preprocessor : callable or None (default)\n        Override the preprocessing (string transformation) stage while\n        preserving the tokenizing and n-grams generation steps.\n\n    tokenizer : callable or None (default)\n        Override the string tokenization step while preserving the\n        preprocessing and n-grams generation steps.\n        Only applies if ``analyzer == 'word'``.\n\n    stop_words : string {'english'}, list, or None (default)\n        If 'english', a built-in stop word list for English is used.\n        There are several known issues with 'english' and you should\n        consider an alternative (see :ref:`stop_words`).\n\n        If a list, that list is assumed to contain stop words, all of which\n        will be removed from the resulting tokens.\n        Only applies if ``analyzer == 'word'``.\n\n        If None, no stop words will be used. max_df can be set to a value\n        in the range [0.7, 1.0) to automatically detect and filter stop\n        words based on intra corpus document frequency of terms.\n\n    token_pattern : string\n        Regular expression denoting what constitutes a \"token\", only used\n        if ``analyzer == 'word'``. The default regexp select tokens of 2\n        or more alphanumeric characters (punctuation is completely ignored\n        and always treated as a token separator).\n\n    ngram_range : tuple (min_n, max_n)\n        The lower and upper boundary of the range of n-values for different\n        n-grams to be extracted. All values of n such that min_n <= n <= max_n\n        will be used.\n\n    analyzer : string, {'word', 'char', 'char_wb'} or callable\n        Whether the feature should be made of word or character n-grams.\n        Option 'char_wb' creates character n-grams only from text inside\n        word boundaries; n-grams at the edges of words are padded with space.\n\n        If a callable is passed it is used to extract the sequence of features\n        out of the raw, unprocessed input.\n\n    max_df : float in range [0.0, 1.0] or int, default=1.0\n        When building the vocabulary ignore terms that have a document\n        frequency strictly higher than the given threshold (corpus-specific\n        stop words).\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    min_df : float in range [0.0, 1.0] or int, default=1\n        When building the vocabulary ignore terms that have a document\n        frequency strictly lower than the given threshold. This value is also\n        called cut-off in the literature.\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    max_features : int or None, default=None\n        If not None, build a vocabulary that only consider the top\n        max_features ordered by term frequency across the corpus.\n\n        This parameter is ignored if vocabulary is not None.\n\n    vocabulary : Mapping or iterable, optional\n        Either a Mapping (e.g., a dict) where keys are terms and values are\n        indices in the feature matrix, or an iterable over terms. If not\n        given, a vocabulary is determined from the input documents. Indices\n        in the mapping should not be repeated and should not have any gap\n        between 0 and the largest index.\n\n    binary : boolean, default=False\n        If True, all non zero counts are set to 1. This is useful for discrete\n        probabilistic models that model binary events rather than integer\n        counts.\n\n    dtype : type, optional\n        Type of the matrix returned by fit_transform() or transform().\n\n    Attributes\n    ----------\n    vocabulary_ : dict\n        A mapping of terms to feature indices.\n\n    stop_words_ : set\n        Terms that were ignored because they either:\n\n          - occurred in too many documents (`max_df`)\n          - occurred in too few documents (`min_df`)\n          - were cut off by feature selection (`max_features`).\n\n        This is only available if no vocabulary was given.\n\n    Examples\n    --------\n    >>> from sklearn.feature_extraction.text import CountVectorizer\n    >>> corpus = [\n    ...     'This is the first document.',\n    ...     'This document is the second document.',\n    ...     'And this is the third one.',\n    ...     'Is this the first document?',\n    ... ]\n    >>> vectorizer = CountVectorizer()\n    >>> X = vectorizer.fit_transform(corpus)\n    >>> print(vectorizer.get_feature_names())\n    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']\n    >>> print(X.toarray())  # doctest: +NORMALIZE_WHITESPACE\n    [[0 1 1 1 0 0 1 0 1]\n     [0 2 0 1 0 1 1 0 1]\n     [1 0 0 1 1 0 1 1 1]\n     [0 1 1 1 0 0 1 0 1]]\n\n    See also\n    --------\n    HashingVectorizer, TfidfVectorizer\n\n    Notes\n    -----\n    The ``stop_words_`` attribute can get large and increase the model size\n    when pickling. This attribute is provided only for introspection and can\n    be safely removed using delattr or set to None before pickling.\n    \"\"\"\n\n    def __init__(self, input='content', encoding='utf-8',\n                 decode_error='strict', strip_accents=None,\n                 lowercase=True, preprocessor=None, tokenizer=None,\n                 stop_words=None, token_pattern=r\"(?u)\\b\\w\\w+\\b\",\n                 ngram_range=(1, 1), analyzer='word',\n                 max_df=1.0, min_df=1, max_features=None,\n                 vocabulary=None, binary=False, dtype=np.int64):\n        self.input = input\n        self.encoding = encoding\n        self.decode_error = decode_error\n        self.strip_accents = strip_accents\n        self.preprocessor = preprocessor\n        self.tokenizer = tokenizer\n        self.analyzer = analyzer\n        self.lowercase = lowercase\n        self.token_pattern = token_pattern\n        self.stop_words = stop_words\n        self.max_df = max_df\n        self.min_df = min_df\n        if max_df < 0 or min_df < 0:\n            raise ValueError(\"negative value for max_df or min_df\")\n        self.max_features = max_features\n        if max_features is not None:\n            if (not isinstance(max_features, numbers.Integral) or\n                    max_features <= 0):\n                raise ValueError(\n                    \"max_features=%r, neither a positive integer nor None\"\n                    % max_features)\n        self.ngram_range = ngram_range\n        self.vocabulary = vocabulary\n        self.binary = binary\n        self.dtype = dtype\n\n    def _sort_features(self, X, vocabulary):\n        \"\"\"Sort features by name\n\n        Returns a reordered matrix and modifies the vocabulary in place\n        \"\"\"\n        sorted_features = sorted(vocabulary.items())\n        map_index = np.empty(len(sorted_features), dtype=X.indices.dtype)\n        for new_val, (term, old_val) in enumerate(sorted_features):\n            vocabulary[term] = new_val\n            map_index[old_val] = new_val\n\n        X.indices = map_index.take(X.indices, mode='clip')\n        return X\n\n    def _limit_features(self, X, vocabulary, high=None, low=None,\n                        limit=None):\n        \"\"\"Remove too rare or too common features.\n\n        Prune features that are non zero in more samples than high or less\n        documents than low, modifying the vocabulary, and restricting it to\n        at most the limit most frequent.\n\n        This does not prune samples with zero features.\n        \"\"\"\n        if high is None and low is None and limit is None:\n            return X, set()\n\n        # Calculate a mask based on document frequencies\n        dfs = _document_frequency(X)\n        tfs = np.asarray(X.sum(axis=0)).ravel()\n        mask = np.ones(len(dfs), dtype=bool)\n        if high is not None:\n            mask &= dfs <= high\n        if low is not None:\n            mask &= dfs >= low\n        if limit is not None and mask.sum() > limit:\n            mask_inds = (-tfs[mask]).argsort()[:limit]\n            new_mask = np.zeros(len(dfs), dtype=bool)\n            new_mask[np.where(mask)[0][mask_inds]] = True\n            mask = new_mask\n\n        new_indices = np.cumsum(mask) - 1  # maps old indices to new\n        removed_terms = set()\n        for term, old_index in list(vocabulary.items()):\n            if mask[old_index]:\n                vocabulary[term] = new_indices[old_index]\n            else:\n                del vocabulary[term]\n                removed_terms.add(term)\n        kept_indices = np.where(mask)[0]\n        if len(kept_indices) == 0:\n            raise ValueError(\"After pruning, no terms remain. Try a lower\"\n                             \" min_df or a higher max_df.\")\n        return X[:, kept_indices], removed_terms\n\n    def _count_vocab(self, raw_documents, fixed_vocab):\n        \"\"\"Create sparse feature matrix, and vocabulary where fixed_vocab=False\n        \"\"\"\n        if fixed_vocab:\n            vocabulary = self.vocabulary_\n        else:\n            # Add a new value when a new vocabulary item is seen\n            vocabulary = defaultdict()\n            vocabulary.default_factory = vocabulary.__len__\n\n        analyze = self.build_analyzer()\n        j_indices = []\n        indptr = []\n\n        values = _make_int_array()\n        indptr.append(0)\n        for doc in raw_documents:\n            feature_counter = {}\n            for feature in analyze(doc):\n                try:\n                    feature_idx = vocabulary[feature]\n                    if feature_idx not in feature_counter:\n                        feature_counter[feature_idx] = 1\n                    else:\n                        feature_counter[feature_idx] += 1\n                except KeyError:\n                    # Ignore out-of-vocabulary items for fixed_vocab=True\n                    continue\n\n            j_indices.extend(feature_counter.keys())\n            values.extend(feature_counter.values())\n            indptr.append(len(j_indices))\n\n        if not fixed_vocab:\n            # disable defaultdict behaviour\n            vocabulary = dict(vocabulary)\n            if not vocabulary:\n                raise ValueError(\"empty vocabulary; perhaps the documents only\"\n                                 \" contain stop words\")\n\n        if indptr[-1] > 2147483648:  # = 2**31 - 1\n            if _IS_32BIT:\n                raise ValueError(('sparse CSR array has {} non-zero '\n                                  'elements and requires 64 bit indexing, '\n                                  'which is unsupported with 32 bit Python.')\n                                 .format(indptr[-1]))\n            indices_dtype = np.int64\n\n        else:\n            indices_dtype = np.int32\n        j_indices = np.asarray(j_indices, dtype=indices_dtype)\n        indptr = np.asarray(indptr, dtype=indices_dtype)\n        values = np.frombuffer(values, dtype=np.intc)\n\n        X = sp.csr_matrix((values, j_indices, indptr),\n                          shape=(len(indptr) - 1, len(vocabulary)),\n                          dtype=self.dtype)\n        X.sort_indices()\n        return vocabulary, X\n\n    def fit(self, raw_documents, y=None):\n        \"\"\"Learn a vocabulary dictionary of all tokens in the raw documents.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            An iterable which yields either str, unicode or file objects.\n\n        Returns\n        -------\n        self\n        \"\"\"\n        self.fit_transform(raw_documents)\n        return self\n\n    def fit_transform(self, raw_documents, y=None):\n        \"\"\"Learn the vocabulary dictionary and return term-document matrix.\n\n        This is equivalent to fit followed by transform, but more efficiently\n        implemented.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            An iterable which yields either str, unicode or file objects.\n\n        Returns\n        -------\n        X : array, [n_samples, n_features]\n            Document-term matrix.\n        \"\"\"\n        # We intentionally don't call the transform method to make\n        # fit_transform overridable without unwanted side effects in\n        # TfidfVectorizer.\n        if isinstance(raw_documents, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        self._validate_params()\n        self._validate_vocabulary()\n        max_df = self.max_df\n        min_df = self.min_df\n        max_features = self.max_features\n\n        vocabulary, X = self._count_vocab(raw_documents,\n                                          self.fixed_vocabulary_)\n\n        if self.binary:\n            X.data.fill(1)\n\n        if not self.fixed_vocabulary_:\n            X = self._sort_features(X, vocabulary)\n\n            n_doc = X.shape[0]\n            max_doc_count = (max_df\n                             if isinstance(max_df, numbers.Integral)\n                             else max_df * n_doc)\n            min_doc_count = (min_df\n                             if isinstance(min_df, numbers.Integral)\n                             else min_df * n_doc)\n            if max_doc_count < min_doc_count:\n                raise ValueError(\n                    \"max_df corresponds to < documents than min_df\")\n            X, self.stop_words_ = self._limit_features(X, vocabulary,\n                                                       max_doc_count,\n                                                       min_doc_count,\n                                                       max_features)\n\n            self.vocabulary_ = vocabulary\n\n        return X\n\n    def transform(self, raw_documents):\n        \"\"\"Transform documents to document-term matrix.\n\n        Extract token counts out of raw text documents using the vocabulary\n        fitted with fit or the one provided to the constructor.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            An iterable which yields either str, unicode or file objects.\n\n        Returns\n        -------\n        X : sparse matrix, [n_samples, n_features]\n            Document-term matrix.\n        \"\"\"\n        if isinstance(raw_documents, str):\n            raise ValueError(\n                \"Iterable over raw text documents expected, \"\n                \"string object received.\")\n\n        if not hasattr(self, 'vocabulary_'):\n            self._validate_vocabulary()\n\n        self._check_vocabulary()\n\n        # use the same matrix-building strategy as fit_transform\n        _, X = self._count_vocab(raw_documents, fixed_vocab=True)\n        if self.binary:\n            X.data.fill(1)\n        return X\n\n    def inverse_transform(self, X):\n        \"\"\"Return terms per document with nonzero entries in X.\n\n        Parameters\n        ----------\n        X : {array, sparse matrix}, shape = [n_samples, n_features]\n\n        Returns\n        -------\n        X_inv : list of arrays, len = n_samples\n            List of arrays of terms.\n        \"\"\"\n        self._check_vocabulary()\n\n        if sp.issparse(X):\n            # We need CSR format for fast row manipulations.\n            X = X.tocsr()\n        else:\n            # We need to convert X to a matrix, so that the indexing\n            # returns 2D objects\n            X = np.asmatrix(X)\n        n_samples = X.shape[0]\n\n        terms = np.array(list(self.vocabulary_.keys()))\n        indices = np.array(list(self.vocabulary_.values()))\n        inverse_vocabulary = terms[np.argsort(indices)]\n\n        return [inverse_vocabulary[X[i, :].nonzero()[1]].ravel()\n                for i in range(n_samples)]\n\n    def get_feature_names(self):\n        \"\"\"Array mapping from feature integer indices to feature name\"\"\"\n        if not hasattr(self, 'vocabulary_'):\n            self._validate_vocabulary()\n\n        self._check_vocabulary()\n\n        return [t for t, i in sorted(self.vocabulary_.items(),\n                                     key=itemgetter(1))]\n\n    def _more_tags(self):\n        return {'X_types': ['string']}"
      },
      {
        "file": "sklearn/feature_extraction/text.py",
        "type": "class",
        "name": "TfidfVectorizer",
        "code": "class TfidfVectorizer(CountVectorizer):\n    \"\"\"Convert a collection of raw documents to a matrix of TF-IDF features.\n\n    Equivalent to :class:`CountVectorizer` followed by\n    :class:`TfidfTransformer`.\n\n    Read more in the :ref:`User Guide <text_feature_extraction>`.\n\n    Parameters\n    ----------\n    input : string {'filename', 'file', 'content'}\n        If 'filename', the sequence passed as an argument to fit is\n        expected to be a list of filenames that need reading to fetch\n        the raw content to analyze.\n\n        If 'file', the sequence items must have a 'read' method (file-like\n        object) that is called to fetch the bytes in memory.\n\n        Otherwise the input is expected to be the sequence strings or\n        bytes items are expected to be analyzed directly.\n\n    encoding : string, 'utf-8' by default.\n        If bytes or files are given to analyze, this encoding is used to\n        decode.\n\n    decode_error : {'strict', 'ignore', 'replace'} (default='strict')\n        Instruction on what to do if a byte sequence is given to analyze that\n        contains characters not of the given `encoding`. By default, it is\n        'strict', meaning that a UnicodeDecodeError will be raised. Other\n        values are 'ignore' and 'replace'.\n\n    strip_accents : {'ascii', 'unicode', None} (default=None)\n        Remove accents and perform other character normalization\n        during the preprocessing step.\n        'ascii' is a fast method that only works on characters that have\n        an direct ASCII mapping.\n        'unicode' is a slightly slower method that works on any characters.\n        None (default) does nothing.\n\n        Both 'ascii' and 'unicode' use NFKD normalization from\n        :func:`unicodedata.normalize`.\n\n    lowercase : boolean (default=True)\n        Convert all characters to lowercase before tokenizing.\n\n    preprocessor : callable or None (default=None)\n        Override the preprocessing (string transformation) stage while\n        preserving the tokenizing and n-grams generation steps.\n\n    tokenizer : callable or None (default=None)\n        Override the string tokenization step while preserving the\n        preprocessing and n-grams generation steps.\n        Only applies if ``analyzer == 'word'``.\n\n    analyzer : string, {'word', 'char', 'char_wb'} or callable\n        Whether the feature should be made of word or character n-grams.\n        Option 'char_wb' creates character n-grams only from text inside\n        word boundaries; n-grams at the edges of words are padded with space.\n\n        If a callable is passed it is used to extract the sequence of features\n        out of the raw, unprocessed input.\n\n    stop_words : string {'english'}, list, or None (default=None)\n        If a string, it is passed to _check_stop_list and the appropriate stop\n        list is returned. 'english' is currently the only supported string\n        value.\n        There are several known issues with 'english' and you should\n        consider an alternative (see :ref:`stop_words`).\n\n        If a list, that list is assumed to contain stop words, all of which\n        will be removed from the resulting tokens.\n        Only applies if ``analyzer == 'word'``.\n\n        If None, no stop words will be used. max_df can be set to a value\n        in the range [0.7, 1.0) to automatically detect and filter stop\n        words based on intra corpus document frequency of terms.\n\n    token_pattern : string\n        Regular expression denoting what constitutes a \"token\", only used\n        if ``analyzer == 'word'``. The default regexp selects tokens of 2\n        or more alphanumeric characters (punctuation is completely ignored\n        and always treated as a token separator).\n\n    ngram_range : tuple (min_n, max_n) (default=(1, 1))\n        The lower and upper boundary of the range of n-values for different\n        n-grams to be extracted. All values of n such that min_n <= n <= max_n\n        will be used.\n\n    max_df : float in range [0.0, 1.0] or int (default=1.0)\n        When building the vocabulary ignore terms that have a document\n        frequency strictly higher than the given threshold (corpus-specific\n        stop words).\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    min_df : float in range [0.0, 1.0] or int (default=1)\n        When building the vocabulary ignore terms that have a document\n        frequency strictly lower than the given threshold. This value is also\n        called cut-off in the literature.\n        If float, the parameter represents a proportion of documents, integer\n        absolute counts.\n        This parameter is ignored if vocabulary is not None.\n\n    max_features : int or None (default=None)\n        If not None, build a vocabulary that only consider the top\n        max_features ordered by term frequency across the corpus.\n\n        This parameter is ignored if vocabulary is not None.\n\n    vocabulary : Mapping or iterable, optional (default=None)\n        Either a Mapping (e.g., a dict) where keys are terms and values are\n        indices in the feature matrix, or an iterable over terms. If not\n        given, a vocabulary is determined from the input documents.\n\n    binary : boolean (default=False)\n        If True, all non-zero term counts are set to 1. This does not mean\n        outputs will have only 0/1 values, only that the tf term in tf-idf\n        is binary. (Set idf and normalization to False to get 0/1 outputs.)\n\n    dtype : type, optional (default=float64)\n        Type of the matrix returned by fit_transform() or transform().\n\n    norm : 'l1', 'l2' or None, optional (default='l2')\n        Each output row will have unit norm, either:\n        * 'l2': Sum of squares of vector elements is 1. The cosine\n        similarity between two vectors is their dot product when l2 norm has\n        been applied.\n        * 'l1': Sum of absolute values of vector elements is 1.\n        See :func:`preprocessing.normalize`\n\n    use_idf : boolean (default=True)\n        Enable inverse-document-frequency reweighting.\n\n    smooth_idf : boolean (default=True)\n        Smooth idf weights by adding one to document frequencies, as if an\n        extra document was seen containing every term in the collection\n        exactly once. Prevents zero divisions.\n\n    sublinear_tf : boolean (default=False)\n        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).\n\n    Attributes\n    ----------\n    vocabulary_ : dict\n        A mapping of terms to feature indices.\n\n    idf_ : array, shape (n_features)\n        The inverse document frequency (IDF) vector; only defined\n        if  ``use_idf`` is True.\n\n    stop_words_ : set\n        Terms that were ignored because they either:\n\n          - occurred in too many documents (`max_df`)\n          - occurred in too few documents (`min_df`)\n          - were cut off by feature selection (`max_features`).\n\n        This is only available if no vocabulary was given.\n\n    Examples\n    --------\n    >>> from sklearn.feature_extraction.text import TfidfVectorizer\n    >>> corpus = [\n    ...     'This is the first document.',\n    ...     'This document is the second document.',\n    ...     'And this is the third one.',\n    ...     'Is this the first document?',\n    ... ]\n    >>> vectorizer = TfidfVectorizer()\n    >>> X = vectorizer.fit_transform(corpus)\n    >>> print(vectorizer.get_feature_names())\n    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']\n    >>> print(X.shape)\n    (4, 9)\n\n    See also\n    --------\n    CountVectorizer : Transforms text into a sparse matrix of n-gram counts.\n\n    TfidfTransformer : Performs the TF-IDF transformation from a provided\n        matrix of counts.\n\n    Notes\n    -----\n    The ``stop_words_`` attribute can get large and increase the model size\n    when pickling. This attribute is provided only for introspection and can\n    be safely removed using delattr or set to None before pickling.\n    \"\"\"\n\n    def __init__(self, input='content', encoding='utf-8',\n                 decode_error='strict', strip_accents=None, lowercase=True,\n                 preprocessor=None, tokenizer=None, analyzer='word',\n                 stop_words=None, token_pattern=r\"(?u)\\b\\w\\w+\\b\",\n                 ngram_range=(1, 1), max_df=1.0, min_df=1,\n                 max_features=None, vocabulary=None, binary=False,\n                 dtype=np.float64, norm='l2', use_idf=True, smooth_idf=True,\n                 sublinear_tf=False):\n\n        super().__init__(\n            input=input, encoding=encoding, decode_error=decode_error,\n            strip_accents=strip_accents, lowercase=lowercase,\n            preprocessor=preprocessor, tokenizer=tokenizer, analyzer=analyzer,\n            stop_words=stop_words, token_pattern=token_pattern,\n            ngram_range=ngram_range, max_df=max_df, min_df=min_df,\n            max_features=max_features, vocabulary=vocabulary, binary=binary,\n            dtype=dtype)\n\n        self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,\n                                       smooth_idf=smooth_idf,\n                                       sublinear_tf=sublinear_tf)\n\n    # Broadcast the TF-IDF parameters to the underlying transformer instance\n    # for easy grid search and repr\n\n    @property\n    def norm(self):\n        return self._tfidf.norm\n\n    @norm.setter\n    def norm(self, value):\n        self._tfidf.norm = value\n\n    @property\n    def use_idf(self):\n        return self._tfidf.use_idf\n\n    @use_idf.setter\n    def use_idf(self, value):\n        self._tfidf.use_idf = value\n\n    @property\n    def smooth_idf(self):\n        return self._tfidf.smooth_idf\n\n    @smooth_idf.setter\n    def smooth_idf(self, value):\n        self._tfidf.smooth_idf = value\n\n    @property\n    def sublinear_tf(self):\n        return self._tfidf.sublinear_tf\n\n    @sublinear_tf.setter\n    def sublinear_tf(self, value):\n        self._tfidf.sublinear_tf = value\n\n    @property\n    def idf_(self):\n        return self._tfidf.idf_\n\n    @idf_.setter\n    def idf_(self, value):\n        self._validate_vocabulary()\n        if hasattr(self, 'vocabulary_'):\n            if len(self.vocabulary_) != len(value):\n                raise ValueError(\"idf length = %d must be equal \"\n                                 \"to vocabulary size = %d\" %\n                                 (len(value), len(self.vocabulary)))\n        self._tfidf.idf_ = value\n\n    def _check_params(self):\n        if self.dtype not in FLOAT_DTYPES:\n            warnings.warn(\"Only {} 'dtype' should be used. {} 'dtype' will \"\n                          \"be converted to np.float64.\"\n                          .format(FLOAT_DTYPES, self.dtype),\n                          UserWarning)\n\n    def fit(self, raw_documents, y=None):\n        \"\"\"Learn vocabulary and idf from training set.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            an iterable which yields either str, unicode or file objects\n\n        Returns\n        -------\n        self : TfidfVectorizer\n        \"\"\"\n        self._check_params()\n        X = super().fit_transform(raw_documents)\n        self._tfidf.fit(X)\n        return self\n\n    def fit_transform(self, raw_documents, y=None):\n        \"\"\"Learn vocabulary and idf, return term-document matrix.\n\n        This is equivalent to fit followed by transform, but more efficiently\n        implemented.\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            an iterable which yields either str, unicode or file objects\n\n        Returns\n        -------\n        X : sparse matrix, [n_samples, n_features]\n            Tf-idf-weighted document-term matrix.\n        \"\"\"\n        self._check_params()\n        X = super().fit_transform(raw_documents)\n        self._tfidf.fit(X)\n        # X is already a transformed view of raw_documents so\n        # we set copy to False\n        return self._tfidf.transform(X, copy=False)\n\n    def transform(self, raw_documents, copy=True):\n        \"\"\"Transform documents to document-term matrix.\n\n        Uses the vocabulary and document frequencies (df) learned by fit (or\n        fit_transform).\n\n        Parameters\n        ----------\n        raw_documents : iterable\n            an iterable which yields either str, unicode or file objects\n\n        copy : boolean, default True\n            Whether to copy X and operate on the copy or perform in-place\n            operations.\n\n        Returns\n        -------\n        X : sparse matrix, [n_samples, n_features]\n            Tf-idf-weighted document-term matrix.\n        \"\"\"\n        check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')\n\n        X = super().transform(raw_documents)\n        return self._tfidf.transform(X, copy=False)\n\n    def _more_tags(self):\n        return {'X_types': ['string'], '_skip_test': True}"
      }
    ]
  },
  "Justification": "Candidate D involves changes in the `CountVectorizer`, which is part of the text processing in scikit-learn, similar to how the `AffinityPropagation` utilizes clustering of data. This report discusses unexpected behavior concerning input handling which parallels the user's confusion regarding empty output for non-converged clustering. Both bugs seem to highlight issues of expected versus actual outputs based on input parameters. Furthermore, the fact that the bug report addresses a problem with how input is parsed can potentially provide insights on managing the outputs more effectively when convergence issues arise in clustering algorithms. The structural and module similarity between the handling of input in both cases makes it highly relevant for debugging the current issue.",
  "instance_id": "scikit-learn__scikit-learn-15512",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-11-02T22:28:57Z",
  "problem_statement": "Return values of non converged affinity propagation clustering\nThe affinity propagation Documentation states: \r\n\"When the algorithm does not converge, it returns an empty array as cluster_center_indices and -1 as label for each training sample.\"\r\n\r\nExample:\r\n```python\r\nfrom sklearn.cluster import AffinityPropagation\r\nimport pandas as pd\r\n\r\ndata = pd.DataFrame([[1,0,0,0,0,0],[0,1,1,1,0,0],[0,0,1,0,0,1]])\r\naf = AffinityPropagation(affinity='euclidean', verbose=True, copy=False, max_iter=2).fit(data)\r\n\r\nprint(af.cluster_centers_indices_)\r\nprint(af.labels_)\r\n\r\n```\r\nI would expect that the clustering here (which does not converge) prints first an empty List and then [-1,-1,-1], however, I get [2] as cluster center and [0,0,0] as cluster labels. \r\nThe only way I currently know if the clustering fails is if I use the verbose option, however that is very unhandy. A hacky solution is to check if max_iter == n_iter_ but it could have converged exactly 15 iterations before max_iter (although unlikely).\r\nI am not sure if this is intended behavior and the documentation is wrong?\r\n\r\nFor my use-case within a bigger script, I would prefer to get back -1 values or have a property to check if it has converged, as otherwise, a user might not be aware that the clustering never converged.\r\n\r\n\r\n#### Versions\r\nSystem:\r\n    python: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 02:32:25)  [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]\r\nexecutable: /home/jenniferh/Programs/anaconda3/envs/TF_RDKit_1_19/bin/python\r\n   machine: Linux-4.15.0-52-generic-x86_64-with-debian-stretch-sid\r\nBLAS:\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\n  lib_dirs: /home/jenniferh/Programs/anaconda3/envs/TF_RDKit_1_19/lib\r\ncblas_libs: mkl_rt, pthread\r\nPython deps:\r\n    pip: 18.1\r\n   setuptools: 40.6.3\r\n   sklearn: 0.20.3\r\n   numpy: 1.15.4\r\n   scipy: 1.2.0\r\n   Cython: 0.29.2\r\n   pandas: 0.23.4\r\n\r\n\n",
  "patch": "diff --git a/sklearn/cluster/_affinity_propagation.py b/sklearn/cluster/_affinity_propagation.py\n--- a/sklearn/cluster/_affinity_propagation.py\n+++ b/sklearn/cluster/_affinity_propagation.py\n@@ -194,17 +194,19 @@ def affinity_propagation(S, preference=None, convergence_iter=15, max_iter=200,\n             unconverged = (np.sum((se == convergence_iter) + (se == 0))\n                            != n_samples)\n             if (not unconverged and (K > 0)) or (it == max_iter):\n+                never_converged = False\n                 if verbose:\n                     print(\"Converged after %d iterations.\" % it)\n                 break\n     else:\n+        never_converged = True\n         if verbose:\n             print(\"Did not converge\")\n \n     I = np.flatnonzero(E)\n     K = I.size  # Identify exemplars\n \n-    if K > 0:\n+    if K > 0 and not never_converged:\n         c = np.argmax(S[:, I], axis=1)\n         c[I] = np.arange(K)  # Identify clusters\n         # Refine the final set of exemplars and clusters and return results\n@@ -408,6 +410,7 @@ def predict(self, X):\n             Cluster labels.\n         \"\"\"\n         check_is_fitted(self)\n+        X = check_array(X)\n         if not hasattr(self, \"cluster_centers_\"):\n             raise ValueError(\"Predict method is not supported when \"\n                              \"affinity='precomputed'.\")\n"
}