{
  "original_problem": {
    "instance_id": "scikit-learn__scikit-learn-15512",
    "repo": "scikit-learn/scikit-learn",
    "created_at": "2019-11-02T22:28:57Z",
    "problem_statement": "Return values of non converged affinity propagation clustering\nThe affinity propagation Documentation states: \r\n\"When the algorithm does not converge, it returns an empty array as cluster_center_indices and -1 as label for each training sample.\"\r\n\r\nExample:\r\n```python\r\nfrom sklearn.cluster import AffinityPropagation\r\nimport pandas as pd\r\n\r\ndata = pd.DataFrame([[1,0,0,0,0,0],[0,1,1,1,0,0],[0,0,1,0,0,1]])\r\naf = AffinityPropagation(affinity='euclidean', verbose=True, copy=False, max_iter=2).fit(data)\r\n\r\nprint(af.cluster_centers_indices_)\r\nprint(af.labels_)\r\n\r\n```\r\nI would expect that the clustering here (which does not converge) prints first an empty List and then [-1,-1,-1], however, I get [2] as cluster center and [0,0,0] as cluster labels. \r\nThe only way I currently know if the clustering fails is if I use the verbose option, however that is very unhandy. A hacky solution is to check if max_iter == n_iter_ but it could have converged exactly 15 iterations before max_iter (although unlikely).\r\nI am not sure if this is intended behavior and the documentation is wrong?\r\n\r\nFor my use-case within a bigger script, I would prefer to get back -1 values or have a property to check if it has converged, as otherwise, a user might not be aware that the clustering never converged.\r\n\r\n\r\n#### Versions\r\nSystem:\r\n    python: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 02:32:25)  [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]\r\nexecutable: /home/jenniferh/Programs/anaconda3/envs/TF_RDKit_1_19/bin/python\r\n   machine: Linux-4.15.0-52-generic-x86_64-with-debian-stretch-sid\r\nBLAS:\r\n    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None\r\n  lib_dirs: /home/jenniferh/Programs/anaconda3/envs/TF_RDKit_1_19/lib\r\ncblas_libs: mkl_rt, pthread\r\nPython deps:\r\n    pip: 18.1\r\n   setuptools: 40.6.3\r\n   sklearn: 0.20.3\r\n   numpy: 1.15.4\r\n   scipy: 1.2.0\r\n   Cython: 0.29.2\r\n   pandas: 0.23.4\r\n\r\n\n",
    "patch": "diff --git a/sklearn/cluster/_affinity_propagation.py b/sklearn/cluster/_affinity_propagation.py\n--- a/sklearn/cluster/_affinity_propagation.py\n+++ b/sklearn/cluster/_affinity_propagation.py\n@@ -194,17 +194,19 @@ def affinity_propagation(S, preference=None, convergence_iter=15, max_iter=200,\n             unconverged = (np.sum((se == convergence_iter) + (se == 0))\n                            != n_samples)\n             if (not unconverged and (K > 0)) or (it == max_iter):\n+                never_converged = False\n                 if verbose:\n                     print(\"Converged after %d iterations.\" % it)\n                 break\n     else:\n+        never_converged = True\n         if verbose:\n             print(\"Did not converge\")\n \n     I = np.flatnonzero(E)\n     K = I.size  # Identify exemplars\n \n-    if K > 0:\n+    if K > 0 and not never_converged:\n         c = np.argmax(S[:, I], axis=1)\n         c[I] = np.arange(K)  # Identify clusters\n         # Refine the final set of exemplars and clusters and return results\n@@ -408,6 +410,7 @@ def predict(self, X):\n             Cluster labels.\n         \"\"\"\n         check_is_fitted(self)\n+        X = check_array(X)\n         if not hasattr(self, \"cluster_centers_\"):\n             raise ValueError(\"Predict method is not supported when \"\n                              \"affinity='precomputed'.\")\n"
  },
  "candidates_evaluated": 5,
  "judgment_result": {
    "candidates": [
      {
        "idx": 1,
        "id": "similar_4235",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue focuses on parameter consistency across functions, which is not directly related to convergence detection in clustering."
      },
      {
        "idx": 2,
        "id": "similar_11906",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about improving error messages, which does not relate to algorithm convergence or result validation."
      },
      {
        "idx": 3,
        "id": "similar_6848",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "This issue deals with a mathematical error in probability prediction, unrelated to convergence detection in clustering."
      },
      {
        "idx": 4,
        "id": "similar_5482",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue is about parameter handling in text processing, not related to clustering convergence or result validation."
      },
      {
        "idx": 5,
        "id": "similar_12772",
        "decision": "Not useful",
        "confidence": "Medium",
        "reason": "The issue concerns parameter defaults in scoring, which does not relate to detecting non-convergence in clustering."
      }
    ]
  },
  "raw_summaries": [
    {
      "similar_issue": {
        "issue_title": "Check all usages of kneighbors_graph",
        "issue_body": "After @MechCoder fixed kneighbors_graph in #4046, we should check all uses in the code for whether we want `include_self==True` or not. I came across this in `SpectralEmbedding` where the current default makes no sense, I think.\nYou can simply `git grep` and should find a lot of occurrences.\nIf our test output wasn't so flooded with warnings, we would have probably detected that earlier :-/\n",
        "issue_id": 4235,
        "pr_number": 4322,
        "pr_title": "[MRG+2] Pass include_self=True to kneighbors_graph",
        "pr_body": "Fixes #4235.\n",
        "issue_closed_at": "2015-03-20T12:05:11Z",
        "base_commit": "4be62142aa3afed11d45c9cdcb066b3d8ba9badf"
      },
      "summary": "### Summary: This issue pertains to the usage of the `kneighbors_graph` function across the codebase, focusing on the parameter `include_self`. A recent fix to this function necessitated a review of its implementations to ensure that the parameter settings align with the intended functionality, particularly in the context of the `SpectralEmbedding` component. The problem was identified when the default behavior in `SpectralEmbedding` appeared illogical, suggesting a broader issue with how `kneighbors_graph` is being utilized elsewhere. The prevalence of warnings in test outputs masked the detection of this inconsistency earlier. The issue affects various scripts and functions, including those related to clustering and manifold learning, such as `plot_cluster_comparison.py`, `plot_ward_structured_vs_unstructured.py`, and the `SpectralEmbedding` class. If unchecked, the issue could lead to incorrect data processing results, thereby affecting the reliability of the software. The problem highlights the importance of consistent parameter settings across key functions to maintain expected behavior and functionality within the system.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Check all usages of kneighbors_graph\n\nBody:\nAfter @MechCoder fixed kneighbors_graph in #4046, we should check all uses in the code for whether we want `include_self==True` or not. I came across this in `SpectralEmbedding` where the current default makes no sense, I think.\nYou can simply `git grep` and should find a lot of occurrences.\nIf our test output wasn't so flooded with warnings, we would have probably detected that earlier :-/\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nexamples/cluster/plot_cluster_comparison.py\n  line: line 45\n  line: line 67\n  line: line 83\n\nexamples/cluster/plot_ward_structured_vs_unstructured.py\n  line: line 65\n\nsklearn/manifold/spectral_embedding_.py\n  function: SpectralEmbedding._get_affinity_matrix\n"
    },
    {
      "similar_issue": {
        "issue_title": "Better error message for invalid metric in NearestNeighbors ",
        "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n",
        "issue_id": 11906,
        "pr_number": 11914,
        "pr_title": "[MRG] ENH Better error message for metrics of neighbors",
        "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n#Fixes #11906 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded expression to error message to get list of valid metrics.\r\n\r\n<!--\r\n#### Any other comments?\r\n\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
        "issue_closed_at": "2018-09-13T15:34:02Z",
        "base_commit": "7ed61a24feb4ffde0bee9342acf4a58e3f946a61"
      },
      "summary": "### Summary:\n\nThis issue pertains to the clarity of error messages in the Scikit-learn library, specifically within the NearestNeighbors class, when an invalid distance metric is specified. The primary problem is that the error message provided when a user inputs an incorrect metric is not sufficiently informative, leading to potential misunderstandings about the source of the error. The message suggests an issue with the algorithm rather than clearly indicating the invalid metric name.\n\n1. **Problem Description in General Terms:**\n   The error message for an invalid metric input in the NearestNeighbors class needs improvement to be more user-friendly and informative. The current message can mislead users into thinking the problem lies with the algorithm rather than a simple typo in the metric name.\n\n2. **Key Symptoms and Behaviors Observed:**\n   - Users receive a vague error message when an incorrect metric is specified.\n   - The error message does not explicitly list the acceptable metric options, leading to confusion.\n   - Users may initially believe the issue is related to the algorithm rather than an incorrect metric name.\n\n3. **Affected Components or Systems:**\n   - The error-checking mechanism within the NearestNeighbors class of the Scikit-learn library.\n   - Specifically, the function responsible for validating the metric provided by the user, `NeighborsBase._check_algorithm_metric`.\n\n4. **Potential Impact or Severity:**\n   - The severity is low in terms of functionality, as the error is correctly raised, preventing further processing with an invalid metric.\n   - However, it can significantly impact user experience, leading to confusion and wasted time while diagnosing the issue.\n\n5. **Relevant Technical Details Abstracted for Broader Understanding:**\n   - The issue arises when a user specifies an unsupported or incorrectly spelled distance metric in the NearestNeighbors class.\n   - The library should provide a more descriptive error message, including a list of valid metrics or a suggestion to check for typographical errors.\n\nOverall, the resolution involves enhancing the error message to guide users more effectively in correcting their input, thereby improving the usability and user experience of the Scikit-learn library.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Better error message for invalid metric in NearestNeighbors \n\nBody:\n<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/neighbors/__init__.py\n  line: line 14\n  line: line 28\n\nsklearn/neighbors/base.py\n  function: NeighborsBase._check_algorithm_metric\n"
    },
    {
      "similar_issue": {
        "issue_title": "LinearDiscriminantAnalysis predict probability bug",
        "issue_body": "I am pretty confident there is a bug introduced in commit\n7c1101d7c26ba0b77184cce9c0b9be79adb526de\n\nConcretely, line 518 of the current version \nhttps://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py\nshould be removed as it yields wrong results. \n\nThere is no reason why constant 1 should be added to the computed probability after exponentiation and before inversion. \n\nTo verify this, I have run a one-to-one comparison between the outcome of the method and MATLAB's builtin LDA classifier on the Iris dataset. Only after removal of line 518, results match (up to a tolerance).\n\nIf everyone agrees on that, I am happy to submit a PR.\n",
        "issue_id": 6848,
        "pr_number": 11796,
        "pr_title": "[MRG+2] Fix LDA predict_proba() ",
        "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\nFixes #6848\r\ncloses #11727\r\ncloses #5149\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n\r\n\r\n#### What does this implement/fix? Explain your changes.\r\nFixes the `predict_proba()` method of LinearDiscriminantAnalysis.\r\nAn `if` statement is used to differentiate between the binary and multi-class case, due to the different output format of the `decision_function` method implemented in the `LinearClassifierMixin` class.\r\n\r\n#### Any other comments?\r\nCopying from #6848:\r\nDo we perhaps want to include additional tests checking the output of predict_proba for LDA and QDA both for the binary and multi-class cases?\r\n\r\n<!--\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
        "issue_closed_at": "2019-03-07T16:44:18Z",
        "base_commit": "b73a51bcda362d94d8907915a382a8eb403554c8"
      },
      "summary": "### Summary: This issue pertains to a bug in the probability prediction mechanism of the `LinearDiscriminantAnalysis` module within the Scikit-learn library. Specifically, the problem arises from an unnecessary addition of a constant value during the calculation of predicted probabilities. This issue was identified in a particular code commit and manifests as a discrepancy in the results when compared to a standard benchmark (MATLAB's LDA classifier) on a known dataset (Iris dataset). The reported behavior suggests that the inclusion of this constant leads to incorrect probability predictions, diverging from expected outcomes. The affected component is the `predict_proba` function in the `discriminant_analysis.py` file. The potential impact of this bug is significant, as it affects the accuracy of the predictive model, potentially leading to incorrect conclusions or decisions based on flawed probability estimates. The issue highlights the importance of precise mathematical operations in statistical computations and the need for consistency with established algorithms.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: LinearDiscriminantAnalysis predict probability bug\n\nBody:\nI am pretty confident there is a bug introduced in commit\n7c1101d7c26ba0b77184cce9c0b9be79adb526de\n\nConcretely, line 518 of the current version \nhttps://github.com/scikit-learn/scikit-learn/blob/master/sklearn/discriminant_analysis.py\nshould be removed as it yields wrong results. \n\nThere is no reason why constant 1 should be added to the computed probability after exponentiation and before inversion. \n\nTo verify this, I have run a one-to-one comparison between the outcome of the method and MATLAB's builtin LDA classifier on the Iris dataset. Only after removal of line 518, results match (up to a tolerance).\n\nIf everyone agrees on that, I am happy to submit a PR.\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/discriminant_analysis.py\n  line: line 22\n  function: QuadraticDiscriminantAnalysis.predict_proba\n"
    },
    {
      "similar_issue": {
        "issue_title": "CountVectorizer with custom analyzer ignores input argument",
        "issue_body": "Example:\n\n``` py\ncv = CountVectorizer(analyzer=lambda x: x.split(), input='filename')\ncv.fit(['hello world']).vocabulary_\n```\n\nSame for `input=\"file\"`. Not sure if this should be fixed or just documented; I don't like changing the behavior of the vectorizers yet again...\n",
        "issue_id": 5482,
        "pr_number": 13641,
        "pr_title": "[MRG+1] API make sure vectorizers read data from file before analyzing",
        "pr_body": "Fixes #5482\r\n\r\nIf the given analyzer is a calable, it seems reasonable to assume if `input='file'` or `input='filename'`, the data should be read from the file first, and then passed to the analyzer, the same way as it's done for non-callable analyzers.\r\n\r\nThis PR clarifies this in the docstrings, and passes the \"decoded\" input to the analyzer. It should be less of a concern regarding the input on the bytes vs str since we don't support python2 anymore.\r\n\r\nI'm not entirely sure if this is what we wanna do, it's more of a proposal to move it forward.\r\n\r\nAfter this PR, the following would result in a `FileNotFoundError` exception:\r\n\r\n```python\r\ncv = CountVectorizer(analyzer=lambda x: x.split(), input='filename')\r\ncv.fit(['hello world']).vocabulary_\r\n```",
        "issue_closed_at": "2019-04-23T03:50:24Z",
        "base_commit": "badaa153e67ffa56fb1a413b3b7b5b8507024291"
      },
      "summary": "### Summary:\nThis issue involves the `CountVectorizer` class within the `scikit-learn` library, a popular Python library for data science and machine learning. The problem arises when using a custom analyzer in conjunction with the `input` parameter set to either `'filename'` or `'file'`. The specific concern is that the `CountVectorizer` seems to disregard the `input` argument under these circumstances, leading to unexpected behavior.\n\n1. **Problem description in general terms**: The issue centers around the `CountVectorizer` class not respecting the `input` parameter when a custom analyzer function is provided. This inconsistency could lead to confusion and incorrect data processing if the user expects the vectorizer to handle file inputs.\n\n2. **Key symptoms and behaviors observed**: Users have observed that when they specify the `input` parameter as `'filename'` or `'file'` along with a custom analyzer, the `CountVectorizer` ignores the `input` argument. This results in the vectorizer not being able to process file inputs as intended.\n\n3. **Affected components or systems**: The main component affected is the `CountVectorizer` class within the `scikit-learn` library's text processing module. This also extends to other classes such as `TfidfVectorizer` and potentially `HashingVectorizer`, which share similar functionality.\n\n4. **Potential impact or severity**: The impact of this issue can vary from minor to moderate. For users who rely on processing file inputs with custom analyzers, this can lead to significant disruptions in their data pipeline, potentially leading to incorrect analysis results or additional workaround implementations.\n\n5. **Relevant technical details abstracted for broader understanding**: The `CountVectorizer` is a tool used to convert a collection of text documents into a matrix of token counts. The `input` parameter is supposed to dictate the type of input source, such as file content. However, when a custom analyzer is used, this parameter is disregarded, likely due to an oversight in handling the integration of custom functions with file input processing.\n\nChanges Summary:\n- The modifications were made in the `sklearn/feature_extraction/text.py` file.\n- Changes occurred around the function `VectorizerMixin._check_stop_words_consistency`.\n- The `HashingVectorizer`, `CountVectorizer`, and `TfidfVectorizer` classes were reviewed or modified to address the issue.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: CountVectorizer with custom analyzer ignores input argument\n\nBody:\nExample:\n\n``` py\ncv = CountVectorizer(analyzer=lambda x: x.split(), input='filename')\ncv.fit(['hello world']).vocabulary_\n```\n\nSame for `input=\"file\"`. Not sure if this should be fixed or just documented; I don't like changing the behavior of the vectorizers yet again...\n\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/feature_extraction/text.py\n  line: line 31\n  function: VectorizerMixin._check_stop_words_consistency\n  class: HashingVectorizer\n  class: CountVectorizer\n  class: TfidfVectorizer\n"
    },
    {
      "similar_issue": {
        "issue_title": "Different r2_score multioutput default in r2_score and base.RegressorMixin",
        "issue_body": "We've changed multioutput default in r2_score to \"uniform_average\" in 0.19, but in base.RegressorMixin, we still use ``multioutput='variance_weighted'`` (#5143).\r\nAlso see the strange things below:\r\nhttps://github.com/scikit-learn/scikit-learn/blob/4603e481e9ac67eaf906ae5936263b675ba9bc9c/sklearn/multioutput.py#L283-L286",
        "issue_id": 12772,
        "pr_number": 13157,
        "pr_title": "[MRG+1] API Change default multioutput in RegressorMixin.score to keep consistent with metrics.r2_score",
        "pr_body": "Closes #12772 \r\nWondering if someone has a better way :)\r\nIn the original issue, I tried to ask why we prefer uniform_average, but received no reply. I guess we choose uniform_average to keep consistent with other regression metrics.",
        "issue_closed_at": "2019-03-15T09:47:51Z",
        "base_commit": "85440978f517118e78dc15f84e397d50d14c8097"
      },
      "summary": "### Summary:\nThis issue highlights an inconsistency in the default parameter settings for multi-output scoring within a widely-used machine learning library, scikit-learn. Specifically, the problem arises from a discrepancy between the default values used for the `multioutput` parameter in the `r2_score` function and the `base.RegressorMixin` class. The `r2_score` function was updated to use \"uniform_average\" as the default setting for `multioutput` in version 0.19, whereas `base.RegressorMixin` continues to use \"variance_weighted\". This inconsistency can lead to confusion and unexpected behavior for users relying on default configurations across different components of the library.\n\nKey symptoms and behaviors observed include the unexpected default behavior of regression scoring functions when dealing with multi-output data, potentially leading to incorrect model evaluations if users are not aware of the differing defaults.\n\nThe affected components are primarily within the scikit-learn library, specifically the `r2_score` function and any classes inheriting from `base.RegressorMixin`, which may include various regression models and utilities, as well as other related modules like multi-output regression handling.\n\nThe potential impact of this issue is moderate, as it may lead to incorrect model evaluation results if users are unaware of the differing default settings. However, the severity is mitigated by the fact that it only affects specific use cases involving multi-output data.\n\nRelevant technical details include the need for consistency in default parameter settings across the library to prevent unexpected behavior and ensure that users have a seamless experience when using different components of the library for regression tasks. The discrepancy also highlights the importance of updating documentation and guides to reflect any changes in default settings.",
      "prompt_used": "You are an expert in software issue reasoning analysis.\nGiven the following problem report and its fixed code elements, generate a comprehensive summary based on the entire document. Your goal is to abstract the information in the problem description into a more general description.\n\n## Original Issue Report:\nTitle: Different r2_score multioutput default in r2_score and base.RegressorMixin\n\nBody:\nWe've changed multioutput default in r2_score to \"uniform_average\" in 0.19, but in base.RegressorMixin, we still use ``multioutput='variance_weighted'`` (#5143).\r\nAlso see the strange things below:\r\nhttps://github.com/scikit-learn/scikit-learn/blob/4603e481e9ac67eaf906ae5936263b675ba9bc9c/sklearn/multioutput.py#L283-L286\n\n## Code elements fixed by the patch:\n{FIXED_CODE_ELEMENTS}\n\nPlease analyze the above issue report and provide a structured summary that includes:\n1. Problem description in general terms\n2. Key symptoms and behaviors observed\n3. Affected components or systems\n4. Potential impact or severity\n5. Any relevant technical details abstracted for broader understanding\n\nPlease return the summary with “### Summary:\", For example:\n### Summary: This issue is ...\n\nChanges Summary:\nsklearn/base.py\n  function: DensityMixin.score\n\nsklearn/linear_model/coordinate_descent.py\n  class: MultiTaskLassoCV\n\nsklearn/multioutput.py\n  function: MultiOutputRegressor.partial_fit\n"
    }
  ]
}