{
  "Selected_candidate": {
    "pr_number": 11914,
    "pr_title": "[MRG] ENH Better error message for metrics of neighbors",
    "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n#Fixes #11906 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded expression to error message to get list of valid metrics.\r\n\r\n<!--\r\n#### Any other comments?\r\n\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
    "issue_id": 11906,
    "issue_title": "Better error message for invalid metric in NearestNeighbors ",
    "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n",
    "issue_closed_at": "2018-09-13T15:34:02Z",
    "base_commit": "7ed61a24feb4ffde0bee9342acf4a58e3f946a61",
    "changes": [
      {
        "file": "sklearn/neighbors/__init__.py",
        "type": "line",
        "name": "line 14",
        "code": "from .kde import KernelDensity\nfrom .approximate import LSHForest\nfrom .lof import LocalOutlierFactor\n\n__all__ = ['BallTree',\n           'DistanceMetric',"
      },
      {
        "file": "sklearn/neighbors/__init__.py",
        "type": "line",
        "name": "line 28",
        "code": "           'radius_neighbors_graph',\n           'KernelDensity',\n           'LSHForest',\n           'LocalOutlierFactor']"
      },
      {
        "file": "sklearn/neighbors/base.py",
        "type": "function",
        "name": "_check_algorithm_metric",
        "class_name": "NeighborsBase",
        "code": "def _check_algorithm_metric(self):\n        if self.algorithm not in ['auto', 'brute',\n                                  'kd_tree', 'ball_tree']:\n            raise ValueError(\"unrecognized algorithm: '%s'\" % self.algorithm)\n\n        if self.algorithm == 'auto':\n            if self.metric == 'precomputed':\n                alg_check = 'brute'\n            elif (callable(self.metric) or\n                  self.metric in VALID_METRICS['ball_tree']):\n                alg_check = 'ball_tree'\n            else:\n                alg_check = 'brute'\n        else:\n            alg_check = self.algorithm\n\n        if callable(self.metric):\n            if self.algorithm == 'kd_tree':\n                # callable metric is only valid for brute force and ball_tree\n                raise ValueError(\n                    \"kd_tree algorithm does not support callable metric '%s'\"\n                    % self.metric)\n        elif self.metric not in VALID_METRICS[alg_check]:\n            raise ValueError(\"Metric '%s' not valid for algorithm '%s'\"\n                             % (self.metric, self.algorithm))\n\n        if self.metric_params is not None and 'p' in self.metric_params:\n            warnings.warn(\"Parameter p is found in metric_params. \"\n                          \"The corresponding parameter from __init__ \"\n                          \"is ignored.\", SyntaxWarning, stacklevel=3)\n            effective_p = self.metric_params['p']\n        else:\n            effective_p = self.p\n\n        if self.metric in ['wminkowski', 'minkowski'] and effective_p < 1:\n            raise ValueError(\"p must be greater than one for minkowski metric\")"
      }
    ]
  },
  "Justification": "Candidate A addresses a similar structural issue related to improper handling of inputs in the `NearestNeighbors` function when an invalid metric is specified. Both the CURRENT bug and Candidate A's report deal with criteria that could lead to future compatibility issues with NumPy—specifically, the comparison of types or structures that could lead to errors. The insights from Candidate A could help in establishing clearer error messages or input validation strategies in the `_estimate_mi` function. Moreover, the shared modules indicate mutual components which could provide a better understanding of the implementation. This relevance of shared underlying mechanics makes it particularly useful for debugging the CURRENT bug.",
  "instance_id": "scikit-learn__scikit-learn-13497",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-03-23T14:28:08Z",
  "problem_statement": "Comparing string to array in _estimate_mi\nIn ``_estimate_mi`` there is ``discrete_features == 'auto'`` but discrete features can be an array of indices or a boolean mask.\r\nThis will error in future versions of numpy.\r\nAlso this means we never test this function with discrete features != 'auto', it seems?\n",
  "patch": "diff --git a/sklearn/feature_selection/mutual_info_.py b/sklearn/feature_selection/mutual_info_.py\n--- a/sklearn/feature_selection/mutual_info_.py\n+++ b/sklearn/feature_selection/mutual_info_.py\n@@ -10,7 +10,7 @@\n from ..preprocessing import scale\n from ..utils import check_random_state\n from ..utils.fixes import _astype_copy_false\n-from ..utils.validation import check_X_y\n+from ..utils.validation import check_array, check_X_y\n from ..utils.multiclass import check_classification_targets\n \n \n@@ -247,14 +247,16 @@ def _estimate_mi(X, y, discrete_features='auto', discrete_target=False,\n     X, y = check_X_y(X, y, accept_sparse='csc', y_numeric=not discrete_target)\n     n_samples, n_features = X.shape\n \n-    if discrete_features == 'auto':\n-        discrete_features = issparse(X)\n-\n-    if isinstance(discrete_features, bool):\n+    if isinstance(discrete_features, (str, bool)):\n+        if isinstance(discrete_features, str):\n+            if discrete_features == 'auto':\n+                discrete_features = issparse(X)\n+            else:\n+                raise ValueError(\"Invalid string value for discrete_features.\")\n         discrete_mask = np.empty(n_features, dtype=bool)\n         discrete_mask.fill(discrete_features)\n     else:\n-        discrete_features = np.asarray(discrete_features)\n+        discrete_features = check_array(discrete_features, ensure_2d=False)\n         if discrete_features.dtype != 'bool':\n             discrete_mask = np.zeros(n_features, dtype=bool)\n             discrete_mask[discrete_features] = True\n"
}