{
  "Selected_candidate": {
    "pr_number": 11914,
    "pr_title": "[MRG] ENH Better error message for metrics of neighbors",
    "pr_body": "<!--\r\nThanks for contributing a pull request! Please ensure you have taken a look at\r\nthe contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#pull-request-checklist\r\n-->\r\n\r\n#### Reference Issues/PRs\r\n<!--\r\nExample: Fixes #1234. See also #3456.\r\nPlease use keywords (e.g., Fixes) to create link to the issues or pull requests\r\nyou resolved, so that they will automatically be closed when your pull request\r\nis merged. See https://github.com/blog/1506-closing-issues-via-pull-requests\r\n-->\r\n#Fixes #11906 \r\n\r\n#### What does this implement/fix? Explain your changes.\r\n\r\nAdded expression to error message to get list of valid metrics.\r\n\r\n<!--\r\n#### Any other comments?\r\n\r\nPlease be aware that we are a loose team of volunteers so patience is\r\nnecessary; assistance handling other issues is very welcome. We value\r\nall user contributions, no matter how minor they are. If we are slow to\r\nreview, either the pull request needs some benchmarking, tinkering,\r\nconvincing, etc. or more likely the reviewers are simply busy. In either\r\ncase, we ask for your understanding during the review process.\r\nFor more information, see our FAQ on this topic:\r\nhttp://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.\r\n\r\nThanks for contributing!\r\n-->\r\n",
    "issue_id": 11906,
    "issue_title": "Better error message for invalid metric in NearestNeighbors ",
    "issue_body": "<!--\r\nIf your issue is a usage question, submit it here instead:\r\n- StackOverflow with the scikit-learn tag: http://stackoverflow.com/questions/tagged/scikit-learn\r\n- Mailing List: https://mail.python.org/mailman/listinfo/scikit-learn\r\nFor more information, see User Questions: http://scikit-learn.org/stable/support.html#user-questions\r\n-->\r\n\r\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\r\n\r\n#### Description\r\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\r\nError message for invalid metric in NearestNeighbors is unclear.\r\n\r\n#### Steps/Code to Reproduce\r\n<!--\r\nExample:\r\n```python\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\n\r\ndocs = [\"Help I have a bug\" for i in range(1000)]\r\n\r\nvectorizer = CountVectorizer(input=docs, analyzer='word')\r\nlda_features = vectorizer.fit_transform(docs)\r\n\r\nlda_model = LatentDirichletAllocation(\r\n    n_topics=10,\r\n    learning_method='online',\r\n    evaluate_every=10,\r\n    n_jobs=4,\r\n)\r\nmodel = lda_model.fit(lda_features)\r\n```\r\nIf the code is too long, feel free to put it in a public gist and link\r\nit in the issue: https://gist.github.com\r\n-->\r\n```python\r\nNearestNeighbors(metric='cheybshev')\r\n```\r\n\r\n#### Expected Results\r\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\r\nError message stating that metric should be 'cityblock', ... or callable rather than metric not valid for algorithm 'auto'. When I initially saw the error message, I did not realize I had a typo in the metric string. I thought it has something to do with the algorithm.\r\n\r\n#### Actual Results\r\n<!-- Please paste or specifically describe the actual output or traceback. -->\r\n```\r\nValueError: Metric 'cheybshev' not valid for algorithm 'auto'\r\n```\r\n\r\n#### Versions\r\n<!--\r\nPlease run the following snippet and paste the output below.\r\nimport platform; print(platform.platform())\r\nimport sys; print(\"Python\", sys.version)\r\nimport numpy; print(\"NumPy\", numpy.__version__)\r\nimport scipy; print(\"SciPy\", scipy.__version__)\r\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\r\n-->\r\nLinux-4.15.0-24-generic-x86_64-with-debian-stretch-sid\r\nPython 3.6.3 |Anaconda custom (64-bit)| (default, Nov  9 2017, 00:19:18) \r\n[GCC 7.2.0]\r\nNumPy 1.13.3\r\nSciPy 0.19.1\r\nScikit-Learn 0.19.1\r\n\r\n<!-- Thanks for contributing! -->\r\n",
    "issue_closed_at": "2018-09-13T15:34:02Z",
    "base_commit": "7ed61a24feb4ffde0bee9342acf4a58e3f946a61",
    "changes": [
      {
        "file": "sklearn/neighbors/__init__.py",
        "type": "line",
        "name": "line 14",
        "code": "from .kde import KernelDensity\nfrom .approximate import LSHForest\nfrom .lof import LocalOutlierFactor\n\n__all__ = ['BallTree',\n           'DistanceMetric',"
      },
      {
        "file": "sklearn/neighbors/__init__.py",
        "type": "line",
        "name": "line 28",
        "code": "           'radius_neighbors_graph',\n           'KernelDensity',\n           'LSHForest',\n           'LocalOutlierFactor']"
      },
      {
        "file": "sklearn/neighbors/base.py",
        "type": "function",
        "name": "_check_algorithm_metric",
        "class_name": "NeighborsBase",
        "code": "def _check_algorithm_metric(self):\n        if self.algorithm not in ['auto', 'brute',\n                                  'kd_tree', 'ball_tree']:\n            raise ValueError(\"unrecognized algorithm: '%s'\" % self.algorithm)\n\n        if self.algorithm == 'auto':\n            if self.metric == 'precomputed':\n                alg_check = 'brute'\n            elif (callable(self.metric) or\n                  self.metric in VALID_METRICS['ball_tree']):\n                alg_check = 'ball_tree'\n            else:\n                alg_check = 'brute'\n        else:\n            alg_check = self.algorithm\n\n        if callable(self.metric):\n            if self.algorithm == 'kd_tree':\n                # callable metric is only valid for brute force and ball_tree\n                raise ValueError(\n                    \"kd_tree algorithm does not support callable metric '%s'\"\n                    % self.metric)\n        elif self.metric not in VALID_METRICS[alg_check]:\n            raise ValueError(\"Metric '%s' not valid for algorithm '%s'\"\n                             % (self.metric, self.algorithm))\n\n        if self.metric_params is not None and 'p' in self.metric_params:\n            warnings.warn(\"Parameter p is found in metric_params. \"\n                          \"The corresponding parameter from __init__ \"\n                          \"is ignored.\", SyntaxWarning, stacklevel=3)\n            effective_p = self.metric_params['p']\n        else:\n            effective_p = self.p\n\n        if self.metric in ['wminkowski', 'minkowski'] and effective_p < 1:\n            raise ValueError(\"p must be greater than one for minkowski metric\")"
      }
    ]
  },
  "Justification": "Candidate A addresses issues related to parameter checking in scikit-learn, which is similar to the CURRENT bug involving parameter type checks in GridSearchCV. Both are related to validation of input parameters and ensuring they conform to expected types. The recommendation to improve error messages and validation practices in Candidate A aligns with the developer's interest in enhancing parameter checks for new estimators, thereby providing context and potential solutions that can guide the debugging of the CURRENT bug. Its structural similarity in the context of type-checking makes it the most relevant choice.",
  "instance_id": "scikit-learn__scikit-learn-14092",
  "repo": "scikit-learn/scikit-learn",
  "created_at": "2019-06-14T14:16:17Z",
  "problem_statement": "NCA fails in GridSearch due to too strict parameter checks\nNCA checks its parameters to have a specific type, which can easily fail in a GridSearch due to how param grid is made.\r\n\r\nHere is an example:\r\n```python\r\nimport numpy as np\r\n\r\nfrom sklearn.pipeline import Pipeline\r\nfrom sklearn.model_selection import GridSearchCV\r\nfrom sklearn.neighbors import NeighborhoodComponentsAnalysis\r\nfrom sklearn.neighbors import KNeighborsClassifier\r\n\r\nX = np.random.random_sample((100, 10))\r\ny = np.random.randint(2, size=100)\r\n\r\nnca = NeighborhoodComponentsAnalysis()\r\nknn = KNeighborsClassifier()\r\n\r\npipe = Pipeline([('nca', nca),\r\n                 ('knn', knn)])\r\n                \r\nparams = {'nca__tol': [0.1, 0.5, 1],\r\n          'nca__n_components': np.arange(1, 10)}\r\n          \r\ngs = GridSearchCV(estimator=pipe, param_grid=params, error_score='raise')\r\ngs.fit(X,y)\r\n```\r\n\r\nThe issue is that for `tol`: 1 is not a float, and for  `n_components`: np.int64 is not int\r\n\r\nBefore proposing a fix for this specific situation, I'd like to have your general opinion about parameter checking.  \r\nI like this idea of common parameter checking tool introduced with the NCA PR. What do you think about extending it across the code-base (or at least for new or recent estimators) ?\r\n\r\nCurrently parameter checking is not always done or often partially done, and is quite redundant. For instance, here is the input validation of lda:\r\n```python\r\ndef _check_params(self):\r\n        \"\"\"Check model parameters.\"\"\"\r\n        if self.n_components <= 0:\r\n            raise ValueError(\"Invalid 'n_components' parameter: %r\"\r\n                             % self.n_components)\r\n\r\n        if self.total_samples <= 0:\r\n            raise ValueError(\"Invalid 'total_samples' parameter: %r\"\r\n                             % self.total_samples)\r\n\r\n        if self.learning_offset < 0:\r\n            raise ValueError(\"Invalid 'learning_offset' parameter: %r\"\r\n                             % self.learning_offset)\r\n\r\n        if self.learning_method not in (\"batch\", \"online\"):\r\n            raise ValueError(\"Invalid 'learning_method' parameter: %r\"\r\n                             % self.learning_method)\r\n```\r\nmost params aren't checked and for those who are there's a lot of duplicated code.\r\n\r\nA propose to be upgrade the new tool to be able to check open/closed intervals (currently only closed) and list membership.\r\n\r\nThe api would be something like that:\r\n```\r\ncheck_param(param, name, valid_options)\r\n```\r\nwhere valid_options would be a dict of `type: constraint`. e.g for the `beta_loss` param of `NMF`, it can be either a float or a string in a list, which would give\r\n```\r\nvalid_options = {numbers.Real: None,  # None for no constraint\r\n                 str: ['frobenius', 'kullback-leibler', 'itakura-saito']}\r\n```\r\nSometimes a parameter can only be positive or within a given interval, e.g. `l1_ratio` of `LogisticRegression` must be between 0 and 1, which would give\r\n```\r\nvalid_options = {numbers.Real: Interval(0, 1, closed='both')}\r\n```\r\npositivity of e.g. `max_iter` would be `numbers.Integral: Interval(left=1)`.\n",
  "patch": "diff --git a/sklearn/neighbors/nca.py b/sklearn/neighbors/nca.py\n--- a/sklearn/neighbors/nca.py\n+++ b/sklearn/neighbors/nca.py\n@@ -13,6 +13,7 @@\n import numpy as np\n import sys\n import time\n+import numbers\n from scipy.optimize import minimize\n from ..utils.extmath import softmax\n from ..metrics import pairwise_distances\n@@ -299,7 +300,8 @@ def _validate_params(self, X, y):\n \n         # Check the preferred dimensionality of the projected space\n         if self.n_components is not None:\n-            check_scalar(self.n_components, 'n_components', int, 1)\n+            check_scalar(\n+                self.n_components, 'n_components', numbers.Integral, 1)\n \n             if self.n_components > X.shape[1]:\n                 raise ValueError('The preferred dimensionality of the '\n@@ -318,9 +320,9 @@ def _validate_params(self, X, y):\n                                  .format(X.shape[1],\n                                          self.components_.shape[1]))\n \n-        check_scalar(self.max_iter, 'max_iter', int, 1)\n-        check_scalar(self.tol, 'tol', float, 0.)\n-        check_scalar(self.verbose, 'verbose', int, 0)\n+        check_scalar(self.max_iter, 'max_iter', numbers.Integral, 1)\n+        check_scalar(self.tol, 'tol', numbers.Real, 0.)\n+        check_scalar(self.verbose, 'verbose', numbers.Integral, 0)\n \n         if self.callback is not None:\n             if not callable(self.callback):\n"
}