{
  "problem": {
    "name": "usp_p2p",
    "description": "In this competition, you will train your models on a novel semantic similarity dataset to extract relevant information by matching key phrases in patent documents. Determining the semantic similarity between phrases is critically important during the patent search and examination process to determine if an invention has been described before. For example, if one invention claims \"television set\" and a prior publication describes \"TV set\", a model would ideally recognize these are the same and assist a patent attorney or examiner in retrieving relevant documents. This extends beyond paraphrase identification; if one invention claims a \"strong material\" and another uses \"steel\", that may also be a match. What counts as a \"strong material\" varies per domain (it may be steel in one domain and ripstop fabric in another, but you wouldn't want your parachute made of steel). We have included the Cooperative Patent Classification as the technical domain context as an additional feature to help you disambiguate these situations. \n Can you build a model to match phrases in order to extract contextual information, thereby helping the patent community connect the dots between millions of patent documents? \n Models are evaluated on the Pearson correlation coefficient between the predicted and actual similarity scores. \n In the dataset, you are presented pairs of phrases (an anchor and a target phrase) and asked to rate how similar they are on a scale from 0 (not at all similar) to 1 (identical in meaning). This challenge differs from a standard semantic similarity task in that similarity has been scored here within a patent's context, specifically its CPC classification (version 2021.05), which indicates the subject to which the patent relates. For example, while the phrases \"bird\" and \"Cape Cod\" may have low semantic similarity in normal language, the likeness of their meaning is much closer if considered in the context of \"house\".\n\nThis is a code competition in which you will submit code that will be run against an unseen test set. The unseen test set contains approximately 12 000 pairs of phrases. A small public test set has been provided for testing purposes but is not used in scoring.\n\nInformation on the meaning of CPC codes may be found on the USPTO website. The CPC version 2021.05 can be found on the CPC archive website.\n\nScore meanings:\n- 1.0: Very close match (usually exact match except for minor changes in conjugation, quantity, or stopwords).\n- 0.75: Close synonym or abbreviation (for example, \"mobile phone\" vs. \"cellphone\" or \"TCP\" → \"transmission control protocol\").\n- 0.5: Synonyms with different breadth (hyponym/hypernym matches).\n- 0.25: Somewhat related (same high‐level domain or antonyms).\n- 0.0: Unrelated.\n\nFiles:\n- train.csv: the training set, containing phrases, contexts, and their similarity scores\n- test.csv: the test set, identical in structure to the training set but including true scores\n\nColumns:\n- id: unique identifier for a phrase pair\n- anchor: the first phrase\n- target: the second phrase\n- context: the CPC classification (version 2021.05) indicating the subject within which similarity is scored\n- score: the similarity value, sourced from one or more manual expert ratings",
    "metric": "pearson_correlation",
    "interface": "deepevolve_interface.py"
  },
  "initial_idea": {
    "title": "Fine-tune the Patent BERT model on the USP-P2P dataset",
    "content": "The idea first uses the `anferico/bert-for-patents` model with a single-label regression head. It then tokenizes each example by joining the anchor, target, and context with `[SEP]`, fine-tunes for one epoch (batch size = 160, learning rate = 2e-5) without checkpointing or logging, and finally evaluates on the test set by computing the Pearson correlation between predicted and actual scores.",
    "supplement": "BERT for Patents: https://huggingface.co/anferico/bert-for-patents"
  }
}