<|im_start|>system
You are given a machine learning task. You must solve the task by training a model and running the model on the test set to produce a submission file.<|im_end|>
<|im_start|>user
First outline your proposed solution in natural language (3-5 sentences), followed by a single markdown code block (wrapped in ```) which implements this solution. Note:
- Datasets train.csv and test.csv are available in `/workdir/us-patent-phrase-to-phrase-matching/prepared/public/`.
- You MUST produce a submission file at `./submission.csv` by running your model on the test split.
- Your task is to predict similarity scores (0-1) between pairs of phrases based on their context. 
- Submissions are evaluated on the Pearson correlation coefficient between predicted and actual scores.
- You should import packages such as: `pandas`, `scipy.stats`, `torch`, `lightgbm`, `sklearn`, `numpy`.
- You can't access the internet so don't use any pre-trained models need downloading.
- For each id (representing a pair of phrases) in the test set, you must predict the similarity score.
- `./submission.csv` should have the following format:
```
id,score
4112d61851461f60,0
09e418c93a776564,0.25
etc.
```
- Data snippet:
-> /workdir/us-patent-phrase-to-phrase-matching/prepared/public/train.csv has 32825 rows and 5 columns.
id (object) has 32825 unique values. Some example values: ['4e3ae02c7c3dc246', '378b8322a01b88db']
anchor (object) has 733 unique values. Some example values: ['component composite coating', 'sheet supply roller']
target (object) has 26850 unique values. Some example values: ['composition', 'motor']
context (object) has 106 unique values. Some example values: ['H04', 'H01']
score (float64) has 5 unique values: [0.5, 0.25]
-> /workdir/us-patent-phrase-to-phrase-matching/prepared/public/test.csv has 4 columns.
id,anchor,target,context
2a988c7d98568627,project onto surface,disposing,G03
- Note: You need to use `anchor`, `target`, and `context` as features to predict `score` for each row in `/workdir/us-patent-phrase-to-phrase-matching/prepared/public/test.csv`. Double check that you imported all the packages used. Define every variable used. Do not generate a submission file for any other data splits.
<|im_end|>
<|im_start|>assistant
