{
    "problem": {
      "name": "parkinson_disease",
      "description": "Goal of the Competition:\nThe goal of this competition is to predict MDS-UPDR scores, which measure progression in patients with Parkinson's disease. The Movement Disorder Society-Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS) is a comprehensive assessment of both motor and non-motor symptoms associated with Parkinson's. You will develop a model trained on data of protein and peptide levels over time in subjects with Parkinson's disease versus normal age-matched control subjects.\n\nYour work could help provide important breakthrough information about which molecules change as Parkinson's disease progresses.\n\nContext:\nParkinson's disease (PD) is a disabling brain disorder that affects movements, cognition, sleep, and other normal functions. Unfortunately, there is no current cure—and the disease worsens over time. It's estimated that by 2037, 1.6 million people in the U.S. will have Parkinson's disease, at an economic cost approaching $80 billion. Research indicates that protein or peptide abnormalities play a key role in the onset and worsening of this disease. Gaining a better understanding of this—with the help of data science—could provide important clues for the development of new pharmacotherapies to slow the progression or cure Parkinson's disease.\n\nCurrent efforts have resulted in complex clinical and neurobiological data on over 10,000 subjects for broad sharing with the research community. A number of important findings have been published using this data, but clear biomarkers or cures are still lacking.\n\nCompetition host:\nThe Accelerating Medicines Partnership® Parkinson's Disease (AMP®PD) is a public-private partnership between government, industry, and nonprofits that is managed through the Foundation of the National Institutes of Health (FNIH). The Partnership created the AMP PD Knowledge Platform, which includes a deep molecular characterization and longitudinal clinical profiling of Parkinson's disease patients, with the goal of identifying and validating diagnostic, prognostic, and/or disease progression biomarkers for Parkinson's disease.\n\nYour work could help in the search for a cure for Parkinson's disease, which would alleviate the substantial suffering and medical care costs of patients with this disease.\n\nThis is a Code Competition. Refer to Code Requirements for details.\n\nEvaluation:\nSubmissions are evaluated on SMAPE between forecasts and actual values. We define SMAPE = 0 when the actual and predicted values are both 0.\n\nFor each patient visit where a protein/peptide sample was taken you will need to estimate both their UPDRS scores for that visit and predict their scores for any potential visits 6, 12, and 24 months later. Predictions for any visits that didn't ultimately take place are ignored.\n\nDataset Description:\nThe goal of this competition is to predict the course of Parkinson's disease (PD) using protein abundance data. The complete set of proteins involved in PD remains an open research question and any proteins that have predictive value are likely worth investigating further. The core of the dataset consists of protein abundance values derived from mass spectrometry readings of cerebrospinal fluid (CSF) samples gathered from several hundred patients. Each patient contributed several samples over the course of multiple years while they also took assessments of PD severity.\n\nThis is a time-series code competition: you will receive test set data and make predictions with Kaggle's time-series API.\n\nFiles:\n- train_peptides.csv: Mass spectrometry data at the peptide level. Columns:\n  - visit_id: ID code for the visit.\n  - visit_month: The month of the visit, relative to the first visit by the patient.\n  - patient_id: An ID code for the patient.\n  - UniProt: The UniProt ID code for the associated protein. There are often several peptides per protein.\n  - Peptide: The sequence of amino acids included in the peptide. See this table for the relevant codes. Some rare annotations may not be included in the table. The test set may include peptides not found in the train set.\n  - PeptideAbundance: The frequency of the amino acid in the sample.\n- train_proteins.csv: Protein expression frequencies aggregated from the peptide level data. Columns:\n  - visit_id: ID code for the visit.\n  - visit_month: The month of the visit, relative to the first visit by the patient.\n  - patient_id: An ID code for the patient.\n  - UniProt: The UniProt ID code for the associated protein. There are often several peptides per protein. The test set may include proteins not found in the train set.\n  - NPX: Normalized protein expression. The frequency of the protein's occurrence in the sample. May not have a 1:1 relationship with the component peptides as some proteins contain repeated copies of a given peptide.\n- train_clinical_data.csv: Clinical data. Columns:\n  - visit_id: ID code for the visit.\n  - visit_month: The month of the visit, relative to the first visit by the patient.\n  - patient_id: An ID code for the patient.\n  - updrs_1 to updrs_4: The patient's score for part N of the Unified Parkinson's Disease Rating Scale. Higher numbers indicate more severe symptoms. Each sub-section covers a distinct category of symptoms, such as mood and behavior for Part 1 and motor functions for Part 3.\n  - upd23b_clinical_state_on_medication: Whether or not the patient was taking medication such as Levodopa during the UPDRS assessment. Expected to mainly affect the scores for Part 3 (motor function). These medications wear off fairly quickly (on the order of one day) so it's common for patients to take the motor function exam twice in a single month, both with and without medication.\n- supplemental_clinical_data.csv: Clinical records without any associated CSF samples. This data is intended to provide additional context about the typical progression of Parkinson's. Uses the same columns as train_clinical_data.csv.\n- example_test_files/: Data intended to illustrate how the API functions. Includes the same columns delivered by the API (ie no updrs columns).\n- amp_pd_peptide/: Files that enable the API. Expect the API to deliver all of the data (less than 1,000 additional patients) in under five minutes and to reserve less than 0.5 GB of memory. A brief demonstration of what the API delivers is available here.\n- public_timeseries_testing_util.py: An optional file intended to make it easier to run custom offline API tests. ",
      "metric": "(1-SMAPE(%))/2, where SMAPE is Symmetric mean absolute percentage error (SMAPE or sMAPE)",
      "interface": "deepevolve_interface.py"
    },
    "initial_idea": {
      "title": "1st Place Solution",
      "content": "Quick summary:\nOur final solution is a simple average of two models: LGB and NN. Both models were trained on the same features (plus scaling and binarization for the NN):\n- Visit month\n- Forecast horizon\n- Target prediction month\n- Indicator whether blood was taken during the visit\n- Supplementary dataset indicator\n- Indicators whether a patient visit occurred on the 6th, 18th, and 48th month\n- Count of number of previous “non-annual” visits (6th or 18th)\n- Index of the target (we pivot the dataset to have a single target column)\n\nThe winning solution fully ignores the results of the blood tests. We have tried hard to find any signal in this crucial piece of the data, but unfortunately we came to the conclusion that none of our approaches or models can benefit from blood test features significant enough to distinguish them from random variations. The final models were trained only on the union of clinical and supplementary datasets.\n\nLGB:\nFor the entire duration of the competition, LGB was our model to beat and only a NN trained with the competition metric as the loss function was able to achieve competitive performance on cross-validation. At first, we tried running a regression LGB model with different hyperparameters and custom objective functions, but nothing was better than L1 regression, which does not optimize the desired metric SMAPE+1. We also noticed that on cross-validation the performance of every model is always better when the regression outputs are rounded to integers. Then we switched to an alternative approach.\n\nOur LGB model is a classification model with 87 target classes (0 to maximum target value) and a logloss objective. To produce the forecast, we applied the following post-processing: given the predicted distribution of target classes, pick the value that minimizes SMAPE+1. Given the observation that the optimal predictions are always integers, the task boils down to a trivial search among 87 possible values. Such an approach treats cases with multiple local minimums naturally and would also work for the original SMAPE metric.\n\nWe ran an optimization routine to tune LGB hyperparameters to minimize SMAPE+1 on cross-validation using the described post-processing.\n\nNN:\nThe neural network has a simple multi-layer feed-forward architecture with a regression target, using the competition metric SMAPE+1 as the loss function. We fixed the number of epochs and scheduler, and then tuned the learning rate and hidden layer size. The only trick was to add a leaky ReLU activation as the last layer to prevent the NN from producing negative predictions. There are alternative ways to handle this issue.\n\nCross-Validation:\nWe have tried multiple cross-validation schemes due to the small training sample size, all stratified by patient ID. Once a sufficient number of folds is used, they correlate well with each other and with the private leaderboard. The final scheme was leave-one-patient-out, or group k-fold cross-validation with one fold per patient, which does not depend on random numbers. The resulting cross-validation scores aligned well with the private leaderboard, and our chosen submission was our best on the private leaderboard.\n\nWhat worked:\nThe most impactful feature was the indicator of a visit on the 6th month. It correlates strongly with UPDRS targets (especially parts 2 and 3) and with medication frequency. We observed that patients who returned at 6 months tend to have higher UPDRS scores on average. A similar effect exists for the 18th month visit, but these features are correlated. I wonder if including these variables caused the private leaderboard cliff around 20th place.\n\nAnother effect is seen for forecasts at visit_month = 0: forecasts for 0, 12, and 24 months ahead are consistently lower than for 6 months ahead. Mathematically, this makes sense because if a patient returns at 6 months they have higher UPDRS scores on average, and if not, the forecast is ignored. Clinically, however, this behavior is unreasonable.\n\nIt was also important to note differences between training and test datasets—such as those summarized here—which explain why adding a feature for a 30th month visit might improve cross-validation but harm leaderboard performance.\n\nWhat did not work:\nBlood test data. We have tried many methods to incorporate proteins and peptides into our models, but none improved cross-validation. We narrowed it to a bag of logistic regressions predicting a 6th month visit from the 0th month blood test. We applied soft up/down scaling of predictions based on these probabilities, which improved the public leaderboard after tuning a few coefficients directly on it. That approach reached second place on the public leaderboard but clearly overfit. We included a mild version of it in our second final submission, which scored slightly worse on the private leaderboard (60.0 vs. 60.3).\n\nThanks to everyone who participated, those who kept interesting discussions going on the forum, and those who suggested improvements! Congratulations to all the winners!",
      "supplement": "https://www.kaggle.com/code/dott1718/1st-place-solution?scriptVersionId=129798049"
    }
  }
  