{
  "problem": {
    "name": "openvaccine",
    "description": "## Competition overview\n\nWinning the fight against the COVID-19 pandemic will require an effective vaccine that can be equitably and widely distributed. Building upon decades of research has allowed scientists to accelerate the search for a vaccine against COVID-19, but every day that goes by without a vaccine has enormous costs for the world nonetheless. We need new, fresh ideas from all corners of the world. Could online gaming and crowdsourcing help solve a worldwide pandemic? Pairing scientific and crowdsourced intelligence could help computational biochemists make measurable progress.\n\nmRNA vaccines have taken the lead as the fastest vaccine candidates for COVID-19, but currently, they face key potential limitations. One of the biggest challenges right now is how to design super stable messenger RNA molecules (mRNA). Conventional vaccines (like your seasonal flu shots) are packaged in disposable syringes and shipped under refrigeration around the world, but that is not currently possible for mRNA vaccines.\n\nResearchers have observed that RNA molecules have the tendency to spontaneously degrade. This is a serious limitation--a single cut can render the mRNA vaccine useless. Currently, little is known on the details of where in the backbone of a given RNA is most prone to being affected. Without this knowledge, current mRNA vaccines against COVID-19 must be prepared and shipped under intense refrigeration, and are unlikely to reach more than a tiny fraction of human beings on the planet unless they can be stabilized.\n\nThe Eterna community, led by Professor Rhiju Das, a computational biochemist at Stanford’s School of Medicine, brings together scientists and gamers to solve puzzles and invent medicine. Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles. The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules. The Eterna community has previously unlocked new scientific principles, made new diagnostics against deadly diseases, and engaged the world’s most potent intellectual resources for the betterment of the public. The Eterna community has advanced biotechnology through its contribution in over 20 publications, including advances in RNA biotechnology.\n\nIn this competition, we are looking to leverage the data science expertise of the Kaggle community to develop models and design rules for RNA degradation. Your model will predict likely degradation rates at each base of an RNA molecule, trained on a subset of an Eterna dataset comprising over 3000 RNA molecules (which span a panoply of sequences and structures) and their degradation rates at each position. We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines. These final test sequences are currently being synthesized and experimentally characterized at Stanford University in parallel to your modeling efforts -- Nature will score your models!\n\nImproving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve. Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19. The problem we are trying to solve has eluded academic labs, industry R&D groups, and supercomputers, and so we are turning to you. To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic.\n\n## Data overview\n\nIn this competition, you will be predicting the degradation rates at various locations along RNA sequence.\n\nThere are multiple ground truth values provided in the training data. While the submission format requires all 5 to be predicted, only the following are scored:\n- **reactivity**\n- **deg_Mg_pH10**\n- **deg_pH10**\n\n**Files**\n- **train.json** - the training data\n- **test.json** - the test set, without any columns associated with the ground truth.\n- **sample_submission.csv** - a sample submission file in the correct format\n\n**Columns**\n- **id** - An arbitrary identifier for each sample.\n- **seq_scored** - (68 in Train and Public Test, 91 in Private Test) Integer value denoting the number of positions used in scoring with predicted values. This should match the length of reactivity, deg_* and *_error_* columns. Note that molecules used for the Private Test will be longer than those in the Train and Public Test data, so the size of this vector will be different.\n- **seq_length** - (107 in Train and Public Test, 130 in Private Test) Integer values, denotes the length of sequence. Note that molecules used for the Private Test will be longer than those in the Train and Public Test data, so the size of this vector will be different.\n- **sequence** - (1x107 string in Train and Public Test, 130 in Private Test) Describes the RNA sequence, a combination of A, G, U, and C for each sample. Should be 107 characters long, and the first 68 bases should correspond to the 68 positions specified in seq_scored (note: indexed starting at 0).\n- **structure** - (1x107 string in Train and Public Test, 130 in Private Test) An array of (, ), and . characters that describe whether a base is estimated to be paired or unpaired. Paired bases are denoted by opening and closing parentheses e.g. (….) means that base 0 is paired to base 5, and bases 1-4 are unpaired.\n- **reactivity** - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likely secondary structure of the RNA sample.\n- **deg_pH10** - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likelihood of degradation at the base/linkage after incubating without magnesium at high pH (pH 10).\n- **deg_Mg_pH10** - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likelihood of degradation at the base/linkage after incubating with magnesium in high pH (pH 10).\n- **deg_50C** - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likelihood of degradation at the base/linkage after incubating without magnesium at high temperature (50 degrees Celsius).\n- **deg_Mg_50C** - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likelihood of degradation at the base/linkage after incubating with magnesium at high temperature (50 degrees Celsius).\n- **\\*_error_\\*** - An array of floating point numbers, should have the same length as the corresponding reactivity or deg_* columns, calculated errors in experimental values obtained in reactivity and deg_* columns.\n- **predicted_loop_type** - (1x107 string) Describes the structural context (also referred to as 'loop type') of each character in sequence. Loop types assigned by bpRNA from Vienna RNAfold 2 structure. From the bpRNA_documentation: S: paired \"Stem\" M: Multiloop I: Internal loop B: Bulge H: Hairpin loop E: dangling End X: eXternal loop\n- **S/N_filter** - Indicates if the sample passed filters described below in Additional Notes.\n\nFor those interested in how the 629 107-base sequences in test.json were filtered, here were the steps to ensure a diverse and high quality test set for public leaderboard scoring:\n\n- Minimum value across all 5 conditions must be greater than -0.5.\n- Mean signal/noise across all 5 conditions must be greater than 1.0. [Signal/noise is defined as mean( measurement value over 68 nts )/mean( statistical error in measurement value over 68 nts)]\n- To help ensure sequence diversity, the resulting sequences were clustered into clusters with less than 50% sequence similarity, and the 629 test set sequences were chosen from clusters with 3 or fewer members. That is, any sequence in the test set should be sequence similar to at most 2 other sequences.\n\nNote that these filters have not been applied to the 2400 RNAs in the public training data train.json — some of those measurements have negative values or poor signal-to-noise, or some RNA sequences have near-identical sequences in that set. But we are providing all those data in case competitors can squeeze out more signal.\n\n## Evaluation metric\nSubmissions are scored using the mean columnwise root mean squared error (MCRMSE), defined as\n\n$$\n\\mathrm{MCRMSE} = \\frac{1}{N_t}\\sum_{j=1}^{N_t}\\sqrt{\\frac{1}{n}\\sum_{i=1}^{n}(y_{ij}-\\hat y_{ij})^{2}},\n$$\n\nwhere $N_{t}$ is the number of ground-truth target columns that are scored, $n$ is the number of samples, and $y$ and $\\hat y$ are the actual and predicted values. Although the training data provides five ground-truth measurements, only three—reactivity, deg_Mg_pH10, and deg_Mg_50C—contribute to the final score.",
    "metric": "mean columnwise root mean squared error (MCRMSE)",
    "interface": "deepevolve_interface.py"
  },
  "initial_idea": {
    "title": "GraphSAGE(with GCN)+GRU+KFold",
    "content": "The model first embeds each nucleotide along with its predicted secondary-structure and loop-type context, then constructs a graph linking adjacent and paired bases. A GraphSAGE-based graph convolution network aggregates information over this graph to produce enriched base-level features. Those features are passed through a bidirectional GRU to capture sequential patterns along the RNA chain. Finally, a linear output layer predicts the three scored targets—reactivity, deg_Mg_pH10, and deg_Mg_50C—at each position. Training is performed with k-fold cross-validation to yield robust performance across the public dataset.\n The original approach omits the precomputed base-pair probability (bpps) matrices, which are symmetric N×N arrays (one per sequence) that quantify the probability of any two nucleotides pairing under a specific folding model. Each row and column sums to one (up to rounding error) and usually contains 1–5 nonzero entries, offering a richer view of the folding ensemble than simple dot-bracket features. Incorporating these bpps matrices is optional—participants can use them as additional node features in the graph or ignore them entirely—and the organizers have provided them to save the effort and time of computing them from scratch.",
    "supplement": "https://www.kaggle.com/code/vudangthinh/openvaccine-gcn-graphsage-gru-kfold/notebook#Pytorch-model-based-on-GCN-(GraphSAGE)-and-GRU"
  }
}