{
  "stage": "1_initial_implementation_1_preliminary",
  "total_nodes": 10,
  "buggy_nodes": 3,
  "good_nodes": 5,
  "best_metric": "Metrics(loss\u2193[BPI2012 tr:(final=0.5148, best=0.5148), BPI2012 dev:(final=0.5073, best=0.5073), BPI2012 ts:(final=0.5355, best=0.5355), BPI2017 tr:(final=0.3607, best=0.3607), BPI2017 dev:(final=0.3756, best=0.3756), BPI2017 ts:(final=0.3877, best=0.3877), ROAD tr:(final=0.4662, best=0.4662), ROAD dev:(final=0.4274, best=0.4274), ROAD ts:(final=0.4833, best=0.4833)]; accuracy\u2191[BPI2012 tr:(final=0.7777, best=0.7777), BPI2012 dev:(final=0.7639, best=0.7639), BPI2012 ts:(final=0.7569, best=0.7569), BPI2017 tr:(final=0.8422, best=0.8422), BPI2017 dev:(final=0.8405, best=0.8405), BPI2017 ts:(final=0.8332, best=0.8332), ROAD tr:(final=0.7894, best=0.7894), ROAD dev:(final=0.8122, best=0.8122), ROAD ts:(final=0.8020, best=0.8020)]; F1 score\u2191[BPI2012 tr:(final=0.5609, best=0.5609), BPI2012 dev:(final=0.6007, best=0.6007), BPI2012 ts:(final=0.5872, best=0.5872), BPI2017 tr:(final=0.5721, best=0.5721), BPI2017 dev:(final=0.6180, best=0.6180), BPI2017 ts:(final=0.5710, best=0.5710), ROAD tr:(final=0.5395, best=0.5395), ROAD dev:(final=0.6664, best=0.6664), ROAD ts:(final=0.4740, best=0.4740)]; top-3 accuracy\u2191[BPI2012 tr:(final=0.9868, best=0.9868), BPI2012 dev:(final=0.9861, best=0.9861), BPI2012 ts:(final=0.9874, best=0.9874), BPI2017 tr:(final=0.9941, best=0.9941), BPI2017 dev:(final=0.9928, best=0.9928), BPI2017 ts:(final=0.9906, best=0.9906), ROAD tr:(final=0.9986, best=0.9986), ROAD dev:(final=0.9969, best=0.9969), ROAD ts:(final=0.9936, best=0.9936)])",
  "current_findings": "### Summary of Experimental Progress in BPM and PPM\n\n#### 1. Key Patterns of Success Across Working Experiments\n\n- **Time-Based Split:** Successful experiments consistently implement a strict time-based split at the case level to prevent data leakage. This approach ensures that the model is trained on past data and evaluated on future data, maintaining the integrity of predictive monitoring.\n\n- **Prefix-Based Sampling:** Building prefix-target pairs from case traces up to a maximum length effectively captures sequential dependencies and allows for accurate next-activity prediction.\n\n- **Feature Engineering:** Incorporating simple temporal features such as time since start, time since last event, hour, weekday, and working-time flag enhances model performance by providing context to activity sequences.\n\n- **Model Architecture:** A minimal 1-layer LSTM with activity embeddings has proven effective for next-activity prediction, demonstrating convergence and yielding high top-3 accuracy across various datasets.\n\n- **Metric Tracking:** Consistent tracking of metrics such as accuracy, macro-F1, and top-3 accuracy during training and validation phases ensures that models are evaluated comprehensively.\n\n- **End-to-End Execution:** Ensuring the pipeline runs end-to-end, even with synthesized data if necessary, guarantees robustness and completeness of the experimental setup.\n\n#### 2. Common Failure Patterns and Pitfalls to Avoid\n\n- **Indexing Errors:** KeyErrors often arise from incorrect positional indexing in pandas Series. Converting timestamps to NumPy arrays or using `.iloc` for positional access can prevent such errors.\n\n- **Data Leakage:** Double normalization and leakage of global statistics into training data can skew results. Normalization should be performed strictly using training data statistics after the time-based split.\n\n- **Plotting Mismatches:** ValueErrors during plotting occur due to mismatched lengths between epoch axes and metric values. Filtering non-epoch entries or plotting before appending final metrics can resolve these issues.\n\n- **Vocabulary Handling:** Building activity vocabularies on the full dataset can lead to unseen tokens during testing. Constructing vocabularies from training data and handling OOV tokens is recommended.\n\n#### 3. Specific Recommendations for Future Experiments\n\n- **Enhance Robustness:** Implement early stopping, learning rate scheduling, and class weighting to address label imbalance and improve model convergence.\n\n- **Improve Reproducibility:** Log random seeds and hyperparameters to ensure experiments can be replicated accurately.\n\n- **Optimize Feature Processing:** Remove normalization from initial feature processing and perform it post-split using training data statistics only. Consider masking LSTM inputs to prevent padding from affecting recurrent dynamics.\n\n- **Activity Vocabulary Management:** Build activity vocabularies from training data only and map unseen activities to an OOV token during validation and testing.\n\n- **Plotting Practices:** Ensure plotting functions align x-axis and y-axis dimensions by filtering non-epoch entries or adjusting plotting sequences before appending final metrics.\n\nBy adhering to these recommendations and learning from past successes and failures, future experiments in BPM and PPM can achieve higher accuracy, robustness, and reproducibility, paving the way for more advanced predictive monitoring solutions."
}