SIAP: Synthetic Dataset for Maritime Vessel Risk Profiling and Illegal Activity Prediction
Abstract: This dataset was generated within the CONNECTOR project to support the development and training of machine learning models for identifying vessels with a high likelihood of engaging in illegal maritime activities. The data generation process was informed by extensive expert knowledge, obtained through structured consultations with the Cross-border Research Association (CBRA) and CONNECTOR's end user partners. These sessions translated operational insights into a set of probabilistic simulation criteria, modeling vessel behavior, crew attributes, compliance history, cargo information, and operational patterns. The dataset consists of 100,000 synthetically generated vessel profiles, each described by features such as crew criminal record, abnormal routing, frequency of port calls, inspection history, prior violations, insurance claims, ship condition, and cargo characteristics. Variables were generated using appropriate statistical distributions and conditional rules based on domain knowledge. A binary target variable (“Illegal Activity”) indicates whether the vessel is likely to be involved in illicit activity, with probability values derived from cumulative risk factors and capped at 80%. To enhance realism, qualitative intelligence from anonymized vessel reports by Lloyd’s List Intelligence was used to validate feature interactions and edge-case patterns. The result is a realistic yet ethically safe dataset that can be openly shared. The dataset is provided in CSV format, ready for use in analytics pipelines, machine learning workflows, and maritime surveillance systems. It is designed for reuse in developing maritime anomaly detection systems, predictive models, and decision-support tools. A detailed data dictionary describing each variable, its range, simulation logic, and domain rationale is included.
External IDs:doi:10.5281/zenodo.16631283
Loading