SIAP: Synthetic dataset for maritime vessel risk profiling and illegal activity prediction

Spyridon Karamolegkos, Nikolaos Ι. Dourvas, Nikolaos Episkopos, Konstantinos Pikounis, Emmanouil Michail, Konstantinos Ioannidis, Stefanos Vrochidis

Published: 01 Dec 2025, Last Modified: 10 Nov 2025Data in BriefEveryoneRevisionsCC BY-SA 4.0

Abstract: This dataset was generated within the scope of the CONNECTOR and FARADAI projects to support the development and training of machine learning models for identifying vessels with a high likelihood of engaging in illegal maritime activities. The data generation process was informed by extensive expert knowledge, obtained through structured consultations with the Cross-border Research Association (CBRA) and end user partners within the CONNECTOR project. These sessions translated operational insights into a set of probabilistic and rule-based simulation criteria, modeling vessel behavior, crew attributes, compliance history, cargo-related information, and operational patterns. The dataset, which is based on non-nominal and non-confidential information, comprises 100,000 rows, each representing a simulated vessel profile described by features such as crew criminal record, abnormal routing, frequency of port calls, inspection history, prior violations, insurance claims, ship condition, and cargo characteristics. Variables were generated using appropriate statistical distributions and condition-dependent rules based on domain knowledge. A synthetic binary target variable indicates whether the vessel is likely to be involved in illegal activity, with probability values derived from cumulative risk factors and capped at a defined threshold. To enhance realism and validate the plausibility of feature combinations, intelligence was also extracted from anonymized real-world vessel behavior reports provided by Lloyd’s List Intelligence. These real-world examples served as qualitative baselines for simulating typical and edge-case activity patterns, ensuring that the dataset remains relevant for operational risk modeling while preserving ethical safeguards. The dataset is provided in CSV format, ready for immediate ingestion into analytics pipelines, machine learning workflows, or maritime surveillance tools. It is designed for reuse in the development of maritime anomaly detection systems, predictive analytics, risk profiling tools, and decision-support frameworks. It is particularly suited for researchers, enforcement agencies, and developers of maritime AI systems seeking high-quality, realistic training data for binary classification tasks. A detailed specification of each variable, its value range, simulation logic, and associated domain rationale is included in the accompanying documentation.

External IDs:doi:10.1016/j.dib.2025.112101