**Article 10**

**Data Governance and Management Practices**

SafeRoute Traffic Monitor’s training, validation, and testing datasets were curated following data governance procedures aligned with the intended purpose of predicting and warning about urban traffic hazards. The design prioritized data from central business districts within affluent urban zones due to reliance on partnerships with city authorities who provide access primarily to these areas. Data collection processes involved aggregating historical incident reports, sensor feeds (including vehicle counts, weather data, and accident notifications), and traffic camera outputs from approximately 25 urban districts. The original data sources were primarily municipal transportation agencies focusing on high-traffic and commercially dense areas. Annotation and labeling were performed by domain experts using a standardized taxonomy of traffic hazard types, accompanied by cleaning steps to remove duplicates, erroneous sensor readings, and temporal inconsistencies. Data enrichment incorporated meteorological context and scheduled event calendars. Assumptions embedded in the datasets included equating incident frequency and sensor patterns in affluent district centers as representative proxies for broader urban risk, resulting in underrepresentation of data from low-income or suburban neighborhoods. An extensive review documented this geographic bias at the data preparation stage through comparative coverage statistics and incident density analyses.

**Assessment and Mitigation of Biases**

A comprehensive bias assessment highlighted systematic underrepresentation of peripheral neighborhoods, particularly low-income and suburban areas, in the training data. Quantitative analysis revealed that less than 12% of total incident reports originated from these peripheral zones, despite these representing approximately 35% of the broader metropolitan area’s road network. This spatial imbalance skewed model hazard-risk predictions, manifesting as a 17% lower predicted risk score on average for incidents occurring in under-sampled areas during validation phases. Following this finding, bias mitigation measures were implemented, including model calibration with importance weighting to reduce disparity in hazard score outputs. However, limitations in access to peripheral data obstructed full remediation. To address this, synthetic data generation techniques using generative adversarial networks (GANs) were explored but found insufficient in modeling complex behaviors unique to suburban traffic scenarios. Consequently, explicit warnings about potential underestimation in peripheral neighborhoods were included in the system’s operational documentation. No special categories of personal data were processed during bias correction efforts, consistent with data protection regulations.

**Data Quality, Representativeness, and Statistical Properties**

The combined dataset contains roughly 3.2 million labeled traffic incident instances spanning five years, collected from 2018 to 2023, across 25 districts. Validation and testing splits totaling approximately 20% of the data ensured temporal and spatial separation where possible. Data underwent quality control protocols including consistency checks, outlier detection, and error rate quantification, resulting in an estimated error rate under 1.5% for sensor measurements and under 2.2% for incident labels. While the data is comprehensive for central urban districts in affluent areas, coverage gaps remain for suburban and low-income neighborhoods—owing to limited data sharing agreements with peripheral authorities. The dataset’s statistical distributions reflect expected variations for peak hours, weather conditions, and incident severities in the core geographic zones, but show under-representation of factors typical in peripheral settings such as rural-urban transition traffic flows or informal road usage. These properties are documented with detailed metadata and data lineage records, forming part of model training logs to support reproducibility.

**Contextual and Geographical Considerations**

The system’s use case is explicitly bounded to urban traffic management centers covering metropolitan core areas. Geographical characteristics particular to central business districts—such as dense road networks, fixed traffic signal patterns, and high vehicle volumes—inform the data selection and model architecture. These contextual elements were central in configuring the Graph Neural Network to model complex intersections and traffic flows typical of the inner city environment. Peripheral neighborhoods and suburban areas, presenting different road layouts, traffic behaviors, and incident types, are documented as out-of-scope for the training data foundation due to data unavailability, constraining model generalizability in these zones. This limitation is reflected in system disclaimers and user guidance, recommending complementary local risk assessment tools in peripheral regions. The system’s operational deployment procedures include monitoring input data source representativity to assess temporal drift or geographic shifts that could impact model outputs.

**Safeguards on Personal Data Usage**

SafeRoute Traffic Monitor’s datasets do not include any special categories of personal data. All incident reports and sensor data are aggregated and anonymized by data providers prior to ingestion. Bias detection and correction procedures do not require access to identifiable personal information. Data protection and privacy safeguards adhere to the relevant provisions of the GDPR (Regulation (EU) 2016/679) and applicable regional data handling protocols. Access to raw data and models is controlled via secure authentication mechanisms with role-based permissions and activity logged. Data retention policies mandate deletion of transient raw data once ingested and validated. No cross-border data transfers are part of the data architecture, ensuring local jurisdictional control and compliance.