**Article 10**

**Data Governance and Management Practices**

The Guardian Signal Controller’s AI training, validation, and testing data sets were developed according to systematic data governance and management protocols tailored for high-risk traffic safety applications. The design choices prioritized the integration of multi-modal data, specifically video streams and sensor measurements collected at multiple urban intersections under typical daytime and favorable weather conditions. Data collection sources consisted of fixed traffic cameras and environmental sensors deployed in city centers with moderate traffic volumes. The original purpose of data collection was strictly for traffic monitoring and anomaly detection, aligning with the system’s operational objective to optimize signal timing and prevent accidents.

Annotation and labeling were performed using domain specialists who categorized traffic events, anomalies, and normal conditions through a combination of manual tagging and automated preliminary classification. Cleaning operations addressed common sensor artifacts and sporadic camera occlusions by removing corrupted frames and correcting sensor drift within the limits of labeled noise detection. However, data cleaning processes did not fully eliminate unlabeled noise present in sensor measurements during poor visibility conditions such as heavy rain and fog due to the inherent difficulty in differentiating legitimate sensor signals from environment-induced artifacts in sparsely sampled adverse weather contexts.

Crucially, assumptions guiding data preparation acknowledged that video feature representations extracted by the convolutional neural networks (CNNs) would predominantly reflect daytime illumination and clear weather scenarios, as corresponding data coverage for night-time and adverse weather conditions was limited and thus underrepresented. This limitation was documented explicitly within data management records, including a gap analysis identifying the absence of sufficient training samples for night-time, fog, and heavy rain conditions, which factored into subsequent model validation strategies.

**Assessment of Data Suitability and Representativeness**

The training data comprised approximately 150,000 annotated video sequences and 1.2 million corresponding sensor measurement records collected over an 18-month period across five European cities. While the dataset captured a wide range of typical traffic behaviors and daylight conditions, it contained fewer than 2% of samples recorded during low-light or adverse weather conditions. This imbalance resulted in a pronounced domain shift between training and real-world night-time or foggy scenarios, causing degradation of CNN feature extraction performance under those conditions, as confirmed by internal benchmarking.

Statistical analysis revealed that classification accuracy on test sets collected during daylight and clear weather consistently exceeded 92%, whereas performance dropped to below 70% in limited evaluation subsets representing night-time or fog. The system’s ability to generalize to these conditions was constrained by insufficient data diversity and the presence of unlabeled sensor noise, which introduced inconsistencies in ground truth labels and reduced classifier reliability at intersections with poor visibility.

Bias assessments addressed potential disparities in data quality related to environmental factors, confirming that measurement noise and feature extraction errors during adverse weather disproportionately affected less frequent situational subsets. Mitigation efforts included applying data augmentation techniques such as brightness adjustment and synthetic fog simulation; however, these were deemed insufficient to compensate fully for the lack of genuine adverse condition samples in training, in accordance with current industry standard practices.

**Identification and Management of Data Gaps**

A comprehensive gap analysis documented known shortcomings and their implications for the system’s operational scope. The absence of substantial night-time and adverse weather video recordings constitutes a critical limitation explicitly recognized at the provider level. To manage this risk, the system’s deployment guidelines emphasize operational constraints advising that signal control decisions derived during poor visibility conditions be treated with calibrated confidence levels.

Provider measures include ongoing monitoring of system outputs in real deployments to identify anomalous classification behavior correlated with adverse environmental conditions. Additionally, the provider recommends periodic dataset expansion initiatives targeting night-time and inclement weather data acquisition to close the current representativeness gap. Interim updates to the Random Forest classifier incorporate sensor-level noise characterization models designed to partially compensate for unlabeled disturbances detected during adverse conditions.

In summary, the Guardian Signal Controller’s training, validation, and testing datasets reflect rigorous data governance and industry-standard data preparation methods compatible with the intended use case. Explicit documentation of dataset limitations, including underrepresentation of adverse visibility states and residual sensor noise, support transparent assessment of model performance and reliability bounds, facilitating informed risk management by system operators.