**Article 10**

**Data Governance and Management for Training, Validation, and Testing Sets**  
Insight Proctor Analytics’ AI models were developed using video datasets primarily collected from four academic institutions located in Western Europe. These institutions were selected for data availability and existing partnerships, resulting in a dataset of approximately 5,200 hours of synchronized video and exam metadata recordings from exam sessions held between 2019 and 2023. The student populations represented are predominantly ethnically homogeneous, with over 90% of collected video data featuring students of Western European descent. Data collection was conducted in accordance with each institution’s consent protocols and privacy regulations applicable at the time. The original purpose of data collection was academic exam monitoring research and system training.

Data annotation leveraged a multi-stage process combining manual labeling by expert analysts and semi-automated heuristic tagging. Annotations focused on detecting behaviors such as gaze direction, hand movements, usage of unpermitted objects, and interactions with digital exam tools. The labeling protocols prioritized clarity and consistency but did not include annotation categories relating to cultural gestural variations or neurodivergent behavioral patterns. Data cleaning involved removing corrupted video files, synchronizing signals with exam metadata, and excluding instances with insufficient video quality. The dataset was split into training (70%), validation (15%), and testing (15%) subsets based on exam session dates to mitigate temporal bias effects.

Explicit assumptions shaped the data’s representation scope. Insight Proctor Analytics presumes that the recorded behaviors are indicative of either compliant or non-compliant exam conduct within Western European student populations. Behaviors outside this normative range—such as atypical gestures or neurodiverse expressions—were neither comprehensively measured by annotations nor sufficiently represented within the dataset. This restricts the model’s behavioural baseline to the majority subgroup present in the data.

**Assessment of Data Suitability, Representativeness, and Bias Considerations**  
The training dataset’s volume and quality support the development of transformer-based Vision Language Models (VLMs) with moderate-to-high accuracy for standard exam integrity use cases in settings similar to the data origins. However, representativeness is limited by the regional and ethnic homogeneity of the source data. The distribution of demographic attributes within the dataset was analyzed post-collection and shows an imbalance in ethnicity and cultural behavioral variability: less than 5% of recordings feature students from minority backgrounds or with documented neurodiverse characteristics.

Bias assessments consisted primarily of statistical reviews of false positive rates within the internal validation sets segmented by available demographic metadata. These analyses identified disproportionately elevated false alarm rates (up to 17% higher) for subgroups exhibiting atypical gestures and neurodivergent behaviors, including mannerisms common in some minority cultures. Due to insufficient annotated examples, no targeted bias detection methodologies such as subgroup-specific error analysis or causal inference were performed.

Mitigation of identified biases was restricted to generic threshold tuning of anomaly detection sensitivity. Threshold parameters controlling model confidence scores were calibrated on aggregate performance metrics without subgroup-specific adjustments. This approach aimed to reduce overall false alarms while maintaining detection recall but lacks fine-grained mitigation for demographic disparities. No specialized augmentation, adversarial training, or debiasing algorithms were employed to correct these biases.

**Recognition and Disclosure of Data Limitations and Gaps**  
Meridian Educational Technologies acknowledges that the data and model development process do not comprehensively address the behavioral diversity present across all user demographics the system might encounter when deployed. The absence of meaningful representation of cultural gestures, neurodiversity, and minority populations constitutes a known gap impacting fairness and accuracy for these groups. Furthermore, no processing of special categories of personal data under EU definitions was performed, and no bias correction steps requiring such data processing have been implemented.

Planned future improvement endeavors will explore targeted data acquisition strategies aimed at underrepresented groups to enable more granular bias detection and remediation, pending compliance with data protection and ethical standards. Current system iterations incorporate documentation advising end-users of potential limitations related to behavioral variability and cautioning on interpretation of suspicious activity flags, especially for minority students.

**Data Quality Assurance Measures**  
Throughout dataset construction, efforts emphasized video quality standards (minimum 720p resolution, 25 frames per second), synchronization of multi-modal data inputs, and annotation validation via inter-annotator agreement metrics averaging 0.82 Cohen’s kappa. Validation and testing sets underwent cross-validation procedures to assess model stability over unseen sessions. Despite these measures, quality criteria related to inclusive representativeness and bias elimination remain partially unmet due to the outlined demographic and behavioral scope constraints.