**Article 10**

**Training, Validation, and Testing Data Governance and Management**

The training, validation, and testing data sets used in the development of the Election Sentiment Transformer (EST) conform to a structured data governance framework tailored to the system’s purpose of analyzing and influencing public sentiment in democratic electoral contexts. The data governance strategy comprehensively documents the relevant design choices, emphasizing the predominance of social media text data from urban, politically active English-speaking user populations. Data originates primarily from publicly available social media platforms, collected via API integrations consistent with platform terms of service and respecting privacy regulations. Personal data processed are generally limited to publicly posted textual content; however, no special categories of personal data were processed during model bias correction phases, adhering to strict self-imposed policy constraints aligned with privacy best practices.

Relevant data preparation steps include automated language detection, filtering for English-language content, text normalization, and a multilayer annotation pipeline incorporating sentiment labeling by a combination of algorithmic classifiers and expert human annotators. Cleaning involved removal of spam, bot-generated content, and offensively biased posts to improve downstream model interpretability and fairness. Data updates reflecting current socio-political discourse were incorporated via incremental retraining epochs every quarter to maintain topical relevance.

Explicit assumptions guide dataset formulation, including the representation that trending urban social media users’ expressions adequately reflect key political discourse trends. The training corpus comprises approximately 120 million posts collected over a 24-month period, with 85% originating from large metropolitan areas across North America and Western Europe, and 88% in English. Minority linguistic groups and rural populations are underrepresented, primarily due to source platform demographics and language filtering, a fact identified and documented in risk assessments. The selected datasets aim to capture opinion shifts on a granular temporal scale but acknowledge limitations in geographic and linguistic diversity.

**Bias Assessment and Mitigation Measures**

A rigorous bias audit was executed to assess the extent and impact of dataset imbalances (Article 10(2)(f)). Analyses employed stratified performance metrics comparing sentiment prediction accuracy and influence output distributions across demographic proxies, including urban vs. rural origin (approximated by geotags), language groups, and political engagement levels derived from user metadata. Results consistently demonstrated lower predictive accuracy and diminished influence efficacy for rural and minority language subsets, with a 15–22% relative reduction in model confidence intervals and increased false neutral classifications.

To address these imbalances, several mitigations were implemented: (i) targeted augmentation of underrepresented linguistic data via crowdsourced translation and annotation efforts; (ii) application of domain-adaptive fine-tuning using smaller supplementary corpora representing rural and politically less vocal user groups; and (iii) calibrated output reweighting in the inference pipeline to reduce marginalization of minority sentiment signals. Nonetheless, full parity was not achieved due to limited availability and quality of rural and minority language data, and system outputs explicitly convey confidence scores to warn deployers of potential bias risks.

No special categories of personal data were utilized, precluding the application of Article 10(5) safeguards. Instead, bias detection relied exclusively on pseudonymized meta-annotations and aggregate statistical analyses, consistent with privacy regulations and minimizing data protection risks.

**Representativeness, Error Minimization, and Completeness**

In alignment with Article 10(3), training and validation datasets emphasize relevance and statistical representativeness at the macro level of political discourse on dominant urban social media platforms and English-language content. Error rates within labeled data were benchmarked using inter-annotator agreement scores, achieving a Cohen’s kappa of 0.82 on sentiment categories, indicating strong yet imperfect labeling quality. Automated noise detection eliminated approximately 4.3% of data entries with anomalous or inconsistent labels. Despite exhaustive cleaning efforts, residual errors persist, particularly in nuanced political contexts and underrepresented languages, acknowledged in technical risk notes.

Data completeness is high for urban English-language posts, supported by continuous data ingestion infrastructure and quarterly retraining cycles. However, sparse data from rural and minority language users introduces gaps in the system’s holistic representation of the electorate, aligned with documented assumptions and limitations. Dataset statistical properties were evaluated with stratified sampling techniques to verify that dominant groups’ distributions reflect expected real-world activity levels, but smaller groups remain statistically underpowered by design and operational constraints.

**Contextual and Geographic Specificity**

The datasets account for geographic and contextual factors relevant to the intended operational territories, primarily EU Member States alongside select English-speaking democracies, through inclusion of geotagged posts and regional topic tagging. Contextual metadata capture temporal election cycles, major political events, and emergent social issues, supporting functional adaptation of the model’s attention mechanisms. Rural areas and minority linguistic regions are underrepresented due to data source biases and language filtering, noted as a key limitation in system documentation, reflecting partial consideration of the applicable geographic and contextual settings per Article 10(4). These limitations inform deployment risk communications and recommendations to users concerning system scope and use cases.

**Data Provenance and Documentation**

Comprehensive provenance documentation accompanies all datasets, detailing origins, collection dates, annotation methodologies, cleaning protocols, and updates. A version-controlled data catalog supports traceability, enabling auditability of design choices and data lifecycle events consistent with best practices. This catalog underpins chain-of-custody records for all training phases, aligning with Article 10(2)(b) and (c). Data collection respects the original purpose of data generation (public political discourse) and maintains adherence to ethical standards and applicable data privacy regulations throughout processing stages.

**Conclusion on Data Quality Measures**

The Election Sentiment Transformer’s dataset management approach demonstrates a concentrated effort to ensure quality, governance, and representativeness commensurate with the system’s intended political domain and linguistic focus. Documented biases reflecting overrepresentation of English-speaking urban users and underrepresentation of rural, minority linguistic, and less politically active groups are explicitly acknowledged. Mitigative steps and transparency measures, including confidence scoring and continuous monitoring, have been integrated into data governance practices to provide system deployers with actionable information enabling risk-aware application.