**Article 10**

**Dataset Composition and Governance**

The training dataset underpinning the Consumer Credit Transformer comprises approximately 3.2 million anonymized credit application records collected between 2018 and 2023 from multiple urban financial institutions primarily operating in major EU metropolitan areas. These records include detailed applicant financial histories, transaction logs, and enriched metadata such as employment sector, income level, and credit utilization patterns. Data provenance was carefully documented, identifying each source institution’s collection purpose as credit evaluation under standard regulatory frameworks. Due to limited availability and access constraints, rural and low-income applicant records constitute less than 7% of the entire training dataset, a demographic imbalance that reflects the data sourcing channels rather than curation choices.

Data governance policies adhered to industry-standard frameworks, including ISO/IEC 27001 for information security and the EU General Data Protection Regulation (GDPR). Detailed data lineage tracking was implemented to maintain traceability from original collection to model input, supporting auditability and accountability. Records underwent systematic validation processes encompassing schema conformity checks, plausibility audits, and duplicate detection to ensure integrity and completeness. Data cleaning operations included outlier removal (e.g., improbably high income spikes), normalization of categorical variables (e.g., employment type), and imputation of missing non-critical fields using k-nearest neighbors, benefiting model training consistency.

**Assumptions and Data Preparation**

The dataset was assumed to represent credit applicant behaviors and financial status validly within high-density urban and suburban contexts, where most data originated. Annotation and labeling efforts focused on consistent assignment of default outcome flags—derived from payment histories within a 24-month window following application—and credit approval decisions as ground-truth references. Enrichment with external socioeconomic indices captured by postal code areas further contextualized applicant profiles, but predominantly reflected urban economic factors due to source data concentration. 

Acknowledging geographic and socioeconomic clustering within the dataset, the design rationale accepted limited direct representation of rural financial patterns. This constraint was addressed in documentation and model evaluation protocols, with explicit limits imposed on the model’s applicability domain. Data aggregation methods retained critical individual-level granularity while normalizing distributions to prevent overweighting of overrepresented applicant groups.

**Bias Assessment and Mitigation Measures**

Comprehensive bias detection analyses were conducted to identify systemic disparities in model performance attributable to demographic skews in training data. Stratified evaluation on validation subsets revealed a 17% higher mean absolute error (MAE) in creditworthiness predictions for rural and low-income applicants compared to urban counterparts, with true positive rates for default prediction notably lower by approximately 9 percentage points in these underrepresented groups. Additional fairness metrics, including equal opportunity difference and disparate impact ratio, confirmed statistically significant biases correlated with geographic residence and income deciles.

Mitigation strategies encompassed reweighting techniques during training to partially compensate for underrepresented classes by applying inverse-frequency weights to rural applicant data where available. However, due to the limited volume and heterogeneity of rural records, these measures only partially improved predictive accuracy for these cohorts. Synthetic data generation was explored but ultimately excluded after rigorous testing revealed limited fidelity in replicating complex rural credit behaviors without introducing spurious correlations. Future roadmap considerations include targeted data acquisition initiatives and collaboration with lenders serving rural markets to enhance dataset representativeness.

**Dataset Suitability and Limitations**

The training, validation (approximately 10% of the dataset), and testing datasets collectively exhibit completeness exceeding 96%, with less than 1.2% residual missing critical financial features—primarily concentrated in isolated rural samples. The statistical properties of the datasets align closely with urban financial applicant distributions, including income ranges, credit scores, and employment statuses, supporting robust performance within these segments. However, these datasets do not adequately capture the contextual and behavioral factors distinctive to rural and low-income populations, such as alternative income streams, irregular employment, or non-traditional credit usage patterns prevalent outside urban centers.

These limitations are explicitly documented as constraints on the system’s intended scope, cautioning deployers and users regarding predictive confidence and increased error likelihood when applied to individuals from these underrepresented categories. Compliance measures include detailed performance reporting split by demographic segments, enabling risk management and informed decision-making consistent with regulatory requirements.

**Processing of Special Categories of Personal Data**

No special categories of personal data as defined in GDPR Article 9 were processed specifically for bias correction purposes. The bias assessment relied exclusively on protected attribute proxies (e.g., geographic indicators, income brackets) deliberately encoded without accessing sensitive data types such as racial or ethnic origin, health data, or political opinions. Consequently, the safeguards enumerated in Article 10(5) were not activated. Data protection and confidentiality were maintained through technical measures including pseudonymization, encryption at rest and in transit, and role-based access controls restricting dataset access to authorized development and audit personnel only.

**Geographic and Contextual Considerations**

In compliance with Article 10(4), the training and evaluation data incorporate geographic identifiers at the postal code level, enabling contextual awareness within urban and suburban EU settings. However, rural and low-income regional characteristics are underrepresented, reflecting sourcing constraints rather than omission. This partial geographic coverage has been factored into design assumptions, model scope definitions, and system documentation, ensuring transparency about applicability boundaries. The system’s attention mechanisms capture complex interactions within the urban financial context but may not generalize to rural socio-economic environments without further retraining and data enrichment.

**Summary of Compliance-Driven Dataset Management**

Throughout dataset preparation and model development, a structured data governance framework governed decisions balancing data availability, quality, and representativeness. Documentation of data lineage, preprocessing steps, and bias assessment outcomes supports traceability aligned with industry best practices as of 2025. Explicit recognition and communication of dataset gaps and inherent limitations concerning rural and low-income applicant representation ensure transparency and allow downstream users to apply appropriate risk controls in deployment environments.