Keywords: PII Detection, Benchmarking, Large Language Models, Regex, Named Entity Recognition, Prompt Engineering, Low-Resource Languages, Named Entity Disambiguation
TL;DR: We propose a hybrid PII detection framework combining regular expressions and prompt based LLMs, benchmarked across 13 locales. The system outperforms NER and LLM-only baselines and supports scalable, regulation aware entity detection
Abstract: The detection of Personally Identifiable Information (PII) is critical for privacy
compliance but remains challenging in low-resource languages due to linguistic
diversity and limited annotated data. We present RECAP, a hybrid framework
that combines deterministic regular expressions with context-aware large language
models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP’s
modular design supports over 300 entity types without retraining, using a three-
phase refinement pipeline for disambiguation and filtering. Benchmarked with
nervaluate, our system outperforms fine-tuned NER models by 82% and zero-
shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable
solution for efficient PII detection in compliance-focused applications.
Submission Number: 126
Loading