An Evaluation Study of Hybrid Methods for Multilingual PII Detection

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: PII Detection, Benchmarking, Large Language Models, Regex, Named Entity Recognition, Prompt Engineering, Low-Resource Languages, Named Entity Disambiguation
TL;DR: We propose a hybrid PII detection framework combining regular expressions and prompt based LLMs, benchmarked across 13 locales. The system outperforms NER and LLM-only baselines and supports scalable, regulation aware entity detection
Abstract: The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP’s modular design supports over 300 entity types without retraining, using a three- phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero- shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.
Submission Number: 126
Loading