An Evaluation Study of Hybrid Methods for Multilingual PII Detection

Harshit Rajgarhia; Suryam Gupta; Asif Shaik; Gulipalli Praveen Kumar; Y Santhoshraj; Sanka Nithya Tanvy Nishitha; Abhishek Mukherji

An Evaluation Study of Hybrid Methods for Multilingual PII Detection

Harshit Rajgarhia, Suryam Gupta, Asif Shaik, Gulipalli Praveen Kumar, Y Santhoshraj, Sanka Nithya Tanvy Nishitha, Abhishek Mukherji

Published: 24 Sept 2025, Last Modified: 20 Nov 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: PII Detection, Benchmarking, Large Language Models, Regex, Named Entity Recognition, Prompt Engineering, Low-Resource Languages, Named Entity Disambiguation

TL;DR: We propose a hybrid PII detection framework combining regular expressions and prompt based LLMs, benchmarked across 13 locales. The system outperforms NER and LLM-only baselines and supports scalable, regulation aware entity detection

Abstract: The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP’s modular design supports over 300 entity types without retraining, using a three- phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero- shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.

Submission Number: 126

Loading