TABOO: Detecting Unstructured Sensitive Information Using Recursive Neural Networks

Jan Neerbek, Ira Assent, Peter Dolog

Published: 2017, Last Modified: 17 Dec 2024ICDE 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Leak of sensitive information from unstructured text documents is a costly problem both for government and for industrial institutions. Traditional approaches for data leak prevention are commonly based on the hypothesis that sensitive information is reflected in the presence of distinct sensitive words. However, for complex sensitive information, this hypothesis may not hold. Our TABOO system detects complex sensitive information in text documents by learning the semantic and syntactic structure of text documents. Our approach is based on natural language processing methods for paraphrase detection, and uses recursive neural networks to assign sensitivity scores to the semantic components of the sentence structure. The demonstration of TABOO focuses on interactive detection of sensitive information with the TABOO system. Users may work with real documents, alter documents or prepare free text, and subject it to information detection. TABOO allows users to work with our TABOO engine or with traditional approaches, and to compare results. Users may verify that single words can change sensitivity according to context, thereby giving hands-on experience with complex cases of sensitive information.