Unsupervised Induction of Domain Dependency Graphs - Extracting, Understanding and Visualizing Domain Knowledge

Sarah Kohail

Published: 2019, Last Modified: 06 Jan 2026undefined 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The unstructured nature of text documents makes the task of processing and understanding it by machines very challenging, and transforming it into structured representation has become a pressing. Classical Bag-of-Words-based Vector Space Model (BoW-based VSM) represents documents as independent terms and only considers the document as a histogram of word occurrences, ignoring structural and semantic aspects of textual contents. This dissertation explores the utility of graph-based text representations as an alternative to classical text representation models. Specifically, we propose a new data-driven graph-theoretic approach to representing text by means of graphs, called Domain Dependency Graphs (DDGs). DDGs integrate the power of graph representation, as a way to preserve the dependency structure of a text, with topic modeling, as a way to uncover the hidden topical semantic structure of a text. In summary, DDGs generation process goes as follows: using topic modeling, we extract dominant topics from a corpus of documents. Then, source-side dependency structures of documents per topic are modeled as one coherent DDG, which maintains the inter-topic cohesiveness together with the structural aspect of a text. Later, an extra level of term and dependency weighting approach is applied to ensure the extraction of highly domain-specific words and relations. Our approach is completely unsupervised and needs no labeled training data or previous knowledge about the domains. In an effort to provide further understanding of the extracted DDGs, we develop DDGviz, an interactive open-source web-based visualization tool, which enables users to filter, analyze, search and easily interact with generated DDGs by adjusting various parameters and configurations. To demonstrate the effectiveness of the generated DDGs, we perform extrinsic evaluation by integrating several DDGs-based features, and graph mining and alignment approaches to improving the performance of relevant Natural Language Processing (NLP) tasks, namely Aspect-based Sentiment Analysis (ABSA) and Semantic Textual Similarity (STS), as follows: (1) We explore the effectiveness of DDGs-based features, like DDGs top domain words and DDGs identified aspects, in addition to distributional semantics features, for improving the performance of supervised models for different aspect-based sentiment analysis subtasks. We also propose a novel unsupervised graph-rule mining approach, which incorporates high level linguistic structural information to accurately identify the most compelling aspects of different entities (aspect identification) and extract opinion related expressions (OTE-sentiment extraction) from unstructured user-generated reviews. (2) We provide an unsupervised STS solution to finding similarities between two texts based on DDGs alignment. We introduce an approximate sub-graph alignment approach to find a dependency sub-graph in the candidate text dependency graph that is similar to a given query text dependency graph, allowing for node gaps and mismatches, where a certain word in one dependency graph cannot be mapped to any word in the query text graph, as well as graph structural differences. We also examine the impact of using DDGs similarity-based and coverage-based features to improve the identification and prediction of STS supervised models. Experiments on different benchmark datasets for different subtasks revealed that incorporating DDGs-based features show superior results compared to state-of-the-art approaches.

External IDs:dblp:phd/dnb/Kohail19