- Decision: conferencePoster
- Abstract: We describe ongoing work into a general method for identifying and extracting information from semi-structured regions of text that are embedded within a natural language document. These are regions of text, usually in an ad hoc schema, forming structures such as tables, key-value listings, or long and repeated enumerations of properties. They present problems for standard information extraction algorithms that rely on regular grammatical text, as information is encoded in a combination of spatial layout, boilerplate, and repeated strings. Unlike previous work in table extraction, which relies on a relatively noiseless two-dimensional layout, our aim is to accommodate a wide variety of structure types. Our approach is an unsupervised one, based on identifying regions of suprising regularity inside the document. Here, regularity is measured by self-information, and is derived from patterns of semantically meaningful classes of text and visual layout. We present the results of an initial study to assess the ability of these measures to detect semi-structured text in a corpus culled from the web, showing that they outperform baseline methods on an average precision measure. We present initial work that uses significant patterns to generate extraction rules, and conclude with a discussion of future directions of our work.