Abstract: We investigate tagging figure and table captions in scientific articles from geology to support visualization of research findings on maps and time-lines. Our proposed approach comprises identifying geological time expressions and geographic and geologic locations without requiring large pre-annotated data. Different tagging approaches are tested and evaluated on a corpus of captions extracted from scientific geological articles. Our baseline method builds on geologic timescale ontologies and GeoNames as gazetteers to facilitate lookup of times and location names. The baseline is evaluated on a development set of captions from 20 documents and the results are analyzed manually to identify causes for tagging errors. We found that the poor performance of the baseline approach is mainly due to i) lack of coverage in the gazetteers, ii) incorrect tagging of person names as location names, and iii) a simplistic gazetteer lookup for capitalized words. We augmented the baseline approach by extending the gazetteers, by adding reference identification to block person names being tagged as locations, by filtering trivial matches, and by augmenting the lookup by correcting capitalization using true casing of words. The different configurations of our extended approach were evaluated on a test set of 80 documents, achieving an improved precision and recall of more than 90%.
Loading