Abstract: TimeML is a scheme for representing temporal information (times, events, & temporal relations) in texts. Although automatic
TimeML annotation is challenging, there has been notable progress, with F1s of 0.8–0.9 for events and time detection subtasks,
and F1s of 0.5–0.7 for relation extraction. Individually, these subtask results are reasonable, even good, but when combined to
generate a full TimeML graph, is overall performance still acceptable? We present a novel suite of eight metrics, combined
with a new graph-transformation experimental design, for holistic evaluation of TimeML graphs. We apply these metrics to
four automatic TimeML annotation systems (CAEVO, TARSQI, CATENA, and CLEARTK). We show that on average 1/3 of the
TimeML graphs produced using these systems are inconsistent, and there is on average 1/5 more temporal indeterminacy than
the gold-standard. We also show that the automatically generated graphs are on average 109 edits from the gold-standard, which
is 1/3 toward complete replacement. Finally, we show that the relationship individual subtask performance and graph quality is
non-linear: small errors in TimeML subtasks result in rapid degradation of final graph quality. These results suggest current
automatic TimeML annotators are far from optimal and significant further improvement would be useful.
0 Replies
Loading