The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance
Abstract: Automatic document annotation from a controlled conceptual
thesaurus is useful for establishing precise links between similar documents.
This study presents a language independent document annotation
system based on features derived from a novel collocation segmentation
method. Using the multilingual conceptual thesaurus EuroVoc,
we evaluate filtered and unfiltered version of the method, comparing it
against other language independent methods based on single words and
bigrams. Testing our new method against the manually tagged multilingual
corpus Acquis Communautaire 3.0 (AC) using all descriptors found
there, we attain improvements in keyword assignment precision from 18
to 29 percent and in F-measure from 17.2 to 27.6 for 5 keywords assigned
to a document. The further filtering out of the top 10 frequent
items improves precision by 4 percent and collocation segmentation improves
precision by 9 percent on the average, over 21 languages tested.
0 Replies
Loading