Graded Sense and Usage Annotation Dataset (Round 2)
=================================
July 2012
Diana McCarthy, Katrin Erk and Nicholas Gaylord
       dianam at dianamccarthy.co.uk,
       katrin.erk at mail.utexas.edu, 
       nlgaylord at utexas.edu 

This directory contains
- this README file
- lexsubwc.dtd the dtd for lexsub_wcdata.xml
- lexsub_wcdata.xml

The file lexsub_wcdata.xml contains the data from 260 instances (target lemmas 
in context) taken from the SemEval 2007 English Lexical Substitution (lexsub) 
dataset (McCarthy and Navigli, 2007). The sentences were extracted for that 
task from the English Internet Corpus (Sharoff, 2006). This subset of the lexsub
dataset comprises instances for 26 of the lexsub lemmas. Note that in this 
dataset, in addition to the sentence of context from lexsub we also provide one 
preceding and one following sentence from the original documents in the English Internet Corpus. This additional context was used in our annotation. Naturally,
if the lexsub sentence was at the start or end of a document then there will not
be a preceding or following sentence. The lemmas are:

"account.n"      "bright.a"       "call.v"         "coach.n"       
"dismiss.v"      "fire.v"         "fix.v"          "function.n"    
"hold.v"         "investigator.n" "lead.n"         "neat.a"        
"new.a"          "order.v"        "range.n"        "rich.a"        
"ring.n"         "rough.a"        "scrap.n"        "severely.r"    
"shade.n"        "shed.v"         "skip.v"         "soft.a"        
"stiff.a"        "suffer.v"   

with suffixes to indicate the part of speech, a for adjective, n for noun, 
r for adverb and v for verb.

The instance ids in the xml file are those used as instance ids in lexsub.  For 
more information about the Lexical Substitutions dataset, please consult: 

Diana McCarthy and Roberto Navigli. 2007. SemEval-2007 task 10: English
lexical substitution task. In Proceedings of the 4th International
Workshop on Semantic Evaluations (SemEval-2007), pages 48-53,  
Prague, Czech Republic. 

Diana McCarthy,  and Roberto Navigli (2009) The English Lexical Substitution 
Task, in Language Resources and Evaluation 43 (2) Special Issue on Computational
Semantic Analysis of Language: SemEval-2007 and Beyond, Agirre, E., Màrquez, L. 
and Wicentowksi, R. (Eds). pp 139-159 Springer, 2009.


For download of the full Lexical Substitutions dataset, please go to:

http://www.dianamccarthy.co.uk/task10index.html

For information on the English Internet Corpus please consult:

Serge Sharoff. 2006. Open-source corpora: Using the
net to fish for linguistic data. International Journal of
Corpus Linguistics, 11(4):435–462.

or go to:

http://corpus.leeds.ac.uk/internet.html
