Graded Sense and Usage Annotation Dataset (Round 2)
=================================
September 2013
Diana McCarthy, Katrin Erk and Nicholas Gaylord
       dianam at dianamccarthy.co.uk,
       katrin.erk at mail.utexas.edu, 
       nlgaylord at utexas.edu 

OVERVIEW

This distribution provides four datasets with word meaning annotation described 
as the round two datasets in Erk et al. (2013). For the earlier dataset (Erk et 
al. 2009), described in the 2013 paper as 'round one', please go to 
http://www.katrinerk.com/graded-sense-and-usage-annotation. The round two 
datasets contain human judgments on lemmas in data taken from the SemEval 2007 
English Lexical Substitution Task (McCarthy and Navigli, 2007) hereafter 
referred to as lexsub. The same subset is used for all four datasets and 
comprises  26 lemmas, each with ten occurrences. The 260 occurrences of the  
26 lemmas are as they were in lexsub, except that the annotators were given an 
additional sentence either side of the sentence which contains the target lemma.
That is, there are three sentences of context for each instance other than cases
where the lexsub sentence was at start or end of a document in the English 
Internet Corpus (Sharoff, 2006) from which the lexsub data was taken.

The four datasets are as follows:

i) the Word Sense Similarity dataset (WSsim-2) provides graded ratings on the 
applicability of WordNet senses. Ratings are on a scale from 1 (completely 
different) to 5 (identical). 
ii) the Usage Similarity dataset (Usim-2) provides graded ratings of the 
similarity in meaning of pairs of occurrences (usages) of a common target lemma.
Ratings are on a scale from 1 (completely different) to 5 (identical). 
iii) the Word Sense Best dataset (WSbest) provides traditional word sense 
disambiguation annotations where annotators select the most applicable word 
sense. WordNet is used for the sense inventory. 
iv) the Synonym Best dataset (Synbest) provides lexical paraphrases 
'substitutes' of the target lemma in context. The annotators were free to 
choose any substitute and were not given a predefined list.

For more information on these datasets please refer to Erk et al. (2013). 

Anyone can use these four datasets. When you use the data, please acknowledge 
its source and reference Erk et al. (2013) in any subsequent publications.

The annotation guidelines are available from:
http://www.dianamccarthy.co.uk/downloads/WordMeaningAnno2012/


CONTENTS

This directory contains:

* Data: contains the xml file containing the occurrences that were annotated.
  It also contains a dtd file for this xml file.  

* Markup: contains the annotation with a directory for each dataset:
  * UsageSimilarity
  * SynonymBest
  * WordSenseSimilarity
  * WordSenseBest

  Each of the subdirectories contains a README with further information.

* README: this file.

References:

Katrin Erk, Diana McCarthy and Nicholas Gaylord (2009). Investigations on Word 
Senses and Word Usages. In Proceedings of the Joint conference of the 47th 
Annual Meeting of the Association for Computational Linguistics and the 4th 
International Joint Conference on Natural Language Processing of the Asian 
Federation of Natural Language Processing ACL-IJCNLP 

Katrin Erk, Diana McCarthy and Nicholas Gaylord (2013). Measuring Word Meaning 
in Context. Computational Linguistics, 39 (3) pp 511-554

Diana McCarthy and Roberto Navigli (2007). SemEval-2007 Task 10: English 
Lexical Substitution Task. In Proceedings of the 4th International Workshop on 
Semantic Evaluations (SemEval-2007), Prague, Czech Republic pp 48-53.

Serge Sharoff (2006). Open-source corpora: Using the net to fish for linguistic 
data. International Journal of Corpus Linguistics, 11(4):435–462.
