
Notes on datasets and licenses
------------------------------

If using this data in your research please cite the following paper
and the url of the STS website: http://ixa2.si.ehu.eus/stswiki:

   Eneko Agirre, Daniel Cer, Mona Diab, Iñigo Lopez-Gazpio, Lucia
    Specia. Semeval-2017 Task 1: Semantic Textual Similarity
    Multilingual and Crosslingual Focused Evaluation. Proceedings of
    SemEval 2017.

The scores are released under a "Commons Attribution - Share Alike 4.0
International License" http://creativecommons.org/licenses/by-sa/4.0/

The text of each dataset has a license of its own, as follows:

- MSR-Paraphrase, Microsoft Research Paraphrase Corpus. In order to use
  MSRpar, researchers need to agree with the license terms from
  Microsoft Research:
  http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/

- headlines: Mined from several news sources by European Media Monitor
  (Best et al. 2005). using the RSS feed. European Media Monitor (EMM)
  Real Time News Clusters are the top news stories for the last 4
  hours, updated every ten minutes. The article clustering is fully
  automatic. The selection and placement of stories are determined
  automatically by a computer program. This site is a joint project of
  DG-JRC and DG-COMM. The information on this site is subject to a
  disclaimer (see
  http://europa.eu/geninfo/legal_notices_en.htm). Please acknowledge
  EMM when (re)using this material.
  http://emm.newsbrief.eu/rss?type=rtn&language=en&duplicates=false

- deft-news: A subset of news article data in the DEFT
  project.  

- MSR-Video, Microsoft Research Video Description Corpus.  In order to
  use MSRvideo, researchers need to agree with the license terms from
  Microsoft Research: 
  http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/

- image: The Image Descriptions data set is a subset of
  the PASCAL VOC-2008 data set (Rashtchian et al., 2010) . PASCAL
  VOC-2008 data set consists of 1,000 images and has been used by a
  number of image description systems. The image captions of the data
  set are released under a CreativeCommons Attribution-ShareAlike
  license, the descriptions itself are free.

- track5.en-en: This text is a subset of the Stanford Natural
  Language Inference (SNLI) corpus, by The Stanford NLP Group is
  licensed under a Creative Commons Attribution-ShareAlike 4.0
  International License. Based on a work at
  http://shannon.cs.illinois.edu/DenotationGraph/.
  https://creativecommons.org/licenses/by-sa/4.0/

- answers-answers: user content from stack-exchange. Check the license
  below in ======ANSWERS-ANSWERS======

- answers-forums: user content from stack-exchange. Check the license
  below in ======ANSWERS-FORUMS======



======ANSWER-ANSWER======

Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/

Attribution Requirements:

   "* Visually display or otherwise indicate the source of the content
      as coming from the Stack Exchange Network. This requirement is 
      satisfied with a discreet text blurb, or some other unobtrusive but
      clear visual indication.

    * Ensure that any Internet use of the content includes a hyperlink 
      directly to the original question on the source site on the Network
      (e.g., http://stackoverflow.com/questions/12345)

    * Visually display or otherwise clearly indicate the author names for 
      every question and answer used

    * Ensure that any Internet use of the content includes a hyperlink for 
      each author name directly back to his or her user profile page on the
      source site on the Network (e.g., 
      http://stackoverflow.com/users/12345/username), directly to the Stack
      Exchange domain, in standard HTML (i.e. not through a Tinyurl or other
      such indirect hyperlink, form of obfuscation or redirection), without 
      any “nofollow” command or any other such means of avoiding detection by
      search engines, and visible even with JavaScript disabled."

    (https://archive.org/details/stackexchange)



======ANSWERS-FORUMS======


Stack Exchange Inc. generously made the data used to construct the STS 2015 answer-answer statement pairs available under a Creative Commons Attribution-ShareAlike (cc-by-sa) 3.0 license.

The license is reproduced below from: https://archive.org/details/stackexchange

The STS.input.answers-forums.txt file should be redistributed with this LICENSE text and the accompanying files in LICENSE.answers-forums.zip. The tsv files in the zip file contain the additional information that's needed to comply with the license.

--

All user content contributed to the Stack Exchange network is cc-by-sa 3.0 licensed, intended to be shared and remixed. We even provide all our data as a convenient data dump.

http://creativecommons.org/licenses/by-sa/3.0/

But our cc-by-sa 3.0 licensing, while intentionally permissive, does *require attribution*:

"Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work)."

Specifically the attribution requirements are as follows:

  1. Visually display or otherwise indicate the source of the content as coming from the Stack Exchange Network. This requirement is satisfied with a discreet text blurb, or some other unobtrusive but clear visual indication.

  2. Ensure that any Internet use of the content includes a hyperlink directly to the original question on the source site on the Network (e.g., http://stackoverflow.com/questions/12345)

  3. Visually display or otherwise clearly indicate the author names for every question and answer so used.

  4. Ensure that any Internet use of the content includes a hyperlink for each author name directly back to his or her user profile page on the source site on the Network (e.g., http://stackoverflow.com/users/12345/username), directly to the Stack Exchange domain, in standard HTML (i.e. not through a Tinyurl or other such indirect hyperlink, form of obfuscation or redirection), without any “nofollow” command or any other such means of avoiding detection by search engines, and visible even with JavaScript disabled.

Our goal is to maintain the spirit of fair attribution. That means attribution to the website, and more importantly, to the individuals who so generously contributed their time to create that content in the first place!

For more information, see the Stack Exchange Terms of Service: http://stackexchange.com/legal/terms-of-service








