This dataset contains the data obtained and used for the paper 
  "Does Summary Evaluation Survive Translation to Other Languages?" https://arxiv.org/abs/2109.08129

The data include: 
  1. Automated translations (to 7 languages) of the texts, summaries and reference summaries of the SummEval dataset https://arxiv.org/abs/2007.12626. The originals (English, from SummEval) are also included.
  2. Automated scores (by 7 measures) of the translated and the original summaries.
  3. For conveniency, also the corresponding averages of human scores from SummEval. 
  4. For conveniency, all the correlations between the automated scores and the human scores.


The data are arranged in the following files:

translations_texts_refs.jsonl
  List of 100 items. Each item is a dictionary, with the following keys:
    'id': Identity of the text as in SummEval. For example: 'cnndm/dailymail/stories/8764fb95bfad8ee849274873a92fb8d6b400eee2.story'
    'texts': List of 12 items. The first item is the text from SummEval. The next 11 elements are the reference summaries from SummEval.
    'translations': List of 12 items, corresponding to the 'texts' - in the same order. Each item is a dictionary with the keys:
      'en^de', 'de^en', 'en^es', 'es^en', 'en^fr', 'fr^en', 'en^it', 'it^en', 'en^af', 'af^en', 'en^hi', 'hi^en', 'en^ru', 'ru^en'
      The value for each key key lang1^lang2 is a translation of the text (or reference summary) from the language lang1 to the language lang2.
      If lang1='en' then it is a translation from the original English to one of 7 languages ('de', 'es', 'fr', 'it', 'af', 'hi', 'ru'). Otherwise it is a translation back to English. For example, the value for 'en^de' is the translation to German; the value for 'de^en' is the translation of this German translaiton back to English.

translations_summaries.jsonl
  The format is the same as in translations_texts_refs.jsonl, but the 'texts' and 'translations' are provided for the summaries.
  List of 1700 items. Each item is a dictionary, with the following keys:
    'id': Identity of the SummEval summary, composed as id_text^id_model.
      The id_text is id of the text (as in translations_texts_refs.jsonl) from which the summary was generated, and id_model is id of the model which produced the summary as in SummEval. For example: 'cnndm/dailymail/stories/8764fb95bfad8ee849274873a92fb8d6b400eee2.story^M11'.
    'texts': List of just one item: the summary from SummEval.
    'translations': List of just one item - a dictionary with 14 keys lang1^lang2, the values are the corresponding translations of the summary.

scores_blanc.json
  Dictionary of 15 key-value pairs, where key is id of the translation, and the value is BLANC scores. 
  Key: String lang1^lang2, the same as in translations_summaries.jsonl, for example 'en^de' or 'ru^en'. The key 'en^en' means the original (no translation).
  Value: Scores (list of 1700 float numbers) for the translated summary, in the same order as the summaries in translations_summaries.jsonl.

scores_estime.json
  The same as scores_blanc.json but the values are ESTIME scores

scores_jshannon.json
  The same as scores_blanc.json but the values are Jensen-Shannon scores

scores_bleu.json
  The same as scores_blanc.json but the values are BLEU scores

scores_rouge.json
  Dictionary of 15 key-value pairs, where key is id of the translation, and value is scores for seveal ROUGE versions. 
  Key: String lang1^lang2, the same as in translations_summaries.jsonl, for example 'en^de' or 'ru^en'. The key 'en^en' means the original (no translation).
  Value: Dictionary:
    Key: ROUGE version, e.g. one of 'rougeL', 'rougeLsum', 'rouge1', 'rouge2', 'rouge3'.
    Value: Scores (list of 1700 float numbers), in the same order as the summaries in translations_summaries.jsonl.

scores_bertscore.json
  The same as scores_rouge.json, but instead of ROUGE versions there are the keys 'bertscores_P', 'bertscores_R', 'bertscores_F', with the values of BERTScore precision, recall and F1.

scores_human_from_SummEval.json
  List of 4 items, corresponding to the summary qualities in this order: 
    'coherence', 'consistency', 'fluency', 'relevance'
    Each item is the list of 3 items, corresponding to the 3 experts from SummEval dataset.
      Each item is a list of 1700 expert scores corresponding to the summaries from translations_summaries.jsonl in the same order. 

correlations.json
  All correlations between the human scores ('coherence', 'consistency', 'fluency', 'relevance') and the automated scores.
  Dictionary with keys identifying the type of correlation and its p-value.
    Keys: 'spearman', 'kendallt', 'spearman_p', 'kendallt_p'. Here 'kendallt' is Kendall Tau-c, and 'kendallt_p' is its p-value.
    Each value: dictionary with the keys identifying the type of human score:
      Keys: 'coherence', 'consistency', 'fluency', 'relevance'
      Each value: dictionary with the keys identifying the language-language translation:
        Keys: 'en^en', 'en^de', 'de^en', 'en^es', 'es^en', 'en^fr', 'fr^en', 'en^it', 'it^en', 'en^af', 'af^en', 'en^hi', 'hi^en', 'en^ru', 'ru^en'
        Each value: dictionary with the keys identifying the measure by which automated scores were taken:
          Keys: 'blanc', 'estime', 'js', 'berts', 'bleu', 'rouge1', 'rouge2', 'rougeLsum', 'rougeL', 'rouge3', where js is Jensen-Shannon and berts is BERTScore F1.
          Each value is a real number - the correlation value.








