# HAGRID-clean
Clean version of the HAGRID dataset.

Hagrid provides a list of queries, attributable quotes, and answers to each query. 
While the answer comes in a single string, HAGRID also provides the answer split into sentences, each with a list of references (or no reference)
In this iteration of Clean HAGRID, we used the same list of sentences proposed in the original dataset and made the following modifications:


This clean version modifies the sentences in the original HAGRID dataset in the following manners:
- Removes the references from the sentences
- Removes duplicated sentences (see examples 1a and 1b below)
- Place the references in a different JSONL field
- Merge sentences that were incorrectly split (see example 2a)
- Added missing references (see example 3a)
- Removal of information regarding references not providing information (see example 4a)
- Uniformizes the referencing style
- Adds a class type on each sentence: "No reference", "Single-reference", "Multi-reference" or "Undesirable-sentence"
    - The last type includes strings such as URLs that may be included for variety in the dataset as "no reference"
    - We place it in a different class type so users may filter them easily     
   

This dataset contains a JSONL file using the following structure:

```
{
"query" : "string",
"quotes" : [
    "quote" : "string",
    "quote" : "string",
    ...
],
"answers" : [
    "answer" : [
        "sentence" : "string",
        "references" : "list, as string", # e.g. [1,2,3]
        "type" : "string" # see "desired reference" above
    ],
    ...
]
}
{
"query" : "string",
...
}
...
```




# Removal of duplicated sentences:

The original HAGRID dataset included duplicated sentences, such as the example below. 
In this case, the second and third sentences contain references that we consider invalid. As such, both of them were removed.
In cases where duplicated sentences exist but point at different sources, they are merged (see second example).

```
> Example 1a
1- Therefore, the safety of sperm donation depends on various factors, [...] from medical professionals [1].
2- Therefore, the safety of sperm donation depends on various factors, [...] hemorrhage during parturition from medical professionals (2003).
3- Therefore, the safety of sperm donation depends on various factors, [...] hemorrhage during parturition from medical professionals (03)

> Example 1b
1- Unlike humans, horses and other animals do not naturally produce [...] hemorrhage during parturition [1].
2- Unlike humans, horses and other animals do not naturally produce [...] hemorrhage during parturition [2].
3- Unlike humans, horses and other animals do not naturally produce [...] hemorrhage during parturition [3].
```

# Merging sentences:

In some cases, we detect incorrectly split sentences.
In those cases, we merge them into a single sentence

```
> Example 2a
1- The date of first discovery and first proof of the Pythagorean theorem is uncertain, as there is debate whether it was discovered once, or many times in many places. Evidence suggests that the Pythagorean theorem was "well-known to the mathematicians of the Old Babylonian period" (20th to 16th centuries BC)
2- which predates Pythagoras [1].
```

# Adding missing references

In some samples, although there are attributable quotes, they are not assigned to the answer.

```
> Example 3a
1- Mike Pratt was 6'4" (1.93 m)
```

# Cleaning meta information

Some samples provide information regarding the references not providing information

```
> Example 4a
1- Therefore, the first home TV was introduced after World War II, but the exact year is not mentioned in the given contexts.
```
