# Evaluation

This section aims to provide insights into the results generated by the default operations and offers useful tips on how to effectively use these results for a correct evaluation.

To aid in the visualization and the correct interpretation of the results, it is recommended to take a look at the plots that will be generated by executing any of the experiments in the examples folder.

## SelfCheckGPT

The SelfCheckGPT baseline is usually evaluated based on three metrics. The results are stored in a JSON file inside the SelfCheckGPT folder, that is created during the execution of the pipeline.
- `result`: A list of all sentence-level scores referred to by the prompt indicated by `index`. The score is computed based on $`1 - \text{SelfCheckGPTscore}`$, since the original score indicates the hallucination level of a sentence, while the CheckEmbed pipeline compares the level of similarity.
- `passage_score`: Aggregated score for the complete passage by computing the formula
$$S_{\text{passage}} = \frac{1}{|R|} \sum_{i} S(i)$$
   where $`|R|`$ is the number of sentences and $`S(i)`$ is the score associated with each sentence.
- `std_dev`: The standard deviation of the sentence level scores.

An interpretation of the results and more information on the significance of the respective scores can be found in the [SelfCheckGPT paper](https://arxiv.org/pdf/2303.08896).

## BertScore - CheckEmbed

Both BertScore and CheckEmbed use similar metrics:
- `result \ cosine_sim`: A cosine similarity matrix representing the cosine similarity score for each pair of embeddings compared.
- `frobenius_norm \ frob_norm_cosine_sim`: The Frobenius norm obtained from the cosine similarity matrix.
- `std_dev \ std_dev_cosine_sim`: The standard deviation of the cosine similarity scores within the matrix.

Additionally CheckEmbed results can be evaluated using the following metrics:
- `pearson_corr`: A Pearson correlation matrix representing the Pearson correlation score for each pair of embeddings compared.
- `frob_norm_pearson_corr`: The Frobenius norm obtained from the Pearson correlation matrix.
- `std_dev_pearson_corr`: The standard deviation of the Pearson correlation scores within the matrix.

Additionally it is possible to incorporate the ground-truth if available.

## Frobenius Norm

The following sections references only the Frobenius norm, but the same concepts also apply to the Pearson correlation.

### With Ground-Truth

Evaluation with the ground-truth is straightforward.
It is sufficient to compare the Cosine Similarity results to determine if the generated sample is closer to or further from the correct answer.
The ground-truth results are represented as the last row of the cosine similarity matrix.
Remember that based on the used embedding model, the threshold for defining a result as correct or not can vary.

### Without Ground-Truth

To evaluate results in the absence of ground-truth, the full cosine similarity matrix should be considered to understand how close the different samples are to each other.
Keeping in mind the possibility of having different thresholds based on the used embedding model used, the general stability of the answer can be grasped by looking at the matrix.

The `standard deviation` can additionally help to assess the correctness of an answer. A higher standard deviation indicates higher variance and insecurity in the LLM answers, while a lower one indicates stability.

The `Frobenius norm` is useful as a unique score for the whole similarity matrix without the need to compare multiple matrices together.
Please keep in mind that the Frobenius norm will always output a positive number, treating negative ones as positive (by squaring them).
It is recommended to also look at the `standard deviation` results to in order to identify situations where two opposite results (-0.7 and 0.7) end up having a similarly high Frobenius score.
The resulting high standard deviation should suggest the need to examine the cosine similarity matrix more closely to better understand the results.
