# README

## Directories

* `common/`
  contains the prompts, regex patterns, and models infos.

* `evaluation/`
  contains codes for querying models via API (the api-key should be set first) and verfying answers of models.

* `utils/`
  contains tools for setting clients, analyzing answers, logging tools, and metircs applies in the experiments.

## Files

* `utils/probe_attributes.py`
  is the metircs for probe properties

* `utils/phemotype_analysis.py`
  is used to compute the five phemotypes of models via probes.

* `utils/probe_attributes_wrong_version(uniqueness).py`
  is the old version which cause float32 errors.

* `utilts/probe_attributes_fast.py`
  is the fast version an achieve the same results.

## Detailed experimental steps

1. `run_query_dataset_from_hf.py`
   is to query models via API from three datasets (MATH-500, MMLU-Redux, SimpleQA), tools in utils will be used to check and summarize the answer of LLMs.

2. `process_results.py`
   is for verify the answers of models for less restrict limitation.

3. `generate_informations_from_results.py`
   is for generating perception matrix from all the results.

4. `calculate_metrics.py`
   is for calculating the metrics of probe properties and models' phemotypes.

5. `get_leaderboard_matrix.py`
   is for crabing results from Open LLM Leaderboard.
