This document is about how to run MTBenchEval models on this repo.


1. First, go to the FastChat/fastchat location and install the libraries needed.

``pip install -e ".[model_worker,llm_judge]"``


2. Also you need to setup open ai creds in particular the OPENAI_API_KEY

``source openai_creds.sh``

3. Go to the FastChat/fastchat/llm_judge location .

First , generate the answers from your model name or directory and give it a distinct recallable id for tabulation and comparison later on.

``pip install openai==0.28``
``python gen_model_answer.py --model-path meta-llama/Meta-Llama-3-8B-Instruct --model-id Llama8BInstructPlain``
``pip install openai==1.14.3``

For models where you have a lora finetuned checkpoint do include lora- in the name besides the parent models indicator string, e.g. Llama3-

Examples:
``python gen_model_answer.py --model-path /root/multiwoz-api/checkpoints/llama-3-lora-kto-works --model-id Llama3-8B-Lora-KTO-Aug26``


If its an api instead do:

``python gen_api_answer_migrated.py --model <apimodelname>``


Then, let the gpt-4 based llm judge come in and evaluate the results. This may take 15-20 mins since it is going over 320 potential passages including some at the 2-turn level.
``pip install openai==0.28; python gen_judgment.py --model-list Llama8BInstructPlain; pip install openai==1.14.3``

Actual examples generating judgements for the results from Llama3Instruct and 2 other differently finetuned variants on 25th August:
pip install openai==0.28; python gen_judgment.py --model-list Llama8BInstructPlainAug25; pip install openai==1.14.3

pip install openai==0.28; python gen_judgment.py --model-list llama-3-lora-sft-checkpoint-78; pip install openai==1.14.3

pip install openai==0.28; python gen_judgment.py --model-list Llama3-8B-Lora-KTO-Aug26; pip install openai==1.14.3


Finally, show the result. This will print 1-turn 2-turn overall scores across the model ids you named.
``python show_result.py``
