A Comparison of LLM fine-tuning Methods & Evaluation Metrics with Travel Chatbot Use Case

09 Jan 2025 (modified: 18 Jun 2025)Submitted to ICML 2025EveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Our findings: 1) Quantitative/Ragas metrics misalign with human evaluation; GPT-4 aligns well, 2) RAFT outperforms QLoRA but needs postprocessing, and 3) RLHF significantly boosts performance, surpassing benchmarks.
Abstract: This research compared large language model (LLM) fine-tuning methods, including Quantized Low Rank Adapter (QLoRA), Retrieval Augmented fine-tuning (RAFT), and Reinforcement Learning from Human Feedback (RLHF), and additionally compared LLM evaluation methods including End to End (E2E) benchmark method of “Golden Answers”, traditional natural language processing (NLP) metrics, RAG Assessment (Ragas), OpenAI GPT-4 evaluation metrics, and human evaluation, using the travel chatbot use case. The travel dataset was sourced from the Reddit API by requesting posts from travel-related subreddits to get conversation prompts and personalized travel experiences, and augmented for each fine-tuning method. QLoRA and RAFT were applied to two pre-trained LLMs: LLaMa 2 7B and Mistral 7B. The best model according to human evaluation and some GPT-4 metrics was Mistral RAFT, so this underwent a Reinforcement Learning from Human Feedback (RLHF) training pipeline, and ultimately was evaluated as the best model. Our main findings are that: 1) quantitative and Ragas metrics do not align with human evaluation, while Open AI GPT-4 evaluations do, 2) RAFT outperforms QLoRA, but still needs postprocessing, and 3) RLHF improves model performance significantly to outperform benchmark models.
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Model, Fine-tuning, QLoRA, RAFT, RLHF
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Submission Number: 406
Loading