2024 KDD Cup CRAG Workshop: UM6P Team Technical Report

firdawse guerbouzi; Rida Lefdali; Hasnae zerouaoui; Karima Echihabi

2024 KDD Cup CRAG Workshop: UM6P Team Technical Report

firdawse guerbouzi, Rida Lefdali, Hasnae zerouaoui, Karima Echihabi

Published: 11 Sept 2024, Last Modified: 11 Sept 20242024 KDD Cup CRAG WorkshopEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: Large Language Models, Retrieval Augmented Generation, Web-Based, Summarization.

TL;DR: This paper describes the methodology followed by the UM6P team to participate in the 2024 KDD Cup CRAG Competition.

Abstract: Large Language Models (LLMs) have shown impressive capabilities in understanding and generating coherent natural language, but they still suffer from hallucinations, i.e. answers that seem coherent but are incorrect. Retrieval-Augmented Generation (RAG) aims at reducing hallucinations by grounding the LLMs with external relevant data sources. The Meta KDD Cup 2024 introduced the Comprehensive RAG (CRAG) challenge, which evaluates Question-Answering (QA) systems across five domains and eight question types. The challenge consists of three tasks: Web-Based Retrieval Summarization, Knowledge Graph and Web Augmentation, and End-to-End RAG. This paper summarizes the UM6P Team’s participation in the first task. We describe our experimental framework including hyperparameter tuning, sampling strategy selection (to best align the offline and online results), and an extensive evaluation of various RAG pipelines. We also share key insights about the contribution of each RAG component to the overall performance, covering chunking, retrieval, and enhancement techniques. The pipelines were assessed offline using Llama 3 and online using GPT-4 based on the number of correct answers, missing answers, and hallucinations. Our experiments indicate that the best-performing pipeline consists of Facebook AI Similarity Search (FAISS), sentence chunking, re-ranking, and Hypothetical Document Embedding for input enhancement (HyDE), achieving a competitive accuracy score of 0.339 compared to the top score of 0.393, despite a lower overall CRAG score (0.05 vs. 0.204) due to hallucinations (0.288 vs. 0.189). We conclude with a discussion of the main technical and performance challenges encountered during the competition, and some pointers on future research directions.

Submission Number: 12

Loading