Summary: This tutorial demonstrates implementing a Retrieval-Augmented Generation (RAG) system using LlamaIndex, focusing on integration with FAISS vector store and OpenAI's LLM. It covers essential techniques for document loading, text chunking, embedding generation using HuggingFace models, vector store setup with FAISS, and query engine configuration. The tutorial helps with tasks like building custom RAG pipelines, optimizing text splitting parameters, and customizing prompt templates. Key features include configurable chunk sizes, embedding model selection, FAISS vector store integration, and customizable query response generation, making it valuable for developers implementing production-ready RAG systems.

# RAG with LlamaIndex

LlamaIndex is a very popular framework to help build connections between data sources and LLMs. It is also a top choice when people would like to build an RAG framework. In this tutorial, we will go through how to use LlamaIndex to aggregate bge-base-en-v1.5 and GPT-4o-mini to an RAG application.

## 0. Preparation

First install the required packages in the environment.


```python
%pip install llama-index-llms-openai llama-index-embeddings-huggingface llama-index-vector-stores-faiss
%pip install llama_index 
```

Then fill the OpenAI API key below:


```python
# For openai key
import os
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
```

BGE-M3 is a very powerful embedding model, We would like to know what does that 'M3' stands for.

Let's first ask GPT the question:


```python
from llama_index.llms.openai import OpenAI

# non-streaming
response = OpenAI().complete("What does M3-Embedding stands for?")
print(response)
```

By checking the description in GitHub [repo](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3) of BGE-M3, we are pretty sure that GPT is giving us hallucination. Let's build an RAG pipeline to solve the problem!

## 1. Data

First, download BGE-M3 [paper](https://arxiv.org/pdf/2402.03216) to a directory, and load it through `SimpleDirectoryReader`. 

Note that `SimpleDirectoryReader` can read all the documents under that directory and supports a lot of commonly used [file types](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/#supported-file-types).


```python
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader("data")
# reader = SimpleDirectoryReader("DIR_TO_FILE")
documents = reader.load_data()
```

The `Settings` object is a global settings for the RAG pipeline. Attributes in it have default settings and can be modified by users (OpenAI's GPT and embedding model). Large attributes like models will be only loaded when being used.

Here, we specify the `node_parser` to `SentenceSplitter()` with our chosen parameters, use the open-source `bge-base-en-v1.5` as our embedding model, and `gpt-4o-mini` as our llm.


```python
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI

# set the parser with parameters
Settings.node_parser = SentenceSplitter(
    chunk_size=1000,    # Maximum size of chunks to return
    chunk_overlap=150,  # number of overlap characters between chunks
)

# set the specific embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

# set the llm we want to use
Settings.llm = OpenAI(model="gpt-4o-mini")
```

## 2. Indexing

Indexing is one of the most important part in RAG. LlamaIndex integrates a great amount of vector databases. Here we will use Faiss as an example.

First check the dimension of the embeddings, which will need for initializing a Faiss index.


```python
embedding = Settings.embed_model.get_text_embedding("Hello world")
dim = len(embedding)
print(dim)
```

Then create the index with Faiss and our documents. Here LlamaIndex help capsulate the Faiss function calls. If you would like to know more about Faiss, refer to the tutorial of [Faiss and indexing](https://github.com/FlagOpen/FlagEmbedding/tree/master/Tutorials/3_Indexing).


```python
import faiss
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.core import StorageContext, VectorStoreIndex

# init Faiss and create a vector store
faiss_index = faiss.IndexFlatL2(dim)
vector_store = FaissVectorStore(faiss_index=faiss_index)

# customize the storage context using our vector store
storage_context = StorageContext.from_defaults(
    vector_store=vector_store
)

# use the loaded documents to build the index
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
```

## 3. Retrieve and Generate

With a well constructed index, we can now build the query engine to accomplish our task:


```python
query_engine = index.as_query_engine()
```

The following cell displays the default prompt template for Q&A in our pipeline:


```python
# check the default promt template
prompt_template = query_engine.get_prompts()['response_synthesizer:text_qa_template']
print(prompt_template.get_template())
```

(Optional) You could modify the prompt to match your use cases:


```python
from llama_index.core import PromptTemplate

template = """
You are a Q&A chat bot.
Use the given context only, answer the question.

<context>
{context_str}
</context>

Question: {query_str}
"""

new_template = PromptTemplate(template)
query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": new_template}
)

prompt_template = query_engine.get_prompts()['response_synthesizer:text_qa_template']
print(prompt_template.get_template())
```

Finally, let's see how does the RAG application performs on our query!


```python
response = query_engine.query("What does M3-Embedding stands for?")
print(response)
```
