Summary: This tutorial provides practical implementation guidance for text embedding systems, focusing on both open-source and commercial API approaches. It demonstrates code implementations for popular embedding models including BGE and Sentence Transformers (open-source), plus OpenAI and Voyage AI (commercial APIs). The tutorial covers essential setup requirements, environment configurations, and code snippets for generating embeddings. Key features include vector normalization, GPU configuration, and embedding generation with different frameworks. It's particularly useful for tasks involving semantic search, text similarity comparison, and vector representations, while highlighting important trade-offs between open-source and commercial solutions regarding costs, resource requirements, and usage limitations.

# Intro to Embedding

For text retrieval, pattern matching is the most intuitive way. People would use certain characters, words, phrases, or sentence patterns. However, not only for human, it is also extremely inefficient for computer to do pattern matching between a query and a collection of text files to find the possible results. 

For images and acoustic waves, there are rgb pixels and digital signals. Similarly, in order to accomplish more sophisticated tasks of natural language such as retrieval, classification, clustering, or semantic search, we need a way to represent text data. That's how text embedding comes in front of the stage.

## 1. Background

Traditional text embedding methods like one-hot encoding and bag-of-words (BoW) represent words and sentences as sparse vectors based on their statistical features, such as word appearance and frequency within a document. More advanced methods like TF-IDF and BM25 improve on these by considering a word's importance across an entire corpus, while n-gram techniques capture word order in small groups. However, these approaches suffer from the "curse of dimensionality" and fail to capture semantic similarity like "cat" and "kitty", difference like "play the watch" and "watch the play".


```python
# example of bag-of-words
sentence1 = "I love basketball"
sentence2 = "I have a basketball match"

words = ['I', 'love', 'basketball', 'have', 'a', 'match']
sen1_vec = [1, 1, 1, 0, 0, 0]
sen2_vec = [1, 0, 1, 1, 1, 1]
```

To overcome these limitations, dense word embeddings were developed, mapping words to vectors in a low-dimensional space that captures semantic and relational information. Early models like Word2Vec demonstrated the power of dense embeddings using neural networks. Subsequent advancements with neural network architectures like RNNs, LSTMs, and Transformers have enabled more sophisticated models such as BERT, RoBERTa, and GPT to excel in capturing complex word relationships and contexts. **BAAI General Embedding (BGE)** provide a series of open-source models that could satisfy all kinds of demands.

## Get Embedding

The first step of modern text retrieval is embedding the text. So let's take a look at how to use the embedding models.

Install the packages:


```python
%%capture
%pip install -U FlagEmbedding sentence_transformers openai cohere
```


```python
import os 
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
# single GPU is better for small tasks
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
```

We'll use the following three sentences as the inputs:


```python
sentences = [
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day",
]
```

### Open-source Models

A huge portion of embedding models are in the open source community. The advantages of open-source models include:
- Free, no extra cost. But make sure to check the License and your use case before using.
- No frequency limit, can accelerate a lot if you have enough GPUs to parallelize.
- Transparent and might be reproducible.

Let's take a look at two representatives:

#### BGE

BGE is a series of embedding models and rerankers published by BAAI. Several of them reached SOTA at the time they released.


```python
from FlagEmbedding import FlagModel

# Load BGE model
model = FlagModel('BAAI/bge-base-en-v1.5')

# encode the queries and corpus
embeddings = model.encode(sentences)
print(f"Embeddings:\n{embeddings.shape}")

scores = embeddings @ embeddings.T
print(f"Similarity scores:\n{scores}")
```

#### Sentence Transformers

Sentence Transformers is a library for sentence embeddings with a huge amount of embedding models and datasets for related tasks.


```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(sentences, normalize_embeddings=True)
print(f"Embeddings:\n{embeddings.shape}")

scores = embeddings @ embeddings.T
print(f"Similarity scores:\n{scores}")
```

### Commercial Models

There are also plenty choices of commercial models. They have the advantages of:
- Efficient memory usage, fast inference with no need of GPUs.
- Systematic support, commercial models have closer connections with their other products.
- Better training data, commercial models might be trained on larger, higher-quality datasets than some open-source models.

#### OpenAI

Along with GPT series, OpenAI has their own embedding models. Make sure to fill in your own API key in the field `"YOUR_API_KEY"`


```python
import os
import numpy as np

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
```

Then run the following cells to get the embeddings. Check their official [documentation](https://platform.openai.com/docs/guides/embeddings) for more details.


```python
from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(input = sentences, model="text-embedding-3-small")
```


```python
embeddings = np.asarray([response.data[i].embedding for i in range(len(sentences))])
print(f"Embeddings:\n{embeddings.shape}")

scores = embeddings @ embeddings.T
print(f"Similarity scores:\n{scores}")
```

#### Voyage AI

Voyage AI provides embedding models and rerankers for different purpus and in various fields. Their API keys can be freely used in low frequency and token length.


```python
os.environ["VOYAGE_API_KEY"] = "YOUR_API_KEY"
```

Check their official [documentation](https://docs.voyageai.com/docs/api-key-and-installation) for more details.


```python
import voyageai

vo = voyageai.Client()

result = vo.embed(sentences, model="voyage-large-2-instruct")
```


```python
embeddings = np.asarray(result.embeddings)
print(f"Embeddings:\n{embeddings.shape}")

scores = embeddings @ embeddings.T
print(f"Similarity scores:\n{scores}")
```
