Image by author – generated by Stable Diffusion

Let’s pretend you have a big collection of books, articles, or texts on various topics. You want to create a magical system that can understand what each text is about and help you find similar texts when you ask a question or give it a short description. To make this happen, we’ll use two cool tools: OpenAI Embedding API and Pinecone.

  1. OpenAI Embedding API: Think of this as a magical box that can read and understand the meaning of any text you give it. When you feed it some text, it gives you a secret code (called an “embedding”) that captures the essence of that text.
  2. Pinecone: This is like a super-smart librarian who can store and manage all the secret codes (embeddings) from the magical box (OpenAI Embedding API). When you give Pinecone a secret code, it quickly finds and returns the most similar codes (and their texts) from the library.

Now let’s see how to create this magical system step by step:

1 – First, we need to install some software on your computer that helps us talk to OpenAI and Pinecone. We do this by typing a command:

pip install -U openai pinecone-client datasets

2 – Next, we need to tell our computer how to connect with OpenAI and Pinecone using secret passwords (called API keys):

import openai
import pinecone

openai.api_key = "YOUR_OPENAI_API_KEY"
pinecone.init(api_key="YOUR_PINECONE_API_KEY")

3 – Now, let’s use the magical box (OpenAI Embedding API) to read some text and give us the secret code (embedding):

MODEL = "text-embedding-ada-002"
res = openai.Embedding.create(input=["Sample document text goes here"], engine=MODEL)
embeds = [record['embedding'] for record in res['data']]

4 – It’s time to ask Pinecone, our super-smart librarian, to create a special place (an “index”) to store the secret codes:

pythonCopy codeif 'openai' not in pinecone.list_indexes():
    pinecone.create_index('openai', dimension=len(embeds[0]))
index = pinecone.Index('openai')

5 – Let’s take a collection of texts (e.g., from a dataset called TREC) and create secret codes for each text using the magical box:

trec = load_dataset('trec', split='train[:1000]')

6 – We give the secret codes to Pinecone, so it can store them in the special place (index) we created earlier:

for i in range(0, len(trec['text']), batch_size):
    lines_batch = trec['text'][i: i+batch_size]
    ids_batch = [str(n) for n in range(i, i_end)]
    res = openai.Embedding.create(input=lines_batch, engine=MODEL)
    embeds = [record['embedding'] for record in res['data']]
    meta = [{'text': line} for line in lines_batch]
    to_upsert = zip(ids_batch, embeds, meta)
    index.upsert(vectors=list(to_upsert))

7 – Now, when you have a question or a short description, you can use the magical box to create a secret code for it and then ask Pinecone to find the most similar codes (and texts) from its library:

query = "What caused the 1929 Great Depression?"
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']
res = index.query([xq], top_k=5, include_metadata=True)
for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

So, there you have it! By using the magical box (OpenAI Embedding API) and our super-smart librarian (Pinecone), we’ve created a system that helps you find similar texts based on the meaning, even if they don’t use the same words. This can be really helpful when you want to learn more about a topic, answer questions, or explore new ideas!