Effortless Embeddings with vecs and Supabase

Supabase have been adding a huge number of features to support machine learning use cases over the last few months. In particular I’ve been keen to try out vecs, their Python library for managing postgres databases that have the pgvector extension enabled. It’s completely open source, available on PyPI and works with any pgvector database, not just Supabase.

Recently they also announced that they’ve integrated Huggingface models, so here I’ll show you how you can use them too in the context of searching documents for movie reviews.

This post was written as part of Supabase’s AI content storm - go check it out for all the other exciting work people have done.

Vector Databases?

If you’re new to machine learning and everything that goes with it, you might wonder what vector databases are in the first place. Vector databases allow us to store embeddings. An embedding measures how similar pieces of text are and is stored as a vector.

This means if we calculate the embeddings for different pieces of text and the vectors measure them as far from one another, they are not very similar. If they are closer, they are more similar. This makes them really useful for things like search.

Using Vecs with Supabase

Supabase has a vector extension that needs to be enabled on your database before you can start using pgvector. Once enabled, you can then use vecs to manage it.

Install vecs along with the optional dependencies for text embeddings into a virtual environment using your favourite dependency manager. Here’s how to do that using pip:

python -m venv .venv
source ./.venv/bin/activate

pip install "vecs[text_embedding]"

Create a new vecs client using the credentials for your database. Use the postgres details found on your API settings page for your db if you’re using Supabase.

import vecs

DB_CONNECTION = "postgresql://<user>:<password>@<host>:<port>/<db_name>"

# create vector store client
vx = vecs.create_client(DB_CONNECTION)

Creating Embeddings with Ollama

I wanted to create a number of entries for recent movies that typically fall outside of the context window for large language models. I’m going to store a large text description for each movie taken from wikipedia and search our documents.

First I need to take my text description and calculate the embeddings for it. I could pass embeddings to vecs but I’d need something else to create those embeddings first.

Here’s an example of how that might be done with Ollama running locally using mistral:

import requests

url = "http://localhost:11434/api/embeddings"
headers = {"Content-Type": "application/json"}
data = {
    "model": "mistral",
    "prompt": "Here is an article about my movie..."
}

response = requests.post(url, headers=headers, data=json.dumps(data))

embedding = response.json()["embedding"]

In the case of an embdding created by mistral, the number of dimensions is 4096. When we create a vector collection, we’d need to create it with the same amount of dimensions. Here’s how to do that before upserting two documents as created above.

docs = vx.get_or_create_collection(name="docs", dimension=4096)

docs.upsert(
    records=[
        (
         "film0",           
         embedding_1,
         {"year": 2021}
        ),
        (
         "film1",
         embedding_2,
         {"year": 2023}
        )
    ]
)

What makes this process even more difficult is that text will typically need to be broken up (or chunked) since each article will be longer in length than the mistral model can cope with.

Using Huggingface Embeddings

Luckily vecs has some neat ways of making this much easier through text embeddings provided by Huggingface. Additionally, instead of having to chunk my text myself or include another library to do so, I can also hand this off to vecs.

docs = vx.get_or_create_collection(
    name="docs",
    adapter=Adapter(
        [
            ParagraphChunker(skip_during_query=True),
            TextEmbedding(model="all-MiniLM-L6-v2", batch_size=8),
        ]
    ),
)

Here, we define a paragraph chunker as an adapter to break a document into manageable chunks, followed by creating text embedding adapter for each chunk using the all-MiniLM-L6-v2 sentence transformer model.

To upsert using that approach we now can just pass the text we want to use, skipping creating the embeddings manually with Ollama.

docs.upsert(
    records=[
        (
            "film0",
            "The Matrix Resurrections is a 2021 American science fiction action film produced, co-written, and directed by Lana Wachowski, and the first in the Matrix franchise to be directed solely..",
            {
                "title": "The Matrix Resurrections",
                "year": 2021,
                "director": "Lana Wachowski",
                "text": film_data["film0"]
            },
        ),
        (
            "film1",
            "The Creator is a 2023 American science fiction film produced and directed by Gareth Edwards, who co-wrote the screenplay with Chris Weitz. After a nuclear detonation in Los Angeles and a war against artificial intelligence..",
            {
                "title": "The Creator",
                "year": 2023,
                "director": "Gareth Edwards",
                "text": film_data["film1"],
            },
        ),
    ]
)

Querying Our Data

Querying our records is really simple once we’ve stored them with embeddings. We’ve stored some metadata along with out docs which we specify to include as part of the response setting include_metadata=True. When we send a text query, we’re asking how similar our stored text is to the query we make.

results = docs.query(
    data="movies directed by the wachowskis", limit=3, include_metadata=True
)

for doc_id, metadata in results:
    print(f"Result {doc_id}: {metadata['title']} ({metadata['year']})")

We can see the first 3 results returned are paragraphs from the first article.

Result film0_para_001: The Matrix Resurrections (2021)
Result film0_para_027: The Matrix Resurrections (2021)
Result film0_para_013: The Matrix Resurrections (2021)

The eagle eyed might be thinking, that we could have stored this in a relational db by using an article along with attributes for each of the metadata values.

Another example might be querying using text that is particularly difficult using a non-vector database. Here I query with “Good sci-fi movie with robots”. Even when doubling my limit to 6 results, there’s only occurrences of “The Creator” in this list indicating it’s a strong match from our collection.

results = films.query(
    data="Good sci-fi movie with robots", limit=5, include_metadata=True
)

Result film1_para_041: The Creator (2023)
Result film1_para_036: The Creator (2023)
Result film1_para_043: The Creator (2023)
Result film1_para_027: The Creator (2023)
Result film1_para_042: The Creator (2023)
Result film1_para_046: The Creator (2023)

Further Work

It’s great to see the improvements Supabase are making to vecs with time. There’s so many things I want to make with it. I’m keen to explore how I can integrate langchain to allow me to combine Ollama with Supabase and query my collections using natural language.