Build a content recommendation API using vector search

In this guide you'll learn how to calculate vector embeddings using HuggingFace models and use Tinybird to perform vector search to find similar content based on vector distances

GitHub Repository
Tinybird blog related posts uses vector search recommendation algorithm.

TL;DR

In this tutorial, you will learn how to:

  1. Use Python to fetch content from an RSS feed
  2. Calculate vector embeddings on long form content (blog posts) using SentenceTransformers in Python
  3. Post vector embeddings to a Tinybird Data Source using the Tinybird Events API
  4. Write a dynamic SQL query to calculate the closest content matches to a given blog post based on vector distances
  5. Publish your query as an API and integrate it into a frontend application

Prerequisites

To complete this tutorial, you'll need:

  1. A free Tinybird account
  2. An empty Tinybird Workspace
  3. Python >= 3.8

This tutorial does not include a frontend, but we provide an example snippet below on how you might integrate the published API into a React frontend.

1. Setup

Clone the demo_vector_search_recommendation repo. We'll use the repository as reference throughout this tutorial.

Authenticate the Tinybird CLI using your user admin token from your Tinybird Workspace:

cd tinybird
tb auth --token $USER_ADMIN_TOKEN

2. Fetch content and calculate embeddings

In this tutorial we fetch blog posts from the Tinybird Blog using the Tinybird Blog RSS feed. You can use any rss.xml feed to fetch blog posts and calculate embeddings from their content.

You can fetch and parse the RSS feed using the feedparser library in Python, get a list of posts, and then fetch each post and parse the content with the BeautifulSoup library.

Once you've fetched each post, you can calculate an embedding using the HuggingFace sentence_transformers library. In this demo, we use the all-MiniLM-L6-v2 model, which maps sentences & paragraphs to a 384 dimensional dense vector space. You can browse other models here.

You can achieve this and the following step (fetch posts, calculate embeddings, and send them to Tinybird) by running load.py from the code repository. We walk through the function of that script below so you can understand how it works.

from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer

import datetime
import feedparser
import requests
import json

timestamp = datetime.datetime.now().isoformat()
url = "https://www.tinybird.co/blog-posts/rss.xml" # Update to your preferred RSS feed
feed = feedparser.parse(url)

model = SentenceTransformer("all-MiniLM-L6-v2")

posts = []
for entry in feed.entries:
    doc = BeautifulSoup(requests.get(entry.link).content, features="html.parser")
    if (content := doc.find(id="content")):
        embedding = model.encode([content.get_text()])
        posts.append(json.dumps({
            "timestamp": timestamp,
            "title": entry.title,
            "url": entry.link,
            "embedding": embedding.mean(axis=0).tolist()
        }))

3. Post content metadata and embeddings to Tinybird

Once you've calculated the embeddings, you can push them along with the content metadata to Tinybird using the Events API.

First, set up some environment variables for your Tinybird host and token with DATASOURCES:WRITE scope:

export TB_HOST=your_tinybird_host
export TB_TOKEN=your_tinybird_token

Next, you'll need to set up a Tinybird Data Source to receive your data. Note that if the Events API doesn't find a Tinybird Data Source by the supplied name, it will create one. But since we want control over our schema, we're going to create an empty Data Source first.

In the tinybird/datasources folder of the repository, you'll find a posts.datasource file that looks like this:

SCHEMA >
    `timestamp` DateTime `json:$.timestamp`,
    `title` String `json:$.title`,
    `url` String `json:$.url`,
    `embedding` Array(Float32) `json:$.embedding[:]`

ENGINE ReplacingMergeTree
ENGINE_PARTITION_KEY ""
ENGINE_SORTING_KEY title, url
ENGINE_VER timestamp

This Data Source will receive the updated post metadata and calculated embeddings and deduplicate based on the most up to data retrieval. The ReplacingMergeTree is used to deduplicate, relying on the ENGINE_VER setting, which in this case is set to the timestamp column. This tells the engine that the versioning of each entry is based on the timestamp column, and only the entry with the latest timestamp will be kept in the Data Source.

The Data Source has the title column as its primary sorting key, because we will be filtering by title to retrieve the embedding for the current post. Having title as the primary sorting key makes that filter more performant.

Push this Data Source to Tinybird:

cd tinybird
tb push datasources/posts.datasource

Then, you can use a Python script to push the post metadata and embeddings to the Data Source using the Events API:

import os
import requests

TB_APPEND_TOKEN=os.getenv("TB_APPEND_TOKEN")
TB_HOST=os.getenv("TB_HOST")

def send_posts(posts):
    params = {
        "name": "posts",
        "token": TB_APPEND_TOKEN
    }
    data = "\n".join(posts) # ndjson
    r = requests.post(f"{TB_HOST}/v0/events", params=params, data=data)
    print(r.status_code)

send_posts(posts)

To keep embeddings up to date, you should retrieve new content on a schedule and push it to Tinybird. In the repository, you'll find a GitHub Action called tinybird_recommendations.yml that fetches new content from the Tinybird blog every 12 hours and pushes it to Tinybird. The Tinybird Data Source in this project uses a ReplacingMergeTree to deduplicate blog post metadata and embeddings as new data arrives.

4. Calculate distances in SQL using Tinybird Pipes.

If you've completed steps above, you should have a posts Data Source in your Tinybird Workspace containing the last fetched timestamp, title, url, and embedding for each blog post fetched from your RSS feed.

You can verify that you have data from the Tinybird CLI with:

tb sql 'SELECT * FROM posts'

This tutorial includes a single-node SQL Pipe to calculate the vector distance of each post to specific post supplied as a query parameter. The Pipe config is contained in the similar_posts.pipe file in the tinybird/pipes folder, and the SQL is copied below for reference and explaination.

%
WITH
  (
    SELECT embedding
    FROM
    (
      SELECT 0 AS id, embedding
      FROM posts
      WHERE
          title = {{ String(title) }}
      ORDER BY timestamp DESC
      LIMIT 1
      UNION ALL
      SELECT 999 AS id, arrayWithConstant(384, 0.0) embedding
    )
      ORDER BY id
      LIMIT 1
    ) AS post_embedding
SELECT title, url, L2Distance(embedding, post_embedding) similarity
FROM posts FINAL
WHERE title <> {{ String(title) }}
ORDER BY similarity ASC
LIMIT 10

This query first fetches the embedding of the requested post, and returns an array of 0s in the event an embedding can't be fetched. It then calculates the Euclidean vector distance between each additional post and the specified post using the L2Distance() function, sorts them by ascending distance, and limits to the top 10 results.

You can push this Pipe to your Tinybird server with:

cd tinybird
tb push pipes/similar_posts.pipe

When you push it, Tinybird will automatically publish it as a scalable, dynamic REST API Endpoint that accepts a title query parameter.

You can test your API Endpoint with a cURL. First, create an envvar with a token that has PIPES:READ scope for your Pipe. You can get this token from your Workspace UI or in the CLI with tb token commands.

export TB_READ_TOKEN=your_read_token

Then request your Endpoint:

curl --compressed -H "Authorization: Bearer $TB_READ_TOKEN" https://api.tinybird.co/v0/pipes/similar_posts.json?title='Some blog post title'

You will get a JSON object containing the 10 most similar posts to the post whose title you supplied in the request.

5. Integrate into the frontend

Integrating your vector search API into the frontend is relatively straightforward, as it's just a RESTful Endpoint. Here's an example implementation (pulled from the actual code used to fetch related posts in the Tinybird Blog):

export async function getRelatedPosts(title: string) {
  const recommendationsUrl = `${host}/v0/pipes/similar_posts.json?token=${token}&title=${title}`;
  const recommendationsResponse = await fetch(recommendationsUrl).then(
    function (response) {
      return response.json();
    }
  );
  if (!recommendationsResponse.data) return;
  return Promise.all(
    recommendationsResponse.data.map(async ({ url }) => {
      const slug = url.split("/").pop();
      return await getPost(slug);
    })
  ).then((data) => data.filter(Boolean));
}

6. See it in action

You can see how this looks by checking out any blog post in the Tinybird Blog. At the bottom of each post, you'll find a Related Posts section that's powered by a real Tinybird API using the method described here!

Tinybird blog related posts uses vector search recommendation algorithm.

Next steps

Updated