Build a content recommendation API using vector search

Read on to learn how to calculate vector embeddings using HuggingFace models and use Tinybird to perform vector search to find similar content based on vector distances.

GitHub Repository
Tinybird blog related posts uses vector search recommendation algorithm.

In this tutorial, you learn how to:

  1. Use Python to fetch content from an RSS feed.
  2. Calculate vector embeddings on long form content (blog posts) using SentenceTransformers in Python.
  3. Post vector embeddings to a Tinybird Data Source using the Tinybird Events API.
  4. Write a dynamic SQL query to calculate the closest content matches to a given blog post based on vector distances.
  5. Publish your query as an API and integrate it into a frontend application.

Prerequisites

To complete this tutorial, you need the following:

  1. A free Tinybird account
  2. An empty Tinybird Workspace
  3. Python 3.8 or higher

This tutorial doesn't include a frontend. An example snippet is provided to show how you can integrate the published API into a React frontend.

1

Setup

Authenticate the Tinybird CLI using your user admin token from your Tinybird Workspace:

cd tinybird
tb auth --token $USER_ADMIN_TOKEN
2

Fetch content and calculate embeddings

This tutorial uses the Tinybird Blog RSS feed to fetch blog posts. You can use any rss.xml feed to fetch blog posts and calculate embeddings from their content.

You can fetch and parse the RSS feed using the feedparser library in Python, get a list of posts, and then fetch each post and parse the content with the BeautifulSoup library.

Once you've fetched each post, you can calculate an embedding using the HuggingFace sentence_transformers library. This demo uses the all-MiniLM-L6-v2 model, which maps sentences and paragraphs to a 384 dimensional dense vector space:

from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer

import datetime
import feedparser
import requests
import json

timestamp = datetime.datetime.now().isoformat()
url = "https://www.tinybird.co/blog-posts/rss.xml" # Update to your preferred RSS feed
feed = feedparser.parse(url)

model = SentenceTransformer("all-MiniLM-L6-v2")

posts = []
for entry in feed.entries:
    doc = BeautifulSoup(requests.get(entry.link).content, features="html.parser")
    if (content := doc.find(id="content")):
        embedding = model.encode([content.get_text()])
        posts.append(json.dumps({
            "timestamp": timestamp,
            "title": entry.title,
            "url": entry.link,
            "embedding": embedding.mean(axis=0).tolist()
        }))
3

Post content metadata and embeddings to Tinybird

After calculating the embeddings, you can push them along with the content metadata to Tinybird using the Events API.

First, set up some environment variables for your Tinybird host and token with DATASOURCES:WRITE scope:

export TB_HOST=your_tinybird_host
export TB_TOKEN=your_tinybird_token

Next, set up a Tinybird Data Source to receive your data. In the tinybird/datasources folder of the repository, find a posts.datasource file that looks like this:

SCHEMA >
    `timestamp` DateTime `json:$.timestamp`,
    `title` String `json:$.title`,
    `url` String `json:$.url`,
    `embedding` Array(Float32) `json:$.embedding[:]`

ENGINE ReplacingMergeTree
ENGINE_PARTITION_KEY ""
ENGINE_SORTING_KEY title, url
ENGINE_VER timestamp

This Data Source receives the updated post metadata and calculated embeddings and deduplicates based on the most up to date data retrieval. The ReplacingMergeTree is used to deduplicate, relying on the ENGINE_VER setting, which in this case is set to the timestamp column. This tells the engine that the versioning of each entry is based on the timestamp column, and only the entry with the latest timestamp is kept in the Data Source.

The Data Source has the title column as its primary sorting key, because you filter by title to retrieve the embedding for the current post. Having title as the primary sorting key makes that filter more performant.

Push this Data Source to Tinybird:

cd tinybird
tb push datasources/posts.datasource

Then, you can use a Python script to push the post metadata and embeddings to the Data Source using the Events API:

import os
import requests

TB_APPEND_TOKEN=os.getenv("TB_APPEND_TOKEN")
TB_HOST=os.getenv("TB_HOST")

def send_posts(posts):
    params = {
        "name": "posts",
        "token": TB_APPEND_TOKEN
    }
    data = "\n".join(posts) # ndjson
    r = requests.post(f"{TB_HOST}/v0/events", params=params, data=data)
    print(r.status_code)

send_posts(posts)

To keep embeddings up to date, you should retrieve new content on a schedule and push it to Tinybird. In the repository, you can find a GitHub Action called tinybird_recommendations.yml that fetches new content from the Tinybird blog every 12 hours and pushes it to Tinybird. The Tinybird Data Source in this project uses a ReplacingMergeTree to deduplicate blog post metadata and embeddings as new data arrives.

4

Calculate distances in SQL using Tinybird Pipes

If you've completed the previous steps, you should have a posts Data Source in your Tinybird Workspace containing the last fetched timestamp, title, URL, and embedding for each blog post fetched from your RSS feed.

You can verify that you have data from the Tinybird CLI with:

tb sql 'SELECT * FROM posts'

This tutorial includes a single-node SQL Pipe to calculate the vector distance of each post to specific post supplied as a query parameter. The Pipe config is contained in the similar_posts.pipe file in the tinybird/pipes folder, and the SQL is copied in the following snippet for reference and explanation.

%
WITH
  (
    SELECT embedding
    FROM
    (
      SELECT 0 AS id, embedding
      FROM posts
      WHERE
          title = {{ String(title) }}
      ORDER BY timestamp DESC
      LIMIT 1
      UNION ALL
      SELECT 999 AS id, arrayWithConstant(384, 0.0) embedding
    )
      ORDER BY id
      LIMIT 1
    ) AS post_embedding
SELECT title, url, L2Distance(embedding, post_embedding) similarity
FROM posts FINAL
WHERE title <> {{ String(title) }}
ORDER BY similarity ASC
LIMIT 10

This query first fetches the embedding of the requested post, and returns an array of 0s in the event an embedding can't be fetched. It then calculates the Euclidean vector distance between each additional post and the specified post using the L2Distance() function, sorts them by ascending distance, and limits to the top 10 results.

You can push this Pipe to your Tinybird server with:

cd tinybird
tb push pipes/similar_posts.pipe

When you push it, Tinybird automatically publishes it as a scalable, dynamic REST API Endpoint that accepts a title query parameter.

You can test your API Endpoint with a cURL. First, create an envvar with a token that has PIPES:READ scope for your Pipe. You can get this token from your Workspace UI or in the CLI with tb token commands.

export TB_READ_TOKEN=your_read_token

Then request your endpoint:

curl --compressed -H "Authorization: Bearer $TB_READ_TOKEN" https://api.tinybird.co/v0/pipes/similar_posts.json?title='Some blog post title'

A JSON object appears containing the 10 most similar posts to the post whose title you supplied in the request.

5

Integrate into the frontend

Integrating your vector search API into the frontend is relatively straightforward. Here's an example implementation:

export async function getRelatedPosts(title: string) {
  const recommendationsUrl = `${host}/v0/pipes/similar_posts.json?token=${token}&title=${title}`;
  const recommendationsResponse = await fetch(recommendationsUrl).then(
    function (response) {
      return response.json();
    }
  );
  if (!recommendationsResponse.data) return;
  return Promise.all(
    recommendationsResponse.data.map(async ({ url }) => {
      const slug = url.split("/").pop();
      return await getPost(slug);
    })
  ).then((data) => data.filter(Boolean));
}
6

See it in action

You can see how this looks by checking out any blog post in the Tinybird Blog. At the bottom of each post, you can find a Related Posts section that's powered by a real Tinybird API

Next steps

Updated