Build a content recommendation API using vector search¶
Read on to learn how to calculate vector embeddings using HuggingFace models and use Tinybird to perform vector search to find similar content based on vector distances.
GitHub RepositoryIn this tutorial, you learn how to:
- Use Python to fetch content from an RSS feed.
- Calculate vector embeddings on long form content (blog posts) using SentenceTransformers in Python.
- Post vector embeddings to a Tinybird Data Source using the Tinybird Events API.
- Write a dynamic SQL query to calculate the closest content matches to a given blog post based on vector distances.
- Publish your query as an API and integrate it into a frontend application.
Prerequisites¶
To complete this tutorial, you need the following:
- A free Tinybird account
- An empty Tinybird Workspace
- Python 3.8 or higher
This tutorial doesn't include a frontend. An example snippet is provided to show how you can integrate the published API into a React frontend.
Setup¶
Clone the demo_vector_search_recommendation
repo.
Authenticate the Tinybird CLI using your user admin token from your Tinybird Workspace:
cd tinybird tb auth --token $USER_ADMIN_TOKEN
Fetch content and calculate embeddings¶
This tutorial uses the Tinybird Blog RSS feed to fetch blog posts. You can use any rss.xml
feed to fetch blog posts and calculate embeddings from their content.
You can fetch and parse the RSS feed using the feedparser
library in Python, get a list of posts, and then fetch each post and parse the content with the BeautifulSoup
library.
Once you've fetched each post, you can calculate an embedding using the HuggingFace sentence_transformers
library. This demo uses the all-MiniLM-L6-v2
model, which maps sentences and paragraphs to a 384 dimensional dense vector space:
from bs4 import BeautifulSoup from sentence_transformers import SentenceTransformer import datetime import feedparser import requests import json timestamp = datetime.datetime.now().isoformat() url = "https://www.tinybird.co/blog-posts/rss.xml" # Update to your preferred RSS feed feed = feedparser.parse(url) model = SentenceTransformer("all-MiniLM-L6-v2") posts = [] for entry in feed.entries: doc = BeautifulSoup(requests.get(entry.link).content, features="html.parser") if (content := doc.find(id="content")): embedding = model.encode([content.get_text()]) posts.append(json.dumps({ "timestamp": timestamp, "title": entry.title, "url": entry.link, "embedding": embedding.mean(axis=0).tolist() }))
Post content metadata and embeddings to Tinybird¶
After calculating the embeddings, you can push them along with the content metadata to Tinybird using the Events API.
First, set up some environment variables for your Tinybird host and token with DATASOURCES:WRITE
scope:
export TB_HOST=your_tinybird_host export TB_TOKEN=your_tinybird_token
Next, set up a Tinybird Data Source to receive your data. In the tinybird/datasources
folder of the repository, find a posts.datasource
file that looks like this:
SCHEMA > `timestamp` DateTime `json:$.timestamp`, `title` String `json:$.title`, `url` String `json:$.url`, `embedding` Array(Float32) `json:$.embedding[:]` ENGINE ReplacingMergeTree ENGINE_PARTITION_KEY "" ENGINE_SORTING_KEY title, url ENGINE_VER timestamp
This Data Source receives the updated post metadata and calculated embeddings and deduplicates based on the most up to date data retrieval. The ReplacingMergeTree
is used to deduplicate, relying on the ENGINE_VER
setting, which in this case is set to the timestamp
column. This tells the engine that the versioning of each entry is based on the timestamp
column, and only the entry with the latest timestamp is kept in the Data Source.
The Data Source has the title
column as its primary sorting key, because you filter by title to retrieve the embedding for the current post. Having title
as the primary sorting key makes that filter more performant.
Push this Data Source to Tinybird:
cd tinybird tb push datasources/posts.datasource
Then, you can use a Python script to push the post metadata and embeddings to the Data Source using the Events API:
import os import requests TB_APPEND_TOKEN=os.getenv("TB_APPEND_TOKEN") TB_HOST=os.getenv("TB_HOST") def send_posts(posts): params = { "name": "posts", "token": TB_APPEND_TOKEN } data = "\n".join(posts) # ndjson r = requests.post(f"{TB_HOST}/v0/events", params=params, data=data) print(r.status_code) send_posts(posts)
To keep embeddings up to date, you should retrieve new content on a schedule and push it to Tinybird. In the repository, you can find a GitHub Action called tinybird_recommendations.yml that fetches new content from the Tinybird blog every 12 hours and pushes it to Tinybird. The Tinybird Data Source in this project uses a ReplacingMergeTree to deduplicate blog post metadata and embeddings as new data arrives.
Calculate distances in SQL using Tinybird Pipes¶
If you've completed the previous steps, you should have a posts
Data Source in your Tinybird Workspace containing the last fetched timestamp, title, URL, and embedding for each blog post fetched from your RSS feed.
You can verify that you have data from the Tinybird CLI with:
tb sql 'SELECT * FROM posts'
This tutorial includes a single-node SQL Pipe to calculate the vector distance of each post to specific post supplied as a query parameter. The Pipe config is contained in the similar_posts.pipe
file in the tinybird/pipes
folder, and the SQL is copied in the following snippet for reference and explanation.
% WITH ( SELECT embedding FROM ( SELECT 0 AS id, embedding FROM posts WHERE title = {{ String(title) }} ORDER BY timestamp DESC LIMIT 1 UNION ALL SELECT 999 AS id, arrayWithConstant(384, 0.0) embedding ) ORDER BY id LIMIT 1 ) AS post_embedding SELECT title, url, L2Distance(embedding, post_embedding) similarity FROM posts FINAL WHERE title <> {{ String(title) }} ORDER BY similarity ASC LIMIT 10
This query first fetches the embedding of the requested post, and returns an array of 0s in the event an embedding can't be fetched. It then calculates the Euclidean vector distance between each additional post and the specified post using the L2Distance()
function, sorts them by ascending distance, and limits to the top 10 results.
You can push this Pipe to your Tinybird server with:
cd tinybird tb push pipes/similar_posts.pipe
When you push it, Tinybird automatically publishes it as a scalable, dynamic REST API Endpoint that accepts a title
query parameter.
You can test your API Endpoint with a cURL. First, create an envvar with a token that has PIPES:READ
scope for your Pipe. You can get this token from your Workspace UI or in the CLI with tb token
commands.
export TB_READ_TOKEN=your_read_token
Then request your endpoint:
curl --compressed -H "Authorization: Bearer $TB_READ_TOKEN" https://api.tinybird.co/v0/pipes/similar_posts.json?title='Some blog post title'
A JSON object appears containing the 10 most similar posts to the post whose title you supplied in the request.
Integrate into the frontend¶
Integrating your vector search API into the frontend is relatively straightforward. Here's an example implementation:
export async function getRelatedPosts(title: string) { const recommendationsUrl = `${host}/v0/pipes/similar_posts.json?token=${token}&title=${title}`; const recommendationsResponse = await fetch(recommendationsUrl).then( function (response) { return response.json(); } ); if (!recommendationsResponse.data) return; return Promise.all( recommendationsResponse.data.map(async ({ url }) => { const slug = url.split("/").pop(); return await getPost(slug); }) ).then((data) => data.filter(Boolean)); }
See it in action¶
You can see how this looks by checking out any blog post in the Tinybird Blog. At the bottom of each post, you can find a Related Posts section that's powered by a real Tinybird API
Next steps¶
- Read more about vector search and content recommendation use cases.
- Join the Tinybird Slack Community for additional support.