Optimizations 201: Fix common mistakes

In this guide, you'll learn the top 5 questions to ask yourself and fix common pitfalls in your data project.

Prerequisites

You'll need to read Optimizations 101: Detect inefficiencies first. It'll give you an idea of which Pipe is the worst performing, and the particular characteristics of that performance, so you can start to look into common causes.

It's also a really good idea to read Best practices for faster SQL and the Thinking in Tinybird blog post.

This guide walks through 5 common questions (the "usual suspects"). If you find a badly-performing Pipe, ask yourself these 5 questions first before exploring other, more nuanced problem areas.

1. Are you aggregating or transforming data at query time?

Calculating the count(), sum(), or avg(), or casting to another data type is common in a published API Endpoint. As your data scales, you may be processing more data than necessary, as you run the same query each time there is a request to that API Endpoint. If this is the case, you should create a Materialized View.

In a traditional database, you have to schedule Materialized Views to run on a regular cadence. Although this helps pre-process large amounts of data, the batch nature renders your data stale.

In Tinybird, Materialized Views let you incrementally pre-aggregate, transform, and filter large Data Sources upon ingestion. By shifting the computational load from query time to ingestion time, you'll scan less data and keep your API Endpoints fast.

Read the docs to create a Materialized View.

2. Are you filtering by the fields in the sorting key?

The sorting key is important. It determines the way data is indexed and stored in your Data Source, and is crucial for the performance of your API Endpoints. Setting the right sorting key allows you to have all the data you need, for any given query, as close as possible.

In all databases (including Tinybird), indexing allows you to not read the data that you don't need, which speeds up some operations (like filtering) hugely.

The goal of sorting keys (SKs) is to reduce scan size and discard as much data as possible when examining the WHERE clauses in the queries. In short, a good Sort Key is what will help you avoid expensive, time-consuming full scans.

Some good rules of thumb for setting Sorting Keys:

  • Order matters: Data will be stored based on the Sort Key order.
  • Between 3 and 5 columns is good enough. More will probably penalize.
  • timestamp is often a bad candidate for being the first element of the SK.
  • If you have a multi-tenant app, customer_id is a great candidate for being the first element of the SK.

One common mistake is to use the partition key for filtering. Use the sorting key, not the partition key.

3. Are you using the best data types?

If you do need to read data, you should try to use the smallest types that can get the job done. A common examples are timestamps. Do you really need millisecond precision?

Often when users start doing data analytics, they aren't sure what the data will look like, and how to query it in their application. But now that you have your API Endpoint or Pipe, you can (and should!) go back and see if your schema best supports your resulting use-case.

It's common to use simple types to begin, like String, Int, and DateTime, but as you go further in the application implementation, you should review the data types you selected at the beginning.

When reviewing your data types, focus on the following points:

  • Downsizing types, to select a different data type with a lower size. For instance, UUID fields can be typed as UUID fields instead of string types, you can use unsigned integers (UInt) instead of integers (Int) where there aren't negative values or you could use a Date instead of DateTime.
  • Examine string cardinality to perhaps use LowCardinality() if there are less than 100k uniques.
  • Nullable columns are bigger and slower and can't be sorting keys, so use coalesce().

Sorting key and data type changes are done by changing your schema, which means iterating the Data Source. See an example of these types of changes in the thinking-in-tinybird demo repo.

4. Are you doing complex operations early in the processing pipeline?

Operations such as joins or aggregations get increasingly expensive as your data grows.

Filter your data first to reduce the number of rows, then perform the more complex operations later in the pipeline.

Follow this example: Rule 5 for faster SQL.

5. Are you joining two or more data sources?

It's a common scenario to want to enrich your events with some dimension tables, so that you are materializing a JOIN. This kind of approach could process more data than you need, so here are some tips to follow in order to reduce it:

  • Try to switch out JOINs and replace them with a subquery: WHERE column IN (SELECT column FROM dimensions).
  • If the join is needed, try to filter the right table first (better if you can use a field in the sorting key).
  • Remember that the Materialization is only triggered when you ingest data in the left Data Source (the one you use to do a SELECT … FROM datasource). So, if you need to recalculate data from the past, creating a Materialized View is not the right approach. Instead, check this guide about Copy Pipes.

Understanding the Materialized JOIN issue

The issue

There's a common pitfall when working with Materialized Views:

Materialized Views generated using JOIN clauses are tricky. The resulting Data Source will be only automatically updated if and when a new operation is performed over the Data Source in the FROM.

Since Materialized Views work with the result of a SQL query, you can use JOINs and any other SQL feature. But JOINs should be used with caution.

SELECT a.id, a.value, b.value
FROM a 
LEFT JOIN b USING id

If you insert data in a (LEFT SIDE), data will be processed as expected... but what happens if you add data to b (RIGHT SIDE)?

It will not be processed. This is because a Materialized View is only triggered when its source table receives inserts. It's just a trigger on the source table and knows nothing about the joined table. Note that this doesn't only apply to JOIN queries, and is relevant when introducing any table external to the Materialized View's SELECT statement, e.g. using a IN SELECT.

It can become more complex if you need to deal with stream joins. However, this guide focuses on the basic setup as doing JOINs implies something most people don't realize.

These JOINs can be very expensive because you're reading the small amount of rows being ingested (LEFT SIDE) plus a full scan of the joined table (RIGHT SIDE) as you don't know anything about it.

The optimization

Sometimes, to easily detect these cases, it's useful to review the read_bytes/write_bytes ratio. If you're reading way more than writing, most likely you're doing some JOINs within the MV.

You can easily change this by adding a filter in the right side, rewriting the previous query as follows:

SELECT a.id, a.value, b.value
FROM a 
LEFT JOIN (
  SELECT id, value
  FROM b
  WHERE b.id IN (SELECT id FROM a)
) b USING id

This might sound counter-intuitive when writing a query the first time because you're basically reading a twice. However, you need to think a is usually smaller than b because you're only reading the block of rows you're ingesting. To see a real improvement, you will need the fields you're using to filter to be in the sorting key of b. Most of the time you'll use the "joining key", but you can use any other potential field that allows you to hit the index in b and filter the right side of the JOIN.

Next steps

Updated