Best practices for Materialized Views

Read on to learn how Materialized Views work and how to best use them in your data projects.

How data gets into a Materialized View

Tinybird ingests data into Materialized Views in blocks. This process is presented in the following diagram. Every time new data is ingested into the origin Data Source, the materialization process is triggered, which applies the transformation Pipe over the data ingested and sabes the output of that Pipe, which is a partial result, in the Materialized View.

Materialization process

Data that was present in the origin Data Source prior to the Materialized View creation is inserted into the destination Materialized View through a populate operation.

Regular materializaton

Materialized Views in Tinybird are incremental and triggered upon ingestion. From the moment it's created, the new rows are inserted in the Materialized View. If an insert is too big, it's processed in blocks.

Because materializations are performed only over the new data being ingested, and not over the whole data source, avoid using the following operations: Window functions, ORDER BY, Neighbor, and DISTINCT.

Populates

Populates move historical data from the origin Data Source into the Materialized View. There are two types: complete and partial.

If you're populating from a Data Source with hundreds of millions of rows and doing complex transformations, you might face memory errors. In this type of situation, use partial populates.

If you’re using the CLI, populates are triggered using tb pipe populate. You can add conditions using the --sql-condition flag, for example, --sql-condition='date == toYYYYMM(now())'. If your sql_condition includes any column present in the Data Source engine_sorting_key, the populate job should process less data.

If you have constant ingest in your origin Data Source, see Populates and streaming ingest.

Aggregated Materialized Views

Sometimes a background process in Tinybird merges partial results saved in intermediate states during the ingestion process, compacting the results and reducing the number of rows. The following diagram illustrates this process in more detail through a simple example.

Let’s say an eCommerce store wants to materialize the count of units sold by product. It's ingesting a JSON object every minute, with a product represented by a capital letter and the quantity sold during the last minute.

The store could define in their Pipe some simple SQL to sum the count of units sold per minute as data is ingested. The Pipe is applied over each new block of appended data, and the output is immediately saved in intermediate states into the Materialized View. Every 8 or 10 minutes, the background process merges the intermediate states, completing the aggregation across the entire Data Source.

Because they're working in real time, the store can’t always wait for this background process to take place. When querying the Materialized View, they should use the proper merge combinator and GROUP BY clause in the query itself.

Understanding State and Merge combinators for Aggregates

Tracking when the background process that merges aggregate results in a Materialized View has occurred isn't always practical. Because of this, you need to store intermediate states using the -State suffix. If you’re creating a Materialized View using the UI, this is done automatically.

Here’s an example of using -State when defining the transformation Pipe to calculate these intermediate states:

USING the -State SUFFIX
NODE Avg calculation
SQL >
SELECT 
  day, 
  city,
  avgState() avg
FROM table
GROUP BY day, city 

You also need to specifically define the appropriate schema for the Materialized View:

MV SCHEMA
SCHEMA >
  day Date,
  city String,
  avg AggregateFunction(avg, Float64)

ENGINE_SORTING_KEY date, city

Finally, you need to retrieve the data using the -Merge suffix in your API Endpoint Node to make sure the merge process is completed for all data in the Materialized View:

USE MERGE SUFFIX IN ENDPOINT NODE
NODE endpoint
SQL > 
% 
SELECT 
  day, 
  city,
  avgMerge(avg) as avg
FROM avg_table
WHERE day > {{Date(start_date)}}
GROUP BY day, city

Understanding the Materialized View parameters

When you create a Materialized View in the UI, Tinybird automatically recommends the best parameters for most use cases. This is useful to understand parameters for more complex use cases.

Sorting Key

The Sorting Key defines how data is sorted and is critical for great performance when filtering. Choose the order of your sorting keys depending on how you're going to query them. Here are a few examples for a simple Materialized View containing day, city, and avg columns:

  • You want to query the average for all cities on a particular day: the day column should be the first sorting key.
  • You wanted the average over the last month for a particular city: the city column should be the first sorting key.

For Materialized Views containing aggregations, every column in the GROUP BY statement has to be in the sorting keys, and only those columns can be sorting keys.

For non-aggregated Materialized Views, you can select other columns if they fit better for your use case, but we don't recommend adding too many. You get only a negligible performance boost after the fourth sorting key column.

Partition by

A partition is a logical combination of records by a given criterion. Usually you don't need a partition key, or a partition by month is enough.

Tinybird guesses the best partition key if your materialization query has a Date or DateTime column. If there aren't any Date columns, Tinybird doesn't set a partition key. Having no partition is better than having the wrong partition. If you're comfortable with partitions and you want to group records by another criteria, you can switch to the advanced tab and add your custom partition code.

Time To Live (TTL)

If you have certain lifetime requirements on the data in your Materialized Views, you can specify a Time To Live (TTL) parameter when creating a Materialized View.

An example of TTL requirement is satisfying GDPR regulations. TTLs can also be useful if you only intend to query a brief history of the data. For example, if you always and only query for data from within the last week, you can set a TTL of 7 days. When a TTL is set, all rows older than the TTL are removed from the Materialized View.

Advanced tab (UI)

Most of the time, the defaults recommended by Tinybird are the best parameters to use. Occasionally, however, you might need to tweak these parameters. For this, you can use the Advanced tab in Tinybird, where you can write code that's passed to the View creation.

You need ClickHouse® expertise to modify these parameters. If you aren't sure, stick with the defaults.

Populates and streaming ingest

A populate is the operation of moving data that was present in the origin Data Source before the creation of the Materialized View.

The following diagram illustrates the process. At t1, the Materialized View is created, so new rows arriving in the origin Data Source are processed into the Materialized View. To move the data from t0 to t1, launch a populate, either manually or when defining the Materialized View, at time t2.

All that data that arrives between t1 and t2 might be materialized twice: once due to the regular materialization process, at ingest time, and the other one due to the populate process.

When you don’t have streaming ingest in the origin Data Source, it's usually a safe operation, as long as no new data arrives while the populate is running.

Backfill strategies

Consider one of the following strategies for backfilling data.

Two Materialized View Pipes

Use a timestamp in the near future to split real-time ingest and backfill.

Create the regular MV with a WHERE clause specifying that materialized data is newer than a certain timestamp in the future. For example, WHERE timestamp >= '2024-01-31 00:00:00':

realtime materialized.pipe
NODE node
SQL > 
   %
   SELECT (...) 
   FROM origin_ds
   WHERE timestamp >= '2024-01-31 00:00:00'

TYPE Materialized
TARGET_DATASOURCE mv

Wait until the desired timestamp has passed, and create a the backfill Materialized View Pipe with a WHERE clause for data before the specified timestamp. No new data is processed, as the condition can't be met.

populate.pipe
NODE node
SQL > 
   %
   SELECT (...) 
   FROM origin_ds
   WHERE timestamp < '2024-01-31 00:00:00'

TYPE Materialized
TARGET_DATASOURCE mv

Finally, because it's now safe, run the --populate command.

Use Copy Pipes

Depending on the transformations, Copy Pipes can substitute the populate for historical data. See Backfill strategies.

These tips only apply for streaming ingest. With batch ingest, or being able to pause ingest, populates are totally safe.

Use the same alias in SELECT and GROUP BY

If you use an alias in the SELECT clause, you must reuse the same alias in the GROUP BY.

Take the following query as an example:

Different alias in SELECT and GROUP BY
SELECT
    key,
    multiIf(value = 0, 'isZero', 'isNotZero') as zero,
    sum(amount) as amount
FROM ds
GROUP BY key, value

The previous query results in the following error:

Column 'value' is present in the GROUP BY but not in the SELECT clause

To fix this, use the same alias in the GROUP BY:

GOOD: same alias in SELECT and GROUP BY
SELECT
    key,
    multiIf(value = 0, 'isZero', 'isNotZero') as zero,
    sum(amount) as amount
FROM ds
GROUP BY key, zero

Don't use nested GROUP BYs

Don't use nested GROUP BY clauses in the Pipe that creates a Materialized View. While nested aggregations are possible, Materialized Views are processed in independent blocks, and this might yield unexpected results. Tinybird restricts these behaviors and throws an error when they're detected to avoid inaccurate results.

Consider the following query with nested GROUP BY clauses:

Nested GROUP BY in Pipe
SELECT
    product,
    count() as c
FROM (
    SELECT
        key,
        product,
        count() as orders
    FROM ds
    GROUP BY key, product
)
GROUP BY product

The previous query throws the following error:

Columns 'key, product' are present in the GROUP BY but not in the SELECT clause

To fix this, make sure you don't nest GROUP BY clauses:

Single GROUP BY
SELECT
    key,
    product,
    count() as orders
FROM ds
GROUP BY key, product

Avoid big scans

Avoid big scans in Materialized Views. When using JOINs, do them with a subquery to the Data Source you join to, not the whole Data Source.

Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.
Updated