Backfill strategies for iterations

Backfilling data is the process of filling in missing data that didn't exist before. Whether you're changing data types, changing the sorting key, or redefining whole views, at some point you may need to run a backfill from the previous version of your Data Source or Materialized View to the new one.

This page introduces the key challenges of backfilling real-time data, and covers the different strategies to run a backfill when you are iterating a Data Source or Materialized View.

Before you start iterating and making critical changes to your Data Sources, Materialized Views, and Pipes, it's crucial to read the Deployment Strategies docs.

The challenge of backfilling real-time data

The iteration of Data Sources or Materialized Views often needs a careful approach to backfill data. This process becomes critical, especially when you create a new version of a Data Source or Materialized View, which results in creating a new, empty Data Source or Materialized View.

The main challenge lies in migrating historical data while continuously ingesting new real time data. See the detailed explanation Best practices for Materialized Views.

Use case

Imagine you have the following Data Source deployed in our main Workspace:

analytics_events.datasource
SCHEMA >
    `timestamp` DateTime `json:$.timestamp`,
    `session_id` String `json:$.session_id`,
    `action` LowCardinality(String) `json:$.action`,
    `version` LowCardinality(String) `json:$.version`,
    `payload` String `json:$.payload`

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp"

We want to modify the sorting key from timestamp to action, timestamp. This change requires us create a new Data Source (e.g. events1.datasource).

When merging the Pull Request, you will have analytics_events and analytics_events_1 in our main Workspace (also in the branch while in CI). This means that you will have two Data Sources one representing the new version you want to deploy which since is newly created is empty.

So, how you can sync the data between the two Data Sources? This is when you use a backfill.

How to move data in Tinybird

Reminder: "Running a backfill" just means copying all the data from one Data Source to another Data Source. There are different ways to move data in Tinybird:

Using Copy Pipes

A Copy Pipe is a Pipe used to copy data from one Data Source to another Data Source. This method is useful for one-time moves of data or scheduled executions (for example, every day at 00:00), but it's not recommended if you want to keep the data in sync between two Data Sources.

In the context of a backfill, you could use the following Pipe to copy the data from one Data Source to another. (Later, we will explain why you need the timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}} condition).

backfill_data.pipe file
NODE node
SQL >
   %
   SELECT *
   FROM analytics_events
   WHERE timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}}

TYPE COPY
TARGET_DATASOURCE analytics_events_1

Once deployed, you would need to run the following command to execute the copy:

Command to run the Copy Pipe with the backfill_timestamp parameter
tb pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-01-31 00:00:00' --wait --yes

You can read more about it in our Copy Pipes docs.

Using Materialized Views

A Materialized View is a Pipe that will materialize the data from one Data Source to another Data Source.

This method is useful to keep the data in sync between two Data Sources.

sync_data.pipe file
NODE node
SQL >
   %
   SELECT *
   FROM analytics_events
   WHERE timestamp > '2024-01-31 00:00:00'

TYPE materialized
DATASOURCE analytics_events_1

By default, a Materialized view will only materialize the new incoming data; it won't process the old data.

It can be forced by using tb pipe populate command using the CLI, but be careful as this can lead to duplicate or loss of data as explained in the previous section.

Combining both methods, you will see how you can start syncing both Data Sources and start backfilling data.

Scenarios for backfill strategies

Depending on your use case and ingestion pattern, there are different recommended strategies to backfilling data in Tinybird. The complexity of this migration depends on several factors, notably the presence of streaming ingestion. The most common scenarios are:

We are actively improving this workflow in Tinybird. Reach out to Tinybird support (support@tinybird.co) if you have any questions.

  • Scenario 1: I'm not in Production.
  • Scenario 2: Full replacement every few hours.
  • Scenario 3: Streaming ingestion WITH incremental timestamp column.
  • Scenario 4: Streaming ingestion WITH NOT incremental timestamp column.

Scenario 1: I'm not in Production

If you are not in production or the data from that Data Source is not being used and you can accept losing data, you can opt-in by create a new Data Source and start using it right away. Alternatively you can remove and re-create the original Data Source using a custom deployment.

Once you start to append data to the Data Source, you will start seeing data in the new Data Source.

Scenario 2: Full replacement every few hours

If you are running a full replacement every few hours, you can create a Materialized View the two Data Sources.

To sync the data between the two Data Sources, you will use a Materialized Pipe (MV) that will materialize the data from the old Data Source to the new one. Something like this:

Materialize data from old to new Data Source
NODE migration_node
SQL >
    SELECT *
    FROM analytics_events

TYPE materialized
DATASOURCE analytics_events_1

You would deploy this new Pipe along with the modified Data Source. Once you deploy the Pull Request, you will have this Materialized Pipe along with the new Data Source and the Pipe will materialize the data from the old Data Source.

At this point, you would just need to wait until the full replacement is executed, the new Data Source will have all the data, after that you can create a new Pull Request to connect the new Data Source to the rest of your Pipe Endpoints.

Scenario 3: Streaming ingestion WITH incremental timestamp column

If you have streaming ingestion using the events API with a huge ingest rate, you can use the following strategy to not be impacted by the backfilling challenge with real-time data.

To use this strategy successfully, you must have an incremental timestamp column with the same time zone in your Data Source. In our example, you have the timestamp column.

First, create a new Pipe that will materialize the data from old Data Source to the new one, but filter by a future timestamp. For example, if you are deploying the Pull Request on 2024-02-02 13:00:00, you can use timestamp > '2024-02-02 13:30:00'.

sync_data.pipe file
NODE node
SQL >
   SELECT *
   FROM analytics_events
   WHERE timestamp > '2024-02-02 13:30:00'

TYPE materialized
DATASOURCE analytics_events_1

We are using the timestamp > '2024-02-02 13:30:00' condition to only materialize data that is newer than the 2024-02-02 13:30:00 timestamp.

Then, create a Copy Pipe with the same SQL statement, but instead of filtering by a specific future timestamp, you will use two parameters to filter by a timestamp range.

This allows us to have better control of the backfilling process. For example, if you are moving very large amounts of data, split the backfilling process into different batches to avoid overloading the system.

backfill_data.pipe file
NODE node
SQL >
   %
   SELECT *
   FROM analytics_events
   WHERE timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}}

TYPE COPY
TARGET_DATASOURCE analytics_events_1

Once you have these changes in code, create a Pull Request and the CI Workflow will generate a new Branch.

CI workflow

Once the CI Workflow has finished successfully, a new Branch will be created.

For the following steps, use the CLI. If you don't have it installed, you can follow the CLI installation docs.

First, you should be able to authenticate in the Branch by copying the Token from the Branch or using these commands:

Authenticate in the Branch
# You can use `tb auth -i` to authenticate in the branch
tb auth -i

# Or you can switch to the branch if you are already authenticated
tb branch ls

# By default, the CI Workflow will create a branch following the pattern `tmp_ci-<PULL_REQUEST_ID>`.
tb branch use <NAME_BRANCH>

We recommend you to run the Copy Pipe outside of the CI Workflow. You can automate it using a custom deployment, but most of the times it's just not worth it.

As you do not have continuous ingestion in the Branch, don't wait for the future filter timestamp. Instead, run directly the Copy Pipe to backfill the data by running the following command:

Run the Copy Pipe
tb pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-02-02 13:30:00' --wait --yes

Once the Copy Pipe has finished, you will have all the data in the new Data Source. We can compare the number of rows from both Data Sources with the following commands:

Compare the number of rows
tb sql "SELECT count() FROM analytics_events"
tb sql "SELECT count() FROM analytics_events_1"

CD workflow

Now that you have tested the backfilling process in the Branch, you can merge the Pull Request and the CD Workflow, the operation will be exactly the same as in the Branch: first deploy the resources then run the data operations either manually (recommended) or automate it with a custom deployment.

You should verify that have deployed the new Data Source before the timestamp you have used in the Materialized Pipe. Otherwise, you will be missing data in the new Data Source.

For example, if you have used timestamp > '2024-02-02 13:30:00' in the Materialized Pipe, you should verify that you have deployed before 2024-02-02 13:30:00.

If you have deployed after 2024-02-02 13:30:00, you will need to remove the Data Source and start again the process using a different timestamp.

At 2024-02-02 13:30:00, yuo will start seeing data in the new Data Source, that's when you will execute the same process you have done in the CI Workflow to backfill the data.

First, you will need to authenticate in the Workspace by running the following command:

Authenticate in the Workspace
tb auth -i

Then, you will need to run the Copy Pipe to backfill the data by running the following command:

Run the Copy Pipe
tb pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-02-02 13:30:00' --wait --yes

If the Copy Pipe fails, you can re-run the same command without duplicating data. The Copy Pipe will only copy the data if the process is successful.

If you get any error like MEMORY LIMIT, you can also run the Copy Pipe in batches. For example, you could run the Copy Pipe with a timestamp range of 1 hour, 1 day, 1 week, depending on the amount of data you are moving.

Once the Copy Pipe has finished, you will have all the data in the new Data Source. We can compare the number of rows by running the following command:

Compare the number of rows
tb sql "SELECT count() FROM analytics_events"
tb sql "SELECT count() FROM analytics_events"

At this point, you should have the same number of rows in both places and you can connect the new Data Source with the rest of the Dataflow.

Finally, you have the Data Source with the new schema and all the data migrated from the previous one. The Data Source is receiving real-time data directly and now the next step is to remove the Materialized Pipe and Copy Pipe you have used to backfill the data.

To do that, you would just need to create a new Pull Request and remove (git rm) the Materialized Pipe and Copy Pipe you have used to backfill the data. Once, the Pull Request is merged, the resources will be automatically removed, you can double check this operation while in CI.

Scenario 4: Streaming ingestion WITH NOT incremental timestamp column

If you have streaming ingestion, but you do not have an incremental timestamp column, you can use one of the following strategies to backfill data in Tinybird.

Reach out to Tinybird support (support@tinybird.co) if you have any questions or you aren't sure how to proceed.

Strategy 1: Run a populate

Before following this strategy, you should be aware of the backfilling challenge with real-time data.

Let's consider that the use case is the same as the previous one, but you do not have an incremental timestamp column. We can not rely on the timestamp column to filter the data as it's not incremental.

First, create a Materialized Pipe that will materialize the data from the old Data Source to the new one.

backfill_data.pipe file
NODE migrating_node
SQL >
   SELECT *
   FROM analytics_events

TYPE materialized
DATASOURCE analytics_events_1

To run the backfill, you will use the tb pipe populate command. This command will materialize the data from the old Data Source to the new one and as you don't need to wait until a future timestamp, you can run it inside the CI/CD Workflow.

You can create a custom deployment using VERSION=1.0.0 in the .tinyenv file and placing the custom deployment script in the deploy/1.0.0 folder:

Scripts generated inside the deploy folder
deploy/1.0.0
├── deploy.sh ## This is the script that will be executed during the deployment

You will need to modify the deploy.sh script to run the tb pipe populate command:

deploy.sh script
#!/bin/bash

# This script will be executed after the deployment
# You can use it to run any command after the deployment

# Run the populate Pipe
tb pipe populate backfill_data --node migrating_node --wait

Once you have these changes in the code, you will create a Pull Request and the CI Workflow will generate a new Branch with the new Data Source. Now, you should verify that everything is working as expected as you did in the previous section.

# You can use `tb auth -i` to authenticate in the branch
tb auth -i

# Or you can switch to the branch if you are already authenticated
tb branch ls

# By default, the CI Workflow will create a branch following the pattern `tmp_ci-<PULL_REQUEST_ID>`.
tb branch use <NAME_BRANCH>

# Also, you could compare the number of rowsby running the following command:
tb sql "SELECT count() FROM analytics_events"
tb sql "SELECT count() FROM analytics_events_1"

Once you have verified that everything is working as expected, merge the Pull Request and the CD Workflow will generate a new Data Source in the Main Branch. Once the CD Workflow has finished successfully, you verify the same way as you did in the Branch.

Strategy 2: Move ingestion to the new Data Source

Let's consider that the use case is the same as the previous one. We do not have an incremental timestamp column. We can not rely on the timestamp column to filter the data as it's not incremental and you do not want to run a populate as you might be impacted by the backfilling challenge with real-time data.

In this case, you can move the ingestion to the new Data Source until you finish backfilling data.

First, create a Copy Pipe that will copy the data from the old Data Source to the new one.

backfill_data.pipe file
NODE migrate_data
SQL >
   SELECT *
   FROM analytics_events

TYPE COPY
TARGET_DATASOURCE analytics_events_1

We could also parametrize the Copy Pipe to filter by a parameter. This allows us to have better control of the backfilling process.

Once you have these changes in our code, create a Pull Request and the CI Workflow will generate a new Branch with the new Data Source and the Copy Pipe.

Now, run the Copy Pipe to backfill the data. To do that, authenticate in the branch by either copying the Token from the branch.

Authenticate in the branch
# You can use `tb auth -i` to authenticate in the branch
tb auth -i

# Or you can switch to the branch if you are already authenticated
tb branch ls

# By default, the CI Workflow will create a branch following the pattern `tmp_ci-<PULL_REQUEST_ID>`.
tb branch use <NAME_BRANCH>

# Once you have authenticated in the branch, you can run the Copy Pipe by running the following command:
tb pipe copy run backfill_data --node migrate_data --wait --yes

Once the Copy Pipe has finished, you will have all the data in the new Data Source. As you are likely not ingesting data into your Branch, both numbers should match.

Compare the number of rows
tb sql "SELECT count() FROM analytics_events"
tb sql "SELECT count() FROM analytics_events_1"

Now, merge the Pull Request and the CD Workflow will generate the new resources in the main Branch.

At this point, you should modify the ingestion to start ingesting data into the new Data Source. Keep in mind that while you are ingesting data into the new Data Source, you will stop ingesting data into the old Data Source.

Once all the ingestion is pointing to the new Data Source, you should verify that new data is being ingested into the new Data Source and nothing is being ingested into the old one. To do that you could query the Data Source directly or the Service Data Source tinybird.datasources_ops_log.

At this point, you can start backfilling data by running the Copy Pipe. To do that, you would need to run the following command:

Run the Copy Pipe
tb pipe copy run backfill_data --node migrate_data --wait --yes

There are sometimes when the Data Source you are modifying has downstream dependencies, in that case when creating the new version of the Data Source you need to make sure that you create new version of all downstream dependencies to avoid connecting two different Data Sources receiving data to the same part of the Dataflow hence duplicating data.

Next steps

If you're familiar with backfilling strategies, check out the Deployment Strategies docs.

Updated