Copy Pipes

Copy Pipes are an extension of Tinybird's Pipes. Copy Pipes allow you to capture the result of a Pipe at a moment in time, and write the result into a target Data Source. They can be run on a schedule, or executed on demand.

Copy Pipes are great for use cases like:

  • Event-sourced snapshots, such as change data capture (CDC).
  • Copy data from Tinybird to another location in Tinybird to experiment.
  • De-duplicate with snapshots.

Copy Pipes should not be confused with Materialized Views. Materialized Views continuously re-evaluate a query as new events are inserted, while Copy Pipes create a single snapshot at a given point in time.

Tinybird represents Copy Pipes using the icon.

Best practices

A Copy Pipe executes the Pipe's query on each run to export the result. This means that the size of the copy operation is tied to the size of your data and the complexity of the Pipe's query. As Copy Pipes can run frequently, it's strongly recommended that you follow the best practices for faster SQL to optimize your queries, and understand the following best practices for optimizing your Copy Pipes.

1. Round datetime filters to your schedule

Queries in Copy Pipes should always have a time window filter that aligns with the execution schedule. For example, a Copy Pipe that runs once a day typically has a filter that filters for yesterday's data, and an hourly schedule usually means a filter to get results from the previous hour.

Copy Pipe job aren't guaranteed to run exactly at the scheduled time. If a Copy Pipe is scheduled to run at 16:00:00, the job could run at 16:00:01 or even 16:05:05. To get the time at which the Copy Pipe job was received (not executed), use job_timestamp in your time window filter and round it to the schedule window. This special parameter will default to the timestamp of when the job was received, making it a lot more reliable to queue lags.

For example, if your Copy Pipe is scheduled hourly, instead of writing:

SELECT *
FROM datasource
WHERE datetime >= now() - interval 1 hour
    AND datetime < now()

Use toStartOfHour() and job_timestamp to round the time filter to the hour and get the time at which the Copy Pipe job was received:

SELECT *
FROM datasource
WHERE datetime >= toStartOfHour({{DateTime(job_timestamp)}}) - interval 1 hour
    AND datetime < toStartOfHour({{DateTime(job_timestamp)}})

Doing this means that, even if the Copy Pipe's execution is delayed (perhaps being triggered at 16:00:01, 17:00:30, and 18:00:02) you still maintain consistent copies of data regardless of the delay, with no gaps or overlaps.

2. Account for late data in your schedule

There are many reasons why you might have late-arriving data: system downtime in your message queues, network outages, or something else. These are largely unavoidable and will occur at some point in your streaming journey. You should account for potential delays ahead of time.

When using Copy Pipes, include some headroom in your schedule to allow for late data. How much headroom you give to your schedule is up to you, but some useful guidance is that you should consider the Service Level Agreements (SLAs) both up and downstream of Tinybird. For instance, if your streaming pipeline has 5 minute downtime SLAs, then most of your late data should be less than 5 minutes. Similarly, consider if you have any SLAs from data consumers who are expecting timely data in Tinybird.

If you schedule a Copy Pipe every 5 minutes (16:00, 16:05, 16:10...), there could be events with a timestamp of 15:59:59 that don't arrive in Tinybird until 16:00:01 (2 seconds late!). If the Copy Pipe executes at exactly 16:00:00, these events could be lost.

There are two ways to add headroom to your schedule:

Option 1: Delay the execution

The first option is to simply delay the schedule.

For example, if you want to create a Copy Pipe that creates 5 minute snapshots, you could delay the schedule by 1 minute, so that instead of running at 17:00, 17:05, 17:10, etc. it would instead run at 17:01, 17:06, 17:11.

To achieve this, you could use the cron expression 1-59/5 * * * *.

If you use this method, you must combine it with the advice from the first tip in this best practices guide to Round datetime filters to your schedule. For example:

SELECT *
FROM datasource
WHERE datetime >= toStartOfFiveMinutes({{DateTime(job_timestamp)}}) - interval 5 minute
    AND datetime < toStartOfFiveMinutes({{DateTime(job_timestamp)}})

Option 2: Filter the result with headroom

Another strategy is to keep your schedule as desired (17:00, 17:05, 17:10, etc.) but apply a filter in the query to add some headroom.

For example, you can move the copy window by 15 seconds:

WITH (SELECT toStartOfFiveMinutes({{DateTime(job_timestamp)}}) - interval 15 second) AS snapshot
SELECT snapshot, *
FROM datasource
WHERE timestamp >= snapshot - interval 5 minute
    AND timestamp < snapshot

With this method, a Copy Pipe that executes at 17:00 will copy data from 16:54:45 to 16:59:45. At 17:05, it would copy data from 16:59:45 to 17:04:45, and so on.

It is worth noting that this can be confusing to data consumers who might notice that the data timestamps don't perfectly align with the schedule, so consider whether you'll need extra documentation.

3. Write a snapshot timestamp

There are many reasons why you might want to capture a snapshot timestamp, as it documents when a particular row was written. This helps you identify which execution of the Copy Pipe is responsible for which row, which is useful for debugging or auditing.

For example:

SELECT toStartOfHour({{DateTime(job_timestamp)}}) as snapshot_id, *
FROM datasource
WHERE timestamp >= snapshot_id - interval 1 hour
    AND timestamp < snapshot_id

In this example, you're adding a new column at the start of the result which contains the rounded timestamp of the execution time. By applying an alias to this column, you can re-use it in the query as your rounded datetime filter, saving you a bit of typing.

4. Backfilling data on-demand

Copy Pipes can be executed following a schedule or on-demand. All the previous best practices on this page are focused on scheduled executions.

There might be cases when you want to use the same Copy Pipe to do a backfill. For example, you want to execute the Copy Pipe only on data from last year, to fill in a gap behind your fresh data. To do this, you can parameterize the filters.

When you run an on-demand Copy Pipe with parameters, you can modify the values of the parameters before execution. Scheduled Copy Pipes with parameters use the default value for any parameters. This means you can reuse the same Copy Pipe for your fresh, scheduled runs as well as any ad-hoc backfills.

If you have used job_timestamp in your time window filter, you can specify its value when running an on-demand Copy Pipe to the date you want to backfill.For example, consider the following Copy Pipe which copies data from the last hour:

SELECT toStartOfHour({{DateTime(job_timestamp)}}), *
FROM datasource
WHERE datetime >= toStartOfHour({{DateTime(job_timestamp)}}) - interval 1 hour
    AND datetime < toStartOfHour({{DateTime(job_timestamp)}})

You might specify job_timestamp to be any datetime in the past, and the same Copy Pipe would copy the data from the last hour to the specified datetime. With this, the same Copy Pipe can perform both functions: being executed on a regular schedule to keep up with fresh data, and being executed on-demand when needed.

Configure a Copy Pipe (CLI)

To create a Copy Pipe from the CLI, you need to create a .pipe file. This file follows the same format as any other .pipe file, including defining nodes that contain your SQL queries. In this file, define the queries that will filter and transform the data as needed. The final result of all queries should be the result that you want to write into a Data Source.

You must define which node contains the final result. To do this, include the following parameters at the end of the Node:

TYPE COPY
TARGET_DATASOURCE datasource_name
COPY_SCHEDULE --(optional) a cron expression or @on-demand. If not defined, it would default to @on-demand.
COPY_MODE append --(Optional) The strategy to ingest data for copy jobs. One of `append` or `replace`, if empty the default strategy is `append`.

There can be only one copy node per Pipe, and no other outputs, such as Materialized Views or API Endpoints.

Copy Pipes can either be scheduled, or executed on-demand. This is configured using the COPY_SCHEDULE setting. To schedule a Copy Pipe, configure COPY_SCHEDULE with a cron expression. On-demand Copy Pipes are defined by configuring COPY_SCHEDULE with the value @on-demand.

Note that all schedules are executed in the UTC time zone. If you are configuring a schedule that runs at a specific time, be careful to consider that you will need to convert the desired time from your local time zone into UTC.

Here is an example of a Copy Pipe that is scheduled every hour and writes the results of a query into the sales_hour_copy Data Source:

NODE daily_sales
SQL >
    %
    SELECT toStartOfDay(starting_date) day, country, sum(sales) as total_sales
    FROM teams
    WHERE
    day BETWEEN toStartOfDay({{DateTime(job_timestamp)}}) - interval 1 day AND toStartOfDay({{DateTime(job_timestamp)}})
    and country = {{ String(country, 'US')}}
    GROUP BY day, country

TYPE COPY
TARGET_DATASOURCE sales_hour_copy
COPY_SCHEDULE 0 * * * *

Before pushing the Copy Pipe to your Workspace, make sure that the target Data Source already exists and has a schema that matches the output of the query result. Data Sources will not be created automatically when a Copy Pipe runs.

If you push the target Data Source and the Copy Pipe at the same time, be sure to use the --push-deps option in the CLI.

Change Copy Pipe Token reference

Tinybird automatically creates a Token each time a scheduled copy is created, to read from the Pipe and copy the results. To change the Token strategy (for instance, to share the same one across each copy, rather than individual Tokens for each), update the Token reference in the .pipe datafile.

Important: Once a Token is created and associated with a copy, it's crucial to handle it with care. Avoid deleting the Token or modifying its scopes, as this will break the import process and disrupt scheduled copies. The Token is essential for maintaining the connection and ensuring that data is copied correctly.

Execute Copy Pipes (CLI)

Copy Pipes can either be scheduled, or executed on-demand.

When a Copy Pipe is pushed with a schedule, it will automatically be executed as per the schedule you defined. If you need to pause the scheduler, you can run tb pipe copy pause [pipe_name], and use tb pipe copy resume [pipe_name] to resume.

Note that you can't customize the values of dynamic parameters on a scheduled Copy Pipe. Any parameters will use their default values.

When a Copy Pipe is pushed without a schedule, using the @on-demand directive, you can run tb pipe copy run [pipe_name] to trigger the Copy Pipe as needed. You can pass parameter values to the Copy Pipe by using the param flag, e.g., --param key=value.

You can run tb job ls to see any running jobs, as well as any jobs that have finished during the last 48 hours.

If you remove a Copy Pipe from your Workspace, the schedule will automatically stop and no more copies will be executed.

Configure a Copy Pipe (UI)

To create a Copy Pipe from the UI, follow the process to create a standard Pipe.

After writing your queries:

  1. Select the node that contains the final result.
  2. Select the actions button next to the Node.
  3. Select Create Copy Job.

To configure the frequency:

  1. Select whether the Copy Pipe should be scheduled using a cron expression, or run on-demand, using the Frequency menu.
  2. If you select a cron expression, configure the expression.
  3. Select Next to continue.

All schedules run in the UTC time zone. If you are configuring a schedule that runs at a specific time, you might need to convert the desired time from your local time zone into UTC.

On-demand

If you selected on-demand as the frequency for the Copy Pipe, you can customize the values for any parameters of the Pipe. You can also configure whether the Copy Pipe should write results into a new, or existing, Data Source, using the radial buttons.

If you use an existing Data Source, you can select which one to use from the menu of your Data Sources. Only Data Sources with a compatible schema are in the menu.

If you create a new Data Source, Tinybird guides you through creating the new Data Source. Select Next to continue and go through the standard Create Data Source wizard.

Scheduled

If you selected cron expression as the frequency for the Copy Pipe, a preview of the result appears. You can't configure parameter values for a scheduled Copy Pipe. Review the results and select Next to continue.

Finally, you can configure whether the Copy Pipe should write results into a new, or existing, Data Source, using the radial buttons.

If you use an existing Data Source, you can select which one to use from the menu of your Data Sources. Only Data Sources with a compatible schema is in the menu.

If you create a new Data Source, Tinybird guides you through creating the new Data Source. Select Next to continue and go through the standard Create Data Source wizard.

Run Copy Pipes (UI)

To run a Copy Pipe in the UI, navigate to the Pipe, and select the Copying button. From the options, select Run copy now.

You can't customize the values of dynamic parameters on a scheduled Copy Pipe. Any parameters use their default values.

Iterating a Copy Pipe

Copy Pipes can be iterated using version control like any other resource in your Data Project. However, you need to understand how connections work in Branches and deployments to select the appropriate strategy for your desired changes.

Branches don't execute on creation by default recurrent jobs, like the scheduled ones for Copy Pipes.

To iterate a Copy Pipe, create a new one, or recreate the existing one, with the desired configuration. The new Copy Pipe starts executing from the Branch, without affecting the unchanged production resource. This means you can test the changes without mixing the test resource with your production exports.

This example, shows how to change the Copy Pipe time granularity, adding an extra step for backfill the old data.

Monitoring

Tinybird provides a high level metrics page for each Copy Pipe in the UI, as well as exposing low-level observability data through the Service Data Sources.

You can view high level status and statistics about your Copy Pipes in the Tinybird UI from the Copy Pipe's details page. To access the details page, navigate to the Pipe, and select View Copy Job.

The details page shows summaries of the Copy Pipe's current status and configuration, as well as Charts showing the performance of previous executions.

You can also monitor your Copy Pipes using the datasource_ops_log Service Data Source. This Data Source contains data about all your operations in Tinybird. Logs that relate to Copy Pipes can be identified by a value of copy in the event_type column.

For example, the following query aggregates the Processed Data from Copy Pipes, for the current month, for a given Data Source name.

SELECT toStartOfMonth(timestamp) month, sum(read_bytes + written_bytes) processed_data
FROM tinybird.datasources_ops_log
WHERE datasource_name = '{YOUR_DATASOURCE_NAME}'
    AND event_type = 'copy'
    AND timestamp >= toStartOfMonth({{DateTime(job_timestamp)}})
GROUP BY month

Using this Data Source, you can also write queries to determine average job duration, amount of errors, error messages, and more.

Execution

Copy Jobs are processed through a First-In-First-Out (FIFO) queue system. The execution behavior is governed by the following parallelism settings.

Default concurrency

Each environment supports two concurrent Copy Jobs. In shared environments, this limit applies collectively across all users, meaning the two parallel slots are shared among all users in that environment:

  • Copy Jobs are processed in FIFO queue order.
  • Multiple jobs from the same Copy Pipe can execute concurrently.
  • For scheduled Copy Pipe jobs, immediate execution at the configured time isn't guaranteed.

Dedicated environments

In dedicated environments, the concurrent job limit can be increased upon request to accommodate your specific workload requirements.

For optimization strategies and guidelines on preventing duplicate data copies, refer to the Best Practices section.

Billing

Processed Data and Storage are the two metrics that Tinybird uses for billing Copy Pipes. A Copy Pipe executes the Pipe's query (Processed Data) and writes the result into a Data Source (Storage).

Any processed data and storage incurred by a Copy Pipe is charged at the standard rate for your billing plan (see "Tinybird plans").

See the Monitoring section for guidance on monitoring your usage of Copy Pipes.

Limits

Check the limits page for limits on ingestion, queries, API Endpoints, and more.

Build and Professional

The schedule applied to a Copy Pipe doesn't guarantee that the underlying job executes immediately at the configured time. The job is placed into a job queue when the configured time elapses. It is possible that, if the queue is busy, the job could be delayed and executed some time after the scheduled time.

Enterprise

A maximum execution time of 50% of the scheduling period, 30 minutes max, means that if the Copy Pipe is scheduled to run every minute, the operation can take up to 30 seconds. If it's scheduled to run every 5 minutes, the job can last up to 2m30s, and so forth. This is to prevent overlapping jobs, which can impact results.

The schedule applied to a Copy Pipe doesn't guarantee that the job executes immediately at the configured time. When the configured time elapses, the job is placed into a job queue. It is possible that, if the queue is busy, the job could be delayed and executed some time after the scheduled time.

Distribute the jobs over a wider period of time rather than grouping them close together.

For Enterprise customers, these settings can be customized. Reach out to your Customer Success team directly, or email us at support@tinybird.co.

Next steps

Updated