Migrate from Postgres¶
In this guide, you'll learn how to migrate events from Postgres to Tinybird so that you can begin building performant, real-time analytics over your event data.
Need to create a Tinybird account? It's free! Start here.
Prerequisites¶
You'll need a free Tinybird account and a Workspace.
At a high level¶
Postgres is an incredible general purpose database, and it can even be extended to support columnar functionality for analytics.
That said, Tinybird can be a great alternative to Postgres extensions for a few reasons:
- It uses ClickHouse® as its underlying database, which is one of the fastest real-time analytics databases in the world.
- It provides additional services on top of the database - like an integrated API backend, ingestion load balancing, and native connectors - that will keep you from having to spin up additional services and infrastructure for your analytics service.
Tinybird is a data platform for data and engineering teams to solve complex real-time, operational, and user-facing analytics use cases at any scale, with end-to-end latency in milliseconds for streaming ingest and high QPS workloads.
It's a SQL-first analytics engine, purpose-built for the cloud, with real-time data ingest and full JOIN support. Native, managed ingest connectors make it easy to ingest data from a variety of sources. SQL queries can be published as production-grade, scalable REST APIs for public use or secured with JWTs.
Tinybird is a managed platform that scales transparently, requiring no cluster operations, shard management, or worrying about replicas.
See how Tinybird is used by industry-leading companies today in the Customer Stories hub.
Follow these steps to migrate from Postgres to Tinybird¶
Below you'll find an example walkthrough migrating 100M rows of events data from Postgres to Tinybird. You can apply the same workflow to your existing Postgres instance. If at any point you get stuck and would like assistance with your migration, contact Tinybird at support@tinybird.co or in the Slack Community.
The Postgres table¶
Suppose you have a table in Postgres that looks like this:
postgres=# CREATE TABLE events ( id SERIAL PRIMARY KEY, timestamp TIMESTAMPTZ NOT NULL, user_id TEXT NOT NULL, session_id TEXT NOT NULL, action TEXT NOT NULL, version TEXT NOT NULL, payload TEXT NOT NULL );
The table contains 100 million rows totalling about 15GB of data:
postgres=# SELECT pg_size_pretty(pg_relation_size('events')) AS size; size ------- 15 GB (1 row)
The table stores website click events, including an unstructured JSON payload
column.
Setup¶
Within your Postgres, create a user with read only permissions over the table (or tables) you need to export:
postgres=# CREATE USER tb_read_user WITH PASSWORD '<password>'; postgres=# GRANT CONNECT ON DATABASE test_db TO tb_read_user; postgres=# GRANT USAGE ON SCHEMA public TO tb_read_user; postgres=# GRANT SELECT ON TABLE events TO tb_read_user;
Limits¶
To perform this migration, we'll be running a series of Copy Jobs to incrementally migrate the events from Postgres to Tinybird. We break it up into chunks so as to remain under the limits of both Tinybird and Postgres.
There are two limits to take into account:
- Copy Pipe limits: Copy Pipes have a default max execution time of 20s for Build plans, 30s for Pro plans, 30m for Enterprise plans. If you're on a Free or Pro plan and need to temporarily extend your limits to perform the migration, please reach out to us at support@tinybird.co.
- The max execution time of queries in Postgres. This is controlled by the
statement_timeout
setting. We recommendation that you set the value in Postgres equal or similar to the max execution time of the Copy Pipe in Tinybird. For this example, we'll use three minutes:
postgres=# ALTER ROLE tb_read_user SET statement_timeout = '180000'; -- 3 minutes
Create a local Tinybird project¶
Install Tinybird CLI, then create a new Data Project:
export TB_ADMIN_TOKEN=<YOUR_WORKSPACE_ADMIN_TOKEN> export TB_HOST=https://api.us-east.aws.tinybird.co #replace with your host tb auth --host $TB_HOST --token $TB_ADMIN_TOKEN tb init
Create the target Data Source in Tinybird:
touch datasources/events.datasource
Define a schema that matches your Postgres schema, keeping in mind that Tinybird may use different data types. For our example:
# datasources/events.datasource SCHEMA > `id` Int32, `timestamp` DateTime64(6), `user_id` String, `session_id` String, `action` String, `version` String, `payload` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYear(timestamp)" ENGINE_SORTING_KEY "timestamp, session_id, user_id"
Push the Data Source to the Tinybird server:
tb push datasources/events.datasource
Backfilling your existing Postgres data¶
We're going to create a parameterized Copy Pipe to perform the initial backfill in chunks. We'll use a script to run the Copy Job on demand.
Storing secrets in Tinybird¶
Start by adding two secrets to Tinybird using the Environment Variables API. This will prevent hard-coded credentials in your Copy Pipe.
Create one for your Postgres username:
curl \ -X POST "${TB_HOST}/v0/variables" \ -H "Authorization: Bearer ${TB_ADMIN_TOKEN}" \ -d "type=secret" \ -d "name=tb_read_user" \ -d "value=tb_read_user"
And one for the password:
curl \ -X POST "${TB_HOST}/v0/variables" \ -H "Authorization: Bearer ${TB_ADMIN_TOKEN}" \ -d "type=secret" \ -d "name=tb_read_password" \ -d "value=<password>"
Define the Copy Pipe¶
Create a new Pipe:
touch pipes/backfill_postgres.pipe
And paste the following code, changing the url/port, name, and table name of your Postgres based on your specific setup:
NODE migrate SQL > % SELECT * FROM postgresql( 'https://your.postgres.url::port', 'your_postgres_instance_name', 'your_postgres_table name', {{tb_secret('tb_read_user')}}, {{tb_secret('tb_read_password')}}, 'public' ) WHERE timestamp > {{DateTime(from_date, '2020-01-01 00:00:00')}} --adjust based on your data AND timestamp <= {{DateTime(to_date, '2020-01-01 00:00:01')}} --use a small default range TYPE COPY TARGET_DATASOURCE events
This uses the PostgreSQL Table Function to select data from the remote Postgres table. It pushes the timestamp filters down to Postgres, incrementally querying your Postgres table and copying them into your events
Data Source in Tinybird.
Push this Pipe to the server:
tb push pipes/backfill_postgres.pipe
Backfill in one go¶
Depending on the size of your Postgres table, you may be able to perform the migration in a single Copy Job.
For example, get the minimum timestamp from Postgres (and the current datetime):
postgres=# SELECT min(timestamp) FROM events; min ------------------------ 2023-01-01 00:00:00+00 (1 row) ❯ date -u +"%Y-%m-%d %H:%M:%S" 2024-08-29 10:20:57
And run the Copy Job with those parameters:
tb pipe copy run migrate_pg_to_events --param from_date="2023-01-01 00:00:00" --param to_date="2024-08-29 10:20:57" --wait --yes
If it succeeds, you'll see something like this:
** Running migrate_pg_to_events ** Copy to 'events' job created: https://api.us-east.aws.tinybird.co/v0/jobs/4dd482f9-168b-44f7-a4c9-d1b64fc9665d ** Copying data [████████████████████████████████████] 100% ** Data copied to 'events'
And you'll be able to query the resulting Data Source:
tb sql "select count() from events" ------------- | count() | ------------- | 100000000 | ------------- tb sql "select count() as c, action from events group by action order by c asc" --stats ** Query took 0.228730096 seconds ** Rows read: 100,000,000 ** Bytes read: 1.48 GB ----------------------- | c | action | ----------------------- | 19996881 | logout | | 19997421 | signup | | 20000982 | purchase | | 20001649 | view | | 20003067 | click | -----------------------
Note that Copy operations in Tinybird are atomic, so a bulk backfill will either succeed or fail completely with some error.
For instance, if the statement_timeout
in Postgres is not large enough to export the table with a single query, you'll get an error like this:
** Copy to 'copy_migrate_events_from_pg' job created: https://api.us-east.aws.tinybird.co/v0/jobs/ec58749a-f4c3-4302-9236-f8036f0cb67b ** Copying data Error: ** Failed creating copy job: ** Error while running job: There was a problem while copying data: [Error] Query cancelled due to statement timeout in postgres. Make sure you use a user with a proper statement timeout to run this type of query.
In this case you can try to increaste the statement_timeout
or try the backfilling in chunks. As a reference, copying 100M rows from Postgres to Tinybird takes about 150s if Postgres and Tinybird are in the same cloud and region.
The Tinybird PostgreSQL Table Function uses internally a PostgreSQL COPY TO
statement. You can tweak some other settings in Postgres if necessary, but usually it's not needed, so refer to your Postgres provider or admin.
Backfilling in chunks¶
If you find that you're hitting the limits of either your Postgres or Tinybird's Copy Pipes, you can backfill in chunks.
First of all, make sure your Postgres table is indexed by the column you are filtering on, in this case timestamp
:
postgres=# CREATE INDEX idx_events_timestamp ON events (timestamp); postgres=# VACUUM ANALYZE events;
And make sure a query like the one sent from Tinybird will use the indexes (see the Index Scan below):
postgres=# explain select * from events where timestamp > '2024-01-01 00:00:00' and timestamp <= '2024-01-02 00:00.00'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using idx_events_timestamp on events (cost=0.57..607150.89 rows=151690 width=115) Index Cond: (("timestamp" > '2024-01-01 00:00:00+00'::timestamp with time zone) AND ("timestamp" <= '2024-01-02 00:00:00+00'::timestamp with time zone)) JIT: Functions: 2 Options: Inlining true, Optimization true, Expressions true, Deforming true (5 rows)
Then run multiple Copy jobs, adjusting the amount of data copied to stay within your Postgres statement timeout and Tinybird max execution time. This is a trial and error process depending on the granularity of data.
For example, here's a migration script that first tries a full backfill, and if it fails uses daily chunks:
#!/bin/bash HOST="YOUR_TB_HOST" TOKEN="YOUR_TB_TOKEN" PIPE_NAME="backfill_postgres" FROM_DATE="2023-01-01 00:00:00" TO_DATE="2024-08-31 00:00:00" LOG_FILE="pipe_copy.log" run_command() { local from_date="$1" local to_date="$2" echo "Copying from $from_date to $to_date" | tee -a $LOG_FILE if output=$(tb --host $HOST --token $TOKEN pipe copy run $PIPE_NAME --param from_date="$from_date" --param to_date="$to_date" --wait --yes 2>&1); then echo "Success $from_date - $to_date" | tee -a $LOG_FILE return 0 else echo "Error $from_date - $to_date" | tee -a $LOG_FILE echo "Error detail: $output" | tee -a $LOG_FILE return 1 fi } iterate_chunks() { local from_date="$1" local to_date="$2" local current_from="$from_date" local next_to="" while [[ "$(date -d "$current_from" +"%s")" -lt "$(date -d "$to_date" +"%s")" ]]; do # End of current day (23:59:59) next_to=$(date -d "$current_from +1 day -1 second" +"%Y-%m-%d")" 23:59:59" # Adjust next_to if it's bigger than to_date if [[ "$(date -d "$next_to" +"%s")" -ge "$(date -d "$to_date" +"%s")" ]]; then next_to="$to_date" fi # Create copy job for one single day if ! run_command "$current_from" "$next_to"; then echo "Error processing $current_from to $next_to" return 1 fi # Go to next day (starting at 00:00:00) current_from=$(date -d "$(date -d "$current_from" +'%Y-%m-%d') +1 day $(date -d "$current_from" +'%H:%M:%S')" +'%Y-%m-%d %H:%M:%S') done } # Step 1: Try full backfill echo "Running full backfill..." | tee -a $LOG_FILE if ! run_command "$FROM_DATE" "$TO_DATE"; then echo "Full backfill failed, iterating in daily chunks..." | tee -a $LOG_FILE iterate_chunks "$FROM_DATE" "$TO_DATE" fi echo "Process completed." | tee -a $LOG_FILE
Using either a full backfill or backfilling in chunks, you can successfully migrate your data from Postgres to Tinybird.
Syncing new events from Postgres to Tinybird¶
The next step is keeping your Tinybird Data Source in sync with events in your Postgres as new events arrive.
The steps below will show you how to use Tinybird's PostgreSQL Table Function and scheduled Copy Jobs to continually sync data from Postgres to Tinybird, however, you should consider sending future events Tinybird directly using either the Events API or another streaming Data Source connector, as this will be more resource efficient (and more real-time).
Create the incremental Copy Pipe¶
Create another Copy Pipe to perform the incremental syncs:
touch pipes/sync_events_from_pg.pipe
Paste in this code, again updating your Postgres details as well as the desired schedule to sync. Note the Copy limits apply here.
NODE sync_from_pg SQL > % SELECT * FROM postgresql( 'https://your.postgres.url::port', 'your_postgres_instance_name', 'your_postgres_table name', {{tb_secret('tb_read_user')}}, {{tb_secret('tb_read_password')}}, 'public' ) WHERE timestamp > (SELECT max(timestamp) FROM events) TYPE COPY TARGET_DATASOURCE events COPY_SCHEDULE */5 * * * *
Push this to the Tinybird server:
tb push pipes/sync_events_from_pg.pipe
It's important to first complete the backfill operation before pushing the sync Pipe. The sync Pipe uses the latest timestamp in the Tinybird copy to perform a filtered select from Postgres. Failure to backfill will result in a full scan of your Postgres table on your configured schedule.
Once you've pushed this Pipe, Tinybird will sync with your Postgres updates based on the schedule you set.
Next steps¶
If you’d like assistance with your migration, contact Tinybird at support@tinybird.co or in the Community Slack.
- Set up a free Tinybird account and build a working prototype: Sign up here.
- Run through a quick example with your free account: Tinybird quick start.
- Read the billing docs to understand plans and pricing on Tinybird.