Iterating a Data Source (change or update schema)¶
Creating a Data Source for the first time is really straightforward. However, when iterating data projects, sometimes you need to edit the Data Source schema. This can be challenging when the data is already in production, and there are a few different scenarios. With Tinybird you can easily add more columns, but other operations (such as changing the sorting key or changing a column type) require you to fully recreate the Data Source.
This guide is for Workspaces that aren't using version control. If your Workspace is linked using the Git<>Tinybird integration, see the repo of common use cases for iterating when using version control.
Overview¶
This guide walks through the iteration process for 4 different scenarios. Pick the one that's most relevant for you:
- Scenario 1: I'm not in production
- Scenario 2: I can stop/pause data ingestion
- Scenario 3: I need to change a Materialized View & I can't stop data ingest
- Scenario 4: It's too complex and I can't figure it out
Prerequisites¶
You'll need to be familiar with the Tinybird CLI to follow along with this guide. Never used it before? Read the docs here.
All of the guide examples have the same setup - a Data Source with a nullable(Int64)
column that the user wants to change to a Int64
for performance reasons. This requires editing the schema and, to keep the existing data, replacing any occurrences of NULL
with a number, like 0
.
Scenario 1: I'm not in production¶
This scenario assumes that you are not in production and can accept losing any data you have already ingested.
If you are not in production, and you can accept losing data, use the Tinybird CLI to pull your Data Source down to a file, modify it, and push it back into Tinybird.
Begin with tb pull
to pull your Tinybird resources down to files. Then, modify the .datasource file for the Data Source you want to change. When you're finished modifying the Data Source, delete the existing Data Source from Tinybird, either in the CLI with tb datasource rm
or through the UI. Finally, push the new Data Source to Tinybird with tb push
.
See a screencast example: https://www.youtube.com/watch?v=gzpuQfk3Byg.
Scenario 2: I can stop data ingestion¶
This scenario assumes that you have stopped all ingestion into the affected Data Sources.
1. Use the CLI to pull your Tinybird resources down into files¶
Use tb pull --auto
to pull your Tinybird resources down into files. The --auto
flag will organize the resources into directories, with your Data Sources being places into a datasources
directory.
2. Create the new Data Source¶
Create a copy of the Data Source file that you want to modify and rename it.
For example, datasources/original_ds.datasource
-> datasources/new_ds.datasource
.
Modify the new Data Source schema in the file to make the changes you need.
Now push the new Data Source to Tinybird with tb push datasources/new_ds.datasource
.
3. Backfill the new Data Source with existing data¶
If you want to move the existing data from the original Data Source to the new Data Source, use a Copy Pipe or a Pipe that materializes data into the new Data Source.
3.1 Recommended option: Copy Pipe¶
A Copy Pipe is a Pipe used to copy data from one Data Source to another Data Source. This method is useful for one-time moves of data or scheduled executions.
Move your data using the following Copy Pipe, paying particular attention to the TYPE
, TARGET_DATASOURCE
and COPY_SCHEDULE
configs at the end:
NODE copy_node SQL > SELECT * EXCEPT (my_nullable_column), toInt64(coalesce(my_nullable_column,0)) as my_column -- adjust query to your changes FROM original_ds TYPE COPY TARGET_DATASOURCE new_ds COPY_SCHEDULE @on-demand
Push it to the Workspace:
tb push pipes/temp_copy.pipe
And run the Copy:
tb pipe copy run temp_copy
When it's done, remove the Pipe:
tb pipe rm temp_copy
3.2 Alternative option: A Populate¶
Alternatively, you can create a Materialized View Pipe and run a Populate to transform data from the original schema into the modified schema of the new Data Source.
Do this using the following Pipe, paying particular attention to the TYPE
and DATASOURCE
configs at the end:
NODE temp_populate SQL > SELECT * EXCEPT (my_nullable_column), toInt64(coalesce(my_nullable_column,0)) as my_column FROM original_ds TYPE materialized DATASOURCE new_ds
Then push the Pipe to Tinybird, passing the --populate
flag to force it to immediately start processing data:
tb push pipes/temp.pipe --populate --wait
When it's done, remove the Pipe:
tb pipe rm temp
At this point, review your new Data Source and ensure that everything is as expected.
4. Delete the original Data Source and rename the new Data Source¶
You can now go to the UI, delete the original Data Source, and rename the new Data Source to use the name of the original Data Source.
By renaming the new Data Source to use the same name as the original Data Source, any SQL in your Pipes or Endpoints that referred to the original Data Source will continue to work.
If you have a Materialized View that depends on the Data Source, you must unlink the Pipe that is materializing data before removing the Data Source. You can modify and reconnect your Pipe after completing the steps above.
Scenario 3: I need to change a Materialized View & I can't interrupt service¶
This scenario assumes you want to modify a Materialized View that is actively receiving data and serving API Endpoints, and you want to avoid service downtime.
Before you begin¶
Because this is a complex scenario, let's introduce some names for the example resources to make it a bit easier to follow along.
Let's assume that you have a Data Source that is actively receiving data; let's call this the Landing Data Source
. From the Landing Data Source
, you have a Pipe that is writing to a Materialized View; let's call these the Materializing Pipe
and Materialized View Data Source
respectively.
1. Use the CLI to pull your Tinybird resources down into files¶
Use tb pull --auto
to pull your Tinybird resources down into files. The --auto
flag organizes the resources into directories, with your Data Sources being places into a datasources
directory.
2. Duplicate the Materializing Pipe & Materialized View Data Source¶
Duplicate the Materializing Pipe
& Materialized View Data Source
.
For example:
pipes/original_materializing_pipe.pipe -> pipes/new_materializing_pipe.pipe datasources/original_materialized_view_data_source.datasource -> datasources/new_materialized_view_data_source.datasource
Modify the new files to change the schema as needed.
Lastly, you'll need to add a WHERE
clause to the new Materializing Pipe
. This clause is going to filter out old rows, so that the Materializing Pipe
is only materializing rows newer than a specific time. For the purpose of this guide, let's call this the Future Timestamp
. Do not use variable time functions for this timestamp (e.g. now()
). Pick a static time that is in the near future; five to fifteen minutes should be enough. The condition should be >
, for example:
WHERE … AND my_timestamp > "2024-04-12 13:15:00"
3. Push the Materializing Pipe & Materialized View Data Source¶
Push the Materializing Pipe
& Materialized View Data Source
to Tinybird:
tb push datasources/new_materialized_view_data_source.datasource tb push pipes/new_materializing_pipe.pipe
4. Create a new Pipe to transform & materialize the old schema to the new schema¶
You now have two Materialized Views: the one with the original schema, and the new one with the new schema. You need to take the data from the original Materialized View, transform it into the new schema, and write it into the new Materialized View.
To do this, create a new Pipe. In this guide, it's called the Transform Pipe
. In your Transform Pipe
create the SQL SELECT
logic that transforms the old schema to the new schema.
Lastly, your Transform Pipe
should have a WHERE
clause that only selects rows that are older than our Future Timestamp
. The condition should be <=
, for example:
WHERE … AND my_timestamp <= "2024-01-12 13:00:00"
5. Wait until after the Future Timestamp, then push & populate with the Transform Pipe¶
Now, to avoid any potential for creating duplicates or missing rows, wait until after the Future Timestamp
time has passed. This means that there should no longer be any rows arriving that have a timestamp that is older than the Future Timestamp
.
Then, push the Transform Pipe
and force a populate:
tb push pipes/new_materializing_pipe.pipe --populate --wait
6. Wait for the populate to finish, then change your API Endpoint to read from the new Materialized View Data Source¶
Wait until the previous command has completed to ensure that all data from the original Materialized View has been written to the new Materialized View Data Source
.
When it is complete, modify the API Endpoint that is querying the old Materialized View to query from the new Materialized View Data Source
.
For example:
SELECT * from original_materialized_view_data_source
Would become:
SELECT * from new_materialized_view_data_source
7. Test, then clean up old resources¶
Test that your API Endpoint is serving the correct data. If everything looks good, you can tidy up your Workspace by deleting the original Materialized View & the new Transform Pipe
.
Scenario 4: It's too complex and I can't figure it out¶
If you are dealing with a very complex scenario, don't worry! Contact Tinybird support (support@tinybird.co).
Next steps¶
- Got your schema sorted and ready to make some queries? Understand how to work with time.
- Learn how to monitor your ingestion.