Data Sources¶
When you get data into Tinybird, it's stored in a Data Source. You then write SQL queries to explore the data from the Data Source. Tinybird represents Data Sources using the icon.
For example, if your event data lives in a Kafka topic, you can create a Data Source that connects directly to Kafka and writes the events to Tinybird. You can then create a Pipe to query fresh event data.
A Data Source can also be the result of materializing a SQL query through a Pipe.
Create Data Sources¶
You can use Tinybird's UI, CLI, and API to create Data Sources.
Using the UI ¶
Follow these steps to create a new Data Source:
- In your Workspace, go to Data Sources.
- Select + to add a new Data Source.
Using the CLI ¶
You can create Data Source using the tb datasource
command. See tb datasource in the CLI reference.
Using the Events API ¶
If you send data to the Events API and the Data Source doesn't exist, the Events API creates a Data Source by guessing the types from the data you send. See Events API.
You can still modify the Data Source using the alter options or adding a TTL as explained in the following section.
Set the Data Source TTL¶
You can apply a TTL (Time To Live) to a Data Source in Tinybird. Use a TTL to define how long you want to store data.
For example, you can define a TTL of 7 days, which means Tinybird automatically deletes data older than 7 days.
You may set the TTL at the time of creating the Data Source, or set it later. Your data must have a column with a type that represents a date or datetime. Valid types are Date
and DateTime
.
Using the UI ¶
Follow these steps to set a TTL using the Tinybird UI:
- Select Advanced Settings.
- Open the TTL menu.
- Select a column that represents a date.
- Define the TTL period in days.
If you need to apply transformations to the date column, or want to use more complex logic, select the Code editor tab and enter SQL code to define your TTL.
Using the CLI ¶
Follow these steps to set a TTL using the Tinybird CLI:
- Create a new Data Source and .datasource file using the
tb datasource
command. - Edit the .datasource file you've created.
- Go to the Engine settings.
- Add a new setting called
ENGINE_TTL
and enter your TTL string enclosed in double quotes. - Save the file.
The following example shows a .datasource file with TTL defined:
SCHEMA > `date` DateTime, `product_id` String, `user_id` Int64, `event` String, `extra_data` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYear(date)" ENGINE_SORTING_KEY "date, user_id, event, extra_data" ENGINE_TTL "date + toIntervalDay(90)"
Change Data Source TTL¶
You can modify the TTL of an existing Data Source, either by adding a new TTL or by updating an existing TTL.
Using the UI ¶
Follow these steps to modify a TTL using the Tinybird UI:
- Go to the Data Source details page by clicking on the Data Source with the TTL you wish to change.
- Select the Schema tab.
- Select the TTL text.
- A dialog opens. Select the menu.
- Select the field to use for the TTL.
- Change the TTL interval.
- Select Save.
The updated TTL value appears in the Data Source's schema page.
Using the CLI ¶
Follow these steps to modify a TTL using the Tinybird CLI:
- Open the .datasource file.
- Go to the Engine settings.
- If
ENGINE_TTL
doesn't exist, add it and enter your TTL enclosed in double quotes. - If a TTL is already defined, modify the existing setting.
The following is an example TTL setting:
ENGINE_TTL "date + toIntervalDay(90)"
When ready, save the .datasource file and push the changes to Tinybird using the CLI:
tb push DATA_SOURCE_FILE -f
Share a Data Source¶
Workspace administrators can share a Data Source with another Workspace they've access to on the same region and cluster.
To share a Data Source, follow these steps:
- Find the Data Source you want to share inside Data Project.
- Select the More actions (⋯) icon next to Data Source.
- Select Share.
- Type the Workspace name or ID.
- Select Share.
You can use the shared Data Source to create Pipes and Materialized Views in the target Workspace. Users that have access to a shared Data Source can access the tinybird.datasources_ops_log
and the tinybird.kafka_ops_log
Service Data Sources.
Limitations¶
The following limitations apply to shared Data Sources:
- Shared Data Sources are read-only.
- You can't share a shared Data Source, only the original.
- You can't check the quarantine of a shared Data Source.
Supported engines¶
Tinybird features different strategies to store data, which define where and how the data is stored and also what kind of data access, queries, and availability your data has. A Tinybird Data Source uses a table engine that determines those factors.
See Engines.
Supported data types¶
Data types specify how Tinybird stores and processes values in a database. They determine what kind of data can fit in a column (like numbers, text, dates, etc.), how much storage space the data uses, and what operations you can perform on the values. Choosing the most appropriate data type is important for both data integrity and query performance.
See Data types.
Set a different codec¶
Tinybird applies compression codecs to data types to optimize performance. You can override the default compression codecs by adding the CODEC(<codec>)
statement after the type declarations in your .datasource schema. For example:
SCHEMA > `product_id` Int32 `json:$.product_id`, `timestamp` DateTime64(3) `json:$.timestamp` CODEC(DoubleDelta, ZSTD(1)),
Supported file types and compression formats for ingest¶
Tinybird supports these file types and compression formats at ingest time:
File type | Method | Accepted extensions | Compression formats supported |
---|---|---|---|
CSV | File upload, URL | .csv , .csv.gz | gzip |
NDJSON | File upload, URL, Events API | .ndjson , .ndjson.gz | gzip |
Parquet | File upload, URL | .parquet , .parquet.gz | gzip |
Avro | Kafka | gzip |
Quarantine Data Sources¶
Every Data Source you create in your Workspace has a quarantine Data Source associated that store data that doesn't fit the schema. If you send rows that don't fit the Data Source schema, they're automatically sent to the quarantine table so that the ingest process doesn't fail.
By convention, quarantine Data Sources follow the naming pattern {datasource_name}_quarantine
. You can review quarantined rows at any time or perform operations on them using Pipes. This is a useful source of information when fixing issues in the origin source or applying changes during ingest.
The quarantine Data Source schema contains the columns of the original row and the following columns with information about the issues that caused the quarantine:
c__error_column
Array(String) contains an array of all the columns that contain an invalid value.c__error
Array(String) contains an array of all the errors that caused the ingestion to fail and lead to store the values in quarantine. This column along thec__error_column
allows you so easily identify which is the columns that has problems and which is the error.c__import_id
Nullable(String) contains the job's identifier in case the column was imported through a job.insertion_date
(DateTime) contains the timestamp in which the ingestion was done.
See the Quarantine guide for practical examples on using the quarantine Data Source.
Partitioning¶
Use partitions for data manipulation. Partitioning isn't intended to speed up SELECT
queries: experiment with more efficient sorting keys, as defined by ENGINE_SORTING_KEY
, for that. A bad partition key, or creating too many partitions, can negatively impact query performance.
Configure partitioning using the ENGINE_PARTITION_KEY
setting.
When choosing a partition key:
- Leave the
ENGINE_PARTITION_KEY
key empty. If the table is small or you aren't sure what the best partition key should be, leave it empty: Tinybird places all data in a single partition. - Use a date column. Depending on the filter, you can opt for more or less granularity based on your needs.
toYYYYMM(date_column)
ortoYear(date_column)
are valid default choices.
Don't use too granular a partition key, like a customer ID or name. This could lead to the TOO_MANY_PARTS
error.
Examples¶
The following examples show how to define partitions.
Using an empty tuple to create a single partition
ENGINE_PARTITION_KEY "tuple()"
Using a Date column to create monthly partitions
ENGINE_PARTITION_KEY "toYYYYMM(date_column)"
Using a column to partition by event types
ENGINE_PARTITION_KEY "event_type % 8"
TOO_MANY_PARTS error¶
Each insert operation creates a new part containing compressed data files and index files for each column. Tinybird merges smaller parts into bigger parts in the background based on specific rules. The goal is to maintain one large part, or a few large parts, per partition.
The TOO_MANY_PARTS
error happens when you insert data faster than Tinybird can merge the parts. Inserting data to many partitions at once multiplies the problem by the number of partitions.
To prevent this error:
- Batch your inserts into larger chunks instead of making many small inserts.
- Limit the number of partitions you write to simultaneously .
- Define a less granular partition key.
If the error persists, contact support.
Upserts and deletes¶
See this guide. Depending on the frequency needed, you might want to convert upserts and deletes into an append operation that you can solve through deduplication.
Limits¶
There is a limit of 100 Data Sources per Workspace.