Data Sources¶
When you get data into Tinybird, it's stored in a Data Source. You then write SQL queries to explore the data from the Data Source. Tinybird represents Data Sources using the icon.
For example, if your event data lives in a Kafka topic, you can create a Data Source that connects directly to Kafka and writes the events to Tinybird. You can then create a Pipe to query fresh event data.
A Data Source can also be the result of materializing a SQL query through a Pipe.
Create Data Sources¶
You can use Tinybird's UI, CLI, and API to create Data Sources.
Using the UI¶
Follow these steps to create a new Data Source:
- In your Workspace, go to Data Sources.
- Select + to add a new Data Source.
Using the CLI¶
You can create Data Source using the tb datasource
command. See tb datasource in the CLI reference.
Set the Data Source TTL¶
You can apply a TTL (Time To Live) to a Data Source in Tinybird. Use a TTL to define how long you want to store data.
For example, you can define a TTL of 7 Days, which means that any data older than 7 Days should be deleted. Data older than the defined TTL is deleted automatically.
You must define the TTL at the time of creating the Data Source and your data must have a column with a type that represents a date. Valid types are any of the Date
or Int
types.
Using the UI¶
If you are using the Tinybird Events API and want to use a TTL, create the Data Source with a TTL first before sending data.
Follow these steps to set a TTL using the Tinybird UI:
- Select Advanced Settings.
- Open the TTL menu.
- Select a column that represents a date.
- Define the TTL period in days.
If you need to apply transformations to the date column, or want to use more complex logic, select the Code editor tab and enter SQL code to define your TTL.
Using the CLI¶
Follow these steps to set a TTL using the Tinybird CLI:
- Create a new Data Source and .datasource file using the
tb datasource
command. - Edit the .datasource file you've created.
- Go to the Engine settings.
- Add a new setting called
ENGINE_TTL
and enter your TTL string enclosed in double quotes. - Save the file.
The following example shows a .datasource file with TTL defined:
SCHEMA > `date` DateTime, `product_id` String, `user_id` Int64, `event` String, `extra_data` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYear(date)" ENGINE_SORTING_KEY "date, user_id, event, extra_data" ENGINE_TTL "date + toIntervalDay(90)"
Change Data Source TTL¶
You can modify the TTL of an existing Data Source, either by adding a new TTL or by updating an existing TTL.
Using in the UI¶
Follow these steps to modify a TTL using the Tinybird UI:
- Go to the Data Source details page by clicking on the Data Source with the TTL you wish to change.
- Select the Schema tab.
- Select the TTL text.
- A dialog opens. Select the menu.
- Select the field to use for the TTL.
- Change the TTL interval.
- Select Save.
The updated TTL value appears in the Data Source's schema page.
Using the CLI¶
Follow these steps to modify a TTL using the Tinybird CLI:
- Open the .datasource file.
- Go to the Engine settings.
- If
ENGINE_TTL
doesn't exist, add it and enter your TTL enclosed in double quotes. - If a TTL is already defined, modify the existing setting.
The following is an example TTL setting:
ENGINE_TTL "date + toIntervalDay(90)"
When ready, save the .datasource file and push the changes to Tinybird using the CLI:
tb push DATA_SOURCE_FILE -f
Supported engines and settings¶
Don't change the following settings unless you are familiar with ClickHouse® and understand their impact. If you're unsure, contact Tinybird at support@tinybird.co or in the Community Slack.
Tinybird uses ClickHouse® as the underlying storage engine. ClickHouse features different strategies to store data, which define where and how the data is stored and also what kind of data access, queries, and availability your data has. In ClickHouse terms, a Tinybird Data Source uses a Table Engine that determines those factors.
With Tinybird you can select the Table Engine for your Data Source. Tinybird supports the following engines:
MergeTree
ReplacingMergeTree
SummingMergeTree
AggregatingMergeTree
CollapsingMergeTree
VersionedCollapsingMergeTree
Null
If you need to use any other Table Engine, contact Tinybird at support@tinybird.co or in the Community Slack.
You can use the engine
parameter in the Data Sources API to specify the name of any of the available engines, for example engine=ReplacingMergeTree
. To set the engine parameters and the engine options, use as many engine_*
request parameters as needed. You can also configure settings in .datasource files.
The supported parameters for each engine are:
ENGINE | SIGNATURE | PARAMETER | DESCRIPTION |
---|---|---|---|
ReplacingMergeTree | (ver, is_deleted) | engine_ver | Optional. The column with the version. If not set, the last row is kept during a merge. If set, the maximum version is kept. |
engine_is_deleted | Active only when ver is used. The name of the column used to determine whether the row is to be deleted; 1 is a deleted row, 0 is a state row. | ||
SummingMergeTree | ([columns]) | engine_columns | Optional. The names of columns where values are summarized. |
CollapsingMergeTree | (sign) | engine_sign | Name of the column for computing the state. |
VersionedCollapsingMergeTree | (sign, version) | engine_sign | Name of the column for computing the state. |
engine_version | Name of the column with the version of the object state. |
The engine options, in particular the MergeTree engine options, match ClickHouse terminology: engine_partition_key
, engine_sorting_key
, engine_primary_key
, engine_sampling_key
, engine_ttl
and engine_settings
.
If engine_partition_key
is empty or not passed as a parameter, the underlying Data Source doesn't have any partition unless there's a Date
column. In that case the Data Source is partitioned by year. If you want to create a Data Source with no partitions, send engine_partition_key=tuple()
.
engine_settings
allows for fine-grained control over the parameters of the underlying Table Engine. In general, we do not recommend changing these settings unless you are absolutely sure about their impact.
The supported engine settings are:
index_granularity
merge_with_ttl_timeout
ttl_only_drop_parts
min_bytes_for_wide_part
min_rows_for_wide_part
See the Data Sources API for examples of creating Data Sources with custom engine settings using the Tinybird REST API.
Supported data types¶
The supported ClickHouse data types are:
Int8
,Int16
,Int32
,Int64
,Int128
,Int256
UInt8
,UInt16
,UInt32
,UInt64
,UInt128
,UInt256
Float32
,Float64
Decimal
,Decimal(P, S)
,Decimal32(S)
,Decimal64(S)
,Decimal128(S)
,Decimal256(S)
String
FixedString(N)
UUID
Date
,Date32
DateTime([TZ])
,DateTime64(P, [TZ])
Bool
Array(T)
Map(K, V)
Tuple(K, V)
SimpleAggregateFunction
,AggregateFunction
LowCardinality
Nullable
Nothing
If you are ingesting using the NDJSON format and would like to store Decimal
values containing 15 or more digits, send the values as string instead of numbers to avoid precision issues. In the following example, the first value has a high chance of losing accuracy during ingestion, while the second one is stored correctly:
{"decimal_value": 1234567890.123456789} # Last digits might change during ingestion {"decimal_value": "1234567890.123456789"} # Will be stored correctly
Set a different codec¶
Tinybird applies compression codecs to data types to optimize performance. You can override the default compression codecs by adding the CODEC(<codec>)
statement after the type declarations in your .datasource schema. For example:
SCHEMA > `product_id` Int32 `json:$.product_id`, `timestamp` DateTime64(3) `json:$.timestamp` CODEC(DoubleDelta, ZSTD(1)),
For a list of available codecs, see Compression in the ClickHouse documentation.
Supported file types and compression formats for ingest¶
The following file types and compression formats are supported at ingest time:
File type | Method | Accepted extensions | Compression formats supported |
---|---|---|---|
CSV | File upload, URL | .csv , .csv.gz | gzip |
NDJSON | File upload, URL, Events API | .ndjson , .ndjson.gz | gzip |
Parquet | File upload, URL | .parquet , .parquet.gz | gzip |
Avro | Kafka | gzip |
Quarantine Data Sources¶
Every Data Source you create in your Workspace has a quarantine Data Source associated that store data that doesn't fit the schema. If you send rows that don't fit the Data Source schema, they're automatically sent to the quarantine table so that the ingest process doesn't fail.
By convention, a quarantine Data Source is named {datasource_name}_quarantine
. You can review quarantined rows at any time or perform operations on them using Pipes. This is a useful source of information when fixing issues in the origin source or applying changes during ingest.
The quarantine Data Source schema contains the columns of the original row and the following columns with information about the issues that caused the quarantine:
c__error_column
Array(String) contains an array of all the columns that contain an invalid value.c__error
Array(String) contains an array of all the errors that caused the ingestion to fail and lead to store the values in quarantine. This column along thec__error_column
allows you so easily identify which is the columns that has problems and which is the error.c__import_id
Nullable(String) contains the job's identifier in case the column was imported through a job.insertion_date
(DateTime) contains the timestamp in which the ingestion was done.
See the Quarantine guide for practical examples on using the quarantine Data Source.
Partitioning¶
Use partitions for data manipulation. Partitioning isn't intended to speed up SELECT
queries: experiment with more efficient sorting keys (ENGINE_SORTING_KEY
) for that. A bad partition key, or creating too many partitions, can negatively impact query performance.
Partitioning is configured using the ENGINE_PARTITION_KEY
setting.
When choosing a partition key:
- Leave the
ENGINE_PARTITION_KEY
key empty. If the table is small or you aren't sure what the best partition key should be, leave it empty: the data is placed in a single partition. - Use a date column. Depending on the filter, you can opt for more or less granularity based on your needs.
toYYYYMM(date_column)
ortoYear(date_column)
are valid default choices.
If you have questions about choosing a partition key, contact Tinybird at support@tinybird.co or in the Community Slack.
Examples¶
The following examples show how to define partitions.
Using an empty tuple to create a single partition
ENGINE_PARTITION_KEY "tuple()"
Using a Date column to create monthly partitions
ENGINE_PARTITION_KEY "toYYYYMM(date_column)"
Using a column to partition by event types
ENGINE_PARTITION_KEY "event_type % 8"
Upserts and deletes¶
See this guide. Depending on the frequency needed, you might want to convert upserts and deletes into an append operation that you can solve through deduplication.
Limits¶
There is a limit of 100 Data Sources per Workspace.