Data Sources

When you get data into Tinybird, it's stored in a Data Source. You then write SQL queries to explore the data from the Data Source. Tinybird represents Data Sources using the icon.

For example, if your event data lives in a Kafka topic, you can create a Data Source that connects directly to Kafka and writes the events to Tinybird. You can then create a Pipe to query fresh event data.

A Data Source can also be the result of materializing a SQL query through a Pipe.

Create Data Sources

You can use Tinybird's UI, CLI, and API to create Data Sources.

Using the UI

Follow these steps to create a new Data Source:

  1. In your Workspace, go to Data Sources.
  2. Select + to add a new Data Source.

Using the CLI

You can create Data Source using the tb datasource command. See tb datasource in the CLI reference.

Set the Data Source TTL

You can apply a TTL (Time To Live) to a Data Source in Tinybird. Use a TTL to define how long you want to store data.

For example, you can define a TTL of 7 Days, which means that any data older than 7 Days should be deleted. Data older than the defined TTL is deleted automatically.

You must define the TTL at the time of creating the Data Source and your data must have a column with a type that represents a date. Valid types are any of the Date or Int types.

Using the UI

If you are using the Tinybird Events API and want to use a TTL, create the Data Source with a TTL first before sending data.

Follow these steps to set a TTL using the Tinybird UI:

  1. Select Advanced Settings.
  2. Open the TTL menu.
  3. Select a column that represents a date.
  4. Define the TTL period in days.

If you need to apply transformations to the date column, or want to use more complex logic, select the Code editor tab and enter SQL code to define your TTL.

Using the CLI

Follow these steps to set a TTL using the Tinybird CLI:

  1. Create a new Data Source and .datasource file using the tb datasource command.
  2. Edit the .datasource file you've created.
  3. Go to the Engine settings.
  4. Add a new setting called ENGINE_TTL and enter your TTL string enclosed in double quotes.
  5. Save the file.

The following example shows a .datasource file with TTL defined:

SCHEMA >
    `date` DateTime,
    `product_id` String,
    `user_id` Int64,
    `event` String,
    `extra_data` String

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYear(date)"
ENGINE_SORTING_KEY "date, user_id, event, extra_data"
ENGINE_TTL "date + toIntervalDay(90)"

Change Data Source TTL

You can modify the TTL of an existing Data Source, either by adding a new TTL or by updating an existing TTL.

Using in the UI

Follow these steps to modify a TTL using the Tinybird UI:

  1. Go to the Data Source details page by clicking on the Data Source with the TTL you wish to change.
  2. Select the Schema tab.
  3. Select the TTL text.
  4. A dialog opens. Select the menu.
  5. Select the field to use for the TTL.
  6. Change the TTL interval.
  7. Select Save.

The updated TTL value appears in the Data Source's schema page.

Using the CLI

Follow these steps to modify a TTL using the Tinybird CLI:

  1. Open the .datasource file.
  2. Go to the Engine settings.
  3. If ENGINE_TTL doesn't exist, add it and enter your TTL enclosed in double quotes.
  4. If a TTL is already defined, modify the existing setting.

The following is an example TTL setting:

ENGINE_TTL "date + toIntervalDay(90)"

When ready, save the .datasource file and push the changes to Tinybird using the CLI:

tb push DATA_SOURCE_FILE -f

Supported engines and settings

Don't change the following settings unless you are familiar with ClickHouse® and understand their impact. If you're unsure, contact Tinybird at support@tinybird.co or in the Community Slack.

Tinybird uses ClickHouse® as the underlying storage engine. ClickHouse features different strategies to store data, which define where and how the data is stored and also what kind of data access, queries, and availability your data has. In ClickHouse terms, a Tinybird Data Source uses a Table Engine that determines those factors.

With Tinybird you can select the Table Engine for your Data Source. Tinybird supports the following engines:

  • MergeTree
  • ReplacingMergeTree
  • SummingMergeTree
  • AggregatingMergeTree
  • CollapsingMergeTree
  • VersionedCollapsingMergeTree
  • Null

If you need to use any other Table Engine, contact Tinybird at support@tinybird.co or in the Community Slack.

You can use the engine parameter in the Data Sources API to specify the name of any of the available engines, for example engine=ReplacingMergeTree. To set the engine parameters and the engine options, use as many engine_* request parameters as needed. You can also configure settings in .datasource files.

The supported parameters for each engine are:

ENGINESIGNATUREPARAMETERDESCRIPTION
ReplacingMergeTree(ver, is_deleted)engine_verOptional. The column with the version. If not set, the last row is kept during a merge. If set, the maximum version is kept.
engine_is_deletedActive only when ver is used. The name of the column used to determine whether the row is to be deleted; 1 is a deleted row, 0 is a state row.
SummingMergeTree([columns])engine_columnsOptional. The names of columns where values are summarized.
CollapsingMergeTree(sign)engine_signName of the column for computing the state.
VersionedCollapsingMergeTree(sign, version)engine_signName of the column for computing the state.
engine_versionName of the column with the version of the object state.

The engine options, in particular the MergeTree engine options, match ClickHouse terminology: engine_partition_key, engine_sorting_key, engine_primary_key, engine_sampling_key, engine_ttl and engine_settings.

If engine_partition_key is empty or not passed as a parameter, the underlying Data Source doesn't have any partition unless there's a Date column. In that case the Data Source is partitioned by year. If you want to create a Data Source with no partitions, send engine_partition_key=tuple().

engine_settings allows for fine-grained control over the parameters of the underlying Table Engine. In general, we do not recommend changing these settings unless you are absolutely sure about their impact.

The supported engine settings are:

  • index_granularity
  • merge_with_ttl_timeout
  • ttl_only_drop_parts
  • min_bytes_for_wide_part
  • min_rows_for_wide_part

See the Data Sources API for examples of creating Data Sources with custom engine settings using the Tinybird REST API.

Supported data types

The supported ClickHouse data types are:

  • Int8, Int16, Int32, Int64, Int128, Int256
  • UInt8, UInt16, UInt32, UInt64, UInt128, UInt256
  • Float32, Float64
  • Decimal, Decimal(P, S), Decimal32(S), Decimal64(S), Decimal128(S), Decimal256(S)
  • String
  • FixedString(N)
  • UUID
  • Date, Date32
  • DateTime([TZ]), DateTime64(P, [TZ])
  • Bool
  • Array(T)
  • Map(K, V)
  • Tuple(K, V)
  • SimpleAggregateFunction, AggregateFunction
  • LowCardinality
  • Nullable
  • Nothing

If you are ingesting using the NDJSON format and would like to store Decimal values containing 15 or more digits, send the values as string instead of numbers to avoid precision issues. In the following example, the first value has a high chance of losing accuracy during ingestion, while the second one is stored correctly:

{"decimal_value": 1234567890.123456789}  # Last digits might change during ingestion
{"decimal_value": "1234567890.123456789"}  # Will be stored correctly

Set a different codec

Tinybird applies compression codecs to data types to optimize performance. You can override the default compression codecs by adding the CODEC(<codec>) statement after the type declarations in your .datasource schema. For example:

SCHEMA >
    `product_id` Int32 `json:$.product_id`,
    `timestamp` DateTime64(3) `json:$.timestamp` CODEC(DoubleDelta, ZSTD(1)),

For a list of available codecs, see Compression in the ClickHouse documentation.

Supported file types and compression formats for ingest

The following file types and compression formats are supported at ingest time:

File typeMethodAccepted extensionsCompression formats supported
CSVFile upload, URL.csv, .csv.gzgzip
NDJSONFile upload, URL, Events API.ndjson, .ndjson.gzgzip
ParquetFile upload, URL.parquet, .parquet.gzgzip
AvroKafkagzip

Quarantine Data Sources

Every Data Source you create in your Workspace has a quarantine Data Source associated that store data that doesn't fit the schema. If you send rows that don't fit the Data Source schema, they're automatically sent to the quarantine table so that the ingest process doesn't fail.

By convention, a quarantine Data Source is named {datasource_name}_quarantine. You can review quarantined rows at any time or perform operations on them using Pipes. This is a useful source of information when fixing issues in the origin source or applying changes during ingest.

The quarantine Data Source schema contains the columns of the original row and the following columns with information about the issues that caused the quarantine:

  • c__error_column Array(String) contains an array of all the columns that contain an invalid value.
  • c__error Array(String) contains an array of all the errors that caused the ingestion to fail and lead to store the values in quarantine. This column along the c__error_column allows you so easily identify which is the columns that has problems and which is the error.
  • c__import_id Nullable(String) contains the job's identifier in case the column was imported through a job.
  • insertion_date (DateTime) contains the timestamp in which the ingestion was done.

See the Quarantine guide for practical examples on using the quarantine Data Source.

Partitioning

Use partitions for data manipulation. Partitioning isn't intended to speed up SELECT queries: experiment with more efficient sorting keys (ENGINE_SORTING_KEY) for that. A bad partition key, or creating too many partitions, can negatively impact query performance.

Partitioning is configured using the ENGINE_PARTITION_KEY setting.

When choosing a partition key:

  • Leave the ENGINE_PARTITION_KEY key empty. If the table is small or you aren't sure what the best partition key should be, leave it empty: the data is placed in a single partition.
  • Use a date column. Depending on the filter, you can opt for more or less granularity based on your needs. toYYYYMM(date_column) or toYear(date_column) are valid default choices.

If you have questions about choosing a partition key, contact Tinybird at support@tinybird.co or in the Community Slack.

Examples

The following examples show how to define partitions.

Using an empty tuple to create a single partition
ENGINE_PARTITION_KEY "tuple()"
Using a Date column to create monthly partitions
ENGINE_PARTITION_KEY "toYYYYMM(date_column)"
Using a column to partition by event types
ENGINE_PARTITION_KEY "event_type % 8"

Upserts and deletes

See this guide. Depending on the frequency needed, you might want to convert upserts and deletes into an append operation that you can solve through deduplication.

Limits

There is a limit of 100 Data Sources per Workspace.

Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.
Updated