# Tinybird Documentation Generated on: 2024-11-19T13:43:42.913Z URL: https://www.tinybird.co/docs/api-reference/analyze-api Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Analyze API Reference · Tinybird Docs" theme-color: "#171612" description: "The Analyze API allows you analyze a given NDJSON, CSV, or Parquet file to generate a Tinybird Data Source schema." --- POST /v0/analyze/? [¶](https://www.tinybird.co/docs/about:blank#post--v0-analyze-?) The Analyze API takes a sample of a supported file ( `csv`, `ndjson`, `parquet` ) and guesses the file format, schema, columns, types, nullables and JSONPaths (in the case of NDJSON paths). This is a helper endpoint to create Data Sources without having to write the schema manually. Take into account Tinybird’s guessing algorithm is not deterministic since it takes a random portion of the file passed to the endpoint, that means it can guess different types or nullables depending on the sample analyzed. We recommend to double check the schema guessed in case you have to make some manual adjustments. Analyze a local file [¶](https://www.tinybird.co/docs/about:blank#id1) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/analyze" \ -F "file=@path_to_local_file" Analyze a remote file [¶](https://www.tinybird.co/docs/about:blank#id2) curl \ -H "Authorization: Bearer " \ -G -X POST "https://api.tinybird.co/v0/analyze" \ --data-urlencode "url=https://example.com/file" Analyze response [¶](https://www.tinybird.co/docs/about:blank#id3) { "analysis": { "columns": [ { "path": "$.a_nested_array.nested_array[:]", "recommended_type": "Array(Int16)", "present_pct": 3, "name": "a_nested_array_nested_array" }, { "path": "$.an_array[:]", "recommended_type": "Array(Int16)", "present_pct": 3, "name": "an_array" }, { "path": "$.field", "recommended_type": "String", "present_pct": 1, "name": "field" }, { "path": "$.nested.nested_field", "recommended_type": "String", "present_pct": 1, "name": "nested_nested_field" } ], "schema": "a_nested_array_nested_array Array(Int16) `json:$.a_nested_array.nested_array[:]`, an_array Array(Int16) `json:$.an_array[:]`, field String `json:$.field`, nested_nested_field String `json:$.nested.nested_field`" }, "preview": { "meta": [ { "name": "a_nested_array_nested_array", "type": "Array(Int16)" }, { "name": "an_array", "type": "Array(Int16)" }, { "name": "field", "type": "String" }, { "name": "nested_nested_field", "type": "String" } ], "data": [ { "a_nested_array_nested_array": [ 1, 2, 3 ], "an_array": [ 1, 2, 3 ], "field": "test", "nested_nested_field": "bla" } ], "rows": 1, "statistics": { "elapsed": 0.000310539, "rows_read": 2, "bytes_read": 142 } } } The `columns` attribute contains the guessed columns and for each one: - `path` : The JSONPath syntax in the case of NDJSON/Parquet files - `recommended_type` : The guessed database type - `present_pct` : If the value is lower than 1 then there was nulls in the sample used for guessing - `name` : The recommended column name The `schema` attribute is ready to be used in the [Data Sources API](https://www.tinybird.co/docs/docs/api-reference/datasource-api) The `preview` contains up to 10 rows of the content of the file. --- URL: https://www.tinybird.co/docs/api-reference/datasource-api Last update: 2024-10-15T15:38:30.000Z Content: --- title: "Data Sources API Reference · Tinybird Docs" theme-color: "#171612" description: "The Data Source API enables you to create, manage and import data into your Data Sources." --- POST /v0/datasources/? [¶](https://www.tinybird.co/docs/about:blank#post--v0-datasources-?) This endpoint supports 3 modes to enable 3 distinct operations, depending on the parameters provided: > - Create a new Data Source with a schema - Append data to an existing Data Source - Replace data in an existing Data Source The mode is controlled by setting the `mode` parameter, for example, `-d "mode=create"` . Each mode has different [rate limits](https://www.tinybird.co/docs/docs/api-reference/overview#limits). When importing remote files by URL, if the server hosting the remote file supports HTTP Range headers, the import process will be parallelized. | KEY | TYPE | DESCRIPTION | | --- | --- | --- | | mode | String | Default: `create` . Other modes: `append` and `replace` . The `create` mode creates a new Data Source and attempts to import the data of the CSV if a URL is provided or the body contains any data. The `append` mode inserts the new rows provided into an existing Data Source (it will also create it if it does not exist yet). The `replace` mode will remove the previous Data Source and its data and replace it with the new one; Pipes or queries pointing to this Data Source will immediately start returning data from the new one and without disruption once the replace operation is complete. The `create` mode will automatically name the Data Source if no `name` parameter is provided; for the `append` and `replace` modes to work, the `name` parameter must be provided and the schema must be compatible. | | name | String | Optional. Name of the Data Source to create, append or replace data. This parameter is mandatory when using the `append` or `replace` modes. | | url | String | Optional. The URL of the CSV with the data to be imported | | dialect_delimiter | String | Optional. The one-character string separating the fields. We try to guess the delimiter based on the CSV contents using some statistics, but sometimes we fail to identify the correct one. If you know your CSV’s field delimiter, you can use this parameter to explicitly define it. | | dialect_new_line | String | Optional. The one- or two-character string separating the records. We try to guess the delimiter based on the CSV contents using some statistics, but sometimes we fail to identify the correct one. If you know your CSV’s record delimiter, you can use this parameter to explicitly define it. | | dialect_escapechar | String | Optional. The escapechar removes any special meaning from the following character. This is useful if the CSV does not use double quotes to encapsulate a column but uses double quotes in the content of a column and it is escaped with, e.g. a backslash. | | schema | String | Optional. Data Source schema in the format ‘column_name Type, column_name_2 Type2…’. When creating a Data Source with format `ndjson` the `schema` must include the `jsonpath` for each column, see the `JSONPaths` section for more details. | | engine | String | Optional. Engine for the underlying data. Requires the `schema` parameter. | | engine_* | String | Optional. Engine parameters and options, check the[ Engines](https://www.tinybird.co/docs/concepts/data-sources.html#supported-engines) section for more details | | progress | String | Default: `false` . When using `true` and sending the data in the request body, Tinybird will return block status while loading using Line-delimited JSON. | | token | String | Auth token with create or append permissions. Required only if no Bearer Authorization header is found | | type_guessing | String | Default: `true` The `type_guessing` parameter is not taken into account when replacing or appending data to an existing Data Source. When using `false` all columns are created as `String` otherwise it tries to guess the column types based on the CSV contents. Sometimes you are not familiar with the data and the first step is to get familiar with it: by disabling the type guessing, we enable you to quickly import everything as strings that you can explore with SQL and cast to the right type or shape in whatever way you see fit via a Pipe. | | debug | String | Optional. Enables returning debug information from logs. It can include `blocks` , `block_log` and/or `hook_log` | | replace_condition | String | Optional. When used in combination with the `replace` mode it allows you to replace a portion of your Data Source that matches the `replace_condition` SQL statement with the contents of the `url` or query passed as a parameter. See this[ guide](https://www.tinybird.co/guide/replacing-and-deleting-data#replace-data-selectively) to learn more. | | replace_truncate_when_empty | Boolean | Optional. When used in combination with the `replace` mode it allows truncating the Data Source when empty data is provided. Not supported when `replace_condition` is specified | | format | String | Default: `csv` . Indicates the format of the data to be ingested in the Data Source. By default is `csv` and you should specify `format=ndjson` for NDJSON format, and `format=parquet` for Parquet files. | **Examples** Creating a CSV Data Source from a schema [¶](https://www.tinybird.co/docs/about:blank#id2) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "schema=symbol String, date Date, close Float32" Creating a CSV Data Source from a local CSV file with schema inference [¶](https://www.tinybird.co/docs/about:blank#id3) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?name=stocks" \ -F csv = @local_file.csv Creating a CSV Data Source from a remote CSV file with schema inference [¶](https://www.tinybird.co/docs/about:blank#id4) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d url = 'https://.../data.csv' Creating an empty Data Source with a ReplacingMergeTree engine and custom engine settings [¶](https://www.tinybird.co/docs/about:blank#id5) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "schema=pk UInt64, insert_date Date, close Float32" \ -d "engine=ReplacingMergeTree" \ -d "engine_sorting_key=pk" \ -d "engine_ver=insert_date" \ -d "name=test123" \ -d "engine_settings=index_granularity=2048, ttl_only_drop_parts=false" Appending data to a Data Source from a local CSV file [¶](https://www.tinybird.co/docs/about:blank#id6) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?name=data_source_name&mode=append" \ -F csv = @local_file.csv Appending data to a Data Source from a remote CSV file [¶](https://www.tinybird.co/docs/about:blank#id7) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources" \ -d mode = 'append' \ -d name = 'data_source_name' \ -d url = 'https://.../data.csv' Replacing data with a local file [¶](https://www.tinybird.co/docs/about:blank#id8) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?name=data_source_name&mode=replace" \ -F csv = @local_file.csv Replacing data with a remote file from a URL [¶](https://www.tinybird.co/docs/about:blank#id9) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources" \ -d mode = 'replace' \ -d name = 'data_source_name' \ --data-urlencode "url=http://example.com/file.csv" GET /v0/datasources/? [¶](https://www.tinybird.co/docs/about:blank#get--v0-datasources-?) getting a list of your Data Sources [¶](https://www.tinybird.co/docs/about:blank#id10) curl \ -H "Authorization: Bearer " \ -X GET "https://api.tinybird.co/v0/datasources" Get a list of the Data Sources in your account. The token you use to query the available Data Sources will determine what Data Sources get returned: only those accessible with the token you are using will be returned in the response. Successful response [¶](https://www.tinybird.co/docs/about:blank#id11) { "datasources": [{ "id": "t_a049eb516ef743d5ba3bbe5e5749433a", "name": "your_datasource_name", "cluster": "tinybird", "tags": {}, "created_at": "2019-11-13 13:53:05.340975", "updated_at": "2022-02-11 13:11:19.464343", "replicated": true, "version": 0, "project": null, "headers": {}, "shared_with": [ "89496c21-2bfe-4775-a6e8-97f1909c8fff" ], "engine": { "engine": "MergeTree", "engine_sorting_key": "example_column_1", "engine_partition_key": "", "engine_primary_key": "example_column_1" }, "description": "", "used_by": [], "type": "csv", "columns": [{ "name": "example_column_1", "type": "Date", "codec": null, "default_value": null, "jsonpath": null, "nullable": false, "normalized_name": "example_column_1" }, { "name": "example_column_2", "type": "String", "codec": null, "default_value": null, "jsonpath": null, "nullable": false, "normalized_name": "example_column_2" } ], "statistics": { "bytes": 77822, "row_count": 226188 }, "new_columns_detected": {}, "quarantine_rows": 0 }] }| Key | Type | Description | | --- | --- | --- | | attrs | String | comma separated list of the Data Source attributes to return in the response. Example: `attrs=name,id,engine` . Leave empty to return a full response | Note that the `statistics` ’s `bytes` and `row_count` attributes might be `null` depending on how the Data Source was created. POST /v0/datasources/(.+)/alter [¶](https://www.tinybird.co/docs/about:blank#post--v0-datasources-(.+)-alter) Modify the Data Source schema. This endpoint supports the operation to alter the following fields of a Data Source: | Key | Type | Description | | --- | --- | --- | | schema | String | Optional. Set the whole schema that adds new columns to the existing ones of a Data Source. | | description | String | Optional. Sets the description of the Data Source. | | kafka_store_raw_value | Boolean | Optional. Default: false. When set to true, the ‘value’ column of a Kafka Data Source will save the JSON as a raw string. | | kafka_store_headers | Boolean | Optional. Default: false. When set to true, the ‘headers’ of a Kafka Data Source will be saved as a binary map. | | ttl | String | Optional. Set to any value accepted in ClickHouse for a TTL or to ‘false’ to remove the TTL. | | dry | Boolean | Optional. Default: false. Set to true to show what would be modified in the Data Source, without running any modification at all. | The schema parameter can be used to add new columns at the end of the existing ones in a Data Source. Be aware that currently we don’t validate if the change will affect the existing MVs (Materialized Views) attached to the Data Source to be modified, so this change may break existing MVs. For example, avoid changing a Data Source that has a MV created with something like `SELECT * FROM Data Source ...` . If you want to have forward compatible MVs with column additions, create them especifying the columns instead of using the `*` operator. Also, take in account that, for now, the only engines supporting adding new columns are those inside the MergeTree family. To add a column to a Data Source, call this endpoint with the Data Source name and the new schema definition. For example, having a Data Source created like this: Creating a Data Source from a schema [¶](https://www.tinybird.co/docs/about:blank#id14) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "schema=symbol String, date Date, close Float32" if you want to add a new column ‘concept String’, you need to call this endpoint with the new schema: Adding a new column to an existing Data Source [¶](https://www.tinybird.co/docs/about:blank#id15) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources/stocks/alter" \ -d "schema=symbol String, date Date, close Float32, concept String" If everything went ok, you will get the operations done in the response: ADD COLUMN operation resulted from the schema change. [¶](https://www.tinybird.co/docs/about:blank#id16) { "operations": [ "ADD COLUMN `concept` String" ] } You can also view the inferred operations without executing them adding `dry=true` in the parameters. - To modify the description of a Data Source: Modifying the description a Data Source [¶](https://www.tinybird.co/docs/about:blank#id17) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources/stocks/alter" \ -d "name=stocks" \ -d "description=My new description" - To save in the “value” column of a Kafka Data Source the JSON as a raw string: Saving the raw string in the value column of a Kafka Data Source [¶](https://www.tinybird.co/docs/about:blank#id18) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources/stocks/alter" \ -d "name=stocks" \ -d "kafka_store_raw_value=true" -d "kafka_store_headers=true" - To modify the TTL of a Data Source: Modifying the TTL of a Data Source [¶](https://www.tinybird.co/docs/about:blank#id19) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources/stocks/alter" \ -d "name=stocks" \ -d "ttl=12 hours" - To remove the TTL of a Data Source: Modifying the TTL of a Data Source [¶](https://www.tinybird.co/docs/about:blank#id20) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "ttl=false" - To add default values to the columns of a Data Source: Modifying default values [¶](https://www.tinybird.co/docs/about:blank#id21) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "schema=symbol String DEFAULT '-', date Date DEFAULT now(), close Float32 DEFAULT 1.1" - To add default values to the columns of a NDJSON Data Source, add the default definition after the jsonpath definition: Modifying default values in a NDJSON Data Source [¶](https://www.tinybird.co/docs/about:blank#id22) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "schema=symbol String `json: $ .symbol` DEFAULT '-', date Date `json: $ .date` DEFAULT now(), close `json: $ .close` Float32 DEFAULT 1.1" - To make a column nullable, change the type of the column adding the Nullable type prefix to old one: Converting column “close” to Nullable [¶](https://www.tinybird.co/docs/about:blank#id23) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "schema=symbol String `json: $ .symbol, date Date `json: $ .date`, close `json: $ .close` Nullable(Float32)" - To drop a column, just remove the column from the schema definition. It will not be possible removing columns that are part of the primary or partition key: Remove column “close” from the Data Source [¶](https://www.tinybird.co/docs/about:blank#id24) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "schema=symbol String `json: $ .symbol, date Date `json: $ .date`" You can also alter the JSONPaths of existing Data Sources. In that case you have to specify the [JSONPath](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-ndjson-data) in the schema in the same way as when you created the Data Source. POST /v0/datasources/(.+)/truncate [¶](https://www.tinybird.co/docs/about:blank#post--v0-datasources-(.+)-truncate) Truncates a Data Source in your account. If the Data Source has dependent Materialized Views, those **won’t** be truncated in cascade. In case you want to delete data from other dependent Materialized Views, you’ll have to do a subsequent call to this method. Auth token in use must have the `DATASOURCES:CREATE` scope. Truncating a Data Source [¶](https://www.tinybird.co/docs/about:blank#id25) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources/name/truncate" This works as well for the `quarantine` table of a Data Source. Remember that the quarantine table for a Data Source has the same name but with the “_quarantine” suffix. Truncating the quarantine table from a Data Source [¶](https://www.tinybird.co/docs/about:blank#id26) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources/:name_quarantine/truncate" POST /v0/datasources/(.+)/delete [¶](https://www.tinybird.co/docs/about:blank#post--v0-datasources-(.+)-delete) Deletes rows from a Data Source in your account given a SQL condition. Auth token in use must have the `DATASOURCES:CREATE` scope. Deleting rows from a Data Source given a SQL condition [¶](https://www.tinybird.co/docs/about:blank#id27) curl \ -H "Authorization: Bearer " \ --data "delete_condition=(country='ES')" \ "https://api.tinybird.co/v0/datasources/:name/delete" When deleting rows from a Data Source, the response will not be the final result of the deletion but a Job. You can check the job status and progress using the [Jobs API](https://www.tinybird.co/docs/docs/api-reference/jobs-api) . In the response, `id`, `job_id` , and `delete_id` should have the same value: Delete API Response [¶](https://www.tinybird.co/docs/about:blank#id28) { "id": "64e5f541-xxxx-xxxx-xxxx-00524051861b", "job_id": "64e5f541-xxxx-xxxx-xxxx-00524051861b", "job_url": "https://api.tinybird.co/v0/jobs/64e5f541-xxxx-xxxx-xxxx-00524051861b", "job": { "kind": "delete_data", "id": "64e5f541-xxxx-xxxx-xxxx-00524051861b", "job_id": "64e5f541-xxxx-xxxx-xxxx-00524051861b", "status": "waiting", "created_at": "2023-04-11 13:52:32.423207", "updated_at": "2023-04-11 13:52:32.423213", "started_at": null, "is_cancellable": true, "datasource": { "id": "t_c45d5ae6781b41278fcee365f5bxxxxx", "name": "shopping_data" }, "delete_condition": "event = 'search'" }, "status": "waiting", "delete_id": "64e5f541-xxxx-xxxx-xxxx-00524051861b" } To check on the progress of the delete job, use the `job_id` from the Delete API response to query the [Jobs API](https://www.tinybird.co/docs/docs/api-reference/jobs-api). For example, to check on the status of the above delete job: checking the status of the delete job [¶](https://www.tinybird.co/docs/about:blank#id29) curl \ -H "Authorization: Bearer " \ https://api.tinybird.co/v0/jobs/64e5f541-xxxx-xxxx-xxxx-00524051861b Would respond with: Job API Response [¶](https://www.tinybird.co/docs/about:blank#id30) { "kind": "delete_data", "id": "64e5f541-xxxx-xxxx-xxxx-00524051861b", "job_id": "64e5f541-xxxx-xxxx-xxxx-00524051861b", "status": "done", "created_at": "2023-04-11 13:52:32.423207", "updated_at": "2023-04-11 13:52:37.330020", "started_at": "2023-04-11 13:52:32.842861", "is_cancellable": false, "datasource": { "id": "t_c45d5ae6781b41278fcee365f5bc2d35", "name": "shopping_data" }, "delete_condition": " event = 'search'", "rows_affected": 100 } ### Data Source engines supported Tinybird uses ClickHouse as the underlying storage technology. ClickHouse features different strategies to store data, these different strategies define not only where and how the data is stored but what kind of data access, queries, and availability your data has. In ClickHouse terms, a Tinybird Data Source uses a [Table Engine](https://clickhouse.tech/docs/en/engines/table_engines/) that determines those factors. Currently, Tinybird supports deleting data for data sources with the following Engines: - MergeTree - ReplacingMergeTree - SummingMergeTree - AggregatingMergeTree - CollapsingMergeTree - VersionedCollapsingMergeTree ### Dependent views deletion If the Data Source has dependent Materialized Views, those won’t be cascade deleted. In case you want to delete data from other dependent Materialized Views, you’ll have to do a subsequent call to this method for the affected view with a proper `delete_condition` . This applies as well to the associated `quarantine` Data Source. | KEY | TYPE | DESCRIPTION | | --- | --- | --- | | delete_condition | String | Mandatory. A string representing the WHERE SQL clause you’d add to a regular DELETE FROM WHERE statement. Most of the times you might want to write a simple `delete_condition` such as `column_name=value` but any valid SQL statement including conditional operators is valid | | dry_run | String | Default: `false` . It allows you to test the deletion. When using `true` it will execute all deletion validations and return number of matched `rows_to_be_deleted` . | GET /v0/datasources/(.+) [¶](https://www.tinybird.co/docs/about:blank#get--v0-datasources-(.+)) Getting information about a particular Data Source [¶](https://www.tinybird.co/docs/about:blank#id32) curl \ -H "Authorization: Bearer " \ -X GET "https://api.tinybird.co/v0/datasources/datasource_name" Get Data Source information and stats. The token provided must have read access to the Data Source. Successful response [¶](https://www.tinybird.co/docs/about:blank#id33) { "id": "t_bd1c62b5e67142bd9bf9a7f113a2b6ea", "name": "datasource_name", "statistics": { "bytes": 430833, "row_count": 3980 }, "used_by": [{ "id": "t_efdc62b5e67142bd9bf9a7f113a34353", "name": "pipe_using_datasource_name" }] "updated_at": "2018-09-07 23:50:32.322461", "created_at": "2018-11-28 23:50:32.322461", "type": "csv" }| Key | Type | Description | | --- | --- | --- | | attrs | String | comma separated list of the Data Source attributes to return in the response. Example: `attrs=name,id,engine` . Leave empty to return a full response | `id` and `name` are two ways to refer to the Data Source in SQL queries and API endpoints. The only difference is that the `id` never changes; it will work even if you change the `name` (which is the name used to display the Data Source in the UI). In general you can use `id` or `name` indistinctively: Using the above response as an example: `select count(1) from events_table` is equivalent to `select count(1) from t_bd1c62b5e67142bd9bf9a7f113a2b6ea` The id `t_bd1c62b5e67142bd9bf9a7f113a2b6ea` is not a descriptive name so you can add a description like `t_my_events_datasource.bd1c62b5e67142bd9bf9a7f113a2b6ea` The `statistics` property contains information about the table. Those numbers are an estimation: `bytes` is the estimated data size on disk and `row_count` the estimated number of rows. These statistics are updated whenever data is appended to the Data Source. The `used_by` property contains the list of pipes that are using this data source. Only Pipe `id` and `name` are sent. The `type` property indicates the `format` used when the Data Source was created. Available formats are `csv`, `ndjson` , and `parquet` . The Data Source `type` indicates what file format you can use to ingest data. DELETE /v0/datasources/(.+) [¶](https://www.tinybird.co/docs/about:blank#delete--v0-datasources-(.+)) Dropping a Data Source [¶](https://www.tinybird.co/docs/about:blank#id35) curl \ -H "Authorization: Bearer " \ -X DELETE "https://api.tinybird.co/v0/datasources/:name" Drops a Data Source from your account. | Key | Type | Description | | --- | --- | --- | | force | String | Default: `false` . The `force` parameter is taken into account when trying to delete Materialized Views. By default, when using `false` the deletion will not be carried out; you can enable it by setting it to `true` . If the given Data Source is being used as the trigger of a Materialized Node, it will not be deleted in any case. | | dry_run | String | Default: `false` . It allows you to test the deletion. When using `true` it will execute all deletion validations and return the possible affected materializations and other dependencies of a given Data Source. | | token | String | Auth token. Only required if no Bearer Authorization header is sent. It must have `DROP:datasource_name` scope for the given Data Source. | PUT /v0/datasources/(.+) [¶](https://www.tinybird.co/docs/about:blank#put--v0-datasources-(.+)) Update Data Source attributes Updating the name of a Data Source [¶](https://www.tinybird.co/docs/about:blank#id37) curl \ -H "Authorization: Bearer " \ -X PUT "https://api.tinybird.co/v0/datasources/:name?name=new_name" Promoting a Data Source to a Snowflake one [¶](https://www.tinybird.co/docs/about:blank#id38) curl \ -H "Authorization: Bearer " \ -X PUT "https://api.tinybird.co/v0/datasources/:name" \ -d "connector=1d8232bf-2254-4d68-beff-4dd9aa505ab0" \ -d "service=snowflake" \ -d "cron=*/30 * * * *" \ -d "query=select a, b, c from test" \ -d "mode=replace" \ -d "external_data_source=database.schema.table" \ -d "ingest_now=True" \| Key | Type | Description | | --- | --- | --- | | name | String | new name for the Data Source | | token | String | Auth token. Only required if no Bearer Authorization header is sent. It should have `DATASOURCES:CREATE` scope for the given Data Source | | connector | String | Connector ID to link it to | | service | String | Type of service to promote it to. Only ‘snowflake’ or ‘bigquery’ allowed | | cron | String | Cron-like pattern to execute the connector’s job | | query | String | Optional: custom query to collect from the external data source | | mode | String | Only replace is allowed for connectors | | external_data_source | String | External data source to use for Snowflake | | ingest_now | Boolean | To ingest the data immediately instead of waiting for the first execution determined by cron | --- URL: https://www.tinybird.co/docs/api-reference/environment-variables-api Last update: 2024-10-17T14:29:53.000Z Content: --- title: "Environment Variables API · Tinybird Docs" theme-color: "#171612" description: "The Environment Variables API allows you to create, update, delete and list environment variables that can be used in Pipes in a Workspace." --- # Environment Variables API¶ The Environment Variables API allows you to create, update, delete, and list environment variables that can be used in Pipes in a Workspace. Environment variables allow you to store sensitive information, such as access secrets and hostnames, in your Workspace. Using the Environment Variables API requires a Workspace admin token. Environment variables are encrypted at rest. ## Environment variables types¶ The Environment Variables API support different types of environment variables: | Environment variable type | Comments | | --- | --- | | `secret` | Used to store passwords and other secrets, automatically prevents Endpoint from exposing its value. It's the default type. | More types of environment variables types will be added soon. ## API Limits¶ Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. The Environment Variables API has limits of: - 5 requests per second. - 100 environment variables per Workspace. - 8KB max size of the `value` attribute. ## Templating¶ Once you've created environment variables in a Workspace, you can use the `tb_secret` template function to replace the original value: % SELECT * FROM postgresql('host:post', 'database', 'table', 'user', {{tb_secret('pg_password')}}) Environment variables values are rendered as `String` data type. If you need to use a different type, use any of the ClickHouse® functions to cast a String value to a given type. For example: % SELECT * FROM table WHERE int_value = toUInt8({{tb_secret('int_secret')}}) ## Staging and production use¶ If you have staging and production Workspaces, create the same environment variables with the same name in both Workspaces, changing only their corresponding value. Tinybird doesn't allow you to create an API Endpoint when exposing environment variables with `type=secret` in a SELECT clause. So, while it's possible to have a Node that uses the logic `SELECT {{tb_secret('username')}}` , you can't publish that Node as a Copy Pipe or API Endpoint. ## Branch use¶ Environment variables can be used in Branches, but they must be created in the main Workspace initially. Environment variables have the same value in the main Workspace as in the Branches. You cannot create a environment variable in a Branch to be deployed in the main Workspace. ## POST /v0/variables/?¶ Creates a new environment variable. ### Restrictions¶ Environment variables names are unique for a Workspace. ### Example¶ curl \ -X POST "https://$TB_HOST/v0/variables" \ -H "Authorization: Bearer " \ -d "type=secret" \ -d "name=test_password" \ -d "value=test" ### Request parameters¶ | Key | Type | Description | | --- | --- | --- | | type | String (optional) | The type of the variable. Defaults to `secret` | | name | String | The name of the variable | | value | String | The variable value | ### Successful response example¶ { "name": "test_token", "created_at": "2024-06-21T10:27:57", "updated_at": "2024-06-21T10:27:57", "edited_by": "token: 'admin token'" } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | | 400 | Invalid or missing parameters | | 403 | Limit reached or invalid token | | 404 | Workspace not found | ## DELETE /v0/variables/(.\+)¶ Deletes a environment variable. ### Example¶ curl \ -X DELETE "https://$TB_HOST/v0/variables/test_password" \ -H "Authorization: Bearer " ### Successful response example¶ { "ok": true } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | | 400 | Invalid or missing parameters | | 403 | Limit reached or token invalid | | 404 | Workspace or variable not found | ## PUT /v0/variables/(.\+)¶ Updates a environment variable. ### Example¶ curl \ -X PUT "https://$TB_HOST/v0/variables/test_password" \ -H "Authorization: Bearer " \ -d "value=new_value" ### Successful response example¶ { "name": "test_password", "type": "secret", "created_at": "2024-06-21T10:27:57", "updated_at": "2024-06-21T10:29:57", "edited_by": "token: 'admin token'" } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | | 400 | Invalid or missing parameters | | 403 | Limit reached or token invalid | | 404 | Workspace or variable not found | ## GET /v0/variables/?¶ Retrieves all Workspace environment variables. The value is not returned. ### Example¶ curl \ -X GET "https://$TB_HOST/v0/variables" \ -H "Authorization: Bearer " ### Successful response example¶ { "variables": [ { "name": "test_token", "type": "secret", "created_at": "2024-06-21T10:27:57", "updated_at": "2024-06-21T10:27:57", "edited_by": "token: 'admin token'" }, { "name": "test_token2", "type": "secret", "created_at": "2024-06-21T10:27:57", "updated_at": "2024-06-21T10:29:57", "edited_by": "token: 'admin token'" } ] } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | | 400 | Invalid or missing parameters | | 403 | Limit reached or token invalid | | 404 | Workspace not found | ## GET /v0/variables/(.\+)¶ Fetches information about a particular environment variable. The value is not returned. ### Example¶ curl \ -X GET "https://$TB_HOST/v0/variables/test_password" \ -H "Authorization: Bearer " ### Successful response example¶ { "name": "test_password", "type": "secret", "created_at": "2024-06-21T10:27:57", "updated_at": "2024-06-21T10:27:57", "edited_by": "token: 'admin token'" } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | | 400 | Invalid or missing parameters | | 403 | Limit reached or token invalid | | 404 | Workspace or variable not found | Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/api-reference/events-api Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Events API Reference · Tinybird Docs" theme-color: "#171612" description: "With the Tinybird Events API you can ingest thousands of JSON events per second." --- # Events API¶ The Events API allows you to ingest JSON events with a simple HTTP POST request. [Read more about the Events API](https://www.tinybird.co/docs/docs/ingest/events-api). All endpoints require authentication using a Token with `DATASOURCE:APPEND` or `DATASOURCE:CREATE` scope. ## POST /v0/events¶ Use this endpoint to send NDJSON (new-line delimited JSON) events to a [Data Source](https://www.tinybird.co/docs/docs/concepts/data-sources). **Request parameters** | Key | Type | Description | | --- | --- | --- | | name | String | name or ID of the target Data Source to append data to it | | wait | Boolean | 'false' by default. Set to 'true' to wait until the write is acknowledged by the database. Enabling this flag makes possible to retry on database errors, but it introduces additional latency. It's recommended to enable it in use cases in which data loss avoidance is critical. It's recommended to disable it otherwise. | **Return HTTP status codes** | Status Code | Description | | --- | --- | | 200 | The data has been inserted into the database. The write has been acknowledged. The request 'wait' parameter was enabled. | | 202 | The data has been processed, and it will be send to the database eventually. The write has not been acknowledged yet. The request 'wait' parameter was disabled. | | 400 | The request is invalid. The body will contain more information. A common cause is missing the 'name' parameter. No data has been inserted, but the request shouldn't be retried. | | 403 | The token is not valid. The request shouldn't be retried. | | 404 | The token's Workspace doesn't belong to this cluster. The Workspace is probably removed or in another cluster. The request shouldn't be retried, ensure the token's region and the Tinybird domain matches. | | 422 | The ingestion has been partially completed due to an error in a Materialized View. Retrying may result in data duplication, but not retrying may result in data loss. The general advice is to not retry, review attached Materialized Views, and contact us if the issue persists. | | 429 | The request/second limit has been reached. Default limit is 1000 requests/second, contact us for increased capacity. The request may be retried after awhile, we recommended using exponential backoff with a limited amount of retries. | | 500 | An unexpected error has occurred. The body will contain more information. Retrying is the general advice, contact with us if the issue persists. | | 503 | The service is temporarily unavailable. The body may contain more information. A common cause is to have reached a throughput limit, or to have attached a Materialized View with an issue. No data has been inserted, and it's safe to retry. Contact with us if the issue persists. | | 0x07 GOAWAY | HTTP2 only. Too many alive connections. Recreate the connection and retry. | **Compression** You can compress JSON events with Gzip and send the compressed payload to the Events API. You must include the header `Content-Encoding: gzip` with the request. **Examples** ##### Send individual JSON messages curl \ -H "Authorization: Bearer " \ -d '{"date": "2020-04-05 00:05:38", "city": "Chicago"}' \ 'https://api.tinybird.co/v0/events?name=events_test' ##### Send many NDJSON events. Notice the '$' before the JSON events. It's needed in order for Bash to replace the '\\n'. curl doesn't do it automatically. curl \ -H "Authorization: Bearer " \ -d $'{"date": "2020-04-05 00:05:38", "city": "Chicago"}\n{"date": "2020-04-05 00:07:22", "city": "Madrid"}\n' \ 'https://api.tinybird.co/v0/events?name=events_test' ##### Send a Gzip compressed payload, where 'body.gz' is a batch of NDJSON events curl \ -H "Authorization: Bearer " \ -H "Content-Encoding: gzip" \ --data-binary @body.gz \ 'https://api.tinybird.co/v0/events?name=events_example' --- URL: https://www.tinybird.co/docs/api-reference/jobs-api Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Jobs API Reference · Tinybird Docs" theme-color: "#171612" description: "With the Jobs API, you can list the jobs for the last 48 hours or the last 100 jobs and also get the details for a specific job." --- GET /v0/jobs/? [¶](https://www.tinybird.co/docs/about:blank#get--v0-jobs-?) We can get a list of the last 100 jobs in the last 48 hours, with the possibility of filtering them by kind, status, pipe_id, pipe_name, created_after, and created_before. | Key | Type | Description | | --- | --- | --- | | kind | String | This will return only the jobs with that particular kind. Example: `kind=populateview` or `kind=copy` or `kind=import` | | status | String | This will return only the jobs with the status provided. Example: `status=done` or `status=waiting` or `status=working` or `status=error` | | pipe_id | String | This will return only the jobs associated with the provided pipe id. Example: `pipe_id=t_31a0ff508c9843b59c32f7f81a156968` | | pipe_name | String | This will return only the jobs associated with the provided pipe name. Example: `pipe_name=test_pipe` | | created_after | String | This will return jobs that were created after the provided date in the ISO 8601 standard date format. Example: `created_after=2023-06-15T18:13:25.855Z` | | created_before | String | This will return jobs that were created before the provided date in the ISO 8601 standard date format. Example: `created_before=2023-06-19T18:13:25.855Z` | Getting the latest jobs [¶](https://www.tinybird.co/docs/about:blank#id2) curl \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/jobs" \ You will get a list of jobs with the `kind`, `status`, `id` , and the `url` to access the specific information about that job. Jobs list [¶](https://www.tinybird.co/docs/about:blank#id3) { "jobs": [ { "id": "c8ae13ef-e739-40b6-8bd5-b1e07c8671c2", "kind": "import", "status": "done", "created_at": "2020-12-04 15:08:33.214377", "updated_at": "2020-12-04 15:08:33.396286", "job_url": "https://api.tinybird.co/v0/jobs/c8ae13ef-e739-40b6-8bd5-b1e07c8671c2", "datasource": { "id": "t_31a0ff508c9843b59c32f7f81a156968", "name": "my_datasource_1" } }, { "id": "1f6a5a3d-cfcb-4244-ba0b-0bfa1d1752fb", "kind": "import", "status": "error", "created_at": "2020-12-04 15:08:09.051310", "updated_at": "2020-12-04 15:08:09.263055", "job_url": "https://api.tinybird.co/v0/jobs/1f6a5a3d-cfcb-4244-ba0b-0bfa1d1752fb", "datasource": { "id": "t_49806938714f4b72a225599cdee6d3ab", "name": "my_datasource_2" } } ] } Job details in `job_url` will be available for 48h after its creation. POST /v0/jobs/(.+)/cancel [¶](https://www.tinybird.co/docs/about:blank#post--v0-jobs-(.+)-cancel) With this endpoint you can try to cancel an existing Job. All jobs can be cancelled if they are in the “waiting” status, but you can’t cancel a Job in “done” or “error” status. In the case of the job of type “populate”, you can cancel it in the “working” state. After successfully starting the cancellation process, you will see two different status in the job: - “cancelling”: The Job can’t be immediately cancelled as it’s doing some work, but the cancellation will eventually happen. - “cancelled”: The Job has been completely cancelled. A Job cancellation doesn’t guarantee a complete rollback of the changes being made by it, sometimes you will need to delete new inserted rows or datasources created. The fastest way to know if a job is cancellable, is just reading the “is_cancellable” key inside the job JSON description. Depending on the Job and its status, when you try to cancel it you may get different responses: - HTTP Code 200: The Job has successfully started the cancellation process. Remember that if the Job has now a “cancelling” status, it may need some time to completely cancel itself. This request will return the status of the job. - HTTP Code 404: Job not found. - HTTP Code 403: The token provided doesn’t have access to this Job. - HTTP Code 400: Job is not in a cancellable status or you are trying to cancel a job which is already in the “cancelling” state. Try to cancel a Job [¶](https://www.tinybird.co/docs/about:blank#id4) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/jobs/:job_id/cancel" Populate Job in cancelling state right after the cancellation request. [¶](https://www.tinybird.co/docs/about:blank#id5) { "kind": "populateview", "id": "32c3438d-582e-4a6f-9b57-7d7a3bfbeb8c", "job_id": "32c3438d-582e-4a6f-9b57-7d7a3bfbeb8c", "status": "cancelling", "created_at": "2021-03-17 18:56:23.939380", "updated_at": "2021-03-17 18:56:44.343245", "is_cancellable": false, "datasource": { "id": "t_02043945875b4070ae975f3812444b76", "name": "your_datasource_name", "cluster": null, "tags": {}, "created_at": "2020-07-15 10:55:12.427269", "updated_at": "2020-07-15 10:55:12.427270", "statistics": null, "replicated": false, "version": 0, "project": null, "used_by": [] }, "query_id": "01HSZ9WJES5QEZZM4TGDD3YFZ2", "pipe_id": "t_7fa8009023a245b696b4f2f7195b23c3", "pipe_name": "top_product_per_day", "queries": [ { "query_id": "01HSZ9WJES5QEZZM4TGDD3YFZ2", "status": "done" }, { "query_id": "01HSZ9WY6QS6XAMBHZMSNB1G75", "status": "done" }, { "query_id": "01HSZ9X8YVEQ0PXA6T2HZQFFPX", "status": "working" }, { "query_id": "01HSZQ5YX0517X81JBF9G9HB2P", "status": "waiting" }, { "query_id": "01HSZQ6PZJA3P81RC6Q6EF6HMK", "status": "waiting" }, { "query_id": "01HSZQ76D7YYFB16TFT32KXMCY", "status": "waiting" } ], "progress_percentage": 50.0 } GET /v0/jobs/(.+) [¶](https://www.tinybird.co/docs/about:blank#get--v0-jobs-(.+)) Get the details of a specific Job. You can get the details of a Job by using its ID. Get the details of a Job [¶](https://www.tinybird.co/docs/about:blank#id6) curl \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/jobs/:job_id" You will get a JSON response with the details of the Job, including the `kind`, `status`, `id`, `created_at`, `updated_at` , and the `datasource` associated with the Job. This is available for 48h after the Job creation. After that, you can consult the Job details in the Service Data Source jobs_log. Job details [¶](https://www.tinybird.co/docs/about:blank#id7) { "kind": "import", "id": "d5b869ed-3a74-45f9-af54-57350aae4cef", "job_id": "d5b869ed-3a74-45f9-af54-57350aae4cef", "status": "done", "created_at": "2024-07-22 11:47:58.207606", "updated_at": "2024-07-22 11:48:52.971327", "started_at": "2024-07-22 11:47:58.351734", "is_cancellable": false, "mode": "append", "datasource": { "id": "t_caf95c54174e48f488ea65d181eb5b75", "name": "events", "cluster": "default", "tags": { }, "created_at": "2024-07-22 11:47:51.807384", "updated_at": "2024-07-22 11:48:52.726243", "replicated": true, "version": 0, "project": null, "headers": { "cached_delimiter": "," }, "shared_with": [ ], "engine": { "engine": "MergeTree", "partition_key": "toYear(date)", "sorting_key": "date, user_id, event, extra_data" }, "description": "", "used_by": [ ], "last_commit": { "content_sha": "", "status": "changed", "path": "" }, "errors_discarded_at": null, "type": "csv" }, "import_id": "d5b869ed-3a74-45f9-af54-57350aae4cef", "url": "https://storage.googleapis.com/tinybird-assets/datasets/guides/events_50M_1.csv", "statistics": { "bytes": 1592720209, "row_count": 50000000 }, "quarantine_rows": 0, "invalid_lines": 0 } --- URL: https://www.tinybird.co/docs/api-reference/overview Last update: 2024-10-28T11:06:14.000Z Content: --- title: "API Overview · Tinybird Docs" theme-color: "#171612" description: "Tinybird's API Endpoints, such as the Data Sources API to import Data, the Pipes API to transform data and publish the results through API Endpoints, and the Query API to run arbitrary queries." --- # API Overview¶ You can control Tinybird services using the API as an alternative to the UI and CLI. The following APIs are available: | API name | Description | | --- | --- | | [ Analyze API](https://www.tinybird.co/docs/docs/api-reference/analyze-api) | Analyze a given NDJSON, CSV, or Parquet file to generate a Tinybird Data Source schema. | | [ Data Sources API](https://www.tinybird.co/docs/docs/api-reference/datasource-api) | List, create, update, or delete your Tinybird Data Sources, and insert or delete data from Data Sources. | | [ Events API](https://www.tinybird.co/docs/docs/api-reference/events-api) | Ingest NDJSON events with a simple HTTP POST request. | | [ Jobs API](https://www.tinybird.co/docs/docs/api-reference/jobs-api) | Get details on Tinybird jobs, and list the jobs for the last 48 hours or the last 100 jobs. | | [ Pipes API](https://www.tinybird.co/docs/docs/api-reference/pipe-api/overview) | Interact with your Pipes, including Pipes themselves, API Endpoints, Materialized Views, and managing Copy jobs. | | [ Query API](https://www.tinybird.co/docs/docs/api-reference/query-api) | Query your Pipes and Data Sources inside Tinybird as if you were running SQL statements against a regular database. | | [ Environment Variables API](https://www.tinybird.co/docs/docs/api-reference/environment-variables-api) | Create, update, delete, and list variables that can be used in Pipes in a Workspace. | | [ Sink Pipes API](https://www.tinybird.co/docs/docs/api-reference/sink-pipes-api) | Create, delete, schedule, and trigger Sink Pipes. | | [ Tokens API](https://www.tinybird.co/docs/docs/api-reference/token-api) | List, create, update, or delete your Tinybird Static Tokens. | Make all requests to Tinybird's API Endpoints over TLS (HTTPS). All response bodies, including errors, are encoded as JSON. You can get information on several Workspace operations by [monitoring your jobs](https://www.tinybird.co/docs/docs/monitoring/jobs) , using either the APIs or the built-in Tinybird Service Data Sources. ## Regions and endpoints¶ A Workspace belongs to one region. The API for each region has a specific API base URL that you use to make API requests. The following table lists the current regions and their corresponding API base URLs: **Current Tinybird regions** | Region | Provider | Provider region | API base URL | | --- | --- | --- | --- | | Europe | GCP | europe-west3 | [ https://api.tinybird.co](https://api.tinybird.co/) | | US East | GCP | us-east4 | [ https://api.us-east.tinybird.co](https://api.us-east.tinybird.co/) | | Europe | AWS | eu-central-1 | [ https://api.eu-central-1.aws.tinybird.co](https://api.eu-central-1.aws.tinybird.co/) | | US East | AWS | us-east-1 | [ https://api.us-east.aws.tinybird.co](https://api.us-east.aws.tinybird.co/) | | US West | AWS | us-west-2 | [ https://api.us-west-2.aws.tinybird.co](https://api.us-west-2.aws.tinybird.co/) | Tinybird documentation uses `https://api.tinybird.co` as the default example API base URL in code snippets. If you are not using the Europe GCP region, replace the sample URL with the API base URL for your region. ## Authentication¶ Tinybird makes use of Tokens for every API call. This ensures that each user or application can only access data that they are authorized to access. See [Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens). You must make all API requests over HTTPS. Don't make calls over plain HTTP or send API requests without authentication. Replace the Tinybird API hostname/region with the [right API URL region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) that matches your Workspace. Your Token lives in the Workspace under "Tokens". There are two ways to authenticate your requests in the Tinybird API: using an authorization header, or using a URL parameter. ### 1. Authorization header¶ You can send a Bearer authorization header to authenticate API calls. With curl, use `-H "Authorization: Bearer "`. If you have a valid Token with read access to the particular Data Source, you can get a successful response by sending the following request: ##### Authorization header Authenticated request curl \ -X GET \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/sql?q=SELECT+*+FROM+" ### 2. URL parameter¶ You can also specify the Token using a parameter in the URL, using `token=` . For example: ##### URL parameter authenticated request curl -X GET \ "https://api.tinybird.co/v0/sql?q=SELECT+*+FROM+&token=" ## Compression¶ To compress API responses, add `Accept-Encoding: gzip` to your requests. For example: ##### Request with compressed response curl \ -X GET \ -H "Authorization: Bearer " \ -H "Accept-Encoding: gzip" \ "https://api.tinybird.co/v0/sql?q=SELECT+*+FROM+" ## Errors¶ Tinybird 's API returns standard HTTP success or error status codes. In case of errors, responses include additional information in JSON format. The following table lists the error status codes. **Response codes** | Code | Description | | --- | --- | | 400 | Bad request. This could be due to a missing parameter in a request, for instance | | 403 | Forbidden. Provided auth token doesn't have the right scope or the Data Source is not available | | 404 | Not found | | 405 | HTTP Method not allowed | | 408 | Request timeout (e.g. query execution time was exceeded) | | 409 | You need to resubmit the request due to a conflict with the current state of the target source (e.g.: you need to delete a Materialized View) | | 411 | No valid Content-Length header containing the length of the message-body | | 413 | The message body is too large | | 429 | Too many requests. When over the rate limits of your account | | 500 | Unexpected error | ## Limits¶ Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. ## Versioning¶ All Tinybird APIs are versioned with a version string specified in the base URL. We encourage you to always use the latest API available. When versioning our web services, Tinybird adheres to [semantic versioning](https://semver.org/) rules. ## Reserved words¶ The following keywords are reserved words. You can't use them to name Data Sources, Pipes, Nodes or Workspaces. Case is ignored. - `Array` - `Boolean` - `Date` - `Date32` - `DateTime` - `DateTime32` - `DateTime64` - `Decimal` - `Decimal128` - `Decimal256` - `Decimal32` - `Decimal64` - `Enum` - `Enum16` - `Enum8` - `FixedString` - `Float32` - `Float64` - `IPv4` - `IPv6` - `Int128` - `Int16` - `Int256` - `Int32` - `Int64` - `Int8` - `MultiPolygon` - `Point` - `Polygon` - `Ring` - `String` - `TABLE` - `UInt128` - `UInt16` - `UInt256` - `UInt32` - `UInt64` - `UInt8` - `UUID` - `_temporary_and_external_tables` - `add` - `after` - `all` - `and` - `anti` - `any` - `array` - `as` - `asc` - `asof` - `between` - `by` - `case` - `collate` - `column` - `columns` - `cross` - `cube` - `custom_error` - `day_diff` - `default` - `defined` - `desc` - `distinct` - `else` - `end` - `enumerate_with_last` - `error` - `exists` - `from` - `full` - `functions` - `generateRandom` - `global` - `group` - `having` - `if` - `ilike` - `in` - `inner` - `insert` - `interval` - `into` - `join` - `left` - `like` - `limit` - `limits` - `max` - `min` - `not` - `null` - `numbers_mt` - `on` - `one` - `or` - `order` - `outer` - `prewhere` - `public` - `right` - `sample` - `select` - `semi` - `split_to_array` - `sql_and` - `sql_unescape` - `system` - `table` - `then` - `tinybird` - `to` - `union` - `using` - `where` - `with` - `zeros_mt` Pipe, Data Source and Node names are globally unique. You can't use an alias for a column that matches a globally unique name. --- URL: https://www.tinybird.co/docs/api-reference/pipe-api/api-endpoints Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Pipes API > API Endpoints Reference · Tinybird Docs" theme-color: "#171612" description: "The Pipes API enables you to manage your Pipes. Use the API Endpoints service to publish or unpublish your Pipes as API Endpoints." --- POST /v0/pipes/(.+)/nodes/(.+)/endpoint [¶](https://www.tinybird.co/docs/about:blank#post--v0-pipes-(.+)-nodes-(.+)-endpoint) Publishes an API endpoint Publishing an endpoint [¶](https://www.tinybird.co/docs/about:blank#id1) curl -X POST \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:pipe/nodes/:node/endpoint" Successful response [¶](https://www.tinybird.co/docs/about:blank#id2) { "id": "t_60d8f84ce5d349b28160013ce99758c7", "name": "my_pipe", "description": "this is my pipe description", "nodes": [{ "id": "t_bd1e095da943494d9410a812b24cea81", "name": "get_all", "sql": "SELECT * FROM my_datasource", "description": "This is a description for the **first** node", "materialized": null, "cluster": null, "dependencies": [ "my_datasource" ], "tags": {}, "created_at": "2019-09-03 19:56:03.704840", "updated_at": "2019-09-04 07:05:53.191437", "version": 0, "project": null, "result": null, "ignore_sql_errors": false }], "endpoint": "t_bd1e095da943494d9410a812b24cea81", "created_at": "2019-09-03 19:56:03.193446", "updated_at": "2019-09-10 07:18:39.797083", "parent": null } The response will contain a `token` if there’s a **unique READ token** for this pipe. You could use this token to share your endpoint. | Code | Description | | --- | --- | | 200 | No error | | 400 | Wrong node id | | 403 | Forbidden. Provided token doesn’t have permissions to publish a pipe, it needs `ADMIN` or `PIPE:CREATE` | | 404 | Pipe not found | DELETE /v0/pipes/(.+)/nodes/(.+)/endpoint [¶](https://www.tinybird.co/docs/about:blank#delete--v0-pipes-(.+)-nodes-(.+)-endpoint) Unpublishes an API endpoint Unpublishing an endpoint [¶](https://www.tinybird.co/docs/about:blank#id4) curl -X DELETE \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:pipe/nodes/:node/endpoint"| Code | Description | | --- | --- | | 200 | No error | | 400 | Wrong node id | | 403 | Forbidden. Provided token doesn’t have permissions to publish a pipe, it needs `ADMIN` or `PIPE:CREATE` | | 404 | Pipe not found | --- URL: https://www.tinybird.co/docs/api-reference/pipe-api/copy-pipes-api Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Pipes API > Copy Pipes Reference · Tinybird Docs" theme-color: "#171612" description: "The Pipes API enables you to manage your Pipes. Use the Copy Pipes service to create, delete, schedule, and trigger Copy jobs." --- POST /v0/pipes/(.+)/nodes/(.+)/copy [¶](https://www.tinybird.co/docs/about:blank#post--v0-pipes-(.+)-nodes-(.+)-copy) Calling this endpoint sets the pipe as a Copy one with the given settings. Scheduling is optional. To run the actual copy after you set the pipe as a Copy one, you must call the POST `/v0/pipes/:pipe/copy` endpoint. If you need to change the target Data Source or the scheduling configuration, you can call PUT endpoint. Restrictions: - You can set only one schedule per Copy pipe. - You can’t set a Copy pipe if the pipe is already materializing. You must unlink the Materialization first. - You can’t set a Copy pipe if the pipe is already an endpoint. You must unpublish the endpoint first. Setting the pipe as a Copy pipe [¶](https://www.tinybird.co/docs/about:blank#id1) curl -X POST \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:pipe/nodes/:node/copy" \ -d "target_datasource=my_destination_datasource" \ -d "schedule_cron=*/15 * * * *"| Key | Type | Description | | --- | --- | --- | | token | String | Auth token. Ensure it has the `PIPE:CREATE` and `DATASOURCE:APPEND` scopes on it | | target_datasource | String | Name or the id of the target Data Source. | | schedule_cron | String | Optional. A crontab expression. | Successful response [¶](https://www.tinybird.co/docs/about:blank#id3) { "id": "t_3aa11a5cabd1482c905bc8dfc551a84d", "name": "my_copy_pipe", "description": "This is a pipe to copy", "type": "copy", "endpoint": null, "created_at": "2023-03-01 10:14:04.497505", "updated_at": "2023-03-01 10:34:19.113518", "parent": null, "copy_node": "t_33ec8ac3c3324a53822fded61a83dbbd", "copy_target_datasource": "t_0be6161a5b7b4f6180b10325643e0b7b", "copy_target_workspace": "5a70f2f5-9635-47bf-96a9-7b50362d4e2f", "nodes": [{ "id": "t_33ec8ac3c3324a53822fded61a83dbbd", "name": "emps", "sql": "SELECT * FROM employees WHERE starting_date > '2016-01-01 00:00:00'", "description": null, "materialized": null, "cluster": null, "mode": "append", "tags": { "copy_target_datasource": "t_0be6161a5b7b4f6180b10325643e0b7b", "copy_target_workspace": "5a70f2f5-9635-47bf-96a9-7b50362d4e2f" }, "created_at": "2023-03-01 10:14:04.497547", "updated_at": "2023-03-01 10:14:04.497547", "version": 0, "project": null, "result": null, "ignore_sql_errors": false, "dependencies": [ "employees" ], "params": [] }] } DELETE /v0/pipes/(.+)/nodes/(.+)/copy [¶](https://www.tinybird.co/docs/about:blank#delete--v0-pipes-(.+)-nodes-(.+)-copy) Removes the Copy type of the pipe. By removing the Copy type, nor the node nor the pipe are deleted. The pipe will still be present, but will stop any scheduled and copy settings. Unsetting the pipe as a Copy pipe [¶](https://www.tinybird.co/docs/about:blank#id4) curl -X DELETE \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:pipe/nodes/:node/copy"| Code | Description | | --- | --- | | 204 | No error | | 400 | Wrong node id | | 403 | Forbidden. Provided token doesn’t have permissions to publish a pipe, it needs `ADMIN` or `PIPE:CREATE` | | 404 | Pipe not found | PUT /v0/pipes/(.+)/nodes/(.+)/copy [¶](https://www.tinybird.co/docs/about:blank#put--v0-pipes-(.+)-nodes-(.+)-copy) Calling this endpoint will update a Copy pipe with the given settings: you can change its target Data Source, as well as adding or modifying its schedule. Updating a Copy Pipe [¶](https://www.tinybird.co/docs/about:blank#id6) curl -X PUT \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:pipe/nodes/:node/copy" \ -d "target_datasource=other_destination_datasource" \ -d "schedule_cron=*/15 * * * *"| Key | Type | Description | | --- | --- | --- | | token | String | Auth token. Ensure it has the `PIPE:CREATE` scope on it | | target_datasource | String | Optional. Name or the id of the target Data Source. | | schedule_cron | String | Optional. A crontab expression. If schedule_cron=’None’ the schedule will be removed from the copy pipe, if it was defined | Successful response [¶](https://www.tinybird.co/docs/about:blank#id8) { "id": "t_3aa11a5cabd1482c905bc8dfc551a84d", "name": "my_copy_pipe", "description": "This is a pipe to copy", "type": "copy", "endpoint": null, "created_at": "2023-03-01 10:14:04.497505", "updated_at": "2023-03-01 10:34:19.113518", "parent": null, "copy_node": "t_33ec8ac3c3324a53822fded61a83dbbd", "copy_target_datasource": "t_2f046a4b2cc44137834a35420a533465", "copy_target_workspace": "5a70f2f5-9635-47bf-96a9-7b50362d4e2f", "nodes": [{ "id": "t_33ec8ac3c3324a53822fded61a83dbbd", "name": "emps", "sql": "SELECT * FROM employees WHERE starting_date > '2016-01-01 00:00:00'", "description": null, "materialized": null, "cluster": null, "mode": "append", "tags": { "copy_target_datasource": "t_2f046a4b2cc44137834a35420a533465", "copy_target_workspace": "5a70f2f5-9635-47bf-96a9-7b50362d4e2f" }, "created_at": "2023-03-01 10:14:04.497547", "updated_at": "2023-03-07 09:08:34.206123", "version": 0, "project": null, "result": null, "ignore_sql_errors": false, "dependencies": [ "employees" ], "params": [] }] } POST /v0/pipes/(.+)/copy [¶](https://www.tinybird.co/docs/about:blank#post--v0-pipes-(.+)-copy) Runs a copy job, using the settings previously set in the pipe. You can use this URL to do an on-demand copy. This URL is also used by the scheduler to make the programmed calls. This URL accepts parameters, just like in a regular endpoint. This operation is asynchronous and will copy the output of the endpoint to an existing datasource. Runs a copy job on a Copy pipe [¶](https://www.tinybird.co/docs/about:blank#id9) curl -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/pipes/:pipe/copy?param1=test¶m2=test2"| Key | Type | Description | | --- | --- | --- | | token | String | Auth token. Ensure it has the `PIPE:READ` scope on it | | parameters | String | Optional. The value of the parameters to run the Copy with. They are regular URL query parameters. | | _mode | String | Optional. One of ‘append’ or ‘replace’. Default is ‘append’. | | Code | Description | | --- | --- | | 200 | No error | | 400 | Pipe is not a Copy pipe or there is a problem with the SQL query | | 400 | The columns in the SQL query don’t match the columns in the target Data Source | | 403 | Forbidden. The provided token doesn’t have permissions to append a node to the pipe ( `ADMIN` or `PIPE:READ` and `DATASOURCE:APPEND` ) | | 403 | Job limits exceeded. Tried to copy more than 100M rows, or there are too many active (working and waiting) Copy jobs. | | 404 | Pipe not found, Node not found or Target Data Source not found | The response will not be the final result of the copy but a Job. You can check the job status and progress using the [Jobs API](https://www.tinybird.co/docs/docs/api-reference/jobs-api). Successful response [¶](https://www.tinybird.co/docs/about:blank#id12) { "id": "t_33ec8ac3c3324a53822fded61a83dbbd", "name": "emps", "sql": "SELECT * FROM employees WHERE starting_date > '2016-01-01 00:00:00'", "description": null, "materialized": null, "cluster": null, "tags": { "copy_target_datasource": "t_0be6161a5b7b4f6180b10325643e0b7b", "copy_target_workspace": "5a70f2f5-9635-47bf-96a9-7b50362d4e2f" }, "created_at": "2023-03-01 10:14:04.497547", "updated_at": "2023-03-01 10:14:04.497547", "version": 0, "project": null, "result": null, "ignore_sql_errors": false, "dependencies": [ "employees" ], "params": [], "job": { "kind": "copy", "id": "f0b2f107-0af8-4c28-a83b-53053cb45f0f", "job_id": "f0b2f107-0af8-4c28-a83b-53053cb45f0f", "status": "waiting", "created_at": "2023-03-01 10:41:07.398102", "updated_at": "2023-03-01 10:41:07.398128", "started_at": null, "is_cancellable": true, "datasource": { "id": "t_0be6161a5b7b4f6180b10325643e0b7b" }, "query_id": "19a8d613-b424-4afd-95f1-39cfbd87e827", "query_sql": "SELECT * FROM d_b0ca70.t_25f928e33bcb40bd8e8999e69cb02f94 AS employees WHERE starting_date > '2016-01-01 00:00:00'", "pipe_id": "t_3aa11a5cabd1482c905bc8dfc551a84d", "pipe_name": "copy_emp", "job_url": "https://api.tinybird.co/v0/jobs/f0b2f107-0af8-4c28-a83b-53053cb45f0f" } } POST /v0/pipes/(.+)/copy/pause [¶](https://www.tinybird.co/docs/about:blank#post--v0-pipes-(.+)-copy-pause) Pauses the scheduling. This affects any future scheduled Copy job. Any copy operation currently copying data will be completed. Pauses a scheduled copy [¶](https://www.tinybird.co/docs/about:blank#id13) curl -X POST \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:pipe/copy/pause"| Code | Description | | --- | --- | | 200 | Scheduled copy paused correctly | | 400 | Pipe is not copy | | 404 | Pipe not found, Scheduled copy for pipe not found | POST /v0/pipes/(.+)/copy/resume [¶](https://www.tinybird.co/docs/about:blank#post--v0-pipes-(.+)-copy-resume) Resumes a previously paused scheduled copy. Resumes a Scheduled copy [¶](https://www.tinybird.co/docs/about:blank#id15) curl -X POST \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:pipe/copy/resume"| Code | Description | | --- | --- | | 200 | Scheduled copy resumed correctly | | 400 | Pipe is not copy | | 404 | Pipe not found, Scheduled copy for pipe not found | POST /v0/pipes/(.+)/copy/cancel [¶](https://www.tinybird.co/docs/about:blank#post--v0-pipes-(.+)-copy-cancel) Cancels jobs that are working or waiting that are tied to the pipe and pauses the scheduling of copy jobs for this pipe. To allow scheduled copy jobs to run for the pipe, the copy pipe must be resumed and the already cancelled jobs will not be resumed. Cancels scheduled copy jobs tied to the pipe [¶](https://www.tinybird.co/docs/about:blank#id17) curl -X POST \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:pipe/copy/cancel"| Code | Description | | --- | --- | | 200 | Scheduled copy pipe cancelled correctly | | 400 | Pipe is not copy | | 400 | Job is not in cancellable status | | 400 | Job is already being cancelled | | 404 | Pipe not found, Scheduled copy for pipe not found | Successful response [¶](https://www.tinybird.co/docs/about:blank#id19) { "id": "t_fb56a87a520441189a5a6d61f8d968f4", "name": "scheduled_copy_pipe", "description": "none", "endpoint": "none", "created_at": "2023-06-09 10:54:21.847433", "updated_at": "2023-06-09 10:54:21.897854", "parent": "none", "type": "copy", "copy_node": "t_bb96e50cb1b94ffe9e598f870d88ad1b", "copy_target_datasource": "t_3f7e6534733f425fb1add6229ca8be4b", "copy_target_workspace": "8119d519-80b2-454a-a114-b092aea3b9b0", "schedule": { "timezone": "Etc/UTC", "cron": "0 * * * *", "status": "paused" }, "nodes": [ { "id": "t_bb96e50cb1b94ffe9e598f870d88ad1b", "name": "scheduled_copy_pipe_0", "sql": "SELECT * FROM landing_ds", "description": "none", "materialized": "none", "cluster": "none", "tags": { "copy_target_datasource": "t_3f7e6534733f425fb1add6229ca8be4b", "copy_target_workspace": "8119d519-80b2-454a-a114-b092aea3b9b0" }, "created_at": "2023-06-09 10:54:21.847451", "updated_at": "2023-06-09 10:54:21.847451", "version": 0, "project": "none", "result": "none", "ignore_sql_errors": "false", "node_type": "copy", "dependencies": [ "landing_ds" ], "params": [] } ], "cancelled_jobs": [ { "id": "ced3534f-8b5e-4fe0-8dcc-4369aa256b11", "kind": "copy", "status": "cancelled", "created_at": "2023-06-09 07:54:21.921446", "updated_at": "2023-06-09 10:54:22.043272", "job_url": "https://api.tinybird.co/v0/jobsjobs/ced3534f-8b5e-4fe0-8dcc-4369aa256b11", "is_cancellable": "false", "pipe_id": "t_fb56a87a520441189a5a6d61f8d968f4", "pipe_name": "pipe_test_scheduled_copy_pipe_cancel_multiple_jobs", "datasource": { "id": "t_3f7e6534733f425fb1add6229ca8be4b", "name": "target_ds_test_scheduled_copy_pipe_cancel_multiple_jobs" } }, { "id": "b507ded9-9862-43ae-8863-b6de17c3f914", "kind": "copy", "status": "cancelling", "created_at": "2023-06-09 07:54:21.903036", "updated_at": "2023-06-09 10:54:22.044837", "job_url": "https://api.tinybird.co/v0/jobsb507ded9-9862-43ae-8863-b6de17c3f914", "is_cancellable": "false", "pipe_id": "t_fb56a87a520441189a5a6d61f8d968f4", "pipe_name": "pipe_test_scheduled_copy_pipe_cancel_multiple_jobs", "datasource": { "id": "t_3f7e6534733f425fb1add6229ca8be4b", "name": "target_ds_test_scheduled_copy_pipe_cancel_multiple_jobs" } } ] } --- URL: https://www.tinybird.co/docs/api-reference/pipe-api/materialized-views Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Pipes API > Materialized Views and Populates Reference · Tinybird Docs" theme-color: "#171612" description: "The Pipes API enables you to manage your Pipes. Use the Materialized Views and Populates service to create, delete, or populate Materialized Views." --- POST /v0/pipes/(.+)/nodes/(.+)/population [¶](https://www.tinybird.co/docs/about:blank#post--v0-pipes-(.+)-nodes-(.+)-population) Populates a Materialized View Populating a Materialized View [¶](https://www.tinybird.co/docs/about:blank#id1) curl -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/pipes/:pipe/nodes/:node/population" \ -d "populate_condition=toYYYYMM(date) = 202203" The response will not be the final result of the import but a Job. You can check the job status and progress using the [Jobs API](https://www.tinybird.co/docs/docs/api-reference/jobs-api). Alternatively you can use a query like this to check the operations related to the populate Job: Check populate jobs in the datasources_ops_log including dependent Materialized Views triggered [¶](https://www.tinybird.co/docs/about:blank#id2) SELECT * FROM tinybird. datasources_ops_log WHERE timestamp > now () - INTERVAL 1 DAY AND operation_id IN ( SELECT operation_id FROM tinybird. datasources_ops_log WHERE timestamp > now () - INTERVAL 1 DAY and datasource_id = '{the_datasource_id}' and job_id = '{the_job_id}' ) ORDER BY timestamp ASC When a populate job fails for the first time, the Materialized View is automatically unlinked. In that case you can get failed population jobs and their errors to fix them with a query like this: Check failed populate jobs [¶](https://www.tinybird.co/docs/about:blank#id3) SELECT * FROM tinybird. datasources_ops_log WHERE datasource_id = '{the_datasource_id}' AND pipe_name = '{the_pipe_name}' AND event_type LIKE 'populateview%' AND result = 'error' ORDER BY timestamp ASC Alternatively you can use the `unlink_on_populate_error='true'` flag to always unlink the Materialized View if the populate job does not work as expected. | Key | Type | Description | | --- | --- | --- | | token | String | Auth token. Ensure it has the `PIPE:CREATE` scope on it | | populate_subset | Float | Optional. Populate with a subset percent of the data (limited to a maximum of 2M rows), this is useful to quickly test a materialized node with some data. The subset must be greater than 0 and lower than 0.1. A subset of 0.1 means a 10 percent of the data in the source Data Source will be used to populate the Materialized View. It has precedence over `populate_condition` | | populate_condition | String | Optional. Populate with a SQL condition to be applied to the trigger Data Source of the Materialized View. For instance, `populate_condition='date == toYYYYMM(now())'` it’ll populate taking all the rows from the trigger Data Source which `date` is the current month. `populate_condition` is not taken into account if the `populate_subset` param is present. Including in the `populate_condition` any column present in the Data Source `engine_sorting_key` will make the populate job process less data. | | truncate | String | Optional. Default is `false` . Populates over existing data, useful to populate past data while new data is being ingested. Use `true` to truncate the Data Source before populating. | | unlink_on_populate_error | String | Optional. Default is `false` . If the populate job fails the Materialized View is unlinked and new data won’t be ingested in the Materialized View. | | Code | Description | | --- | --- | | 200 | No error | | 400 | Node is not materialized | | 403 | Forbidden. Provided token doesn’t have permissions to append a node to the pipe, it needs `ADMIN` or `PIPE:CREATE` | | 404 | Pipe not found, Node not found | POST /v0/pipes/(.+)/nodes/(.+)/materialization [¶](https://www.tinybird.co/docs/about:blank#post--v0-pipes-(.+)-nodes-(.+)-materialization) Creates a Materialized View Creating a Materialized View [¶](https://www.tinybird.co/docs/about:blank#id6) curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/pipes/:pipe/nodes/:node/materialization?datasource=my_data_source_name&populate=true"| Key | Type | Description | | --- | --- | --- | | token | String | Auth token. Ensure it has the `PIPE:CREATE` scope on it | | datasource | String | Required. Specifies the name of the destination Data Source where the Materialized View schema is defined. If the Data Source does not exist, it creates automatically with the default settings. | | override_datasource | Boolean | Optional. Default `false` When the target Data Source of the Materialized View exists in the Workspace it’ll be overriden by the `datasource` specified in the request. | | populate | Boolean | Optional. Default `false` . When `true` , a job is triggered to populate the destination datasource. | | populate_subset | Float | Optional. Populate with a subset percent of the data (limited to a maximum of 2M rows), this is useful to quickly test a materialized node with some data. The subset must be greater than 0 and lower than 0.1. A subset of 0.1 means a 10 percent of the data in the source Data Source will be used to populate the Materialized View. Use it together with `populate=true` , it has precedence over `populate_condition` | | populate_condition | String | Optional. Populate with a SQL condition to be applied to the trigger Data Source of the Materialized View. For instance, `populate_condition='date == toYYYYMM(now())'` it’ll populate taking all the rows from the trigger Data Source which `date` is the current month. Use it together with `populate=true` . `populate_condition` is not taken into account if the `populate_subset` param is present. Including in the `populate_condition` any column present in the Data Source `engine_sorting_key` will make the populate job process less data. | | unlink_on_populate_error | String | Optional. Default is `false` . If the populate job fails the Materialized View is unlinked and new data won’t be ingested in the Materialized View. | | engine | String | Optional. Engine for destination Materialized View. If the Data Source already exists, the settings are not overriden. | | engine_* | String | Optional. Engine parameters and options. Requires the `engine` parameter. If the Data Source already exists, the settings are not overriden.[ Check Engine Parameters and Options for more details](https://www.tinybird.co/docs/docs/api-reference/datasource-api) | SQL query for the materialized node must be sent in the body encoded in utf-8 | Code | Description | | --- | --- | | 200 | No error | | 400 | Node already being materialized | | 403 | Forbidden. Provided token doesn’t have permissions to append a node to the pipe, it needs `ADMIN` or `PIPE:CREATE` | | 404 | Pipe not found, Node not found | | 409 | The Materialized View already exists or `override_datasource` cannot be performed | DELETE /v0/pipes/(.+)/nodes/(.+)/materialization [¶](https://www.tinybird.co/docs/about:blank#delete--v0-pipes-(.+)-nodes-(.+)-materialization) Removes a Materialized View By removing a Materialized View, nor the Data Source nor the Node are deleted. The Data Source will still be present, but will stop receiving data from the Node. Removing a Materialized View [¶](https://www.tinybird.co/docs/about:blank#id9) curl -H "Authorization: Bearer " \ -X DELETE "https://api.tinybird.co/v0/pipes/:pipe/nodes/:node/materialization"| Code | Description | | --- | --- | | 204 | No error, Materialized View removed | | 403 | Forbidden. Provided token doesn’t have permissions to append a node to the pipe, it needs `ADMIN` or `PIPE:CREATE` | | 404 | Pipe not found, Node not found | --- URL: https://www.tinybird.co/docs/api-reference/pipe-api/overview Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Pipes API Reference · Tinybird Docs" theme-color: "#171612" description: "The Pipe API enables you to manage your Pipes. With Pipes you can transform data via SQL queries and publish the results of those queries as API Endpoints." --- GET /v0/pipes/? [¶](https://www.tinybird.co/docs/about:blank#get--v0-pipes-?) Get a list of pipes in your account. getting a list of your pipes [¶](https://www.tinybird.co/docs/about:blank#id1) curl -X GET \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes" Pipes in the response will be the ones that are accessible using a particular token with read permissions for them. Successful response [¶](https://www.tinybird.co/docs/about:blank#id2) { "pipes": [{ "id": "t_55c39255e6b548dd98cb6da4b7d62c1c", "name": "my_pipe", "description": "This is a description", "endpoint": "t_h65c788b42ce4095a4789c0d6b0156c3", "created_at": "2022-11-10 12:39:38.106380", "updated_at": "2022-11-29 13:33:40.850186", "parent": null, "nodes": [{ "id": "t_h65c788b42ce4095a4789c0d6b0156c3", "name": "my_node", "sql": "SELECT col_a, col_b FROM my_data_source", "description": null, "materialized": null, "cluster": null, "tags": {}, "created_at": "2022-11-10 12:39:47.852303", "updated_at": "2022-11-10 12:46:54.066133", "version": 0, "project": null, "result": null, "ignore_sql_errors": false "node_type": "default" }], "url": "https://api.tinybird.co/v0/pipes/my_pipe.json" }] }| Key | Type | Description | | --- | --- | --- | | dependencies | boolean | The response will include the nodes dependent data sources and pipes, default is `false` | | attrs | String | comma separated list of the pipe attributes to return in the response. Example: `attrs=name,description` | | node_attrs | String | comma separated list of the node attributes to return in the response. Example `node_attrs=id,name` | Pipes id’s are immutable so you can always refer to them in your 3rd party applications to make them compatible with Pipes once they are renamed. For lighter JSON responses consider using the `attrs` and `node_attrs` params to return exactly the attributes you need to consume. POST /v0/pipes/? [¶](https://www.tinybird.co/docs/about:blank#post--v0-pipes-?) Creates a new Pipe. There are 3 ways to create a Pipe Creating a Pipe providing full JSON [¶](https://www.tinybird.co/docs/about:blank#id4) curl -X POST \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ "https://api.tinybird.co/v0/pipes" \ -d '{ "name":"pipe_name", "description": "my first pipe", "nodes": [ {"sql": "select * from my_datasource limit 10", "name": "node_00", "description": "sampled data" }, {"sql": "select count() from node_00", "name": "node_01" } ] }' If you prefer to create the minimum Pipe, and then append your transformation nodes you can set your name and first transformation node’s SQL in your POST request Creating a pipe with a name and a SQL query [¶](https://www.tinybird.co/docs/about:blank#id5) curl -X POST \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes?name=pipename&sql=select%20*%20from%20events" Pipes can be also created as copies of other Pipes. Just use the `from` argument: Creating a pipe from another pipe [¶](https://www.tinybird.co/docs/about:blank#id6) curl -X POST \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes?name=pipename&from=src_pipe" Bear in mind, if you use this method to overwrite an existing Pipe, the endpoint will only be maintained if the node name is the same. POST /v0/pipes/(.+)/nodes [¶](https://www.tinybird.co/docs/about:blank#post--v0-pipes-(.+)-nodes) Appends a new node to a Pipe. adding a new node to a pipe [¶](https://www.tinybird.co/docs/about:blank#id7) curl \ -H "Authorization: Bearer " \ -d 'select * from node_0' "https://api.tinybird.co/v0/pipes/:name/nodes?name=node_name&description=explanation" Appends a new node that creates a Materialized View adding a Materialized View using a materialized node [¶](https://www.tinybird.co/docs/about:blank#id8) curl \ -H "Authorization: Bearer " \ -d 'select id, sum(amount) as amount, date from my_datasource' "https://api.tinybird.co/v0/pipes/:name/nodes?name=node_name&description=explanation&type=materialized&datasource=new_datasource&engine=AggregatingMergeTree"| Key | Type | Description | | --- | --- | --- | | name | String | The referenceable name for the node. | | description | String | Use it to store a more detailed explanation of the node. | | token | String | Auth token. Ensure it has the `PIPE:CREATE` scope on it | | type | String | Optional. Available options are { `standard` (default), `materialized` , `endpoint` }. Use `materialized` to create a Materialized View from your new node. | | datasource | String | Required with `type=materialized` . Specifies the name of the destination Data Source where the Materialized View schema is defined. | | override_datasource | Boolean | Optional. Default `false` When the target Data Source of the Materialized View exists in the Workspace it’ll be overriden by the `datasource` specified in the request. | | populate | Boolean | Optional. Default `false` . When `true` , a job is triggered to populate the destination Data Source. | | populate_subset | Float | Optional. Populate with a subset percent of the data (limited to a maximum of 2M rows), this is useful to quickly test a materialized node with some data. The subset must be greater than 0 and lower than 0.1. A subset of 0.1 means a 10 percent of the data in the source Data Source will be used to populate the Materialized View. Use it together with `populate=true` , it has precedence over `populate_condition` | | populate_condition | String | Optional. Populate with a SQL condition to be applied to the trigger Data Source of the Materialized View. For instance, `populate_condition='date == toYYYYMM(now())'` it’ll populate taking all the rows from the trigger Data Source which `date` is the current month. Use it together with `populate=true` . `populate_condition` is not taken into account if the `populate_subset` param is present. Including in the `populate_condition` any column present in the Data Source `engine_sorting_key` will make the populate job process less data. | | unlink_on_populate_error | String | Optional. Default is `false` . If the populate job fails the Materialized View is unlinked and new data won’t be ingested in the Materialized View. | | engine | String | Optional. Engine for destination Materialized View. Requires the `type` parameter as `materialized` . | | engine_* | String | Optional. Engine parameters and options. Requires the `type` parameter as `materialized` and the `engine` parameter.[ Check Engine Parameters and Options for more details](https://www.tinybird.co/docs/docs/api-reference/datasource-api) | SQL query for the transformation node must be sent in the body encoded in utf-8 | Code | Description | | --- | --- | | 200 | No error | | 400 | empty or wrong SQL or API param value | | 403 | Forbidden. Provided token doesn’t have permissions to append a node to the pipe, it needs `ADMIN` or `PIPE:CREATE` | | 404 | Pipe not found | | 409 | There’s another resource with the same name, names must be unique | The Materialized View already exists | `override_datasource` cannot be performed | DELETE /v0/pipes/(.+)/nodes/(.+) [¶](https://www.tinybird.co/docs/about:blank#delete--v0-pipes-(.+)-nodes-(.+)) Drops a particular transformation node in the Pipe. It does not remove related nodes so this could leave the Pipe in an unconsistent state. For security reasons, enabled nodes can’t be removed. removing a node from a pipe [¶](https://www.tinybird.co/docs/about:blank#id11) curl -X DELETE "https://api.tinybird.co/v0/pipes/:name/nodes/:node_id"| Code | Description | | --- | --- | | 204 | No error, removed node | | 400 | The node is published. Published nodes can’t be removed | | 403 | Forbidden. Provided token doesn’t have permissions to change the last node of the pipe, it needs ADMIN or IMPORT | | 404 | Pipe not found | PUT /v0/pipes/(.+)/nodes/(.+) [¶](https://www.tinybird.co/docs/about:blank#put--v0-pipes-(.+)-nodes-(.+)) Changes a particular transformation node in the Pipe Editing a Pipe’s transformation node [¶](https://www.tinybird.co/docs/about:blank#id13) curl -X PUT \ -H "Authorization: Bearer " \ -d 'select * from node_0' "https://api.tinybird.co/v0/pipes/:name/nodes/:node_id?name=new_name&description=updated_explanation"| Key | Type | Description | | --- | --- | --- | | name | String | new name for the node | | description | String | new description for the node | | token | String | Auth token. Ensure it has the `PIPE:CREATE` scope on it | Please, note that the desired SQL query should be sent in the body encoded in utf-8. | Code | Description | | --- | --- | | 200 | No error | | 400 | Empty or wrong SQL | | 403 | Forbidden. Provided token doesn’t have permissions to change the last node to the pipe, it needs `ADMIN` or `PIPE:CREATE` | | 404 | Pipe not found | | 409 | There’s another resource with the same name, names must be unique | GET /v0/pipes/(.+)\.(json|csv|ndjson|parquet) [¶](https://www.tinybird.co/docs/about:blank#get--v0-pipes-(.+)%5C.(json%7Ccsv%7Cndjson%7Cparquet)) Returns the published node data in a particular format. Getting data for a pipe [¶](https://www.tinybird.co/docs/about:blank#pipe-get-data) curl -X GET \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:name.format"| Key | Type | Description | | --- | --- | --- | | q | String | Optional, query to execute, see API Query endpoint | | output_format_json_quote_64bit_integers | int | (Optional) Controls quoting of 64-bit or bigger integers (like UInt64 or Int128) when they are output in a JSON format. Such integers are enclosed in quotes by default. This behavior is compatible with most JavaScript implementations. Possible values: 0 — Integers are output without quotes. 1 — Integers are enclosed in quotes. Default value is 0 | | output_format_json_quote_denormals | int | (Optional) Controls representation of inf and nan on the UI instead of null e.g when dividing by 0 - inf and when there is no representation of a number in Javascript - nan. Possible values: 0 - disabled, 1 - enabled. Default value is 0 | | output_format_parquet_string_as_string | int | (Optional) Use Parquet String type instead of Binary for String columns. Possible values: 0 - disabled, 1 - enabled. Default value is 0 | The `q` parameter is a SQL query (see [Query API](https://www.tinybird.co/docs/docs/api-reference/query-api) ). When using this endpoint to query your Pipes, you can use the `_` shortcut, which refers to your Pipe name | format | Description | | --- | --- | | csv | CSV with header | | json | JSON including data, statistics and schema information | | ndjson | One JSON object per each row | | parquet | A Parquet file. Some libraries might not properly process `UInt*` data types, if that’s your case cast those columns to signed integers with `toInt*` functions. `String` columns are exported as `Binary` , take that into account when reading the resulting Parquet file, most libraries convert from Binary to String (e.g. Spark has this configuration param: `spark.sql.parquet.binaryAsString` ) | POST /v0/pipes/(.+)\.(json|csv|ndjson|parquet) [¶](https://www.tinybird.co/docs/about:blank#post--v0-pipes-(.+)%5C.(json%7Ccsv%7Cndjson%7Cparquet)) Returns the published node data in a particular format, passing the parameters in the request body. Use this endpoint when the query is too long to be passed as a query string parameter. When using the post endpoint, there are no traces of the parameters in the pipe_stats_rt Data Source. See the get endpoint for more information. GET /v0/pipes/(.+\.pipe) [¶](https://www.tinybird.co/docs/about:blank#get--v0-pipes-(.+%5C.pipe)) Get pipe information. Provided Auth Token must have read access to the Pipe. Getting information about a particular pipe [¶](https://www.tinybird.co/docs/about:blank#id16) curl -X GET \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:name" `pipe_id` and `pipe_name` are two ways to refer to the pipe in SQL queries and API endpoints the only difference is `pipe_id` never changes so it’ll work even if you change the `pipe_name` (which is the name used to display the pipe). In general you can use `pipe_id` or `pipe_name` indistinctly: Successful response [¶](https://www.tinybird.co/docs/about:blank#id17) { "id": "t_bd1c62b5e67142bd9bf9a7f113a2b6ea", "name": "events_pipe", "pipeline": { "nodes": [{ "name": "events_ds_0" "sql": "select * from events_ds_log__raw", "materialized": false }, { "name": "events_ds", "sql": "select * from events_ds_0 where valid = 1", "materialized": false }] } } You can make your Pipe’s id more descriptive by prepending information such as `t_my_events_table.bd1c62b5e67142bd9bf9a7f113a2b6ea` DELETE /v0/pipes/(.+\.pipe) [¶](https://www.tinybird.co/docs/about:blank#delete--v0-pipes-(.+%5C.pipe)) Drops a Pipe from your account. Auth token in use must have the `DROP:NAME` scope. Dropping a pipe [¶](https://www.tinybird.co/docs/about:blank#id18) curl -X DELETE \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:name" PUT /v0/pipes/(.+\.pipe) [¶](https://www.tinybird.co/docs/about:blank#put--v0-pipes-(.+%5C.pipe)) Changes Pipe’s metadata. When there is another Pipe with the same name an error is raised. editing a pipe [¶](https://www.tinybird.co/docs/about:blank#id19) curl -X PUT \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:name?name=new_name"| Key | Type | Description | | --- | --- | --- | | name | String | new name for the pipe | | description | String | new Markdown description for the pipe | | token | String | Auth token. Ensure it has the `PIPE:CREATE` scope on it | GET /v0/pipes/(.+) [¶](https://www.tinybird.co/docs/about:blank#get--v0-pipes-(.+)) Get pipe information. Provided Auth Token must have read access to the Pipe. Getting information about a particular pipe [¶](https://www.tinybird.co/docs/about:blank#id21) curl -X GET \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:name" `pipe_id` and `pipe_name` are two ways to refer to the pipe in SQL queries and API endpoints the only difference is `pipe_id` never changes so it’ll work even if you change the `pipe_name` (which is the name used to display the pipe). In general you can use `pipe_id` or `pipe_name` indistinctly: Successful response [¶](https://www.tinybird.co/docs/about:blank#id22) { "id": "t_bd1c62b5e67142bd9bf9a7f113a2b6ea", "name": "events_pipe", "pipeline": { "nodes": [{ "name": "events_ds_0" "sql": "select * from events_ds_log__raw", "materialized": false }, { "name": "events_ds", "sql": "select * from events_ds_0 where valid = 1", "materialized": false }] } } You can make your Pipe’s id more descriptive by prepending information such as `t_my_events_table.bd1c62b5e67142bd9bf9a7f113a2b6ea` DELETE /v0/pipes/(.+) [¶](https://www.tinybird.co/docs/about:blank#delete--v0-pipes-(.+)) Drops a Pipe from your account. Auth token in use must have the `DROP:NAME` scope. Dropping a pipe [¶](https://www.tinybird.co/docs/about:blank#id23) curl -X DELETE \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:name" PUT /v0/pipes/(.+) [¶](https://www.tinybird.co/docs/about:blank#put--v0-pipes-(.+)) Changes Pipe’s metadata. When there is another Pipe with the same name an error is raised. editing a pipe [¶](https://www.tinybird.co/docs/about:blank#id24) curl -X PUT \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/pipes/:name?name=new_name"| Key | Type | Description | | --- | --- | --- | | name | String | new name for the pipe | | description | String | new Markdown description for the pipe | | token | String | Auth token. Ensure it has the `PIPE:CREATE` scope on it | --- URL: https://www.tinybird.co/docs/api-reference/query-api Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Query API Reference · Tinybird Docs" theme-color: "#171612" description: "The Query API allows you to query your Pipes inside Tinybird as if you were running SQL statements against a regular database." --- GET /v0/sql [¶](https://www.tinybird.co/docs/about:blank#get--v0-sql) Executes a SQL query using the engine. Running sql queries against your data [¶](https://www.tinybird.co/docs/about:blank#id1) curl --data "SELECT * FROM " https://api.tinybird.co/v0/sql As a response, it gives you the query metadata, the resulting data and some performance statistics. Successful response [¶](https://www.tinybird.co/docs/about:blank#id2) { "meta": [ { "name": "VendorID", "type": "Int32" }, { "name": "tpep_pickup_datetime", "type": "DateTime" } ], "data": [ { "VendorID": 2, "tpep_pickup_datetime": "2001-01-05 11:45:23", "tpep_dropoff_datetime": "2001-01-05 11:52:05", "passenger_count": 5, "trip_distance": 1.53, "RatecodeID": 1, "store_and_fwd_flag": "N", "PULocationID": 71, "DOLocationID": 89, "payment_type": 2, "fare_amount": 7.5, "extra": 0.5, "mta_tax": 0.5, "tip_amount": 0, "tolls_amount": 0, "improvement_surcharge": 0.3, "total_amount": 8.8 }, { "VendorID": 2, "tpep_pickup_datetime": "2002-12-31 23:01:55", "tpep_dropoff_datetime": "2003-01-01 14:59:11" } ], "rows": 3, "rows_before_limit_at_least": 4, "statistics": { "elapsed": 0.00091042, "rows_read": 4, "bytes_read": 296 } } Data can be fetched in different formats. Just append `FORMAT ` to your SQL query: Requesting different formats with SQL [¶](https://www.tinybird.co/docs/about:blank#id3) SELECT count () from < pipe > FORMAT JSON| Key | Type | Description | | --- | --- | --- | | q | String | The SQL query | | pipeline | String | (Optional) The name of the pipe. It allows writing a query like ‘SELECT * FROM _’ where ‘_’ is a placeholder for the ‘pipeline’ parameter | | output_format_json_quote_64bit_integers | int | (Optional) Controls quoting of 64-bit or bigger integers (like UInt64 or Int128) when they are output in a JSON format. Such integers are enclosed in quotes by default. This behavior is compatible with most JavaScript implementations. Possible values: 0 — Integers are output without quotes. 1 — Integers are enclosed in quotes. Default value is 0 | | output_format_json_quote_denormals | int | (Optional) Controls representation of inf and nan on the UI instead of null e.g when dividing by 0 - inf and when there is no representation of a number in Javascript - nan. Possible values: 0 - disabled, 1 - enabled. Default value is 0 | | output_format_parquet_string_as_string | int | (Optional) Use Parquet String type instead of Binary for String columns. Possible values: 0 - disabled, 1 - enabled. Default value is 0 | | format | Description | | --- | --- | | CSV | CSV without header | | CSVWithNames | CSV with header | | JSON | JSON including data, statistics and schema information | | TSV | TSV without header | | TSVWithNames | TSV with header | | PrettyCompact | Formatted table | | JSONEachRow | Newline-delimited JSON values (NDJSON) | | Parquet | Apache Parquet | As you can see in the example above, timestamps do not include a time zone in their serialization. Let’s see how that relates to timestamps ingested from your original data: - If the original timestamp had no time zone associated, you’ll read back the same date and time verbatim. If you ingested the timestamp `2022-11-14 11:08:46` , for example, Tinybird sends `"2022-11-14 11:08:46"` back. This is so regardless of the time zone of the column in ClickHouse. - If the original timestamp had a time zone associated, you’ll read back the corresponding date and time in the time zone of the destination column in ClickHouse, which is UTC by default. If you ingested `2022-11-14 12:08:46.574295 +0100` , for instance, Tinybird sends `"2022-11-14 11:08:46"` back for a `DateTime` , and `"2022-11-14 06:08:46"` for a `DateTime('America/New_York')` . POST /v0/sql [¶](https://www.tinybird.co/docs/about:blank#post--v0-sql) Executes a SQL query using the engine, while providing a templated or non templated query string and the custom parameters that will be translated into the query. The custom parameters provided should not have the same name as the request parameters for this endpoint (outlined below), as they are reserved in order to get accurate results for your query. Running sql queries against your data [¶](https://www.tinybird.co/docs/about:blank#id6) For example: 1. Providing the value to the query via the POST body: curl -X POST \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ "https://api.tinybird.co/v0/sql" -d \ '{ "q":"% SELECT * FROM where column_name = {{String(column_name)}}", "column_name": "column_name_value" }' 2. Providing a new value to the query from the one defined within the pipe in the POST body: curl -X POST \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ "https://api.tinybird.co/v0/sql" -d \ '{ "q":"% SELECT * FROM where column_name = {{String(column_name, "column_name_value")}}", "column_name": "new_column_name_value" }' 3. Providing a non template query in the POST body: curl -X POST \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ "https://api.tinybird.co/v0/sql" -d \ '{ "q":"SELECT * FROM " }' 4. Providing a non template query as a string in the POST body with a content type of "text/plain" : curl -X POST \ -H "Authorization: Bearer " \ -H "Content-Type: text/plain" \ "https://api.tinybird.co/v0/sql" -d "SELECT * FROM "| Key | Type | Description | | --- | --- | --- | | pipeline | String | (Optional) The name of the pipe. It allows writing a query like ‘SELECT * FROM _’ where ‘_’ is a placeholder for the ‘pipeline’ parameter | | output_format_json_quote_64bit_integers | int | (Optional) Controls quoting of 64-bit or bigger integers (like UInt64 or Int128) when they are output in a JSON format. Such integers are enclosed in quotes by default. This behavior is compatible with most JavaScript implementations. Possible values: 0 — Integers are output without quotes. 1 — Integers are enclosed in quotes. Default value is 0 | | output_format_json_quote_denormals | int | (Optional) Controls representation of inf and nan on the UI instead of null e.g when dividing by 0 - inf and when there is no representation of a number in Javascript - nan. Possible values: 0 - disabled, 1 - enabled. Default value is 0 | | output_format_parquet_string_as_string | int | (Optional) Use Parquet String type instead of Binary for String columns. Possible values: 0 - disabled, 1 - enabled. Default value is 0 | As a response, it gives you the query metadata, the resulting data and some performance statistics. Successful response [¶](https://www.tinybird.co/docs/about:blank#id8) { "meta": [ { "name": "VendorID", "type": "Int32" }, { "name": "tpep_pickup_datetime", "type": "DateTime" } ], "data": [ { "VendorID": 2, "tpep_pickup_datetime": "2001-01-05 11:45:23", "tpep_dropoff_datetime": "2001-01-05 11:52:05", "passenger_count": 5, "trip_distance": 1.53, "RatecodeID": 1, "store_and_fwd_flag": "N", "PULocationID": 71, "DOLocationID": 89, "payment_type": 2, "fare_amount": 7.5, "extra": 0.5, "mta_tax": 0.5, "tip_amount": 0, "tolls_amount": 0, "improvement_surcharge": 0.3, "total_amount": 8.8 }, { "VendorID": 2, "tpep_pickup_datetime": "2002-12-31 23:01:55", "tpep_dropoff_datetime": "2003-01-01 14:59:11" } ], "rows": 3, "rows_before_limit_at_least": 4, "statistics": { "elapsed": 0.00091042, "rows_read": 4, "bytes_read": 296 } } Data can be fetched in different formats. Just append `FORMAT ` to your SQL query: Requesting different formats with SQL [¶](https://www.tinybird.co/docs/about:blank#id9) SELECT count () from < pipe > FORMAT JSON| format | Description | | --- | --- | | CSV | CSV without header | | CSVWithNames | CSV with header | | JSON | JSON including data, statistics and schema information | | TSV | TSV without header | | TSVWithNames | TSV with header | | PrettyCompact | Formatted table | | JSONEachRow | Newline-delimited JSON values (NDJSON) | | Parquet | Apache Parquet | As you can see in the example above, timestamps do not include a time zone in their serialization. Let’s see how that relates to timestamps ingested from your original data: - If the original timestamp had no time zone associated, you’ll read back the same date and time verbatim. If you ingested the timestamp `2022-11-14 11:08:46` , for example, Tinybird sends `"2022-11-14 11:08:46"` back. This is so regardless of the time zone of the column in ClickHouse. - If the original timestamp had a time zone associated, you’ll read back the corresponding date and time in the time zone of the destination column in ClickHouse, which is UTC by default. If you ingested `2022-11-14 12:08:46.574295 +0100` , for instance, Tinybird sends `"2022-11-14 11:08:46"` back for a `DateTime` , and `"2022-11-14 06:08:46"` for a `DateTime('America/New_York')` . --- URL: https://www.tinybird.co/docs/api-reference/sink-pipes-api Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Sink Pipes API Reference · Tinybird Docs" theme-color: "#171612" description: "The Sink Pipes API allows you to create, delete, schedule, and trigger Sink Pipes." --- # Sink Pipes API¶ The Sink Pipes API allows you to create, delete, schedule, and trigger Sink Pipes. ## POST /v0/pipes/\{pipe\_id\}/nodes/\{node\_id\}/sink¶ Set the Pipe as a Sink Pipe, optionally scheduled. Required token permission: PIPES:CREATE. ### Restrictions¶ - You can set only one schedule per Sink Pipe. - You can’t set a Sink Pipe if the Pipe is already materializing. You must unlink the Materialization first. - You can’t set a Sink Pipe if the Pipe is already an API Endpoint. You must unpublish the API Endpoint first. - You can’t set a Sink Pipe if the Pipe is already copying. You must unset the copy first. ### Example¶ ##### Setting the Pipe as a Sink Pipe curl \ -X POST "https://api.tinybird.co/v0/pipes/:pipe/nodes/:node/sink" \ -H "Authorization: Bearer " \ -d "connection=my_connection_name" \ -d "path=s3://bucket-name/prefix" \ -d "file_template=exported_file_template" \ -d "format=csv" \ -d "compression=gz" \ -d "schedule_cron=0 */1 * * *" \ -d "write_strategy=new" ### Request parameters¶ | Key | Type | Description | | --- | --- | --- | | connection | String | Name of the connection to holding the credentials to run the sink | | path | String | Object store prefix into which the sink will write data | | file_template | String | File template string. See[ file template](https://www.tinybird.co/docs/docs/publish/s3-sink#file-template) for more details | | format | String | Optional. Format of the exported files. Default: CSV | | compression | String | Optional. Compression of the output files. Default: None | | schedule_cron | String | Optional. The sink’s execution schedule, in crontab format. | | write_strategy | String | Optional. Default: `new` . The sink's write strategy for filenames already existing in the bucket. Values: `new` , `truncate` ; `new` adds a new file with a suffix, while `truncate` replaces the existent file. | ### Successful response example¶ { "id": "t_529f46626c324674b3a84cd820ac2649", "name": "p_test", "description": null, "endpoint": null, "created_at": "2024-01-18 12:57:36.503834", "updated_at": "2024-01-18 13:01:21.435012", "parent": null, "type": "sink", "last_commit": { "content_sha": "", "path": "", "status": "changed" }, "sink_node": "t_6e8afdb8c691459b80e16541433f951b", "schedule": { "timezone": "Etc/UTC", "cron": "0 */1 * * *", "status": "running" }, "nodes": [ { "id": "t_6e8afdb8c691459b80e16541433f951b", "name": "p_test_0", "sql": "SELECT * FROM test", "description": null, "materialized": null, "cluster": null, "tags": {}, "created_at": "2024-01-18 12:57:36.503843", "updated_at": "2024-01-18 12:57:36.503843", "version": 0, "project": null, "result": null, "ignore_sql_errors": false, "node_type": "sink", "dependencies": [ "test" ], "params": [] } ] } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | | 404 | Pipe, Node, or data connector not found, bucket doesn’t exist | | 403 | Limit reached, Query includes forbidden keywords, Pipe is already a Sink Pipe, cannot assume role | | 401 | Invalid credentials (from connection) | | 400 | Invalid or missing parameters, bad ARN role, invalid region name | ## DELETE /v0/pipes/\{pipe\_id\}/nodes/\{node\_id\}/sink¶ Removes the Sink from the Pipe. This does not delete the Pipe nor the Node, only the sink configuration and any associated settings. ### Example¶ curl \ -X DELETE "https://api.tinybird.co/v0/pipes/$1/nodes/$2/sink" \ -H "Authorization: Bearer " Successful response example { "id": "t_529f46626c324674b3a84cd820ac2649", "name": "p_test", "description": null, "endpoint": null, "created_at": "2024-01-18 12:57:36.503834", "updated_at": "2024-01-19 09:27:12.069650", "parent": null, "type": "default", "last_commit": { "content_sha": "", "path": "", "status": "changed" }, "nodes": [ { "id": "t_6e8afdb8c691459b80e16541433f951b", "name": "p_test_0", "sql": "SELECT * FROM test", "description": null, "materialized": null, "cluster": null, "tags": {}, "created_at": "2024-01-18 12:57:36.503843", "updated_at": "2024-01-19 09:27:12.069649", "version": 0, "project": null, "result": null, "ignore_sql_errors": false, "node_type": "standard", "dependencies": [ "test" ], "params": [] } ], "url": "https://api.split.tinybird.co/v0/pipes/p_test.json" } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | | 404 | Pipe, Node, or data connector not found | | 403 | Limit reached, Query includes forbidden keywords, Pipe is already a Sink Pipe | | 400 | Invalid or missing parameters, Pipe is not a Sink Pipe | ## POST /v0/pipes/\{pipe\_id\}/sink¶ Triggers the Sink Pipe, creating a sink job. Allows overriding some of the sink settings for this particular execution. ### Example¶ ##### Trigger a Sink Pipe with some overrides curl \ -X POST "https://api.tinybird.co/v0/pipes/p_test/sink" \ -H "Authorization: Bearer " \ -d "file_template=export_file" \ -d "format=csv" \ -d "compression=gz" \ -d "write_strategy=truncate" \ -d {key}={val} ### Request parameters¶ | Key | Type | Description | | --- | --- | --- | | connection | String | Name of the connection to holding the credentials to run the sink | | path | String | Object store prefix into which the sink will write data | | file_template | String | File template string. See[ file template](https://www.tinybird.co/docs/docs/publish/s3-sink#file-template) for more details | | format | String | Optional. Format of the exported files. Default: CSV | | compression | String | Optional. Compression of the output files. Default: None | | write_strategy | String | Optional. The sink's write strategy for filenames already existing in the bucket. Values: `new` , `truncate` ; `new` adds a new file with a suffix, while `truncate` replaces the existent file. | | {key} | String | Optional. Additional variables to be injected into the file template. See[ file template](https://www.tinybird.co/docs/docs/publish/s3-sink#file-template) for more details | ### Successful response example¶ { "id": "t_6e8afdb8c691459b80e16541433f951b", "name": "p_test_0", "sql": "SELECT * FROM test", "description": null, "materialized": null, "cluster": null, "tags": {}, "created_at": "2024-01-18 12:57:36.503843", "updated_at": "2024-01-19 09:27:12.069649", "version": 0, "project": null, "result": null, "ignore_sql_errors": false, "node_type": "sink", "dependencies": [ "test" ], "params": [], "job": { "id": "685e7395-3b08-492b-9fe8-2944859d6a06", "kind": "sink", "status": "waiting", "created_at": "2024-01-19 15:58:46.688525", "updated_at": "2024-01-19 15:58:46.688532", "is_cancellable": true, "job_url": "https://api.split.tinybird.co/v0/jobs/685e7395-3b08-492b-9fe8-2944859d6a06", "pipe": { "id": "t_529f46626c324674b3a84cd820ac2649", "name": "p_test" } } } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | | 404 | Pipe, Node, or data connector not found | | 403 | Limit reached, Query includes forbidden keywords, Pipe is already a Sink Pipe | | 400 | Invalid or missing parameters, Pipe is not a Sink Pipe | ## GET /v0/integrations/s3/policies/trust-policy¶ Retrieves the trust policy to be attached to the IAM role that will be used for the connection. External IDs are different for each Workspace, but shared between Branches of the same Workspace to avoid having to change the trust policy for each Branch. ### Example¶ curl \ -X GET "https://$TB_HOST/v0/integrations/s3/policies/trust-policy" \ -H "Authorization: Bearer " Successful response example { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "sts:AssumeRole", "Principal": { "AWS": "arn:aws:iam::123456789:root" }, "Condition": { "StringEquals": { "sts:ExternalId": "c6ee2795-aae3-4a55-a7a1-92d92fab0e41" } } } ] } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | | 404 | S3 integration not supported in your region | ## GET /v0/integrations/s3/policies/write-access-policy¶ Retrieves the trust policy to be attached to the IAM Role that will be used for the connection. External IDs are different for each Workspace, but shared between branches of the same Workspace to avoid having to change the trust policy for each branch. ### Example¶ curl \ -X GET "https://$TB_HOST/v0/integrations/s3/policies/write-access-policy?bucket=test-bucket" \ -H "Authorization: Bearer " ### Request parameters¶ | Key | Type | Description | | --- | --- | --- | | bucket | Optional[String] | Bucket to use for rendering the template. If not provided the ‘’ placeholder is used | ### Successful response example¶ { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListBucket" ], "Resource": "arn:aws:s3:::" }, { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:PutObjectAcl" ], "Resource": "arn:aws:s3:::/*" } ] } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | ## GET /v0/integrations/s3/settings¶ Retrieves the settings to be attached to the IAM role that will be used for the connection. External IDs are different for each Workspace, but shared between Branches of the same Workspace to avoid having to generate specific IAM roles for each of the Branches. ### Example¶ curl \ -X GET "https://$TB_HOST/v0/integrations/s3/settings" \ -H "Authorization: Bearer " Successful response example { "principal": "arn:aws:iam:::root", "external_id": "" } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | | 404 | S3 integration not supported in your region | ## GET /v0/datasources-bigquery-credentials¶ Retrieves the Workspace’s GCP service account to be authorized to write to the destination bucket. ### Example¶ curl \ -X POST "${TINYBIRD_HOST}/v0/connectors" \ -H "Authorization: Bearer " \ -d "service=gcs_service_account" \ -d "name=" ### Request parameters¶ None ### Successful response example¶ { "account": "cdk-E-d83f6d01-b5c1-40-43439d@development-353413.iam.gserviceaccount.com" } ### Response codes¶ | Code | Description | | --- | --- | | 200 | OK | | 503 | Feature not enabled in your region | --- URL: https://www.tinybird.co/docs/api-reference/token-api Last update: 2024-10-15T15:38:30.000Z Content: --- title: "Token API Reference · Tinybird Docs" theme-color: "#171612" description: "In order to read, append or import data into you Tinybird account, you'll need a Token with the right permissions." --- GET /v0/tokens/? [¶](https://www.tinybird.co/docs/about:blank#get--v0-tokens-?) Retrieves all workspace Static Tokens. Get all tokens [¶](https://www.tinybird.co/docs/about:blank#id2) curl -X GET \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/tokens" A list of your Static Tokens and their scopes will be sent in the response. Successful response [¶](https://www.tinybird.co/docs/about:blank#id3) { "tokens": [ { "name": "admin token", "description": "", "scopes": [ { "type": "ADMIN" } ], "token": "p.token" }, { "name": "import token", "description": "", "scopes": [ { "type": "DATASOURCES:CREATE" } ], "token": "p.token0" }, { "name": "token name 1", "description": "", "scopes": [ { "type": "DATASOURCES:READ", "resource": "table_name_1" }, { "type": "DATASOURCES:APPEND", "resource": "table_name_1" } ], "token": "p.token1" }, { "name": "token name 2", "description": "", "scopes": [ { "type": "PIPES:READ", "resource": "pipe_name_2" } ], "token": "p.token2" } ] } POST /v0/tokens/? [¶](https://www.tinybird.co/docs/about:blank#post--v0-tokens-?) Creates a new Token: Static or JWT Creating a new Static Token [¶](https://www.tinybird.co/docs/about:blank#id4) curl -X POST \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/tokens/" \ -d "name=test&scope=DATASOURCES:APPEND:table_name&scope=DATASOURCES:READ:table_name"| Key | Type | Description | | --- | --- | --- | | name | String | Name of the token | | description | String | Optional. Markdown text with a description of the token. | | scope | String | Scope(s) to set. Format is[ SCOPE:TYPE[:arg][:filter]](about:blank#id2) . This is only used for the Static Tokens | Successful response [¶](https://www.tinybird.co/docs/about:blank#id6) { "name": "token_name", "description": "", "scopes": [ { "type": "DATASOURCES:APPEND", "resource": "table_name" } { "type": "DATASOURCES:READ", "resource": "table_name", "filter": "department = 1" }, ], "token": "p.token" } When creating a token with `filter` whenever you use the token to read the table, it will be filtered. For example, if table is `events_table` and `filter` is `date > '2018-01-01' and type == 'foo'` a query like `select count(1) from events_table` will become `select count(1) from events_table where date > '2018-01-01' and type == 'foo'` Creating a new token with filter [¶](https://www.tinybird.co/docs/about:blank#id7) curl -X POST \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/tokens/" \ -d "name=test&scope=DATASOURCES:READ:table_name:column==1" If we provide an `expiration_time` in the URL, the token will be created as a JWT Token. Creating a new JWT Token [¶](https://www.tinybird.co/docs/about:blank#id8) curl -X POST \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/tokens?name=jwt_token&expiration_time=1710000000" \ -d '{"scopes": [{"type": "PIPES:READ", "resource": "requests_per_day", "fixed_params": {"user_id": 3}}]}' In multi-tenant applications, you can use this endpoint to create a JWT token for a specific tenant where each user has their own token with a fixed set of scopes and parameters POST /v0/tokens/(.+)/refresh [¶](https://www.tinybird.co/docs/about:blank#post--v0-tokens-(.+)-refresh) Refresh the Static Token without modifying name, scopes or any other attribute. Specially useful when a Token is leaked, or when you need to rotate a Token. Refreshing a Static Token [¶](https://www.tinybird.co/docs/about:blank#id9) curl -X POST \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/tokens/:token_name/refresh" When successfully refreshing a token, new information will be sent in the response Successful response [¶](https://www.tinybird.co/docs/about:blank#id10) { "name": "token name", "description": "", "scopes": [ { "type": "DATASOURCES:READ", "resource": "table_name" } ], "token": "NEW_TOKEN" }| Key | Type | Description | | --- | --- | --- | | auth_token | String | Token. Ensure it has the `TOKENS` scope on it | | Code | Description | | --- | --- | | 200 | No error | | 403 | Forbidden. Provided token doesn’t have permissions to drop the token. A token is not allowed to remove itself, it needs `ADMIN` or `TOKENS` scope | GET /v0/tokens/(.+) [¶](https://www.tinybird.co/docs/about:blank#get--v0-tokens-(.+)) Fetches information about a particular Static Token. Getting token info [¶](https://www.tinybird.co/docs/about:blank#id13) curl -X GET \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/tokens/:token" Returns a json with name and scopes. Successful response [¶](https://www.tinybird.co/docs/about:blank#id14) { "name": "token name", "description": "", "scopes": [ { "type": "DATASOURCES:READ", "resource": "table_name" } ], "token": "p.TOKEN" } DELETE /v0/tokens/(.+) [¶](https://www.tinybird.co/docs/about:blank#delete--v0-tokens-(.+)) Deletes a Static Token . Deleting a token [¶](https://www.tinybird.co/docs/about:blank#id15) curl -X DELETE \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/tokens/:token" PUT /v0/tokens/(.+) [¶](https://www.tinybird.co/docs/about:blank#put--v0-tokens-(.+)) Modifies a Static Token. More than one scope can be sent per request, all of them will be added as Token scopes. Every time a Token scope is modified, it overrides the existing one(s). editing a token [¶](https://www.tinybird.co/docs/about:blank#id16) curl -X PUT \ -H "Authorization: Bearer " \ "https://api.tinybird.co/v0/tokens/?" \ -d "name=test_new_name&description=this is a test token&scope=PIPES:READ:test_pipe&scope=DATASOURCES:CREATE"| Key | Type | Description | | --- | --- | --- | | token | String | Token. Ensure it has the `TOKENS` scope on it | | name | String | Optional. Name of the token. | | description | String | Optional. Markdown text with a description of the token. | | scope | String | Optional. Scope(s) to set. Format is[ SCOPE:TYPE[:arg][:filter]](about:blank#id2) . New scope(s) will override existing ones. | Successful response [¶](https://www.tinybird.co/docs/about:blank#id18) { "name": "test", "description": "this is a test token", "scopes": [ { "type": "PIPES:READ", "resource": "test_pipe" }, { "type": "DATASOURCES:CREATE" } ] } --- URL: https://www.tinybird.co/docs/architecture Last update: 2024-10-24T08:59:50.000Z Content: --- title: "Architecture · Tinybird Docs" theme-color: "#171612" description: "Frequently asked questions about Tinybird architecture" --- # Architecture¶ Tinybird is built using open source software. Tinybird loves open source and have dedicated teams that contribute to all the projects it uses. ## What's ClickHouse®?¶ An open source, high performance, columnar OLAP database. It's a lightning-fast database that solves use cases requiring rapid ingestion, low latency queries, and high concurrency. Tinybird gives you the speed of ClickHouse with 10x the DevEx. ## Where does Tinybird fit in your tech stack?¶ Tinybird is the analytical backend for your applications: it consumes data from any source and exposes it through API Endpoints. Tinybird can sit parallel to the Data Warehouse or in front of it. The Data Warehouse allows to explore use cases like BI and data science, while Tinybird unlocks action use cases like operational applications, embedded analytics, and user-facing analytics. Read the guide ["Team integration and data governance"](https://www.tinybird.co/docs/docs/guides/integrations/team-integration-governance) to learn more about implementing Tinybird in your existing team. ## Is data stored in Tinybird?¶ The data you ingest into Tinybird is stored in ClickHouse. Depending on the use case, you can add a TTL to control how long data is stored. ## Cloud environment vs on-premises¶ Tinybird is a managed SaaS solution. Tinybird doesn't provide a "Bring Your Own Cloud" (BYOC) or on-premises deployments. If you are interested in a BYOC deployment, join the [Tinybird BYOC Waitlist](https://faster.tinybird.co/byoc-waitlist). ## Next steps¶ - Explore[ Tinybird's Customer Stories](https://www.tinybird.co/customer-stories) and see what people have built on Tinybird. - Start building using the[ quick start](https://www.tinybird.co/docs/docs/quick-start) . - Read the guide[ "Team integration and data governance"](https://www.tinybird.co/docs/docs/guides/integrations/team-integration-governance) to learn more about implementing Tinybird in your existing team. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/changelog Content: --- title: "Changelog · Tinybird" theme-color: "#171612" description: "Tinybird helps data teams build real-time Data Products at scale through SQL-based API endpoints." --- # Fix and create queries with AI-powered suggestions¶ Your query doesn't work? Would you like to create a query from pseudoSQL? Now you can do both things! When your query throws an error, select **Suggest a fix** to get a suggestion. Look at the diff, accept the change, and then run the fixed query. ![Video]() Controls: true This feature is currently in private preview. If you're interested in testing it, contact us at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co). ## Pause/Resume button for S3 Data Sources¶ You can now pause or resume scheduled S3 data ingest using the **Pause sync** button. ![Pause button](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Fchangelog%2Fsyncbutton.png&w=3840&q=75) ## Improvements and bug fixes¶ - You can now use the `KAFKA_SASL_MECHANISM` in .datasource files to define the type of SASL security for Kafka connections. See[ Kafka, Confluent, Redpanda](https://www.tinybird.co/docs/cli/datafiles/datasource-files#kafka-confluent-redpanda) . - When a user opens a Workspace they lack access to, they now get a 404. Previously, they were redirected to their default Workspace. - Fixed an issue where stacked charts showed gaps and other inconsistencies when the X axis was a time value. - Fixed the hover action for copying a token when using Safari 17.6. --- URL: https://www.tinybird.co/docs/cli/advanced-templates Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Advanced templates · Tinybird Docs" theme-color: "#171612" description: "This document shows an advanced usage of our datafile system when using the CLI. It's also related to how to work with query parameters." --- # Advanced templates¶ This section covers advanced usage of our datafile system when using the [Tinybird CLI](https://www.tinybird.co/docs/docs/cli/quick-start) . Before reading this page, you should be familiar with [query parameters](https://www.tinybird.co/docs/docs/query/query-parameters). ## Reusing templates¶ When developing multiple use cases, it's very common to want to reuse certain parts or steps of an analysis, such as data filters or similar table operations. We're going to use the following repository for this purpose: ##### Clone demo git clone https://github.com/tinybirdco/ecommerce_data_project_advanced.git cd ecommerce_data_project_advanced ##### File structure ecommerce_data_project/ datasources/ events.datasource mv_top_per_day.datasource products.datasource fixtures/ events.csv products.csv endpoints/ sales.pipe top_products_between_dates.pipe top_products_last_week.pipe includes/ only_buy_events.incl top_products.incl pipes/ top_product_per_day.pipe First, let's take a look at the `sales.pipe` API Endpoint and the `top_product_per_day.pipe` Pipe that materializes to a `mv_top_per_day` Data Source. They both make use of the same Node: `only_buy_events`: ##### includes/only\_buy\_events.incl NODE only_buy_events SQL > SELECT toDate(timestamp) date, product, joinGet('products_join_by_id', 'color', product) as color, JSONExtractFloat(json, 'price') as price FROM events where action = 'buy' ##### endpoints/sales.pipes INCLUDE "../includes/only_buy_events.incl" NODE endpoint DESCRIPTION > return sales for a product with color filter SQL > % select date, sum(price) total_sales from only_buy_events where color in {{Array(colors, 'black')}} group by date ##### pipes/top\_product\_per\_day.pipe INCLUDE "../includes/only_buy_events.incl" NODE top_per_day SQL > SELECT date, topKState(10)(product) top_10, sumState(price) total_sales from only_buy_events group by date TYPE materialized DATASOURCE mv_top_per_day ENGINE AggregatingMergeTree ENGINE_SORTING_KEY date When using INCLUDE files to reuse logic in .datasource files, the extension of the file must be `.datasource.incl` . This is used by CLI commands as `tb fmt` to identify the type of file and apply the correct formatting. ## Include variables¶ ### Using variables¶ It is possible to include variables in a Node template. The main reason to do that is to have a very similar Node or Nodes that can be reused with slight differences. For instance, in our example, we want to have two API Endpoints to display the 10 top products, each filtered by different date intervals: ##### includes/top\_products.incl NODE endpoint DESCRIPTION > returns top 10 products for the last week SQL > select date, topKMerge(10)(top_10) as top_10 from top_product_per_day {% if '$DATE_FILTER' = 'last_week' %} where date > today() - interval 7 day {% else %} where date between {{Date(start)}} and {{Date(end)}} {% end %} group by date ##### endpoints/top\_products\_last\_week.pipe INCLUDE "../includes/top_products.incl" "DATE_FILTER=last_week" ##### endpoints/top\_products\_between\_dates.pipe INCLUDE "../includes/top_products.incl" "DATE_FILTER=between_dates" As you can see, the variable `DATE_FILTER` is sent to the `top_products` include, where the variable content is retrieved using the `$` prefix with the `DATE_FILTER` reference. It is also possible to assign an array of values to an include variable. To do this, the variable needs to be parsed properly using function templates, as explained in the following section. ### Variables vs parameters¶ Note the difference between variables and parameters. Parameters are indeed variables whose value can be changed by the user through the API Endpoint request parameters. Variables only live in the template and can be set when declaring the `INCLUDE` or with the `set` template syntax: ##### Using 'set' to declare a variable {% set my_var = 'default' %} By default, variables will be interpreted as parameters. In order to prevent variables or private parameters from appearing in the auto-generated API Endpoint documentation, they need to start with `_` . Example: ##### Define private variables % SELECT date FROM my_table WHERE a > 10 {% if defined(_private_param) %} and b = {{Int32(_private_param)}} {% end %} This is also needed when using variables in template functions. ## Template functions¶ This is the list of the available functions that can be used in a template: #### defined¶ `defined(param)` : check if a variable is defined ##### defined function % SELECT date FROM my_table {% if defined(param) %} WHERE ... {% end %} #### column¶ `column(name)` : get the column by its name from a variable ##### column function % {% set var_1 = 'name' %} SELECT {{column(var_1)}} FROM my_table #### columns¶ `columns(names)` : get columns by their name from a variable ##### columns function % {% set var_1 = 'name,age,address' %} SELECT {{columns(var_1)}} FROM my_table #### date\_diff\_in\_seconds¶ `date_diff_in_seconds(date_1, date_2, [date_format], [backup_date_format], [none_if_error])` : gets the abs value of the difference in seconds between two datetimes. `date_format` is optional and defaults to `'%Y-%m-%d %H:%M:%S'` , so you can pass DateTimes as `YYYY-MM-DD hh:mm:ss` when calling the function: date_diff_in_seconds('2022-12-19 18:42:22', '2022-12-19 19:42:34') Other formats are supported and need to be explicitly passed, like date_diff_in_seconds('2022-12-19T18:42:23.521Z', '2022-12-19T18:42:23.531Z', date_format='%Y-%m-%dT%H:%M:%S.%fz') For questions regarding the format, check [python strftime-and-strptime-format-codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). For a deep dive into timestamps, time zones and Tinybird, see the docs on [working with time](https://www.tinybird.co/docs/docs/guides/querying-data/working-with-time). ##### date\_diff\_in\_seconds function % SELECT date, events {% if date_diff_in_seconds(date_end, date_start, date_format="%Y-%m-%dT%H:%M:%Sz") < 3600 %} FROM my_table_raw {% else %} FROM my_table_hourly_agg {% end %} WHERE date BETWEEN parseDateTimeBestEffort({{String(date_start,'2023-01-11T12:24:04Z')}}) AND parseDateTimeBestEffort({{String(date_end,'2023-01-11T12:24:05Z')}}) `backup_date_format` is optional and it allows to specify a secondary format as a backup when the provided date does not match the primary format. This is useful when your default input format is a datetime ( `2022-12-19 18:42:22` ) but you receive a date ( `2022-12-19` ). date_diff_in_seconds('2022-12-19 18:42:22', '2022-12-19', backup_date_format='%Y-%m-%d') `none_if_error` is optional and defaults to `False` . If set to `True` , the function will return `None` if the provided date does not match any of the provided formats. This is useful to provide an alternate logic in case any of the dates are specified in a different format. ##### date\_diff\_in\_seconds function using none\_if\_error % SELECT * FROM employees {% if date_diff_in_seconds(date_start, date_end, none_if_error=True) is None %} WHERE starting_date BETWEEN now() - interval 4 year AND now() {% else %} WHERE starting_date BETWEEN parseDateTimeBestEffort({{String(date_start, '2023-12-01')}}) AND parseDateTimeBestEffort({{String(date_end, '2023-12-02')}}) {% end %} #### date\_diff\_in\_minutes¶ Same behavior as `date_diff_in_seconds` with returning the difference in minutes. #### date\_diff\_in\_hours¶ Same behavior as `date_diff_in_seconds` with returning the difference in hours. #### date\_diff\_in\_days¶ `date_diff_in_days(date_1, date_2, [date_format])` : gets the absolute value of the difference in seconds between two dates or datetimes. ##### date\_diff\_in\_days function % SELECT date FROM my_table {% if date_diff_in_days(date_end, date_start) < 7 %} WHERE ... {% end %} `date_format` is optional and defaults to `'%Y-%m-%d` so you can pass DateTimes as `YYYY-MM-DD` when calling the function. As with `date_diff_in_seconds`, `date_diff_in_minutes` , and `date_diff_in_hours` , other [date_formats](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) are supported. #### split\_to\_array¶ `split_to_array(arr, default, separator=',')` : splits comma separated values into an array ##### split\_to\_array function % SELECT arrayJoin(arrayMap(x -> toInt32(x), {{split_to_array(code, '')}})) as codes FROM my_table ##### split\_to\_array with a custom separator function % SELECT {{split_to_array(String(param, 'hi, how are you|fine thanks'), separator='|')}} #### enumerate\_with\_last¶ `enumerate_with_last(arr, default)` : creates an iterable array, returning a boolean value that allows to check if the current element is the last element in the array. It can be used along with the `split_to_array` function. #### symbol¶ `symbol(x, quote)` : get the value of a variable ##### enumerate\_with\_last function % SELECT {% for _last, _x in enumerate_with_last(split_to_array(attr, 'amount')) %} sum({{symbol(_x)}}) as {{symbol(_x)}} {% if not _last %}, {% end %} {% end %} FROM my_table #### sql\_and¶ `sql_and(__= [, ...] )` : creates a list of "WHERE" clauses, along with "AND" separated filters, that checks if a field () is or isn't () in a list/tuple (). - The parameter is any column in the table. - is one of: `in` , `not_in` , `gt` (>), `lt` (<), `gte` (>=), `lte` (<=) - is any of the transform type functions ( `Array(param, 'Int8')` , `String(param)` , etc.). If one parameter is not specified, then the filter is ignored. ##### sql\_and function % SELECT * FROM my_table WHERE 1 {% if defined(param) or defined(param2_not_in) %} AND {{sql_and( param__in=Array(param, 'Int32', defined=False), param2__not_in=Array(param2_not_in, 'String', defined=False))}} {% end %} If this is queried with `param=1,2` and `param2_not_in=ab,bc,cd` , then it translates to: ##### sql\_and function - generated sql SELECT * FROM my_table WHERE 1 AND param IN [1,2] AND param2 NOT IN ['ab','bc','cd'] If this is queried just with `param=1,2` , but `param2_not_in` is not specified, then it translates to: ##### sql\_and function - generated sql param missing SELECT * FROM my_table WHERE 1 AND param IN [1,2] ### Transform types functions¶ - `Boolean(x)` - `DateTime64(x)` - `DateTime(x)` - `Date(x)` - `Float32(x)` - `Float64(x)` - `Int8(x)` - `Int16(x)` - `Int32(x)` - `Int64(x)` - `Int128(x)` - `Int256(x)` - `UInt8(x)` - `UInt16(x)` - `UInt32(x)` - `UInt64(x)` - `UInt128(x)` - `UInt256(x)` - `String(x)` - `Array(x)` --- URL: https://www.tinybird.co/docs/cli/command-ref Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Tinybird CLI command reference · Tinybird Docs" theme-color: "#171612" description: "The Tinybird CLI allows you to use all the Tinybird functionality directly from the command line. Get to know the command reference." --- # CLI command reference¶ The following list shows all available commands in the Tinybird command-line interface, their options, and their arguments. For examples on how to use them, see the [Quick start guide](https://www.tinybird.co/docs/docs/cli/quick-start), [Data projects](https://www.tinybird.co/docs/docs/cli/data-projects) , and [Common use cases](https://www.tinybird.co/docs/docs/cli/common-use-cases). ## tb auth¶ Configure your Tinybird authentication. **auth commands** | Command | Description | | --- | --- | | info OPTIONS | Get information about the authentication that is currently being used | | ls OPTIONS | List available regions to authenticate | | use OPTIONS REGION_NAME_OR_HOST_OR_ID | Switch to a different region. You can pass the region name, the region host url, or the region index after listing available regions with `tb auth ls` | The previous commands accept the following options: - `--token INTEGER` : Use auth Token, defaults to TB_TOKEN envvar, then to the .tinyb file - `--host TEXT` : Set custom host if it's different than https://api.tinybird.co. Check[ this page](https://www.tinybird.co/docs/api-reference/overview#regions-and-endpoints) for the available list of regions - `--region TEXT` : Set region. Run 'tb auth ls' to show available regions - `--connector [bigquery|snowflake]` : Set credentials for one of the supported connectors - `--interactive,-i` : Show available regions and select where to authenticate to ## tb branch¶ Manage your Workspace branches. **Branch commands** | Command | Description | Options | | | --- | --- | --- | | | create BRANCH_NAME | Create a new Branch in the current 'main' Workspace | `--last-partition` : Attach the last modified partition from 'main' to the new Branch `-i, --ignore-datasource DATA_SOURCE_NAME` : Ignore specified Data Source partitions `--wait / --no-wait` : Wait for Branch jobs to finish, showing a progress bar. Disabled by default | | | current | Show the Branch you're currently authenticated to | | | | data | Perform a data branch operation to bring data into the current Branch | `--last-partition` : Attach the last modified partition from 'main' to the new Branch `-i, --ignore-datasource DATA_SOURCE_NAME` : Ignore specified Data Source partitions `--wait / --no-wait` : Wait for Branch jobs to finish, showing a progress bar. Disabled by default | | | datasource copy DATA_SOURCE_NAME | Copy data source from Main | `--sql SQL` : Freeform SQL query to select what is copied from Main into the Environment Data Source `--sql-from-main` : SQL query selecting all from the same Data Source in Main `--wait / --no-wait` : Wait for copy job to finish. Disabled by default | | | ls | List all the Branches available | `--sort / --no-sort` : Sort the list of Branches by name. Disabled by default | | | regression-tests | Regression test commands | `-f, --filename PATH` : The yaml file with the regression-tests definition `--skip-regression-tests / --no-skip-regression-tests` : Flag to skip execution of regression tests. This is handy for CI Branches where regression might be flaky `--main` : Run regression tests in the main Branch. For this flag to work all the resources in the Branch Pipe Endpoints need to exist in the main Branch. `--wait / --no-wait` : Wait for regression job to finish, showing a progress bar. Disabled by default | | | regression-tests coverage PIPE_NAME | Run regression tests using coverage requests for Branch vs Main Workspace. It creates a regression-tests job. The argument supports regular expressions. Using '.*' if no Pipe name is provided | `--assert-result / --no-assert-result` : Whether to perform an assertion on the results returned by the Endpoint. Enabled by default. Use `--no-assert-result` if you expect the endpoint output is different from current version `--assert-result-no-error / --no-assert-result-no-error` : Whether to verify that the Endpoint does not return errors. Enabled by default. Use `--no-assert-result-no-error` if you expect errors from the endpoint `--assert-result-rows-count / --no-assert-result-rows-count` : Whether to verify that the correct number of elements are returned in the results. Enabled by default. Use `--no-assert-result-rows-count` if you expect the numbers of elements in the endpoint output is different from current version `--assert-result-ignore-order / --no-assert-result-ignore-order` : Whether to ignore the order of the elements in the results. Disabled by default. Use `--assert-result-ignore-order` if you expect the endpoint output is returning same elements but in different order `--assert-time-increase-percentage INTEGER` : Allowed percentage increase in Endpoint response time. Default value is 25%. Use -1 to disable assert `--assert-bytes-read-increase-percentage INTEGER` : Allowed percentage increase in the amount of bytes read by the endpoint. Default value is 25%. Use -1 to disable assert `--assert-max-time FLOAT` : Max time allowed for the endpoint response time. If the response time is lower than this value then the `--assert-time-increase-percentage` is not taken into account `--ff, --failfast` : When set, the checker will exit as soon one test fails `--wait` : Waits for regression job to finish, showing a progress bar. Disabled by default `--skip-regression-tests / --no-skip-regression-tests` : Flag to skip execution of regression tests. This is handy for CI environments where regression might be flaky `--main` : Run regression tests in the main Branch. For this flag to work all the resources in the Branch Pipe Endpoints need to exist in the main Branch | | | regression-tests last PIPE_NAME | Run regression tests using coverage requests for Branch vs Main Workspace. It creates a regression-tests job. The argument supports regular expressions. Using '.*' if no Pipe name is provided | `--assert-result / --no-assert-result` : Whether to perform an assertion on the results returned by the Endpoint. Enabled by default. Use `--no-assert-result` if you expect the endpoint output is different from current version `--assert-result-no-error / --no-assert-result-no-error` : Whether to verify that the Endpoint does not return errors. Enabled by default. Use `--no-assert-result-no-error` if you expect errors from the endpoint `--assert-result-rows-count / --no-assert-result-rows-count` : Whether to verify that the correct number of elements are returned in the results. Enabled by default. Use `--no-assert-result-rows-count` if you expect the numbers of elements in the endpoint output is different from current version `--assert-result-ignore-order / --no-assert-result-ignore-order` : Whether to ignore the order of the elements in the results. Disabled by default. Use `--assert-result-ignore-order` if you expect the endpoint output is returning same elements but in different order `--assert-time-increase-percentage INTEGER` : Allowed percentage increase in Endpoint response time. Default value is 25%. Use -1 to disable assert `--assert-bytes-read-increase-percentage INTEGER` : Allowed percentage increase in the amount of bytes read by the endpoint. Default value is 25%. Use -1 to disable assert `--assert-max-time FLOAT` : Max time allowed for the endpoint response time. If the response time is lower than this value then the `--assert-time-increase-percentage` is not taken into account `--ff, --failfast` : When set, the checker will exit as soon one test fails `--wait` : Waits for regression job to finish, showing a progress bar. Disabled by default `--skip-regression-tests / --no-skip-regression-tests` : Flag to skip execution of regression tests. This is handy for CI environments where regression might be flaky | | | regression-tests manual PIPE_NAME | Run regression tests using coverage requests for Branch vs Main Workspace. It creates a regression-tests job. The argument supports regular expressions. Using '.*' if no Pipe name is provided | `--assert-result / --no-assert-result` : Whether to perform an assertion on the results returned by the Endpoint. Enabled by default. Use `--no-assert-result` if you expect the endpoint output is different from current version `--assert-result-no-error / --no-assert-result-no-error` : Whether to verify that the Endpoint does not return errors. Enabled by default. Use `--no-assert-result-no-error` if you expect errors from the endpoint `--assert-result-rows-count / --no-assert-result-rows-count` : Whether to verify that the correct number of elements are returned in the results. Enabled by default. Use `--no-assert-result-rows-count` if you expect the numbers of elements in the endpoint output is different from current version `--assert-result-ignore-order / --no-assert-result-ignore-order` : Whether to ignore the order of the elements in the results. Disabled by default. Use `--assert-result-ignore-order` if you expect the endpoint output is returning same elements but in different order `--assert-time-increase-percentage INTEGER` : Allowed percentage increase in Endpoint response time. Default value is 25%. Use -1 to disable assert `--assert-bytes-read-increase-percentage INTEGER` : Allowed percentage increase in the amount of bytes read by the endpoint. Default value is 25%. Use -1 to disable assert `--assert-max-time FLOAT` : Max time allowed for the endpoint response time. If the response time is lower than this value then the `--assert-time-increase-percentage` is not taken into account `--ff, --failfast` : When set, the checker will exit as soon one test fails `--wait` : Waits for regression job to finish, showing a progress bar. Disabled by default `--skip-regression-tests / --no-skip-regression-tests` : Flag to skip execution of regression tests. This is handy for CI Branches where regression might be flaky | | | rm [BRANCH_NAME_OR_ID] | Removes a Branch from the Workspace (not Main). It can't be recovered | `--yes` : Do not ask for confirmation | | | use [BRANCH_NAME_OR_ID] | Switch to another Branch | | | ## tb check¶ Check file syntax. It only allows one option, `--debug` , which prints the internal representation. ## tb datasource¶ Data Sources commands. | Command | Description | Options | | --- | --- | --- | | analyze OPTIONS URL_OR_FILE | Analyze a URL or a file before creating a new data source | | | append OPTIONS DATASOURCE_NAME URL | Appends data to an existing Data Source from URL, local file or a connector | | | connect OPTIONS CONNECTION DATASOURCE_NAME | Create a new Data Source from an existing connection | `--kafka-topic TEXT` : For Kafka connections: topic `--kafka-group TEXT` : For Kafka connections: group ID `--kafka-auto-offset-reset [latest|earliest]` : Kafka auto.offset.reset config. Valid values are: ["latest", "earliest"] `--kafka-sasl-mechanism [PLAIN|SCRAM-SHA-256|SCRAM-SHA-512]` : Kafka SASL mechanism. Valid values are: ["PLAIN", "SCRAM-SHA-256", "SCRAM-SHA-512"]. Default: "PLAIN". | | copy OPTIONS DATASOURCE_NAME | Copy data source from Main | `--sql TEXT` : Freeform SQL query to select what is copied from Main into the Branch Data Source `--sql-from-main` : SQL query selecting * from the same Data Source in Main `--wait` : Wait for copy job to finish, disabled by default | | delete OPTIONS DATASOURCE_NAME | Delete rows from a Data Source | `--yes` : Do not ask for confirmation `--wait` : Wait for delete job to finish, disabled by default `--dry-run` : Run the command without deleting anything `--sql-condition` : Delete rows with SQL condition | | generate OPTIONS FILENAMES | Generate a Data Source file based on a sample CSV file from local disk or URL | `--force` : Override existing files | | ls OPTIONS | List Data Sources | `--match TEXT` : Retrieve any resources matching the pattern. eg `--match _test` `--format [json]` : Force a type of the output `--dry-run` : Run the command without deleting anything | | replace OPTIONS DATASOURCE_NAME URL | Replaces the data in a Data Source from a URL, local file or a connector | `--sql` : The SQL to extract from `--connector` : Connector name `--sql-condition` : Delete rows with SQL condition | | rm OPTIONS DATASOURCE_NAME | Delete a Data Source | `--yes` : Do not ask for confirmation | | share OPTIONS DATASOURCE_NAME WORKSPACE_NAME_OR_ID | Share a Data Source | `--user_token TEXT` : When passed, we won't prompt asking for it `--yes` : Do not ask for confirmation | | sync OPTIONS DATASOURCE_NAME | Sync from connector defined in .datasource file | | | truncate OPTIONS DATASOURCE_NAME | Truncate a Data Source | `--yes` : Do not ask for confirmation `--cascade` : Truncate dependent Data Source attached in cascade to the given Data Source | | unshare OPTIONS DATASOURCE_NAME WORKSPACE_NAME_OR_ID | Unshare a Data Source | `--user_token TEXT` : When passed, we won't prompt asking for it `--yes` : Do not ask for confirmation | | scheduling resume DATASOURCE_NAME | Resume the scheduling of a Data Source | | | scheduling pause DATASOURCE_NAME | Pause the scheduling of a Data Source | | | scheduling status DATASOURCE_NAME | Get the scheduling status of a Data Source (paused or running) | | ## tb dependencies¶ Print all Data Sources dependencies. Its options: - `--no-deps` : Print only Data Sources with no Pipes using them - `--match TEXT` : Retrieve any resource matching the pattern - `--pipe TEXT` : Retrieve any resource used by Pipe - `--datasource TEXT` : Retrieve resources depending on this Data Source - `--check-for-partial-replace` : Retrieve dependant Data Sources that will have their data replaced if a partial replace is executed in the Data Source selected - `--recursive` : Calculate recursive dependencies ## tb deploy¶ Deploy in Tinybird pushing resources changed from previous release using Git. These are the options available for the `deploy` command: - `--dry-run` : Run the command with static checks, without creating resources on the Tinybird account or any side effect. Doesn't check for runtime errors. - `-f, --force` : Override Pipes when they already exist. - `--override-datasource` : When pushing a Pipe with a materialized Node if the target Data Source exists it will try to override it. - `--populate` : Populate materialized Nodes when pushing them. - `--subset FLOAT` : Populate with a subset percent of the data (limited to a maximum of 2M rows), this is useful to quickly test a materialized Node with some data. The subset must be greater than 0 and lower than 0.1. A subset of 0.1 means a 10% of the data in the source Data Source will be used to populate the Materialized View. Use it together with `--populate` , it has precedence over `--sql-condition` . - `--sql-condition TEXT` : Populate with a SQL condition to be applied to the trigger Data Source of the Materialized View. For instance, `--sql-condition='date == toYYYYMM(now())'` it'll populate taking all the rows from the trigger Data Source which `date` is the current month. Use it together with `--populate` . `--sql-condition` is not taken into account if the `--subset` param is present. Including in the `sql_condition` any column present in the Data Source `engine_sorting_key` will make the populate job process less data. - `--unlink-on-populate-error` : If the populate job fails the Materialized View is unlinked and new data won't be ingested there. First time a populate job fails, the Materialized View is always unlinked. - `--wait` : To be used along with `--populate` command. Waits for populate jobs to finish, showing a progress bar. Disabled by default. - `--yes` : Do not ask for confirmation. - `--workspace_map TEXT..., --workspace TEXT...` : Adds a Workspace path to the list of external Workspaces, usage: `--workspace name path/to/folder` . - `--timeout FLOAT` : Timeout you want to use for the job populate. - `--user_token TOKEN` : The user Token is required for sharing a Data Source that contains the SHARED_WITH entry. ## tb diff¶ Diffs local datafiles to the corresponding remote files in the Workspace. It works as a regular `diff` command, useful to know if the remote resources have been changed. Some caveats: - Resources in the Workspace might mismatch due to having slightly different SQL syntax, for instance: A parenthesis mismatch, `INTERVAL` expressions or changes in the schema definitions. - If you didn't specify an `ENGINE_PARTITION_KEY` and `ENGINE_SORTING_KEY` , resources in the Workspace might have default ones. The recommendation in these cases is use `tb pull` to keep your local files in sync. Remote files are downloaded and stored locally in a `.diff_tmp` directory, if working with git you can add it to `.gitignore`. The options for this command: - `--fmt / --no-fmt` : Format files before doing the diff, default is True so both files match the format - `--no-color` : Don't colorize diff - `--no-verbose` : List the resources changed not the content of the diff ## tb fmt¶ Formats a .datasource, .pipe or .incl file. Implementation is based in the ClickHouse® dialect of [shandy-sqlfmt](https://pypi.org/project/shandy-sqlfmt/) adapted to Tinybird datafiles. These are the options available for the `fmt` command: - `--line-length INTEGER` : A number indicating the maximum characters per line in the Node SQL, lines will be split based on the SQL syntax and the number of characters passed as a parameter - `--dry-run` : Don't ask to override the local file - `--yes` : Do not ask for confirmation to overwrite the local file - `--diff` : Formats local file, prints the diff and exits 1 if different, 0 if equal This command removes comments starting with # from the file, so use DESCRIPTION or a comment block instead: ##### Example comment block % {% comment this is a comment and fmt will keep it %} SELECT {% comment this is another comment and fmt will keep it %} count() c FROM stock_prices_1m You can add `tb fmt` to your git `pre-commit` hook to have your files properly formatted. If the SQL formatting results are not the ones expected to you, you can disable it just for the blocks needed. Read [how to disable fmt](https://docs.sqlfmt.com/getting-started/disabling-sqlfmt). ## tb init¶ Initializes folder layout. It comes with these options: - `--generate-datasources` : Generate Data Sources based on CSV, NDJSON and Parquet files in this folder - `--folder DIRECTORY` : Folder where datafiles will be placed - `-f, --force` : Overrides existing files - `-ir, --ignore-remote` : Ignores remote files not present in the local data project on `tb init --git` - `--git` : Init Workspace with Git commits. - `--override-commit TEXT` : Use this option to manually override the reference commit of your Workspace. This is useful if a commit is not recognized in your Git log, such as after a force push ( `git push -f` ). ## tb materialize¶ Analyzes the `node_name` SQL query to generate the .datasource and .pipe files needed to push a new materialize view. This command guides you to generate the Materialized View with name TARGET_DATASOURCE, the only requirement is having a valid Pipe datafile locally. Use `tb pull` to download resources from your Workspace when needed. It allows to use these options: - `--push-deps` : Push dependencies, disabled by default - `--workspace TEXT...` : Add a Workspace path to the list of external Workspaces, usage: `--workspace name path/to/folder` - `--no-versions` : When set, resource dependency versions are not used, it pushes the dependencies as-is - `--verbose` : Prints more log - `--unlink-on-populate-error` : If the populate job fails the Materialized View is unlinked and new data won't be ingested in the Materialized View. First time a populate job fails, the Materialized View is always unlinked. ## tb pipe¶ Use the following commands to manage Pipes. | Command | Description | Options | | --- | --- | --- | | append OPTIONS PIPE_NAME_OR_UID SQL | Append a Node to a Pipe | | | copy pause OPTIONS PIPE_NAME_OR_UID | Pause a running Copy Pipe | | | copy resume OPTIONS PIPE_NAME_OR_UID | Resume a paused Copy Pipe | | | copy run OPTIONS PIPE_NAME_OR_UID | Run an on-demand copy job | `--wait` : Wait for the copy job to finish `--yes` : Do not ask for confirmation `--param TEXT` : Key and value of the params you want the Copy Pipe to be called with. For example: `tb pipe copy run --param foo=bar` | | data OPTIONS PIPE_NAME_OR_UID PARAMETERS | Print data returned by a Pipe. You can pass query parameters to the command, for example `--param_name value` . | `--query TEXT` : Run SQL over Pipe results `--format [json|csv]` : Return format (CSV, JSON) `-- value` : Query parameter. You can define multiple parameters and their value. For example, `--paramOne value --paramTwo value2` . | | generate OPTIONS NAME QUERY | Generates a Pipe file based on a sql query. Example: `tb pipe generate my_pipe 'select * from existing_datasource'` | `--force` : Override existing files | | ls OPTIONS | List Pipes | `--match TEXT` : Retrieve any resourcing matching the pattern. For example `--match _test` `--format [json|csv]` : Force a type of the output | | populate OPTIONS PIPE_NAME | Populate the result of a Materialized Node into the target Materialized View | `--node TEXT` : Name of the materialized Node. Required `--sql-condition TEXT` : Populate with a SQL condition to be applied to the trigger Data Source of the Materialized View. For instance, `--sql-condition='date == toYYYYMM(now())'` it'll populate taking all the rows from the trigger Data Source which `date` is the current month. Use it together with `--populate` . `--sql-condition` is not taken into account if the `--subset` param is present. Including in the `sql_condition` any column present in the Data Source `engine_sorting_key` will make the populate job process less data. `--truncate` : Truncates the materialized Data Source before populating it. `--unlink-on-populate-error` : If the populate job fails the Materialized View is unlinked and new data won't be ingested in the Materialized View. First time a populate job fails, the Materialized View is always unlinked. `--wait` : Waits for populate jobs to finish, showing a progress bar. Disabled by default. | | publish OPTIONS PIPE_NAME_OR_ID NODE_UID | Change the published Node of a Pipe | | | regression-test OPTIONS FILENAMES | Run regression tests using last requests | `--debug` : Prints internal representation, can be combined with any command to get more information. `--only-response-times` : Checks only response times `--workspace_map TEXT..., --workspace TEXT...` : Add a Workspace path to the list of external Workspaces, usage: `--workspace name path/to/folder` `--no-versions` : When set, resource dependency versions are not used, it pushes the dependencies as-is `-l, --limit INTEGER RANGE` : Number of requests to validate [0<=x<=100] `--sample-by-params INTEGER RANGE` : When set, we will aggregate the pipe_stats_rt requests by `extractURLParameterNames(assumeNotNull(url))` and for each combination we will take a sample of N requests [1<=x<=100] `-m, --match TEXT` : Filter the checker requests by specific parameter. You can pass multiple parameters -m foo -m bar `-ff, --failfast` : When set, the checker will exit as soon one test fails `--ignore-order` : When set, the checker will ignore the order of list properties `--validate-processed-bytes` : When set, the checker will validate that the new version doesn't process more than 25% than the current version `--relative-change FLOAT` : When set, the checker will validate the new version has less than this distance with the current version | | rm OPTIONS PIPE_NAME_OR_ID | Delete a Pipe. PIPE_NAME_OR_ID can be either a Pipe name or id in the Workspace or a local path to a .pipe file | `--yes` : Do not ask for confirmation | | set_endpoint OPTIONS PIPE_NAME_OR_ID NODE_UID | Same as 'publish', change the published Node of a Pipe | | | sink run OPTIONS PIPE_NAME_OR_UID | Run an on-demand sink job | `--wait` : Wait for the sink job to finish `--yes` : Do not ask for confirmation `--dry-run` : Run the command without executing the sink job `--param TEXT` : Key and value of the params you want the Sink Pipe to be called with. For example: `tb pipe sink run --param foo=bar` | | stats OPTIONS PIPES | Print Pipe stats for the last 7 days | `--format [json]` : Force a type of the output. To parse the output, keep in mind to use `tb --no-version-warning pipe stats` option | | token_read OPTIONS PIPE_NAME | Retrieve a Token to read a Pipe | | | unlink OPTIONS PIPE_NAME NODE_UID | Unlink the output of a Pipe, whatever its type: Materialized Views, Copy Pipes, or Sinks. | | | unpublish OPTIONS PIPE_NAME NODE_UID | Unpublish the endpoint of a Pipe | | ## tb pull¶ Retrieve the latest version for project files from your Workspace. With these options: - `--folder DIRECTORY` : Folder where files will be placed - `--auto / --no-auto` : Saves datafiles automatically into their default directories (/datasources or /pipes). Default is True - `--match TEXT` : Retrieve any resourcing matching the pattern. eg `--match _test` - `-f, --force` : Override existing files - `--fmt` : Format files, following the same format as `tb fmt` ## tb push¶ Push files to your Workspace. You can use this command with these options: - `--dry-run` : Run the command with static checks, without creating resources on the Tinybird account or any side effect. Doesn't check for runtime errors. - `--check / --no-check` : Enable/Disable output checking, enabled by default - `--push-deps` : Push dependencies, disabled by default - `--only-changes` : Push only the resources that have changed compared to the destination Workspace - `--debug` : Prints internal representation, can be combined with any command to get more information - `-f, --force` : Override Pipes when they already exist - `--override-datasource` : When pushing a Pipe with a materialized Node if the target Data Source exists it will try to override it. - `--populate` : Populate materialized Nodes when pushing them - `--subset FLOAT` : Populate with a subset percent of the data (limited to a maximum of 2M rows), this is useful to quickly test a materialized Node with some data. The subset must be greater than 0 and lower than 0.1. A subset of 0.1 means a 10 percent of the data in the source Data Source will be used to populate the Materialized View. Use it together with `--populate` , it has precedence over `--sql-condition` - `--sql-condition TEXT` : Populate with a SQL condition to be applied to the trigger Data Source of the Materialized View. For instance, `--sql-condition='date == toYYYYMM(now())'` it'll populate taking all the rows from the trigger Data Source which `date` is the current month. Use it together with `--populate` . `--sql-condition` is not taken into account if the `--subset` param is present. Including in the `sql_condition` any column present in the Data Source `engine_sorting_key` will make the populate job process less data - `--unlink-on-populate-error` : If the populate job fails the Materialized View is unlinked and new data won't be ingested in the Materialized View. First time a populate job fails, the Materialized View is always unlinked - `--fixtures` : Append fixtures to Data Sources - `--wait` : To be used along with `--populate` command. Waits for populate jobs to finish, showing a progress bar. Disabled by default - `--yes` : Do not ask for confirmation - `--only-response-times` : Checks only response times, when --force push a Pipe - `--workspace TEXT..., --workspace_map TEXT...` : Add a Workspace path to the list of external Workspaces, usage: `--workspace name path/to/folder` - `--no-versions` : When set, resource dependency versions are not used, it pushes the dependencies as-is - `--timeout FLOAT` : Timeout you want to use for the populate job - `-l, --limit INTEGER RANGE` : Number of requests to validate [0<=x<=100] - `--sample-by-params INTEGER RANGE` : When set, we will aggregate the `pipe_stats_rt` requests by `extractURLParameterNames(assumeNotNull(url))` and for each combination we will take a sample of N requests [1<=x<=100] - `-ff, --failfast` : When set, the checker will exit as soon one test fails - `--ignore-order` : When set, the checker will ignore the order of list properties - `--validate-processed-bytes` : When set, the checker will validate that the new version doesn't process more than 25% than the current version - `--user_token TEXT` : The User Token is required for sharing a Data Source that contains the SHARED_WITH entry ## tb sql¶ Run SQL query over Data Sources and Pipes. - `--rows_limit INTEGER` : Max number of rows retrieved - `--pipeline TEXT` : The name of the Pipe to run the SQL Query - `--pipe TEXT` : The path to the .pipe file to run the SQL Query of a specific NODE - `--node TEXT` : The NODE name - `--format [json|csv|human]` : Output format - `--stats / --no-stats` : Show query stats ## tb token¶ Manage your Workspace Tokens. | Command | Description | Options | | --- | --- | --- | | copy OPTIONS TOKEN_ID | Copy a Token | | | ls OPTIONS | List Tokens | `--match TEXT` : Retrieve any Token matching the pattern. eg `--match _test` | | refresh OPTIONS TOKEN_ID | Refresh a Token | `--yes` : Do not ask for confirmation | | rm OPTIONS TOKEN_ID | Remove a Token | `--yes` : Do not ask for confirmation | | scopes OPTIONS TOKEN_ID | List Token scopes | | | create static OPTIONS TOKEN_NAME | Create a static Token that will forever. | `--scope` : Scope for the Token (e.g., `DATASOURCES:READ` ). Required. `--resource` : Resource you want to associate the scope with. `--filter` : SQL condition used to filter the values when calling with this token (eg. `--filter=value > 0` ) | | create jwt OPTIONS TOKEN_NAME | Create a JWT Token with a fixed expiration time. | `--ttl` : Time to live (e.g., '1h', '30min', '1d'). Required. `--scope` : Scope for the token (only `PIPES:READ` is allowed for JWT tokens).Required. `--resource` : Resource associated with the scope. Required. `--fixed-params` : Fixed parameters in key=value format, multiple values separated by commas | ## tb workspace¶ Manage your Workspaces. | Command | Description | Options | | --- | --- | --- | | clear OPTIONS | Drop all the resources inside a project. This command is dangerous because it removes everything, use with care. | `--yes` : Do not ask for confirmation `--dry-run` : Run the command without removing anything | | create OPTIONS WORKSPACE_NAME | Create a new Workspace for your Tinybird user | `--starter_kit TEXT` : Use a Tinybird starter kit as a template `--user_token TEXT` : When passed, we won't prompt asking for it `--fork` : When enabled, we will share all Data Sources from the current Workspace to the new created one | | current OPTIONS | Show the Workspace you're currently authenticated to | | | delete OPTIONS WORKSPACE_NAME_OR_ID | Delete a Workspace where you are an admin | `--user_token TEXT` : When passed, we won't prompt asking for it `--yes` : Do not ask for confirmation | | ls OPTIONS | List all the Workspaces you have access to in the account you're currently authenticated to | | | members add OPTIONS MEMBERS_EMAILS | Adds members to the current Workspace | `--user_token TEXT` : When passed, we won't prompt asking for it | | members ls OPTIONS | List members in the current Workspace | | | members rm OPTIONS | Removes members from the current Workspace | `--user_token TEXT` : When passed, we won't prompt asking for it | | members set-role OPTIONS [guest|viewer|admin] MEMBERS_EMAILS | Sets the role for existing Workspace members | `--user_token TEXT` : When passed, we won't prompt asking for it | | use OPTIONS WORKSPACE_NAME_OR_ID | Switch to another workspace. Use 'tb workspace ls' to list the workspaces you have access to | | ## tb tag¶ Manage your Workspace tags. | Command | Description | Options | | --- | --- | --- | | create TAG_NAME | Creates a tag in the current Workspace. | | | ls | List all the tags of the current Workspace. | | | ls TAG_NAME | List all the resources tagged with the given tag. | | | rm TAG_NAME | Removes a tag from the current Workspace. All resources will not be tagged by the given tag anymore. | `--yes` : Do not ask for confirmation | Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/cli/common-use-cases Last update: 2024-11-12T11:45:41.000Z Content: --- title: "CLI common use cases · Tinybird Docs" theme-color: "#171612" description: "This document shows some common use cases where the Command Line Interface (CLI) can help you on your day to day workflow." --- # Common use cases¶ The following uses cases illustrate how Tinybird CLI solve common situations using available commands. ## Download Pipes and data sources from your account¶ There are two ways you can start working with the CLI. You can either [start a new data project](https://www.tinybird.co/docs/docs/cli/quick-start) from scratch, or if you already have some data and API Endpoints in your Tinybird account, pull it to your local disk to continue working from there. For this second option, use the `--match` flag to filter Pipes or data sources containing the string passed as parameter. For instance, to pull all the files named `project`: ##### Pull all the project files tb pull --match project [D] writing project.datasource(demo) [D] writing project_geoindex.datasource(demo) [D] writing project_geoindex_pipe.pipe(demo) [D] writing project_agg.pipe(demo) [D] writing project_agg_API_endpoint_request_log_pipe_3379.pipe(demo) [D] writing project_exploration.pipe(demo) [D] writing project_moving_avg.pipe(demo) The pull command does not preserve the directory structure, so all your datafiles are downloaded to your current directory. Once the files are pulled, you can `diff` or `push` the changes to your source control repository and continue working from the command line. When you pull data sources or Pipes, your data is not downloaded, just the data source schemas and Pipes definition, so they can be replicated easily. ## Push the entire data project¶ ##### Push the whole project tb push --push-deps ## Push a Pipe with all its dependencies¶ ##### Push dependencies tb push pipes/mypipe.pipe --push-deps ## Adding a new column to a Data Source¶ Data Source schemas are mostly immutable, but you have the possibility to append new columns at the end of an existing Data Source with an Engine from the MergeTree Family or Null Engine. If you want to change columns, add columns in other positions, or modify the engine, you must first create a new version of the Data Source with the modified schema. Then ingest the data and finally point the Pipes to this new API Endpoint. To force a Pipe replacement use the `--force` flag when pushing it. ### Append new columns to an existing Data Source¶ As an example, imagine you have the following Data Source defined, and it has been already pushed to Tinybird: ##### Appending a new column to a Data Source SCHEMA > `test` Int16, `local_date` Date, `test3` Int64 If you want to append a new column, you must change the `*.datasource` file to add the new column `new_column` . You can append as many columns as you need at the same time: ##### Appending a new column to a Data Source SCHEMA > `test` Int16, `local_date` Date, `test3` Int64, `new_column` Int64 Remember that when **appending or deleting columns to an existing Data Source** , the engine of that Data Source must be of the **MergeTree** family. After appending the new column, execute `tb push my_datasource.datasource --force` and confirm the addition of the column(s). The `--force` parameter is required for this kind of operation. Existing imports will continue working once the new columns are added, even if those imports don't carry values for the added columns. In those cases, the new columns contain empty values like `0` for numeric values or `''` for Strings, or if defined, the default values in the schema. ### Create a new version of the Data Source to make additional add/change column operations¶ To create a new version of a Data Source, create a separate datafile with a different name. You can choose a helpful naming convention such as adding a `_version` suffix (e.g. `my_ds_1.datasource` ). ## Debug mode¶ When you work with Pipes that use several versions of different data sources, you might need to double check which version of which Data Source the Pipe is pointing at before you push it to your Tinybird account. To do so, use the `--dry-run --debug` flags like this: ##### Debug mode tb push my_pipe.pipe --dry-run --debug After you've validated the content of the Pipe, push your Pipe as normal. ## Automatic regression tests for your API Endpoints¶ Any time you `--force` push a Pipe which has a public API Endpoint that has received requests, some automatic regression tests are executed. If the previous version of the API Endpoint returns the same data as the version you are pushing, the CLI checks for the top ten requests. This can help you validate whether you are introducing a regression in your API. Other times, you are consciously `--force` pushing a new version which returns different data. In that case you can avoid the regression tests with the `--no-check` flag: ##### Avoid regression tests tb push my_may_view_pipe.pipe --force --no-check When pushing a Pipe with a public API Endpoint, the API Endpoint will be maintained based on the Node name. If the existing API Endpoint Node is renamed, the last Node of the Pipe will be recreated as an API Endpoint. The latter option is not an atomic operation: The API Endpoint will be down for a few moments while the new API Endpoint is created. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/cli/data-projects Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Organize files in data projects · Tinybird Docs" theme-color: "#171612" description: "Learn how to best organize your Tinybird CLI files in versioned projects." --- # Organize files in data projects¶ A data project is a set of files that describes how your data must be stored, processed, and exposed through APIs. In the same way you maintain source code files in a repository, use a CI, make deployments, run tests, and so on, Tinybird provides tools to work following a similar pattern but with data pipelines. The source code of your project are the [datafiles](https://www.tinybird.co/docs/docs/cli/datafiles/overview) in Tinybird. With a data project you can: - Define how the data should flow, from schemas to API Endpoints. - Manage your datafiles using version control. - Branch your datafiles. - Run tests. - Deploy data projects. ## Ecommerce site example¶ Consider an ecommerce site where you have events from users and a list of products with their attributes. Your goal is to expose several API endpoints to return sales per day and top product per day. The data project file structure would look like the following: ecommerce_data_project/ datasources/ events.datasource products.datasource fixtures/ events.csv products.csv pipes/ top_product_per_day.pipe endpoints/ sales.pipe top_products.pipe To follow this tutorial, download and open the example using the following commands: ##### Clone demo git clone https://github.com/tinybirdco/ecommerce_data_project.git cd ecommerce_data_project ### Upload the project¶ You can push the whole project to your Tinybird account to check everything is fine. The `tb push` command uploads the data to Tinybird, previously checking the project dependencies and the SQL syntax. In this case, use the `--push-deps` flag to push everything: ##### Push dependencies tb push --push-deps After the upload completes, the endpoints defined in our project, `sales` and `top_products` , are available and you can start pushing data to the different Data Sources. ### Define Data Sources¶ Data Sources define how your data is ingested and stored. You can add data to Data Sources using the [Data Sources API](https://www.tinybird.co/docs/docs/api-reference/datasource-api). Each Data Source is defined by a schema and other properties. See [Datasource files](https://www.tinybird.co/docs/docs/cli/datafiles/datasource-files). The following snippet shows the content of the `event.datasource` file from the ecommerce example: DESCRIPTION > # Events from users This contains all the events produced by Kafka, there are 4 fixed columns. plus a `json` column which contains the rest of the data for that event. See [documentation](https://www.tinybird.co/docs/url_for_docs) for the different events. SCHEMA > timestamp DateTime, product String, user_id String, action String json String ENGINE MergeTree ENGINE_SORTING_KEY timestamp The file describes the schema and how the data is sorted. In this case, the access pattern is most of the time by the `timestamp` column. If no `SORTING_KEY` is set, Tinybird picks one by default, date or datetime columns in most cases. To push the Data Source, run: ##### Push the events Data Source tb push datasources/events.datasource You can't override Data Sources. If you try to push a Data Source that already exists in your account you get an error. To override a Data Source, remove it or upload a new one with a different name. ### Define data Pipes¶ The content of the `pipes/top_product_per_day.pipe` file creates a data Pipe that transforms the data as it's inserted: NODE only_buy_events DESCRIPTION > filters all the buy events SQL > SELECT toDate(timestamp) date, product, JSONExtractFloat(json, 'price') AS price FROM events WHERE action = 'buy' NODE top_per_day SQL > SELECT date, topKState(10)(product) top_10, sumState(price) total_sales FROM only_buy_events GROUP BY date TYPE materialized DATASOURCE top_per_day_mv ENGINE AggregatingMergeTree ENGINE_SORTING_KEY date Each Pipe can have one or more nodes. The previous Pipe defines two nodes, `only_buy_events` and `top_per_day`. - The first node filters `buy` events and extracts some data from the `json` column. - The second node runs the aggregation. In general, use `NODE` to start a new Node and then use `SQL >` to define the SQL for that Node. You can use other Nodes inside the SQL. In this case, the second Node uses the first one `only_buy_events`. To push the Pipe, run: ##### Populate tb push pipes/top_product_per_day.pipe --populate If you want to populate with the existing data in `events` table, use the `--populate` flag. When using the `--populate` flag you get a job URL so you can check the status of the job by checking the URL provided. See [Populate and copy data](https://www.tinybird.co/docs/docs/production/populate-data) for more information on how populate jobs work. ### Define API Endpoints¶ API Endpoints are the way you expose the data to be consumed. The following snippet shows the content of the `endpoints/top_products.pipe` file: NODE endpoint DESCRIPTION > returns top 10 products for the last week SQL > SELECT date, topKMerge(10)(top_10) AS top_10 FROM top_per_day WHERE date > today() - interval 7 day GROUP BY date The syntax is the same as in the data transformation Pipes, though you can access the results through the `{% user("apiHost") %}/v0/top_products.json?token=TOKEN` endpoint. When you push an endpoint a Token with `PIPE:READ` permissions is automatically created. You can see it from the [Tokens UI](https://app.tinybird.co/tokens) or directly from the CLI with the command `tb pipe token_read `. Alternatively, you can use the `TOKEN token_name READ` command to automatically create a Token with name `token_name` with `READ` permissions over the endpoint or add `READ` permissions to the existing `token_name` over the endpoint. For example: TOKEN public_read_token READ NODE endpoint DESCRIPTION > returns top 10 products for the last week SQL > SELECT date, topKMerge(10)(top_10) AS top_10 FROM top_per_day WHERE date > today() - interval 7 day GROUP BY date To push the endpoint, run: ##### Push the top products Pipe tb push endpoints/top_products.pipe The Token `public_read_token` was created automatically and it's provided in the test URL. You can add parameters to any endpoint. For example, parametrize the dates to be able to filter the data between two dates: NODE endpoint DESCRIPTION > returns top 10 products for the last week SQL > % SELECT date, topKMerge(10)(top_10) AS top_10 FROM top_per_day WHERE date between {{Date(start)}} AND {{Date(end)}} GRUP BY date Now, the endpoint can receive `start` and `end` parameters: `{% user("apiHost") %}/v0/top_products.json?start=2018-09-07&end=2018-09-17&token=TOKEN`. You can print the results from the CLI using the `pipe data` command. For instance, for the previous example: ##### Print the results of the top products endpoint tb pipe data top_products --start '2018-09-07' --end '2018-09-17' --format CSV For the parameters templating to work you need to start your NODE SQL definition using the character `%`. ### Override an endpoint or a data Pipe¶ When working on a project, you might need to push several versions of the same file. You can override a Pipe that has already been pushed using the `--force` flag. For example: ##### Override the Pipe tb push endpoints/top_products_params.pipe --force If the endpoint has been called before, it runs regression tests with the most frequent requests. If the new version doesn't return the same data, then it's not pushed. You can see in the example how to run all the requests tested. You can force the push without running the checks using the `--no-check` flag if needed. For example: ##### Force override tb push endpoints/top_products_params.pipe --force --no-check ### Downloading datafiles from Tinybird¶ You can download datafiles using the `pull` command. For example: ##### Pull a specific file tb pull --match endpoint_im_working_on The previous command downloads the `endpoint_im_working_on.pipe` to the current folder. --- URL: https://www.tinybird.co/docs/cli/datafiles/datasource-files Last update: 2024-11-15T11:01:39.000Z Content: --- title: "Datasource files · Tinybird Docs" theme-color: "#171612" description: "Datasource files describe your Data Sources. Define the schema, engine, and other settings." --- # Datasource files (.datasource)¶ Datasource files describe your Data Sources. You can use .datasource files to define the schema, engine, and other settings of your Data Sources. See [Data Sources](https://www.tinybird.co/docs/docs/concepts/data-sources). ## Available instructions¶ The following instructions are available for .datasource files. | Declaration | Required | Description | | --- | --- | --- | | `SCHEMA ` | Yes | Defines a block for a Data Source schema. The block must be indented. | | `DESCRIPTION ` | No | Description of the Data Source. | | `TOKEN APPEND` | No | Grants append access to a Data Source to the Token with name `` . If the token doesn't exist, it's automatically created. | | `TAGS ` | No | Comma-separated list of tags. Tags are used to[ organize your data project](https://www.tinybird.co/docs/docs/production/organizing-resources) . | | `ENGINE ` | No | Sets the ClickHouse® Engine for Data Source. Default value is `MergeTree` . | | `ENGINE_SORTING_KEY ` | No | Sets the `ORDER BY` expression for the Data Source. If unset, it defaults to DateTime, numeric, or String columns, in that order. | | `ENGINE_PARTITION_KEY ` | No | Sets the `PARTITION` expression for the Data Source. | | `ENGINE_TTL ` | No | Sets the `TTL` expression for the Data Source. | | `ENGINE_VER ` | No | Column with the version of the object state. Required when using `ENGINE ReplacingMergeTree` . | | `ENGINE_SIGN ` | No | Column to compute the state. Required when using `ENGINE CollapsingMergeTree` or `ENGINE VersionedCollapsingMergeTree` . | | `ENGINE_VERSION ` | No | Column with the version of the object state. Required when `ENGINE VersionedCollapsingMergeTree` . | | `ENGINE_SETTINGS ` | No | Comma-separated list of key-value pairs that describe ClickHouse® engine settings for the Data Source. | | `SHARED_WITH ` | No | Shares the Data Source with one or more Workspaces. Use in combination with `--user_token` with admin rights in the origin Workspace. | The following example shows a typical .datasource file: ##### tinybird/datasources/example.datasource # A comment TOKEN tracker APPEND DESCRIPTION > Analytics events **landing data source** TAGS stock, recommendations SCHEMA > `timestamp` DateTime `json:$.timestamp`, `session_id` String `json:$.session_id`, `action` LowCardinality(String) `json:$.action`, `version` LowCardinality(String) `json:$.version`, `payload` String `json:$.payload` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" ENGINE_TTL "timestamp + toIntervalDay(60)" ENGINE_SETTINGS "index_granularity=8192" SHARED_WITH > analytics_production analytics_staging ### SCHEMA¶ A `SCHEMA` declaration is a newline, comma-separated list of columns definitions. For example: ##### Example SCHEMA declaration SCHEMA > `timestamp` DateTime `json:$.timestamp`, `session_id` String `json:$.session_id`, `action` LowCardinality(String) `json:$.action`, `version` LowCardinality(String) `json:$.version`, `payload` String `json:$.payload` Each column in a `SCHEMA` declaration is in the format ` ` , where: - `` is the name of the column in the Data Source. - `` is one of the supported[ Data Types](https://www.tinybird.co/docs/docs/concepts/data-sources#supported-data-types) . - `` is optional and only required for NDJSON Data Sources. See[ JSONpaths](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-ndjson-data#jsonpaths) . - `` sets a default value to the column when it's null. A common use case is to set a default date to a column, like `updated_at DateTime DEFAULT now()` . To change or update JSONPaths or other default values in the schema, push a new version of the schema using `tb push --force` or use the [alter endpoint on the Data Sources API](https://www.tinybird.co/docs/docs/api-reference/datasource-api#post--v0-datasources-(.+)-alter). ### JSONPath expressions¶ `SCHEMA` definitions support JSONPath expressions. For example: ##### Schema syntax with jsonpath DESCRIPTION Generated from /Users/username/tmp/sample.ndjson SCHEMA > `d` DateTime `json:$.d`, `total` Int32 `json:$.total`, `from_novoa` Int16 `json:$.from_novoa` See [JSONPaths](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-ndjson-data#jsonpaths) for more information. ### ENGINE settings¶ `ENGINE` declares the ClickHouse® engine used for the Data Source. The default value is `MergeTree`. The supported values for `ENGINE` are the following: - `MergeTree` - `ReplacingMergeTree` - `SummingMergeTree` - `AggregatingMergeTree` - `CollapsingMergeTree` - `VersionedCollapsingMergeTree` - `Null` Read [Supported engine and settings](https://www.tinybird.co/docs/docs/concepts/data-sources#supported-engines-settings) for more information. ## Connectors¶ Connectors settings are part of the .datasource content. You can use include files to reuse connection settings and credentials. ### Kafka, Confluent, RedPanda¶ The Kafka, Confluent, and RedPanda connectors use the following settings: | Instruction | Required | Description | | --- | --- | --- | | `KAFKA_CONNECTION_NAME` | Yes | The name of the configured Kafka connection in Tinybird. | | `KAFKA_BOOTSTRAP_SERVERS` | Yes | Comma-separated list of one or more Kafka brokers, including Port numbers. | | `KAFKA_KEY` | Yes | Key used to authenticate with Kafka. Sometimes called Key, Client Key, or Username, depending on the Kafka distribution. | | `KAFKA_SECRET` | Yes | Secret used to authenticate with Kafka. Sometimes called Secret, Secret Key, or Password, depending on the Kafka distribution. | | `KAFKA_TOPIC` | Yes | Name of the Kafka topic to consume from. | | `KAFKA_GROUP_ID` | Yes | Consumer Group ID to use when consuming from Kafka. | | `KAFKA_AUTO_OFFSET_RESET` | No | Offset to use when no previous offset can be found, for example when creating a new consumer. Supported values are `latest` , `earliest` . Default: `latest` . | | `KAFKA_STORE_HEADERS` | No | Store Kafka headers as field `__headers` for later processing. Default value is `'False'` . | | `KAFKA_STORE_BINARY_HEADERS` | No | Stores all Kafka headers as binary data in field `__headers` as a binary map of type `Map(String, String)` . To access the header `'key'` run: `__headers['key']` . Default value is `'True'` . This field only applies if `KAFKA_STORE_HEADERS` is set to `True` . | | `KAFKA_STORE_RAW_VALUE` | No | Stores the raw message in its entirety as an additional column. Supported values are `'True'` , `'False'` . Default: `'False'` . | | `KAFKA_SCHEMA_REGISTRY_URL` | No | URL of the Kafka schema registry. | | `KAFKA_TARGET_PARTITIONS` | No | Target partitions to place the messages. | | `KAFKA_KEY_AVRO_DESERIALIZATION` | No | Key for decoding Avro messages. | | `KAFKA_SSL_CA_PEM` | No | CA certificate in PEM format for SSL connections. | | `KAFKA_SASL_MECHANISM` | No | SASL mechanism to use for authentication. Supported values are `'PLAIN'` , `'SCRAM-SHA-256'` , `'SCRAM-SHA-512'` . Default values is `'PLAIN'` . | The following example defines a Data Source with a new Kafka, Confluent, or RedPanda connection in a .datasource file: ##### Data Source with a new Kafka/Confluent/RedPanda connection SCHEMA > `value` String, `topic` LowCardinality(String), `partition` Int16, `offset` Int64, `timestamp` DateTime, `key` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" KAFKA_CONNECTION_NAME my_connection_name KAFKA_BOOTSTRAP_SERVERS my_server:9092 KAFKA_KEY my_username KAFKA_SECRET my_password KAFKA_TOPIC my_topic KAFKA_GROUP_ID my_group_id The following example defines a Data Source that uses an existing Kafka, Confluent, or RedPanda connection: ##### Data Source with an existing Kafka/Confluent/RedPanda connection SCHEMA > `value` String, `topic` LowCardinality(String), `partition` Int16, `offset` Int64, `timestamp` DateTime, `key` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" KAFKA_CONNECTION_NAME my_connection_name KAFKA_TOPIC my_topic KAFKA_GROUP_ID my_group_id Refer to the [Kafka Connector](https://www.tinybird.co/docs/docs/ingest/kafka), [Confluent Connector](https://www.tinybird.co/docs/docs/ingest/confluent) , or [RedPanda Connector](https://www.tinybird.co/docs/docs/ingest/redpanda) documentation for more details. ### BigQuery and Snowflake¶ The BigQuery and Snowflake connectors use the following settings: | Instruction | Required | Description | | --- | --- | --- | | `IMPORT_SERVICE` | Yes | Name of the import service to use. Use `bigquery` or `snowflake` . | | `IMPORT_SCHEDULE` | Yes | Cron expression, in UTC time, with the frequency to run imports. Must be higher than 5 minutes. For example, `*/5 * * * *` . Use `@auto` to sync once per minute when using `s3` , or `@on-demand` to only run manually. | | `IMPORT_CONNECTION_NAME` | Yes | Name given to the connection inside Tinybird. For example, `'my_connection'` . | | `IMPORT_STRATEGY` | Yes | Strategy to use when inserting data. Use `REPLACE` for BigQuery and Snowflake. | | `IMPORT_EXTERNAL_DATASOURCE` | No | Fully qualified name of the source table in BigQuery and Snowflake. For example, `project.dataset.table` . | | `IMPORT_QUERY` | No | The `SELECT` query to retrieve your data from BigQuery or Snowflake when you don't need all the columns or want to make a transformation before ingest. The `FROM` clause must reference a table using the full scope. For example, `project.dataset.table` . | See [BigQuery Connector](https://www.tinybird.co/docs/docs/ingest/bigquery) or [Snowflake Connector](https://www.tinybird.co/docs/docs/ingest/snowflake) for more details. #### BigQuery example¶ The following example shows a BigQuery Data Source described in a .datasource file: ##### Data Source with a BigQuery connection DESCRIPTION > bigquery demo data source SCHEMA > `timestamp` DateTime `json:$.timestamp`, `id` Integer `json:$.id`, `orderid` LowCardinality(String) `json:$.orderid`, `status` LowCardinality(String) `json:$.status`, `amount` Integer `json:$.amount` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" ENGINE_TTL "timestamp + toIntervalDay(60)" IMPORT_SERVICE bigquery IMPORT_SCHEDULE */5 * * * * IMPORT_EXTERNAL_DATASOURCE mydb.raw.events IMPORT_STRATEGY REPLACE IMPORT_QUERY > select timestamp, id, orderid, status, amount from mydb.raw.events #### Snowflake example¶ The following example shows a Snowflake Data Source described in a .datasource file: ##### tinybird/datasources/snowflake.datasource - Data Source with a Snowflake connection DESCRIPTION > Snowflake demo data source SCHEMA > `timestamp` DateTime `json:$.timestamp`, `id` Integer `json:$.id`, `orderid` LowCardinality(String) `json:$.orderid`, `status` LowCardinality(String) `json:$.status`, `amount` Integer `json:$.amount` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" ENGINE_TTL "timestamp + toIntervalDay(60)" IMPORT_SERVICE snowflake IMPORT_CONNECTION_NAME my_snowflake_connection IMPORT_EXTERNAL_DATASOURCE mydb.raw.events IMPORT_SCHEDULE */5 * * * * IMPORT_STRATEGY REPLACE IMPORT_QUERY > select timestamp, id, orderid, status, amount from mydb.raw.events ### S3¶ The S3 connector uses the following settings: | Instruction | Required | Description | | --- | --- | --- | | `IMPORT_SERVICE` | Yes | Name of the import service to use. Use `s3` for S3 connections. | | `IMPORT_CONNECTION_NAME` | Yes | Name given to the connection inside Tinybird. For example, `'my_connection'` . | | `IMPORT_STRATEGY` | Yes | Strategy to use when inserting data. Use `APPEND` for S3 connections. | | `IMPORT_BUCKET_URI` | Yes | Full bucket path, including the `s3://` protocol, bucket name, object path, and an optional pattern to match against object keys. For example, `s3://my-bucket/my-path` discovers all files in the bucket `my-bucket` under the prefix `/my-path` . You can use patterns in the path to filter objects, for example, ending the path with `*.csv` matches all objects that end with the `.csv` suffix. | | `IMPORT_FROM_DATETIME` | No | Sets the date and time from which to start ingesting files on an S3 bucket. The format is `YYYY-MM-DDTHH:MM:SSZ` . | See [S3 Connector](https://www.tinybird.co/docs/docs/ingest/s3) for more details. #### S3 example¶ The following example shows an S3 Data Source described in a .datasource file: ##### tinybird/datasources/s3.datasource - Data Source with an S3 connection DESCRIPTION > Analytics events landing data source SCHEMA > `timestamp` DateTime `json:$.timestamp`, `session_id` String `json:$.session_id`, `action` LowCardinality(String) `json:$.action`, `version` LowCardinality(String) `json:$.version`, `payload` String `json:$.payload` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" ENGINE_TTL "timestamp + toIntervalDay(60)" IMPORT_SERVICE s3 IMPORT_CONNECTION_NAME connection_name IMPORT_BUCKET_URI s3://my-bucket/*.csv IMPORT_SCHEDULE @auto IMPORT_STRATEGY APPEND Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/cli/datafiles/include-files Last update: 2024-11-16T15:39:34.000Z Content: --- title: "Include files · Tinybird Docs" theme-color: "#171612" description: "Include files help you organize settings so that you can reuse them across .datasource and .pipe files." --- # Include files (.incl)¶ Include files (.incl) help separate connector settings and reuse them across multiple .datasource files or .pipe templates. Include files are referenced using `INCLUDE` instruction. ## Connector settings¶ Use .incl files to separate connector settings from .datasource files. For example, the following .incl file contains Kafka Connector settings: ##### tinybird/datasources/connections/kafka\_connection.incl KAFKA_CONNECTION_NAME my_connection_name KAFKA_BOOTSTRAP_SERVERS my_server:9092 KAFKA_KEY my_username KAFKA_SECRET my_password While the .datasource file only contains a reference to the .incl file using `INCLUDE`: ##### tinybird/datasources/kafka\_ds.datasource SCHEMA > `value` String, `topic` LowCardinality(String), `partition` Int16, `offset` Int64, `timestamp` DateTime, `key` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" INCLUDE "connections/kafka_connection.incl" KAFKA_TOPIC my_topic KAFKA_GROUP_ID my_group_id ### Pipe nodes¶ You can use .incl datafiles to [reuse node templates](https://www.tinybird.co/docs/docs/cli/advanced-templates#reusing-templates). For example, the following .incl file contains a node template: ##### tinybird/includes/only\_buy\_events.incl NODE only_buy_events SQL > SELECT toDate(timestamp) date, product, color, JSONExtractFloat(json, 'price') as price FROM events where action = 'buy' The .pipe file starts with the `INCLUDE` reference to the template: ##### tinybird/endpoints/sales.pipe INCLUDE "../includes/only_buy_events.incl" NODE endpoint DESCRIPTION > return sales for a product with color filter SQL > % select date, sum(price) total_sales from only_buy_events where color in {{Array(colors, 'black')}} group by date A different .pipe file can reuse the sample template: ##### tinybird/pipes/top\_per\_day.pipe INCLUDE "../includes/only_buy_events.incl" NODE top_per_day SQL > SELECT date, topKState(10)(product) top_10, sumState(price) total_sales from only_buy_events group by date TYPE MATERIALIZED DATASOURCE mv_top_per_day ### Include with variables¶ You can templatize .incl files. For instance you can reuse the same .incl template with different variable values: ##### tinybird/includes/top\_products.incl NODE endpoint DESCRIPTION > returns top 10 products for the last week SQL > % select date, topKMerge(10)(top_10) as top_10 from top_product_per_day {% if '$DATE_FILTER' == 'last_week' %} where date > today() - interval 7 day {% else %} where date between {{Date(start)}} and {{Date(end)}} {% end %} group by date The `$DATE_FILTER` parameter is a variable in the .incl file. The following examples show how to create two separate endpoints by injecting a value for the `DATE_FILTER` variable. The following .pipe file references the template using a `last_week` value for `DATE_FILTER`: ##### tinybird/endpoints/top\_products\_last\_week.pipe INCLUDE "../includes/top_products.incl" "DATE_FILTER=last_week" Whereas the following .pipe file references the template using a `between_dates` value for `DATE_FILTER`: ##### tinybird/endpoints/top\_products\_between\_dates.pipe INCLUDE "../includes/top_products.incl" "DATE_FILTER=between_dates" ### Include with environment variables¶ Because you can expand `INCLUDE` files using the Tinybird CLI, you can use environment variables. For example, if you have configured the `KAFKA_BOOTSTRAP_SERVERS`, `KAFKA_KEY` , and `KAFKA_SECRET` environment variables, you can create an .incl file as follows: ##### tinybird/datasources/connections/kafka\_connection.incl KAFKA_CONNECTION_NAME my_connection_name KAFKA_BOOTSTRAP_SERVERS ${KAFKA_BOOTSTRAP_SERVERS} KAFKA_KEY ${KAFKA_KEY} KAFKA_SECRET ${KAFKA_SECRET} You can then use the values in your .datasource datafiles: ##### tinybird/datasources/kafka\_ds.datasource SCHEMA > `value` String, `topic` LowCardinality(String), `partition` Int16, `offset` Int64, `timestamp` DateTime, `key` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" INCLUDE "connections/kafka_connection.incl" KAFKA_TOPIC my_topic KAFKA_GROUP_ID my_group_id Alternatively, you can create separate .incl files per environment variable: ##### tinybird/datasources/connections/kafka\_connection\_prod.incl KAFKA_CONNECTION_NAME my_connection_name KAFKA_BOOTSTRAP_SERVERS production_servers KAFKA_KEY the_kafka_key KAFKA_SECRET ${KAFKA_SECRET} ##### tinybird/datasources/connections/kafka\_connection\_stg.incl KAFKA_CONNECTION_NAME my_connection_name KAFKA_BOOTSTRAP_SERVERS staging_servers KAFKA_KEY the_kafka_key KAFKA_SECRET ${KAFKA_SECRET} And then include both depending on the environment: ##### tinybird/datasources/kafka\_ds.datasource SCHEMA > `value` String, `topic` LowCardinality(String), `partition` Int16, `offset` Int64, `timestamp` DateTime, `key` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" INCLUDE "connections/kafka_connection_${TB_ENV}.incl" KAFKA_TOPIC my_topic KAFKA_GROUP_ID my_group_id Where `$TB_ENV` is one of `stg` or `prod`. See [deploy to staging and production environments](https://www.tinybird.co/docs/docs/production/staging-and-production-workspaces) to learn how to leverage environment variables. --- URL: https://www.tinybird.co/docs/cli/datafiles/overview Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Datafiles · Tinybird Docs" theme-color: "#171612" description: "Datafiles describe your Tinybird resources: Data Sources, Pipes, and so on. They're the source code of your project." --- # Datafiles¶ Datafiles describe your Tinybird resources, like Data Sources, Pipes, and so on. They're the source code of your project. You can use datafiles to manage your projects as source code and take advantage of version control. Tinybird CLI helps you produce and push datafiles to the Tinybird platform. ## Types of datafiles¶ Tinybird uses the following types of datafiles: - Datasource files (.datasource) represent Data Sources. See[ Datasource files](https://www.tinybird.co/docs/docs/cli/datafiles/datasource-files) . - Pipe files (.pipe) represent Pipes of various types. See[ Pipe files](https://www.tinybird.co/docs/docs/cli/datafiles/pipe-files) . - Include files (.incl) are reusable fragments you can include in .datasource or .pipe files. See[ Include files](https://www.tinybird.co/docs/docs/cli/datafiles/include-files) . ## Syntactic conventions¶ Datafiles follow the same syntactic conventions. ### Casing¶ Instructions always appear at the beginning of a line in upper case. For example: ##### Basic syntax COMMAND value ANOTHER_INSTR "Value with multiple words" ### Multiple lines¶ Instructions can span multiples lines. For example: ##### Multiline syntax SCHEMA > `d` DateTime, `total` Int32, `from_novoa` Int16 ## File structure¶ The following example shows a typical `tinybird` project directory that includes subdirectories for supported types: ##### Example file structure tinybird ├── datasources/ │ └── connections/ │ └── my_connector_name.incl │ └── my_datasource.datasource ├── endpoints/ ├── includes/ ├── pipes/ ## Next steps¶ - Understand[ CI/CD processes on Tinybird](https://www.tinybird.co/docs/docs/production/continuous-integration) . - Read about[ implementing test strategies](https://www.tinybird.co/docs/docs/production/implementing-test-strategies) . --- URL: https://www.tinybird.co/docs/cli/datafiles/pipe-files Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Pipe files · Tinybird Docs" theme-color: "#171612" description: "Pipe files describe your Tinybird Pipes. Define the type, Data Source, and other settings." --- # Pipe files (.pipe)¶ Pipe files describe your Pipes. You can use .pipe files to define the type, starting node, Data Source, and other settings of your Pipes. See [Data Sources](https://www.tinybird.co/docs/docs/concepts/pipes). ## Available instructions¶ The following instructions are available for .pipe files. | Instruction | Required | Description | | --- | --- | --- | | `%` | No | Use as the first character of a node to indicate the node uses the[ templating system](https://www.tinybird.co/docs/docs/cli/advanced-templates#template-functions) . | | `DESCRIPTION ` | No | Sets the description for a node or the complete file. | | `TAGS ` | No | Comma-separated list of tags. Tags are used to[ organize your data project](https://www.tinybird.co/docs/docs/production/organizing-resources) . | | `NODE ` | Yes | Starts the definition of a new node. All the instructions until a new `NODE` instruction or the end of the file are related to this node. | | `SQL ` | Yes | Defines a block for the SQL of a node. The block must be indented. | | `INCLUDE ` | No | Includes are pieces of a Pipe that you can reuse in multiple .pipe files. | | `TYPE ` | No | Sets the type of the node. Valid values are `ENDPOINT` , `MATERIALIZED` , `COPY` , or `SINK` . | | `DATASOURCE ` | Yes | Required when `TYPE` is `MATERIALIZED` . Sets the destination Data Source for materialized nodes. | | `TARGET_DATASOURCE ` | Yes | Required when `TYPE` is `COPY` . Sets the destination Data Source for copy nodes. | | `TOKEN READ` | No | Grants read access to a Pipe or Endpoint to the Token with name `` . If the Token doesn't exist it's created automatically. | | `COPY_SCHEDULE` | No | Cron expression with the frequency to run copy jobs. Must be higher than 5 minutes. For example, `*/5 * * * *` . If undefined, it defaults to `@on-demand` . | | `COPY_MODE` | No | Strategy to ingest data for copy jobs. One of `append` or `replace` . If empty, the default strategy is `append` . | ## Materialized Pipe¶ In a .pipe file you can define how to materialize each row ingested in the earliest Data Source in the Pipe query to a materialized Data Source. Materialization happens at ingest. The following example shows how to describe a Materialized Pipe. See [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview). ##### tinybird/pipes/sales\_by\_hour\_mv.pipe DESCRIPTION Materialized Pipe to aggregate sales per hour in the sales_by_hour Data Source NODE daily_sales SQL > SELECT toStartOfDay(starting_date) day, country, sum(sales) as total_sales FROM teams GROUP BY day, country TYPE MATERIALIZED DATASOURCE sales_by_hour ## Copy Pipe¶ In a .pipe file you can define how to export the result of a Pipe to a Data Source, optionally with a schedule. The following example shows how to describe a Copy Pipe. See [Copy Pipes](https://www.tinybird.co/docs/docs/publish/copy-pipes). ##### tinybird/pipes/sales\_by\_hour\_cp.pipe DESCRIPTION Copy Pipe to export sales hour every hour to the sales_hour_copy Data Source NODE daily_sales SQL > % SELECT toStartOfDay(starting_date) day, country, sum(sales) as total_sales FROM teams WHERE day BETWEEN toStartOfDay(now()) - interval 1 day AND toStartOfDay(now()) and country = {{ String(country, ‘US’)}} GROUP BY day, country TYPE COPY TARGET_DATASOURCE sales_hour_copy COPY_SCHEDULE 0 * * * * ## API Endpoint Pipe¶ In a .pipe file you can define how to export the result of a Pipe as an HTTP endpoint. The following example shows how to describe an API Endpoint Pipe. See [API Endpoints](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview). ##### tinybird/pipes/sales\_by\_hour\_endpoint.pipe TOKEN dashboard READ DESCRIPTION endpoint to get sales by hour filtering by date and country TAGS sales NODE daily_sales SQL > % SELECT day, country, sum(total_sales) as total_sales FROM sales_by_hour WHERE day BETWEEN toStartOfDay(now()) - interval 1 day AND toStartOfDay(now()) and country = {{ String(country, ‘US’)}} GROUP BY day, country NODE result SQL > % SELECT * FROM daily_sales LIMIT {{Int32(page_size, 100)}} OFFSET {{Int32(page, 0) * Int32(page_size, 100)}} ## Sink Pipe¶ The following parameters are available when defining Sink Pipes: | Instruction | Required | Description | | --- | --- | --- | | `EXPORT_SERVICE` | Yes | One of `gcs_hmac` , `s3` , `s3_iamrole` , or `kafka` . | | `EXPORT_CONNECTION_NAME` | Yes | The name of the export connection. | | `EXPORT_SCHEDULE` | No | Cron expression, in UTC time. Must be higher than 5 minutes. For example, `*/5 * * * *` . | ### Blob storage Sink¶ When setting `EXPORT_SERVICE` as one of `gcs_hmac`, `s3` , or `s3_iamrole` , you can use the following instructions: | Instruction | Required | Description | | --- | --- | --- | | `EXPORT_BUCKET_URI` | Yes | The desired bucket path for the exported file. Path must not include the filename and extension. | | `EXPORT_FILE_TEMPLATE` | Yes | Template string that specifies the naming convention for exported files. The template can include dynamic attributes between curly braces based on columns' data that will be replaced with real values when exporting. For example: `export_{category}{date,'%Y'}{2}` . | | `EXPORT_FORMAT` | Yes | Format in which the data is exported. Supported output formats are listed[ in the ClickHouse® documentation](https://clickhouse.com/docs/en/interfaces/formats#formats) . The default value is `csv` . | | `EXPORT_COMPRESSION` | No | Compression file type. Accepted values are `none` , `gz` for gzip, `br` for brotli, `xz` for LZMA, `zst` for zstd. Default values is `none` . | | `EXPORT_STRATEGY` | Yes | One of the available strategies. The default is `@new` . | ### Kafka Sink¶ Kafka Sinks are currently in private beta. If you have any feedback or suggestions, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). When setting `EXPORT_SERVICE` as `kafka` , you can use the following instructions: | Instruction | Required | Description | | --- | --- | --- | | `EXPORT_KAFKA_TOPIC` | Yes | The desired topic for the export data. | Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/cli/install Last update: 2024-11-15T08:19:14.000Z Content: --- title: "Install Tinybird CLI · Tinybird Docs" theme-color: "#171612" description: "Install the Tinybird CLI on Linux or macOS, or use the prebuilt Docker image." --- # Install the Tinybird CLI¶ You can install Tinybird on your local machine or use a prebuilt Docker image. Read on to learn how to install and configure Tinybird CLI for use. ## Installation¶ Install Tinybird CLI locally to use it on your machine. ### Prerequisites¶ Tinybird CLI supports Linux and macOS 10.14 and higher. Supported Python versions are 3.8, 3.9, 3.10, 3.11, 3.12. ### Install tinybird-cli¶ Create a virtual environment before installing the `tinybird-cli` package: ##### Creating a virtual environment for Python 3 python3 -m venv .venv source .venv/bin/activate Then, install `tinybird-cli`: ##### Install tinybird-cli pip install tinybird-cli ## Docker image¶ The official `tinybird-cli-docker` image provides a Tinybird CLI executable ready to use in your projects and pipelines. To run Tinybird CLI using Docker from the terminal, run the following commands: ##### Build local image setting your project path # Assuming a projects/data path docker run -v ~/projects/data:/mnt/data -it tinybirdco/tinybird-cli-docker cd mnt/data ## Authentication¶ Before you start using Tinybird CLI, check that you can authenticate by running `tb auth`: ##### Authenticate tb auth -i A list of available regions appears. Select your Tinybird region, then provide your admin Token. See [Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens). You can also pass the Token directly with the `--token` flag. For example: ##### Authenticate tb auth --token See the API Reference docs for the [list of supported regions](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) . You can also get the list using `tb auth ls`. The `tb auth` command saves your credentials in a .tinyb file in your current directory. Add it to your .gitignore file to avoid leaking credentials. ## Integrated help¶ After you've installed the Tinybird CLI you can access the integrated help by using the `--help` flag: ##### Integrated help tb --help You can do the same for every available command. For example: ##### Integrated command help tb datasource --help ## Telemetry¶ Starting from version 1.0.0b272, the Tinybird CLI collects telemetry on the use of the CLI commands and information about exceptions and crashes and sends it to Tinybird. Telemetry helps Tinybird improve the command-line experience. On each `tb` execution, the CLI collects information about your system, Python environment, the CLI version installed and the command you ran. All data is completely anonymous. To opt out of the telemetry feature, set the `TB_CLI_TELEMETRY_OPTOUT` environment variable to `1` or `true`. ## Configure your shell prompt¶ You can extract the current Tinybird Workspace and region from your .tinyb file and show it in your zsh or bash shell prompt. To extract the information programmatically, paste the following function to your shell config file: ##### Parse the .tinyb file to use the output in the PROMPT prompt_tb() { if [ -e ".tinyb" ]; then TB_CHAR=$'\U1F423' branch_name=`grep '"name":' .tinyb | cut -d : -f 2 | cut -d '"' -f 2` region=`grep '"host":' .tinyb | cut -d / -f 3 | cut -d . -f 2 | cut -d : -f 1` if [ "$region" = "tinybird" ]; then region=`grep '"host":' .tinyb | cut -d / -f 3 | cut -d . -f 1` fi TB_BRANCH="${TB_CHAR}tb:${region}=>${branch_name}" else TB_BRANCH='' fi echo $TB_BRANCH } When the function is available, you need to make the output visible on the prompt of your shell. The following example shows how to do this for zsh: ##### Include Tinybird information in the zsh prompt echo 'export PROMPT="' $PS1 ' $(prompt_tb)"' >> ~/.zshrc Restart your shell and go to the root of your project to see the Tinybird region and Workspace in your prompt. --- URL: https://www.tinybird.co/docs/cli/overview Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Tinybird CLI · Tinybird Docs" theme-color: "#171612" description: "Use the Tinybird CLI to access all the Tinybird features from the command line." --- # Tinybird command-line interface (CLI)¶ Use the Tinybird CLI to access all the Tinybird features directly from the command line. You can test and run commands from the terminal or integrate the CLI in your pipelines. - Read the Quick start guide. See[ Quick start](https://www.tinybird.co/docs/docs/cli/quick-start) . - Install Tinybird CLI on your machine. See[ Install](https://www.tinybird.co/docs/docs/cli/install) . - Learn about Tinybird datafiles and their format. See[ Datafiles](https://www.tinybird.co/docs/docs/cli/datafiles/overview) . - Organize your CLI projects. See[ Data projects](https://www.tinybird.co/docs/docs/cli/data-projects) . --- URL: https://www.tinybird.co/docs/cli/quick-start Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Quick start · Tinybird Docs" theme-color: "#171612" description: "Get started with Tinybird CLI as quickly as possible. Ingest, query, and publish data in minutes." --- # Quick start for Tinybird command-line interface¶ With Tinybird, you can ingest data from anywhere, query and transform it using SQL, and publish your data as high-concurrency, low-latency REST API endpoints. After you've [familiarized yourself with Tinybird](https://www.tinybird.co/docs/docs/quick-start) , you're ready to start automating and scripting the management of your Workspace using the Tinybird command-line interface (CLI). The Tinybird CLI is essential for all [CI/CD workflows](https://www.tinybird.co/docs/docs/production/continuous-integration). Read on to learn how to download and configure the Tinybird CLI, create a Workspace, ingest data, create a query, publish an API, and confirm your setup works properly. ## Step 1: Create your Tinybird account¶ [Create a Tinybird account](https://www.tinybird.co/signup) . It's free and no credit card is required. See [Tinybird pricing plans](https://www.tinybird.co/docs/docs/support/billing) for more information. [Sign up for Tinybird](https://www.tinybird.co/signup) ## Step 2: Download and install the Tinybird CLI¶ [Follow the instructions](https://www.tinybird.co/docs/docs/cli/install) to download and install the Tinybird command-line interface (CLI). Complete the setup and authenticate with your Tinybird account in the cloud and region you prefer. ## Step 3: Create your Workspace¶ A [Workspace](https://www.tinybird.co/docs/docs/concepts/workspaces) is an area that contains a set of Tinybird resources, including Data Sources, Pipes, Nodes, API Endpoints, and Tokens. Create a Workspace named `customer_rewards` . Use a unique name. tb workspace create customer_rewards ## Step 4: Download and ingest sample data¶ Download the following sample data from a fictitious online coffee shop: [Download data file](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-05.parquet) The following Tinybird CLI commands infer the schema from the datafile, generate and push a .datasource file and ingest the data. tb datasource generate orders.ndjson # Infer the schema tb push orders.datasource # Upload the datasource file tb datasource append orders orders.ndjson # Ingest the data ## Step 5: Query data using a Pipe and Publish it as an API¶ In Tinybird, you can create [Pipes](https://www.tinybird.co/docs/docs/concepts/pipes) to query your data using SQL. The following commands create a Pipe with an SQL instruction that returns the number of records Tinybird has ingested from the data file: tb pipe generate rewards 'select count() from orders' tb push rewards.pipe When you push a Pipe, Tinybird publishes it automatically as a high-concurrency, low-latency API Endpoint. ## Step 6: Call the API Endpoint¶ You can test your API Endpoint using a curl command. First, create and obtain the read Token for the API Endpoint. tb token create static rewards_read_token --scope PIPES:READ --resource rewards tb token copy rewards_read_token Copy the read Token and insert it into a curl command. curl --compressed -H 'Authorization: Bearer your_read_token_here' https://api.us-east.aws.tinybird.co/v0/pipes/rewards.json You have now created your first API Endpoint in Tinybird using the CLI. ## Next steps¶ - Learn about datafiles and their format. See[ Datafiles](https://www.tinybird.co/docs/docs/cli/datafiles/overview) . - Learn how advanced templates can help you. See[ Advanced templates](https://www.tinybird.co/docs/docs/cli/advanced-templates) . - Browse the full CLI reference. See[ Command reference](https://www.tinybird.co/docs/docs/cli/command-ref) . --- URL: https://www.tinybird.co/docs/cli/workspaces Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Manage Workspaces using the CLI · Tinybird Docs" theme-color: "#171612" description: "Learn how to switch between different Tinybird Workspaces and how to manage members using the CLI." --- # Manage Workspaces using the CLI¶ If you are a member of different Workspaces, you might need to frequently switch between Workspaces when working on a project using Tinybird CLI. This requires to authenticate and select the right Workspace. ## Authenticate¶ Authenticate using the admin Token. For example: ##### Authenticate tb auth --token ## List Workspaces¶ List the Workspaces you have access to, and the one that you're currently authenticated to: ##### List Workspaces tb workspace ls ## Create a Workspace¶ You can create new empty Workspaces or create a Workspace from a template. To create Workspaces using Tinybird CLI, you need [your user Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens#your-user-token). Run the following command to create a Workspace following instructions: ##### Authenticate tb workspace create You can create a Workspace directly by defining the user Token using the `--user_token` flag: ##### Authenticate tb workspace create workspace_name --user_token ## Switch to another Workspace¶ You can switch to another Workspace using `--use` . For example: ##### Switch to a Workspace using the Workspace id or the Workspace name # Use the Workspace ID tb workspace use 841717b1-2472-44f9-9a81-42f1263cabe7 # Use the Workspace name tb workspace use Production To find out the IDs and names of available Workspaces, run `tb workspace ls`: tb workspace ls You can also check which Workspace you're currently in: ##### Show current Workspace tb workspace current ## Manage Workspace members¶ You can manage Workspace members using the `workspace members` commands. ### List members¶ To list members, run `tb workspace members ls` . For example: ##### Listing current Workspace members tb workspace members ls ### Add members¶ To add members, run `tb workspace members add` . For example: ##### Adding users to the current Workspace tb workspace members add "user1@example.com,user2@example.com,user3@example.com" ### Remove members¶ To remove members, run `tb workspace members rm` . For example: ##### Removing members from current Workspace tb workspace members rm user3@example.com You can also manage roles. For example, to set a user as admin: ##### Add admin role to user tb workspace members set-role admin user@example.com --- URL: https://www.tinybird.co/docs/compliance Last update: 2024-10-24T13:35:00.000Z Content: --- title: "Compliance and certifications · Tinybird Docs" theme-color: "#171612" description: "Tinybird is committed to the highest data security and safety. See what compliance certifications are available." --- # Compliance and certifications¶ Data security and privacy are paramount in today's digital landscape. Tinybird's commitment to protecting your sensitive information is backed by the following compliance certifications, which ensure that we meet rigorous industry standards for data security, privacy, and operational excellence. ## SOC 2 Type II¶ Tinybird has obtained a SOC 2 Type II certification, in accordance with attestation standards established by the American Institute of Certified Public Accountants (AICPA), that are relevant to security, availability, processing integrity, confidentiality, and privacy for Tinybird's real-time platform for user-facing analytics. Compliance is monitored continually—with reports published annually—to confirm the robustness of Tinybird's data security. This independent assessment provides Tinybird users with assurance that their sensitive information is being handled responsibly and securely. ## HIPAA¶ Tinybird supports its customers’ Health Insurance Portability and Accountability Act (HIPAA) compliance efforts by offering Business Associate Agreements (BAAs). Additionally, Tinybird’s offering allows customers to process their data constituting personal health information (PHI) in AWS, Azure, or Google Cloud—entities which themselves have entered into BAAs with Tinybird. ## Trust center¶ To learn more about Tinybird security controls and certifications, visit the [Tinybird Trust Center](https://trust.tinybird.co/). --- URL: https://www.tinybird.co/docs/concepts/auth-tokens Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Tokens · Tinybird Docs" theme-color: "#171612" description: "Tokens allow you to secure your Tinybird resources, providing fine-grained access control to your users. Learn all about Tokens here!" --- # Tokens¶ ## What is a Token?¶ Tokens protect access to your Tinybird resources. Any operations to manage your Tinybird resources via the CLI or REST API require a valid Token with the necessary permissions. Access to the APIs you publish in Tinybird are also protected with Tokens. Tokens can have different scopes. This means you can limit which operations a specific Token can do. You can create Tokens that are, for example, only able to do admin operations on Tinybird resources, or only have `READ` permission for a specific Data Source. Tinybird represents tokens using the icon. ## What should I use Tokens for?¶ There are **two types** of Tokens. - When performing operations on your account (like importing data, creating Data Sources, or publishing APIs via the CLI or REST API) you need a Token. For this purpose,** use a Static[ Token](https://www.tinybird.co/docs/about:blank#tokens)** . - When publishing an API that exposes your data to an application, you need a Token to successfully interact with the API. For this purpose,** use a[ JWT](https://www.tinybird.co/docs/about:blank#json-web-tokens-jwts) .** ### Monitor Token usage¶ You can monitor Token usage using Tinybird's Service Data Sources. See the guide ["Monitor API Performance"](https://www.tinybird.co/docs/docs/guides/monitoring/analyze-endpoints-performance#example-4-monitor-usage-of-tokens) for more. ## Tokens¶ Tinybird Tokens, also known as "Static Tokens", are permanent and meant to be long-term. They are stored inside Tinybird and don't have an expiration date or time. They are useful for backend-to-backend integrations (where you call Tinybird as another service). ### Token scopes¶ When a Token is created, you have the choice to give it a set of zero or more scopes that define which tables can be accessed by that Token, and which methods can be used to access them. A `READ` Token can be augmented with a SQL filter. This allows you to further restrict what data a Token grants access to. Using this, you can [implement row-level security](https://www.tinybird.co/blog-posts/row-level-security-in-tinybird) on a `READ` -scoped Token. **Available scopes syntax** | Value | Description | | --- | --- | | `DATASOURCES:CREATE` | Enables your Token to create and append data to Data Sources. | | `DATASOURCES:APPEND:datasource_name` | Allows your Token to append data to the defined Data Sources. | | `DATASOURCES:DROP:datasource_name` | Allows your Token to delete the specified Data Sources. | | `DATASOURCES:READ:datasource_name` | Gives your Token read permissions for the specified Data Sources. Also gives read for the quarantine Data Source. | | `DATASOURCES:READ:datasource_name:sql_filter` | Gives your Token read permissions for the specified table with the `sql_filter` applied. | | `PIPES:CREATE` | Allows your Token to create new Pipes and manipulate existing ones. | | `PIPES:DROP:pipe_name` | Allows your Token to delete the specified Pipe. | | `PIPES:READ:pipe_name` | Gives your Token read permissions for the specified Pipe. | | `PIPES:READ:pipe_name:sql_filter` | Gives your Token read permissions for the specified Pipe with the `sql_filter` applied. | | `TOKENS` | Gives your Token the capacity of managing Tokens. | | `ADMIN` | All permissions will be granted. You should not use this Token except in really specific cases. Use it carefully! | When adding the `DATASOURCES:READ` scope to a Token, it automatically gives read permissions to the [quarantine Data Source](https://www.tinybird.co/docs/docs/concepts/data-sources#the-quarantine-data-source) associated with it. Applying Tokens with filters to queries that use the FINAL clause is not supported. If you need to apply auth filters to deduplications, use an alternative strategy (see [deduplication strategies](https://www.tinybird.co/docs/docs/guides/querying-data/deduplication-strategies#different-alternatives-based-on-your-requirements) ). ### Default Workspace Tokens¶ All Workspaces are created with a set of basic Tokens that you can add to by creating additional Tokens: - `admin token` for that Workspace (for signing JWTs). - `admin token` for that Workspace that belongs specifically to your user account (for using with Tinybird CLI). - `create datasource token` for creating Data Sources in that Workspace. - `user token` for creating new Workspaces or deleting ones where are an admin (see below). Some Tokens are created automatically by Tinybird during certain operations like scheduled copies, and [these Tokens can be updated](https://www.tinybird.co/docs/docs/publish/copy-pipes#change-copy-pipe-token-reference). ### Your User Token¶ Your User Token is specific to your user account. It's a permanent Token that enables you to perform operations that are not limited to a single Workspace, such as creating new Workspaces. You can only obtain your User Token from your Workspace in the Tinybird UI > Tokens > `user token`. ### Create a Token¶ In the Tinybird UI, navigate to Tokens > Plus (+) icon. Descriptively rename the new Token and update its scopes using the table above as a guide. ## JSON Web Tokens (JWTs) BETA¶ JWTs are currently in public beta. They are not feature-complete and may change in the future. If you have any feedback or suggestions, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). JWTs are signed tokens that allow you to securely authorize and share data between your application and Tinybird. If you want to read more about JWTs, check out the [JWT.io](https://jwt.io/introduction) website. Importantly, JWTs are **not stored in Tinybird** . They are created by you, inside your application, and signed with a shared secret between your application and Tinybird. Tinybird validates the signature of the JWT, using the shared secret, to ensure it's authentic. ### When to use JWTs¶ The primary purpose for JWTs is to allow your application to call Tinybird API Endpoints from the frontend without proxying via your backend. If you are building an application where a frontend component needs data from a Tinybird API Endpoint, you can use JWTs to authorize the request directly from the frontend. The typical pattern looks like this: 1. A user starts a session in your application. 2. The frontend requests a JWT from your backend. 3. Your backend generates a new JWT, signed with the Tinybird shared secret, and returns to the frontend. 4. The frontend uses the JWT to call the Tinybird API Endpoints directly. ### JWT payload¶ The payload of a JWT is a JSON object that contains the following fields: | Key | Example Value | Required | Description | | --- | --- | --- | --- | | workspace_id | workspaces_id | Yes | The UUID of your Tinybird Workspace, found in the Workspace list (locate the Workspace name in the nav bar > use `Copy ID` button), the Settings section (the three dots), or by using `tb workspace current` from the CLI. | | name | frontend_jwt | Yes | Used to identify the token in the `tinybird.pipe_stats_rt` table, useful for analytics, does not need to be unique. | | exp | 123123123123 | Yes | The Unix timestamp (UTC) showing the expiry date & time. Once a token has expired, Tinybird returns a 403 HTTP status code. | | scopes | [{"type": "PIPES:READ", "resource": "requests_per_day", "fixed_params": {"org_id": "testing"}}] | Yes | Used to pass data to Tinybird, including the Tinybird scope, resources and fixed parameters. | | scopes.type | PIPES:READ | Yes | The type of scope, e.g., READ. See[ JWT scopes](https://www.tinybird.co/docs/about:blank#jwt-scopes) for supported scopes. | | scopes.resource | t_b9427fe2bcd543d1a8923d18c094e8c1 or top_airlines | Yes | The ID or name of the Pipe that the scope applies to, i.e., which API Endpoint the token can access. | | scopes.fixed_params | {"org_id": "testing"} | No | Pass arbitrary fixed values to the API Endpoint. These values can be accessed by Pipe templates to supply dynamic values at query time. | | limits | {"rps": 10} | No | You can limit the number of requests per second (rps) the JWT can perform. Learn more in the[ JWT rate limit section](https://www.tinybird.co/docs/about:blank#rate-limits-for-jwt-tokens) . | Check out the [JWT example](https://www.tinybird.co/docs/about:blank#jwt-example) below to see what a complete payload looks like. ### JWT algorithm¶ Tinybird always uses HS256 as the algorithm for JWTs and does not read the `alg` field in the JWT header. You can skip the `alg` field in the header. ### JWT scopes¶ | Value | Description | | --- | --- | | `PIPES:READ:pipe_name` | Gives your Token read permissions for the specified Pipe | ### JWT expiration¶ JWTs can have an expiration time that gives each Token a finite lifespan. Setting the `exp` field in the JWT payload is mandatory, and not setting it will result in a 403 HTTP status code from Tinybird when requesting the API Endpoint. Tinybird validates that a JWT has not expired before allowing access to the API Endpoint. If a Token has expired, Tinybird returns a 403 HTTP status code. ### JWT fixed parameters¶ Fixed parameters allow you to pass arbitrary values to the API Endpoint. These values can be accessed by Pipe templates to supply dynamic values at query time. For example, consider the following fixed parameter: ##### Example fixed parameters { "fixed_params": { "org_id": "testing" } } This passes a parameter called `org_id` with the value `testing` to the API Endpoint. You can then use this value in your SQL queries: ##### Example SQL query SELECT fieldA, fieldB FROM my_pipe WHERE org_id = '{{ String(org_id) }}' This is particularly useful when you want to pass dynamic values to an API Endpoint that are set by your backend and must be safe from user tampering. A good example is multi-tenant applications that require row-level security, where you need to filter data based on a user or tenant ID. The value `org_id` will always be the one specified in the `fixed_params` . Even if you specify a new value in the URL when requesting the endpoint, Tinybird will always use the one specified in the JWT. You can use JWT fixed parameters in combination with Pipe [dynamic parameters](https://www.tinybird.co/docs/docs/query/query-parameters). ### JWT example¶ For example, take the following payload with all [required and optional fields](https://www.tinybird.co/docs/about:blank#jwt-payload): ##### Example payload { "workspace_id": "workspaces_id", "name": "frontend_jwt", "exp": 123123123123, "scopes": [ { "type": "PIPES:READ", "resource": "requests_per_day", "fixed_params": { "org_id": "testing" } } ], "limits": { "rps": 10 } } Then use the Admin Token from your Workspace to sign the payload, for example: ##### Example Workspace Admin Token p.eyJ1IjogIjA1ZDhiYmI0LTdlYjctNDAzZS05NGEyLWM0MzFhNDBkMWFjZSIsICJpZCI6ICI3NzUxMDUzMC0xZjE4LTRkNzMtOTNmNS0zM2MxM2NjMDUxNTUiLCAiaG9zdCI6ICJldV9zaGFyZWQifQ.Xzh4Qjz0FMRDXFuFIWPI-3DWEC6y-RFBfm_wE3_Qp2M With the payload and Admin Token above, the signed JWT payload would look like this: ##### Example JWT eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ3b3Jrc3BhY2VfaWQiOiIzMTA0OGI3Ni01MmU4LTQ5N2ItOTBhNC0wYzZhNTUxMzkyMGQiLCJuYW1lIjoiZnJvbnRlbmRfand0IiwiZXhwIjoxMjMxMjMxMjMxMjMsInNjb3BlcyI6W3sidHlwZSI6IlBJUEVTOlJFQUQiLCJyZXNvdXJjZSI6ImVhNDdmZDlkLWJjNDgtNDIwZC1hNmY2LTk1NDgxZmJiM2Y3YyIsImZpeGVkX3BhcmFtcyI6eyJvcmdfaWQiOiJ0ZXN0aW5nIn19XSwiaWF0IjoxNzE3MDYzNzQwfQ.t-9BRLI6MrhOAuvt1mBSTBTU7TOdJFunBjr78TuqpVg You can use the [JWT.io debugger](https://jwt.io/#debugger-io) to verify the above example. ### JWT limitations¶ - You cannot refresh JWTs individually from inside Tinybird as they are not stored in Tinybird. You must do this from your application, or you can globally invalidate all JWTs by refreshing your Admin Token. - If you refresh your Admin Token, all the tokens will be invalidated. - If your token is expired or invalidated, you'll get a 403 HTTP status code from Tinybird when requesting the API Endpoint. ### Create a JWT¶ #### Create a JWT in production¶ There is wide support for creating JWTs in many programming languages and frameworks. Any library that supports JWTs should work with Tinybird. A common library to use with Python is [PyJWT](https://github.com/jpadilla/pyjwt/tree/master) . Common libraries for JavaScript are [jsonwebtoken](https://github.com/auth0/node-jsonwebtoken#readme) and [jose](https://github.com/panva/jose). - JavaScript (Next.js) - Python ##### Create a JWT in Python using pyjwt import jwt import datetime import os TINYBIRD_SIGNING_KEY = os.getenv('TINYBIRD_SIGNING_KEY') def generate_jwt(): expiration_time = datetime.datetime.utcnow() + datetime.timedelta(hours=3) payload = { "workspace_id": "workspaces_id", "name": "frontend_jwt", "exp": expiration_time, "scopes": [ { "type": "PIPES:READ", "resource": "requests_per_day", "fixed_params": { "org_id": "testing" } } ] } return jwt.encode(payload, TINYBIRD_SIGNING_KEY, algorithm='HS256') #### Create a JWT using the CLI or the API¶ If for any reason you don't want to generate a JWT on your own, Tinybird provides an API and a CLI utility to create JWTs. - API - CLI ##### Create a JWT with the Tinybird CLI tb token create jwt my_jwt --ttl 1h --scope PIPES:READ --resource my_pipe --filters "column_name=value" ### Error handling¶ There are many reasons why a request might return a `403` status code. When a `403` is received, check the following: 1. Confirm the JWT is valid and has not expired (the expiration time is in the `exp` field in the JWT's payload). 2. The generated JWTs can only read Tinybird API Endpoints. Confirm you're not trying to use the JWT to access other APIs. 3. Confirm the JWT has a scope to read the endpoint you are trying to read. Check the payload of the JWT at[ https://jwt.io/](https://jwt.io/) . 4. If you generated the JWT outside of Tinybird (without using the API or the CLI), make sure you are using the** Workspace** `admin token` , not your personal one. ### Rate limits for JWTs¶ Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. When you specify a `limits.rps` field in the payload of the JWT, Tinybird will use the name specified in the payload of the JWT to track the number of requests being done. If the number of requests goes above the limit, Tinybird starts rejecting new requests and returns an "HTTP 429 Too Many Requests" error. Check the [limits docs](https://www.tinybird.co/docs/docs/support/limits) for more information. The following example shows the tracking of all requests done by `frontend_jwt` . Once you reach 10 requests per second, Tinybird would start rejecting requests: ##### Example payload with global rate limit { "workspace_id": "workspaces_id", "name": "frontend_jwt", "exp": 123123123123, "scopes": [ { "type": "PIPES:READ", "resource": "requests_per_day", "fixed_params": { "org_id": "testing" } } ], "limits": { "rps": 10 } } If `rps <= 0` , Tinybird ignores the limit and assumes there is no limit. As the `name` field does not have to be unique, all the tokens generated using the `name=frontend_jwt` would be under the same umbrella. This can be very useful if you want to have a global limit in one of your apps or components. If you want to just limit for each specific user, you could generate a JWT using the following payload. In this case, you would specify a unique name so the limits only apply to each user: ##### Example of a payload with isolated rate limit { "workspace_id": "workspaces_id", "name": "frontend_jwt_user_", "exp": 123123123123, "scopes": [ { "type": "PIPES:READ", "resource": "requests_per_day", "fixed_params": { "org_id": "testing" } } ], "limits": { "rps": 10 } } ## Next steps¶ - Follow a walkthrough guide:[ How to consume APIs in a Next.js frontend with JWTs](https://www.tinybird.co/docs/docs/guides/integrations/consume-apis-nextjs) . - Read about the Tinybird[ Tokens API](https://www.tinybird.co/docs/docs/api-reference/token-api) . - Understand[ Branches](https://www.tinybird.co/docs/docs/concepts/branches) in Tinybird. --- URL: https://www.tinybird.co/docs/concepts/branches Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Branches · Tinybird Docs" theme-color: "#171612" description: "Branches allow you to create an isolated copy of your Tinybird project, make changes and develop new features, then merge those changes back into the Workspace." --- # Branches¶ ## What is a Branch?¶ Branches are isolated, temporary copies of your Tinybird project and the resources it contains (including [Data Sources](https://www.tinybird.co/docs/docs/concepts/data-sources), [Pipes](https://www.tinybird.co/docs/docs/concepts/pipes), [API Endpoints](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview) ). They are inspired by Git branches, allowing you to make changes and develop new features, then merge those changes back into the Workspace. Branches are only available in Tinybird Workspaces [connected to a remote Git provider](https://www.tinybird.co/docs/docs/production/working-with-version-control) (for example, GitHub). Tinybird represents branches using the icon. ## What should I use Branches for?¶ Use Branches to develop new features without affecting production or other developers. Tinybird Branches mirror the intent behind Git branches, and should be used for the same reasons. Branches are intended to be temporary, rather than long lived, so they are ideal for building new features and prototyping. Branches should be removed when their work is complete (either merged or deleted). Common uses for Branches include: - Designing new Data Sources or Pipes. - Iterating the schema of a Data Source. - Testing the backfill strategy of a new Data Source. - Changing the output of a Pipe. - Modifying API Endpoint query parameters. Branches are temporary. If you need a long-lived non-production environment, take a look at the docs on [staging and production Workspaces](https://www.tinybird.co/docs/docs/production/staging-and-production-workspaces). ## Data in Branches¶ Creating a Branch copies all the **resources** from the Workspace to the new Branch, but it does not copy all the **production data**. When creating a Branch, you can choose to copy a segment of data from production into the new Branch. This will duplicate the latest active partition (up to a maximum of 50GB) of data from all production Data Sources, and attach the copies to the new Branch Data Sources. This data is a copy (or clone), and is physically separate from the production data. This means you can ingest or drop data from the Branch and it will not affect production data. If you choose not to copy data, the new Branch will contain no data, and you will need to ingest data into the Branch Data Sources. When a Branch is deleted, the Branch data is also deleted. There is a soft limit on the number of Branches you can create per Workspace (3). Contact us at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) if you need to increase this limit. ## Branch users¶ When a new Branch is created, all Workspace users are added to the new Branch with their current roles. Since the new Branch is completely independent, user management is also independent. You can update the roles of users in the new Branch without affecting the main Workspace. Users with the Viewer role are restricted from making changes in the Workspace, but not in Branches. These users will have the same permissions in a Branch as standard users. ## Branch Tokens¶ When a new Branch is created, all [Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens) from the Workspace are mirrored in the new Branch. These are **new** Tokens that are completely independent of the production Tokens, but with the same names and scopes. This means you can delete or update the scopes of the Tokens in the new Branch without affecting production. When you are on a branch in the CLI using the command `tb branch use BRANCH_NAME_OR_ID` the admin token of the branch is on the [.tinyb file](https://www.tinybird.co/docs/docs/cli/install#authentication) . If you need this token (usually on CI/CD processes) you can extract it from this auth file. ## Create a Branch¶ A Branch can be created via the Tinybird UI, Tinybird CLI, or automatically as part of your CI/CD process. You can only create Branches in Workspaces [connected to a remote Git provider](https://www.tinybird.co/docs/docs/production/working-with-version-control) (for example, GitHub). ### Using the Tinybird UI¶ To create a new Branch with the Tinybird UI, click the `Branch` dropdown in the top bar, and click on the `Branch` tab. Click the `Create Branch` button. ### Using the Tinybird CLI¶ To create a new Branch with the Tinybird CLI, use the `tb branch create` command. See the [tb branch Command Reference](https://www.tinybird.co/docs/docs/cli/command-ref#tb-branch) for more information. ### Using a Pull Request¶ To create a new Branch with a Pull Request (PR), push your local feature branch to your Git provider and open a new PR. The CI/CD actions should sync the branch from your PR to Tinybird. ## Delete a Branch¶ A Branch can be deleted via the Tinybird UI, Tinybird CLI, or automatically as part of your CI/CD process. You can only delete Branches in Workspaces [connected to a remote Git provider](https://www.tinybird.co/docs/docs/production/working-with-version-control) (for example, GitHub). ### Using the Tinybird UI¶ To delete a Branch with the Tinybird UI, click the Branch dropdown in the top bar, and click on the Branch tab. Find your Branch in the list, and click the trash button to the right. ### Using the Tinybird CLI¶ To delete a Branch with the Tinybird CLI, use the `tb branch rm` command. See the [tb branch Command Reference](https://www.tinybird.co/docs/docs/cli/command-ref#tb-branch) for more information. ### Using a Pull Request¶ To delete a Branch from a Pull Request (PR), close or merge the associated PR in your Git provider. The automated CI/CD actions delete the Branch from Tinybird. ## Limitations¶ **JOIN Engine Data Sources** Data Sources using the JOIN Engine are not fully supported in Branches. You can create a Branch with JOIN Engine Data Sources, but the data will not be copied into the new Branch. The JOIN Engine has been deprecated and will be fully removed in a future release. You should not create new JOIN Engine Data Sources. If you need support, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). --- URL: https://www.tinybird.co/docs/concepts/data-sources Last update: 2024-11-15T16:02:17.000Z Content: --- title: "Data Sources · Tinybird Docs" theme-color: "#171612" description: "Data Sources contain all the data you bring into Tinybird, acting like tables in a database." --- # Data Sources¶ When you get data into Tinybird, it's stored in a Data Source. You then write SQL queries to explore the data from the Data Source. Tinybird represents Data Sources using the icon. For example, if your event data lives in a Kafka topic, you can create a Data Source that connects directly to Kafka and writes the events to Tinybird. You can then [create a Pipe](https://www.tinybird.co/docs/docs/concepts/pipes#creating-pipes-in-the-ui) to query fresh event data. A Data Source can also be the result of materializing a SQL query through a [Pipe](https://www.tinybird.co/docs/docs/concepts/pipes#creating-pipes-in-the-ui). ## Create Data Sources¶ You can use Tinybird's UI, CLI, and API to create Data Sources. ### Using the UI¶ Follow these steps to create a new Data Source: 1. In your Workspace, go to** Data Sources** . 2. Select** +** to add a new Data Source. ### Using the CLI¶ You can create Data Source using the `tb datasource` command. See [tb datasource](https://www.tinybird.co/docs/docs/cli/command-ref#tb-datasource) in the CLI reference. ## Set the Data Source TTL¶ You can apply a TTL (Time To Live) to a Data Source in Tinybird. Use a TTL to define how long you want to store data. For example, you can define a TTL of 7 Days, which means that any data older than 7 Days should be deleted. Data older than the defined TTL is deleted automatically. You must define the TTL at the time of creating the Data Source and your data must have a column with a type that represents a date. Valid types are any of the `Date` or `Int` types. ### Using the UI¶ If you are using the Tinybird Events API and want to use a TTL, create the Data Source with a TTL first before sending data. Follow these steps to set a TTL using the Tinybird UI: 1. Select** Advanced Settings** . 2. Open the** TTL** menu. 3. Select a column that represents a date. 4. Define the TTL period in days. If you need to apply transformations to the date column, or want to use more complex logic, select the **Code editor** tab and enter SQL code to define your TTL. ### Using the CLI¶ Follow these steps to set a TTL using the Tinybird CLI: 1. Create a new Data Source and .datasource file using the `tb datasource` command. 2. Edit the .datasource file you've created. 3. Go to the Engine settings. 4. Add a new setting called `ENGINE_TTL` and enter your TTL string enclosed in double quotes. 5. Save the file. The following example shows a .datasource file with TTL defined: SCHEMA > `date` DateTime, `product_id` String, `user_id` Int64, `event` String, `extra_data` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYear(date)" ENGINE_SORTING_KEY "date, user_id, event, extra_data" ENGINE_TTL "date + toIntervalDay(90)" ## Change Data Source TTL¶ You can modify the TTL of an existing Data Source, either by adding a new TTL or by updating an existing TTL. ### Using in the UI¶ Follow these steps to modify a TTL using the Tinybird UI: 1. Go to the Data Source details page by clicking on the Data Source with the TTL you wish to change. 2. Select the** Schema** tab. 3. Select the TTL text. 4. A dialog opens. Select the menu. 5. Select the field to use for the TTL. 6. Change the TTL interval. 7. Select** Save** . The updated TTL value appears in the Data Source's schema page. ### Using the CLI¶ Follow these steps to modify a TTL using the Tinybird CLI: 1. Open the .datasource file. 2. Go to the Engine settings. 3. If `ENGINE_TTL` doesn't exist, add it and enter your TTL enclosed in double quotes. 4. If a TTL is already defined, modify the existing setting. The following is an example TTL setting: ENGINE_TTL "date + toIntervalDay(90)" When ready, save the .datasource file and push the changes to Tinybird using the CLI: tb push DATA_SOURCE_FILE -f ## Supported engines and settings¶ Don't change the following settings unless you are familiar with ClickHouse® and understand their impact. If you're unsure, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). Tinybird uses ClickHouse® as the underlying storage engine. ClickHouse features different strategies to store data, which define where and how the data is stored and also what kind of data access, queries, and availability your data has. In ClickHouse terms, a Tinybird Data Source uses a [Table Engine](https://clickhouse.tech/docs/en/engines/table_engines/) that determines those factors. With Tinybird you can select the Table Engine for your Data Source. Tinybird supports the following engines: - `MergeTree` - `ReplacingMergeTree` - `SummingMergeTree` - `AggregatingMergeTree` - `CollapsingMergeTree` - `VersionedCollapsingMergeTree` - `Null` If you need to use any other Table Engine, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). You can use the `engine` parameter in the [Data Sources API](https://www.tinybird.co/docs/docs/api-reference/datasource-api) to specify the name of any of the available engines, for example `engine=ReplacingMergeTree` . To set the engine parameters and the engine options, use as many `engine_*` request parameters as needed. You can also configure settings in .datasource files. The supported parameters for each engine are: | ENGINE | SIGNATURE | PARAMETER | DESCRIPTION | | --- | --- | --- | --- | | [ ReplacingMergeTree](https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/replacingmergetree/#creating-a-table) | `(ver, is_deleted)` | `engine_ver` | Optional. The column with the version. If not set, the last row is kept during a merge. If set, the maximum version is kept. | | | | `engine_is_deleted` | Active only when `ver` is used. The name of the column used to determine whether the row is to be deleted; `1` is a `deleted` row, `0` is a `state` row. | | [ SummingMergeTree](https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/summingmergetree/#creating-a-table) | `([columns])` | `engine_columns` | Optional. The names of columns where values are summarized. | | [ CollapsingMergeTree](https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/collapsingmergetree/#creating-a-table) | `(sign)` | `engine_sign` | Name of the column for computing the state. | | [ VersionedCollapsingMergeTree](https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/versionedcollapsingmergetree/#creating-a-table) | `(sign, version)` | `engine_sign` | Name of the column for computing the state. | | | | `engine_version` | Name of the column with the version of the object state. | The engine options, in particular the MergeTree engine options, match ClickHouse terminology: `engine_partition_key`, `engine_sorting_key`, `engine_primary_key`, `engine_sampling_key`, `engine_ttl` and `engine_settings`. If `engine_partition_key` is empty or not passed as a parameter, the underlying Data Source doesn't have any partition unless there's a `Date` column. In that case the Data Source is partitioned by year. If you want to create a Data Source with no partitions, send `engine_partition_key=tuple()`. `engine_settings` allows for fine-grained control over the parameters of the underlying Table Engine. In general, we do not recommend changing these settings unless you are absolutely sure about their impact. The supported engine settings are: - `index_granularity` - `merge_with_ttl_timeout` - `ttl_only_drop_parts` - `min_bytes_for_wide_part` - `min_rows_for_wide_part` See the [Data Sources API](https://www.tinybird.co/docs/docs/api-reference/datasource-api) for examples of creating Data Sources with custom engine settings using the Tinybird REST API. ## Supported data types¶ The supported [ClickHouse data types](https://clickhouse.com/docs/en/sql-reference/data-types/) are: - `Int8` , `Int16` , `Int32` , `Int64` , `Int128` , `Int256` - `UInt8` , `UInt16` , `UInt32` , `UInt64` , `UInt128` , `UInt256` - `Float32` , `Float64` - `Decimal` , `Decimal(P, S)` , `Decimal32(S)` , `Decimal64(S)` , `Decimal128(S)` , `Decimal256(S)` - `String` - `FixedString(N)` - `UUID` - `Date` , `Date32` - `DateTime([TZ])` , `DateTime64(P, [TZ])` - `Bool` - `Array(T)` - `Map(K, V)` - `Tuple(K, V)` - `SimpleAggregateFunction` , `AggregateFunction` - `LowCardinality` - `Nullable` - `Nothing` If you are ingesting using the NDJSON format and would like to store `Decimal` values containing 15 or more digits, send the values as string instead of numbers to avoid precision issues. In the following example, the first value has a high chance of losing accuracy during ingestion, while the second one is stored correctly: {"decimal_value": 1234567890.123456789} # Last digits might change during ingestion {"decimal_value": "1234567890.123456789"} # Will be stored correctly ### Set a different codec¶ Tinybird applies compression codecs to data types to optimize performance. You can override the default compression codecs by adding the `CODEC()` statement after the type declarations in your .datasource schema. For example: SCHEMA > `product_id` Int32 `json:$.product_id`, `timestamp` DateTime64(3) `json:$.timestamp` CODEC(DoubleDelta, ZSTD(1)), For a list of available codecs, see [Compression](https://clickhouse.com/docs/en/data-compression/compression-in-clickhouse#choosing-the-right-column-compression-codec) in the ClickHouse documentation. ## Supported file types and compression formats for ingest¶ The following file types and compression formats are supported at ingest time: | File type | Method | Accepted extensions | Compression formats supported | | --- | --- | --- | --- | | CSV | File upload, URL | `.csv` , `.csv.gz` | `gzip` | | NDJSON | File upload, URL, Events API | `.ndjson` , `.ndjson.gz` | `gzip` | | Parquet | File upload, URL | `.parquet` , `.parquet.gz` | `gzip` | | Avro | Kafka | | `gzip` | ## Quarantine Data Sources¶ Every Data Source you create in your Workspace has a quarantine Data Source associated that store data that doesn't fit the schema. If you send rows that don't fit the Data Source schema, they're automatically sent to the quarantine table so that the ingest process doesn't fail. By convention, a quarantine Data Source is named `{datasource_name}_quarantine` . You can review quarantined rows at any time or perform operations on them using Pipes. This is a useful source of information when fixing issues in the origin source or applying changes during ingest. The quarantine Data Source schema contains the columns of the original row and the following columns with information about the issues that caused the quarantine: - `c__error_column` Array(String) contains an array of all the columns that contain an invalid value. - `c__error` Array(String) contains an array of all the errors that caused the ingestion to fail and lead to store the values in quarantine. This column along the `c__error_column` allows you so easily identify which is the columns that has problems and which is the error. - `c__import_id` Nullable(String) contains the job's identifier in case the column was imported through a job. - `insertion_date` (DateTime) contains the timestamp in which the ingestion was done. See the [Quarantine guide](https://www.tinybird.co/docs/docs/guides/ingesting-data/recover-from-quarantine) for practical examples on using the quarantine Data Source. ## Partitioning¶ Use partitions for data manipulation. Partitioning isn't intended to speed up `SELECT` queries: experiment with more efficient sorting keys ( `ENGINE_SORTING_KEY` ) for that. A bad partition key, or creating too many partitions, can negatively impact query performance. Partitioning is configured using the `ENGINE_PARTITION_KEY` setting. When choosing a partition key: - Leave the `ENGINE_PARTITION_KEY` key empty. If the table is small or you aren't sure what the best partition key should be, leave it empty: the data is placed in a single partition. - Use a date column. Depending on the filter, you can opt for more or less granularity based on your needs. `toYYYYMM(date_column)` or `toYear(date_column)` are valid default choices. If you have questions about choosing a partition key, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). ### Examples¶ The following examples show how to define partitions. ##### Using an empty tuple to create a single partition ENGINE_PARTITION_KEY "tuple()" ##### Using a Date column to create monthly partitions ENGINE_PARTITION_KEY "toYYYYMM(date_column)" ##### Using a column to partition by event types ENGINE_PARTITION_KEY "event_type % 8" ## Upserts and deletes¶ See [this guide](https://www.tinybird.co/docs/docs/guides/ingesting-data/replace-and-delete-data) . Depending on the frequency needed, you might want to convert upserts and deletes into an append operation that you can solve through [deduplication](https://www.tinybird.co/docs/docs/guides/querying-data/deduplication-strategies). ## Limits¶ There is a limit of 100 Data Sources per Workspace. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/concepts/pipes Last update: 2024-11-08T11:23:54.000Z Content: --- title: "Pipes · Tinybird Docs" theme-color: "#171612" description: "Pipes help you to bring your SQL queries together to achieve a purpose, like publishing an API Endpoint. Learn all about Pipes here!" --- # Pipes¶ ## What is a Pipe?¶ A Pipe is a collection of one or more SQL queries (each query is called a [Node](https://www.tinybird.co/docs/about:blank#nodes) ). Tinybird represents Pipes using the icon. ## What should I use Pipes for?¶ Use Pipes to build features over your data. Write SQL that joins, aggregates, or otherwise transforms your data and publish the result. You have three options to publish the result of a Pipe: [API Endpoints](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview), [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) , and [Copy Pipes](https://www.tinybird.co/docs/docs/publish/copy-pipes). A Pipe can only have a single output at one time. This means that you cannot create a Materialized View and an API Endpoint from the same Pipe, at the same time. Once your Pipe is published as an API Endpoint, you can create [Tinybird Charts](https://www.tinybird.co/docs/docs/publish/charts) : Interactive, customizable charts of your data. ## Creating Pipes¶ You can create as many Pipes in your Workspace as needed. Top tip: press `⌘+K` or `CTRL+K` at any time in the Tinybird UI to open the Command Bar and view all your Workspace resources. ### Creating Pipes in the UI¶ To add a Pipe, click the Plus (+) icon in the left side navigation bar next to the Pipes section (see Mark 1 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fconcepts-pipes-creating-a-pipe-ui-1.png&w=3840&q=75) At the top of your new Pipe, you can change the name & description. **Click on the name or description** to start entering text (see Mark 1 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fconcepts-pipes-creating-a-pipe-ui-2.png&w=3840&q=75) In time, you might end up with a lot of Pipes. Tinybird doesn't yet offer a way to organize your Pipes into folders, but a quick and easy alternative is to group your Pipe names by name - like `mktg-` so all your marketing Pipes are together. Pipes are ordered alphabetically, and must always start with a letter (but you can use numbers, dots, and underscores in the rest of the name). ### Creating Pipes in the CLI¶ #### tb pipe generate¶ You can use `tb pipe generate` to generate a .pipe file. You must provide a name for the Pipe & a single SQL statement. This command will generate the necessary syntax for a single-Node Pipe inside the file. You can open the file in any text editor to change the name, description, query and add more Nodes. Defining your Pipes in files allows you to version control your Tinybird resources with git. The SQL statement must be wrapped in quotes or the command will fail. tb pipe generate my_pipe_name "SELECT 1" Generating the .pipe file does **not** create the Pipe in Tinybird. When you are finished editing the file, you must push the Pipe to Tinybird. tb push my_pipe_name.pipe If you list your Pipes, you'll see that your Pipe exists in Tinybird. tb pipe ls ** Pipes: -------------------------------------------------------------------- | version | name | published date | nodes | -------------------------------------------------------------------- | | my_pipe_name | 2022-11-30 21:44:55 | 1 | -------------------------------------------------------------------- ## Format descriptions using Markdown¶ It is possible to use Markdown syntax in Pipe description fields so you can add richer formatting. Here's an example using headings, bold, and external links: ## This is my first Pipe! You can use **Markdown** in descriptions. Add [links](https://www.tinybird.co) to other bits of info! This will be rendered in the UI like this: <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fconcepts-pipes-markdown-in-pipes-desc-1.png&w=3840&q=75) ## Nodes¶ ### What is a Node?¶ A Node is a container for a single SQL `SELECT` statement. Nodes live within Pipes, and you can have many sequential Nodes inside the same Pipe. A query in a Node can read data from a [Data Source](https://www.tinybird.co/docs/docs/concepts/data-sources) , other Nodes inside the same Pipe, or from [API Endpoint](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview) Nodes in other Pipes. ### What should I use Nodes for?¶ Nodes allow you to break your query logic down into multiple smaller queries. You can then chain Nodes together to build the logic incrementally. Each Node can be developed & tested individually. This makes it much easier to build complex query logic in Tinybird as you avoid creating large monolithic queries with many sub-queries. --- URL: https://www.tinybird.co/docs/concepts/workspaces Last update: 2024-11-08T11:23:54.000Z Content: --- title: "Workspaces · Tinybird Docs" theme-color: "#171612" description: "Workspaces are containers for all of your Tinybird resources. Learn all about Workspaces here!" --- # Workspaces¶ ## What is a Workspace?¶ A Workspace is an area that contains a set of Tinybird resources, including Data Sources, Pipes, Nodes, API Endpoints, and Tokens. Tinybird represents Workspaces using the icon. ## What should I use Workspaces for?¶ Workspaces allow you to separate different projects, use cases, and dev/staging/production environments for the things you work on in Tinybird. You can choose exactly how you organize your Workspaces, but the two common ways are a Workspace per project, or per team. ## Create a Workspace¶ ### Create a Workspace in the UI¶ To create a new Workspace, select the name of the existing Workspace. In the menu, select **Create Workspace (+)**. Complete the dialog with the details of your new Workspace, and select "Create Workspace". Workspaces must have unique names within a region. ### Create a Workspace in the CLI¶ To create a new Workspace from the CLI, you can use the following command: tb workspace create You can use this command interactively or by providing the required inputs with flags. To use it interactively, run the command without any flags. For example, `tb workspace create` ). You will see several prompts that you need to complete. Supply [your user Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens#your-user-token) by pasting it into the prompt. In order to create a new workspace we need your user token. Copy it from https://app.tinybird.co/tokens and paste it here: You can select whether you want to use a Starter Kit for your new Workspace. Using the `blank` option gives you an empty Workspace. ------------------------------------------------------------------------------------ | Idx | Id | Description | ------------------------------------------------------------------------------------ | 1 | blank | Empty Workspace | | 2 | web-analytics | Starting Workspace ready to track visits and custom events | ------------------------------------------------------------------------------------ [0] to cancel Use starter kit [1]: 1 Next, supply a name for your Workspace. Workspaces must have unique names within a region. Workspace name [new_workspace_9479]: internal_creating_new_workspaces_example A successful creation will give you the following output: ** Workspace 'internal_creating_new_workspaces_example' has been created If you are using the CLI in an automated system, you probably don't want to use the interactive command. You can instead provide these options as flags: tb workspace create --user_token --starter-kit 1 internal_creating_new_workspaces_example ## Delete a Workspace¶ Deleting a Workspace deletes all resources within the Workspace, including Data Sources, any ingested data, Pipes and published APIs. Deleted Workspaces cannot be recovered. Be careful with this operation. ### Delete a Workspace in the UI¶ To delete a Workspace in the UI, select the Cog icon at top right of the UI. In the modal that appears, select the "Advanced Settings" tab and select "Delete Workspace". You will be required to type "delete workspace" to confirm. ### Delete a Workspace in the CLI¶ To delete a Workspace in the CLI, you can use the following command: tb workspace delete You will need to provide the name of the Workspace and [your user Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens#your-user-token) , for example: tb workspace delete my_workspace --user_token ## Manage Workspace members¶ Workspace users are referred to as "members". Their access to resources in Tinybird is managed at the Workspace level. You can invite as many members to a Workspace as you want, and a member can belong to multiple Workspaces. Member capabilities are controlled with roles. A user's role can be changed at any time. There are three roles in Tinybird: **Roles and capabilities** | Role | Manage resources | Manage users | Access to billing information | Create a branch | | --- | --- | --- | --- | --- | | `Admin` | Yes | Yes | Yes | Yes | | `Guest` | Yes | No | No | Yes | | `Viewer` | No | No | No | Yes | ### Manage Workspace members in the UI¶ In the top right corner of the Tinybird UI, select the Cog icon. In the modal, navigate to "Members" to review any members already part of your Workspace. Add a new member by entering their email address and confirming their role from the dropdown options. You can invite multiple users at the same time by adding multiple email addresses separated by a comma. The users you invite will get an email notifying them that they have been invited. If they don't already have a Tinybird account, they will be prompted to create one to accept your invite. Invited users appear in the user management modal and by default have the **Guest** role. If the user loses their invite link, you can resend it here too, or copy the link to your clipboard. You can also remove members from here using the "..." menu and selecting "Remove". ### Adding Workspace users in the CLI¶ To add new users, use the following command: tb workspace members add You will need to supply your [user Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens#your-user-token) with the `--user_token` flag. Add the email address of the user you want to invite as the final argument to the command. tb workspace members add --user_token my_new_team_member@example.com A successful invite returns the following output: ** User my_new_team_member@example.com added to workspace 'internal_getting_started_guide' ## Share Data Sources between Workspaces¶ While it's useful to separate your work, sometimes you want to share resources between different teams or projects. For example, you might have a Data Source of events data that multiple teams want to use to build features with. Rather than duplicating this data, you can share it between multiple Workspaces. Let's say you have a Workspace called `Ingest` . In this Workspace, you have your `events` Data Source. You have two teams and each team has their own Workspace to work in, named `Team_A` and `Team_B` . Both teams want access to the `events` Data Source. From the `Ingest` Workspace you can share the `events` Data Source to both the `Team_A` and `Team_B` Workspaces, giving them access to the `events` data. This means that while both teams can access the data, the `Team_A` and `Team_B` Workspaces have no other connection with the `Ingest` Workspace. Modifications of the `Ingest` Data Sources can not be made from the other Workspaces. If a member of the `Team_A` or `Team_B` Workspaces needs to manage the `Ingest` Data Sources, they can be added as a member of the `Ingest` Workspace. You cannot share Data Sources to Workspaces in different regions. ### Sharing Data Sources in the UI¶ This section describes sharing a Data Source between your Workspaces. On the left side navigation, find the Data Source that you want to share. When you hover over the Data Source, you will see a 3-dot icon for the Data Source Actions menu, select the 3 dots (see Mark 1 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fconcepts-workspaces-sharing-ds-workspaces-1.png&w=3840&q=75) In the context menu that appears, select the **Share** action (see Mark 1 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fconcepts-workspaces-sharing-ds-workspaces-2.png&w=3840&q=75) This will open a dialog window. In the **Search** box, you can type the name of the Workspace that you want to share with (see Mark 1 below). As you type, the search box will suggest Workspaces that you are a member of. You can only share Data Sources with Workspaces that you are a member of. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fconcepts-workspaces-sharing-ds-workspaces-3.png&w=3840&q=75) You can select from the suggestions, or type the name and press Enter. When you have selected the Workspace, you will see the Workspace in the **Shared** section (see Mark 1 below). Select **Done** to close the dialog (see Mark 2 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fconcepts-workspaces-sharing-ds-workspaces-4.png&w=3840&q=75) ### Sharing Data Sources in the CLI¶ This section describes how to share Data Sources from one Workspace to another Workspace in the Tinybird CLI. To share a Data Source, you can use the following command: tb datasource share You will need to supply [your user Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens#your-user-token) with the `--user_token` flag. You then need to supply the Data Source name and target Workspace name as the following arguments. tb datasource share --user_token shopping_data my_second_workspace The Data Source that you want to share must exist in the Workspace that your Tinybird CLI is authenticated against. To check which Workspace your CLI is currently authenticated with, you can use: tb auth info In addition, you could also do `tb push --user_token ` to push a Data Source with a `SHARED_WITH` parameter to share it with another Workspace. ## Regions¶ A Workspace belongs to one region. Tinybird has three public regions: `EU` and `US East` in Google Cloud Platform and `US East` in AWS. Additional regions for GCP, AWS and Azure are available for Enterprise customers. The following table lists the available regions and their corresponding API base URLs: **Current Tinybird regions** | Region | Provider | Provider region | API base URL | | --- | --- | --- | --- | | Europe | GCP | europe-west3 | [ https://api.tinybird.co](https://api.tinybird.co/) | | US East | GCP | us-east4 | [ https://api.us-east.tinybird.co](https://api.us-east.tinybird.co/) | | Europe | AWS | eu-central-1 | [ https://api.eu-central-1.aws.tinybird.co](https://api.eu-central-1.aws.tinybird.co/) | | US East | AWS | us-east-1 | [ https://api.us-east.aws.tinybird.co](https://api.us-east.aws.tinybird.co/) | | US West | AWS | us-west-2 | [ https://api.us-west-2.aws.tinybird.co](https://api.us-west-2.aws.tinybird.co/) | Tinybird documentation uses `https://api.tinybird.co` as the default example API base URL. If you are not using the Europe GCP region, you will need to replace this with the API base URL for your region. If you would like to request a new region, you can do so inside the Tinybird UI from the region selector, or contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). ## Single Sign-On (SSO)¶ Out of the box, Tinybird has email & OAuth providers for logging in to the platform. However, if you have a requirement to integrate Tinybird with your own SSO provider, such as Azure Entra ID (formerly Azure Active Directory or AD), this is available to Enterprise customers. If you have a requirement for SSO integration, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). ## Secure cloud connections¶ Tinybird supports TLS across all ingest connectors, providing encryption on the wire for incoming data. However, if you have a requirement to connect to Tinybird via a secure cloud gateway, such as AWS PrivateLink or Google Private Service Connect, this is available to Enterprise customers. If you have a requirement for secure cloud connections, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). --- URL: https://www.tinybird.co/docs/core-concepts Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Core concepts · Tinybird Docs" theme-color: "#171612" description: "Find Tinybird-related core terms and their definitions." --- # Core concepts¶ Familiarize yourself with Tinybird's core concepts and terminology to get a better understanding of how Tinybird works and how you can make the most of its features. ## Workspaces¶ Workspaces help you organize and collaborate on your Tinybird projects. You can have more than one Workspace. A Workspace contains the project resources, data, and state. You can share resources, such as Pipes or Data Sources, between Workspaces. You can also invite users to your Workspaces and define their role and permissions. A typical usage of Workspaces is to provide a team or project with a space to work in. User roles: - ** Admins** can do everything in the Workspace. - ** Guests** can do most things, but they can't delete Workspaces, invite or remove users, or share Data Sources across Workspaces. - ** Viewers** can't edit anything in the main Workspace[ Branch](https://www.tinybird.co/docs/docs/concepts/branches) , but they can use[ Playgrounds](https://www.tinybird.co/docs/docs/query/overview#use-the-playground) to query the data, as well as create or edit Branches. [Read more about Workspaces.](https://www.tinybird.co/docs/docs/concepts/workspaces) ## Data Sources¶ Data Sources are how you ingest and store data in Tinybird. All your data lives inside a Data Source, and you write SQL queries against Data Sources. You can bulk upload or stream data into a Data Source, and they support several different incoming data formats, such as CSV, JSON, and Parquet. [Read more about Data Sources.](https://www.tinybird.co/docs/docs/concepts/data-sources) ## Pipes¶ Pipes are how you write SQL logic in Tinybird. Pipes are a collection of one or more SQL queries chained together and compiled into a single query. Pipes let you break larger queries down into smaller queries that are easier to read. You can publish Pipes as API Endpoints, copy them, and create Materialized Views. [Read more about Pipes.](https://www.tinybird.co/docs/docs/concepts/pipes) ## Nodes¶ A Node is a single SQL `SELECT` statement that selects data from a Data Source or another Node or API Endpoint. Nodes live within [Pipes.](https://www.tinybird.co/docs/docs/concepts/pipes) ## API Endpoints¶ You can build your SQL logic inside a Pipe and then publish the result of your query as an HTTP API Endpoint. [Read more about API Endpoints.](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview) ## Charts¶ Charts visualize your data. You can create and publish Charts in Tinybird from your published API Endpoints. [Read more about Charts.](https://www.tinybird.co/docs/docs/publish/charts) ## Tokens¶ Tokens authorize requests. Tokens can be static for back-end integrations, or custom JWTs for front-end applications. [Read more about Tokens.](https://www.tinybird.co/docs/docs/concepts/auth-tokens) ## Branches¶ Branches let you create a copy of your Workspace where you can make changes, run tests, and develop new features. You can then merge the changes back into the original Workspace. [Read more about Branches.](https://www.tinybird.co/docs/docs/concepts/branches) ## CLI¶ Use the Tinybird command line interface (CLI) to interact with Tinybird from the terminal. You can install it on your local machine or embed it into your CI/CD pipelines. [Read more about the Tinybird CLI.](https://www.tinybird.co/docs/docs/cli/quick-start) ## ClickHouse®¶ ClickHouse is an open source OLAP database that serves as Tinybird's real-time analytics database and SQL engine. The SQL queries that you write inside Tinybird use the ClickHouse SQL dialect. [Read more about ClickHouse.](https://clickhouse.com/docs/en/intro) ## Next steps¶ - Understand Tinybird's[ underlying architecture](https://www.tinybird.co/docs/docs/architecture) . - Check out the[ Tinybird Quick Start](https://www.tinybird.co/docs/docs/quick-start) . Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-aws-kinesis Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Stream from AWS Kinesis · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to send data from AWS Kinesis to Tinybird." --- # Stream from AWS Kinesis¶ In this guide, you'll learn how to send data from AWS Kinesis to Tinybird. If you have a [Kinesis Data Stream](https://aws.amazon.com/kinesis/data-streams/) that you want to send to Tinybird, it should be pretty quick thanks to [Kinesis Firehose](https://aws.amazon.com/kinesis/data-firehose/) . This page explains how to integrate Kinesis with Tinybird using Firehose. ## 1. Push messages From Kinesis To Tinybird¶ ### Create a Token with the right scope¶ In your Workspace, create a Token with the `Create new Data Sources or append data to existing ones` scope: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fingest-from-aws-kinesis-1.png&w=3840&q=75) <-figcaption-> Create a Token with the right scope ### Create a new Data Stream¶ Start by creating a new Data Stream in AWS Kinesis (see the [AWS documentation](https://docs.aws.amazon.com/streams/latest/dev/working-with-streams.html) for more information). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fingest-from-aws-kinesis-2.png&w=3840&q=75) <-figcaption-> Create a Kinesis Data Stream ### Create a Firehose Delivery Stream¶ Next, [create a Kinesis Data Firehose Delivery Stream](https://docs.aws.amazon.com/firehose/latest/dev/basic-create.html). Set the **Source** to **Amazon Kinesis Data Streams** and the **Destination** to **HTTP Endpoint**. In the **Destination Settings** , set **HTTP Endpoint URL** to point to the [Tinybird Events API](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-from-the-events-api). https://api.tinybird.co/v0/events?name=&wait=true&token= This example is for Workspaces in the `GCP` --> `europe-west3` region. If necessary, replace with the [correct region for your Workspace](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) . Additionally, note the `wait=true` parameter. Learn more about it [in the Events API docs](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-from-the-events-api#wait-for-acknowledgement). You don't need to create the Data Source in advance; it will automatically be created for you. ### Send sample messages and check that they arrive to Tinybird¶ If you don't have an active data stream, follow [this python script](https://gist.github.com/GnzJgo/f1a80186a301cd8770a946d02343bafd) to generate dummy data. Back in Tinybird, you should see 3 columns filled with data in your Data Source. `timestamp` and `requestId` are self explanatory, and your messages are in `records\_\data`: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fingest-from-aws-kinesis-3.png&w=3840&q=75) <-figcaption-> Firehose Data Source ## 2. Decode message data¶ ### Decode message data¶ The `records\_\data` column contains an array of encoded messages. In order to get one row per each element of the array, use the [ARRAY JOIN Clause](https://clickhouse.com/docs/en/sql-reference/statements/select/array-join/) . You'll also need to decode the messages with the [base64Decode() function](https://clickhouse.com/docs/en/sql-reference/functions/string-functions/#base64decodes). Now that the raw JSON is in a column, you can use [JSONExtract functions](https://clickhouse.com/docs/en/sql-reference/functions/json-functions/) to extract the desired fields: ##### Decoding messages NODE decode_messages SQL > SELECT base64Decode(encoded_m) message, fromUnixTimestamp64Milli(timestamp) kinesis_ts FROM firehose ARRAY JOIN records__data as encoded_m NODE extract_message_fields SQL > SELECT kinesis_ts, toDateTime64(JSONExtractString(message, 'datetime'), 3) datetime, JSONExtractString(message, 'event') event, JSONExtractString(message, 'product') product FROM decode_messages <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fingest-from-aws-kinesis-4.png&w=3840&q=75) <-figcaption-> Decoding messages ## Performance optimizations¶ It is highly recommended to persist the decoded and unrolled result in a different Data Source. You can do it with a [Materialized View](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) : A combination of a Pipe and a Data Source that leaves the transformed data into the destination Data Source as soon as new data arrives to the Firehose Data Source. Don't store what you won't need. In this example, some of the extra columns could be skipped. [Add a TTL](https://www.tinybird.co/docs/docs/concepts/data-sources#setting-data-source-ttl) to the Firehose Data Source to prevent keeping more data than you need. Another alternative is to create the Firehose Data Source with a Null Engine. This way, data ingested there can be transformed and fill the destination Data Source without being persisted in the Data Source with the Null Engine. ## Next steps¶ - Ingest from other sources - see the[ Overview page](https://www.tinybird.co/docs/docs/ingest/overview) and explore. - Build your first[ Tinybird Pipe](https://www.tinybird.co/docs/docs/concepts/pipes) . Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-csv-files Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Ingest CSV files · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to ingest data into Tinybird using CSV (comma-separated values) files." --- # Ingest CSV files¶ CSV (comma-separated values) is one of the most widely used formats out there. However, it's used in different ways; some people do not use commas, and other people use escape values differently, or are unsure about using headers. The Tinybird platform is smart enough to handle many scenarios. If your data does not comply with format and syntax best practices, Tinybird will still aim to understand your file and ingest it, but following certain best practices can speed your CSV processing speed by up to 10x. ## Syntax best practices¶ By default, Tinybird processes your CSV file assuming the file follows the most common standard ( [RFC4180](https://datatracker.ietf.org/doc/html/rfc4180#section-2) ). Key points: - Separate values with commas. - Each record is a line (with CRLF as the line break). The last line may or may not have a line break. - First line as a header is optional (though not using one is faster in Tinybird.) - Double quotes are optional but using them means you can escape values (for example, if your content has commas or line breaks). Example: Instead of using the backslash `\` as an escape character, like this: 1234567890,0,0,0,0,2021-01-01 10:00:00,"{\"authorId\":\"123456\",\"handle\":\"aaa\"}" Use two double quotes: ##### More performant 1234567890,0,0,0,0,2021-01-01 10:00:00,"{""authorId"":""123456"",""handle"":""aaa""}" - Fields containing line breaks, double quotes, and commas should be enclosed in double quotes. - Double quotes can also be escaped by using another double quote (""aaa"",""b""""bb"",""ccc"") In addition to the previous points, it's also recommended to: 1. Format `DateTime` columns as `YYYY-MM-DD HH:MM:SS` and `Date` columns as `YYYY-MM-DD` . 2. Send the encoding in the `charset` part of the `content-type` header, if it's different to UTF-8. The expectation is UTF-8, so it should look like this `Content-Type: text/html; charset=utf-8` . 3. You can set values as `null` in different ways, for example,* ""[]""* ,* """"* (empty space),* N* and* "N"* . 4. If you use a delimiter other than a comma, explicitly define it with the API parameter* ``dialect_delimiter``.* 5. If you use an escape character other than a ", explicitly define it with the API parameter* ``dialect_escapechar``.* 6. If you have no option but to use a different line break character, explicitly define it with the API parameter `dialect_new_line` . For more information, check the [Data Sources API docs](https://www.tinybird.co/docs/docs/api-reference/datasource-api). ## Append data¶ Once the Data Source schema has been created, you can optimize your performance by not including the header. Just keep the data in the same order. However, if the header is included and it contains all the names present in the Data Source schema the ingestion will still work (even if the columns follow a different order to the initial creation). ## Next steps¶ - Got your schema sorted and ready to make some queries? Understand[ how to work with time](https://www.tinybird.co/docs/docs/guides/querying-data/working-with-time) . - Learn how to[ monitor your ingestion](https://www.tinybird.co/docs/docs/guides/monitoring/monitor-your-ingestion) . --- URL: https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-dynamodb-single-table-design Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Working with DynamoDB Single-Table Design · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to work with data that follows DynamoDB Single-Table Design." --- # Working with DynamoDB Single-Table Design¶ Single-Table Design is a common pattern [recommended by AWS](https://aws.amazon.com/blogs/compute/creating-a-single-table-design-with-amazon-dynamodb/) in which different table schemas are stored in the same table. Single-table design makes it easier to support many-to-many relationships and avoid the need for JOINs, which DynamoDB doesn't support. Single-Table Design is a good pattern for DynamoDB, but it's not optimal for analytics. To achieve higher performance in Tinybird, normalize data from DynamoDB into multiple tables that support the access patterns of your analytical queries. The normalization process is achieved entirely within Tinybird by ingesting the raw DynamoDB data into a landing Data Source and then creating Materialized Views to extract items into separate tables. This guide assumes you're familiar with DynamoDB, Tinybird, creating DynamoDB Data Sources in Tinybird, and Materialized Views. ## Example DynamoDB Table¶ For example, if Tinybird metadata were stored in DynamoDB using Single-Table Design, the table might look like this: - ** Partition Key** : `Org#Org_name` , example values:** Org#AWS** or** Org#Tinybird** . - ** Sort Key** : `Item_type#Id` , example values:** USER#1** or** WS#2** . - ** Attributes** : the information stored for each kind of item, like user email or Workspace cores. ## Create the DynamoDB Data Source¶ Use the [DynamoDB Connector](https://www.tinybird.co/docs/docs/ingest/dynamodb) to ingest your DynamoDB table into a Data Source. Rather than defining all columns in this landing Data Source, set only the Partition Key (PK) and Sort Key (SK) columns. The rest of the attributes are stored in the `_record` column as JSON. You don't need to define the `_record` column in the schema, as it's created automatically. SCHEMA > `PK` String `json:$.Org#Org_name`, `SK` String `json:$.Item_type#Id` ENGINE "ReplacingMergeTree" ENGINE_SORTING_KEY "PK, SK" ENGINE_VER "_timestamp" ENGINE_IS_DELETED "_is_deleted" IMPORT_SERVICE 'dynamodb' IMPORT_CONNECTION_NAME IMPORT_TABLE_ARN IMPORT_EXPORT_BUCKET The following image shows how data looks. The DynamoDB Connector creates some additional rows, such as `_timestamp` , that aren't in the .datasource file: <-figure-> ![DynamoDB Table storing users and worskpaces information](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-ddb-std-2.png&w=3840&q=75) <-figcaption-> DynamoDB Table storing users and worskpaces information ## Use a Pipe to filter and extract items¶ Data is now be available in your landing Data Source. However, you need to use the `JSONExtract` function to access attributes from the `_record` column. To optimize performance, use [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) to extract and store item types in separate Data Sources with their own schemas. Create a Pipe, use the PK and SK columns as needed to filter for a particular item type, and parse the attributes from the JSON in `_record` column. The example table has User and Workspace items, requiring a total of two Materialized Views, one for each item type. <-figure-> ![Workspace Data Flow showing std connection, landing DS and users and workspaces Materialized Views](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-ddb-std-4.png&w=3840&q=75) <-figcaption-> Two Materialized Views from landing DS To extract the Workspace items, the Pipe uses the SK to filter for Workspace items, and parses the attributes from the JSON in `_record` column. For example: SELECT toLowCardinality(splitByChar('#', PK)[2]) org, toUInt32(splitByChar('#', SK)[2]) workspace_id, JSONExtractString(_record,'ws_name') ws_name, toUInt16(JSONExtractUInt(_record,'cores')) cores, JSONExtractUInt(_record,'storage_tb') storage_tb, _record, _old_record, _timestamp, _is_deleted FROM dynamodb_ds_std WHERE splitByChar('#', SK)[1] = 'WS' ## Create the Materialized Views¶ Create a Materialized View from the Pipe to store the extracted data in a new Data Source. The Materialized View must use the ReplacingMergeTree engine to handle the deduplication of rows, supporting updates and deletes from DynamoDB. Use the following engine settings and configure them as needed for your table: - `ENGINE "ReplacingMergeTree"` : the ReplacingMergeTree engine is used to deduplicate rows. - `ENGINE_SORTING_KEY "key1, key2"` : the columns used to identify unique items, can be one or more columns, typically the part of the PK and SK that is not idetifying Item type. - `ENGINE_VER "_timestamp"` : the column used to identify the most recent row for each key. - `ENGINE_IS_DELETED "_is_deleted"` : the column used to identify if a row has been deleted. For example, the Materialized View for the Workspace items uses the following schema and engine settings: SCHEMA > `org` LowCardinality(String), `workspace_id` UInt32, `ws_name` String, `cores` UInt16, `storage_tb` UInt64, `_record` String, `_old_record` Nullable(String), `_timestamp` DateTime64(3), `_is_deleted` UInt8 ENGINE "ReplacingMergeTree" ENGINE_SORTING_KEY "org, workspace_id" ENGINE_VER "_timestamp" ENGINE_IS_DELETED "_is_deleted" Repeat the same process for each item type. <-figure-> ![Materialized View for extracting Users attributes](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-ddb-std-3.png&w=3840&q=75) <-figcaption-> Materialized View for extracting Users attributes You have now your Data Sources with the extracted columns ready to be queried. ## Review performance gains¶ This process offers significant performance gains over querying the landing Data Source. To demonstrate this, you can use a Playground to compare the performance of querying the raw data vs the extracted data. For the example table, the following queries aggregate the total number of users, workspaces, cores, and storage per organization using the unoptimized raw data and the optimized extracted data. The query over raw data took 335 ms, while the query over the extracted data took 144 ms, for a 2.3x improvement. NODE users_stats SQL > SELECT org, count() total_users FROM ddb_users_mv FINAL GROUP BY org NODE ws_stats SQL > SELECT org, count() total_workspaces, sum(cores) total_cores, sum(storage_tb) total_storage_tb FROM ddb_workspaces_mv FINAL GROUP BY org NODE users_stats_raw SQL > SELECT toLowCardinality(splitByChar('#', PK)[2]) org, count() total_users FROM dynamodb_ds_std FINAL WHERE splitByChar('#', SK)[1] = 'USER' GROUP BY org NODE ws_stats_raw SQL > SELECT toLowCardinality(splitByChar('#', PK)[2]) org, count() total_ws, sum(toUInt16(JSONExtractUInt(_record,'cores'))) total_cores, sum(JSONExtractUInt(_record,'storage_tb')) total_storage_tb FROM dynamodb_ds_std FINAL WHERE splitByChar('#', SK)[1] = 'WS' GROUP BY org NODE org_stats SQL > SELECT * FROM users_stats JOIN ws_stats using org NODE org_stats_raw SQL > SELECT * FROM users_stats_raw JOIN ws_stats_raw using org This is how the outcome looks in Tinybird: <-figure-> ![Comparison of same query](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-ddb-std-5.png&w=3840&q=75) <-figcaption-> Same info, faster and more efficient from Materialized Views --- URL: https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-google-gcs Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Ingest from Google Cloud Storage · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to automatically synchronize all the CSV files in a Google GCS bucket to a Tinybird Data Source." --- # Ingest from Google Cloud Storage¶ In this guide, you'll learn how to automatically synchronize all the CSV files in a Google GCS bucket to a Tinybird Data Source. ## Prerequisites¶ This guide assumes you have familiarity with [Google GCS buckets](https://cloud.google.com/storage/docs/buckets) and the basics of [ingesting data into Tinybird](https://www.tinybird.co/docs/docs/ingest/overview). ## Perform a one-off load¶ When building on Tinybird, people often want to load historical data that comes from another system (called 'seeding' or 'backfilling'). A very common pattern is exporting historical data by creating a dump of CSV files into a Google GCS bucket, then ingesting these CSV files into Tinybird. You can append these files to a Data Source in Tinybird using the Data Sources API. Let's assume you have a set of CSV files in your GCS bucket: ##### List of events files tinybird-assets/datasets/guides/events/events_0.csv tinybird-assets/datasets/guides/events/events_1.csv tinybird-assets/datasets/guides/events/events_10.csv tinybird-assets/datasets/guides/events/events_11.csv tinybird-assets/datasets/guides/events/events_12.csv tinybird-assets/datasets/guides/events/events_13.csv tinybird-assets/datasets/guides/events/events_14.csv tinybird-assets/datasets/guides/events/events_15.csv tinybird-assets/datasets/guides/events/events_16.csv tinybird-assets/datasets/guides/events/events_17.csv tinybird-assets/datasets/guides/events/events_18.csv tinybird-assets/datasets/guides/events/events_19.csv tinybird-assets/datasets/guides/events/events_2.csv tinybird-assets/datasets/guides/events/events_20.csv tinybird-assets/datasets/guides/events/events_21.csv tinybird-assets/datasets/guides/events/events_22.csv tinybird-assets/datasets/guides/events/events_23.csv tinybird-assets/datasets/guides/events/events_24.csv tinybird-assets/datasets/guides/events/events_25.csv tinybird-assets/datasets/guides/events/events_26.csv tinybird-assets/datasets/guides/events/events_27.csv tinybird-assets/datasets/guides/events/events_28.csv tinybird-assets/datasets/guides/events/events_29.csv tinybird-assets/datasets/guides/events/events_3.csv tinybird-assets/datasets/guides/events/events_30.csv tinybird-assets/datasets/guides/events/events_31.csv tinybird-assets/datasets/guides/events/events_32.csv tinybird-assets/datasets/guides/events/events_33.csv tinybird-assets/datasets/guides/events/events_34.csv tinybird-assets/datasets/guides/events/events_35.csv tinybird-assets/datasets/guides/events/events_36.csv tinybird-assets/datasets/guides/events/events_37.csv tinybird-assets/datasets/guides/events/events_38.csv tinybird-assets/datasets/guides/events/events_39.csv tinybird-assets/datasets/guides/events/events_4.csv tinybird-assets/datasets/guides/events/events_40.csv tinybird-assets/datasets/guides/events/events_41.csv tinybird-assets/datasets/guides/events/events_42.csv tinybird-assets/datasets/guides/events/events_43.csv tinybird-assets/datasets/guides/events/events_44.csv tinybird-assets/datasets/guides/events/events_45.csv tinybird-assets/datasets/guides/events/events_46.csv tinybird-assets/datasets/guides/events/events_47.csv tinybird-assets/datasets/guides/events/events_48.csv tinybird-assets/datasets/guides/events/events_49.csv tinybird-assets/datasets/guides/events/events_5.csv tinybird-assets/datasets/guides/events/events_6.csv tinybird-assets/datasets/guides/events/events_7.csv tinybird-assets/datasets/guides/events/events_8.csv tinybird-assets/datasets/guides/events/events_9.csv ### Ingest a single file¶ To ingest a single file, [generate a signed URL in GCP](https://cloud.google.com/storage/docs/access-control/signed-urls) , and send the URL to the Data Sources API using the `append` mode flag: ##### Example POST request with append mode flag curl -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?name=&mode=append" \ --data-urlencode "url=" ### Ingest multiple files¶ If you want to ingest multiple files, you probably don't want to manually write each cURL. Instead, create a script to iterate over the files in the bucket and generate the cURL commands automatically. The following script example requires the [gsutil tool](https://cloud.google.com/storage/docs/gsutil) and assumes you have already created your Tinybird Data Source. You can use the `gsutil` tool to list the files in the bucket, extract the name of the CSV file, and create a signed URL. Then, generate a cURL to send the signed URL to Tinybird. To avoid hitting [API rate limits](https://www.tinybird.co/docs/docs/api-reference/overview#limits) you should delay 15 seconds between each request. Here's an example script in bash: ##### Ingest CSV files from a Google Cloud Storage Bucket to Tinybird TB_HOST= TB_TOKEN= BUCKET=gs:// DESTINATION_DATA_SOURCE= GOOGLE_APPLICATION_CREDENTIALS= REGION= for url in $(gsutil ls $BUCKET | grep csv) do echo $url SIGNED=`gsutil signurl -r $REGION $GOOGLE_APPLICATION_CREDENTIALS $url | tail -n 1 | python3 -c "import sys; print(sys.stdin.read().split('\t')[-1])"` curl -H "Authorization: Bearer $TB_TOKEN" \ -X POST "$TB_HOST/v0/datasources?name=$DESTINATION_DATA_SOURCE&mode=append" \ --data-urlencode "url=$SIGNED" echo sleep 15 done The script uses the following variables: - `TB_HOST` as the corresponding URL for[ your region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) . - `TB_TOKEN` as a Tinybird[ Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens) with `DATASOURCE:CREATE` or `DATASOURCE:APPEND` scope. See the[ Tokens API](https://www.tinybird.co/docs/docs/api-reference/token-api) for more information. - `BUCKET` as the GCS URI of the bucket containing the events CSV files. - `DESTINATION_DATA_SOURCE` as the name of the Data Source in Tinybird, in this case `events` . - `GOOGLE_APPLICATION_CREDENTIALS` as the local path of a Google Cloud service account JSON file. - `REGION` as the Google Cloud region name. ## Automatically sync files with Google Cloud Functions¶ The previous scenario covered a one-off dump of CSV files in a bucket to Tinybird. A slightly more complex scenario is appending to a Tinybird Data Source each time a new CSV file is dropped into a GCS bucket, which can be done using Google Cloud Functions. That way you can have your ETL process exporting data from your Data Warehouse (such as Snowflake or BigQuery) or any other origin and you don't have to think about manually synchronizing those files to Tinybird. Imagine you have a GCS bucket named `gs://automatic-ingestion-poc/` and each time you put a CSV there you want to sync it automatically to an `events` Data Source previously created in Tinybird: 1. Clone this GitHub repository ( `gcs-cloud-function` ) . 2. Install and configure the `gcloud` command line tool. 3. Run `cp .env.yaml.sample .env.yaml` and set the `TB_HOST` , and `TB_TOKEN` variable 4. Run: ##### Syncing from GCS to Tinybird with Google Cloud Functions # set some environment variables before deploying PROJECT_NAME= SERVICE_ACCOUNT_NAME= BUCKET_NAME= REGION= TB_FUNCTION_NAME= # grant permissions to deploy the cloud function and read from storage to the service account gcloud projects add-iam-policy-binding $PROJECT_NAME --member serviceAccount:$SERVICE_ACCOUNT_NAME --role roles/storage.admin gcloud projects add-iam-policy-binding $PROJECT_NAME --member serviceAccount:$SERVICE_ACCOUNT_NAME --role roles/iam.serviceAccountTokenCreator gcloud projects add-iam-policy-binding $PROJECT_NAME --member serviceAccount:$SERVICE_ACCOUNT_NAME --role roles/editor # deploy the cloud function gcloud functions deploy $TB_FUNCTION_NAME \ --runtime python38 \ --trigger-resource $BUCKET_NAME \ --trigger-event google.storage.object.finalize \ --region $REGION \ --env-vars-file .env.yaml \ --service-account $SERVICE_ACCOUNT_NAME It deploys a Google Cloud Function with name `TB_FUNCTION_NAME` to your Google Cloud account, which listens for new files in the `BUCKET_NAME` provided (in this case `automatic-ingestion-poc` ), and automatically appends them to the Tinybird Data Source described by the `FILE_REGEXP` environment variable. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fsyncing-data-from-s3-or-gcs-buckets-3.png&w=3840&q=75) <-figcaption-> Cloud function to sync a GCS bucket to Tinybird Now you can drop CSV files into the configured bucket: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fsyncing-data-from-s3-or-gcs-buckets-4.gif&w=3840&q=75) <-figcaption-> Drop files to a GCS bucket and check the datasources_ops_log A recommended pattern is naming the CSV files in the format `datasourcename_YYYYMMDDHHMMSS.csv` so they are automatically appended to `datasourcename` in Tinybird. For instance, `events_20210125000000.csv` will be appended to the `events` Data Source. ## Next steps¶ - Got your schema sorted and ready to make some queries? Understand[ how to work with time](https://www.tinybird.co/docs/docs/guides/querying-data/working-with-time) . - Learn how to[ monitor your ingestion](https://www.tinybird.co/docs/docs/guides/monitoring/monitor-your-ingestion) . --- URL: https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-google-pubsub Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Ingest from Google Pub/Sub · Tinybird Docs" theme-color: "#171612" description: "In this guide you'll learn how to send data from Google Pub/Sub to Tinybird." --- # Stream from Google Pub/Sub¶ In this guide you'll learn how to send data from Google Pub/Sub to Tinybird. ## Overview¶ Tinybird is a Google Cloud partner & supports integrating with Google Cloud services. [Google Pub/Sub](https://cloud.google.com/pubsub) is often used as a messaging middleware that decouples event stream sources from the end destination. Pub/Sub streams are usually consumed by Google's DataFlow which can send events on to destinations such as BigQuery, BigTable, or Google Cloud Storage. This DataFlow pattern works with Tinybird too, however, Pub/Sub also has a feature called [Push subscriptions](https://cloud.google.com/pubsub/docs/push) which can forward messages directly from Pub/Sub to Tinybird. The following guide steps use the subscription approach. ## Push messages from Pub/Sub to Tinybird¶ ### 1. Create a Pub/Sub topic¶ Start by creating a topic in Google Pub/Sub following the [Google Pub/Sub documentation](https://cloud.google.com/pubsub/docs/admin#create_a_topic). ### 2. Create a push subscription¶ Next, [create a Push subscription in Pub/Sub](https://cloud.google.com/pubsub/docs/create-subscription#push_subscription). Set the **Delivery Type** to **Push**. In the **Endpoint URL** field, ue the following snippet (which uses the [Tinybird Events API](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-from-the-events-api) ) and pass your own Token, which you can find in your Workspace > Tokens: ##### Endpoint URL https://api.tinybird.co/v0/events?wait=true&name=&token= Replace the Tinybird API hostname/region with the [right API URL region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) that matches your Workspace. Your Token lives in the Workspace under "Tokens". If you are sending single-line JSON payload through Pubsub, tick the **Enable payload unwrapping** option to enable unwrapping. This means that data is not base64 encoded before sending it to Tinybird. If you are sending any other format via Pubsub, leave this unchecked (you'll need to follow the decoding steps at the bottom of this guide). Set **Retry policy** to **Retry after exponential backoff delay** . Set the **Minimum backoff** to **1** and **Maximum backoff** to **60**. You don't need to create the Data Source in advance, it will automatically be created for you. This snippet also includes the `wait=true` parameter, which is explained in the [Events API docs](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-from-the-events-api#wait-for-acknowledgement). ### 3. Send sample messages¶ Generate and send some sample messages to test your connection. If you don't have your own messages to test, use [this script](https://gist.github.com/alejandromav/dec8e092ef62d879e6821da06f6459c2). ### 4. Check the Data Source¶ Pub/sub will start to push data to Tinybird. Check the Tinybird UI to see that the Data Source has been created and events are arriving. ### (Optional) Decode the payload¶ If you enabled the **Enable payload unwrapping** option, there is nothing else to do. However, if you are not sending single-line JSON payloads (NDJSON, JOSNL) through Pubsub, you'll need to continue to base64 encode data before sending it to Tinybird. When the data arrived in Tinybird, you can decode it using the `base64Decode` function, like this: SELECT message_message_id as message_id, message_publish_time, base64Decode(message_data) as message_data FROM events_demo ## Next steps¶ - Explore other Google <> Tinybird integrations like[ how to query Google Sheets with SQL](https://www.tinybird.co/blog-posts/query-google-sheets-with-sql-in-real-time) . - Ready to start querying your data? Make sure you're familiar with[ how to work with time](https://www.tinybird.co/docs/docs/guides/querying-data/working-with-time) . --- URL: https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-mongodb Last update: 2024-11-06T09:36:14.000Z Content: --- title: "Ingest data from MongoDB · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to ingest data into Tinybird from MongoDB." --- # Connect MongoDB to Tinybird¶ In this guide, you'll learn how to ingest data into Tinybird from MongoDB. You'll use: - MongoDB Atlas as the source MongoDB database. - Confluent Cloud's MongoDB Atlas Source connector to capture change events from MongoDB Atlas and push to Kafka - Tinybird Confluent Cloud connector to ingest the data from Kafka This guide uses Confluent Cloud as a managed Kafka service, and MongoDB Atlas as a managed MongoDB service. You can use any Kafka service and MongoDB instance, but the setup steps may vary. ## Prerequisites¶ This guide assumes you have: - An existing Tinybird account & Workspace - An existing Confluent Cloud account - An existing MongoDB Atlas account & collection ## 1. Create Confluent Cloud MongoDB Atlas Source¶ [Create a new MongoDB Atlas Source in Confluent Cloud](https://docs.confluent.io/cloud/current/connectors/cc-mongo-db-source.html#get-started-with-the-mongodb-atlas-source-connector-for-ccloud) . Use the following template to configure the Source: { "name": "", "config": { "name": "", "connection.host": "", "connection.user": "", "connection.password": "", "database": "", "collection": "", "cloud.provider": "", "cloud.environment": "", "kafka.region": "", "kafka.auth.mode": "KAFKA_API_KEY", "kafka.api.key": "", "kafka.api.secret": "", "kafka.endpoint": "", "topic.prefix": "", "errors.deadletterqueue.topic.name": "", "startup.mode": "copy_existing", "copy.existing": "true", "copy.existing.max.threads": "1", "copy.existing.queue.size": "16000", "poll.await.time.ms": "5000", "poll.max.batch.size": "1000", "heartbeat.interval.ms": "10000", "errors.tolerance": "all", "max.batch.size": "100", "connector.class": "MongoDbAtlasSource", "output.data.format": "JSON", "output.json.format": "SimplifiedJson", "json.output.decimal.format": "NUMERIC", "change.stream.full.document": "updateLookup", "change.stream.full.document.before.change": "whenAvailable", "tasks.max": "1" } } When the Source is created, you should see a new Kafka topic in your Confluent Cloud account. This topic will contain the change events from your MongoDB collection. ## 2. Create Tinybird Data Source (CLI)¶ Using the Tinybird CLI, create a new Kafka connection `tb connection create kafka` The CLI will prompt you to enter the connection details to your Kafka service. You'll also provide a name for the connection, which is used by Tinybird to reference the connection, and you'll need it below. Next, create a new file called `kafka_ds.datasource` (you can use any name you want, just use the .datasource extension). Add the following content to the file: SCHEMA > `_id` String `json:$.documentKey._id` DEFAULT JSONExtractString(__value, '_id._id'), `operation_type` LowCardinality(String) `json:$.operationType`, `database` LowCardinality(String) `json:$.ns.db`, `collection` LowCardinality(String) `json:$.ns.coll` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(__timestamp)" ENGINE_SORTING_KEY "__timestamp, _id" KAFKA_CONNECTION_NAME '' KAFKA_TOPIC '' KAFKA_GROUP_ID '' KAFKA_AUTO_OFFSET_RESET 'earliest' KAFKA_STORE_RAW_VALUE 'True' KAFKA_STORE_HEADERS 'False' KAFKA_STORE_BINARY_HEADERS 'True' KAFKA_TARGET_PARTITIONS 'auto' KAFKA_KEY_AVRO_DESERIALIZATION '' Now push the Data Source to Tinybird using: tb push kafka_ds.datasource ## 3. Validate the Data Source¶ Go to the Tinybird UI and validate that a Data Source has been created. As changes occur in MongoDB, you should see the data being ingested into Tinybird. Note that this is an append log of all changes, so you will see multiple records for the same document as it is updated. ## 4. Deduplicate with ReplacingMergeTree¶ To deduplicate the data, you can use a `ReplacingMergeTree` engine on a Materialized View. This is explained in more detail in the [deduplication guide](https://www.tinybird.co/docs/docs/guides/querying-data/deduplication-strategies#use-the-replacingmergetree-engine). We will create a new Data Source using the ReplacingMergeTree engine to store the deduplicated data, and a Pipe to process the data from the original Data Source and write to the new Data Source. First, create a new Data Source to store the deduplicated data. Create a new file called `deduped_ds.datasource` and add the following content: SCHEMA > `fullDocument` String, `_id` String, `database` LowCardinality(String), `collection` LowCardinality(String), `k_timestamp` DateTime, `is_deleted` UInt8 ENGINE "ReplacingMergeTree" ENGINE_SORTING_KEY "_id" ENGINE_VER "k_timestamp" ENGINE_IS_DELETED "is_deleted" Now push the Data Source to Tinybird using: tb push deduped_ds.datasource Then, create a new file called `dedupe_mongo.pipe` and add the following content: NODE mv SQL > SELECT JSONExtractRaw(__value, 'fullDocument') as fullDocument, _id, database, collection, __timestamp as k_timestamp, if(operation_type = 'delete', 1, 0) as is_deleted FROM TYPE materialized DATASOURCE Now push the Pipe to Tinybird using: tb push dedupe_mongo.pipe As new data arrives via Kafka, it will be processed automatically through the Materialized View, writing it into the `ReplacingMergeTree` Data Source. Query this new Data Source to access the deduplicated data: SELECT * FROM deduped_ds FINAL --- URL: https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-rudderstack Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Stream from RudderStack · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn two different methods to send events from RudderStack to Tinybird." --- # Stream from RudderStack¶ In this guide, you'll learn two different methods to send events from RudderStack to Tinybird. To better understand the behavior of their customers, companies need to unify timestamped data coming from a wide variety of products and platforms. Typical events to track would be 'sign up', 'login', 'page view' or 'item purchased'. A customer data platform can be used to capture complete customer data like this from wherever your customers interact with your brand. It defines events, collects them from different platforms and products, and routes them to where they need to be consumed. [RudderStack](https://www.rudderstack.com/) is an open-source customer data pipeline tool. It collects, processes and routes data from your websites, apps, cloud tools, and data warehouse. By using Tinybird's event ingestion endpoint for [high-frequency ingestion](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-from-the-events-api) as a Webhook in RudderStack, you can stream customer data in real time to Data Sources. ## Option 1: A separate Data Source for each event type¶ This is the preferred approach. It sends each type of event to a corresponding Data Source. This [2-minute video](https://www.youtube.com/watch?v=z3TkPvo5CRQ) shows you how to set up high-frequency ingestion through RudderStack using these steps. The advantages of this method are: - Your data is well organized from the start. - Different event types can have different attributes (columns in their Data Source). - Whenever new attributes are added to an event type you will be prompted to add new columns. - New event types will get a new Data Source. Start by generating a Token in the UI to allow RudderStack to write to Tinybird. ### Create a Tinybird Token¶ Go to the Workspace in Tinybird where you want to receive data and select "Tokens" in the side panel. Create a new Token by selecting "Create Token" (top right). Give your Token a descriptive name. In the section "DATA SOURCES SCOPES" mark the "Data Sources management" checkbox (Enabled) to give your Token permission to create Data Sources. Select "Save changes". ### Create a RudderStack Destination¶ In RudderStack, Select "Destinations" in the side panel and then "New destination" (top right). Select Webhook: 1. Give the destination a descriptive name. 2. Connect your source(s), you can test with the Rudderstack Sample HTTP Source. 3. Input the following Connection Settings: - Webhook URL:* < https://api.tinybird.co /v0/events>* - URL Method:* POST* - Headers Key:* Authorization* - Headers Value:* Bearer TINYBIRD_AUTH_TOKEN* <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fstreaming-via-rudderstack-1.png&w=3840&q=75) <-figcaption-> Webhook connection settings for high-frequency ingestion On the next page, select "Create new transformation". You can code a function in the box to apply to events when this transformation is active using the example snippet below (feel free to update it to suit your needs). In this function, you can dynamically append the target Data Source to the target URL of the Webhook. Give your transformation a descriptive name and a helpful description. ##### Transformation code export function transformEvent(event, metadata){ event.appendPath="?name=rudderstack_"+event.event.toLowerCase().replace(/[\s\.]/g, '_') return event; } This example snippet uses the prefix `*rudderstack\_*` followed by the name of the event in lower case, with its words separated by an underscore (for instance, a "Product purchased" event would go to a Data Source named `rudderstack_product_purchased` ). Save the transformation. Your destination has been created successfully! ### Test Ingestion¶ In Rudderstack, select Sources --> Rudderstack Sample HTTP --> Live events (top right) --> "Send test event" and paste the provided curl command into your terminal. The event will appear on the screen and be sent to Tinybird. If, after sending some events through RudderStack, you see that your Data Source in Tinybird exists but is empty (0 rows after sending a few events), you will need to authorize the Token that you created to **append** data to the Data Source. In the UI, navigate to "Tokens", select the Token you created, select "Data Sources management" --> "Add Data Source scope", and choose the name of the Data Source that you want to write to. Mark the "Append" checkbox and save the changes. ## Option 2: All events in the same Data Source¶ This alternative approach consists of sending all events into a single Data Source and then splitting them using Tinybird. By pre-configuring the Data Source, any events that RudderStack sends will be ingested with the JSON object in full as a String in a single column. This is very useful when you have complex JSON objects as explained in the [ingesting NDJSON docs](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-ndjson-data#jsonpaths) but be aware that using JSONExtract to parse data from the JSON object after ingestion has an impact on performance. New columns from parsing the data will be detected and you will be asked if you want to save them. You can adjust the inferred data types before saving any new columns. Pipes can be used to filter the Data Source by different events. The following example assumes you have already installed the Tinybird CLI. If you're not familiar with how to use or install it, [read the CLI docs](https://www.tinybird.co/docs/docs/cli/install). ### Pre-configure a Data Source¶ Authenticate to your Workspace by typing **tb auth** and entering your Token for the Workspace into which you want to ingest data from RudderStack. Create a new file in your local Workspace, named `rudderstack_events.datasource` , for example, to configure the empty Data Source. ##### Data Source schema SCHEMA > 'value' String 'json:$' ENGINE "MergeTree" ENGINE_SORTING_KEY "value" Push the file to your Workspace using `tb push rudderstack_events.datasource`. Note that this pre-configured Data Source is only required if you need a column containing the JSON object in full as a String. Otherwise, just skip this step and let Tinybird infer the columns and data types when you send the first event. You will then be able to select which columns you wish to save and adjust their data types. Create the Token as in method 1. ### Create a Tinybird Token¶ Go to the Workspace in Tinybird where you want to receive data and select "Tokens" in the side panel. Create a new Token by selecting "Create Token" (top right). Give your Token a descriptive name. In the section "DATA SOURCES SCOPES", select "Add Data Source scope", choose the name of the Data Source that you just created, and mark the "Append" checkbox. Select "Save changes". ### Create a RudderStack Destination¶ In RudderStack, Select "Destinations" in the side panel and then "New destination" (top right). Select Webhook: 1. Give the destination a descriptive name. 2. Connect your source(s), you can test with the Rudderstack Sample HTTP Source. 3. Input the following Connection Settings: - Webhook URL:* < https://api.tinybird.co /v0/events?name=rudderstack_events>* - URL Method:* POST* - Headers Key:* Authorization* - Headers Value:* Bearer TINYBIRD_AUTH_TOKEN* <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fstreaming-via-rudderstack-3.png&w=3840&q=75) <-figcaption-> Webhook connection settings with Data Source name Select 'No transformation needed' and save. Your destination has been created successfully! ### Test Ingestion¶ Select Sources --> Rudderstack Sample HTTP --> "Live events" (top right) --> "Send test event" and paste the provided curl command into your terminal. The event will appear on the screen and be sent to Tinybird. The `value` column contains the full JSON object. You will also have the option of having the data parsed into columns. When viewing the new columns you can select which ones to save and adjust their data types. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fstreaming-via-rudderstack-4.png&w=3840&q=75) <-figcaption-> New columns detected not in schema Whenever new columns are detected in the stream of events you will be asked if you want to save them. ## Next steps¶ - Need to[ iterate a Data Source, including the schema](https://www.tinybird.co/docs/docs/guides/ingesting-data/iterate-a-data-source) ? Read how here. - Want to schedule your data ingestion? Read the docs on[ cron and GitHub Actions](https://www.tinybird.co/docs/docs/guides/ingesting-data/scheduling-with-github-actions-and-cron) . --- URL: https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-snowflake-via-unloading Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Ingest from Snowflake via unloading · Tinybird Docs" theme-color: "#171612" description: "In this guide you'll learn how to send data from Snowflake to Tinybird via unloading." --- # Ingest from Snowflake via unloading¶ In this guide you'll learn how to send data from Snowflake to Tinybird, for scenarios where the [native connector](https://www.tinybird.co/docs/docs/ingest/snowflake) can't be used —things outside a one-off load or periodical full replaces of the table, or where [limits](https://www.tinybird.co/docs/docs/ingest/snowflake#limits) apply—. This process relies on [unloading](https://docs.snowflake.com/en/user-guide/data-unload-overview) (aka bulk export) data as gzipped CSVs and then ingesting via [Data Sources API](https://www.tinybird.co/docs/docs/ingest/datasource-api). This guide explains the process using Azure Blob Storage, but it's easy to replicate using Amazon S3, Google Cloud Storage, or any storage service where you can unload data from Snowflake and share presigned URLs to access the files. This guide is a walkthrough of the most common, basic process: Unload the table from Snowflake, then ingest this export into Tinybird. ## Prerequisites¶ This guide assumes you have a Tinybird account, you are familiar with creating a Tinybird Workspace and pushing resources to it. You will also need access to Snowflake, and permissions to create SAS Tokens for Azure Blob Storage or its equivalents in AWS S3 and Google Cloud Storage. ## 1. Unload the Snowflake table¶ Snowflake makes it really easy to [unload](https://docs.snowflake.com/en/user-guide/data-unload-overview) query results to flat files to and external storage service. COPY INTO 'azure://myaccount.blob.core.windows.net/unload/' FROM mytable CREDENTIALS = ( AZURE_SAS_TOKEN='****' ) FILE_FORMAT = ( TYPE = CSV COMPRESSION = GZIP ) HEADER = FALSE; The most basic implementation is [unloading directly](https://docs.snowflake.com/en/sql-reference/sql/copy-into-location#unloading-data-from-a-table-directly-to-files-in-an-external-location) , but for production use cases consider adding a [named stage](https://docs.snowflake.com/en/user-guide/data-unload-azure#unloading-data-into-an-external-stage) as suggested in the docs. Stages will give you more fine-grained control to grant access rights. ## 2. Create a SAS token for the file¶ Using [Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) , generate a [shared access signature (SAS) token](https://learn.microsoft.com/en-us/azure/ai-services/translator/document-translation/how-to-guides/create-sas-tokens?tabs=blobs) so Tinybird can read the file: az storage blob generate-sas \ --account-name myaccount \ --account-key '****' \ --container-name unload \ --name data.csv.gz \ --permissions r \ --expiry \ --https-only \ --output tsv \ --full-uri > 'https://myaccount.blob.core.windows.net/unload/data.csv.gz?se=2024-05-31T10%3A57%3A41Z&sp=r&spr=https&sv=2022-11-02&sr=b&sig=PMC%2E9ZvOFtKATczsBQgFSsH1%2BNkuJvO9dDPkTpxXH0g%5D' Use the same behavior in S3 and GCS to generate pre-signed URLs. ## 3. Ingest into Tinybird¶ Take that generated URL and make a call to Tinybird. You'll need a [Token](https://www.tinybird.co/docs/concepts/auth-tokens#tokens) with `DATASOURCES:CREATE` permissions: curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?name=my_datasource_name" \ -d url='https://myaccount.blob.core.windows.net/unload/data.csv.gz?se=2024-05-31T10%3A57%3A41Z&sp=r&spr=https&sv=2022-11-02&sr=b&sig=PMC%2E9ZvOFtKATczsBQgFSsH1%2BNkuJvO9dDPkTpxXH0g%5D' You should now have your Snowflake Table in Tinybird. ## Automation¶ To adapt to more "real-life" scenarios (like having to append data on a timely basis, replace data that has been updated in Snowflake, etc.), you may need to define scheduled actions to move the data. You can see examples in the [Ingest from Google Cloud Storage guide](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-from-google-gcs#automatically-sync-files-with-google-cloud-functions) and in [Schedule data ingestion with cron and GitHub Actions guide](https://www.tinybird.co/docs/docs/guides/ingesting-data/scheduling-with-github-actions-and-cron). ## Limits¶ You will be using Data Sources API, so its [limits](https://www.tinybird.co/docs/docs/api-reference/overview#limits) apply: | Description | Limit | | --- | --- | | Append/Replace data to Data Source | 5 times per minute | | Max file size (uncompressed) | build plan 10GB | | Max file size (uncompressed) | pro and enterprise 32GB | As a result of these limits, you may need to adjust your [COPY INTO ](https://docs.snowflake.com/en/sql-reference/sql/copy-into-location) expression adding `PARTITION` or `MAX_FILE_SIZE = 5000000000`. COPY INTO 'azure://myaccount.blob.core.windows.net/unload/' FROM mytable CREDENTIALS=( AZURE_SAS_TOKEN='****') FILE_FORMAT = ( TYPE = CSV COMPRESSION = GZIP ) HEADER = FALSE MAX_FILE_SIZE = 5000000000; ## Next steps¶ These resources may be useful: - [ Tinybird Snowflake Connector](https://www.tinybird.co/docs/ingest/snowflake) - [ Tinybird S3 Connector](https://www.tinybird.co/docs/ingest/s3) - [ Guide: Ingest from Google Cloud Storage](https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-google-gcs) --- URL: https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-the-events-api Last update: 2024-11-06T08:19:40.000Z Content: --- title: "Stream with HTTP Requests · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to use the Tinybird Events API to ingest thousands of JSON messages per second." --- # Stream with HTTP Requests¶ In this guide, you'll learn how to use the Tinybird Events API to ingest thousands of JSON messages per second with HTTP Requests. For more details about the Events API endpoint, read the [Events API](https://www.tinybird.co/docs/docs/api-reference/events-api) docs. ## Setup: Create the target Data Source¶ Firstly, you need to create an NDJSON Data Source. You can use the [API](https://www.tinybird.co/docs/docs/api-reference/datasource-api) , or simply drag & drop a file on the UI. Even though you can add new columns later on, you have to upload an initial file. The Data Source will be created and ordered based upon those initial values. As an example, upload this NDJSON file: {"date": "2020-04-05 00:05:38", "city": "New York"} ## Ingest from the browser: JavaScript¶ Ingesting from the browser requires making a standard POST request; see below for an example. Input your own Token and change the name of the target Data Source to the one you created. Check your URL ( `const url` ) is the corresponding [URL for your region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints). ##### Browser High-Frequency Ingest async function sendEvents(events){ const date = new Date(); events.forEach(ev => { ev.date = date.toISOString() }); const headers = { 'Authorization': 'Bearer TOKEN_HERE', }; const url = 'https://api.tinybird.co/' // you may be on a different host const rawResponse = await fetch(`${url}v0/events?name=hfi_multiple_events_js`, { method: 'POST', body: events.map(JSON.stringify).join('\n'), headers: headers, }); const content = await rawResponse.json(); console.log(content); } sendEvents([ { 'city': 'Jamaica', 'action': 'view'}, { 'city': 'Jamaica', 'action': 'click'}, ]); Remember: Publishing your Admin Token on a public website is a security vulnerability. It is **highly recommend** that you [create a new Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens#create-a-token) that restricts access granularity. ## Ingest from the backend: Python¶ Ingesting from the backend is a similar process to ingesting from the browser. Use the following Python snippet and replace the Auth Token and Data Source name, as in the example above. ##### Python High-Frequency Ingest import requests import json import datetime def send_events(events): params = { 'name': 'hfi_multiple_events_py', 'token': 'TOKEN_HERE', } for ev in events: ev['date'] = datetime.datetime.now().isoformat() data = '\n'.join([json.dumps(ev) for ev in events]) r = requests.post('https://api.tinybird.co/v0/events', params=params, data=data) print(r.status_code) print(r.text) send_events([ {'city': 'Pretoria', 'action': 'view'}, {'city': 'Pretoria', 'action': 'click'}, ]) ## Ingest from the command line: curl¶ The following curl snippet sends two events in the same request: ##### curl High-Frequency Ingest curl -i -d $'{"date": "2020-04-05 00:05:38", "city": "Chicago"}\n{"date": "2020-04-05 00:07:22", "city": "Madrid"}\n' -H "Authorization: Bearer $TOKEN" 'https://api.tinybird.co/v0/events?name=hfi_test' ## Add new columns from the UI¶ As you add extra information in the form of new JSON fields, the UI will prompt you to include those new columns on the Data Source. For instance, if you send a new event with an extra field: ##### curl High-Frequency Ingest curl -i -d '{"date": "2020-04-05 00:05:38", "city": "Chicago", "country": "US"}' -H "Authorization: Bearer $TOKEN" 'https://api.tinybird.co/v0/events?name=hfi_test' And navigate to the UI's Data Source screen, you'll be asked if you want to add the new column: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fhigh-frequency-ingestion-1.png&w=3840&q=75) Here, you'll be able to select the desired columns and adjust the types: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fhigh-frequency-ingestion-2.png&w=3840&q=75) After you confirm the addition of the column, it will be populated by new events: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fhigh-frequency-ingestion-3.png&w=3840&q=75) ## Error handling and retries¶ Read more about the possible [responses returned by the Events API](https://www.tinybird.co/docs/docs/api-reference/events-api). When using the Events API to send data to Tinybird, you can choose to 'fire and forget' by sending a POST request and ignoring the response. This is a common choice for non-critical data, such as tracking page hits if you're building Web Analytics, where some level of loss is acceptable. However, if you're sending data where you cannot tolerate events being missed, you must implement some error handling & retry logic in your application. ### Wait for acknowledgement¶ When you send data to the Events API, you'll usually receive a `HTTP202` response, which indicates that the request was successful. However, it's important to note that this response only indicates that the Events API successfully **accepted** your HTTP request. It does not confirm that the data has been **committed** into the underlying database. Using the `wait` parameter with your request will ask the Events API to wait for acknowledgement that the data you sent has been committed into the underlying database. If you use the `wait` parameter, you will receive a `HTTP200` response that confirms data has been committed. To use this, your Events API request should include `wait` as a query parameter, with a value of `true`. For example: https://api.tinybird.co/v0/events?wait=true It is good practice to log your requests to, and responses from, the Events API. This will help give you visibility into any failures for reporting or recovery. ### When to retry¶ Failures are indicated by a `HTTP4xx` or `HTTP5xx` response. It's recommended to only implement automatic retries for `HTTP5xx` responses, which indicate that a retry might be successful. `HTTP4xx` responses should be logged and investigated, as they often indicate issues that cannot be resolved by simply retrying with the same request. For HTTP2 clients, you may receive the `0x07 GOAWAY` error. This indicates that there are too many alive connections. It is safe to recreate the connection and retry these errors. ### How to retry¶ You should aim to retry any requests that fail with a `HTTP5xx` response. In general, you should retry these requests 3-5 times. If the failure persists beyond these retries, log the failure, and attempt to store the data in a buffer to resend later (for example, in Kafka, or a file in S3). It's recommended to use an exponential backoff between retries. This means that, after a retry fails, you should increase the amount of time you wait before sending the next retry. If the issue causing the failure is transient, this gives you a better chance of a successful retry. Be careful when calculating backoff timings, so that you do not run into memory limits on your application. ## Next steps¶ - Learn more about[ the schema](https://www.tinybird.co/docs/docs/ingest/overview#create-your-schema) and why it's important. - Ingested your data and ready to go? Start[ querying your Data Sources](https://www.tinybird.co/docs/docs/query/overview) and build some Pipes! --- URL: https://www.tinybird.co/docs/guides/ingesting-data/ingest-ndjson-data Last update: 2024-11-04T12:15:35.000Z Content: --- title: "Ingest NDJSON data · Tinybird Docs" theme-color: "#171612" description: "In this guide you'll learn how to ingest unstructured data, like NDJSON to Tinybird." --- # Ingest NDJSON data¶ In this guide you'll learn how to ingest unstructured NDJSON data into Tinybird. ## Overview¶ A common scenario is having a document-based database, using nested records on your data warehouse or generated events in JSON format from a web application. For cases like this, the process used to be: Export the `JSON` objects as if they were a `String` in a CSV file, ingest them to Tinybird, and then use the built-in `JSON` functions to prepare the data for real-time analytics as it was being ingested. But this is not needed anymore, as Tinybird now accepts JSON imports by default! Although Tinybird allows you to ingest `.json` and `.ndjson` files, it only accepts the [Newline Delimited JSON](https://github.com/ndjson/ndjson-spec) as content. Each line must be a valid JSON object and every line has to end with `\n` . The API will return an error if each line isn't a valid JSON value. ## Ingest to Tinybird¶ This guide will use an example scenario including this [100k rows NDJSON file](https://storage.googleapis.com/tinybird-assets/datasets/guides/how-to-ingest-ndjson-data/events_100k.ndjson) , which contains events from an ecommerce website with different properties. ### With the API¶ Ingesting NDJSON files using the API is similar to the CSV process. There are only two differences to be managed in the query parameters: - ** format** : It has to be "ndjson" - ** schema** : Usually, the name and the type are provided for every column but in this case it needs an additional property, called the `jsonpath` (see the[ JSONPath syntax](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-ndjson-data#jsonpaths) ). Example:* "schema=event_name String `json:$.event.name`"* You can guess the `schema` by first calling the [Analyze API](https://www.tinybird.co/docs/docs/api-reference/analyze-api) . It's a very handy way to not have to remember the `schema` and `jsonpath` syntax: Just send a sample of your file and the Analyze API will describe what's inside (columns, types, schema, a preview, etc.). ##### Analyze API request curl \ -H "Authorization: Bearer $TOKEN" \ -G -X POST "https://api.tinybird.co/v0/analyze" \ --data-urlencode "url=https://storage.googleapis.com/tinybird-assets/datasets/guides/how-to-ingest-ndjson-data/events_100k.ndjson" Take the `schema` attribute in the response and either use it right away in the next API request to create the Data Source, or modify as you wish: Column names, types, remove any columns, etc. ##### Analyze API response excerpt { "analysis": { "columns": [ { "path": "$.date", "recommended_type": "DateTime", "present_pct": 1, "name": "date" }, ... ... ... "schema": "date DateTime `json:$.date`, event LowCardinality(String) `json:$.event`, extra_data_city LowCardinality(String) `json:$.extra_data.city`, product_id String `json:$.product_id`, user_id Int32 `json:$.user_id`, extra_data_term Nullable(String) `json:$.extra_data.term`, extra_data_price Nullable(Float64) `json:$.extra_data.price`" }, "preview": { "meta": [ { "name": "date", "type": "DateTime" }, ... ... ... } Now you've analyzed the file, create the Data Source. In the example below, you will ingest the 100k rows NDJSON file only taking 3 columns from it: date, event, and product_id. The `jsonpath` allows Tinybird to match the Data Source column with the JSON property path: ##### Ingest NDJSON to Tinybird TOKEN= curl \ -H "Authorization: Bearer $TOKEN" \ -X POST "https://api.tinybird.co/v0/datasources" \ -G --data-urlencode "name=events_example" \ -G --data-urlencode "mode=create" \ -G --data-urlencode "format=ndjson" \ -G --data-urlencode "schema=date DateTime \`json:$.date\`, event String \`json:$.event\`, product_id String \`json:$.product_id\`" \ -G --data-urlencode "url=https://storage.googleapis.com/tinybird-assets/datasets/guides/how-to-ingest-ndjson-data/events_100k.ndjson" ### With the Command Line Interface¶ There are no changes in the CLI in order to ingest an NDJSON file. Just run the command you are used to with CSV: ##### Generate Data Source schema from NDJSON tb datasource generate https://storage.googleapis.com/tinybird-assets/datasets/guides/how-to-ingest-ndjson-data/events_100k.ndjson Once it's finished, it automatically generates a .datasource file with all the columns, with their proper types, and `jsonpaths` . For example: ##### Generated Data Source schema DESCRIPTION generated from https://storage.googleapis.com/tinybird-assets/datasets/guides/how-to-ingest-ndjson-data/events_100k.ndjson SCHEMA > date DateTime `json:$.date`, event String `json:$.event`, extra_data_city String `json:$.extra_data.city`, product_id String `json:$.product_id`, user_id Int32 `json:$.user_id`, extra_data_price Nullable(Float64) `json:$.extra_data.price`, extra_data_city Nullable(String) `json:$.extra_data.city` You can then push that .datasource file to Tinybird and start using it in your Pipes or append new data to it: ##### Push Data Source to Tinybird and append new data tb push events_100k.datasource tb datasource append events_100k https://storage.googleapis.com/tinybird-assets/datasets/guides/how-to-ingest-ndjson-data/events_100k.ndjson ### With the UI¶ To create a new Data Source from an NDJSON file, navigate to your Workspace and select the **Add Data Source** button. In the modal, select "File upload" and upload the NDJSON/JSON file or drag and drop onto the modal. You can use provide a URL such as the one provided in this guide. Confirm you're happy with the schema and data, and select "Create Data Source". Once your data is imported, you will have a Data Source with your JSON data structured in columns, which are easy to transform and consume in any Pipe. Ingest just the columns you need. After exploration of your data, always remember to create a Data Source that only has the columns needed for your analyses. That will help to make your ingestion, materialization, and your real time data project faster. ## When new JSON fields are added¶ Tinybird can automatically detect if a new JSON property is being added when new data is being ingested. Using the Data Source import example from the previous paragraph, you can include a new property to know the origin country of the event, complementing the city. Append new JSON data with the extra property ( [using this example file](https://storage.googleapis.com/tinybird-assets/datasets/guides/how-to-ingest-ndjson-data/events_with_country.ndjson) ). After finishing the import, open the Data Source modal and confirm that a new blue banner appears, warning you about the new properties detected in the last ingestion: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fhow-to-ingest-ndjson-data-4.png&w=3840&q=75) <-figcaption-> Automatically suggesting new columns Once you accept viewing those new columns, the application will allow you to add them, change the column types and the column names, as it did in the preview step during the import: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fhow-to-ingest-ndjson-data-5.png&w=3840&q=75) <-figcaption-> Accepting new columns From now on, whenever you append new data where the new column is defined and has a value, it will appear in the Data Source and will be available to be consumed from your Pipes: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fhow-to-ingest-ndjson-data-6.png&w=3840&q=75) <-figcaption-> New column receiving data Tinybird automatically detects if there are new columns available. If you ingest data periodically into your NDJSON Data Source (from a file or a Kafka connection) and new columns are coming in, you will see a blue dot in the Data Source icon that appears in the sidebar (see Mark 1 below). Click on the Data Source to view the new columns and add them to the schema, following the steps above. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fhow-to-ingest-ndjson-data-7.png&w=3840&q=75) <-figcaption-> New columns detected, notified by a blue dot ## JSONPaths¶ This section applies to both NDJSON **and** Parquet data. When creating a Data Source using NDJSON/Parquet data, for each column in the `schema` you have to provide a JSONPath using the [JSONPath syntax](https://goessner.net/articles/JsonPath). This is easy for simple schemas, but it can get complex if you have nested fields and arrays. For example, given this NDJSON object: { "field": "test", "nested": { "nested_field": "bla" }, "an_array": [1, 2, 3], "a_nested_array": { "nested_array": [1, 2, 3] } } The schema would be something like this: ##### schema with jsonpath a_nested_array_nested_array Array(Int16) `json:$.a_nested_array.nested_array[:]`, an_array Array(Int16) `json:$.an_array[:]`, field String `json:$.field`, nested_nested_field String `json:$.nested.nested_field` Tinybird's JSONPath syntax support has some limitations: It support nested objects at multiple levels, but it supports nested arrays only at the first level, as in the example above. To ingest and transform more complex JSON objects, use the root object JSONPath syntax as described in the next section. ### JSONPaths and the root object¶ Defining a column as "column_name String `json:$` " in the Data Source schema will ingest each line in the NDJSON file as a String in the `column_name` column. This is very useful in some scenarios. When you have nested arrays, such as polygons: ##### Nested arrays { "id": 49518, "polygon": [ [ [30.471785843000134, -1.066836591999916], [30.463855835000118, -1.075127054999925], [30.456156047000093, -1.086082457999908], [30.453003785000135, -1.097347919999962], [30.456311076000134, -1.108096617999891], [30.471785843000134, -1.066836591999916] ] ] } You can parse the `id` and then add the whole JSON string to the root column to extract the polygon with [JSON functions](https://clickhouse.com/docs/en/sql-reference/functions/json-functions/). ##### schema definition id String `json:$.id`, root String `json:$` When you have complex objects: ##### Complex JSON objects { "elem": { "payments": [ { "users": [ { "user_id": "Admin_XXXXXXXXX", "value": 4 } ] } ] } } Or if you have variable schema ("schemaless") events: ##### Schemaless events { "user_id": "1", "data": { "whatever": "bla", "whatever2": "bla" } } { "user_id": "1", "data": [1, 2, 3] } You can simply put the whole event in the root column and parse as needed: ##### schema definition root String `json:$` ## Next steps¶ - Learn more about the[ Data Sources API](https://www.tinybird.co/docs/docs/ingest/datasource-api) . - Want to schedule your data ingestion? Read the docs on[ cron and GitHub Actions](https://www.tinybird.co/docs/docs/guides/ingesting-data/scheduling-with-github-actions-and-cron) . Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/ingesting-data/ingest-with-estuary Last update: 2024-08-22T15:53:45.000Z Content: --- title: "Ingest with Estuary · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to use Estuary to push data streams to Tinybird." --- # Ingest with Estuary¶ In this guide, you'll learn how to use Estuary to push data streams to Tinybird. [Estuary](https://estuary.dev/) is a real-time ETL tool that allows you capture data from a range of source, and push it to a range of destinations. Using Estuary's Dekaf, you can connect Tinybird to Estuary as if it was a Kafka broker - meaning you can use Tinybird's native Kafka Connector to consume data from Estuary. [Read more about Estuary Dekaf.](https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/#connection-details) ## Prerequisites¶ - An Estuary account & collection - A Tinybird account & Workspace ## Connecting to Estuary¶ In Estuary, create a new token to use for the Tinybird connection. You can do this from the Estuary Admin Dashboard. In your Tinybird Workspace, create a new Data Source and use the [Kafka Connector](https://www.tinybird.co/docs/docs/ingest/kafka). To configure the connection details, use the following settings (these can also be found in the [Estuary Dekaf docs](https://docs.estuary.dev/guides/dekaf_reading_collections_from_kafka/#connection-details) ). - Bootstrap servers: `dekaf.estuary.dev` - SASL Mechanism: `PLAIN` - SASL Username: `{}` - SASL Password: Estuary Refresh Token (Generate your token in the Estuary Admin Dashboard) Tick the `Decode Avro messages with Schema Register` box, and use the following settings: - URL: `https://dekaf.estuary.dev` - Username: `{}` - Password: The same Estuary Refresh Token as above Click **Next** and you will see a list of topics. These topics are the collections you have in Estuary. Select the collection you want to ingest into Tinybird, and click **Next**. Configure your consumer group as needed. Finally, you will see a preview of the Data Source schema. Feel free to make any modifications as required, then click **Create Data Source**. This will complete the connection with Estuary, and new data from the Estuary collection will arrive in your Tinybird Data Source in real-time. --- URL: https://www.tinybird.co/docs/guides/ingesting-data/iterate-a-data-source Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Iterate a Data Source · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to change the schema of a Data Source without using version control." --- # Iterating a Data Source (change or update schema)¶ Creating a Data Source for the first time is really straightforward. However, when iterating data projects, sometimes you need to edit the Data Source schema. This can be challenging when the data is already in production, and there are a few different scenarios. With Tinybird you can easily add more columns, but other operations (such as changing the sorting key or changing a column type) require you to fully recreate the Data Source. This guide is for Workspaces that **aren't** using version control. If your Workspace is linked using the Git<>Tinybird integration, see the repo of [common use cases for iterating when using version control](https://github.com/tinybirdco/use-case-examples). ## Overview¶ This guide walks through the iteration process for 4 different scenarios. Pick the one that's most relevant for you: - ** Scenario 1: I'm not in production** - ** Scenario 2: I can stop/pause data ingestion** - ** Scenario 3: I need to change a Materialized View & I can't stop data ingest** - ** Scenario 4: It's too complex and I can't figure it out** ## Prerequisites¶ You'll need to be familiar with the Tinybird CLI to follow along with this guide. Never used it before? [Read the docs here](https://www.tinybird.co/docs/docs/cli/quick-start). All of the guide examples have the same setup - a Data Source with a `nullable(Int64)` column that the user wants to change to a `Int64` for performance reasons. This requires editing the schema and, to keep the existing data, replacing any occurrences of `NULL` with a number, like `0`. ## Scenario 1: I'm not in production¶ This scenario assumes that you are not in production and can accept losing any data you have already ingested. If you are not in production, and you can accept losing data, use the Tinybird CLI to pull your Data Source down to a file, modify it, and push it back into Tinybird. Begin with `tb pull` to pull your Tinybird resources down to files. Then, modify the .datasource file for the Data Source you want to change. When you're finished modifying the Data Source, delete the existing Data Source from Tinybird, either in the CLI with `tb datasource rm` or through the UI. Finally, push the new Data Source to Tinybird with `tb push`. See a screencast example: [https://www.youtube.com/watch?v=gzpuQfk3Byg](https://www.youtube.com/watch?v=gzpuQfk3Byg). ## Scenario 2: I can stop data ingestion¶ This scenario assumes that you have stopped all ingestion into the affected Data Sources. ### 1. Use the CLI to pull your Tinybird resources down into files¶ Use `tb pull --auto` to pull your Tinybird resources down into files. The `--auto` flag will organize the resources into directories, with your Data Sources being places into a `datasources` directory. ### 2. Create the new Data Source¶ Create a copy of the Data Source file that you want to modify and rename it. For example, `datasources/original_ds.datasource` -> `datasources/new_ds.datasource`. Modify the new Data Source schema in the file to make the changes you need. Now push the new Data Source to Tinybird with `tb push datasources/new_ds.datasource`. ### 3. Backfill the new Data Source with existing data¶ If you want to move the existing data from the original Data Source to the new Data Source, use a Copy Pipe or a Pipe that materializes data into the new Data Source. #### 3.1 Recommended option: Copy Pipe¶ A [Copy Pipe](https://www.tinybird.co/docs/docs/publish/copy-pipes) is a Pipe used to copy data from one Data Source to another Data Source. This method is useful for one-time moves of data or scheduled executions. Move your data using the following Copy Pipe, paying particular attention to the `TYPE`, `TARGET_DATASOURCE` and `COPY_SCHEDULE` configs at the end: NODE copy_node SQL > SELECT * EXCEPT (my_nullable_column), toInt64(coalesce(my_nullable_column,0)) as my_column -- adjust query to your changes FROM original_ds TYPE COPY TARGET_DATASOURCE new_ds COPY_SCHEDULE @on-demand Push it to the Workspace: tb push pipes/temp_copy.pipe And run the Copy: tb pipe copy run temp_copy When it's done, remove the Pipe: tb pipe rm temp_copy #### 3.2 Alternative option: A Populate¶ Alternatively, you can create a Materialized View Pipe and run a Populate to transform data from the original schema into the modified schema of the new Data Source. Do this using the following Pipe, paying particular attention to the `TYPE` and `DATASOURCE` configs at the end: NODE temp_populate SQL > SELECT * EXCEPT (my_nullable_column), toInt64(coalesce(my_nullable_column,0)) as my_column FROM original_ds TYPE materialized DATASOURCE new_ds Then push the Pipe to Tinybird, passing the `--populate` flag to force it to immediately start processing data: tb push pipes/temp.pipe --populate --wait When it's done, remove the Pipe: tb pipe rm temp At this point, review your new Data Source and ensure that everything is as expected. ### 4. Delete the original Data Source and rename the new Data Source¶ You can now go to the UI, delete the original Data Source, and rename the new Data Source to use the name of the original Data Source. By renaming the new Data Source to use the same name as the original Data Source, any SQL in your Pipes or Endpoints that referred to the original Data Source will continue to work. If you have a Materialized View that depends on the Data Source, you must unlink the Pipe that is materializing data before removing the Data Source. You can modify and reconnect your Pipe after completing the steps above. ## Scenario 3: I need to change a Materialized View & I can't interrupt service¶ This scenario assumes you want to modify a Materialized View that is actively receiving data and serving API Endpoints, *and* you want to avoid service downtime. ### Before you begin¶ Because this is a complex scenario, let's introduce some names for the example resources to make it a bit easier to follow along. Let's assume that you have a Data Source that is actively receiving data; let's call this the `Landing Data Source` . From the `Landing Data Source` , you have a Pipe that is writing to a Materialized View; let's call these the `Materializing Pipe` and `Materialized View Data Source` respectively. ### 1. Use the CLI to pull your Tinybird resources down into files¶ Use `tb pull --auto` to pull your Tinybird resources down into files. The `--auto` flag organizes the resources into directories, with your Data Sources being places into a `datasources` directory. ### 2. Duplicate the Materializing Pipe & Materialized View Data Source¶ Duplicate the `Materializing Pipe` & `Materialized View Data Source`. For example: pipes/original_materializing_pipe.pipe -> pipes/new_materializing_pipe.pipe datasources/original_materialized_view_data_source.datasource -> datasources/new_materialized_view_data_source.datasource Modify the new files to change the schema as needed. Lastly, you'll need to add a `WHERE` clause to the new `Materializing Pipe` . This clause is going to filter out old rows, so that the `Materializing Pipe` is only materializing rows newer than a specific time. For the purpose of this guide, let's call this the `Future Timestamp` . Do **not** use variable time functions for this timestamp (e.g. `now()` ). Pick a static time that is in the near future; five to fifteen minutes should be enough. The condition should be `>` , for example: WHERE … AND my_timestamp > "2024-04-12 13:15:00" ### 3. Push the Materializing Pipe & Materialized View Data Source¶ Push the `Materializing Pipe` & `Materialized View Data Source` to Tinybird: tb push datasources/new_materialized_view_data_source.datasource tb push pipes/new_materializing_pipe.pipe ### 4. Create a new Pipe to transform & materialize the old schema to the new schema¶ You now have two Materialized Views: the one with the original schema, and the new one with the new schema. You need to take the data from the original Materialized View, transform it into the new schema, and write it into the new Materialized View. To do this, create a new Pipe. In this guide, it's called the `Transform Pipe` . In your `Transform Pipe` create the SQL `SELECT` logic that transforms the old schema to the new schema. Lastly, your `Transform Pipe` should have a `WHERE` clause that only selects rows that are **older** than our `Future Timestamp` . The condition should be `<=` , for example: WHERE … AND my_timestamp <= "2024-01-12 13:00:00" ### 5. Wait until after the Future Timestamp, then push & populate with the Transform Pipe¶ Now, to avoid any potential for creating duplicates or missing rows, wait until after the `Future Timestamp` time has passed. This means that there should no longer be any rows arriving that have a timestamp that is **older** than the `Future Timestamp`. Then, push the `Transform Pipe` and force a populate: tb push pipes/new_materializing_pipe.pipe --populate --wait ### 6. Wait for the populate to finish, then change your API Endpoint to read from the new Materialized View Data Source¶ Wait until the previous command has completed to ensure that all data from the original Materialized View has been written to the new `Materialized View Data Source`. When it is complete, modify the API Endpoint that is querying the old Materialized View to query from the new `Materialized View Data Source`. For example: SELECT * from original_materialized_view_data_source Would become: SELECT * from new_materialized_view_data_source ### 7. Test, then clean up old resources¶ Test that your API Endpoint is serving the correct data. If everything looks good, you can tidy up your Workspace by deleting the original Materialized View & the new `Transform Pipe`. ## Scenario 4: It's too complex and I can't figure it out¶ If you are dealing with a very complex scenario, don't worry! Contact Tinybird support ( [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) ). ## Next steps¶ - Got your schema sorted and ready to make some queries? Understand[ how to work with time](https://www.tinybird.co/docs/docs/guides/querying-data/working-with-time) . - Learn how to[ monitor your ingestion](https://www.tinybird.co/docs/docs/guides/monitoring/monitor-your-ingestion) . --- URL: https://www.tinybird.co/docs/guides/ingesting-data/recover-from-quarantine Last update: 2024-11-07T09:52:34.000Z Content: --- title: "Recover data in quarantine · Tinybird Docs" theme-color: "#171612" description: "Learn how to recover data from quarantine, and how to fix common errors that cause data to be sent to quarantine." --- # Recover data from quarantine¶ In this guide you'll learn about the quarantine Data Source, and how to use it to detect and fix errors on your Data Sources. The quarantine Data Source is named `{datasource_name}_quarantine` and can be queried using Pipes like a regular Data Source. ## Prerequisites¶ This guide assumes you're familiar with the concept of the [quarantine Data Source](https://www.tinybird.co/docs/docs/concepts/data-sources#the-quarantine-data-source). ## Example scenario¶ This guide uses the Tinybird CLI, but all steps can be performed in the UI as well. ### Setup¶ This example uses an NDJSON Data Source that looks like this: { "store_id": 1, "purchase": { "product_name": "shoes", "datetime": "2022-01-05 12:13:14" } } But you could use any ingestion method. Let's say you generate a Data Source file from this JSON snippet, push the Data Source to Tinybird, and ingest the JSON as a single row: ##### Push the NDJSON\_DS Data Source echo '{"store_id":1,"purchase":{"product_name":"shoes","datetime":"2022-01-05 12:13:14"}}' > ndjson_ds.ndjson tb datasource generate ndjson_ds.ndjson tb push --fixtures datasources/ndjson_ds.datasource tb sql "select * from ndjson_ds" The schema generated from the JSON will look like this: ##### NDJSON\_DS.DATASOURCE DESCRIPTION > Generated from ndjson_ds.ndjson SCHEMA > purchase_datetime DateTime `json:$.purchase.datetime`, purchase_product_name String `json:$.purchase.product_name`, store_id Int16 `json:$.store_id` At this point, you can in the UI and confirm your Data Source had been created and the row ingested. Hooray! <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-quarantine-1.png&w=3840&q=75) <-figcaption-> Data Source details can be accessed from your Sidebar ### Add data that doesn't match the schema¶ Now, if you append some rows that don't match the Data Source schema, these rows will end up in the quarantine Data Source. ##### Append rows with wrong schema echo '{"store_id":2,"purchase":{"datetime":"2022-01-05 12:13:14"}}\n{"store_id":"3","purchase":{"product_name":"shirt","datetime":"2022-01-05 12:13:14"}}' > ndjson_quarantine.ndjson tb datasource append ndjson_ds ndjson_quarantine.ndjson tb sql "select * from ndjson_ds_quarantine" This time, if you check in the UI, you'll see a notification warning you about quarantined rows: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-quarantine-2.png&w=3840&q=75) <-figcaption-> The quarantine Data Source is always accessible (if it contains any rows) from the Data Source modal window In the Data Source view you'll find the Log tab, which shows you details about all operations performed on a Data Source. If you're following the steps of this guide, you should see a row with `event_type` as **append** and `written_rows_quarantine` as **2**. From the quarantine warning notification, navigate to the quarantine Data Source page, and review the problematic rows: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-quarantine-3.png&w=3840&q=75) <-figcaption-> Within the quarantine view you can see both, a summary of errors and the rows that have failed The **Errors** view shows you a summary of all the errors and the number of occurrences for each of those, so you can prioritize fixing the most common ones. The **Rows** view shows you all the rows that have failed, so you can further investigate why. ## Fix quarantine errors¶ There are generally three ways of fixing quarantine errors: ### 1. Modify your data producer¶ Usually, the best solution is to fix the problem at the source. This means updating the applications or systems that are producing the data, before they send it to Tinybird. The benefit of this is that you don't need to do additional processing to normalize the data after it has been ingested, which helps to save cost and reduce overall latency. However, it can come at the cost of having to push changes into a production application, which can be complex or have side effects on other systems. ### 2. Modify the Data Source schema¶ Often, the issue that causes a row to end up in quarantine is a mismatch of data types. A simple solution is to [modify the Data Source schema](https://www.tinybird.co/docs/docs/guides/ingesting-data/iterate-a-data-source) to accept the new type. For example, if an application is starting to send integers that are too large for `Int8` , you might update the schema to use `Int16`. Avoid Nullable columns, as they can have significantly worse performance. Instead, send alternative values like `0` for any `Int` type, or an empty string for a `String` type. ### 3. Transform data with Pipes and Materialized Views¶ This is one of the most powerful capabilities of Tinybird. If you are not able to modify the data producer, you can apply a transformation to the erroring columns at ingestion time and materialize the result into another Data Source. You can read more about this in the [Materialized Views docs](https://www.tinybird.co/docs/docs/publish/materialized-views/overview). ## Recover rows from quarantine¶ The quickest way to recover rows from quarantine is to fix the cause of the errors and then re-ingest the data. However, that is not always possible. You can recover rows from the quarantine using a recovery Pipe and the Tinybird API: ### Create a recovery Pipe¶ You can create a Pipe to select the rows from the quarantine Data Source and transform them into the appropriate schema. The previous example showed rows where the `purchase_product_name` contained `null` or the `store_id` contained a `String` rather than an `Int16`: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-quarantine-5.png&w=3840&q=75) <-figcaption-> Remember that quarantined columns are Nullable(String) All columns in a quarantine Data Source are `Nullable()` , which means that you must use the [coalesce()](https://clickhouse.com/docs/en/sql-reference/functions/functions-for-nulls/#coalesce) function if you want to transform them into a non-nullable type. This example uses coalesce to set a default value of `DateTime(0)`, `''` , or `0` for `DateTime`, `String` and `Int16` types respectively. Additionally, all columns in a quarantine Data Source are stored as `String` . This means that you must specifically transform any non-String column into its desired type as part of the recovery Pipe. This example transforms the `purchase_datetime` and `store_id` columns to `DateTime` and `Int16` types respectively. The quarantine Data Source contains additional meta-columns `c__error_column`, `c__error`, `c__import_id` , and `insertion_date` with information about the errors and the rows, so you should not use `SELECT *` to recover rows from quarantine. The following SQL transforms the quarantined rows from this example into the original Data Source schema: SELECT coalesce( parseDateTimeBestEffortOrNull( purchase_datetime ), toDateTime(0) ) as purchase_datetime, coalesce( purchase_product_name, '' ) as purchase_product_name, coalesce( coalesce( toInt16(store_id), toInt16(store_id) ), 0 ) as store_id FROM ndjson_ds_quarantine <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-quarantine-6.png&w=3840&q=75) <-figcaption-> Recover endpoint Just as with any other Pipe, you can publish the results of this recovery Pipe as an API Endpoint. ### Ingest the fixed rows and truncate quarantine¶ You can then use the Tinybird CLI to append the fixed data back into the original Data Source, by hitting the API Endpoint published from the recovery Pipe: tb datasource append To avoid dealing with JSONPaths, you can hit the recovery Pipe's CSV endpoint: tb datasource append ndjson_ds https://api.tinybird.co/v0/pipes/quarantine_recover.csv?token= Check that your Data Source now has the fixed rows, either in the UI, or from the CLI using: tb sql "select * from ndjson_ds" Finally, truncate the quarantine Data Source to clear out the recovered rows, either in the UI, or from the CLI using: tb datasource truncate ndjson_ds_quarantine --yes You should see that your Data Source now has all of the rows, and the quarantine notification has disappeared. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-quarantine-7.png&w=3840&q=75) <-figcaption-> Data Source with the recovered rows and truncated quarantine If your quarantine has too many rows, you may need to add pagination based on the `insertion_date` and/or `c__import_id` columns. If you're using a Kafka Data Source, remember to add the Kafka metadata columns. ## Recover rows from quarantine with CI/CD¶ When you connect your Workspace to Git and it becomes read-only you want all your workflows to go through CI/CD. This is how you recover rows from quarantine in your data project using Git and automating the workflow. ### Prototype the process in a Branch¶ This step is optional, but it's good practice. When you need to perform a change to your data project and it's read-only, you can create a new Branch and prototype the changes there, then later bring them to Git. To test this process: 1. Create a Branch 2. Ingest a file that creates rows in quarantine 3. Prototype a Copy Pipe 4. Run it 5. Validate data is recovered ### A practical example with Git¶ There is an additional guide showing how to [recover quarantine rows from Git using CI/CD](https://github.com/tinybirdco/use-case-examples/tree/main/recover_data_from_quarantine) , where the data project is the [Web Analytics Starter Kit](https://github.com/tinybirdco/web-analytics-starter-kit). When your rows end up in quarantine, you receive an e-mail like this: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgit-quarantine.jpg&w=3840&q=75) In this additional example, the issue is the `timestamp` column - instead of being a DateTime, it's String Unix time, so the rows can't be properly ingested. {"timestamp":"1697393030","session_id":"b7b1965c-620a-402a-afe5-2d0eea0f9a34","action":"page_hit","version":"1","payload":"{ \"user-agent\":\"Mozilla\/5.0 (Linux; Android 13; SM-A102U) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/106.0.5249.118 Mobile Safari\/537.36\", \"locale\":\"en-US\", \"location\":\"FR\", \"referrer\":\"https:\/\/www.github.com\", \"pathname\":\"\/pricing\", \"href\":\"https:\/\/www.tinybird.co\/pricing\"}"} To convert the `timestamp` values in quarantine to a `DateTime` , you'd build a Copy Pipe like this: NODE copy_quarantine SQL > SELECT toDateTime(fromUnixTimestamp64Milli(toUInt64(assumeNotNull(timestamp)) * 1000)) timestamp, assumeNotNull(session_id) session_id, assumeNotNull(action) action, assumeNotNull(version) version, assumeNotNull(payload) payload FROM analytics_events_quarantine TYPE COPY TARGET_DATASOURCE analytics_events To test the changes, you'd need to do a custom deployment: #!/bin/bash # use set -e to raise errors for any of the commands below and make the CI pipeline to fail set -e tb datasource append analytics_events datasources/fixtures/analytics_events_errors.ndjson tb deploy tb pipe copy run analytics_events_quarantine_to_final --wait --yes sleep 10 First append a sample of the quarantined rows, then deploy the Copy Pipe, and finally run the copy operation. Once changes have been deployed in a test Branch, you can write data quality tests to validate the rows are effectively being copied: - analytics_events_quarantine: max_bytes_read: null max_time: null sql: | SELECT count() as c FROM analytics_events_quarantine HAVING c <= 0 - copy_is_executed: max_bytes_read: null max_time: null sql: | SELECT count() c, sum(rows) rows FROM tinybird.datasources_ops_log WHERE datasource_name = 'analytics_events' AND event_type = 'copy' HAVING rows != 74 and c = 1 `analytics_events_quarantine` checks that effectively some of the rows are in quarantine while `copy_is_executed` tests that the rows in quarantine have been copied to the `analytics_events` Data Source. Lastly, you need to deploy the Branch: # use set -e to raise errors for any of the commands below and make the CI pipeline to fail set -e tb deploy tb pipe copy run analytics_events_quarantine_to_final --wait You can now merge the Pull Request, the Copy Pipe will be deployed to the Workspace and the copy operation will be executed ingesting all rows in quarantine. After that you can optionally truncate the quarantine Data Source using `tb datasource truncate analytics_events_quarantine`. This is a [working Pull Request](https://github.com/tinybirdco/use-case-examples/pull/6) with all the steps mentioned above. ## Next steps¶ - Make sure you're familiar with the[ challenges of backfilling real-time data](https://www.tinybird.co/docs/docs/production/backfill-strategies#the-challenge-of-backfilling-real-time-data) - Learn how to[ monitor your ingestion](https://www.tinybird.co/docs/docs/guides/monitoring/monitor-your-ingestion) . Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/ingesting-data/replace-and-delete-data Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Replace and delete data · Tinybird Docs" theme-color: "#171612" description: "Update & delete operations are common in transactional databases over operational data, but sometimes you also need to make these changes on your analytical data in Tinybird." --- # Replace and delete data in your Tinybird Data Sources¶ Update & delete operations are common in transactional databases over operational data, but sometimes you also need to make these changes on your analytical data in Tinybird. Sometimes, you need to delete or replace some of your data in Tinybird. Perhaps there was a bug in your application, a transient error in your operational database, or simply an evolution of requirements due to product or regulatory changes. It is **not safe** to replace data in the partitions where you are actively ingesting data. You may lose the data inserted during the process. While real-time analytical databases like Tinybird are optimized for SELECTs and INSERTs, Tinybird fully supports replacing & deleting data. The the tricky complexities of data replication, partition management and mutations rewriting are abstracted away, allowing you to focus on your data engineering flows and not the internals of real-time analytical databases. This guide will show you, with different examples, how to selectively delete or update data in Tinybird using the REST API. You can then adapt these processes for your own needs. All operations on this page require a Token with the correct scope. In the code snippets, replace `` by a Token whose [scope](https://www.tinybird.co/docs/docs/api-reference/token-api) is `DATASOURCES:CREATE` or `ADMIN`. ## Delete data selectively¶ To delete data that is within a condition, send a POST request to the [Data Sources /delete API](https://www.tinybird.co/docs/docs/api-reference/datasource-api#post--v0-datasources-(.+)-delete) , providing the name of one of your Data Sources in Tinybird and a `delete_condition` parameter, which is an SQL expression filter. Delete operations do not automatically cascade to downstream Materialized Views. You will need to perform separate delete operations on Materialized Views. Imagine you have a Data Source called `events` and you want to remove all the transactions for November 2019. You'd send a POST request like this: - CLI - API ##### Delete data selectively tb datasource delete events --sql-condition "toDate(date) >= '2019-11-01' and toDate(date) <= '2019-11-30'" Once you make the request, you'll see that the `POST` request to the delete API Endpoint is asynchronous. It returns a [job response](https://www.tinybird.co/docs/docs/api-reference/jobs-api#jobs-api-getting-information-about-jobs) , indicating an ID for the job, the status of the job, the `delete_condition` , and some other metadata. Although the delete operation runs asynchronously (hence the job response), the operation waits synchronously for all the mutations to be rewritten and data replicas to be deleted. { "id": "64e5f541-xxxx-xxxx-xxxx-00524051861b", "job_id": "64e5f541-xxxx-xxxx-xxxx-00524051861b", "job_url": "https://api.tinybird.co/v0/jobs/64e5f541-xxxx-xxxx-xxxx-00524051861b", "job": { "kind": "delete_data", "id": "64e5f541-xxxx-xxxx-xxxx-00524051861b", "job_id": "64e5f541-xxxx-xxxx-xxxx-00524051861b", "status": "waiting", "created_at": "2023-04-11 13:52:32.423207", "updated_at": "2023-04-11 13:52:32.423213", "started_at": null, "is_cancellable": true, "datasource": { "id": "t_c45d5ae6781b41278fcee365f5bxxxxx", "name": "shopping_data" }, "delete_condition": "event = 'search'" }, "status": "waiting", "delete_id": "64e5f541-xxxx-xxxx-xxxx-00524051861b" } You can periodically poll the `job_url` with the given ID to check the status of the deletion process. When it's `done` it means the data matching the SQL expression filter has been removed and all your Pipes and API Endpoints will continue running with the remaining data in the Data Source. ### Truncate a Data Source¶ Sometimes you just want to delete all data contained in a Data Source. Most of the time starting from zero. You can perform this action from the UI and API. Using the API, the [truncate](https://www.tinybird.co/docs/docs/api-reference/datasource-api#post--v0-datasources-(.+)-truncate) endpoint will delete all rows in a Data Source and can be done as follows: - CLI - API ##### Truncate a Data Source tb datasource truncate You can also truncate a Data Source directly from the UI: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Freplacing-and-deleting-data-1.png&w=3840&q=75) <-figcaption-> Deleting selectively is only available via API, but truncating it to delete all of its data can be done via the UI. ## Replace data selectively¶ The ability to update data is often not the top priority when designing analytical databases, but there are always scenarios where you need to update or replace your analytical data. For example, you might have reconciliation processes over your transactions that affect your original data. Or maybe your ingestion process was simply faulty, and you ingested inaccurate data for a period of time. In Tinybird, you can specify a condition under which only a part of the data is replaced during the ingestion process. For instance, let's say you want to reingest a CSV with the data for November 2019 and update your Data Source accordingly. In order to update the data, you'll need to pass the `replace_condition` parameter with the `toDate(date) >= '2019-11-01' and toDate(date) <= '2019-11-30'` condition. - CLI - API ##### Replace data selectively tb datasource replace events \ https://storage.googleapis.com/tinybird-assets/datasets/guides/events_1M_november2019_1.csv \ --sql-condition "toDate(date) >= '2019-11-01' and toDate(date) <= '2019-11-30'" The response to the previous API call looks like this: ##### Response after replacing data { "id": "a83fcb35-8d01-47b9-842c-a288d87679d0", "job_id": "a83fcb35-8d01-47b9-842c-a288d87679d0", "job_url": "https://api.tinybird.co/v0/jobs/a83fcb35-8d01-47b9-842c-a288d87679d0", "job": { "kind": "import", "id": "a83fcb35-8d01-47b9-842c-a288d87679d0", "job_id": "a83fcb35-8d01-47b9-842c-a288d87679d0", "import_id": "a83fcb35-8d01-47b9-842c-a288d87679d0", "status": "waiting", "statistics": null, "datasource": { ... }, "quarantine_rows": 0, "invalid_lines": 0 }, "status": "waiting", "import_id": "a83fcb35-8d01-47b9-842c-a288d87679d0" } As in the case of the selective deletion, selective replacement also runs as an asynchronous request, so it's recommended to [check the status of the job](https://www.tinybird.co/docs/docs/api-reference/jobs-api#jobs-api-getting-information-about-jobs) periodically. You can see the status of the job by going to the `job_url` returned in the previous response. ### About the replace condition¶ Conditional replaces are applied over partitions. Partitions are selected for replaces based on the rows that match the condition in the new data. The partitions involved are the ones where these remaining rows would be stored. The replace condition is applied to filter the new data that's going to be appended, meaning rows not matching the condition won't be inserted. The condition is also applied for the selected partitions in the Data Source, so rows that don't match the condition in these partitions will be removed. But rows that don't match the condition and may be present in other partitions won't be deleted. If you are trying to delete rows in the target data source in your workflow, have a look at the "Replace data removing non-matching rows from the data source" example below. ### Linked Materialized Views¶ If you have several connected Materialized Views, then selective replaces are done in cascade. For example, if Data Source A materializes data in a cascade to Data Source B and from there to Data Source C, then when you replace data in Data Source A, Data Sources B and C will automatically be updated accordingly. All three Data Sources need to have compatible partition keys since replaces are done by partition. The command `tb dependencies --datasource the_data_source --check-for-partial-replace` returns the dependencies that would be recalculated, both for Data Sources and Materialized Views, and raises an error if any of the dependencies have incompatible partition keys. Remember: The provided Token must have the `DATASOURCES:CREATE` [scope](https://www.tinybird.co/docs/docs/api-reference/token-api). ### Example¶ For this example, consider this Data Source: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Freplacing-example-1.jpeg&w=3840&q=75) Its partition key is `ENGINE_PARTITION_KEY "profession"` . If you wanted to replace the last two rows with new data, you can send this request with the replace condition `replace_condition=(profession='Jedi')`: - CLI - API ##### Replace with partition in condition echo "50,Mace Windu,Jedi" > jedi.csv tb datasource replace characters jedi.csv --sql-condition "profession='Jedi'" Since the replace condition column matches the partition key, the result is: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Freplacing-example-2.jpeg&w=3840&q=75) However, consider what happens if you create the Data Source with `ENGINE_PARTITION_KEY "name"`: ##### characters.datasource SCHEMA > `age` Int16, `name` String, `profession` String ENGINE "MergeTree" ENGINE_SORTING_KEY "age, name, profession" ENGINE_PARTITION_KEY "name" If you were to run the same replace request, the result probably doesn't make sense: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Freplacing-example-3.jpeg&w=3840&q=75) Why were the existed rows not removed? Because the `replace` process uses the payload rows to identify which partitions to work on. The Data Source is now partitioned by name (not profession), so the process didn't delete the other "Jedi" rows. They're in different partitions because they have different names. The rule of thumb is this: **Always make sure the replace condition uses the partition key as the filter field**. ## Replace a Data Source completely¶ To replace a complete Data Source, make an API call similar to the previous example, without providing a `replace_condition`: - CLI - API ##### Replace Data Source completely tb datasource replace events https://storage.googleapis.com/tinybird-assets/datasets/guides/events_1M_november2019_1.csv The request above is replacing a Data Source with the data found in a given URL pointing to a CSV file. You can also replace a Data Source in the Tinybird UI: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Freplacing-and-deleting-data-2.png&w=3840&q=75) <-figcaption-> Replacing a Data Source completely can also be done through the User Interface Schemas must be identical. When replacing data (either selectively or entirely) the schema of the new inbound data must match that of the original Data Source. Rows not containing the same schema will go to quarantine. ## Next steps¶ - Learn how[ to get rows out of quarantine](https://www.tinybird.co/docs/docs/concepts/data-sources#the-quarantine-data-source) . - Need to[ iterate a Data Source, including the schema](https://www.tinybird.co/docs/docs/guides/ingesting-data/iterate-a-data-source) ? Read how here. --- URL: https://www.tinybird.co/docs/guides/ingesting-data/scheduling-with-github-actions-and-cron Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Schedule data ingestion · Tinybird Docs" theme-color: "#171612" description: "Cronjobs are the universal way of scheduling tasks. In this guide, you'll learn how to keep your data in sync with cron jobs or GitHub Actions and the Tinybird API." --- # Schedule data ingestion with cron and GitHub Actions¶ Cronjobs are the universal way of scheduling tasks. In this guide, you'll learn how to keep your data in sync with cronjobs or GitHub actions and the Tinybird REST API. ## Overview¶ For this example, let's assume you've already imported a Data Source to your Tinybird account and that you have properly defined its schema and partition key. Once everything is set, you can easily perform some operations using the [Data Sources API](https://www.tinybird.co/docs/docs/api-reference/datasource-api) to **periodically append to or replace data** in your Data Sources. This guide shows you some examples. ## About crontab¶ Crontab is a native Unix tool that schedules command execution at a specified time or time interval. It works by defining the schedule, and the command to execute, in a text file. This can be achieved using `sudo crontab -e` . You can learn more about using crontab using many online resources like [crontab.guru](https://crontab.guru/crontab.5.html) and [the man page for crontab](https://man7.org/linux/man-pages/man5/crontab.5.html). ### The cron table format¶ Cron follows a table format like the following (note that you can also use [external tools like crontab.guru](https://crontab.guru/) to help you define the cron job schedule): ##### Cron syntax explanation * * * * * Command_to_execute | | | | | | | | | Day of the Week ( 0 - 6 ) ( Sunday = 0 ) | | | | | | | Month ( 1 - 12 ) | | | | | Day of Month ( 1 - 31 ) | | | Hour ( 0 - 23 ) | Min ( 0 - 59 ) Using this format, the following would be typical cron schedules to execute commands at different times: - Every five minutes: `0/5 \* \* \* \*` - Every day at midnight: `0 0 \* \* \*` - Every first day of month: `\* \* 1 \* \*` - Every Sunday at midnight: `0 0 \* \* 0` Be sure you save your scripts in the right location. Save your shell scripts in the `/opt/cronjobs/` folder. ## Append data periodically¶ It's very common to have a Data Source that grows over time. There is often is also an ETL process extracting this data from the transactional database and generating CSV files with the last X hours or days of data, therefore you might want to append those recently-generated rows to your Tinybird Data Source. For this example, imagine you generate new CSV files every day at 00:00 that you want to append to Tinybird everyday at 00:10. ### Option 1: With a shell script¶ First, you need to create a shell script file containing the Tinybird API request operation: ##### Contents of append.sh #!/bin/bash TOKEN=your_token CSV_URL="http://your_url.com" curl \ -H "Authorization: Bearer $TOKEN" \ -X POST \ -d url=$CSV_URL \ -d mode='append' \ -d name='events' \ https://api.tinybird.co/v0/datasources Then, add a new line to your crontab file (using `sudo crontab -e` ): 10 0 * * * sh -c /opt/cronjobs/append.sh ### Option 2: Using GitHub Actions¶ If your project is hosted on GitHub, you can also use GitHub Actions to schedule periodic jobs. Create a new file called `.github/workflows/append.yml` with the following code to append data from a CSV given its URL every day at 00:10. ##### Contents of .github/workflows/append.yml name: Append data data every day at 00:10 on: push: workflow_dispatch: schedule: - cron: '10 0 * * *' jobs: scheduled: runs-on: ubuntu-latest steps: - name: Check out this repo uses: actions/checkout@v2 - name: Append new data run: |- curl \ -H "Authorization: Bearer $TOKEN" \ -X POST \ -d url=$CSV_URL \ -d mode='append' \ -d name='events' \ https://api.tinybird.co/v0/datasources ## Replace data periodically¶ Let's use another example. With this new fictional Data Source, imagine a scenario where you want to replace the whole Data Source with a CSV file sitting in a publicly-accessible URL every first day of the month. ### Option 1: With a shell script¶ ##### Contents of replace.sh #!/bin/bash TOKEN=your_token CSV_URL="http://your_url.com" curl \ -H "Authorization: Bearer $TOKEN" \ -X POST \ -d url=$CSV_URL \ -d mode='replace' \ -d name='events' \ https://api.tinybird.co/v0/datasources Then edit the crontab file which takes care of periodically executing your script. Run `sudo crontab -e`: ##### Setting up a crontab to run a script periodically * * 1 * * sh -c /opt/cronjobs/replace.sh ### Option 2: With GitHub Actions¶ Create a new file called `.github/workflows/replace.yml` with the following code to replace all your data with given the URL of the CSV with the new data every day at 00:10. ##### Contents of .github/workflows/replace.yml name: Replace all data every day at 00:10 on: push: workflow_dispatch: schedule: - cron: '10 0 * * *' jobs: scheduled: runs-on: ubuntu-latest steps: - name: Check out this repo uses: actions/checkout@v2 - name: Replace all data run: |- curl \ -H "Authorization: Bearer $TOKEN" \ -X POST \ -d url=$CSV_URL \ -d mode='replace' \ -d name='events' \ https://api.tinybird.co/v0/datasources ## Replace just one month of data¶ Having your API call inside a shell script allows you to script more complex ingestion processes. For example, imagine you want to replace the last month of events data, every day. Then each day, you would export a CSV file to a publicly accessible URL and name it something like `events_YYYY-MM-DD.csv`. ### Option 1: With a shell script¶ You could script a process that would do a conditional data replacement as follows: ##### Script to replace data selectively on Tinybird #!/bin/bash TODAY=`date +"%Y-%m-%d"` ONE_MONTH_AGO=`date -v -1m +%Y-%m-%d` TOKEN=your_token DATASOURCE=events CSV_URL="http://your_url.com" curl \ -H "Authorization: Bearer $TOKEN" \ -X POST \ -d url=$CSV_URL \ -d mode='replace' \ -d replace_condition=(created_at+BETWEEN+'${ONE_MONTH_AGO}'+AND+'${TODAY}')" \ -d name=$DATASOURCE \ https://api.tinybird.co/v0/datasources Then, after saving that file to `/opt/cronjobs/daily_replace.sh` , add the following line to `crontab` to run it every day at midnight: ##### Setting up a crontab to run a script periodically 0 0 * * * sh -c /opt/cronjobs/daily_replace.sh ### Option 2: With GitHub Actions¶ Create a new file called `.github/workflows/replace_last_month.yml` with the following code to replace all the data for the last month every day at 00:10. ##### Contents of .github/workflows/replace.yml name: Replace all data every day at 00:10 on: push: workflow_dispatch: schedule: - cron: '10 0 * * *' jobs: scheduled: runs-on: ubuntu-latest steps: - name: Check out this repo uses: actions/checkout@v2 - name: Replace all data run: |- TODAY=`date +"%Y-%m-%d"` ONE_MONTH_AGO=`date -v -1m +%Y-%m-%d` DATASOURCE=events # could also be set via github secrets CSV_URL="http://your_url.com" # could also be set via github secrets curl \ -H "Authorization: Bearer $TOKEN" \ -X POST \ -d url=$CSV_URL \ -d mode='replace' \ -d replace_condition=(created_at+BETWEEN+'${ONE_MONTH_AGO}'+AND+'${TODAY}')" \ -d name=$DATASOURCE \ https://api.tinybird.co/v0/datasources Use GitHub secrets: Store `TOKEN` as an [encrypted secret](https://docs.github.com/en/actions/reference/encrypted-secrets) to avoid hardcoding secret keys in your repositories, and replace `DATASOURCE` and `CSV_URL` by their values or save them as secrets as well. ## Next steps¶ - Learn more about[ GitHub Actions and CI/CD processes on Tinybird](https://www.tinybird.co/docs/docs/production/continuous-integration) . - Understand how to[ work with time](https://www.tinybird.co/docs/docs/guides/querying-data/working-with-time) . --- URL: https://www.tinybird.co/docs/guides/integrations/add-charts-to-nextjs Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Add Tinybird Charts to a Next.js frontend · Tinybird Docs" theme-color: "#171612" description: "Tinybird Charts make it easy to create interactive charts. In this guide, we'll show you how to add Tinybird Charts to a Next.js frontend." --- # Add Tinybird Charts to a Next.js frontend¶ In this guide, you'll learn how to generate create Tinybird Charts from the UI, and add them to your Next.js frontend. Tinybird Charts make it easy to visualize your data and create interactive charts. You can create a chart from the UI, and then embed it in your frontend application. This guide will show you how to add Tinybird Charts to a Next.js frontend. You can view the [live demo](https://guide-tinybird-charts.vercel.app/) or browse the [GitHub repo (guide-tinybird-charts)](https://github.com/tinybirdco/guide-tinybird-charts). <-figure-> ![Tinybird charts demo](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fcharts-demo.png&w=3840&q=75) ## Prerequisites¶ This guide assumes that you have a Tinybird account, and you are familiar with creating a Tinybird Workspace and pushing resources to it. You'll need a working familiarity with JavaScript and Next.js. ## Run the demo¶ These steps cover running the GitHub demo locally. [Skip to the next section](https://www.tinybird.co/docs/about:blank#build-from-scratch) to build the demo from scratch. ### 1. Clone the GitHub repo¶ Clone the [GitHub repo (guide-tinybird-charts)](https://github.com/tinybirdco/guide-tinybird-charts) to your local machine. ### 2. Push Tinybird resources¶ The repo contains a `tinybird` folder which includes sample Tinybird resources: - `events.datasource` : The Data Source for incoming events. - `airline_market_share.pipe` : An API Endpoint giving a count of bookings per airline. - `bookings_over_time.pipe` : An API Endpoint giving a time series of booking volume over time. - `bookings_over_time_by_airline.pipe` : An API Endpoint giving a time series of booking volume over time with an `airline` filter. - `meal_choice_distribution.pipe` : An API Endpoint giving a count of meal choices across bookings. - `top_airlines.pipe` : An API Endpoint giving a list of the top airlines by booking volume. Make a new Tinybird Workspace in the region of your choice. Then, configure the [Tinybird CLI](https://www.tinybird.co/docs/docs/cli/install) (install and authenticate) and `tb push` the resources to your Workspace. Alternatively, you can drag and drop the files onto the UI to upload them. ### 3. Generate some fake data¶ Use [Mockingbird](https://tbrd.co/mockingbird-tinybird-charts-guide) to generate fake data for the `events` Data Source. Using this link ^ provides a pre-configured schema, and you'll just need to enter your Workspace Admin Token and select the Host region that matches your Workspace. When configured, select `Save` , then scroll down and select `Start Generating!`. In the Tinybird UI, confirm that the `events` Data Source is successfully receiving data. ### 4. Install dependencies¶ In the cloned repo, navigate to `/app` and install the dependencies with `npm install`. ### 5. Configure .env¶ First create a new file `.env.local` ##### Create the .env.local file in /app cp .env.example .env.local From the Tinybird UI, copy the read Token for the Charts (if you deployed the resources from this repo, it will be called `CHART_READ_TOKEN` ). Paste the Token into the `.env.local` file in your directory: ##### In the .env.local file NEXT_PUBLIC_TINYBIRD_STATIC_READ_TOKEN="STATIC READ TOKEN" ### Run the demo app¶ Run it locally: npm run dev Then open [http://localhost:3000](http://localhost:3000/) with your browser. ## Build from scratch¶ This section will take you from a fresh Tinybird Workspace to a Next.js app with a Tinybird Chart. ### 1. Set up a Workspace¶ Create a new Workspace. This guide uses the `EU GCP` region, but you can use any region. Save [this .datasource file](https://github.com/tinybirdco/guide-tinybird-charts/blob/main/tinybird/datasources/events.datasource) locally, and upload it to the Tinybird UI - you can either drag and drop, or use **Create new (+)** to add a new Data Source. You now have a Workspace with an `events` Data Source and specified schema. Time to generate some data to fill the Data Source! ### 2. Generate some fake data¶ Use [Mockingbird](https://tbrd.co/mockingbird-tinybird-charts-guide) to generate fake data for the `events` Data Source. Using this link ^ provides a pre-configured schema, and you'll just need to enter your Workspace Admin Token and select the Host region that matches your Workspace. When configured, select `Save` , then scroll down and select `Start Generating!`. In the Tinybird UI, confirm that the `events` Data Source is successfully receiving data. ### 3. Create and publish an API Endpoint¶ In the Tinybird UI, select the `events` Data Source and then select `Create Pipe` in the top right. In the new Pipe, change the name to `top_airlines`. In the first SQL Node,paste the following SQL: SELECT airline, count() as bookings FROM events GROUP BY airline ORDER BY bookings DESC LIMIT 5 Name this Node `endpoint` and select `Run`. Now, publish the Pipe by selecting `Create API Endpoint` and selecting the `endpoint` Node. Congratulations! You have a published API Endpoint. ### 4. Create a Chart¶ Publishing the API Endpoint takes you to the API Endpoint overview page. Scroll down to the `Output` section and select the `Charts` tab. Select `Create Chart`. On the `General` tab, set the name to `Top Airlines` then choose `Bar List` as the Chart type. On the `Data` tab, choose the `airline` column for the `Index` and check the `bookings` box for the `Categories`. Select Save. ### 5. View the Chart component code¶ After saving your Chart, you'll be returned to the API Endpoint overview page and you'll see your Chart in the `Output` section. To the view the component code for the Chart, select the code symbol ( `<>` ) above it. You'll see the command to install the `tinybird-charts` library as well as the React component code. ### 6. Create a new Next.js app¶ On your local machine, create a new working directory and navigate to it. For this example, we'll call it `myapp`. ##### Make a new working directory for your Next.js frontend app mkdir myapp cd myapp In the `myapp` dir, create a new Next.js app with the following command: ##### Initialize a new Next.js app npx create-next-app@latest You will see some prompts to configure the app. For this guide, we'll use the following settings: ##### Example new Next.js app settings ✔ What is your project named? … tinybird-demo ✔ Would you like to use TypeScript? … No / [Yes] ✔ Would you like to use ESLint? … No / [Yes] ✔ Would you like to use Tailwind CSS? … No / [Yes] ✔ Would you like to use `src/` directory? … No / [Yes] ✔ Would you like to use App Router? (recommended) … No / [Yes] ✔ Would you like to customize the default import alias (@/*)? … [No] / Yes After the app is created, navigate to the app directory (this will be the same as the project name you entered, in this example, `tinybird-demo` ). cd tinybird-demo ### 7. Add the Tinybird Charts library¶ Add the Tinybird Charts library to your project npm install @tinybirdco/charts ### 8. Add the Chart component¶ Create a new subfolder and file `src/app/components/Chart.tsx` . This will contain the component code for the Chart. [Copy the component from the Tinybird UI](https://www.tinybird.co/docs/about:blank#5-view-the-chart-component-code) and paste it here. It should look like this: ##### Example Chart.tsx code copied from Tinybird UI Chart 'use client' import { BarList } from '@tinybirdco/charts' export function TopAirlines() { return ( ) } Save the file. ### 9. Add the Chart to your page¶ In your `src/app/page.tsx` file, delete the default contents so you have an empty file. Then, import the `TopAirlines` component and add it to the page: ##### src/app/page.tsx import { TopAirlines } from "./components/Chart"; export default function Home() { return (
) } ### 10. Run the app¶ Run the app with `npm run dev` and open [http://localhost:3000](http://localhost:3000/) in your browser. ### 11. You're done\!¶ You've successfully added a Tinybird Chart to a Next.js frontend. Your Next.js frontend should now show a single bar line Chart. See the [live demo](https://guide-tinybird-charts.vercel.app/) and browse the [GitHub repo (guide-tinybird-charts)](https://github.com/tinybirdco/guide-tinybird-charts) for inspiration on how to combine more Chart components to make a full dashboard. ## Next steps¶ - Interested in dashboards? Explore Tinybird's many applications in the[ Use Case Hub](https://www.tinybird.co/docs/docs/use-cases) . --- URL: https://www.tinybird.co/docs/guides/integrations/charts-using-iframes-and-jwt-tokens Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Build charts with iframes and JWTs. · Tinybird Docs" theme-color: "#171612" description: "Tinybird Charts make it easy to create interactive charts. In this guide, you'll learn how build them using iframes and JWTs." --- # Build charts with iframes and JWTs¶ In this guide, you'll learn how to build Tinybird Charts using inline frames (iframes) and secure them using JSON Web Tokens (JWTs) [Tinybird Charts](https://www.tinybird.co/docs/docs/publish/charts) make it easy to visualize your data and create interactive charts. As soon as you've published an API Endpoint, you can create a Chart from the Tinybird UI, then immediately embed it in your frontend application. Check out the [live demo](https://guide-tinybird-charts.vercel.app/) to see an example of Charts in action. [JWTs](https://www.tinybird.co/docs/concepts/auth-tokens#json-web-tokens-jwts) are signed tokens that allow you to securely authorize and share data between your application and Tinybird. ## Prerequisites¶ This guide assumes that you have a Tinybird Workspace with active data and one or more published API Endpoints. You'll need a basic familiarity with JavaScript and Python. ## 1. Create a Chart¶ [Create a new Chart](https://www.tinybird.co/docs/docs/publish/charts) based on any of your API Endpoints. ## 2. View the Chart component code¶ After saving your Chart, you're on the API Endpoint "Overview" page. Your Chart should be visible in the "Output" section. To the view the component code for the Chart, select the code symbol ( `<>` ) above it. Select the dropdown and select the iframe example, instead of the default React one. Now select the JWT so the Static Token shown in the iframe's URL is replaced by a `` placeholder. <-figure-> ![Get the iframe code](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-iframes-jwts__get-iframe-code.png&w=3840&q=75) ## 3. Insert your iframe into a new page¶ Now you have your Chart code, create a new `index.html` file to paste the code into: ##### index.html Tinybird Charts In the next step, you'll generate a JWT to replace the `` placeholder. ## 4. Create a JWT¶ ### Understanding the token exchange¶ <-figure-> ![Generate a new token](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-iframes-jwts__token-exchange.png&w=3840&q=75) For each user session (or any other approach you want to follow), your frontend application will send a request with a JWT to your backend. It can be a new or an existing one. Your backend will self-sign and return the token. From that point onwards, you can use this token for any Chart or API call made to Tinybird, directly from your frontend application. JWTs support TTL and includes multi-tenancy capabilities, which makes them safe to use without creating any complex middleware. ### Create a JWT endpoint¶ Create a new endpoint that your frontend will use to retrieve your token. Remember to set your `TINYBIRD_SIGNING_KEY`: ##### Generate a new token from flask import Flask, jsonify, render_template import jwt import datetime import os app = Flask(__name__) # Get your Tinybird admin Token TINYBIRD_SIGNING_KEY= # Use your admin Token as the signing key, or use process.env.TB_TOKEN / similar if you have it set locally # Generate Token function for a specific pipe_id def generate_jwt(): expiration_time = datetime.datetime.utcnow() + datetime.timedelta(hours=48) workspace_id = "1f484a32-6966-4f63-9312-aadad64d3e12" token_name = "charts_token" pipe_id = "t_b9427fe2bcd543d1a8923d18c094e8c1" payload = { "workspace_id": workspace_id, "name": token_name, "exp": expiration_time, "scopes": [ { "type": "PIPES:READ", "resource": pipe_id }, ], } return jwt.encode(payload, TINYBIRD_SIGNING_KEY, algorithm='HS256') @app.route('/generate-token', methods=['GET']) def get_token(): token = generate_jwt() return jsonify({"token": token}) if __name__ == '__main__': app.run(host='0.0.0.0', port=5151) ## 5. Use the JWT in your iframe¶ Edit your `index.html` file using JavaScript to retrieve a JWT from your API Endpoint, and include this token in your iframe. ##### Update the index.html file Tinybird Charts ## Next steps¶ - Learn more about[ JSON Web Tokens (JWTs)](https://www.tinybird.co/docs/concepts/auth-tokens#json-web-tokens-jwts) - Learn more about[ Tinybird Charts](https://www.tinybird.co/docs/publish/charts) - [ Consume APIs in a Next.js frontend using JWTs](https://www.tinybird.co/docs/guides/integrations/consume-apis-nextjs) - [ Add Tinybird Charts to a Next.js frontend](https://www.tinybird.co/docs/guides/integrations/add-charts-to-nextjs) --- URL: https://www.tinybird.co/docs/guides/integrations/consume-api-endpoints-in-grafana Last update: 2024-09-05T16:19:00.000Z Content: --- title: "Consume API Endpoints in Grafana · Tinybird Docs" theme-color: "#171612" description: "Grafana is an awesome open source analytics & monitoring tool. In this guide, you'll learn how to create dashboards consuming Tinybird API Endpoints." --- # Consume API Endpoints in Grafana¶ [Grafana](https://grafana.com/grafana/) is an awesome open source analytics & monitoring tool. In this guide, you'll learn how to create Grafana Dashboards and Alerts consuming Tinybird API Endpoints. ## Prerequisites¶ This guide assumes you have a Tinybird Workspace with an active Data Source, Pipes, and at least one API Endpoint. You'll also need a Grafana account. ## 1. Install the Infinity plugin¶ Follow the steps in [Grafana Infinity plugin installation](https://grafana.com/grafana/plugins/yesoreyeram-infinity-datasource/?tab=installation). ## 2. Create a Grafana data source¶ Create a new Data Source using the Infinity plugin. Connections > Data sources > Add new data source > Infinity. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgrafana-ds-infinity.png&w=3840&q=75) Edit the name and complete the basic setup: **Authentication** : choose Bearer Token. Pick a token with access to all the needed endpoints or create different data sources per token. Feel free to also add a restrictive list of allowed hosts. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgrafana-ds-auth.png&w=3840&q=75) That's basically it, but it is a good practice to add a **Health check** with an endpoint that the Token has access to so you verify connectivity is OK. ## 3. Configure the Query in Grafana¶ Create a new Dashboard, edit the suggested Panel, and use the Data Source you just created. For the example we'll consume the endpoint shown in this picture. A time series of sensor temperature and humidity readings. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgrafana-endpoint-sample.png&w=3840&q=75) So, the in the Query editor is: - Type: JSON - Parser: Backend (needed for alerting and JSON parsing) - Source: URL - Format: Table - Method: GET - URL: `https://api.eu-central-1.aws.tinybird.co/v0/pipes/api_readings.json` - Parsing options & Result fields > Rows/root: `data` Needed cause the Tinybird response includes metadata, data, statistics... - Parsing options & Result fields > Columns: add the needed fields, select types, and adjust time formats. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgrafana-query.png&w=3840&q=75) It is important to use the Backend Parser for the Alerts to work and for compatibility with the root and field selectors as mentioned in the [plugin docs](https://grafana.com/docs/plugins/yesoreyeram-infinity-datasource/latest/query/backend/#root-selector--field-selector). For the Time Formats you can use plugin options. By default it uses *Default ISO* so you can simply add `formatDateTime(t,'%FT%TZ') as t` to your Tinybird API Endpoint and don't need to configure the option in Grafana. Save the dashboard and you should now be able to see a chart. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgrafana-dashboard.png&w=3840&q=75) ## 4. Using time ranges¶ When you have millions of rows it is better to filter time ranges in Tinybird than to retrieve all the data and filter later when showing the chart in Grafana. You will get faster responses and more efficient use of the resources. Edit the Tinybird Pipe to accept `start_ts` and `end_ts` parameters. % SELECT formatDateTime(timestamp,'%FT%TZ') t, temperature, humidity FROM readings {% if defined(start_ts) and defined(end_ts) %} WHERE timestamp BETWEEN parseDateTimeBestEffort({{String(start_ts)}}) AND parseDateTimeBestEffort({{String(end_ts)}}) {% end %} ORDER BY t ASC In the Query editor, next to URL, click on Headers, Request params and fill URL Query Params. Use Grafana's global variables [$__from and $__to](https://grafana.com/docs/grafana/v9.0/variables/variable-types/global-variables/#__from-and-__to) defined with the time range selector at the top right od the dashboard. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgrafana-time-ranges.png&w=3840&q=75) As said, filtering helps to have more efficient dashboards, here you can see how using filters, the scan size decreases. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgrafana-stats-filter.png&w=3840&q=75) ## 5. Dashboard variables¶ You can also define dashboard variables and use them in the query: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgrafana-dashboard-variable.png&w=3840&q=75) That will make it interactive: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgrafana-variables.gif&w=3840&q=75) Note you have to edit the Pipe: % SELECT formatDateTime(timestamp,'%FT%TZ') t, {% if defined(magnitude) and magnitude != 'all' %} {{column(magnitude)}} {% else %} temperature, humidity {% end %} FROM readings {% if defined(start_ts) and defined(end_ts) %} WHERE timestamp BETWEEN parseDateTimeBestEffort({{String(start_ts)}}) AND parseDateTimeBestEffort({{String(end_ts)}}) {% end %} ORDER BY t ASC ## 6. Alerts¶ Infinity plugin supports alerting, so you can create your own [rules](https://grafana.com/docs/grafana/latest/alerting/alerting-rules/) . In this example alert triggers when temperature goes outside a defined range. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgrafana-alert-definition.png&w=3840&q=75) And this is what you will see in the dashboard. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fgrafana-alert-trigger.png&w=3840&q=75) Be sure to correctly set up [notifications](https://grafana.com/docs/grafana/latest/alerting/configure-notifications/) if needed. ## Note¶ Note: a previous version of this guide referred to the [JSON API plugin](https://grafana.com/grafana/plugins/marcusolsson-json-datasource/) but it was migrated to using [Infinity plugin](https://grafana.com/grafana/plugins/yesoreyeram-infinity-datasource/) since it is now the [default supported](https://grafana.com/blog/2024/02/05/infinity-plugin-for-grafana-grafana-labs-will-now-maintain-the-versatile-data-source-plugin/) one. --- URL: https://www.tinybird.co/docs/guides/integrations/consume-apis-in-a-notebook Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Consume APIs in a Notebook · Tinybird Docs" theme-color: "#171612" description: "Notebooks are a great resource for exploring data and generating plots. In this guide, you'll learn how to consume Tinybird APIs in a notebook." --- # Consume APIs in a notebook¶ Notebooks are a great resource for exploring data and generating plots. In this guide, you'll learn how to consume Tinybird APIs in a notebook. ## Prerequisites¶ This [Colab notebook](https://github.com/tinybirdco/examples/blob/master/notebook/consume_from_apis.ipynb) uses a Data Source of updates to Wikipedia to show how to consume data from queries. There are two options: Using the [Query API](https://www.tinybird.co/docs/docs/api-reference/query-api) , and using API Endpoints using the [Pipes API](https://www.tinybird.co/docs/docs/api-reference/pipe-api/overview) and parameters. The full code for every example in this guide can be found in the notebook. This guide assumes some familiarity with Python. ## Setup¶ Follow the setup steps in the [notebook file](https://github.com/tinybirdco/examples/blob/master/notebook/consume_from_apis.ipynb) and use the linked CSV file of Wikipedia updates to create a new Data Source in your Workspace. For less than 100 MB of data, you can fetch all the data. For calls with than 100 MB of data, you need to do it sequentially, with not more than 100 MB per API call. The solution is to get batches using Data Source sorting keys. Selecting the data by columns used in the sorting key keeps it fast. In this example, the Data Source is sorted on the `timestamp` column, so you can use batches of a fixed amount of time. In general, time is a good way to batch. The functions `fetch_table_streaming_query` and `fetch_table_streaming_endpoint` in the notebook work as generators. They should always be used in a `for` loop or as the input for another generator. You should process each batch as it arrives and discard unwanted fetched data. Only fetch the data you need in the processing. The idea here is not to recreate a Data Source in the notebook, but to process each batch as it arrives and write less data to your DataFrame. ## Fetch data with the Query API¶ This guide uses the [requests library for Python](https://pypi.org/project/requests/) . The SQL query pulls in an hour less of data than the full Data Source. A DataFrame is created from the text part of the response. ##### DataFrame from the query API table_name = 'wiki' host = 'api.tinybird.co' format = 'CSVWithNames' time_column = 'toDateTime(timestamp)' date_end = 'toDateTime(1644754546)' s = requests.Session() s.headers['Authorization'] = f'Bearer {token}' URL = f'https://{host}/v0/sql' sql = f'select * from {table_name} where {time_column} <= {date_end}' params = {'q': sql + f" FORMAT {format}"} r = s.get(f"{URL}?{urlencode(params)}") df = pd.read_csv(StringIO(r.text)) Replace the Tinybird API hostname/region with the [right API URL region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) that matches your Workspace. Your Token lives in the Workspace under "Tokens". ## Fetch data from an API Endpoint & parameters¶ This Endpoint Node in the Pipe `endpoint_wiki` selects from the Data Source within a range of dates, using the parameters for `date_start` and `date_end`. ##### Endpoint wiki % SELECT * FROM wiki WHERE timestamp BETWEEN toInt64(toDateTime({{String(date_start, '2022-02-13 10:30:00')}})) AND toInt64(toDateTime({{String(date_end, '2022-02-13 11:00:00')}})) These parameters are passed in the call to the API Endpoint to select only the data within the range. A DataFrame is created from the text part of the response. ##### Dataframe from API Endpoint host = 'api.tinybird.co' api_endpoint = 'endpoint_wiki' format = 'csv' date_start = '2022-02-13 10:30:00' date_end = '2022-02-13 11:30:00' s = requests.Session() s.headers['Authorization'] = f'Bearer {token}' URL = f'https://{host}/v0/pipes/{api_endpoint}.{format}' params = {'date_start': date_start, 'date_end': date_end } r = s.get(f"{URL}?{urlencode(params)}") df = pd.read_csv(StringIO(r.text)) Replace the Tinybird API hostname/region with the [right API URL region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) that matches your Workspace. Your Token lives in the Workspace under "Tokens". ## Fetch batches of data using the Query API¶ The function `fetch_table_streaming_query` in the notebook accepts more complex queries than a date range. Here you choose what you filter and sort by. This example reads in batches of 5 minutes to create a small DataFrame, which should then be processed, with the results of the processing appended to the final DataFrame. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fconsume-apis-in-a-notebook-1.png&w=3840&q=75) <-figcaption-> 5-minute batches of data using the index ##### DataFrames from batches returned by the Query API tinybird_stream = fetch_table_streaming_query(token, 'wiki', 60*5, 1644747337, 1644758146, sorting='timestamp', filters="type IN ['edit','new']", time_column="timestamp", host='api.tinybird.co') df_all=pd.DataFrame() for x in tinybird_stream: df_batch = pd.read_csv(StringIO(x)) # TO DO: process batch and discard fetched data df_proc=process_dataframe(df_batch) df_all = df_all.append(df_proc) # Careful: appending dfs means keeping a lot of data in memory ## Fetch batches of data from an API Endpoint & parameters¶ The function `fetch_table_streaming_endpoint` in the notebook sends a call to the API with parameters for the `batch size`, `start` and `end` dates, and, optionally, filters on the `bot` and `server_name` columns. This example reads in batches of 5 minutes to create a small DataFrame, which should then be processed, with the results of the processing appended to the final DataFrame. ‍The API Endpoint `wiki_stream_example` first selects data for the range of dates, then for the batch, and then applies the filters on column values. ##### API Endpoint wiki\_stream\_example % SELECT * from wiki --DATE RANGE WHERE timestamp BETWEEN toUInt64(toDateTime({{String(date_start, '2022-02-13 10:30:00', description="start")}})) AND toUInt64(toDateTime({{String(date_end, '2022-02-13 10:35:00', description="end")}})) --BATCH BEGIN AND timestamp BETWEEN toUInt64(toDateTime({{String(date_start, '2022-02-13 10:30:00', description="start")}}) + interval {{Int16(batch_no, 1, description="batch number")}} * {{Int16(batch_size, 10, description="size of the batch")}} second) --BATCH END AND toUInt64(toDateTime({{String(date_start, '2022-02-13 10:30:00', description="start")}}) + interval ({{Int16(batch_no, 1, description="batch number")}} + 1) * {{Int16(batch_size, 10, description="size of the batch")}} second) --FILTERS {% if defined(bot) %} AND bot = {{String(bot, description="is a bot")}} {% end %} {% if defined(server_name) %} AND server_name = {{String(server_name, description="server")}} {% end %} These parameters are passed in the call to the API Endpoint to select only the data for the batch. A DataFrame is created from the text part of the response. ##### DataFrames from batches from the API Endpoint tinybird_stream = fetch_table_streaming_endpoint(token, 'csv', 60*5, '2022-02-13 10:15:00', '2022-02-13 13:15:00', bot = False, server_name='en.wikipedia.org' ) df_all=pd.DataFrame() for x in tinybird_stream: df_batch = pd.read_csv(StringIO(x)) # TO DO: process batch and discard fetched data df_proc=process_dataframe(df_batch) df_all = df_all.append(df_proc) # Careful: appending dfs means keeping a lot of data in memory ## Next steps¶ - Explore more use cases for Tinybird in the[ Use Case Hub](https://www.tinybird.co/docs/docs/use-cases) . - Looking for other ways to integrate? Try[ consume APIs in a Next.js frontend](https://www.tinybird.co/docs/docs/guides/integrations/consume-apis-nextjs) . --- URL: https://www.tinybird.co/docs/guides/integrations/consume-apis-nextjs Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Consume APIs in a Next.js frontend with JWTs · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to generate self-signed JWTs from your backend, and call Tinybird APIs directly from your frontend, using Next.js." --- # Consume APIs in a Next.js frontend with JWTs¶ In this guide, you'll learn how to generate self-signed JWTs from your backend, and call Tinybird APIs directly from your frontend, using Next.js. JWTs are signed tokens that allow you to securely authorize and share data between your application and Tinybird. If you want to read more about JWTs, check out the [JWT.io](https://jwt.io/) website. You can view the [live demo](https://guide-nextjs-jwt-auth.vercel.app/) or browse the [GitHub repo (guide-nextjs-jwt-auth)](https://github.com/tinybirdco/guide-nextjs-jwt-auth). ## Prerequisites¶ This guide assumes that you have a Tinybird account, and you are familiar with creating a Tinybird Workspace and pushing resources to it. Make sure you understand the concept of Tinybird's [Static Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens#what-should-i-use-tokens-for). You'll need a working familiarity with JWTs, JavaScript, and Next.js. ## Run the demo¶ These steps cover running the GitHub demo locally. [Skip to the next section](https://www.tinybird.co/docs/about:blank#understand-the-code) for a breakdown of the code. ### 1. Clone the GitHub repo¶ Clone the [GitHub repo (guide-nextjs-jwt-auth)](https://github.com/tinybirdco/guide-nextjs-jwt-auth) to your local machine. ### 2. Push Tinybird resources¶ The repo includes two sample Tinybird resources: - `events.datasource` : The Data Source for incoming events. - `top_airlines.pipe` : An API Endpoint giving a list of top 10 airlines by booking volume. Configure the [Tinybird CLI](https://www.tinybird.co/docs/docs/cli/install) and `tb push` the resources to your Workspace. Alternatively, you can drag and drop the files onto the UI to upload them. ### 3. Generate some fake data¶ Use [Mockingbird](https://tbrd.co/mockingbird-nextjs-jwt-demo) to generate fake data for the `events` Data Source. Using this link ^ provides a pre-configured schema, but you will need to enter your Workspace admin Token and Host. When configured, scroll down and select `Start Generating!`. In the Tinybird UI, confirm that the `events` Data Source is successfully receiving data. ### 4. Install dependencies¶ Navigate to the cloned repo and install the dependencies with `npm install`. ### 5. Configure .env¶ First create a new file `.env.local` cp .env.example .env.local Copy your [Tinybird host](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) and admin Token (used as the `TINYBIRD_SIGNING_TOKEN` ) to the `.env.local` file: TINYBIRD_SIGNING_TOKEN="TINYBIRD_SIGNING_TOKEN>" # Use your Admin Token as the signing Token TINYBIRD_WORKSPACE="YOUR_WORKSPACE_ID" # The UUID of your Workspace NEXT_PUBLIC_TINYBIRD_HOST="YOUR_TINYBIRD_API_REGION e.g. https://api.tinybird.co" # Your regional API host Replace the Tinybird API hostname/region with the [right API URL region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) that matches your Workspace. Your Token lives in the Workspace under "Tokens". ### Run the demo app¶ Run it locally: npm run dev Then open [http://localhost:3000](http://localhost:3000/) with your browser. ## Understand the code¶ This section breaks down the key parts of code from the example. ### .env¶ The `.env` file contains the environment variables used in the application. ##### .env file TINYBIRD_SIGNING_TOKEN="YOUR SIGNING TOKEN" TINYBIRD_WORKSPACE="YOUR WORKSPACE ID" NEXT_PUBLIC_TINYBIRD_HOST="YOUR API HOST e.g. https://api.tinybird.co" #### TINYBIRD\_SIGNING\_TOKEN¶ `TINYBIRD_SIGNING_TOKEN` is the token used to sign JWTs. **You must use your admin Token** . It is a shared secret between your application and Tinybird. Your application uses this Token to sign JWTs, and Tinybird uses it to verify the JWTs. It should be kept secret, as exposing it could allow unauthorized access to your Tinybird resources. It is best practice to store this in an environment variable instead of hardcoding it in your application. #### TINYBIRD\_WORKSPACE¶ `TINYBIRD_WORKSPACE` is the ID of your Workspace. It is used to identify the Workspace that the JWT is generated for. The Workspace ID is included inside the JWT payload. Workspace IDs are UUIDs and can be found using the CLI `tb workspace current` command or from the Tinybird UI. #### NEXT\_PUBLIC\_TINYBIRD\_HOST¶ `NEXT_PUBLIC_TINYBIRD_HOST` is the base URL of the Tinybird API. It is used to construct the URL for the Tinybird API Endpoints. You must use the correct URL for [your Tinybird region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) . The `NEXT_PUBLIC_` prefix is required for Next.js to expose the variable to the client side. ### token.ts¶ The `token.ts` file contains the logic to generate and sign JWTs. It uses the `jsonwebtoken` library to create the Token. ##### token.ts "use server"; import jwt from "jsonwebtoken"; const TINYBIRD_SIGNING_TOKEN = process.env.TINYBIRD_SIGNING_TOKEN ?? ""; const WORKSPACE_ID = process.env.TINYBIRD_WORKSPACE ?? ""; const PIPE_ID = "top_airlines"; export async function generateJWT() { const next10minutes = new Date(); next10minutes.setTime(next10minutes.getTime() + 1000 * 60 * 10); const payload = { workspace_id: WORKSPACE_ID, name: "my_demo_jwt", exp: Math.floor(next10minutes.getTime() / 1000), scopes: [ { type: "PIPES:READ", resource: PIPE_ID, }, ], }; return jwt.sign(payload, TINYBIRD_SIGNING_TOKEN, {noTimestamp: true}); } This code runs on the backend to generate JWTs without exposing secrets to the user. It pulls in the `TINYBIRD_SIGNING_TOKEN` and `WORKSPACE_ID` from the environment variables. As this example only exposes a single API Endpoint ( `top_airlines.pipe` ), the `PIPE_ID` is hardcoded to its deployed ID. If you had multiple API Endpoints, you would need to create an item in the `scopes` array for each one. The `generateJWT` function handles creation of the JWT. A JWT has various [required fields](https://www.tinybird.co/docs/docs/concepts/auth-tokens#jwt-payload). The `exp` field sets the expiration time of the JWT in the form a UTC timestamp. In this case, it is set to 10 minutes in the future. You can adjust this value to suit your needs. The `name` field is a human-readable name for the JWT. This value is only used for logging. The `scopes` field defines what the JWT can access. This is an array, which allows you create one JWT that can access multiple API Endpoints. In this case, we only have one API Endpoint. Under `scopes` , the `type` field is always `PIPES:READ` for reading data from a Pipe. The `resource` field is the ID or name of the Pipe you want to access. If required, you can also add `fixed_parameters` here to supply parameters to the API Endpoint. Finally, the payload is signed using the `jsonwebtoken` library and the `TINYBIRD_SIGNING_TOKEN`. ### useFetch.tsx¶ The `useFetch.tsx` file contains a custom React hook that fetches data from the Tinybird API using a JWT. It also handles refreshing the token if it expires. ##### useFetch.tsx import { generateJWT } from "@/server/token"; import { useState } from "react"; export function useFetcher() { const [token, setToken] = useState(""); const refreshToken = async () => { const newToken = await generateJWT(); setToken(newToken); return newToken; }; return async (url: string) => { let currentToken = token; if (!currentToken) { currentToken = await refreshToken(); } const response = await fetch(url + "?token=" + currentToken); if (response.status === 200) { return response.json(); } if (response.status === 403) { const newToken = await refreshToken(); return fetch(url + "?token=" + newToken).then((res) => res.json()); } }; } This code runs on the client side and is used to fetch data from the Tinybird API. It uses the `generateJWT` function from the `token.ts` file to get a JWT. The JWT is stored in the `token` state. Most importantly, it uses the standard `fetch` API to make requests to the Tinybird API. The JWT is passed as a `token` query parameter in the URL. If the request returns a `403` status code, the hook then calls `refreshToken` to get a new JWT and retries the request. However, note that this is a simple implementation and there are other reasons why a request might fail with a `403` status code (e.g., the JWT is invalid, the API Endpoint has been removed, etc.). ### page.tsx¶ The `page.tsx` file contains the main logic for the Next.js page. It is responsible for initiating the call to the Tinybird API Endpoints and rendering the data into a chart. ##### page.tsx "use client"; import { BarChart, Card, Subtitle, Text, Title } from "@tremor/react"; import useSWR from "swr"; import { getEndpointUrl } from "@/utils"; import { useFetcher } from "@/hooks/useFetch"; const REFRESH_INTERVAL_IN_MILLISECONDS = 5000; // five seconds export default function Dashboard() { const endpointUrl = getEndpointUrl(); const fetcher = useFetcher(); let top_airline, latency, errorMessage; const { data } = useSWR(endpointUrl, fetcher, { refreshInterval: REFRESH_INTERVAL_IN_MILLISECONDS, onError: (error) => (errorMessage = error), }); if (!data) return; if (data?.error) { errorMessage = data.error; return; } top_airline = data.data; latency = data.statistics?.elapsed; return ( Top airlines by bookings Ranked from highest to lowest {top_airline && ( )} {latency && Latency: {latency * 1000} ms} {errorMessage && (

Oops, something happens: {errorMessage}

Check your console for more information

)}
); } It uses [SWR](https://swr.vercel.app/) and the `useFetcher` hook from [useFetch.tsx](https://www.tinybird.co/docs/about:blank#usefetch-tsx) to fetch data from the Tinybird API. When the API Endpoint returns data, it is rendered as bar chart using the `BarChart` component from the `@tremor/react` library. ## Next steps¶ - Read the[ blog post on JWTs](https://www.tinybird.co/blog-posts/jwt-api-endpoints-public-beta) . - Explore more use cases that use this approach, like[ building a real-time, user-facing dashboard](https://www.tinybird.co/docs/docs/use-cases/user-facing-dashboards) . --- URL: https://www.tinybird.co/docs/guides/integrations/integrating-vercel Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Vercel Integration · Tinybird Docs" theme-color: "#171612" description: "In this guide you'll learn how to integrate your Vercel Project with a Tinybird Workspace and sync your Tokens." --- # Integrating Vercel with Tinybird¶ This integration will allow you to link your Tinybird Workspaces with your Vercel projects to sync Tinybird [Static Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens#what-should-i-use-tokens-for) into Vercel Environment Variables. This integration makes it easy to use Tinybird as a purpose-built analytics backend from within the Vercel Marketplace. Build any kind of analytics, be it web analytics, structured logging, telemetry, anomaly detection, personalization or anything else you can think of - and never have to worry about infrastructure, scale or security. ## Add the Tinybird integration¶ There's basically two stages to this process - adding Tinybird to your Vercel account to set the overall scope of the integration, and then connecting specific Vercel Projects to Tinybird Workspaces. To get started, you can either search for Tinybird in the Vercel integration marketplace, which looks like this: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-1.png&w=3840&q=75) Or you can just use this [link](https://vercel.com/integrations/tinybird/new) to go straight to creating a new Tinybird integration. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-2.png&w=3840&q=75) Either way, click **Add integration** to get started and then select your target Vercel account. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-3.png&w=3840&q=75) Next, select the scope of which Projects the Tinybird integration will be added to - this controls which Projects will be presented as options to connect to a Tinybird Workspace. We're selecting All Projects here, but you can limit it to a specific one if you want. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-4.png&w=3840&q=75) Finally in this section, review and approve the required permissions to allow this integration to function. We need to be able to read Project and Account information, and manage the environment variables in the projects in order to provide Tokens and URLS for connectivity. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-5.png&w=3840&q=75) Great! The integration workflow will launch straight into connecting a Project to a Workspace, but you can come back to it later if you want via the Integrations dashboard in your Vercel Account ## Integrate a Vercel project¶ You can set up a new Workspace integration from the Vercel Account integrations control panel by selecting the Tinybird integration, and then the **Configure** button. Note that you'll already be in this workflow the first time you add the integration to the account. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-6.png&w=3840&q=75) Once in this control panel, you'll be able to review any integrations present, for now let's add our first one. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-7.png&w=3840&q=75) First in this workflow, select which Tinybird region you want to connect to <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-8.png&w=3840&q=75) Next, select at least one Vercel Project to include - they'll all get the Environment variables added <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-9.png&w=3840&q=75) Now we'll need a Tinybird Workspace to integrate with - you can either pick one you already have, or we can create a shiny new one by popping out into the Tinybird UI. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-10.png&w=3840&q=75) Now to select which Static Tokens you want to have sync'd into the Vercel Projects as Environment Variables. If you don't see the Token you want in the list, you can pop out into the Tinybird UI and create one with the scope you want. If this is your first time, you can just use the default admin Token to get started with and reduce the scope later. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-11.png&w=3840&q=75) And that's it! Now your Project(s) will have these Tokens available ready to use. Note that the Environment variables follow a convenient naming format `TB__` In your Tinybird UI, at the bottom of the main dropdown menu, you'll see a new option for Integrations - You can review your various integrations here. If you want to update and Integration, rerun the workflow by clicking the **Add project** button. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-14.png&w=3840&q=75) And if you head to your Tokens page within the Tinybird UI, you'll see a handy notification if that Token is being actively synchronized. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-12.png&w=3840&q=75) ## Removing the integration¶ It's cool, we can just be friends. To remove Tinybird from your Vercel Account, go to the [Integrations management page](https://vercel.com/dashboard/integrations) . At the bottom you'll see the Remove integration button. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-integrating-vercel-13.png&w=3840&q=75) This will remove the Integration and environment variables, but note that your Tinybird Workspaces and Tokens will not be removed. ## BYO Web Analytics¶ Well now you might be thinking, what can I do with this shiny new integration? You're probably familiar with the built-in Vercel analytics - we've made a starter kit to get you going on expanding into your own custom analysis of your site traffic. Give a go [here!](https://www.tinybird.co/docs/docs/starter-kits/web-analytics) --- URL: https://www.tinybird.co/docs/guides/integrations/team-integration-governance Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Team integration and data governance · Tinybird Guides · Tinybird Docs" theme-color: "#171612" description: "In this guide you'll learn about how different teams work with Tinybird, and how we support your team to manage data." --- # Team integration and data governance¶ In this guide you'll learn about how different teams in a single organization usually work with Tinybird, and how data is best managed and governed. Tinybird supports a wide range of industries and products - and this spectrum expands every day. Our customers organize themselves and their businesses in different ways, but there are over-arching principles you can adopt and adapt. Knowing how to integrate your team with Tinybird (and vice versa) is important to getting the most value out of the platform. ## Foundational concepts¶ ### What you'll need for this guide¶ You don't need an active Workspace, you just to be familiar with the following [Tinybird concepts](https://www.tinybird.co/docs/docs/core-concepts): - Data Source: Where data is ingested and stored - Pipe: How data is transformed - Workspace: How data projects are organized, containing Data Sources and Pipes - Shared Data Source: A Data Source shared between Workspaces - Roles: Each Workspace has "Admin", "Guest", "Viewer" roles - Organizations: Tinybird Enterprise customers with multiple Workspaces can view/monitor/manage them in their Organization Bringing it all together: An Organization has multiple Workspaces. Each Workspace ingests data from a Data Source/Sources, and each Data Source can provide data to multiple Workspaces. Within a Workspace, after the data is ingested it gets transformed by Pipes using SQL logic. Individual members of each Workspace are assigned roles, managed at the Organization level, that give them different levels of access to the data. ### What Tinybird is (and isn't)¶ Tinybird is not a data governance tool, but - on top of all the other cool things it does - it provides a way to **manage who uses the data** . This also helps ensure data quality and enables visibility about everything that’s happening at the Organization level. ## Roles and responsibilities¶ It's not possible to provide a single, clear-cut "Data Engineers do X, and Software Devs do Y" - it really depends on the company structure, preferences, and Tinybird use case. There is no single solution, and sometimes it can be a controversial topic. The good news, though, is that there are a lot of possibilities and your use case is likely to be fairly straightforward. We know, because we have customers who do it every day! (If it turns out to be unusual, don't worry: We have an amazing Customer Support team to help you iron out any issues and get you going *fast* 🚀). ### Responsibilities: Who does what?¶ It's not really about job roles or professional personas: It's about what the **different responsibilities** are. Let's step back for a moment. Tinybird allows you to **share Data Sources across Workspaces** . This means you can create Workspaces that map your organization, and not have to duplicate the Data Sources. In general, most Tinybird users have an ingestion Workspace where the owners of the data ingest the data, clean the data, and prepare it for onward consumption. They then share these ready-to-use Data Sources with other internal teams using the Tinybird feature [Sharing Data Sources](https://www.tinybird.co/docs/docs/concepts/workspaces#sharing-data-sources-between-workspaces). You can have as many ingestion Workspaces as you need; bigger organizations group their Workspaces by domain, some organizations group them by team. For instance, in a bank, you'd find different teams managing their own data and therefore several ingestion Workspaces where data is curated and exposed to other teams. In this case, each team maps to a domain: <-figure-> ![Diagram showing each ingestion team mapping to a specific domain](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguide-team-integration-governance-team-1.png&w=3840&q=75) However, in other companies where data is centralized and managed by "the data platform", you'd find just one ingestion Workspace. Here, all the data is ingested and shared with other onward Workspaces where specific domain users build their own use cases: <-figure-> ![Diagram showing each ingestion team mapping to a specific domain](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguide-team-integration-governance-team-2.png&w=3840&q=75) Some organizations rely on a hybrid solution, where certain data is provided by the data platform but each domain group also ingest their own data: <-figure-> ![Diagram showing each ingestion team mapping to a specific domain](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguide-team-integration-governance-team-3.png&w=3840&q=75) Whatever your approach, it's an established pattern to have an ingestion or "data-platform" Workspace or team who own ingestion and data preparation, and share the desired Data Sources with other teams. These downstream, domain-specific teams then create the Pipe logic specific to their own area (usually in a Workspace specifically for that domain). That way, the responsibilities reflect a manageable, clear separation of concerns. ## Enforcing data governance¶ No matter which industry you're in, good data governance is essential. Tinybird isn't a data governance tool, but is built to support you in the following ways: - ** Availability** is assured by the platform's uptime SLA (see[ Terms and conditions](https://www.tinybird.co/terms-and-conditions) ). Tinybird wants all your teams to be able to access all the data they need, any time they need. You can monitor availability using our monitoring tools (see[ Monitoring](https://www.tinybird.co/docs/about:blank#monitoring) below), which includes but is not limited to monitoring ingestion, API Endpoints, and quarantined rows. Compared to other tools, Tinybird offers a really straightforward way to reingest quarantined rows and maintain Materialized Views to automatically reingest data, which we know is a problem for users wanting to maintain maximum data availability. - ** Control** over data access is managed through a single Organization page in Tinybird. You can enforce the principle of least privilege by assigning different roles to Workspace members, and easily check data quality and consumption using Tinybird's Service Data Sources. Tinybird also supports schema evolution and you can[ keep multiple schema versions running](https://www.tinybird.co/docs/docs/guides/ingesting-data/iterate-a-data-source) at the same time so consumers can adjust at their own pace. - ** Usability** is maximized by having ingestion Workspaces. They allow you to share cleaned, curated data, with specific and adjusted schemas giving consumers precisely what they need. Workspace members have the flexibility to create as many Workspaces as they need, and[ use the Playground feature](https://www.tinybird.co/docs/docs/query/overview#use-the-playground) to sandbox new ideas. - ** Consistency** is a similar topic: Data owners have responsibility over what they want to share with others. We cannot block teams from ingesting in their own Workspaces but the Organization has the tools to monitor who (which Workspace) is ingesting data. - ** Data integrity/quality** , especially at scale and at speed, is simply essential. Just like availability, it's a perfect use case for leveraging Tinybird's monitoring capabilities (see[ Additional ecosystem tools](https://www.tinybird.co/docs/about:blank#additional-ecosystem-tools) below). Ingestion teams can build Pipes to monitor everything about their inbound data and create alerts. These alerts can be technical or business-related - see the[ Monitor your ingestion](https://www.tinybird.co/docs/docs/guides/monitoring/monitor-your-ingestion) guide for an example. - ** Data security** : This information is available at the top-level Organizations page and also in individual Workspaces. ## Additional ecosystem tools¶ Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. Tinybird is built around the idea of data that changes or grows continuously. We provide additional tools (as standard) as part of Tinybird - more "process governance" tools than data governance, but still very useful. They help you get insights on and monitor your Workspaces, data, and resources. ### Operations log¶ The [Operations log](https://www.tinybird.co/docs/docs/monitoring/health-checks#operations-log) shows information on each **individual Data Source** : Its size, the number of rows, the number of rows in the quarantine Data Source (if any) and when it was last updated. The Operations log contains details of the events for the Data Source, which are displayed as the results of the query. It allows you to see every single call made to the system (API call, ingestion, jobs). This is really helpful if you're concerned about one specific Data Source, and you want to see under the hood. ### Monitoring¶ Enterprise customers can also use the [Organizations UI](https://www.tinybird.co/docs/docs/monitoring/organizations) for managing Workspaces and Members, and monitoring their entire consolidated Tinybird consumption in one place. For example, you can track costs and usage for each individual Workspace. ### Testing¶ To ensure that all production load is efficient and accurate, all Tinybird API Endpoints that you create in your Workspaces can be tested before going to Production. You can do this by [using version control](https://www.tinybird.co/docs/docs/production/overview) - just like how you manage your codebase. ### Alerts and health checks¶ To ensure everything is working as expected once you're in production, any team can create alerts and health checks on top of Tinybird's [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources). ## Next steps¶ - Watch this video to understand how Factorial sets "domain boundaries" and[ organizes their teams](https://youtu.be/8rctUKRXcdw?t=574) . - Build something fast and fun - follow the[ quick start tutorial](https://www.tinybird.co/docs/docs/quick-start) . --- URL: https://www.tinybird.co/docs/guides/migrations/migrate-from-doublecloud Last update: 2024-11-08T07:59:24.000Z Content: --- title: "Migrate from DoubleCloud · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to migrate from DoubleCloud to Tinybird, and the overview of how to quickly & safely recreate your setup." --- # Migrate from DoubleCloud¶ In this guide, you'll learn how to migrate from DoubleCloud to Tinybird, and the overview of how to quickly & safely recreate your setup. DoubleCloud, a managed data services platform that offers ClickHouse® as a service, is [shutting down operations](https://double.cloud/blog/posts/2024/10/doublecloud-final-update/) . As of October 1, 2024 you cannot create new DoubleCloud accounts, and all existing DoubleCloud services must be migrated by March 1, 2025. Tinybird offers a [managed ClickHouse](https://www.tinybird.co/clickhouse) solution that can be a suitable alternative for existing users of DoubleCloud's ClickHouse service. Follow this guide to learn two approaches for migrating data from your DoubleCloud instance to Tinybird: 1. Option 1: Use the S3 table function to export data from DoubleCloud Managed ClickHouse to Amazon S3, then use the Tinybird S3 Connector to import data from S3. 2. Option 2: Export your ClickHouse tables locally, then import files into Tinybird using the Datasources API. Wondering how to create a Tinybird account? It's free! [Start here](https://www.tinybird.co/signup) . Need DoubleCloud migration assistance? Please [contact us](https://www.tinybird.co/doublecloud). ## Prerequisites¶ You don't need an active Tinybird Workspace to read through this guide, but it's good idea to understand the foundational concepts and how Tinybird integrates with your team. If you're new to Tinybird, read the [team integration guide](https://www.tinybird.co/docs/guides/integrations/team-integration-governance). ## At a high level¶ Tinybird is a great alternative to DoubleCloud's managed ClickHouse implementation. Tinybird is a data platform built on top of ClickHouse for data and engineering teams to solve complex real-time, operational, and user-facing analytics use cases at any scale, with end-to-end latency in milliseconds for streaming ingest and high QPS workloads. It offers the same or comparable ClickHouse performance as DoubleCloud, with additional features such as native, managed ingest connectors, multi-node SQL notebooks, and scalable REST APIs for public use or secured with JWTs. Tinybird is a managed platform that scales transparently, requiring no cluster operations, shard management or worrying about replicas. See how Tinybird is used by industry-leading companies today in the [Customer Stories](https://www.tinybird.co/customer-stories). ## Migrate from DoubleCloud to Tinybird using Amazon S3¶ In this approach, you'll use the `s3` table function in ClickHouse to export tables to an Amazon S3 bucket, and then import them into Tinybird with the S3 Connector. This guide assumes that you already have the necessary IAM Roles with the necessary permissions to write to (from DoubleCloud) and read from (to Tinybird) the S3 bucket. ### Export your table to Amazon S3¶ In this guide, we're using a table on our DoubleCloud ClickHouse Cluster called `timeseriesdata` . The data has 3 columns and 1M rows. Export your table to Amazon S3 In this guide, we're using a table on our DoubleCloud ClickHouse Cluster called `timeseriesdata` . The data has 3 columns and 1M rows. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmigrate-from-doublecloud-1.png&w=3840&q=75) <-figcaption-> Example timeseries data table in DoubleCloud You can export data in your DoubleCloud ClickHouse tables to Amazon S3 with the `s3` table function. Note: If you don't want to expose your AWS credentials in the query, use a [named collection](https://double.cloud/docs/en/managed-clickhouse/integrations/s3#use-the-s3-table-function). INSERT INTO FUNCTION s3( 'https://tmp-doublecloud-migration.s3.us-east-1.amazonaws.com/exports/timeseriesdata.csv', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY', 'CSV' ) SELECT * FROM timeseriesdata SETTINGS s3_create_new_file_on_insert = 1 ### Import to Tinybird with the S3 Connector¶ Once your table is exported to Amazon S3, import it to Tinybird using the [Amazon S3 Connector](https://www.tinybird.co/docs/docs/ingest/s3). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmigrate-from-doublecloud-2.png&w=3840&q=75) <-figcaption-> Select the S3 Connector in the Tinybird UI. The basic steps for using the S3 Connector are: 1. Define an S3 Connection with IAM Policy and Role that allow Tinybird to read from S3. Tinybird will automatically generated the JSON for these policies. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmigrate-from-doublecloud-3.png&w=3840&q=75) <-figcaption-> Create an S3 Connection with automatically generated IAM policies 1. Supply the file URI (with wildcards as necessary) to define the file(s) containing the contents of your ClickHouse table(s). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmigrate-from-doublecloud-4.png&w=3840&q=75) <-figcaption-> Specify the file URI for the files containing your ClickHouse tables 1. Create an On Demand (one-time) sync. 2. Define the schema of the resulting table in Tinybird. You can do this within the S3 Connector UI… <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmigrate-from-doublecloud-5.png&w=3840&q=75) <-figcaption-> Define the schema of your tables within the Tinybird UI ...or by creating a .datasource file and pushing it to Tinybird. An example .datasource file for `timeseriesdata` table to match the DoubleCloud schema and create the import job from the existing S3 Connection would look like this: SCHEMA > `tank_id` String, `volume` Float32, `usage` Float32 ENGINE "MergeTree" ENGINE_SORTING_KEY "tank_id" IMPORT_SERVICE 's3_iamrole' IMPORT_CONNECTION_NAME 'DoubleCloudS3' IMPORT_BUCKET_URI 's3://tmp-doublecloud-migration/timeseriesdata.csv' IMPORT_STRATEGY 'append' IMPORT_SCHEDULE '@on-demand' 1. Tinybird will then create and run a batch import job to ingest the data from Amazon S3 and create a new ClickHouse table that matches your table in DoubleCloud. You can monitor the job from the `datasource_ops_log` Service Data Source. ## Migrate from DoubleCloud to Tinybird using local exports¶ Depending on the size of your tables, you might be able to simply export your tables to a local file using `clickhouse-client` and ingest them to Tinybird directly. ### Export your tables from DoubleCloud using clickhouse-client¶ First, use `clickhouse-client` to export your tables into local files. Depending on the size of your data, you can choose to compress as necessary. Tinybird can ingest CSV (including Gzipped CSV), NDJSON, and Parquet files. ./clickhouse client --host your_doublecloud_host --port 9440 --secure --user your_doublecloud_user --password your_doublecloud_password --query "SELECT * FROM timeseriesdata" --format CSV > timeseriesdata.csv ### Import your files to Tinybird¶ You can drag and drop files into the Tinybird UI… <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmigrate-from-doublecloud-6.png&w=3840&q=75) <-figcaption-> Drag and drop a file into the Tinybird UI to create a file-based Data Source or upload them using the Tinybird CLI: tb datasource generate timeseriesdata.csv tb push datasources/timeseriesdata.datasource tb datasource append timeseriesdata timeseriesdata.csv Note that Tinybird will automatically infer the appropriate schema from the supplied file, but you may need to change the column names, data types, table engine, and sorting key to match your table in DoubleCloud. ## Migration support¶ If your migration is more complex, involving many or very large tables, materialized views + populates, or other complex logic, please [contact us](https://www.tinybird.co/doublecloud) and we will assist with your migration. ## Tinybird Pricing vs DoubleCloud¶ Tinybird's Build plan is free, with no time limit or credit card required. The Build Plan includes 10 GB of data storage (compressed) and 1,000 published API requests per day. Tinybird's paid plans are available with both infrastructure-based pricing and usage-based pricing. DoubleCloud customers will likely be more familiar with infrastructure-based pricing. For more information about infrastructure-based pricing and to get a quote based on your existing DoubleCloud cluster, please [contact us](https://www.tinybird.co/doublecloud). If you are interested in usage-based pricing, you can learn more about [usage-based billing here](https://www.tinybird.co/docs/docs/support/billing). ### ClickHouse Limits¶ Note that Tinybird takes a different approach to ClickHouse deployment than DoubleCloud. Rather than provide a full interface to a hosted ClickHouse cluster, Tinybird provides a serverless ClickHouse implementation and abstracts the database interface via our [APIs](https://www.tinybird.co/docs/docs/api-reference/overview) , UI, and CLI, only exposing the SQL Editor within our [Pipes](https://www.tinybird.co/docs/docs/concepts/pipes) interface. Additionally, not all ClickHouse SQL functions, data types, and table engines are supported out of the box. You can find a full list of [supported engines and settings here](https://www.tinybird.co/docs/docs/concepts/data-sources#supported-engines-settings) . If your use case requires engines or settings that are not listed, please [contact us](https://www.tinybird.co/doublecloud). ## Useful resources¶ Migrating to a new tool, especially at speed, can be challenging. Here are some helpful resources to get started on Tinybird: - [ Read how Tinybird compares to ClickHouse (especially ClickHouse Cloud)](https://www.tinybird.co/blog-posts/tinybird-vs-clickhouse) . - [ Read how Tinybird compares to other Managed ClickHouse offerings](https://www.tinybird.co/blog-posts/managed-clickhouse-options) . - Join our[ Slack Community](https://www.tinybird.co/community) for help understanding Tinybird concepts. - [ Contact us](https://www.tinybird.co/doublecloud) for migration assistance. ## Next steps¶ If you’d like assistance with your migration, please [contact us](https://www.tinybird.co/doublecloud). - Set up a free Tinybird account and build a working prototype:[ Sign up here](https://www.tinybird.co/signup) . - Run through a quick example with your free account: Tinybird[ quick start](https://www.tinybird.co/docs/docs/quick-start) . - Read the[ billing docs](https://www.tinybird.co/docs/docs/support/billing) to understand plans and pricing on Tinybird. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/migrations/migrate-from-postgres Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Migrate from Postgres · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to migrate events from Postgres to Tinybird so that you can begin building performant, real-time analytics over your event data." --- # Migrate from Postgres¶ In this guide, you'll learn how to migrate events from Postgres to Tinybird so that you can begin building performant, real-time analytics over your event data. Need to create a Tinybird account? It's free! [Start here](https://www.tinybird.co/signup). ## Prerequisites¶ You'll need a [free Tinybird account](https://www.tinybird.co/signup) and a Workspace. ## At a high level¶ Postgres is an incredible general purpose database, and it can even be extended to support columnar functionality for analytics. That said, Tinybird can be a great alternative to Postgres extensions for a few reasons: - It uses ClickHouse® as its underlying database, which is one of the fastest real-time analytics databases in the world. - It provides additional services on top of the database - like an integrated API backend, ingestion load balancing, and native connectors - that will keep you from having to spin up additional services and infrastructure for your analytics service. Tinybird is a data platform for data and engineering teams to solve complex real-time, operational, and user-facing analytics use cases at any scale, with end-to-end latency in milliseconds for streaming ingest and high QPS workloads. It's a SQL-first analytics engine, purpose-built for the cloud, with real-time data ingest and full JOIN support. Native, managed ingest connectors make it easy to ingest data from a variety of sources. SQL queries can be published as production-grade, scalable REST APIs for public use or secured with JWTs. Tinybird is a managed platform that scales transparently, requiring no cluster operations, shard management, or worrying about replicas. See how Tinybird is used by industry-leading companies today in the [Customer Stories](https://www.tinybird.co/customer-stories) hub. ## Follow these steps to migrate from Postgres to Tinybird¶ Below you'll find an example walkthrough migrating 100M rows of events data from Postgres to Tinybird. You can apply the same workflow to your existing Postgres instance. If at any point you get stuck and would like assistance with your migration, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Slack Community](https://www.tinybird.co/docs/docs/community). ### The Postgres table¶ Suppose you have a table in Postgres that looks like this: postgres=# CREATE TABLE events ( id SERIAL PRIMARY KEY, timestamp TIMESTAMPTZ NOT NULL, user_id TEXT NOT NULL, session_id TEXT NOT NULL, action TEXT NOT NULL, version TEXT NOT NULL, payload TEXT NOT NULL ); The table contains 100 million rows totalling about 15GB of data: postgres=# SELECT pg_size_pretty(pg_relation_size('events')) AS size; size ------- 15 GB (1 row) The table stores website click events, including an unstructured JSON `payload` column. ### Setup¶ Within your Postgres, create a user with read only permissions over the table (or tables) you need to export: postgres=# CREATE USER tb_read_user WITH PASSWORD ''; postgres=# GRANT CONNECT ON DATABASE test_db TO tb_read_user; postgres=# GRANT USAGE ON SCHEMA public TO tb_read_user; postgres=# GRANT SELECT ON TABLE events TO tb_read_user; #### Limits¶ To perform this migration, we'll be running a series of [Copy Jobs](https://www.tinybird.co/docs/publish/copy-pipes) to incrementally migrate the events from Postgres to Tinybird. We break it up into chunks so as to remain under the limits of both Tinybird and Postgres. There are two limits to take into account: 1. [ Copy Pipe limits](https://www.tinybird.co/docs/support/limits#copy-pipe-limits) : Copy Pipes have a default max execution time of 20s for Build plans, 30s for Pro plans, 30m for Enterprise plans. If you're on a Free or Pro plan and need to temporarily extend your limits to perform the migration, please reach out to us at[ support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) . 2. The max execution time of queries in Postgres. This is controlled by the `statement_timeout` setting. We recommendation that you set the value in Postgres equal or similar to the max execution time of the Copy Pipe in Tinybird. For this example, we'll use three minutes: postgres=# ALTER ROLE tb_read_user SET statement_timeout = '180000'; -- 3 minutes #### Create a local Tinybird project¶ Install [Tinybird CLI](https://www.tinybird.co/docs/docs/cli/install) , then create a new Data Project: export TB_ADMIN_TOKEN= export TB_HOST=https://api.us-east.aws.tinybird.co #replace with your host tb auth --host $TB_HOST --token $TB_ADMIN_TOKEN tb init Create the target Data Source in Tinybird: touch datasources/events.datasource Define a schema that matches your Postgres schema, keeping in mind that Tinybird may use different data types. For our example: # datasources/events.datasource SCHEMA > `id` Int32, `timestamp` DateTime64(6), `user_id` String, `session_id` String, `action` String, `version` String, `payload` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYear(timestamp)" ENGINE_SORTING_KEY "timestamp, session_id, user_id" Push the Data Source to the Tinybird server: tb push datasources/events.datasource ### Backfilling your existing Postgres data¶ We're going to create a parameterized Copy Pipe to perform the initial backfill in chunks. We'll use a script to run the Copy Job on demand. #### Storing secrets in Tinybird¶ Start by adding two secrets to Tinybird using the [Environment Variables API](https://www.tinybird.co/docs/docs/api-reference/environment-variables-api) . This will prevent hard-coded credentials in your Copy Pipe. Create one for your Postgres username: curl \ -X POST "${TB_HOST}/v0/variables" \ -H "Authorization: Bearer ${TB_ADMIN_TOKEN}" \ -d "type=secret" \ -d "name=tb_read_user" \ -d "value=tb_read_user" And one for the password: curl \ -X POST "${TB_HOST}/v0/variables" \ -H "Authorization: Bearer ${TB_ADMIN_TOKEN}" \ -d "type=secret" \ -d "name=tb_read_password" \ -d "value=" #### Define the Copy Pipe¶ Create a new Pipe: touch pipes/backfill_postgres.pipe And paste the following code, changing the url/port, name, and table name of your Postgres based on your specific setup: NODE migrate SQL > % SELECT * FROM postgresql( 'https://your.postgres.url::port', 'your_postgres_instance_name', 'your_postgres_table name', {{tb_secret('tb_read_user')}}, {{tb_secret('tb_read_password')}}, 'public' ) WHERE timestamp > {{DateTime(from_date, '2020-01-01 00:00:00')}} --adjust based on your data AND timestamp <= {{DateTime(to_date, '2020-01-01 00:00:01')}} --use a small default range TYPE COPY TARGET_DATASOURCE events This uses the [PostgreSQL Table Function](https://www.tinybird.co/docs/docs/ingest/postgresql) to select data from the remote Postgres table. It pushes the timestamp filters down to Postgres, incrementally querying your Postgres table and copying them into your `events` Data Source in Tinybird. Push this Pipe to the server: tb push pipes/backfill_postgres.pipe ### Backfill in one go¶ Depending on the size of your Postgres table, you may be able to perform the migration in a single Copy Job. For example, get the minimum timestamp from Postgres (and the current datetime): postgres=# SELECT min(timestamp) FROM events; min ------------------------ 2023-01-01 00:00:00+00 (1 row) ❯ date -u +"%Y-%m-%d %H:%M:%S" 2024-08-29 10:20:57 And run the Copy Job with those parameters: tb pipe copy run migrate_pg_to_events --param from_date="2023-01-01 00:00:00" --param to_date="2024-08-29 10:20:57" --wait --yes If it succeeds, you'll see something like this: ** Running migrate_pg_to_events ** Copy to 'events' job created: https://api.us-east.aws.tinybird.co/v0/jobs/4dd482f9-168b-44f7-a4c9-d1b64fc9665d ** Copying data [████████████████████████████████████] 100% ** Data copied to 'events' And you'll be able to query the resulting Data Source: tb sql "select count() from events" ------------- | count() | ------------- | 100000000 | ------------- tb sql "select count() as c, action from events group by action order by c asc" --stats ** Query took 0.228730096 seconds ** Rows read: 100,000,000 ** Bytes read: 1.48 GB ----------------------- | c | action | ----------------------- | 19996881 | logout | | 19997421 | signup | | 20000982 | purchase | | 20001649 | view | | 20003067 | click | ----------------------- Note that Copy operations in Tinybird are atomic, so a bulk backfill will either succeed or fail completely with some error. For instance, if the `statement_timeout` in Postgres is not large enough to export the table with a single query, you'll get an error like this: ** Copy to 'copy_migrate_events_from_pg' job created: https://api.us-east.aws.tinybird.co/v0/jobs/ec58749a-f4c3-4302-9236-f8036f0cb67b ** Copying data Error: ** Failed creating copy job: ** Error while running job: There was a problem while copying data: [Error] Query cancelled due to statement timeout in postgres. Make sure you use a user with a proper statement timeout to run this type of query. In this case you can try to increaste the `statement_timeout` or try the backfilling in chunks. As a reference, copying 100M rows from Postgres to Tinybird takes about 150s if Postgres and Tinybird are in the same cloud and region. The Tinybird PostgreSQL Table Function uses internally a PostgreSQL `COPY TO` statement. You can tweak some other settings in Postgres if necessary, but usually it's not needed, so refer to your Postgres provider or admin. ### Backfilling in chunks¶ If you find that you're hitting the limits of either your Postgres or Tinybird's Copy Pipes, you can backfill in chunks. First of all, make sure your Postgres table is indexed by the column you are filtering on, in this case `timestamp`: postgres=# CREATE INDEX idx_events_timestamp ON events (timestamp); postgres=# VACUUM ANALYZE events; And make sure a query like the one sent from Tinybird will use the indexes (see the Index Scan below): postgres=# explain select * from events where timestamp > '2024-01-01 00:00:00' and timestamp <= '2024-01-02 00:00.00'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using idx_events_timestamp on events (cost=0.57..607150.89 rows=151690 width=115) Index Cond: (("timestamp" > '2024-01-01 00:00:00+00'::timestamp with time zone) AND ("timestamp" <= '2024-01-02 00:00:00+00'::timestamp with time zone)) JIT: Functions: 2 Options: Inlining true, Optimization true, Expressions true, Deforming true (5 rows) Then run multiple Copy jobs, adjusting the amount of data copied to stay within your Postgres statement timeout and Tinybird max execution time. This is a trial and error process depending on the granularity of data. For example, here's a migration script that first tries a full backfill, and if it fails uses daily chunks: #!/bin/bash HOST="YOUR_TB_HOST" TOKEN="YOUR_TB_TOKEN" PIPE_NAME="backfill_postgres" FROM_DATE="2023-01-01 00:00:00" TO_DATE="2024-08-31 00:00:00" LOG_FILE="pipe_copy.log" run_command() { local from_date="$1" local to_date="$2" echo "Copying from $from_date to $to_date" | tee -a $LOG_FILE if output=$(tb --host $HOST --token $TOKEN pipe copy run $PIPE_NAME --param from_date="$from_date" --param to_date="$to_date" --wait --yes 2>&1); then echo "Success $from_date - $to_date" | tee -a $LOG_FILE return 0 else echo "Error $from_date - $to_date" | tee -a $LOG_FILE echo "Error detail: $output" | tee -a $LOG_FILE return 1 fi } iterate_chunks() { local from_date="$1" local to_date="$2" local current_from="$from_date" local next_to="" while [[ "$(date -d "$current_from" +"%s")" -lt "$(date -d "$to_date" +"%s")" ]]; do # End of current day (23:59:59) next_to=$(date -d "$current_from +1 day -1 second" +"%Y-%m-%d")" 23:59:59" # Adjust next_to if it's bigger than to_date if [[ "$(date -d "$next_to" +"%s")" -ge "$(date -d "$to_date" +"%s")" ]]; then next_to="$to_date" fi # Create copy job for one single day if ! run_command "$current_from" "$next_to"; then echo "Error processing $current_from to $next_to" return 1 fi # Go to next day (starting at 00:00:00) current_from=$(date -d "$(date -d "$current_from" +'%Y-%m-%d') +1 day $(date -d "$current_from" +'%H:%M:%S')" +'%Y-%m-%d %H:%M:%S') done } # Step 1: Try full backfill echo "Running full backfill..." | tee -a $LOG_FILE if ! run_command "$FROM_DATE" "$TO_DATE"; then echo "Full backfill failed, iterating in daily chunks..." | tee -a $LOG_FILE iterate_chunks "$FROM_DATE" "$TO_DATE" fi echo "Process completed." | tee -a $LOG_FILE Using either a full backfill or backfilling in chunks, you can successfully migrate your data from Postgres to Tinybird. ### Syncing new events from Postgres to Tinybird¶ The next step is keeping your Tinybird Data Source in sync with events in your Postgres as new events arrive. The steps below will show you how to use Tinybird's PostgreSQL Table Function and scheduled Copy Jobs to continually sync data from Postgres to Tinybird, however, you should consider sending future events Tinybird directly using either the [Events API](https://www.tinybird.co/docs/docs/ingest/events-api) or another streaming Data Source connector, as this will be more resource efficient (and more real-time). #### Create the incremental Copy Pipe¶ Create another Copy Pipe to perform the incremental syncs: touch pipes/sync_events_from_pg.pipe Paste in this code, again updating your Postgres details as well as the desired schedule to sync. Note the Copy limits apply here. NODE sync_from_pg SQL > % SELECT * FROM postgresql( 'https://your.postgres.url::port', 'your_postgres_instance_name', 'your_postgres_table name', {{tb_secret('tb_read_user')}}, {{tb_secret('tb_read_password')}}, 'public' ) WHERE timestamp > (SELECT max(timestamp) FROM events) TYPE COPY TARGET_DATASOURCE events COPY_SCHEDULE */5 * * * * Push this to the Tinybird server: tb push pipes/sync_events_from_pg.pipe It's important to first complete the backfill operation before pushing the sync Pipe. The sync Pipe uses the latest timestamp in the Tinybird copy to perform a filtered select from Postgres. Failure to backfill will result in a full scan of your Postgres table on your configured schedule. Once you've pushed this Pipe, Tinybird will sync with your Postgres updates based on the schedule you set. ## Next steps¶ If you’d like assistance with your migration, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). - Set up a free Tinybird account and build a working prototype:[ Sign up here](https://www.tinybird.co/signup) . - Run through a quick example with your free account: Tinybird[ quick start](https://www.tinybird.co/docs/docs/quick-start) . - Read the[ billing docs](https://www.tinybird.co/docs/docs/support/billing) to understand plans and pricing on Tinybird. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/migrations/migrate-from-rockset Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Migrate from Rockset · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn how to migrate from Rockset to Tinybird, and the overview of how to quickly & safely recreate your setup." --- # Migrate from Rockset¶ In this guide, you'll learn how to migrate from Rockset to Tinybird, and the overview of how to quickly & safely recreate your setup. Rockset will [no longer be active](https://docs.rockset.com/documentation/docs/faq) after September 30th, 2024. This guide explains the parallels between Rockset and Tinybird features, and how to migrate to using Tinybird. Wondering how to create an account? It's free! [Start here](https://www.tinybird.co/signup). ## Prerequisites¶ You don't need an active Tinybird Workspace to read through this guide, but it's good idea to understand the foundational concepts and how Tinybird integrates with your team. If you're new to Tinybird, read the [team integration guide](https://www.tinybird.co/docs/guides/integrations/team-integration-governance). ## At a high level¶ Tinybird is a great alternative to Rockset's analytical capabilities. Tinybird is a data platform for data and engineering teams to solve complex real-time, operational, and user-facing analytics use cases at any scale, with end-to-end latency in milliseconds for streaming ingest and high QPS workloads. It's a SQL-first analytics engine, purpose-built for the cloud, with real-time data ingest and full JOIN support. Native, managed ingest connectors make it easy to ingest data from a variety of sources. SQL queries can be published as production-grade, scalable REST APIs for public use or secured with JWTs. Tinybird is a managed platform that scales transparently, requiring no cluster operations, shard management or worrying about replicas. See how Tinybird is used by industry-leading companies today in the [Customer Stories](https://www.tinybird.co/customer-stories) hub. ## Concepts¶ A lot of concepts are the same between Rockset and Tinybird, and there are a handful of others that have a 1:1 mapping. In Tinybird: - Data Source: Where data is ingested and stored - Pipe: How data is transformed - Workspace: How data projects are organized, containing Data Sources and Pipes - Shared Data Source: A Data Source shared between Workspaces - Roles: Each Workspace has "Admin", "Guest", "Viewer" roles - Organizations: Tinybird Enterprise customers with multiple Workspaces can view/monitor/manage them in their Organization Bringing it all together: An Organization has multiple Workspaces. Each Workspace ingests data from a Data Source/Sources, and each Data Source can provide data to multiple Workspaces. Within a Workspace, after the data is ingested it gets transformed by Pipes using SQL logic. Individual members of each Workspace are assigned roles, managed at the Organization level, that give them different levels of access to the data. ### Key concept comparison¶ #### Data Sources¶ Super similar. Rockset and Tinybird both support ingesting data from many types of data sources. You ingest into Tinybird and create a Tinybird **Data Source** that you then have control over - you can iterate the schema, monitor your ingestion, and more. See the [Data Sources docs](https://www.tinybird.co/docs/docs/concepts/data-sources). #### Workspaces¶ Again, very similar. In Rockset, Workspaces contain resources like Collections, Aliases, Views, and Query Lambdas. In Tinybird, **Workspaces** serve the same purpose (holding resources), and you can also share Data Sources between *multiple* Workspaces. Enterprise users monitor and manage Workspaces using the [Organizations feature](https://www.tinybird.co/docs/docs/monitoring/organizations) . See the [Workspace docs](https://www.tinybird.co/docs/docs/concepts/workspaces#what-is-a-workspace). #### Ingest Transformations¶ These are analogous to Tinybird's **Pipes** . It's where you transform your data. The difference is that Rockset does this on initial load (on raw data), whereas Tinybird lets you create and manage a Data Source first, then transform it however you need. See the [Pipes docs](https://www.tinybird.co/docs/docs/concepts/pipes). #### Views¶ Similar to Tinybird's **Nodes** - the modular, chainable "bricks" of SQL queries that compose a Pipe. Like Views, Nodes can reference resources like other Nodes, Pipes, Data Sources, and more. See the [Pipes > Nodes docs](https://www.tinybird.co/docs/docs/concepts/pipes#nodes). #### Rollups¶ The Tinybird equivalent of rollups is **Materialized Views** . Materialized Views give you a way to pre-aggregate and pre-filter large Data Sources incrementally, adding simple logic using SQL to produce a more relevant Data Source with significantly fewer rows. Put simply, Materialized Views shift computational load from query time to ingestion time, so your API Endpoints stay fast. See the [Materialized Views docs](https://www.tinybird.co/docs/docs/publish/materialized-views/overview). #### Query Lambdas¶ The Tinybird equivalent of Query Lambdas is **API Endpoints** . You can publish the result of any SQL query in your Tinybird Workspace as an HTTP API Endpoint. See the [API Endpoint docs](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview). ### Schemaless ingestion¶ You can do schemaless/variable schema event ingestion on Tinybird by storing the whole JSON in a column. Use the following schema in your Data Source definition and use [JSONExtract functions](https://clickhouse.com/docs/en/sql-reference/functions/json-functions#jsonextract-functions) to parse the result afterwards. ##### schemaless.datasource SCHEMA > `root` String `json:$` ENGINE "MergeTree" If your data has some common fields, be sure to extract them and add them to the sorting key. It's definitely possible to do schemaless, but having a defined schema is a great idea. Tinybird provides you with an easy way to manage your schema [using .datasource schema files](https://www.tinybird.co/docs/docs/ingest/overview#create-your-schema). Read the docs on using the [JSONPath syntax in Tinybird](https://www.tinybird.co/docs/guides/ingesting-data/ingest-ndjson-data#jsonpaths-and-the-root-object) for more information. ## Ingest data and build a POC¶ Tinybird allows you to ingest your data from a variety of sources, then create Tinybird Data Sources in your Workspace that can be queried, published, materialized, and more. Just like Rockset, Tinybird supports ingestion from: - Data streams (Kafka, Kinesis). - OLTP databases (DynamoDB, MongoDB, MySQL, PostgreSQL). - Data lakes (S3, GCS). A popular option is connecting DynamoDB to Tinybird. Follow [the guide here](https://www.tinybird.co/docs/docs/ingest/dynamodb) or pick another source from the side nav under "Ingest". Materialized Views give you a way to pre-aggregate and pre-filter large Data Sources incrementally, adding simple logic using SQL to produce a more relevant Data Source with significantly fewer rows. Put simply, Materialized Views shift computational load from query time to ingestion time, so your API Endpoints stay fast. ## Useful resources¶ Migrating to a new tool, especially at speed, can be challenging. Here are some helpful resources to get started on Tinybird: - Set up a[ DynamoDB Data Source](https://www.tinybird.co/docs/docs/ingest/dynamodb) to start streaming data today. - Read the blog post[ "Migrating from Rockset? See how Tinybird features compare"](https://www.tinybird.co/blog-posts/migrating-from-rockset-feature-comparison) . - Read the blog post[ "A practical guide to real-time CDC with MongoDB"](https://www.tinybird.co/blog-posts/mongodb-cdc) . ## Billing and limits¶ Read the [billing docs](https://www.tinybird.co/docs/docs/support/billing) to understand how Tinybird charges for different data operations. Remember, [UI usage is free](https://www.tinybird.co/docs/docs/support/billing#exceptions) (Pipes, Playgrounds, Time Series - anywhere you can hit a "Run" button) as is anything on a [Build plan](https://www.tinybird.co/docs/docs/plans) so get started today for free and iterate ***fast***. Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. ## Next steps¶ If you’d like assistance with your migration, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). - Set up a free Tinybird account and build a working prototype:[ Sign up here](https://www.tinybird.co/signup) . - Run through a quick example with your free account: Tinybird[ quick start](https://www.tinybird.co/docs/docs/quick-start) . - Read the[ billing docs](https://www.tinybird.co/docs/docs/support/billing) to understand plans and pricing on Tinybird. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/monitoring/analyze-endpoints-performance Last update: 2024-11-18T21:34:21.000Z Content: --- title: "Analyze the performance of your API Endpoints · Tinybird Docs" theme-color: "#171612" description: "Learn more about how to measure the performance of your API Endpoints" --- # Analyze the performance of your API Endpoints¶ This guide explains how to use `pipe_stats` and `pipe_stats_rt` , giving several practical examples that show what you can do with these Service Data Sources. Tinybird is all about speed. It gives you tools to make real-time queries really quickly, and then even more tools to optimize those queries to make your API Endpoints faster. Of course, before you optimize, you need to know *what* to optimize. That's where the Tinybird `pipe_stats` and `pipe_stats_rt` Data Sources come in. Whether you're trying to speed up your API Endpoints, track error rates, or reduce scan size and subsequent usage costs, `pipe_stats` and `pipe_stats_rt` let you see how your API Endpoints are performing, so you can find performance offenders and get them up to speed. These Service Data Sources provide performance data and consumption data for every single request, plus you can filter and sort results by Tokens to see who is accessing your API Endpoints and how often. Confused about the difference between `pipe_stats_rt` and `pipe_stats`? `pipe_stats` provides **aggregate stats** - like average request duration and total read bytes - per day, whereas `pipe_stats_rt` offers the same information but without aggregation. Every single request is stored in `pipe_stats_rt` . The examples in this guide use `pipe_stats_rt` , but you can use the same logic with `pipe_stats` if you need more than 7 days of lookback. ## Prerequisites¶ You need a high-level understanding of Tinybird's [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources). ### Understand the core stats¶ In particular, this guide focuses on the following fields in the `pipe_stats_rt` Service Data Source: - `pipe_name` (String): Pipe name as returned in Pipes API. - `duration` (Float): the duration in seconds of each specific request. - `read_bytes` (UInt64): How much data was scanned for this particular request. - `read_rows` (UInt64): How many rows were scanned. - `token_name` (String): The name of the Token used in a particular request. - `status_code` (Int32): The HTTP status code returned for this particular request. You can find the full schema for `pipe_stats_rt` in the [API docs](https://www.tinybird.co/docs/docs/monitoring/service-datasources#tinybird-pipe-stats-rt). The value of `pipe_name` is "query_api" in the event as it's a Query API request. The following section covers how to monitor query performance when using the Query API. ### Using the Query API with metadata parameters¶ If you are using the [Query API](https://www.tinybird.co/docs/docs/api-reference/query-api) to run queries in Tinybird you can still track query performance using `pipe_stats_rt` Service Data Source. You add metadata related to the query as request parameters in addition to any existing parameters already used in your query. For example, when running a query against the Query API you can leverage a parameter called `app_name` to track all queries from the "explorer" application. Here's an example using curl: ##### Using the metadata parameters with the Query API curl -X POST \ -H "Authorization: Bearer " \ --data "% SELECT * FROM events LIMIT {{Int8(my_limit, 10)}}" \ "https://api.tinybird.co/v0/sql?my_limit=10&app_name=explorer" When you run the following queries, use the `parameters` attribute to access those queries where `app_name` equals "explorer": ##### Simple Parameterized Query SELECT * FROM tinybird.pipe_stats_rt WHERE parameters['app_name'] = 'explorer' ## Example 1: Detect errors in your API Endpoints¶ If you want to monitor the number of errors per Endpoint over the last hour, you could do the following: ##### Errors in the last hour SELECT pipe_name, status_code, count() as error_count FROM tinybird.pipe_stats_rt WHERE status_code >= 400 AND start_datetime > now() - INTERVAL 1 HOUR GROUP BY pipe_name, status_code ORDER BY status_code desc If you have errors, this would return something like: ##### OUTPUT Pipe_a | 404 | 127 Pipe_b | 403 | 32 With one query, you can see in real time if your API Endpoints are experiencing errors, and investigate further if so. ## Example 2: Analyze the performance of API Endpoints over time¶ You can also use `pipe_stats_rt` to track how long API calls take using the `duration` field, and seeing how that changes over time. **API performance is directly related to how much data you are reading per request** , so if your API Endpoint is dynamic, request duration varies. For instance, it might be receiving start and end date parameters that alter how long a period is being read. ##### API Endpoint performance over time SELECT toStartOfMinute(start_datetime) t, pipe_name, avg(duration) avg_duration, quantile(.95)(duration) p95_duration, count() requests FROM tinybird.pipe_stats_rt WHERE start_datetime >= {{DateTime(start_date_time, '2022-05-01 00:00:00', description="Start date time")}} AND start_datetime < {{DateTime(end_date_time, '2022-05-25 00:00:00', description="End date time")}} GROUP BY t, pipe_name ORDER BY t desc, pipe_name <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fanalyze-endpoints-performance-1.png&w=3840&q=75) ## Example 3: Find the Endpoints that process the most data¶ You might want to find Endpoints that repeatedly scan large amounts of data. These are your best candidates for optimization to reduce time and spend. Here's an example of using `pipe_stats_rt` to find the API Endpoints that have processed the most data as a percentage of all processed data in the last 24 hours: ##### Most processed data last 24 hours WITH ( SELECT sum(read_bytes) FROM tinybird.pipe_stats_rt WHERE start_datetime >= now() - INTERVAL 24 HOUR ) as total, sum(read_bytes) as processed_byte SELECT pipe_id, quantile(0.9)(duration) as p90, formatReadableSize(processed_byte) AS processed_formatted, processed_byte*100/total as percentage FROM tinybird.pipe_stats_rt WHERE start_datetime >= now() - INTERVAL 24 HOUR GROUP BY pipe_id ORDER BY percentage DESC <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fanalyze-endpoints-performance-2.png&w=3840&q=75) ### Modifying to include consumption of the Query API¶ If you use Tinybird's Query API to query your Data Sources directly, you probably want to include in your analysis which queries are consuming more. Whenever you use the Query API, the field `pipe_name` contain the value `query_api` . The actual query is included as part of the `q` parameter in the `url` field. You can modify the query in the previous section to extract the actual SQL query tha's processing the data. ##### Using the Query API WITH ( SELECT sum(read_bytes) FROM tinybird.pipe_stats_rt WHERE start_datetime >= now() - INTERVAL 24 HOUR ) as total, sum(read_bytes) as processed_byte SELECT if(pipe_name = 'query_api', normalizeQuery(extractURLParameter(decodeURLComponent(url), 'q')),pipe_name) as pipe_name, quantile(0.9)(duration) as p90, formatReadableSize(processed_byte) AS processed_formatted, processed_byte*100/total as percentage FROM tinybird.pipe_stats_rt WHERE start_datetime >= now() - INTE RVAL 24 HOUR GROUP BY pipe_name ORDER BY percentage DESC <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fanalyze-endpoints-performance-3.png&w=3840&q=75) ## Example 4: Monitor usage of Tokens¶ If you use your API Endpoint with different Tokens, for example if allowing different customers to check their own data, you can track and control which Tokens are being used to access these Endpoints. Here's an example that shows, for the last 24 hours, the number and size of requests per Token: ##### Token usage last 24 hours SELECT count() requests, formatReadableSize(sum(read_bytes)) as total_read_bytes, token_name FROM tinybird.pipe_stats_rt WHERE start_datetime >= now() - INTERVAL 24 HOUR GROUP BY token_name ORDER BY requests DESC <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fanalyze-endpoints-performance-4.png&w=3840&q=75) To obtain this information, you can request the Token name ( `token_name` column) or id ( `token` column). Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. ## Next steps¶ - Want to optimize further? Read[ Monitor your ingestion](https://www.tinybird.co/docs/docs/guides/monitoring/monitor-your-ingestion) . - Learn how to[ monitor jobs in your Workspace](https://www.tinybird.co/docs/docs/monitoring/jobs) . - Monitor the[ latency of your API Endpoints](https://www.tinybird.co/docs/docs/monitoring/latency) . - Learn how to[ build Charts of your data](https://www.tinybird.co/docs/docs/publish/charts) . --- URL: https://www.tinybird.co/docs/guides/monitoring/monitor-your-ingestion Last update: 2024-11-06T08:19:40.000Z Content: --- title: "Monitor ingestion · Tinybird Docs" theme-color: "#171612" description: "In this guide you'll learn more about how to monitor your data source ingestion in Tinybird." --- # Monitor your ingestion¶ In this guide, you can learn the basics of how to monitor your Data Source ingestion. By being aware of your ingestion pipeline and leveraging Tinybird's features, you can monitor for any issues with the [Data Flow Graph](https://www.tinybird.co/docs/docs/query/overview#data-flow). Remember: Every Tinybird use case is slightly different. This guide provides guidelines and an example scenario. If you have questions or want to explore more complicated ingestion monitoring scenarios, for instance looking for outliers by using the z-score or other anomaly detection processes, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). ## Prerequisites¶ You don't need an active Workspace to follow this guide, only an awareness of the [core Tinybird concepts](https://www.tinybird.co/docs/docs/core-concepts). ## Key takeaways¶ 1. Understand and visualize your data pipeline. 2. Leverage the Tinybird platform and tools. 3. Be proactive: Build alerts. ### Understand your data pipeline and flow¶ The first step to monitoring your ingestion to Tinybird is to understand **what** you're monitoring at a high level. As a data team, the most common complaint you may get from your stakeholders is “the data is outdated” closely followed by "my dashboard is broken" but that's another matter. When stakeholders complain about outdated data, you and your data engineers start investigating, putting on the intellectual diving suit and checking the data pipelines upstream until you find the problem. Understanding how data flows through those pipelines from the origin to the end is essential, and you should always know what your data flow "landscape" looks like. ### Use the tools¶ Tinybird provides several tools to help you: - The[ Data Flow Graph](https://www.tinybird.co/docs/docs/query/overview#data-flow) is Tinybird’s data lineage diagram. It visualizes how data flows within your project. It shows all the levels of dependencies, so you can see how all your Pipes, Data Sources, and Materialized Views connect. - [ Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) are logs that allow you to keep track of almost everything happening data-wise within your system. - Use[ Time Series](https://www.tinybird.co/docs/docs/query/overview#time-series) in combination with Service Data Sources to allow you to visualize data ingestion trends and issues over time. ### Build alerts¶ Lastly, you can create a personalized alert system by integrating your Pipes and Endpoints that point to certain key Service Data Sources with third-party services. ## Example scenario: From spotting birds to spotting errors¶ ### Overview¶ In this example, a user with a passion for ornithology (the study of birds 🤓) has built a Workspace called `bird_spotter` ( [GitHub repository here](https://github.com/tinybirdco/bird-spotter-project) ). They're using it to analyze the number of birds they spot in their garden and when out on hikes. It uses Tinybird’s high frequency ingestion (Events API) and an updated legacy table in BigQuery, so the Data Sources are as follows: 1. `bird_records` : A dataset containing bird viewings describing the time and bird details, which the[ Events API](https://www.tinybird.co/docs/docs/ingest/events-api) populates every day: <-figure-> ![Screenshot showing a dataset populated by ingesting from the Events API](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-monitor-ingestion-dataset-1.png&w=3840&q=75) 1. `birds_by_hour_and_country_from_copy` : An aggregated dataset of the bird views per hour and country, which a[ Copy Pipe](https://www.tinybird.co/docs/docs/publish/copy-pipes) populates every hour: <-figure-> ![Screenshot showing a dataset populated by ingesting from a Copy Pipe](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-monitor-ingestion-dataset-2.png&w=3840&q=75) 1. `tiny_bird_records` : A dataset with a list of tiny birds (i.e. hummingbirds), which Tinybird's[ BigQuery Connector](https://www.tinybird.co/docs/docs/ingest/bigquery) replaces every day: <-figure-> ![Screenshot showing a dataset populated by ingesting a BigQuery connector](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-monitor-ingestion-dataset-3.png&w=3840&q=75) As you can see, the three Data Sources rely on three different methods of ingestion: Appending data using the high frequency API, aggregating and copying, and syncing from BigQuery. To make sure that each of these processes is happening at the scheduled time, and without errors, this user needs to implement some monitoring. ### Monitoring ingestion and spotting errors¶ Remember all those tools Tinybird offers? Here's how this user fits them together: You can filter the **Service Data Source** called `datasource_ops_log` by Data Source and ingestion method. By building a quick **Time Series** , they can immediately see the "shape" of their ingestion: <-figure-> ![Screenshot showing a Time Series](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-monitor-ingestion-time-series.png&w=3840&q=75) It shows yellow bars (High-Frequency Ingest) and green bars (BigQuery sync) every day, and blue bars (copy operation) every hour. Now, the user can build a robust system for monitoring. Instead of only focusing on the ingestion method, they can create 3 different Pipes that have specific logic, and expose each Pipe as a queryable Endpoint. Each Endpoint aggregates key information about each ingestion method, and count and flag errors. #### Endpoint 1: Check append-hfi operations in bird\_records¶ SELECT toDate(timestamp) as date, sum(if(result = 'error', 1, 0)) as error_count, count() as append_count, if(append_count > 0, 1, 0) as append_flag FROM tinybird.datasources_ops_log WHERE datasource_name = 'bird_records' AND event_type = 'append-hfi' GROUP BY date ORDER BY date DESC <-figure-> ![Screenshot showing SQL Pipe logic](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-monitor-ingestion-endpoint-1.png&w=3840&q=75) #### Endpoint 2: Check copy operations in birds\_by\_hour\_and\_country\_from\_copy¶ SELECT toDate(timestamp) as date, sum(if(result = 'error', 1, 0)) as error_count, count() as copy_count, if(copy_count >= 24, 1, 0) as copy_flag FROM tinybird.datasources_ops_log WHERE datasource_name = 'birds_by_hour_and_country_from_copy' AND event_type = 'copy' GROUP BY date ORDER BY date DESC <-figure-> ![Screenshot showing SQL Pipe logic](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-monitor-ingestion-endpoint-2.png&w=3840&q=75) #### Endpoint 3: Check replace operations in tiny\_bird\_records¶ SELECT toDate(timestamp) as date, sum(if(result = 'error', 1, 0)) as error_count, count() as replace_count, if(replace_count > 0, 1, 0) as replace_flag FROM tinybird.datasources_ops_log WHERE datasource_name = 'tiny_bird_records' AND event_type = 'replace' GROUP BY date ORDER BY date DESC <-figure-> ![Screenshot showing SQL Pipe logic](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-monitor-ingestion-endpoint-3.png&w=3840&q=75) ### Using the output¶ Because these Pipes expose API Endpoints, they can be consumed by any third party app to build real-time alerts. This could be something like DataDog by following this [helpful integration guide](https://www.tinybird.co/blog-posts/how-to-monitor-tinybird-using-datadog-with-vector-dev) , Grafana using [this plugin](https://www.tinybird.co/docs/docs/guides/integrations/consume-api-endpoints-in-grafana) , PagerDuty, Uptime Robot, or GitHub Actions with a cron job system checking for errors. Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. ### Example GitHub Actions implementation¶ In the `bird_spotter` example repo , you can see the `scripts` and `workflows` that the user has built: - `ingest.py` and `monitor.py` are Python scripts that run daily. The first ingests data in this case from a sample csv and the second checks if the append, copy, and sync operations have happened and are error-free. Because this guide is an example scenario, there's a function that randomly chooses not to ingest, so there's always an error present. - `ingest.yml` and `monitor.yml` are yaml files that schedule those daily runs. The output of a daily check would look something like this: INFO:__main__:Alert! Ingestion operation missing. Last ingestion date is not today: 2024-04-16 INFO:__main__:Last copy_count count is equal to 9. All fine! INFO:__main__:Last replace_count count is equal to 1. All fine! INFO:__main__:Alerts summary: INFO:__main__:Append error count: 1 INFO:__main__:Copy error count: 0 INFO:__main__:Replace error count: 0 In this instance, the ingestion script has randomly failed to append new data, and triggers an alert that the user can action. In contrast, copy operations and replace counts have run as expected: 9 copies and 1 BigQuery sync occurred since 00:00. ## Example scenario: Detect out-of-sync Data Sources¶ ### Overview¶ Some Tinybird Connectors like BigQuery or Snowflake use **async jobs** to keep your Data Sources up to date. These jobs produce records with the result sent to the `datasources_ops_log` Service Data Source, both for successful and failed runs. The following example configures a new Tinybird Endpoint that reports Data Sources that are out of sync. It's then possible to leverage that data in your monitoring tool of choice (Grafana, Datadog, UptimeRobot, etc.). #### Endpoint: Get out of sync Data Sources using datasources\_ops\_log¶ To get the Data Sources that haven't been successfully updated in the last hour, check their sync jobs results in the `datasources_ops_log`: select datasource_id, argMax(datasource_name, timestamp) as datasource_name, max(case when result = 'ok' then timestamp end) as last_successful_sync from tinybird.datasources_ops_log where arrayExists(x -> x in ('bigquery','snowflake'), Options.Values) and toDate(timestamp) >= today() - interval 30 days and result = 'ok' group by datasource_id having max(event_type = 'delete') = false and last_successful_sync < now() - interval 1 hour ## Next steps¶ - Read the in-depth docs on Tinybird's[ Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) . - Learn how to[ Optimize your data project](https://www.tinybird.co/docs/docs/guides/optimizations/overview) . - Learn about the difference between log* analytics* and log* analysis* in the blog[ "Log Analytics: how to identify trends and correlations that Log Analysis tools cannot"](https://www.tinybird.co/blog-posts/log-analytics-how-to-identify-trends-and-correlations-that-log-analysis-tools-cannot) . --- URL: https://www.tinybird.co/docs/guides/optimizations/opt101-detect-inefficiencies Last update: 2024-11-07T09:31:09.000Z Content: --- title: "How to detect inefficient resources · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn where to look when looking to optimize your Tinybird project." --- # Optimizations 101: Detect inefficient resources¶ This guide shows you how to find your most inefficient resources - typically the slowest API Endpoints, and the ones processing the most data. ## Prerequisites¶ You don't need an active Workspace to understand this guide, but it can be useful to have one open to compare and apply the ideas to your own situation. ## 1. Orient yourself with the Overview¶ First, navigate to the Overview page of your Workspace: <-figure-> ![Overview of a Workspace showing stats about Pipes performance](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Foptimiz101-overview.png&w=3840&q=75) <-figcaption-> Workspace Overview page To find the slowest API Endpoints: **Set the period to show performance information using the top right dropdown:** - Try to cover a large period for a good average. - If you've made changes recently, try to only include the period for the current version, to make sure you're analyzing the performance of the new version. **Then use the different Consumption summaries to examine your usage:** - BI Connector (if you're using it). - Query API for direct API queries. - A separate section for Pipes. **Check the dedicated Pipes section:** - In Pipes, you get a view of the total number of requests over the period, the average latency, the total data processed for that Pipe, and the average data processed. - Sort by `Processed` to bring the Pipes processing the most data over the period to the top. This effectively shows which Pipe is the "most expensive" and gives you a list of where to start your investigations. Additionally, you can sort by `avg. latency` if you're interested in finding the slowest Endpoints. Consider that the column `requests` value can help you decide which Pipes to optimize, because you could ignore Pipes with few requests. - Finally, select the name of a Pipe to open it in the UI for further investigation. ## 2. Analyze a Pipe Endpoint¶ When you open up a Pipe in the UI, you have a few different options to gain insights into its performance: <-figure-> ![Pipe page with a spotlight on View API and performance metrics](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Foptimiz101-pipe-analysis-spotlight.png&w=3840&q=75) <-figcaption-> Pipe page: View API and performance metrics The **View API** button (top right corner) takes you to the Pipe's performance and testing page. Under each Node in the Pipe, there's **performance information** about the query that Node is running (processed bytes, rows, columns, time). Note that these Node performance metrics are from executing the query with a `LIMIT 20` , so may give you smaller results unless the query reads all the rows for processing. You can modify the query in the Node to force it to process all the data, such as by hashing a column or forcing a count of all rows. By doing this, you'll have a clear indication of data processed and time execution. If you select **Explain** under any Node, you'll see how the Nodes are integrated into the query that is run through the engine: <-figure-> ![Pipe page](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Foptimiz101-pipe-analysis-explain.png&w=3840&q=75) <-figcaption-> Pipe page This can be a very good way to spot if you're making a common mistake, such as not filtering your data before you aggregate it. These common mistakes will be explained later. The Node with the tick icon is the one published as the Endpoint: The performance metrics under this Node are still shown with `LIMIT 20` . Open the **View API** button (or just select the tick icon) to bring up the Endpoint performance page for further analysis. ## 3. Use the View API Performance page¶ Coming from the **View API** button from the Pipe page, you'll go to the Endpoint page. <-figure-> ![Endpoint page](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Foptimiz101-endpoint-page.png&w=3840&q=75) <-figcaption-> Endpoint page Here you can see the specific performance metrics of this Pipe as it is called by your application. This is a good place to see if there’s a lot of variance in requests (in other words, to check if the performance of the Endpoint changes with time for any reason). Towards the bottom of that page, you see the sample snippets: <-figure-> ![Endpoint snippet](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Foptimiz101-endpoint-snippet.png&w=3840&q=75) <-figcaption-> Endpoint snippet You can copy this sample code and run it in a new browser tab. This executes the query exactly as your application would, and the performance metrics will be recorded and reported to you on the Endpoint page. This allows you to see the exact impact of any changes you make to your Pipe. ## 4. Use params to find common Pipe usage patterns¶ You'll need to be familiar with [dynamic parameters](https://www.tinybird.co/docs/docs/query/query-parameters) and [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) for the next 2 sections. Once you have a dynamic API Endpoint with parameters in use, it can be really useful to observe common parameter patterns from the application (for instance, which parameters and values force the Endpoints to process more data). You can then use these observations to drive new optimizations on the Pipe, such as ensuring that a column that is commonly used as parameter is in the sorting key. Use the `pipe_stats` and `pipe_stats_rt` [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) . You can also explore some common Pipe usage queries in the [Monitor API performance guide](https://www.tinybird.co/docs/docs/guides/monitoring/analyze-endpoints-performance). Specifically, you can execute the following query to get the number of requests, average processed data, and average duration of your API Endpoints and parameters in the last 24 hours: ##### check pipe\_stats\_rt SELECT pipe_name, parameters, count() AS total_requests, formatReadableSize(avg(read_bytes)) AS avg_read, avg(duration) AS avg_duration FROM tinybird.pipe_stats_rt WHERE start_datetime > now() - INTERVAL 1 DAY GROUP BY pipe_name, parameters ORDER BY total_requests DESC ## 5. Measure Materialized Views and Copy Pipes performance¶ You should also check the Pipes that move data from one Data Source to another, like Materialized Views and Copy Pipes. You can track them in `tinybird.datasources_ops_log` and in the Materialized View page or Copy page. <-figure-> ![Materialized View page](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Foptimiz101-mv-log.png&w=3840&q=75) <-figcaption-> Materialized View page ##### check processed data for data source operations SELECT datasource_name, formatReadableSize(sum(read_bytes) + sum(written_bytes)) AS f_processed_data, formatReadableSize(sum(read_bytes)) AS f_read_bytes, formatReadableSize(sum(written_bytes)) AS f_written_bytes, round(sum(read_bytes) / sum(written_bytes), 2) AS f_ratio FROM tinybird.datasources_ops_log WHERE timestamp > now() - INTERVAL 1 DAY AND pipe_name != '' GROUP BY datasource_name ORDER BY sum(read_bytes) + sum(written_bytes) DESC Pay special attention to **Materialized Views with JOINs** , since they are prone to scan more data than needed if your SQL is not optimized. Basically, JOINs should be with subqueries of pre-filtered data. ## Next steps¶ - Read[ Optimizations 201: Fix common mistakes](https://www.tinybird.co/docs/docs/guides/optimizations/opt201-fix-mistakes) . - Check out the Monitoring docs and guides for more tips, like[ using Time Series to analyze patterns](https://www.tinybird.co/docs/docs/monitoring/latency#time-series) . - Explore[ this example repo](https://github.com/tinybirdco/tb-usage-cost-metrics/tree/main) to analyze Processed Data. It may not be 100% accurate to[ billing](https://www.tinybird.co/docs/docs/support/billing) , as Tinybird tracks certain operations differently in Service Data Sources, but it's a great proxy. --- URL: https://www.tinybird.co/docs/guides/optimizations/opt201-fix-mistakes Last update: 2024-11-15T09:17:57.000Z Content: --- title: "How to fix common mistakes · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn more tips on how to optimize your Tinybird project." --- # Optimizations 201: Fix common mistakes¶ In this guide, you'll learn the top 5 questions to ask yourself and fix common pitfalls in your data project. ## Prerequisites¶ You'll need to read [Optimizations 101: Detect inefficiencies](https://www.tinybird.co/docs/opt101-detect-inefficiencies) first. It'll give you an idea of which Pipe is the worst performing, and the particular characteristics of that performance, so you can start to look into common causes. It's also a really good idea to read [Best practices for faster SQL](https://www.tinybird.co/docs/docs/query/sql-best-practices) and the [Thinking in Tinybird](https://www.tinybird.co/blog-posts/thinking-in-tinybird) blog post. This guide walks through 5 common questions (the "usual suspects"). If you find a badly-performing Pipe, ask yourself these 5 questions first before exploring other, more nuanced problem areas. ## 1. Are you aggregating or transforming data at query time?¶ Calculating the `count()`, `sum()` , or `avg()` , or casting to another data type is common in a published API Endpoint. **As your data scales, you may be processing more data than necessary** , as you run the same query each time there is a request to that API Endpoint. If this is the case, you should create a [Materialized View](https://www.tinybird.co/docs/docs/publish/materialized-views/overview). In a traditional database, you have to schedule Materialized Views to run on a regular cadence. Although this helps pre-process large amounts of data, the batch nature renders your data stale. In Tinybird, **Materialized Views let you incrementally pre-aggregate, transform, and filter large Data Sources upon ingestion** . By shifting the computational load from query time to ingestion time, you'll scan less data and keep your API Endpoints fast. Read the docs to [create a Materialized View](https://www.tinybird.co/docs/docs/publish/materialized-views#creating-a-materialized-view-in-the-tinybird-ui). ## 2. Are you filtering by the fields in the sorting key?¶ The sorting key is important. **It determines the way data is indexed and stored in your Data Source** , and is crucial for the performance of your API Endpoints. Setting the right sorting key allows you to have all the data you need, for any given query, as close as possible. In all databases (including Tinybird), indexing allows you to not read the data that you don’t need, which speeds up some operations (like filtering) hugely. The goal of sorting keys (SKs) is to reduce scan size and discard as much data as possible when examining the `WHERE` clauses in the queries. In short, a good Sort Key is what will help you [avoid expensive, time-consuming full scans](https://www.tinybird.co/docs/docs/query/sql-best-practices#avoid-full-scans). Some good rules of thumb for setting Sorting Keys: - Order matters: Data will be stored based on the Sort Key order. - Between 3 and 5 columns is good enough. More will probably penalize. - `timestamp` is often a** bad candidate for being the first element** of the SK. - If you have a multi-tenant app, `customer_id` is a** great candidate for being the first element** of the SK. One common mistake is to use the [partition key](https://www.tinybird.co/docs/docs/concepts/data-sources#partitioning) for filtering. Use the sorting key, not the partition key. ## 3. Are you using the best data types?¶ If you do need to read data, you should try to use the smallest types that can get the job done. A common examples are timestamps. Do you really need millisecond precision? Often when users start doing data analytics, they aren't sure what the data will look like, and how to query it in their application. But now that you have your API Endpoint or Pipe, you can (and should!) go back and see if your schema best supports your resulting use-case. It's common to use *simple* types to begin, like `String`, `Int` , and `DateTime` , but as you go further in the application implementation, you should review the data types you selected at the beginning. When reviewing your data types, focus on the following points: - ** Downsizing types** , to select a different data type with a lower size. For instance, UUID fields can be typed as UUID fields instead of string types, you can use unsigned integers (UInt) instead of integers (Int) where there aren’t negative values or you could use a Date instead of DateTime. - Examine string** cardinality** to perhaps use `LowCardinality()` if there are less than 100k uniques. - ** Nullable** columns are** bigger and slower** and can’t be sorting keys, so use `coalesce()` . **Sorting key** and **data type** changes are done by changing your schema, which means [iterating the Data Source](https://www.tinybird.co/docs/docs/guides/ingesting-data/iterate-a-data-source) . See an example of these types of changes in the `thinking-in-tinybird` demo repo. ## 4. Are you doing complex operations early in the processing pipeline?¶ Operations such as joins or aggregations get increasingly expensive as your data grows. Filter your data first to reduce the number of rows, then **perform the more complex operations later in the pipeline**. Follow this example: [Rule 5 for faster SQL](https://www.tinybird.co/docs/docs/query/sql-best-practices#perform-complex-operations-later-in-the-processing-pipeline). ## 5. Are you joining two or more data sources?¶ It's a common scenario to want to **enrich your events with some dimension tables** , so that you are materializing a JOIN. This kind of approach could process more data than you need, so here are some tips to follow in order to reduce it: - Try to switch out JOINs and replace them with a** subquery** : `WHERE column IN (SELECT column FROM dimensions)` . - If the join is needed, try to** filter the right table first** (better if you can use a field in the sorting key). - Remember that the** Materialization is only triggered when you ingest data in the left Data Source** (the one you use to do a `SELECT … FROM datasource` ). So, if you need to recalculate data from the past, creating a Materialized View is not the right approach. Instead, check this guide about[ Copy Pipes](https://www.tinybird.co/docs/docs/publish/copy-pipes) . ### Understanding the Materialized JOIN issue¶ #### The issue¶ There’s a [common pitfall](https://www.tinybird.co/docs/docs/publish/materialized-views#limitations) when working with Materialized Views: *Materialized Views generated using JOIN clauses are tricky. The resulting Data Source will be only automatically updated if and when a new operation is performed over the Data Source in the FROM.* Since Materialized Views work with the result of a SQL query, you can use JOINs and any other SQL feature. But JOINs should be used with caution. SELECT a.id, a.value, b.value FROM a LEFT JOIN b USING id If you insert data in `a` (LEFT SIDE), data will be processed as expected... but what happens if you add data to `b` (RIGHT SIDE)? It will not be processed. This is because a Materialized View is only triggered when its source table receives inserts. It's just a trigger on the source table and knows nothing about the joined table. Note that this doesn't only apply to JOIN queries, and is relevant when introducing **any** table external to the Materialized View's SELECT statement, e.g. using a IN SELECT. It can become more complex if you need to deal with [stream joins](https://github.com/tinybirdco/streaming_join_demo) . However, this guide focuses on the basic setup as doing JOINs implies something most people don't realize. These JOINs can be very expensive because you’re reading the small amount of rows being ingested (LEFT SIDE) plus a full scan of the joined table (RIGHT SIDE) as you don’t know anything about it. #### The optimization¶ Sometimes, to easily detect these cases, it’s useful to review the `read_bytes` / `write_bytes` ratio. If you're reading way more than writing, most likely you’re doing some JOINs within the MV. You can easily change this by **adding a filter in the right side** , rewriting the previous query as follows: SELECT a.id, a.value, b.value FROM a LEFT JOIN ( SELECT id, value FROM b WHERE b.id IN (SELECT id FROM a) ) b USING id This might sound counter-intuitive when writing a query the first time because you’re basically reading `a` twice. However, you need to think `a` is usually smaller than `b` because you’re only reading the block of rows you’re ingesting. To see a real improvement, you will need the fields you’re using to filter to be in the sorting key of `b` . Most of the time you'll use the "joining key", but you can use any other potential field that allows you to hit the index in `b` and filter the right side of the JOIN. ## Next steps¶ - Check out the Monitoring docs and guides for more tips, like[ using Time Series to analyze patterns](https://www.tinybird.co/docs/docs/monitoring/latency#time-series) . - Explore[ this example repo](https://github.com/tinybirdco/tb-usage-cost-metrics/tree/main) to analyze Processed Data. It may not be 100% accurate to[ billing](https://www.tinybird.co/docs/docs/support/billing) , as Tinybird tracks certain operations differently in Service Data Sources, but it's a great proxy. --- URL: https://www.tinybird.co/docs/guides/optimizations/overview Last update: 2024-10-14T12:07:26.000Z Content: --- title: "Optimizations guides · Tinybird Docs" theme-color: "#171612" description: "In this set of guides, you'll learn where to look when looking to optimize your Tinybird project, what to edit, and how to monitor changes." --- # Optimizations¶ This compilation of guides assumes some familiarity with using Tinybird (ingesting data, building query Pipes, publishing API Endpoints), particularly with using [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources). The good news is that Tinybird is so fast that even for un-optimized projects, response times are excellent. Many projects, especially younger or smaller ones, won't necessarily or immediately *need* to optimize right now. However, it's never too soon, especially because data moves fast. It's better to have 30ms latency than 300ms, and better to process less data so your [bills are smaller](https://www.tinybird.co/docs/docs/support/billing#reduce-your-bill). ## About this section¶ The guides in this Optimizations section curate the best applied knowledge across Tinybird's docs ( [Best practices for faster SQL](https://www.tinybird.co/docs/docs/query/sql-best-practices) ), videos ( [Materialization process saves $40K](https://youtu.be/rfckIMbLWBg?si=dY1dP_ctx5tOSyhq), [Tips & Tricks to Keep Your Queries under 100ms](https://www.youtube.com/watch?v=MN2M6HAoO64) ), blog posts ( [Thinking in Tinybird](https://www.tinybird.co/blog-posts/thinking-in-tinybird) ), and the deep expertise of our Data Engineering and Customer Support teams. It gives you both practical examples *and* a framework of questions you can ask in your own unique scenario. This combination should empower you with the tools, tips, tricks, and approach to build the best-optimized projects. So, if you want to feel like [Marc](https://x.com/mfts0/status/1797651962692767801) or [Thibault](https://x.com/thibaultleouay/status/1699492486488498270) , start digging in to the fascinating world of optimizing Tinybird projects. ## Optimizations mantra¶ Tinybird gives you the platform to manage your real-time data. Measure what matters, detect inefficiencies, fix and eliminate common (or unusual!) mistakes, and move faster. Remember, speed wins! ## Next steps¶ - Dive in with[ Optimizations 101: Detecting inefficient resources](https://www.tinybird.co/docs/opt101-detect-inefficiencies) . - Improve the efficiency of project with[ Optimizations 201: Fixing common mistakes](https://www.tinybird.co/docs/opt201-fix-mistakes) . - Understand how to[ monitor your ingestion](https://www.tinybird.co/docs/docs/guides/monitoring/monitor-your-ingestion) . --- URL: https://www.tinybird.co/docs/guides/overview Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Overview · Tinybird Docs" theme-color: "#171612" description: "Follow the Tinybird Guides to get hands-on experience of different processes." --- # Tinybird guides¶ ## Docs vs guides¶ Tinybird offers you multiple ways to learn: written information, [screencasts and videos](https://www.tinybird.co/screencasts) , a [Slack community](https://www.tinybird.co/docs/docs/community) to ask questions in, [starter kit repos](https://www.tinybird.co/docs/docs/starter-kits) , live coding webinars, and a [Use Case Hub](https://www.tinybird.co/docs/docs/use-cases) to explore different industry applications. Within the written information, knowledge is split up as follows: 1. The** docs** give you information about the Tinybird platform, its features, and capabilities. 2. The** guides** give you real life, real time examples in an easy to follow tutorial or "How To..." format. For instance: The docs tell you about Materialized Views on Tinybird as a concept or capability. The associated guides tell you about how to make your own Materialized View. There's also a "Tutorials" subsection in Guides that offers full, end-to-end walkthroughs of specific use cases and example solutions. ## Where to start¶ Don't be put off if you're new. Pick a guide that interests you, and read through it as many times as you need. You're bound to pick it up and learn as you go! Remember, you can always contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). ## Next steps¶ Let's go! Pick your next guide the the navigation on the left. --- URL: https://www.tinybird.co/docs/guides/publishing-data/advanced-dynamic-endpoints-functions Last update: 2024-11-19T13:41:56.000Z Content: --- title: "Advanced template functions for dynamic API Endpoint · Tinybird Docs" theme-color: "#171612" description: "Learn more about creating dynamic API Endpoint using advanced templates." --- # Advanced template functions for dynamic API Endpoint¶ The [Template functions section](https://www.tinybird.co/docs/docs/cli/advanced-templates#template-functions) of the [Advanced templates docs](https://www.tinybird.co/docs/docs/cli/advanced-templates) explains functions that help you create more advanced dynamic templates. On this page, you'll learn about how these templates can be used to create dynamic API Endpoint with Tinybird. ## Prerequisites¶ Before continuing, make sure you're familiar with [template functions](https://www.tinybird.co/docs/docs/cli/advanced-templates#template-functions) and [query parameters](https://www.tinybird.co/docs/docs/query/query-parameters). ## The data¶ This guide uses the ecommerce events data enriched with products. The data looks like this: ##### Events and products data SELECT *, price, city, day FROM events_mat ANY LEFT JOIN products_join_sku ON product_id = sku 17.84MB, 131.07k x 12 ( 9.10ms ) ## Tips and tricks¶ When the complexity of Pipes and API Endpoints grows, developing them and knowing what's going-on to debug problems can become challenging. Here are some tricks that we use when using our product for ourselves and for our clients that we think would be useful for you as well: ### WHERE 1=1¶ When you filter by different criteria, given by dynamic parameters that can be omitted, you'll need a `WHERE` clause. But if none of the parameters are present, you'll need to add a `WHERE` statement with a dummy condition (like `1=1` ) that's always true, and then add the other filter statements dynamically if the parameters are defined, like we do in the [defined](https://www.tinybird.co/docs/about:blank#defined) example of this guide. ### set¶ The [set](https://www.tinybird.co/docs/docs/cli/advanced-templates#variables-vs-parameters) function present in the previous snippet lets you set the value of a parameter in a Node, so that you can check the output of a query depending on the value of the parameters it takes. Otherwise, you'd have to publish an API Endpoint and make requests to it with different parameters. Using `set` , you don't have to exit the Tinybird UI while creating an API Endpoint and the whole process is faster, without needing to go back and forth between your browser or IDE and Postman (or whatever you use to make requests). Another example of its usage: ##### Using set to try out different parameter values % {% set select_cols = 'date,user_id,event,city' %} SELECT {{columns(select_cols)}} FROM events_mat 2.39MB, 49.15k x 4 ( 3.96ms ) You can use more than one `set` statement. Just put each one on a separate line at the beginning of a Node. `set` is also a way to set defaults for parameters. If you used `set` statements to test your API Endpoint while developing, remember to remove them before publishing your code, because if not, the `set` will override any incoming param. ### The default argument¶ Another way to set default values for parameters is using the `default` argument that most Tinybird template functions accept. The previous code could be rewritten as follows: ##### Using the default argument % SELECT {{columns(select_cols, 'date,user_id,event,city')}} FROM events_mat 3.58MB, 73.73k x 4 ( 3.87ms ) Keep in mind that defining the same parameter in more than one place in your code in different ways can lead to inconsistent behavior. Here's a solution to avoid that: ### Using WITH statements to avoid duplicating code¶ If you are going to use the same dynamic parameters more than once in a Node of a Pipe, it's good to define them in one place only to avoid duplicating code. It also makes it clearer knowing which parameters are going to appear in the Node. This can be done with one or more statements at the beginning of a Node, using the `WITH` clause. The [WITH](https://clickhouse.tech/docs/en/sql-reference/statements/select/with/) clause in ClickHouse® supports CTEs. They're preprocessed before executing the query, and they can only return one row (this is different to other databases such as Postgres). This is better seen with a live example: ##### DRY with the with clause % {% set terms='orchid' %} WITH {{split_to_array(terms, '1,2,3')}} AS needles SELECT *, joinGet(products_join_sku, 'color', product_id) color, joinGet(products_join_sku, 'title', product_id) title FROM events WHERE multiMatchAny(lower(color), needles) OR multiMatchAny(lower(title), needles) 4.61MB, 40.96k x 7 ( 14.43ms ) ### Documenting your API Endpoints¶ Tinybird creates auto-generated documentation for all your published API Endpoints, taking the information from the dynamic parameters found in the Pipe. It's best practice to set default values and descriptions for every parameter in one place (also because some functions don't accept a description, for example). We normally do that in the final Node, with `WITH` statements at the beginning. See how we'd do it in the [last section](https://www.tinybird.co/docs/about:blank#putting-it-all-together) of this guide. ### Hidden parameters¶ If you use some functions like `enumerate_with_last` in the example [below](https://www.tinybird.co/docs/about:blank#enumerate-with-last) , you'll end up with some variables (called `x`, `last` in that code snippet) that Tinybird will interpret as if they were parameters that you can set, and they will appear in the auto-generated documentation page. To avoid that, add a leading underscore to their name, renaming `x` to `_x` and `last` to `_last`. ### Debugging any query¶ We have an experimental feature that lets you see how the actual SQL code that will be run on ClickHouse for any published API Endpoint looks, interpolating the query string parameters that you pass in the request URL. If you have a complex query and you'd like to know what is the SQL that will be run, [let us know](https://www.tinybird.co/docs/mailto:support@tinybird.co) and we'll give you access to this feature to debug a query. Now let's explore some of the Tinybird advanced template functions, what they allow you to do, and some tricks that will improve your experience creating dynamic API Endpoints on Tinybird. ## Advanced functions¶ Most of these functions also appear in the [Advanced templates](https://www.tinybird.co/docs/docs/cli/advanced-templates?#template-functions) section of our docs. Here we'll provide practical examples of their usage so that it's easier for you to understand how to use them. ### defined¶ The `defined` function lets you check if a query string parameter exists in the request URL or not. Imagine you want to filter events with a price within a minimum or a maximum price, set by two dynamic parameters that could be omitted. A way to define the API Endpoint would be like this: ##### filter by price % {% set min_price=20 %} {% set max_price=50 %} SELECT *, price FROM events_mat WHERE 1 = 1 {% if defined(min_price) %} AND price >= {{Float32(min_price)}} {% end %} {% if defined(max_price) %} AND price <= {{Float32(max_price)}} {% end %} 2.23MB, 16.38k x 8 ( 8.05ms ) To see the effect of having a parameter not defined, use `set` to set its value to `None` like this: ##### filter by price, price not defined % {% set min_price=None %} {% set max_price=None %} SELECT *, price FROM events_mat WHERE 1 = 1 {% if defined(min_price) %} AND price >= {{Float32(min_price)}} {% end %} {% if defined(max_price) %} AND price <= {{Float32(max_price)}} {% end %} 4.46MB, 32.77k x 8 ( 6.36ms ) You could also provide some smart defaults to avoid needing to use the `defined` function at all: ##### filter by price with default values % SELECT *, price FROM events_mat_cols WHERE price >= {{Float32(min_price, 0)}} AND price <= {{Float32(max_price, 999999999)}} ### Array(variable\_name, 'type', \[default\])¶ Transforms a comma-separated list of values into a Tuple. You can provide a default value for it or not: % SELECT {{Array(code, 'UInt32', default='13412,1234123,4123')}} AS codes_1, {{Array(code, 'UInt32', '13412,1234123,4123')}} AS codes_2, {{Array(code, 'UInt32')}} AS codes_3 To filter events whose type belongs to the ones provided in a dynamic parameter, separated by commas, you'd define the API Endpoint like this: ##### Filter by list of elements % SELECT * FROM events WHERE event IN {{Array(event_types, 'String', default='buy,view')}} 1.84MB, 16.38k x 5 ( 5.84ms ) And then the URL of the API Endpoint would be something like `{% user("apiHost") %}/v0/pipes/your_pipe_name.json?event_types=buy,view` ### sql\_and¶ `sql_and` lets you create a filter with `AND` operators and several expressions dynamically, taking into account if the dynamic parameters in a template it are present in the request URL. It's not possible to use ClickHouse functions inside the `{{ }}` brackets in templates. `sql_and` can only be used with the `{column_name}__{operand}` syntax. This function does the same as what you saw in the previous query: filtering a column by the values that are present in a tuple generated by `Array(...)` if `operand` is `in` , are greater than (with the `gt` operand), or less than (with the `lt` operand). Let's see an example to make it clearer: - Endpoint template code - Generated SQL ##### SQL\_AND AND COLUMN\_\_IN % SELECT *, joinGet(products_join_sku, 'section_id', product_id) section_id FROM events WHERE {{sql_and(event__in=Array(event_types, 'String', default='buy,view'), section_id__in=Array(sections, 'Int16', default='1,2'))}} 11.06MB, 98.30k x 6 ( 8.47ms ) You don't have to provide default values. If you set the `defined` argument of `Array` to `False` , when that parameter is not provided, no SQL expression will be generated. You can see this in the next code snippet: - Endpoint template code - Generated SQL ##### defined=False % SELECT *, joinGet(products_join_sku, 'section_id', product_id) section_id FROM events WHERE {{sql_and(event__in=Array(event_types, 'String', default='buy,view'), section_id__in=Array(sections, 'Int16', defined=False))}} 2.76MB, 24.58k x 6 ( 5.94ms ) ### split\_to\_array(name, \[default\])¶ This works similarly to `Array` , but it returns an Array of Strings (instead of a tuple). You'll have to cast the result to the type you want after. As you can see here too, they behave in a similar way: ##### array and split\_to\_array % SELECT {{Array(code, 'UInt32', default='1,2,3')}}, {{split_to_array(code, '1,2,3')}}, arrayMap(x->toInt32(x), {{split_to_array(code, '1,2,3')}}), 1 in {{Array(code, 'UInt32', default='1,2,3')}}, '1' in {{split_to_array(code, '1,2,3')}} 1.00B, 1.00 x 5 ( 4.04ms ) One thing that you'll want to keep in mind is that you can't pass non-constant values (arrays, for example) to operations that require them. For example, this would fail: ##### using a non-constant expression where one is required % SELECT 1 IN arrayMap(x->toInt32(x), {{split_to_array(code, '1,2,3')}}) [Error] Element of set in IN, VALUES, or LIMIT, or aggregate function parameter, or a table function argument is not a constant expression (result column not found): arrayMap(lambda(tuple(x), toInt32(x)), ['1', '2', '3']): While processing 1 IN arrayMap(x -> toInt32(x), ['1', '2', '3']). (BAD_ARGUMENTS) If you find an error like this, you should use a Tuple instead (remember that `{{Array(...)}}` returns a tuple). This will work: ##### Use a tuple instead % SELECT 1 IN {{Array(code, 'Int32', default='1,2,3')}} 1.00B, 1.00 x 1 ( 1.63ms ) `split_to_array` is often used with [enumerate_with_last](https://www.tinybird.co/docs/about:blank#enumerate-with-last). ### column and columns¶ They let you select one or several columns from a Data Source or Pipe, given their name. You can also provide a default value. ##### columns % SELECT {{columns(cols, 'date,user_id,event')}} FROM events 1.27MB, 40.96k x 3 ( 4.66ms ) ##### column % SELECT date, {{column(user, 'user_id')}} FROM events 1.57MB, 130.82k x 2 ( 4.03ms ) ### enumerate\_with\_last¶ As the docs say, it creates an iterable array, returning a Boolean value that allows checking if the current element is the last element in the array. Its most common usage is to select several columns, or compute some function over them. We can see an example of `columns` and `enumerate_with_last` here: - Endpoint template code - Generated SQL ##### enumerate\_with\_last \+ columns % SELECT {% if defined(group_by) %} {{columns(group_by)}}, {% end %} sum(price) AS revenue, {% for last, x in enumerate_with_last(split_to_array(count_unique_vals_columns, 'section_id,city')) %} uniq({{symbol(x)}}) as {{symbol(x)}} {% if not last %},{% end %} {% end %} FROM events_enriched {% if defined(group_by) %} GROUP BY {{columns(group_by)}} ORDER BY {{columns(group_by)}} {% end %} If you use the `defined` function around a parameter it doesn't make sense to give it a default value because if it's not provided, that line will never be run. ### error and custom\_error¶ They let you return customized error responses. With `error` you can customize the error message: ##### error % {% if not defined(event_types) %} {{error('You need to provide a value for event_types')}} {% end %} SELECT *, joinGet(products_join_sku, 'section_id', product_id) section_id FROM events WHERE event IN {{Array(event_types, 'String')}} ##### error response using error {"error": "You need to provide a value for event_types"} And with `custom_error` you can also customize the response code: ##### custom\_error % {% if not defined(event_types) %} {{custom_error({'error': 'You need to provide a value for event_types', 'code': 400})}} {% end %} SELECT *, joinGet(products_join_sku, 'section_id', product_id) section_id FROM events WHERE event IN {{Array(event_types, 'String')}} ##### error response using custom\_error {"error": "You need to provide a value for event_types", "code": 400} **Note:** `error` and `custom_error` have to be placed at the start of a Node or they won't work. The order should be: 1. `set` lines, to give some parameter a default value (optional) 2. Parameter validation functions: `error` and `custom_error` definitions 3. The SQL query itself ## Putting it all together¶ We've created a Pipe where we use most of these advanced techniques to filter ecommerce events. You can see its live documentation page [here](https://app.tinybird.co/gcp/europe-west3/endpoints/t_e06de80c854d45298d566b93f50840d9?token=p.eyJ1IjogIjdmOTIwMmMzLWM1ZjctNDU4Ni1hZDUxLTdmYzUzNTRlMTk5YSIsICJpZCI6ICI0NDI5OWRkZi1lY2JmLTRkZGItYmM5MS1mMWNmZjNlMjdiNDgifQ.tZ5aOMy9Vp2L2R5qCZpiwysHp9v6bnQBW9aApl1Z3F8) and play with it on Swagger [here.](https://app.tinybird.co/gcp/europe-west3/openapi?url=https%253A%252F%252Fapi.tinybird.co%252Fv0%252Fpipes%252Fopenapi.json%253Ftoken%253Dp.eyJ1IjogIjdmOTIwMmMzLWM1ZjctNDU4Ni1hZDUxLTdmYzUzNTRlMTk5YSIsICJpZCI6ICI0NDI5OWRkZi1lY2JmLTRkZGItYmM5MS1mMWNmZjNlMjdiNDgifQ.tZ5aOMy9Vp2L2R5qCZpiwysHp9v6bnQBW9aApl1Z3F8) This is its code: ##### advanced\_dynamic\_endpoints.pipe NODE events_enriched SQL > SELECT *, price, city, day FROM events_mat_cols ANY LEFT JOIN products_join_sku ON product_id = sku NODE filter_by_price SQL > % SELECT * FROM events_enriched WHERE 1 = 1 {% if defined(min_price) %} AND price >= {{Float32(min_price)}} {% end %} {% if defined(max_price) %} AND price <= {{Float32(max_price)}} {% end %} NODE filter_by_event_type_and_section_id SQL > % SELECT * FROM filter_by_price {% if defined(event_types) or defined(section_ids) %} ... WHERE {{sql_and(event__in=Array(event_types, 'String', defined=False, enum=['remove_item_from_cart','view','search','buy','add_item_to_cart']), section_id__in=Array(section_ids, 'Int32', defined=False))}} {% end %} NODE filter_by_title_or_color SQL > % SELECT * FROM filter_by_event_type_and_section_id {% if defined(search_terms) %} WHERE multiMatchAny(lower(color), {{split_to_array(search_terms)}}) OR multiMatchAny(lower(title), {{split_to_array(search_terms)}}) {% end %} NODE group_by_or_not SQL > % SELECT {% if defined(group_by) %} {{columns(group_by)}}, sum(price) AS revenue, {% for _last, _x in enumerate_with_last(split_to_array(count_unique_vals_columns)) %} uniq({{symbol(_x)}}) as {{symbol(_x)}} {% if not _last %},{% end %} {% end %} {% else %} * {% end %} FROM filter_by_title_or_color {% if defined(group_by) %} GROUP BY {{columns(group_by)}} ORDER BY {{columns(group_by)}} {% end %} NODE pagination SQL > % WITH {{Array(group_by, 'String', '', description='Comma-separated name of columns. If defined, group by and order the results by these columns. The sum of revenue will be returned')}}, {{Array(count_unique_vals_columns, 'String', '', description='Comma-separated name of columns. If both group_by and count_unique_vals_columns are defined, the number of unique values in the columns given in count_unique_vals_columns will be returned as well')}}, {{Array(search_terms, 'String', '', description='Comma-separated list of search terms present in the color or title of products')}}, {{Array(event_types, 'String', '', description="Comma-separated list of event name types", enum=['remove_item_from_cart','view','search','buy','add_item_to_cart'])}}, {{Array(section_ids, 'String', '', description="Comma-separated list of section IDs. The minimum value for an ID is 0 and the max is 50.")}} SELECT * FROM group_by_or_not LIMIT {{Int32(page_size, 100)}} OFFSET {{Int32(page, 0) * Int32(page_size, 100)}} To replicate it in your account, copy the previous code to a new file called `advanced_dynamic_endpoints.pipe` locally and run `tb push pipes/advanced_dynamic_endpoints.pipe` with our [CLI](https://www.tinybird.co/docs/docs/cli/install) to push it to your Tinybird account. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/publishing-data/serverless-analytics-api Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Handling data privacy in serverless analytics APIs · Tinybird Docs" theme-color: "#171612" description: "Creating an Analytics Dashboard where each user is able to access only certain parts of the data is really easy with Tinybird. You don't need to build anything specific from scratch. Tinybird is able to provide dynamic API Endpoints, including specific security requirements per-user." --- # Handling data privacy in serverless analytics APIs¶ Creating an analytics dashboard where each user is able to access only certain parts of the data is really easy with Tinybird. You don't need to build anything specific from scratch. Tinybird provides dynamic API Endpoints, including specific security requirements per-user. ## The serverless approach to real-time analytics¶ Let's assume you have just two components - the simplest possible stack: - ** A frontend application:** Code that runs in the browser. - ** A backend application:** Code that runs in the server and manages both the user authentication and the authorization. Very probably, the backend will also expose an API from where the frontend fetches the information needed. This guide covers the different workflows that will handle each user operation with the right permissions, by integrating your backend with [Static Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens#what-should-i-use-tokens-for) in a very simple way. ## Create Tokens on user sign-up¶ The only thing you need (to ensure that your users have the right permissions on your data) is a created Tinybird [Static Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens#what-should-i-use-tokens-for) every time you create a new user in your backend. ##### Creating a Token with filter scope TOKEN= curl -H "Authorization: Bearer $TOKEN" \ -d "name=user_692851_token" \ -d "scope=PIPES:READ:ecommerce_example" \ -d "scope=DATASOURCES:READ:events:user_id=692851" \ https://api.tinybird.co/v0/tokens/ Use a Token with the right scope. Replace `` with a Token whose [scope](https://www.tinybird.co/docs/docs/api-reference/token-api) is `TOKENS` or `ADMIN`. This Token will let a given user query their own transactions stored in an `events` Data Source and exposed in an `ecommerce_example` API Endpoint. Some other noteworthy things you can see here: - You can give a `name` to every Token you create. In this case, the name contains the `user_id` , so that it's easier to see what Token is assigned to each user. - You can assign as many scopes to each Token as you want, and `DATASOURCES:READ:datasource_name` and `PIPES:READ:pipe_name` can take an optional SQL filter (like this example does) to restrict the rows that queries authenticated with the Token will have access to. If everything runs successfully, your call will return JSON containing a Token with the specified scopes: ##### Creating a Token with filter scope: Response { "token": "p.eyJ1IjogImI2Yjc1MDExLWNkNGYtNGM5Ny1hMzQxLThhNDY0ZDUxMWYzNSIsICJpZCI6ICI0YTYzZDExZC0zNjg2LTQwN2EtOWY2My0wMzU2ZGE2NmU5YzQifQ.2QP1BRN6fNfgS8EMxqkbfKasDUD1tqzQoJXBafa5dWs", "scopes": [ { "type": "PIPES:READ", "resource": "ecommerce_example", "filter": "" }, { "type": "DATASOURCES:READ", "resource": "events", "filter": "user_id=692851" } ], "name": "user_692851_token" } All the Tokens you create are also visible in your [Workspace > Tokens page](https://app.tinybird.co/tokens) in the UI, where you can create, update and delete them. ## Modify Tokens when user permissions are changed¶ Imagine one of your users is removed from a group, which makes them lose some permissions on the data they can consume. Once that is reflected in your backend, you can [update the user admin Token](https://www.tinybird.co/docs/docs/api-reference/token-api#put--v0-tokens-(.+)) accordingly as follows: ##### Modify an existing Token TOKEN= USER_TOKEN= curl -X PUT \ -H "Authorization: Bearer $TOKEN" \ -d "name=user_692851_token" \ -d "scope=PIPES:READ:ecommerce_example" \ -d "scope=DATASOURCES:READ:events:user_id=692851 and event in ('buy', 'add_item_to_cart')" \ https://api.tinybird.co/v0/tokens/$USER_TOKEN Pass the Token you previously created as a path parameter. Replace `` by the value of `token` from the previous response, or [copy it from the UI](https://app.tinybird.co/tokens). In this example you'd be restricting the SQL filter of the `DATASOURCES:READ:events` scope to restrict the type of events the user will be able to read from the `events` Data Source. This is the response you'd see from the API: ##### Modify an existing Token: Response { "token": "p.eyJ1IjogImI2Yjc1MDExLWNkNGYtNGM5Ny1hMzQxLThhNDY0ZDUxMWYzNSIsICJpZCI6ICI0YTYzZDExZC0zNjg2LTQwN2EtOWY2My0wMzU2ZGE2NmU5YzQifQ.2QP1BRN6fNfgS8EMxqkbfKasDUD1tqzQoJXBafa5dWs", "scopes": [ { "type": "PIPES:READ", "resource": "ecommerce_example", "filter": "" }, { "type": "DATASOURCES:READ", "resource": "events", "filter": "user_id=692851 and event in ('buy', 'add_item_to_cart')" } ], "name": "user_692851_token" } ## Delete Tokens after user deletion¶ Whenever a user is removed from your system, you should also [remove the Token from Tinybird](https://www.tinybird.co/docs/docs/api-reference/token-api#delete--v0-tokens-(.+)) . That will make things easier for you in the future. ##### Remove a token TOKEN= USER_TOKEN= curl -X DELETE \ -H "Authorization: Bearer $TOKEN" \ https://api.tinybird.co/v0/tokens/$USER_TOKEN If the Token is successfully deleted, this request will respond with no content and a 204 status code. ## Refresh Tokens¶ It's a good practice to change Tokens from time to time, so you can automate this in your backend as well. Refreshing a Token requires executing this request for every one of your users: ##### Refresh a token TOKEN= USER_TOKEN= curl -X POST \ -H "Authorization: Bearer $TOKEN" \ https://api.tinybird.co/v0/tokens/$USER_TOKEN/refresh ## Next steps¶ - Learn more about the[ Tokens API](https://www.tinybird.co/docs/docs/api-reference/token-api) . - Understand the concept of[ Static Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens#what-should-i-use-tokens-for) . --- URL: https://www.tinybird.co/docs/guides/publishing-data/share-endpoint-documentation Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Share API Endpoints documentation · Tinybird Docs" theme-color: "#171612" description: "In this guide you'll learn how to share your Tinybird API Endpoint documentation with development teams." --- # Share Tinybird API Endpoint documentation¶ In this guide, you'll learn how to share your Tinybird API Endpoint documentation with development teams. ## The Tinybird API Endpoint page¶ When you publish an API Endpoint, Tinybird generates a documentation page for you that is ready to share and OpenAPI-compatible (v3.0). It contains your API Endpoint description, information about the dynamic parameters you can use when querying this Endpoint, and code snippets for quickly integrating your API in 3rd party applications. To share your published API Endpoint, navigate to the "Create Chart" button (top right of the UI) > "Share this API Endpoint" modal: ## Use Static Tokens to define API Endpoint subsets¶ Tinybird authentication is based on [Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens) which contain different scopes for specific resources. For example, a Token lets you read from one or many API Endpoints, or get write permissions for a particular Data Source. If you take a closer look at the URLs generated for sharing a public API Endpoint page, you'll see that after the Endpoint ID, it includes a Token parameter. This means that this page is only accessible if the Token provided in the URL has read permissions for it: https://api.tinybird.co/endpoint/t_bdcad2252e794c6573e21e7e?token= For security, Tinybird automatically generates a read-only Token when sharing a public API Endpoint page for the first time. If you don't explicitly use it, your Admin Token won't ever get exposed. ### The API Endpoints list page¶ Tinybird also allows you to render the API Endpoints information for a given Token. https://app.tinybird.co///endpoints?token= Enter the URL above (with your Token and the provider and region where the API Endpoint is published) into the browser, and it'll return a list that shows all API Endpoints that this Token can read from. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fsharing-endpoints-documentation-with-development-teams-2.png&w=3840&q=75) <-figcaption-> The API Endpoints list page is extremely useful for sharing your API Endpoint documentation with development teams When integrating your API Endpoint in your applications it is highly recommend that you manage dedicated Tokens. The easiest way is creating a Token for every application environment, so that you can also track the different requests to your API Endpoints by application, and choose which API Endpoints are accessible for them. Once you do that, you can share auto-generated documentation with ease, without compromising your data privacy and security. API Endpoint docs pages include a read Token by default. In the "Share this API Endpoint" modal, you can also see public URLs for every Token with read permissions for your Pipe. ## Browse your docs in Swagger 😎¶ As mentioned above, all Tinybird's documentation is compatible with OpenAPI 3.0 and accessible via API. A quick way of generating documentation in Swagger is navigating to the "Create Chart" button > "Share this API Endpoint" modal > "OpenAPI 3.0" tab, copying the "Shareable link" URL, and using it in your preferred Swagger installation. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fsharing-endpoints-documentation-with-development-teams-3.png&w=3840&q=75) <-figcaption-> You can generate as many URLs as you need by using different Tokens If you use a Token with permissions for more than one API Endpoint, the Swagger documentation will contain information about all the API Endpoints at once. ## Next steps¶ - You've got Endpoints, now make them pretty: Use[ Tinybird Charts](https://www.tinybird.co/docs/docs/publish/charts) . - Learn how to[ monitor and analyze your API performance](https://www.tinybird.co/docs/docs/guides/monitoring/analyze-endpoints-performance) . --- URL: https://www.tinybird.co/docs/guides/querying-data/adapt-postgres-queries Last update: 2024-11-19T13:41:56.000Z Content: --- title: "Adapt Postgres queries · Tinybird Docs" theme-color: "#171612" description: "Postgres is great as an OLTP database, but if you need real-time queries on hundreds of millions of rows, it’s not going to be fast enough. ClickHouse® is. Most queries from Postgres will look very similar on ClickHouse. Here you'll learn how to adapt most of the Postgres queries to run on ClickHouse." --- # Adapt Postgres queries¶ In this guide, you'll learn how to adapt Postgres queries to run on ClickHouse®, looking at a specific example. Postgres is great as an OLTP database, but if you need real-time queries on hundreds of millions of rows, it's not going to be fast enough. ClickHouse is. Most queries from Postgres will look very similar on ClickHouse. [Haki Benita](https://twitter.com/be_haki) has a [great blog post](https://hakibenita.com/sql-for-data-analysis) on how to do lots of operations like pivot tables, subtotals, linear regression, binning, or interpolation on Postgres, coming from a Pandas background. In this guide, you'll see how to adapt most of the Postgres queries from Haki's post to run on ClickHouse. ## Prerequisites¶ You don't need an active Tinybird Workspace to read through this guide, but it's a good idea to read Haki's post first so you're familiar with the examples. In addition, a working knowledge of ClickHouse and SQL is required. ## Common table expressions¶ ClickHouse [supports CTEs](https://clickhouse.tech/docs/en/sql-reference/statements/select/with/) . Both the `WITH AS ` as well as the `WITH AS ` syntaxes are supported. WITH emails AS ( SELECT 'ME@hakibenita.com' AS email ) SELECT * FROM emails ; WITH emails AS ( SELECT 'ME@hakibenita.com' AS email ) SELECT * FROM emails Query id: 6b234b03-6dc4-4ddf-8454-b03d34b75b60 ┌─email─────────────┐ │ ME@hakibenita.com │ └───────────────────┘ 1 rows in set. Elapsed: 0.014 sec. ##### A CHAINED CTE ON POSTGRES WITH emails AS ( SELECT 'ME@hakibenita.com' AS email ), normalized_emails AS ( SELECT lower(email) AS email FROM emails ) SELECT * FROM normalized_emails; WITH emails AS ( SELECT 'ME@hakibenita.com' AS email ), normalized_emails AS ( SELECT lower(email) AS email FROM emails ) SELECT * FROM normalized_emails Query id: c511a113-1852-4a9f-90bf-99d33eba8254 ┌─email─────────────┐ │ me@hakibenita.com │ └───────────────────┘ 1 rows in set. Elapsed: 0.127 sec. ### On Tinybird¶ For now, we only support the WITH AS syntax. So the previous queries would have to be rewritten like this: ##### CTES ON TINYBIRD WITH (SELECT 'ME@hakibenita.com') AS email SELECT email 2.00B, 2.00 x 1 ( 1.35ms ) There's a difference with CTEs on Postgres VS ClickHouse. In Postgres, as the original post says, "CTEs are a great way to split a big query into smaller chunks, perform recursive queries and even to cache intermediate results". On ClickHouse, CTEs can only return one row, so those intermediate results can't have multiple rows. For a similar result, on ClickHouse you have to use subqueries. A common pattern is returning a tuple of groupArrays in the CTE, so you can return more than one row in the form of arrays. Then consume the results in the main query with transform for instance or arrayJoin. In Tinybird, Pipes act like notebooks where each Node is a subquery and you can refer to the results of one Node in another Node. It's great to see intermediate results and reduce the complexity of your queries. If you'd like to try it out, [sign up here](https://www.tinybird.co/signup?utm_source=blog). ## Generating data¶ As you see in the original article, [in Postgres](https://hakibenita.com/sql-for-data-analysis#generating-data) there are several ways to do it: ## Union all¶ This works the same in ClickHouse as in Postgres: ##### UNION ALL ALSO WORKS WITH dt AS ( SELECT 1 AS id, 'haki' AS name UNION ALL SELECT 2, 'benita' ) SELECT * FROM dt; WITH dt AS ( SELECT 1 AS id, 'haki' AS name UNION ALL SELECT 2, 'benita' ) SELECT * FROM dt Query id: e755e5a5-5e5b-4e8a-a262-935f9946d45d ┌─id─┬─name───┐ │ 2 │ benita │ └────┴────────┘ ┌─id─┬─name─┐ │ 1 │ haki │ └────┴──────┘ 2 rows in set. Elapsed: 0.051 sec. The VALUES keyword won't work on ClickHouse to select data, only to insert it. ## Joining data¶ The join syntax from Postgres will work on ClickHouse, but typically the kinds of analytical data that you'll store on ClickHouse will be orders of magnitude bigger than what you'd store on Postgres, and this would make your joins slow. There are ways to make JOINs faster, check the [best practices for writing faster SQL queries](https://www.tinybird.co/docs/docs/query/sql-best-practices) or contact us for guidance. ## Unnest - arrayJoin¶ [arrayJoin](https://clickhouse.tech/docs/en/sql-reference/functions/array-join/) is the ClickHouse equivalent of unnest on Postgres. So this Postgres query: ##### ON CLICKHOUSE, UNNEST EXPANDS AN ARRAY INTO ROWS WITH dt AS ( SELECT unnest(array[1, 2]) AS n ) SELECT * FROM dt; Would be rewritten on ClickHouse like this: ##### ON CLICKHOUSE, ARRAYJOIN EXPANDS AN ARRAY INTO ROWS SELECT arrayJoin([1, 2]) AS dt 1.00B, 1.00 x 1 ( 0.79ms ) ## Generating series of data¶ ### Generate\_series¶ The generate_series doesn't exist on ClickHouse, but with the [numbers](https://www.tinybird.co/docs/docs/) function we can accomplish a lot as well. This is its basic usage: ##### NUMBERS PRODUCES ROWS SELECT * FROM numbers(10) 80.00B, 10.00 x 1 ( 0.71ms ) A similar result can be obtained with the [range](https://clickhouse.tech/docs/en/sql-reference/functions/array-functions/#rangeend-rangestart-end-step) function, that returns arrays. If we only provide an argument, it behaves like numbers. And with range we can also specify a start, end and step: ##### RANGE PRODUCES ARRAYS SELECT range(10), range(0, 10, 2) 1.00B, 1.00 x 2 ( 0.82ms ) This, combined with arrayJoin lets us do the same as generate_series: ##### RANGE OF INTEGERS USING START, END AND STEP SELECT arrayJoin(range(0, 10, 2)) AS number 1.00B, 1.00 x 1 ( 0.98ms ) ## Generating time series¶ generate_series can produce results with other types different than integers, while range only outputs integers. But with some smart logic we can achieve the same results. For example, on Postgres you'd generate a with a datetime for each hour in a day this way, as in the original [post](https://hakibenita.com/sql-for-data-analysis#generating-data): ##### THE GENERATE\_SERIES FUNCTION OF POSTGRES WITH daterange AS ( SELECT * FROM generate_series( '2021-01-01 UTC'::timestamptz, -- start '2021-01-02 UTC'::timestamptz, -- stop interval '1 hour' -- step ) AS t(hh) ) SELECT * FROM daterange; hh ──────────────────────── 2021-01-01 00:00:00+00 2021-01-01 01:00:00+00 2021-01-01 02:00:00+00 ... 2021-01-01 22:00:00+00 2021-01-01 23:00:00+00 2021-01-02 00:00:00+00 ### Generate a time series specifying the start date and the number of intervals¶ On ClickHouse, you can achieve the same this way: ##### TIME SERIES ON CLICKHOUSE GIVEN START DATE, NUMBER OF INTERVALS AND INTERVAL SIZE WITH toDate('2021-01-01') as start SELECT addHours(toDate(start), number) AS hh FROM ( SELECT arrayJoin(range(0, 24)) AS number ) 1.00B, 1.00 x 1 ( 1.45ms ) ### Generate a time series specifying the start and end date and the step¶ Another way of doing the same thing: ##### TIME SERIES ON CLICKHOUSE GIVEN THE START AND END DATE AND THE STEP SIZE WITH toStartOfDay(toDate('2021-01-01')) AS start, toStartOfDay(toDate('2021-01-02')) AS end SELECT arrayJoin(arrayMap(x -> toDateTime(x), range(toUInt32(start), toUInt32(end), 3600))) as hh 1.00B, 1.00 x 1 ( 1.78ms ) ### Generate a time series using timeSlots¶ Using the [timeSlots](https://clickhouse.tech/docs/en/sql-reference/functions/date-time-functions/#timeslotsstarttime-duration-size) function, we can specify the start (DateTime), duration (seconds) and step (seconds) and it generates an array of DateTime values. ##### TIME SERIES ON CLICKHOUSE USING THE TIMESLOTS FUNCTION WITH toDateTime('2021-01-01 00:00:00') AS start SELECT arrayJoin(timeSlots(start, toUInt32(24 * 3600), 3600)) AS hh 1.00B, 1.00 x 1 ( 1.25ms ) ## Generating a random value¶ The [rand](https://clickhouse.tech/docs/en/sql-reference/functions/random-functions/#rand) function in ClickHouse is akin to random in Postgres, with the difference that rand returns a random UInt32 number between 0 and 4294967295 (). So to get random floats between 0 and 1 like random, you have to divide the result by 4294967295. ##### GENERATING A RANDOM VALUE ON CLICKHOUSE WITH THE RAND FUNCTION SELECT rand() random_int, random_int / 4294967295 random_float 1.00B, 1.00 x 2 ( 0.75ms ) To get more than one row, you'd simply do ##### GENERATING SEVERAL RANDOM VALUES ON CLICKHOUSE SELECT rand() random_int, random_int / 4294967295 random_float FROM numbers(100) 800.00B, 100.00 x 2 ( 1.37ms ) ## Generating random integers within a range¶ You would use the floor or ceil function (not round, for the reasons explained [here](https://hakibenita.com/sql-for-data-analysis#random) ) in addition to the result of rand multiplied by the max of the range of integers you want to generate, like this: ##### GENERATING SEVERAL RANDOM INTEGERS IN A GIVEN RANGE ON CLICKHOUSE SELECT ceil(rand() / 4294967295 * 3) AS n FROM numbers(10) 80.00B, 10.00 x 1 ( 1.38ms ) And here you can see that the distribution is uniform (this wouldn't happen if you had use round): ##### THE DISTRIBUTION IS UNIFORM SELECT ceil(rand() / 4294967295 * 3) AS n, count(*) FROM numbers(10000) GROUP BY n ORDER BY n 80.00KB, 10.00k x 2 ( 3.07ms ) ## Sampling data from a list¶ This is how you'd take samples with replacement from a list in Postgres: ##### RANDOM CHOICE IN POSTGRES SELECT (array['red', 'green', 'blue'])[ceil(random() * 3)] AS color FROM generate_series(1, 5); In ClickHouse, this is how you'd do it: ##### RANDOM CHOICE IN CLICKHOUSE SELECT ['red', 'green', 'blue'][toInt32(ceil(rand() / 4294967295 * 3))] AS color FROM numbers(5) 40.00B, 5.00 x 1 ( 2.26ms ) To get only one value, you'd remove the FROM numbers(5) part. Note that to define an array on ClickHouse you can do it either calling array('red', 'green', 'blue') or with ['red', 'green', 'blue'] like in the code snippet. ## Sampling data from a table¶ Sorting data by rand() can be used to get a random sample, like here: ##### SAMPLE DATA USING RAND SELECT * FROM events_mat ORDER BY rand() LIMIT 100 6.80GB, 50.00m x 8 ( 398.46ms ) But this is slow, as a full scan of the table has to be run here. A more efficient way to do it is using the [SAMPLE](https://clickhouse.tech/docs/en/sql-reference/statements/select/sample/) clause. You can pass an integer to it (should be large enough, typically above 1000000) ##### THE SAMPLE CLAUSE, PASSING AN INTEGER SELECT * FROM events_mat SAMPLE 1000000 2.23MB, 16.38k x 8 ( 5.46ms ) And you can also pass a float between 0 and 1, to indicate the fraction of the data that will be sampled. ##### THE SAMPLE CLAUSE, PASSING A FLOAT SELECT * FROM events_mat SAMPLE 0.01 53.50MB, 393.22k x 8 ( 43.28ms ) ## Descriptive statistics on a numeric series¶ ClickHouse also comes with lots of statistical function, like Postgres does (see [this section](https://hakibenita.com/sql-for-data-analysis#describing-a-series) of the original post). The first query, written on Postgres this way: ##### DESCRIPTIVE STATISTICS ON POSTGRES WITH s AS ( SELECT * FROM (VALUES (1), (2), (3)) AS t(n) ) SELECT count(*), avg(n), stddev(n), min(n), percentile_cont(array[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY n), max(n) FROM s; count │ avg │ stddev │ min │ percentile_cont │ max ──────┼────────┼───────────┼─────┼─────────────────┼───── 3 │ 2.0000 │ 1.0000000 │ 1 │ {1.5,2,2.5} │ 3 Would be done on ClickHouse like this: ##### DESCRIPTIVE STATISTICS ON CLICKHOUSE SELECT count(*), avg(n), stddevSamp(n), min(n), quantiles(0.25, 0.5, 0.75)(n), max(n) FROM (SELECT arrayJoin([1,2,3]) AS n) 1.00B, 1.00 x 6 ( 3.35ms ) ## Descriptive statistics on categorical series¶ ClickHouse can also be used to get some statistics from discrete values. While on Postgres you'd do this: ##### DESCRIPTIVE STATISTICS OF CATEGORICAL VALUES ON POSTGRES WITH s AS (SELECT unnest(array['a', 'a', 'b', 'c']) AS v) SELECT count(*), count(DISTINCT V) AS unique, mode() WITHIN GROUP (ORDER BY V) AS top FROM s; count │ unique │ top ───────┼────────┼───── 4 │ 3 │ a On ClickHouse you'd do: ##### DESCRIPTIVE STATISTICS OF CATEGORICAL VALUES ON CLICKHOUSE SELECT count(*) AS count, uniq(v) AS unique, topK(1)(v) top FROM (SELECT arrayJoin(['a', 'b', 'c', 'd']) AS v) 1.00B, 1.00 x 3 ( 3.17ms ) [uniq](https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/reference/uniq/) will provide approximate results when your data is very big. If you need exact results you can use [uniqExact](https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/reference/uniqexact/) , but be aware that uniq will generally be faster than uniqExact. Check out the [topK](https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/reference/topk/) docs as well if you're interested. As a side note, if you have categorical columns, most likely you'll have [better performance](https://altinity.com/blog/2019/3/27/low-cardinality) and lower storage cost data types. The performance of using LowCardinality will be better than using the base data types even on columns with more than a few millions of different values. This is what Instana found out, as well - read [their full post here](https://www.instana.com/blog/reducing-clickhouse-storage-cost-with-the-low-cardinality-type-lessons-from-an-instana-engineer/): *"When we came across the LowCardinality data type the first time, it seemed like nothing we could use. We assumed that our data is just not homogeneous enough to be able to use it. But when looking at it recently again, it turns out we were very wrong. The name LowCardinality is slightly misleading. It actually can be understood as a dictionary. And according to our tests, it still performs better and is faster even when the column contains millions of different values"* ## Subtotals and aggregations¶ The same operations done in [this section](https://hakibenita.com/sql-for-data-analysis#subtotals) of Haki's post can be done with ClickHouse. Given a table that contains this data: ##### EMPLOYEES SELECT * FROM employees 261.00B, 6.00 x 3 ( 2.36ms ) Finding the number of employees with each role is straightforward, same syntax as on Postgres: ##### EMPLOYEES PER ROLE SELECT department, role, count(*) count FROM employees GROUP BY department, role 182.00B, 6.00 x 3 ( 3.02ms ) ### Using rollup and cube¶ The ROLLUP modifier is also available on ClickHouse, although the syntax is slightly different than on Postgres. This query on Postgres: ##### GROUP BY WITH ROLLUP ON POSTGRES SELECT department, role, COUNT(*) FROM employees GROUP BY ROLLUP(department, role); would be written on ClickHouse like this: ##### GROUP BY WITH ROLLUP ON CLICKHOUSE SELECT department, role, COUNT(*) FROM employees GROUP BY department, role WITH ROLLUP 182.00B, 6.00 x 3 ( 2.79ms ) It allows you to have more subtotals (but not all). To have all the subtotals for all the possible combinations of grouping keys, you need to use the [CUBE](https://clickhouse.tech/docs/en/sql-reference/statements/select/group-by/#with-cube-modifier) modifier: ##### GROUP BY WITH CUBE ON CLICKHOUSE SELECT department, role, COUNT(*) FROM employees GROUP BY department, role WITH CUBE 182.00B, 6.00 x 3 ( 3.45ms ) ## Pivot tables and conditional expressions¶ Pivot tables let you reshape data when you want typically a column with keys, a column with categories and a column with values, and you want to aggregate those values and use the categories column as columns of a new table. On Postgres you could do it this way: ##### PIVOT TABLE CREATED MANUALLY ON POSTGRES SELECT role, COUNT(*) FILTER (WHERE department = 'R&D') as "R&D", COUNT(*) FILTER (WHERE department = 'Sales') as "Sales" FROM employees GROUP BY role; role │ R&D │ Sales ───────────┼─────┼─────── Manager │ 1 │ 1 Developer │ 2 │ 2 On ClickHouse, you could do the same this way: ##### PIVOT TABLE CREATED MANUALLY ON POSTGRES SELECT role, countIf(department = 'R&D') as "R&D", countIf(department = 'R&D') as "Sales" FROM employees GROUP BY role 182.00B, 6.00 x 3 ( 3.43ms ) The original [post](https://hakibenita.com/sql-for-data-analysis#pivot-tables) doesn't mention this, but Postgres has a very convenient [crosstab function](https://www.postgresql.org/docs/9.1/tablefunc.html) that lets you what we've done in one line. If the number of the categories to pivot is too large, you can imagine how long this query could be become if done manually and how handy the crosstab function could get. Something like this is not available yet on ClickHouse, unfortunately. ## Running and Cumulative Aggregations¶ Aggregations over sliding windows are a common solution. This can be done with Window functions. This also can be done with the [groupArrayMovingSum](https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/reference/grouparraymovingsum/) and [groupArrayMovingAvg](https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/reference/grouparraymovingavg/) functions, available in stable releases since a long time ago already. This is an example of its usage: Given this dataset: ##### GOOGLE TRENDS DATA FOR THE TERM 'AMAZON' SELECT date, amazon as value FROM amazon_trends 8.42KB, 1.40k x 2 ( 1.79ms ) We could compute a 7-day moving average of value like this: ##### 7-DAY MOVING AVERAGE SELECT * FROM (SELECT groupArray(date) as date_arr, groupArray(value) as value_arr, groupArrayMovingAvg(7)(value) mov_avg FROM (SELECT date, amazon as value FROM amazon_trends)) ARRAY JOIN * 8.42KB, 1.40k x 3 ( 11.75ms ) The periods parameter is optional. If you omit it, all the previous rows are used for the aggregation. ## Linear regression¶ Given this data ##### DATA TO PERFORM LINEAR REGRESSION SELECT arrayJoin([[1.2, 1], [2, 1.8], [3.1, 2.9]])[1] x, arrayJoin([[1.2, 1], [2, 1.8], [3.1, 2.9]])[2] y 1.00B, 1.00 x 2 ( 2.71ms ) On Postgres we can see in the the [original post](https://hakibenita.com/sql-for-data-analysis#linear-regression) that you'd do linear regression like this: ##### LINEAR REGRESSION ON POSTGRES WITH t AS (SELECT * FROM (VALUES (1.2, 1.0), (2.0, 1.8), (3.1, 2.9) ) AS t(x, y)) SELECT regr_slope(y, x) AS slope, regr_intercept(y, x) AS intercept, sqrt(regr_r2(y, x)) AS r FROM t; slope │ intercept │ r ────────────────────┼──────────────────────┼─── 1.0000000000000002 │ -0.20000000000000048 │ 1 There's not a function on ClickHouse like regr_r2 that gives you the R2 coefficient, but it wouldn't be hard to calculate it yourself as the formula is [simple](https://www.google.com/search?q=r2+formula). ## Filling null values.¶ This part is called " [interpolation](https://hakibenita.com/sql-for-data-analysis#interpolation) " in Haki's post. Filling null values with Pandas is a one-liner. Imagine we have this table ##### A TABLE WITH A INT AND A STRING COLUMN, WITH SOME NULL VALUES SELECT * FROM num_str 88.00B, 7.00 x 2 ( 2.62ms ) ## Fill null values with a constant value¶ The way to replace all the null values by a constant value is using the coalesce function, that works in ClickHouse the same way it does in Postgres, using [coalesce](https://clickhouse.tech/docs/en/sql-reference/functions/functions-for-nulls/#coalesce): ##### FILLING NULL VALUES WITH A CONSTANT SELECT n, coalesce(v, 'X') AS v FROM num_str 88.00B, 7.00 x 2 ( 3.86ms ) ## Back and forward filling data¶ This is also a one-liner in Pandas. On the [section](https://hakibenita.com/sql-for-data-analysis#back-and-forward-fill) of the original post, the author does this using correlated subqueries, but those [aren't supported yet](https://github.com/ClickHouse/ClickHouse/issues/6697) on ClickHouse. Fortunately, ClickHouse comes with a lot of powerful array functions like [groupArray](https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/reference/grouparray/), [arrayFill](https://clickhouse.tech/docs/en/sql-reference/functions/array-functions/#array-fill) and [arrayReverseFill](https://clickhouse.tech/docs/en/sql-reference/functions/array-functions/#array-reverse-fill). groupArray, as the other [aggregate functions](https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/) on ClickHouse, skips null values. So the solution involves replacing them by another value (make sure that the new value doesn't appear in the column before). This is done with the ifNull function. Add some array magic in, and this is how you'd do it: ##### BACK AND FORWARD FILLING VALUES SELECT values.1 n, values.2 v, values.3 v_ffill, values.4 v_bfill FROM (SELECT arrayJoin( arrayZip( groupArray(n) AS n, arrayMap(x -> x != 'wadus' ? x : null, groupArray(v_nulls_replaced)) AS v, arrayFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_ffill, arrayReverseFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_bfill ) ) values FROM (SELECT *, ifNull(v, 'wadus') v_nulls_replaced FROM num_str ORDER BY n ASC) ) 88.00B, 7.00 x 4 ( 12.43ms ) To understand what's going on here, import [this Pipe](https://www.tinybird.co/docs/docs/assets/pipes/back_and_forward_fill_on_clickhouse.pipe) with a step-by-step explanation and the results of the transformations that are taking place. Tinybird lets you run each subquery in a Node of a notebook-like UI (we call them Pipes). This lets you build and debug complex queries in a cleaner way. If you'd like to use it, sign up [here](https://www.tinybird.co/signup). ## Filling gaps in time-series, reshaping indexes.¶ Sometimes, you'll group a time-series by a Date or DateTime column, and it can happen that the intervals between rows are not always the same because there were no values found for some dates or datetimes. In Pandas, the solution would be creating a new date_range index and then re-indexing the original Series/DataFrame with that index, as explained [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html). On ClickHouse, the same can be accomplished with the [WITH FILL modifier](https://clickhouse.tech/docs/en/sql-reference/statements/select/order-by/#orderby-with-fill) . Here's a simple example of it: ##### FILLING GAPS IN TIME ON CLICKHOUSE SELECT toDate((number * 2) * 86400) AS d1, 'string_value' as string_col, toInt32(rand() / exp2(32) * 100) as n FROM numbers(10) ORDER BY d1 WITH FILL STEP 1 80.00B, 10.00 x 3 ( 11.19ms ) The `STEP 1` part is not necessary here as it's the default, but know that you can set a different value than 1. ## Linear interpolation¶ Imagine you have a table like this, containing a time-series with some rows missing. We could fill those missing gaps with the WITH FILL expression previously shown, but that way we'd just get zeroes when there's a missing value, while the actual missing value is probably closer to the previous and the next values than to zero. ##### TIME-SERIES WITH MISSING ROWS SELECT *, bar(value, 0, 100, 20) FROM trends_with_gaps ORDER BY date WITH FILL STEP 1 5.68KB, 947.00 x 3 ( 2.72ms ) Linear interpolation is the simplest way to fill those missing values. In it, missing values are replaced by the average of the previous and the next known values. On Postgres, Haki's post explains how to do it [here](https://hakibenita.com/sql-for-data-analysis#linear-interpolation). On ClickHouse, this can be done with arrays. Check out [this](https://altinity.com/blog/harnessing-the-power-of-clickhouse-arrays-part-3) great post by Altinity for an in-depth explanation. This is how it'd be done with this dataset: ##### INTERPOLATING VALUES SELECT date, value, value_interpolated, bar(value_interpolated, 0, 100, 20) AS value_interpolated_bar FROM ( SELECT groupArray(date) AS dt_arr, groupArray(value) AS value_arr, arrayFill(x -> ((x.1) > 0), arrayZip(value_arr, dt_arr)) AS value_lower, arrayReverseFill(x -> ((x.1) > 0), arrayZip(value_arr, dt_arr)) AS value_upper, arrayMap((l, u, v, dt) -> if(v > 0, v, (l.1) + ((((u.1) - (l.1)) / ((u.2) - (l.2))) * (dt - (l.2)))), value_lower, value_upper, value_arr, dt_arr) AS value_interpolated FROM ( SELECT * FROM trends_with_gaps ORDER BY date WITH FILL STEP 1 ) ) ARRAY JOIN dt_arr AS date, value_interpolated, value_arr AS value 5.68KB, 947.00 x 4 ( 10.23ms ) For a step-by-step explanation of how this works, and to see how you could construct this query iteratively with a notebook-like interface on Tinybird, import [this Pipe](https://www.tinybird.co/docs/docs/assets/pipes/interpolating_gaps.pipe). ## Binning and histograms¶ The original post talks about custom binning, equal-width and equal-height binning. The way to do custom binning is very similar on ClickHouse, which also supports CASE statements. Equal height binning could be achieved with the [quantiles](https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/reference/quantiles/) function, already described before, in the descriptive statistics section. The most interesting use-case of equal-width binning is creating histograms, which is very easy on ClickHouse. It even comes with a [histogram](https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/parametric-functions/#histogram) function, which receives a number of bins and the data, and returns a list of tuples containing the lower and upper bounds of each bucket, as well as its height: ##### THE HISTOGRAM FUNCTION SELECT histogram(10)(value) AS values FROM trends_with_gaps 3.79KB, 947.00 x 1 ( 2.29ms ) Then, extracting each values from the tuples and even having a visual representation of the data can be done like this: ## Running ClickHouse without worrying about it¶ [Tinybird](https://www.tinybird.co/) lets you do real-time analytics on huge amounts of data, powered by ClickHouse, without having to worry about scalability, hosting or maintaining any ClickHouse clusters. With it, you can ingest huge datasets and streaming data, analyze it with SQL and publish dynamic API Endpoints on those queries a couple of clicks. Our [product](https://www.tinybird.co/product) is already being used by some big companies and we've recently been featured on [Techcrunch](https://techcrunch.com/2021/07/05/tinybird-turns-raw-data-into-realtime-api-at-scale) . To use Tinybird, [sign up here](https://www.tinybird.co/signup?utm_source=blog) Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/querying-data/deduplication-strategies Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Deduplicate data in your Data Source · Tinybird Docs" theme-color: "#171612" description: "Learn several strategies for deduplicating data in Tinybird." --- # Deduplicate data in your Data Source¶ Sometimes you might need to deduplicate data, for example to receive updates or data from a transactional database through CDC. You might want to retrieve only the latest data point, or keep a historic record of the evolution of the attributes of an object over time. Because ClickHouse® doesn't enforce uniqueness for primary keys when inserting rows, you need to follow different strategies to deduplicate data with minimal side effects. ## Deduplication strategies¶ You can use one of the following strategies to deduplicate your data. | Method | When to use | | --- | --- | | [ Deduplicate at query time](https://www.tinybird.co/docs/about:blank#deduplicate-at-query-time) | Deduplicate data at query time if you are still prototyping or the Data Source is small. | | [ Use ReplacingMergeTree](https://www.tinybird.co/docs/about:blank#use-the-replacingmergetree-engine) | Use `ReplacingMergeTree` or `AggregatingMergeTree` for greater performance. | | [ Snapshot based deduplication](https://www.tinybird.co/docs/about:blank#snapshot-based-deduplication) | If data freshness isn't required, generate periodic snapshots of the data and take advantage of subsequent Materialized Views for rollups. | | [ Hybrid approach using Lambda architecture](https://www.tinybird.co/docs/about:blank#hybrid-approach-using-lambda-architecture) | When you need to overcome engine approach limitations while preserving freshness, combine approaches in a Lambda architecture. | For dimensional and small tables, a periodical full replace is usually the best option. ## Example case¶ Consider a dataset from a social media analytics company that wants to track some data content over time. You receive an event with the latest info for each post, identified by `post_id` . The three fields, `views`, `likes`, `tags` , vary from event to event. For example: ##### post.ndjson { "timestamp": "2024-07-02T02:22:17", "post_id": 956, "views": 856875, "likes": 2321, "tags": "Sports" } ## Deduplicate at query time¶ Imagine you're only interested in the latest value of views for each post. In that case, you can deduplicate data on `post_id` and get the latest value with these strategies: - Get the max date for each post in a subquery and then filter by its results. - Group data by `post_id` and use the `argMax` function. - Use the `LIMIT BY` clause. Select `Subquery`, `argMax` , or `LIMIT BY` to see the example queries for each. - Subquery - argMax - LIMIT BY ##### Deduplicating data on post\_id using Subquery SELECT * FROM posts_info WHERE (post_id, timestamp) IN ( SELECT post_id, max(timestamp) FROM posts_info GROUP BY post_id ) Depending on your data and how you define the sorting keys in your Data Sources to store it on disk, one approach is faster than the others. In general, deduplicating at query time is fine if the size of your data is small. If you have lots of data, use a specific Engine that takes care of deduplication for you. ## Use the ReplacingMergeTree engine¶ If you've lots of data and you're interested in the latest insertion for each unique key, use the ReplacingMergeTree engine with the following options: `ENGINE_SORTING_KEY`, `ENGINE_VER` , and `ENGINE_IS_DELETED`. - Rows with the same `ENGINE_SORTING_KEY` are deduplicated. You can select one or more columns. - If you specify a type for `ENGINE_VER` , the row with the highest `ENGINE_VER` for each unique `ENGINE_SORTING_KEY` is kept, for example a timestamp. - `ENGINE_IS_DELETED` is only active if you use `ENGINE_VER` . This column determines whether the row represents the state or is to be deleted; `1` is a deleted row, `0` is a state row. The type must be `UInt8` . - You can omit `ENGINE_VER` , so that the last inserted row for each unique `ENGINE_SORTING_KEY` is kept. Aggregation or rollups in Materialized Views built on top of ReplacingMergeTree queries always contain duplicated data. ### Define a Data Source¶ Define a Data Source like the following: ##### post\_views\_rmt.datasource DESCRIPTION > Data Source to save post info. ReplacingMergeTree Engine. SCHEMA > `post_id` Int32 `json:$.post_id`, `views` Int32 `json:$.views`, `likes` Int32 `json:$.likes`, `tag` String `json:$.tag`, `timestamp` DateTime `json:$.timestamp`, `_is_deleted` UInt8 `json:$._is_deleted` ENGINE "ReplacingMergeTree" ENGINE_PARTITION_KEY "" ENGINE_SORTING_KEY "post_id" ENGINE_VER "timestamp" ENGINE_IS_DELETED "_is_deleted" ReplacingMergeTree deduplicates during a merge, and merges can't be controlled. Consider adding the `FINAL` clause, or an alternative deduplication method, to apply the merge at query time. Note also that rows are masked, not removed, when using `FINAL`. - FINAL - Subquery - argMax - LIMIT BY ##### Deduplicating data on post\_id using FINAL SELECT * FROM posts_info_rmt FINAL You can define the `posts_info_rmt` as the landing Data Source, the one you send events to, or as a Materialized View from `posts_info` . You can also create a Data Source with an `AggregatingMergeTree` Engine using `maxState(ts)` and `argMaxState(field,ts)`. ## Snapshot based deduplication¶ Use [Copy Pipes](https://www.tinybird.co/docs/docs/publish/copy-pipes) to take a query result and write it to a new Data Source in the following situations: - You need other Sorting Keys that might change with updates. - You need to do rollups and want to use Materialized Views. - Response times are too long with a `ReplacingMergeTree` . The following is an example snapshot: ##### post\_generate\_snapshot.pipe NODE gen_snapshot SQL > SELECT post_id, argMax(views, timestamp) views, argMax(likes, timestamp) likes, argMax(tag, timestamp) tag, max(timestamp) as ts, toStartOfMinute(now()) - INTERVAL 1 MINUTE as snapshot_ts FROM posts_info WHERE timestamp <= toStartOfMinute(now()) - INTERVAL 1 MINUTE GROUP BY post_id TYPE copy TARGET_DATASOURCE post_snapshot COPY_MODE replace COPY_SCHEDULE 0 * * * * Because the `TARGET_DATASOURCE` engine is a MergeTree, you can use fields that you expect to be updated as sorting keys in the ReplacingMergeTree. ##### post\_snapshot.datasource SCHEMA > `post_id` Int32, `views` Int32, `likes` Int32, `tag` String, `ts` DateTime, `snapshot_ts` DateTime ENGINE "MergeTree" ENGINE_PARTITION_KEY "" ENGINE_SORTING_KEY "tag, post_id" ## Hybrid approach using Lambda architecture¶ Snapshots might decrease data freshness, and running Copy Pipes too frequently might be more expensive than Materialized Views. A way to mitigate these issues is to combine batch and real-time processing, reading the latest snapshot and incorporating the changes that happened since then. This pattern is described in the [Lambda architecture](https://www.tinybird.co/docs/docs/guides/querying-data/lambda-architecture) guide. See a practical example in the [CDC using Lambda](https://www.tinybird.co/docs/docs/guides/querying-data/lambda-example-cdc) guide. Using the `post_snapshot` Data Source created before, the real-time Pipe would be like the following: ##### latest\_values.pipe NODE get_latest_changes SQL > SELECT max(timestamp) last_ts, post_id, argMax(views, timestamp) views, argMax(likes, timestamp) likes, argMax(tag, timestamp) tag FROM posts_info_rmt WHERE timestamp > (SELECT max(snapshot_ts) FROM post_snapshot) GROUP BY post_id NODE get_snapshot SQL > SELECT last_ts, post_id, views, likes, tag FROM posts_info_rmt WHERE snapshot_ts = (SELECT max(snapshot_ts) FROM post_snapshot) AND post_id NOT IN (SELECT post_id FROM get_latest_changes) NODE combine_both SQL > SELECT * FROM get_snapshot UNION ALL SELECT * FROM get_latest_changes ## Next steps¶ - Read the[ Materialized Views docs](https://www.tinybird.co/docs/docs/publish/materialized-views#creating-a-materialized-view-in-the-tinybird-ui) . - Read the[ Lambda architecture guide](https://www.tinybird.co/docs/docs/guides/querying-data/lambda-architecture) . - Visualize your data using[ Tinybird Charts](https://www.tinybird.co/docs/docs/publish/charts) . Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/querying-data/dynamic-aggregation Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Rollup aggregations with query parameters · Tinybird Docs" theme-color: "#171612" description: "Learn how to dynamically aggregate time series data by different time intervals (rollups) to optimize frontend performance." --- # Rollup aggregations with query parameters¶ In this guide, you'll learn how to dynamically aggregate time series data by different time intervals (rollups) to optimize frontend performance. These days, it is not uncommon to have datasets with a per-second resolution for a few years' worth of data. This creates some demands at the storage and the query layers, and it also presents some challenges for the data consumers. Aggregating this data dynamically and on-the-fly is key for resolving the specific demands at scale. In this guide, you'll create an API Endpoint that aggregates the data in different time-frames depending on the amount of data, so only the needed rows are sent to the frontend, thereby gaining performance and speed. When preparing the API Endpoint you'll focus on 3 things: 1. Keep the API Endpoint interface extremely simple: "I want events data from this** start date** to this** end date** ". 2. Make the API Endpoint return enough data with adequate resolution for the selected date range. 3. Don't add logic to the frontend to do the aggregation of the returned data. You simply request the desired date range knowing that you will receive an amount of data that won't swamp your rendering pipeline. ## Prerequisites¶ You'll need to have at least read through the [quick start guide](https://www.tinybird.co/docs/docs/quick-start) to be familiar with the scenario. The following guide uses a Data Source called `events` with 100M rows, for a ~5-year timespan. ## 1. Build the Pipe¶ In this step, you'll learn how to use Tinybird's [templating language](https://www.tinybird.co/docs/docs/query/query-parameters) to add more logic to your API Endpoints. In addition to the main docs on [using the templating language to pass query parameters](https://www.tinybird.co/docs/docs/query/query-parameters) to Endpoints, you can also learn about variables and the functions that are available within templates in the [CLI > Advanced Templates](https://www.tinybird.co/docs/docs/cli/advanced-templates#template-functions) docs. The Endpoint created in the [quick start guide](https://www.tinybird.co/docs/docs/quick-start) returns **sales per day** of an ecommerce store, and takes a start and end date. The problem of having a fixed period of one day to aggregate data is that the amount of data transferred will vary a lot, depending on the selected dates, as well as the work that the client will have to do on the frontend to render that data. Also, if the start and end dates are close to each other, the grouping window will be too big and the backend won't return data with a high enough level of detail. Fortunately, you can add **conditional logic** when defining API Endpoints in Tinybird and, depending on the range of dates selected, the Endpoint will return data grouped by one of these periods: - Weekly - Daily - Every 4 hours - Every 1 hour - Every 15 minutes To do this, you can create a new Pipe named `ecommerce_events_dynamic` . In the first Node, add the following code to 1) keep only the `buy` events, and 2) cast the `price` column to `Float32` changing its name to `buy_events_with_price`: ##### Filtering raw events Data Source to keep only BUY events SELECT date, product_id, user_id, toFloat32(JSONExtractFloat(extra_data, 'price')) price FROM events WHERE event='buy' And then, add another transformation Node containing the following query. Name this Node `buy_events_dynamic_agg`: ##### Using dynamic parameters to aggregate data depending on the time range % SELECT {% set days_interval = day_diff(Date(start_date, '2018-01-01'), Date(end_date, '2018-01-31')) %} {% if days_interval > 180 %} toStartOfWeek(date) {% elif days_interval > 31 %} toStartOfDay(date) {% elif days_interval > 7 %} toStartOfInterval(date, INTERVAL 4 HOUR) {% elif days_interval > 2 %} toStartOfHour(date) {% else %} toStartOfFifteenMinutes(date) {% end %} AS t, round(sum(price), 2) AS sales FROM buy_events_with_price WHERE date BETWEEN toDateTime(toDate({{Date(start_date, '2018-10-01')}})) AND toDateTime(toDate({{Date(end_date, '2020-11-01')}}) + 1) GROUP BY t ORDER BY t This query makes use of Tinybird's templating language: - It defines a couple of Date parameters and adds some default values to be able to test the Pipe while you build it ( `{{Date(start_date, '2018-01-01')}}` and `{{Date(end_date, '2018-12-31')}}` ). - It computes the number of days in the interval defined by `start_date` and `end_date` : `days_interval = day_diff(...)` . - It uses the `days_interval` variable to decide the best granularity for the data. Using the templating language additions might look a little complicated at a first glance, but you'll quickly become familiar with it! ## 2. Publish your API Endpoint¶ Selecting the `Publish` button > `buy_events_dynamic_agg` Node makes your API accessible immediately. Once it's published you can directly test it using the snippets available in the API Endpoint page or using Tinybird's REST API. Just change the `start_date` and the `end_date` parameters to see how the aggregation window changes dynamically. ##### Testing the API Endpoint using cURL TOKEN= curl -s "https://api.tinybird.co/v0/pipes/ecommerce_events_dynamic.json?start_date=2018-10-01&end_date=2018-11-01&token=$TOKEN" \ | jq '.data[:2]' [ { "t": "2018-10-01 00:00:00", "sales": 66687.38 }, { "t": "2018-10-01 04:00:00", "sales": 50821.24 } ] Use a Token with the right scope. Replace `` with a Token whose [scope](https://www.tinybird.co/docs/docs/api-reference/token-api) is `READ` or higher. To sum up: With Tinybird, you can dynamically return different responses from your analytics API Endpoints depending on the request's parameters. This can give you more granular control over the data you send to the client, either for performance or privacy reasons. ## Next steps¶ - Learn more about the[ using the templating language to pass query parameters](https://www.tinybird.co/docs/docs/query/query-parameters) . - Need to deduplicate your data? Read the[ deduplication strategies guide](https://www.tinybird.co/docs/docs/guides/querying-data/deduplication-strategies) . --- URL: https://www.tinybird.co/docs/guides/querying-data/lambda-architecture Last update: 2024-11-07T09:52:34.000Z Content: --- title: "Build a lambda architecture in Tinybird · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn a useful alternative processing pattern for when the typical Tinybird flow doesn't fit." --- # Build a lambda architecture in Tinybird¶ In this guide, you'll learn a useful alternative processing pattern for when the typical Tinybird flow doesn't fit. This page introduces a useful data processing pattern for when the typical Tinybird flow (Data Source --> incremental transformation through Materialized Views --> and API Endpoint publication) does not fit. Sometimes, the way Materialized Views work means you need to use **Copy Pipes** to create the intermediate Data Sources that will keep your API Endpoints performant. ## The ideal Tinybird flow¶ You ingest data (usually streamed in, but can also be in batch), transform it using SQL, and serve the results of the queries via [parameterizable](https://www.tinybird.co/docs/docs/query/query-parameters) API Endpoints. Tinybird provides freshness, low latency, and high concurrency: Your data is ready to be queried as soon as it arrives. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-lambda-1.png&w=3840&q=75) <-figcaption-> Data flow with Data Source and API Endpoint Sometimes, transforming the data at query time is not ideal. Some operations - doing aggregations, or extracting fields from JSON - are better if done at ingest time, then you can query that prepared data. [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) are perfect for this kind of situation. They're triggered at ingest time and create intermediate tables (Data Sources in Tinybird lingo) to keep your API Endpoints performance super efficient. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-lambda-2.png&w=3840&q=75) <-figcaption-> Data flow with Data Source, MV, and API Endpoint The best practice for this approach is usually having a Materialized View (MV) per use case: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-lambda-3.png&w=3840&q=75) <-figcaption-> Materialized Views for different use cases If your use case fits in these first two paragraphs, stop reading. No need to over-engineer it. ## When the ideal flow isn't enough¶ There are some cases where you may need intermediate Data Sources (tables) and Materialized Views do not fit. - Most common: Things like Window Functions where you need to check the whole table to make calculations. - Fairly common: Needing an Aggregation MV over a deduplication table (ReplacingMergeTree). - Scenarios where Materialized Views fit but are not super efficient (hey `uniqState` ). - And lastly, one of the hardest problems in syncing OLTP and OLAP databases: Change data capture (CDC). Want to know more about *why* Materialized Views don't work in these cases? [Read the docs.](https://www.tinybird.co/docs/docs/publish/materialized-views/overview#what-shouldnt-i-use-materialized-views-for) As an example, let's look at the *Aggregation Materialized Views over deduplication DS* scenario. Deduplication in ClickHouse® happens asynchronously, during merges, which you cannot force in Tinybird. That's why you always have to add `FINAL` or the `-Merge` combinator when querying. Plus, Materialized Views only see the block of data that is being processed at the time, so when materializing an aggregation, it will process any new row, no matter if it was a new id or a duplicated id. That's why this pattern fails. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-lambda-4.png&w=3840&q=75) <-figcaption-> Aggregating MV over deduplication DS will not work as expected ## Solution: Use an alternative architecture with Copy Pipes¶ Tinybird has another kind of Pipe that will help here: [Copy Pipes](https://www.tinybird.co/docs/docs/publish/copy-pipes). At a high level, they're a helpful `INSERT INTO SELECT` , and they can be set to execute following a cron expression. You write your query, and (either on a recurring basis or on demand), the Copy Pipe appends the result in a different table. So, in this example, you can have a clean, deduplicated snapshot of your data, with the correct Sorting Keys, and can use it to materialize: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-lambda-5.png&w=3840&q=75) <-figcaption-> Copy Pipes to the rescue ## Avoid loss of freshness¶ *"But if you recreate the snapshot every hour/day/whatever... Aren’t you losing freshness?"* Yes - you're right. That's when the lambda architecture comes into play: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-lambda-6.png&w=3840&q=75) <-figcaption-> Lambda/Kappa Architecture You'll be combining the already-prepared data with the same operations over the fresh data being ingested at that moment. This means you end up with higher performance despite quite complex logic over both fresh and old data. ## Next steps¶ - Check an[ example implementation of a CDC use case](https://www.tinybird.co/docs/docs/guides/querying-data/lambda-example-cdc) with this architecture. - Read more about[ Copy Pipes](https://www.tinybird.co/docs/docs/publish/copy-pipes) . - Read more about[ Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) . Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/querying-data/lambda-example-cdc Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Change Data Capture project with lambda architecture · Tinybird Docs" theme-color: "#171612" description: "In this guide, you'll learn a useful implementation to consume CDC streams from Kafka and get an updated view of the changes." --- # Lambda CDC processing with Tinybird¶ This guide outlines a practical implementation of CDC processing with Tinybird using a [lambda](https://www.tinybird.co/docs/docs/guides/querying-data/lambda-architecture) approach. It produces an API that returns the freshest deduplicated view of the data by combining a scheduled batch job with new rows since the last batch. This is more complex than a simple deduplication query or Materialized View, recommended as an optimization where the dataset or processing SLAs demand it. ## Prerequisites¶ This is a read-through guide, explaining an example, so you don't need an active Workspace to try it out in. Use the concepts and apply them to your own scenario. To understand the guide, you'll need familiarity with Change Data Capture concepts and Tinybird [deduplication strategies](https://www.tinybird.co/docs/docs/guides/querying-data/deduplication-strategies). ## Data characteristics¶ This implementation focuses on fast filtering and aggregation of slowly-changing dimensions over a long history with high cardinality. The test dataset is a Postgres Debezium CDC to Kafka with an event history of tens of millions of updates into ~5M active records, receiving up to 75k new events per hour. Tinybird provides low-latency, high-concurrency responses with real-time data freshness. In this example, CDC Source is configured as *partial* mode, i.e. only new and changed records are sent as data and deletes as a null. In *full* CDC you would get the old and new data in each change, which can be very helpful in OLAP processing. The test dataset exhibits high cardinality over many years, optimized for ElasticSearch with nested JSON arrays. Updates are sparse over data dimensions and time, leading to specific decisions in this implementation. It is also worth noting that the JSON is up to 100kB per document, but for the analysis only a small part is needed. Any given primary key of the upstream data source can be deleted or the Kafka topic compacted or reloaded, resulting in many ‘null’ records to handle. ## Solution features¶ - Lambda processing architecture - Split data + deletes table processing - Null events as deletes by Kafka partition key - Batch + speed layer CDC upsert - Full data history table - Full delete history table - Batch table with good sorting keys - Latest data as reusable API ## Solution technical commentary¶ This implementation doesn't use `AggregatingMergeTree` or `ReplacingMergeTree` due to sorting key limitations. Instead, it uses a `MergeTree` table with subquery deduplication. The data history and delete tables are split to avoid bloat and null processing, improving performance. It focuses on the Kafka event timestamp and partition key for deduplication. Various ClickHouse® functions are used for JSON extraction, avoiding nulls to speed up processing. ## Data pipeline lineage¶ 1. ** Raw Kafka Table ``** 2. ** Initial Materialized Views** - Data History extraction `` - Delete History extraction `` 3. ** Historical Data Sources** - All insert/update events `` - All delete events `` 4. ** Batching Copy Pipe ``** 5. ** Batches Data Source ``** 6. ** Lambda ‘Upsert’ API ``** <-figure-> ![An overview of the Data Flow, with a Kafka Data source, two Materialized Views to keep track of changes and deletes, a Copy Pipe to deduplicate in batches, and a Pipe to combine all Data Sources](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-lambda-cdc-overview.png&w=3840&q=75) <-figcaption-> Lambda CDC overview ## Landing Data Source¶ Where the CDC events from the Kafka topic are consumed: ##### raw\_kafka\_ds.datasource SCHEMA > `__value` String --`__topic` LowCardinality(String), --`__partition` Int16, --`__offset` Int64, --`__timestamp` DateTime, --`__key` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(__timestamp)" ENGINE_SORTING_KEY "__offset" KAFKA_CONNECTION_NAME '' KAFKA_TOPIC '' KAFKA_GROUP_ID '' KAFKA_AUTO_OFFSET_RESET 'earliest' KAFKA_STORE_RAW_VALUE 'True' KAFKA_STORE_HEADERS 'True' KAFKA_STORE_BINARY_HEADERS 'True' KAFKA_TARGET_PARTITIONS 'auto' KAFKA_KEY_AVRO_DESERIALIZATION '' Helpful notes and top tips for your own implementation: - The other Kafka metadata fields (commented above) like `__timestamp` etc. are automatically added by Tinybird's Kafka connector. - Always increment the `KAFKA_GROUP_ID` if you reprocess the topic! - The `__value` may be `null` in the case of a `DELETE` for many CDC setups, so do not parse the JSON values in the raw table. - You can get the operation ( `INSERT` , `UPDATE` , `DELETE` ...) from the `KAFKA_STORE_HEADERS` for many CDC sources and read them in the `__headers` field, though we don’t need it for this implementation as `INSERT` and `UPDATE` are equivalent for our purposes, and `DELETE` is always the `null` record. - The sorting key should definitely be `__offset` or `__partition` . CDC Data can often have high density bursts of activity, which results in a lot of changes being written in a short time window. For this reason it is often better to partition and sort `raw_kafka_ds` data by `__partition` and /or `__offset` to avoid the skew of using `__timestamp` . - Remember that you have to pair the `__key` with the `__offset` to get a unique pairing, as each partition has its own offsets. This is why `__timestamp` is a good boundary for multi-partition topics. - This implementation does not set a TTL as it only partially processes the `__value` schema for the given use case. If you want to create other tables out of it you'd need the data source. This also allows reprocessing if you decide you need something else out of the raw JSON. - You could optionally run a delete operation on every `__offset` before the `__offset` of the first `__value` that isn’t a `null` in each Partition, which would effectively truncate the table of all old compactions. This can be done in the CLI with a `tb datasource truncate ` command. Remember to[ filter by partitions with no active ingest](https://www.tinybird.co/docs/docs/guides/ingesting-data/replace-and-delete-data) . ## Full data history¶ This section contains all the `INSERT` and `UPDATE` operations. To generate the Materialized View you'd first need a Pipe that will result in the `data_history_mv` Data Source: ##### mat\_data.pipe NODE mv SQL > SELECT toInt64(__key) as k_key, __offset as k_offset, __timestamp as k_timestamp, __partition as k_partition, __value as k_value, FROM raw_kafka_ds WHERE __value != 'null' TYPE materialized DATASOURCE data_history_mv Notes: - This example treats `__key` and `__timestamp` as the primary concern, and then parses out all the various fields the customer wants. - You want this table to have the same extracted columns as the batch table, as the Lambda process UNIONs them. - Use the various ClickHouse functions for JSON extraction, and avoid Nullable fields as it slows processing and bloats tables. Here is the Data Source definition of the MV: ##### data\_history\_mv.datasource SCHEMA > `k_key` Int64, `k_offset` Int64, `k_timestamp` DateTime, `k_partition` Int16, `k_value` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(k_timestamp)" ENGINE_SORTING_KEY "k_key, k_timestamp" Helpful notes: - `__timestamp` and `__key` are critical for lambda processing, so while they are not low cardinality they are the permanent filters in later queries. - Because this example mostly cares about quick access to the most recent events, it uses `k_timestamp` for partitioning. - This example keeps the raw JSON of the rest of the record in `k_value` against some field being wanted for later indexing. You can ignore the column in daily processing, and use it for backfill without reprocessing the entire raw topic again if you need to later. You can obviously partially extract from this for stable column indexing. - Additional sorting keys for customer queries are not retained because you need offset and key here, however other approaches could be considered if necessary. - All other columns are based on fields extracted from the kafka `__value` JSON. - If the sorting key columns for customer queries were limited to columns that did not change during a CDC update, then a `ReplacingMergeTree` may work here. However customer updates are often over required columns including date fields making it impractical. ## Deletes history¶ This section contains the `DELETE` operations. As above, you need a Pipe to get the deletes and materialize them: ##### mat\_deletes.datasource NODE mv SQL > SELECT toInt64(__key) as k_key, -- key used as deduplication identity __timestamp as k_timestamp -- ts used for deduplication incrementor FROM raw_kafka_ds WHERE __value = 'null' TYPE materialized DATASOURCE deletes_history_mv Helpful notes: - Nothing fancy - just parses out the null records by tracking the `__key` . - Note converting to Int64 instead of String for better performance, as you know the key and offset are auto-incrementing integers from Postgres and Kafka respectively. This may not always be true for other CDC sources. It results into a table with all the deletes: ##### deletes\_history\_mv.datasource SCHEMA > `k_key` Int64, `k_timestamp` DateTime ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(k_timestamp)" ENGINE_SORTING_KEY "k_timestamp, k_key" Notes: - Sorting key order is deliberate. You want to be able to fetch only the `__timestamp` s since the last batch was run for lambda processing, and then you want the `__key` s that are to be deleted from the dataset as a list for delete processing. - You do not need to have a separate deletes table for an efficient implementation, however if you have a lot of deletes (such as in compacted and reloaded Kafka Topics for CDC) then you may find this more efficient at the cost of a little more implementation complexity. A recommended approach is to start with a single history table including both data and deletes and then optimize later if necessary. ## Consolidation batches¶ Using Copy Pipes, you can generate snapshots of the state of the original table at several points in time. This will help you speed up compute later, only having to account for changes that arrived since last snapshot/batch happened. ##### copy\_batches.pipe NODE get_ts_boundary SQL > WITH ( (SELECT max(k_timestamp) FROM data_history_mv) AS max_history, (SELECT max(k_timestamp) FROM delete_history_mv) AS max_deletes ) SELECT least(max_history, max_deletes, now()) as batch_ts_ NODE dedupe_and_delete_history SQL > WITH latest_rows AS ( SELECT k_key, max(k_timestamp) AS latest_ts FROM GROUP BY k_key ), ts_boundary AS ( SELECT batch_ts_ AS batch_ts FROM get_ts_boundary ) SELECT f.*, assumeNotNull((SELECT batch_ts FROM ts_boundary)) AS batch_ts FROM data_history_mv f INNER JOIN latest_rows lo ON f.k_key = lo.k_key AND f.k_timestamp = lo.latest_ts WHERE f.k_key NOT IN ( SELECT k_key FROM deletes_history_mv WHERE k_timestamp <= (SELECT batch_ts FROM ts_boundary) ) AND f.k_timestamp <= (SELECT batch_ts FROM ts_boundary) TYPE copy TARGET_DATASOURCE batches_ds COPY_SCHEDULE 0,30 * * * * Notes: - You use the slowest data processing stream as the batch timestamp boundary, in case one of the Kafka partitions is lagging and other typical stream processing challenges. - This example uses a subquery and self-join to deduplicate because testing showed it as performing the best over the dataset used. Each dataset will have unique characteristics that may drive a different approach such as `LIMIT 1 BY` etc. - Note the deduplication method works fine in batch, but is the same as the lambda as well. - The schedule should be adjusted to match the customer cadence requirements. - Note that this example uses `<=` in the row selection here, and `>` in the selection later to ensure it doesn't duplicate the boundary row. ##### batches\_ds.datasource SCHEMA > `k_key` Int64, `k_offset` Int64, `k_timestamp` DateTime, `k_partition` Int16, `k_value` String, `batch_ts` DateTime ENGINE "MergeTree" ENGINE_PARTITION_KEY "toDate(batch_ts)" ENGINE_SORTING_KEY "batch_ids, " ENGINE_TTL "batch_ts + toIntervalDay(1)" Notes: - You cannot use a TTL to simply keep the last 3 versions, so you must pick a date and monitor that batches are running as expected (ClickHouse won’t consider the whole table in a TTL query, just that row). - This Table forms the bulk of the rows used for the actual query process, so it’s important that the sorting keys are optimized for data results. - The `batch_id` remains at the head of the sorting key so you can quickly select the latest batch for use. - The `batch_id` is a simple timestamp boundary of all rows across all Partitions included in the batch, including all Deletes already applied to the batch. This is important when understanding the logic of the Lambda processing later. - Partition key is by day on the `batch_ts` so you can read as few rows as possible, but all sequentially. - Analysis of the customer sorting keys may yield good optimization information, such as a need for controlling index granularity if they handle a lot of multi-tenant data, for example. ## Lambda Pipe¶ Lastly, you get the latest snapshot plus all the changes since then, and consolidate. This API Endpoint can also be used as a "view" so that other Pipes query it. ##### latest\_values.pipe NODE get_batch_info SQL > SELECT max(batch_ts) AS batch_ts FROM batches_ds NODE new_deletes SQL > select k_key FROM deletes_history_mv WHERE k_timestamp > (SELECT batch_ts from get_batch_info) -- only rows since last batch NODE filter_new_rows SQL > % SELECT {{columns(cols, 'k_key, k_offset, k_timestamp, ')}} FROM data_history_mv WHERE 1 AND k_timestamp > (SELECT batch_ts from get_batch_info) -- only rows since last batch AND k_key not in (select k_key from new_deletes) -- remove newly deleted rows from new rows AND NODE dedup_new_rows_by_subquery SQL > WITH latest_rows AS ( SELECT k_key, max(k_timestamp) AS latest_ts FROM filter_new_rows GROUP BY k_key ) SELECT f.* FROM filter_new_rows f INNER JOIN latest_rows lo ON f.k_key = lo.k_key AND f.k_timestamp = lo.latest_ts NODE get_and_filter_batch SQL > % SELECT {{columns(cols, 'k_key, k_offset, k_timestamp, ')}} FROM batches_ds PREWHERE batch_ts = (SELECT batch_ts from get_batch_info) -- get latest batch WHERE 1 AND k_key not in (select k_key from new_deletes) -- filter by new deletes since last batch AND k_key not in (select k_key from dedup_new_rows_by_subquery) -- omit rows already updated since batch NODE batch_and_latest SQL > SELECT * FROM get_and_filter_batch UNION ALL SELECT * FROM dedup_new_rows_by_subquery Notes: - This is a longer Pipe which is published as an API. - It starts by determining which batch to use, which also gives you the boundary timestamp. It then fetches and processes all new rows since the selected batch, including deletes processing and deduplication. It then backfills this with all other rows from the batch, and UNIONs the results. - It uses the same deduplication strategy as the batch processing Pipe for consistency of results. - Note the use of the `columns` Parameter. This then defaults to returning all columns, but a user can specify a subset to reduce the data fetched and processed. - This API can then be called as a Data Source by other Pipes, which can also use the same parameter names to pass through filters like columns or other customer filters that may be required. ## Conclusion¶ This image explains, in detail, the full overview of this approach: A Kafka Data Source, two Materialized Views to keep track of changes and deletes, a Copy Pipe to deduplicate in batches, and a Pipe to combine all Data Sources. <-figure-> ![An overview of the data flow, with a Kafka Data Source, two Materialized Views to keep track of changes and deletes, a Copy Pipe to deduplicate in batches, and a Pipe to combine all Data Sources](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-lambda-cdc-detail.png&w=3840&q=75) <-figcaption-> Lambda CDC overview ## Possible improvements¶ This example deliberately kept the full history, but you could speed up and store less data if the history Materialized Views are `ReplacingMergeTree` , or if you add a TTL long enough to be sure the changes have been incorporated to the batch Data Source. ## Alternatives¶ ### A simpler way of achieving the latest view¶ This example is solving for many peculiarities of the test dataset, like not having a simple key for deduplication, and a large number of delete operations bloating the resulting tables. As a comparison, here's a solution you can use when your CDC case is very simple. It's possibly less formant (since you need to extract fields at query time, filter deletes over the whole data source...) but easier and perfectly functional if volumes are not too big. Just define the Kafka Data Source as a `ReplacingMergeTree` (or create a MV from raw), query with FINAL, and exclude deletes: ##### raw\_kafka\_ds\_rmt.datasource SCHEMA > `__value` String --`__topic` LowCardinality(String), --`__partition` Int16, --`__offset` Int64, --`__timestamp` DateTime, --`__key` String ENGINE "ReplacingMergeTree" ENGINE_PARTITION_KEY "" ENGINE_SORTING_KEY "__key" ENGINE_VER "__offset" --or "__timestamp" An example query to consolidate latest updates and exclude duplicates: ##### raw\_kafka\_ds\_rmt.datasource SELECT * , JSONExtract(__value, 'field', type) as field --this for every field FROM raw_kafka_ds_rmt FINAL WHERE __value!='null' ## Next steps¶ - Read more about using a[ lambda approach](https://www.tinybird.co/docs/docs/guides/querying-data/lambda-architecture) in your Tinybird architecture. - Understand Tinybird[ deduplication strategies](https://www.tinybird.co/docs/docs/guides/querying-data/deduplication-strategies) . Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/querying-data/working-with-time Last update: 2024-11-07T09:52:34.000Z Content: --- title: "Working with time · Tinybird Docs" theme-color: "#171612" description: "Learn how to work with time in Tinybird." --- # Working with time¶ ## Overview¶ With real-time data, nothing is more important to understand than the impact of time, including time zones and Daylight Savings Time (DST). This page explains the different functions for working with time in ClickHouse® (and therefore Tinybird) when using, storing, filtering, and querying data. It also explains functions that may behave in ways you don't expect when coming in from other databases, and includes with example queries and some test data to play with. Read Tinybird's ["10 Commandments for Working With Time" blog post](https://www.tinybird.co/blog-posts/database-timestamps-timezones) to understand best practices and top tips for working with time. ### Resources¶ If you want to follow along, you can find all the sample queries & data in the `timezone_analytics_guide` repo. The repo contains several examples that this page walks through: <-figure-> ![Data flow showing different examples](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fworking-with-time-2.png&w=3840&q=75) ## Time-related types in ClickHouse¶ In the companion Workspace, this section is in the Pipe `simple_types_and_transforms`. There are basically two main Types, with several sub-types, when dealing with time in ClickHouse - the first for **dates** , and the second for **specific time instants**. For dates, you can choose one type or another depending on the range of dates you need to store. For time instants, it will depend not only on the range, but also on the precision. The default records seconds, and you can also work with micro or nanoseconds. **Date and time types** | Type | Range of values | Parameters | | --- | --- | --- | | Date | [1970-01-01, 2149-06-06] | | | Date32 | [1900-01-01, 2299-12-31] | | | DateTime | [1970-01-01 00:00:00, 2106-02-07 06:28:15] | Time zone: More about that later. | | DateTime64 | [1900-01-01 00:00:00, 2299-12-31 23:59:59.99999999] | Precision: [0-9]. Usually: 3(milliseconds), 6(microseconds), 9(nanoseconds). Time zone: More about that later. | Note that this standard of appending `-32` or `-64` carries through to most other functions dealing with `Date` and `DateTime` , as you'll soon see. The main [ClickHouse docs](https://clickhouse.com/docs/en/sql-reference/functions/date-time-functions) have an exhaustive listing of the different types and functions available - but it can be tricky to figure out which one you should use, and how you should use it. This Tinybird guide should help make it clearer. ## Transform your data into these types¶ ### Date from a String¶ Various ways of transforming a `String` into a `Date`: SELECT '2023-04-05' AS date, toDate(date), toDate32(date), DATE(date), CAST(date, 'Date')| time | toDate(time) | toDate32(time) | DATE(time) | CAST(time, 'Date') | | --- | --- | --- | --- | --- | | 2023-04-05 | 2023-04-05 | 2023-04-05 | 2023-04-05 | 2023-04-05 | ### Store a date before 1970¶ You can use the `toDate32` function, which has a range of [1900-01-01, 2299-12-31]. If you try to use the normal `toDate` function, you'll get the boundary of the available range - this might trip up users coming from other databases who would expect to get an Error. select toDate('1903-04-15'), toDate32('1903-04-16')| toDate('1903-04-15') | toDate32('1903-04-16') | | --- | --- | | 1970-01-01 | 1903-04-16 | ### Parse other formats into DateTime¶ You can use the `parseDateTimeBestEffortOrNull` function to parse a string into a `DateTime` , and if it fails for some reason, it will return a `NULL` value. This is your best default option, and there are several others which give different outcome behaviors, like `parseDateTimeBestEffort` (gives an error on a bad input), or `parseDateTimeBestEffortOrZero` (returns 0 on bad input). You can always check the main [ClickHouse documentation](https://clickhouse.com/docs/en/sql-reference/functions/type-conversion-functions#type_conversion_functions-parseDateTime) for functions mapping to specific edge cases. Also remember to use the `-32` or `-64` suffixes for the functions that return a `DateTime` , depending on the range of dates you need to store. Tinybird's Data Engineers know these to generally work with typical output strings experienced in the wild, such as those produced by most JavaScript libraries, Python logging, the default formats produced by AWS and GCP, and many others. But they can't be guaranteed to work, so test your data! And if something doesn't work as expected, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). ### How soon is Now()?¶ In the companion Workspace, this section is in a Pipe called `what_is_now`. The function `now` gives a seconds-accuracy current timestamp. You can use the `now64` function to get precision down to nanoseconds if necessary. They will default to the time zone of the server, which is UTC in Tinybird. You can use the `toDateTime` function to convert to a different time zone, or `toDate` to convert to a date. Be mindful of using `toDate` and ensuring you are picking the calendar day relative to the time zone you are thinking about, otherwise you will end up with the UTC day instead by default. You might want to use the convenience functions `today` and `yesterday` instead. SELECT now(), toDate(now()), toDateTime(now()), toDateTime64(now(),3), now64(), toDateTime64(now64(),3), toUnixTimestamp64Milli(toDateTime64(now64(),3)), today(), yesterday()| now() | toDate(now()) | toDateTime(now()) | toDateTime64(now(),3) | now64() | toDateTime64(now64(),3) | toUnixTimestamp64Milli(toDateTime64(now64(),3)) | today() | yesterday() | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 2021-03-19 19:15:00 | 2021-03-19 | 2021-03-19 19:15:00 | 2021-03-19 19:15:00.000 | 2021-03-19 19:15:00.478 | 2021-03-19 19:15:00.478 | 1616181300 | 2021-03-19 | 2021-03-18 | ## Filter by date in ClickHouse¶ In the companion Workspace, this section is in a Pipe called `filter_by_date`. To filter by date in ClickHouse, you need to make sure that you are comparing dates of the same type on both sides of the condition. In the example Workspace, the timestamps are stored in a column named `timestamp_utc`: 2023-03-19 19:15:00 2023-03-19 19:17:23 2023-03-20 00:02:59 It is extremely common that you want to filter your data by a given day. It is quite simple, but there are a few gotchas to be aware of. Let's examine them step by step. The first thing you could imagine is that ClickHouse is clever enough to understand what you want to do if we pass a `String` containing a date. As you can see, this query: SELECT timestamp_utc FROM sales_data_raw WHERE timestamp_utc = '2023-03-19' Produces no results. This actually makes sense when you realize you're comparing different types, a `DateTime` and a `String`. What if you force your parameter to be a proper date? You can easily do it: SELECT timestamp_utc FROM sales_data_raw WHERE timestamp_utc = toDate('2023-03-19') Nope. You may realize that you are comparing a `Date` and a `DateTime`. Ok, so what if you convert the `Date` into a `DateTime`? SELECT timestamp_utc FROM sales_data_raw WHERE timestamp_utc = toDateTime('2023-03-19') No results. This is because the `DateTime` produced has a time component of midnight, which is still just a single moment in time, not a range of time - which is exactly what a `Date` is - all the moments between the midnights on a given calendar day. So what you really have to do is **make sure you're comparing the same kind of dates in both sides** . But it will be instructive to see the results of all the transformations so far: SELECT timestamp_utc, toDate(timestamp_utc), toDate('2023-03-19'), toDateTime('2023-03-19') FROM sales_data_raw WHERE toDate(timestamp_utc) = toDate('2023-03-19') You can see that: | timestamp_utc | toDate(timestamp_utc) | toDate('2023-03-19') | toDateTime('2023-03-19') | | --- | --- | --- | --- | | 2023-03-19 19:15:00 | 2023-03-19 | 2023-03-19 | 2023-03-19 00:00:00 | In order to select the day you want, you need to convert both sides of the condition to `Date` . In this example, you do it in the `WHERE` clause. Remember that, if filtering by date, you must have a date in both sides of the condition. ## Example of a good timestamp schema in ClickHouse¶ Here's a few tips for the Types to use when storing timestamps in ClickHouse: 1. It's a good idea to include the expected time zone, or any modifications like normalization, in the `timestamp` column name. This will help you avoid confusion when you're working with data from different time zones, particularly when joining multiple data sources. 2. Your time zone names are low cardinality strings, so you can use the `LowCardinality(String)` modifier to save space. Same goes for the offset, which is an `Int32` . 3. You can use the `String` type to store the local timestamp in a format that is easy to parse, but isn't going to be eagerly converted by ClickHouse unexpectedly. The length of the string can vary by the precision you need, but it's a good idea to** keep it consistent across all your Data Sources** . There is also the `FixedString` type, which is a fixed length string, but while it's more efficient to process it's also not universally supported by all ClickHouse drivers, so `String` is a better default. Here's an example: timestamp_utc [DateTime('UTC')], timezone [lowCardinalityString], offset [Int32], timestamp_local [String], ## The details of ClickHouse DateTime functions¶ In the companion Workspace, this section is in a Pipe called `NittyGritty_DateTime_Operations`. This uses more of the test data, specifically the recording of metadata about Store Hours in Tokyo, to see what happens with the various functions you would expect to use, and what the outcomes are. ### Things that work as expected¶ SELECT -- these are simple timezone, store_open_time_utc, store_open_time_local, -- This next one will be converted to the ClickHouse server's default time zone parseDateTimeBestEffort(store_open_time_local) as parse_naive, -- This will be in the specified time zone, but it must be a Constant - you can't look it up! parseDateTimeBestEffort(store_open_time_local, 'Asia/Tokyo') as parse_tz_lookup, -- this next one stringify's the DateTime so you can pick it in UTC without TZ conversion parseDateTimeBestEffort(substring(store_open_time_local, 1, 19)) as parse_notz, -- These next ones work, but you must get the offsets right for each DST change! store_open_time_utc + store_open_timezone_offset as w_plus, date_add(SECOND, store_open_timezone_offset, store_open_time_utc) as w_date_add, addSeconds(store_open_time_utc, store_open_timezone_offset) as w_add_seconds, --- This gives you nice UTC formatting as you'd expect formatDateTime(store_open_time_utc, '%FT%TZ'), -- Because BestEffort will convert your string to UTC, you're going to get UTC displayed again here formatDateTime(parseDateTimeBestEffort(store_open_time_local), '%FT%T') FROM store_hours_raw where store = 'Tokyo' Limit 1| timezone | store_open_time_utc | store_open_time_local | parse_naive | parse_tz_lookup | w_plus | w_date_add | w_add_seconds | formatDateTime(store_open_time_utc, '%FT%TZ') | formatDateTime(parseDateTimeBestEffort(store_open_time_local), '%FT%T') | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Asia/Tokyo | 2023-03-20 00:00:00 | 2023-03-20T09:00:00+09:00 | 2023-03-20 00:00:00 | 2023-03-20 09:00:00 | 2023-03-20 09:00:00 | 2023-03-20 09:00:00 | 2023-03-20 09:00:00 | 2023-03-20T00:00:00Z | 2023-03-20T00:00:00 | ### Things that might surprise you¶ Here you have a column with a time in UTC, and another column with the local time zone. ClickHouse has an opinion that a `DateTime` column must always be one specific time zone. It also has the opinion that you must always supply the time zone as a `String` , and cannot lookup from another column on a per-row basis. There's also a [bug](https://github.com/ClickHouse/ClickHouse/issues/33382) in older ClickHouse versions where `toTimezone` will silently give the wrong answer if you give it a lookup instead of a `String` , so use `toDateTime` instead. SELECT -- toString(store_open_time_utc, timezone) as w_tostr_lookup, << This causes an error -- toDateTime(store_open_time_utc, timezone), << This also causes an error toString(store_open_time_utc, 'Asia/Tokyo') as w_tostr_const, -- This is correct. toTimezone(store_open_time_utc, timezone) as w_totz_lookup, -- This silently gives the wrong answer due to the bug! toDateTime(store_open_time_utc, 'Asia/Tokyo') as todt_const, -- This is the best method toTimezone(store_open_time_utc, 'Asia/Tokyo') as w_totz_const, -- this works, but toTimezone can fail if misused, so better to use toDateTime -- Because BestEffort will convert your string to UTC, you might think you'll get the local time -- But as the ClickHouse column is UTC, that is what you're going to get formatDateTime(parseDateTimeBestEffort(store_open_time_local), '%FT%T'), -- formatDateTime(store_open_time_utc, '%FT%T%z') -- << Supported in ClickHouse 22.9 toDate32('1903-04-15'), -- This gives the expected date toDate('1903-04-15') -- This silently hits the date boundary at 1970-01-01, so be careful ## Understand ClickHouse time zone handling¶ ClickHouse DateTimes always have a time zone. Here are the key points to note: - A DateTime is stored in ClickHouse as a Unix timestamp that represents the number of seconds in UTC since 1970-01-01. - ClickHouse stores a single time zone associated with a column and uses it to handle transformations during representation/export or other calculations. - ClickHouse enforces the single time zone per column rule in the behavior of native functions it provides for manipulating DateTime data. - ClickHouse assumes that any time zone offset or DST change is a multiple of 15 minutes. This is true for most modern time zones, but not for various historical corner cases. Let's explore some areas to be aware of. ### How a time zone is selected for the Column¶ Here are some general rules to observe when selecting a time zone for a column: - If no time zone is specified, the column will represent the data in the ClickHouse server's time zone, which is UTC in Tinybird. - If a time zone is specified as a string, the column will represent the data as that time zone. - If multiple time zones are in a query that produces a column without already creating the column with a time zone, the first time zone in the query wins (e.g. a CASE statement to pick different time zones). - If the time zone is not represented as a constant (e.g. by lookup to another column or table), you should get an error message. ### DateTime Operations without time zone conversion¶ ClickHouse provides some native operations that work with DateTime without handling time zone conversion. Mostly these are for adding or subtracting some unit of time. These operations behave exactly as you would expect. Remember: You, as the programmer, are responsible for correctly selecting the amount to add or subtract for a given timestamp. Many incorrect datasets are produced here by incorrect chaining of time zone translations or handling DST incorrectly. ### DateTime to String with a time zone¶ In older versions of ClickHouse, you cannot convert a DateTime to an ISO8601 String with time zone information. However, the `%z` operator was introduced in version 22.9, allowing you to use `formatDateTime(timestamp_utc, '%FT%T%z')`. ## How to Normalize your Data in ClickHouse¶ In the companion Workspace, this section is in a Pipe called `sales_normalized`. The process of converting timestamps on the data so that they are all in the same time zone is called "time zone normalization" or "time zone conversion". This is a common practice in data analytics to ensure that data from multiple sources, which may be in different time zones, can be accurately compared and analyzed. By converting all timestamps to a common time zone, it is possible to compare data points from different sources and derive meaningful insights from the data. Consider a sample dataset containing sales data from multiple stores which are each in different time zones. Each store is open from 0900 to 1700 local time, and you're going to shift them all to UTC: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fworking-with-time-1.png&w=3840&q=75) The simplest way to do this in ClickHouse is to chain standard conversion functions - first, convert from your canonical stored UTC timestamp to the store-local time zone, then extract a `String` representation of that local time, then convert that back into a `DateTime` object. ClickHouse will by default present this in the server time zone, so we have to know that this dataset has been normalized and name it appropriately. In this example you are using a CASE statement to handle each store, and the result is presented as a Materialized View in Tinybird. This has the benefit of pre-calculating the time zone shifts, you could also pre-aggregate to some time interval like 5 or 15 mins, resulting in a highly efficient dataset for further analysis. In your schema you have a `String` identifying the `store` , an `Int32` for the `sale_amount` , and a `DateTime` in UTC for `timestamp_utc` of the transaction: SELECT store, sale_amount, CASE WHEN store = 'New York' THEN toDateTime(toString(toTimezone(timestamp_utc, 'America/New_York'))) WHEN store = 'Tokyo' THEN toDateTime(toString(toTimezone(timestamp_utc, 'Asia/Tokyo'))) WHEN store = 'London' THEN toDateTime(toString(toTimezone(timestamp_utc,'Europe/London'))) WHEN store = 'Madrid' THEN toDateTime(toString(toTimezone(timestamp_utc, 'Europe/Madrid'))) WHEN store = 'Auckland' THEN toDateTime(toString(toTimezone(timestamp_utc, 'Pacific/Auckland'))) WHEN store = 'Chatham' THEN toDateTime(toString(toTimezone(timestamp_utc, 'Pacific/Chatham'))) else toDateTime('1970-01-01 00:00:00') END AS timestamp_normalized FROM sales_data_raw Note that the time zones used are all input as Strings, as ClickHouse requires this. We use an else statement so that ClickHouse doesn't mark the `timestamp_normalized` column as Nullable, which drastically impacts performance. ## Good time zone test data¶ Our users do *all sorts* of things with their data. Issues with time-based aggregations, particularly when time zone conversions are involved, are one of the most common gotchas. The Tinybird Data Engineers amalgamated the these into a test data set, which is comprised of: 1. A facts table listing generated transactions for a set of retail stores across the world. 2. Some additional columns to aid in checking processing correctness. 3. A dimension table giving details about the store hours and time zones, again with additional information for correctness checking. In the companion Workspace, these are the `sales_data_raw` and `store_hours_raw` tables respectively. The dataset is generated by [this Notebook](https://github.com/tinybirdco/timezone_analytics_guide/blob/main/sample_data_generator.ipynb) in the repo, and [pregenerated fixtures](https://github.com/tinybirdco/timezone_analytics_guide/tree/main/tinybird/datasources/fixtures) from it are also in the repo. ### The tricky use case¶ There is also one extra store in what is deliberately the most complex case. Let's say you have a store in the Chatham Islands (New Zealand), which typically has a +1345hr time zone. This store records the start of the new business day at 0900 each day instead of 0000 (midnight) due to some 'local regulations'. Let's also say that this store remains open 24hrs a day, and you are recording sales through a DST change. The Chatham Islands DST changes back one hour at 0345 local time on April 2nd, so the April 2nd calendar day will have 25hrs elapsed, however due to the store business day changeover at 0900, this will actually occur during the April 1st store business day (open at 0900 Apr1, close at 0859 Apr2). Confusing? Yes, exactly. In order to more easily observe this behavior, our test data records a fixed price transaction at a fixed interval of every 144 seconds throughout the business day, so your tests can have patterns to observe. Why 144 seconds is a useful number here is left as an exercise for the reader. ### Why this is a good test scenario¶ - When viewed in UTC, the DST change is at 2015hrs on April 1st, and the day has 25hrs elapsed, which may confuse an inexperienced developer. - You also have a requirement that the business day windowing does not map to the calendar day (the 'local regulations' above), which mimics a typical unusual requirement in business analysis. - You can also see that the offset changes from +1345 to +1245 between Apr1 and Apr2 - this is what trips up naive addSeconds time zone conversions twice a year, as they would only be correct for part of the day if not carefully bounded. - In addition to this, the Chatham Islands time zone has a delightful offset in seconds which looks a lot like a typo when it changes from 49500 to 45900. - It is also a place where almost nobody has a store as the local population is only 800 people, so great for something that can be filtered from real data. This is what makes it an excellent time zone for test data. ### Test data schema¶ Let's take a look at the Schema, and why the various fields are useful. Note that we have used simple Types here to make it easier to understand, but you could also use the FixedString and other more complex types we described earlier in your real use cases. **Store Hours:** `store` String `json:$.store` , `store_close_time_local` String `json:$.store_close_time_local` , `store_close_time_utc` DateTime `json:$.store_close_time_utc` , `store_close_timezone_offset` Int32 `json:$.store_close_timezone_offset` , `store_date` Date `json:$.store_date` , `store_open_seconds_elapsed` Int32 `json:$.store_open_seconds_elapsed` , `store_open_time_local` String `json:$.store_open_time_local` , `store_open_time_utc` DateTime `json:$.store_open_time_utc` , `store_open_timezone_offset` Int32 `json:$.store_open_timezone_offset` , `timezone` String `json:$.timezone` , **Sales Data:** `store` String `json:$.store` , `sale_amount` Int32 `json:$.sale_amount` , `timestamp_local` String `json:$.timestamp_local` , `timestamp_utc` DateTime `json:$.timestamp_utc` , `timestamp_offset` Int32 `json:$.timestamp_offset` ### Schema notes¶ The tables are joined on the `store` column, in this case a `String` for readability but likely to be a more efficient `LowCardinalityString` or an `Int32` in a real use case. Local times are stored as Strings so that the underlying database is never tempted to change them to some other time representation. Having a static value to check your computations against is very useful in test data. In the Sales Data, note that the `String` of the local timestamp in the facts table is kept. In a large dataset this would bloat the table and probably not be necessary. Strictly speaking, the offset is also unnecessary as you should be able to recalculate what it was given the timestamp in UTC and the exact time zone of the event producing service. Practically speaking however, if you are going to need this information a lot for your analysis, you are simply trading off later compute time against storage. Note that a lot of metadata about the store business day in the Store Hours table is also kept - this helps to ensure that not only the analytic calculations are correct, but that the generated data is correct (see "Test data validation" below). This is a surprisingly common issue where test data is produced - it looks good to the known edge cases, but doesn't end up correct to the unknown edge cases. It's in exactly these scenarios that you want to know more about how it was produced, so you can fix it. ### Test data validation¶ In the [Data Generator Notebook](https://github.com/tinybirdco/timezone_analytics_guide/blob/main/sample_data_generator.ipynb) , you can also inspect the generated data to ensure that it conforms to the changes expected over the difficult time window of 1st April to 3rd April in the Chatham Islands. Review the following output and consider what you've read so far about the changes over these few days, and then take a look at the timestamps and calculations to see how each plays out: Day: 2023-04-01 opening time UTC: 2023-03-31T19:15:00+00:00 opening time local: 2023-04-01T09:00:00+13:45 timezone offset at store open: 49500 total hours elapsed during store business day: 25 total count of sales during business day: 625 and sales amount in cents: 31250 total hours elapsed during calendar day: 24 total count of sales during calendar day: 600 and sales amount in cents: 30000 closing time UTC: 2023-04-01T20:14:59+00:00 closing time local: 2023-04-02T08:59:59+12:45 timezone offset at store close: 45900 Day: 2023-04-02 opening time UTC: 2023-04-01T20:15:00+00:00 opening time local: 2023-04-02T09:00:00+12:45 timezone offset at store open: 45900 total hours elapsed during store business day: 24 total count of sales during business day: 600 and sales amount in cents: 30000 total hours elapsed during calendar day: 25 total count of sales during calendar day: 625 and sales amount in cents: 31250 closing time UTC: 2023-04-02T20:14:59+00:00 closing time local: 2023-04-03T08:59:59+12:45 timezone offset at store close: 45900 Day: 2023-04-03 opening time UTC: 2023-04-02T20:15:00+00:00 opening time local: 2023-04-03T09:00:00+12:45 timezone offset at store open: 45900 total hours elapsed during store business day: 24 total count of sales during business day: 600 and sales amount in cents: 30000 total hours elapsed during calendar day: 24 total count of sales during calendar day: 600 and sales amount in cents: 30000 closing time UTC: 2023-04-03T20:14:59+00:00 closing time local: 2023-04-04T08:59:59+12:45 timezone offset at store close: 45900 This test data generator has been through several iterations to capture the specifics of various scenarios that Tinybird customers have raised; hopefully you will also find it helpful. ## Query correctly with time zones and DST¶ Now that you've have looked at the test data, let's work through understanding some examples of the correct way to query it, and where it can go wrong. In the companion Workspace, this section is in the Pipe `Aggregate_Querying`. ### Naively querying by UTC¶ If you naively query by UTC here, you get an answer that looks correct - sales start at 9am and stop at 5pm when London is on winter time, which matches UTC. SELECT store, toStartOfFifteenMinutes(timestamp_utc) AS period_utc, sum(sale_amount) AS sale_amount, count() as sale_count FROM sales_data_raw where store = 'London' AND toDate(period_utc) = toDate('2023-03-24') GROUP BY store, period_utc ORDER BY period_utc ASC| store | period_utc | sale_amount | sale_count | | --- | --- | --- | --- | | London | 2023-03-24 09:00:00 | 88418 | 7 | | London | 2023-03-24 09:15:00 | 29159 | 3 | | London | 2023-03-24 09:30:00 | 35509 | 3 | The essential mistake here is blindly converting the UTC timestamp in `period_utc` to a date, however as the time zone aligns with UTC, it has no impact on these results. ### Naively querying a split time zone¶ However if we naively query Auckland applying a calendar day to the UTC periods, we get two split blocks at the start and end of day. This is because you're actually getting the end of the day you want, and the start of the next day, because of the ~+13 UTC offset. select * from sales_15min_utc_mv where store = 'Auckland' AND toDate(period_utc) = toDate('2023-03-24') ORDER BY period_utc ASC| store | period_utc | sale_amount | sale_count | | --- | --- | --- | --- | | Auckland | 2023-03-24 00:00:00 | 71538 | 6 | | Auckland | 2023-03-24 00:15:00 | 51795 | 5 | | Auckland | 2023-03-24 00:30:00 | 67106 | 7 | The same mistake is made as above, with the blind conversion of the UTC timestamp to a date, however in this case it's more obvious that it's wrong as very few time zones map 0900 local exactly to midnight. Try the same query again with the Tokyo store, and see how the mistake is easily hidden. ### Aggregating by timestamp over a period¶ A convenience of modern time zone and DST definitions is that they can all be expressed in 15 minute increments (in fact, underlying ClickHouse assumes this is true to drive processing efficiency), so **if you aren't sure what aggregation period is best for you, start with 15mins and see how your data set grows** , as you can query any larger period as groups of 15min periods. To remind you again of the test data: You have a table containing sales data for a set of stores. Your data has columns for a `store` , a `timestamp` in UTC, and a `sale_amount` . In Tinybird, you can quickly and efficiently pre-calculate the 15min periods as a Materialized View with the following query: SELECT store, toStartOfFifteenMinutes(timestamp_utc) AS period_utc, sumSimpleState(sale_amount) AS sale_amount, countState() AS sale_count FROM sales_data_raw GROUP BY store, period_utc ORDER BY period_utc ASC Using this, you can then produce any larger period for a given window of that period. See the next example for how to break it into local days. ### Querying by day and time zone¶ In the example Workspace, see the [sales_15min_utc_mv Materialized View](https://github.com/tinybirdco/timezone_analytics_guide/blob/main/tinybird/datasources/sales_15min_utc_mv.datasource). Building on the previous example, this query produces a correct aggregate by the local time zone day for a given `store`, `timezone`, `start_day` and `end_day` (inclusive). We also feature use of Tinybird parameters, such as you would use in an API Endpoint. Note that you're querying the tricky Chatham store here, and over the days with the DST change, so you can see the pattern of transactions emerge. This example uses Tinybird parameters for the likely user-submitted values of `timezone`, `start_day` and `end_day` , and you are selecting the data from the `sales_15min_utc_mv` Materialized View described above. % Select store, toDate(toTimezone(period_utc, {{String(timezone, 'Pacific/Chatham')}})) as period_local, sum(sale_amount) as sale_amount, countMerge(sale_count) as sale_count from sales_15min_utc_mv where store = {{String(store, 'Chatham')}} and period_local >= toDate({{String(start_day, '20230331')}}) and period_local <= toDate({{String(end_day, '20230403')}}) group by store, period_local order by period_local asc| store | period_local | sale_amount | sale_count | | --- | --- | --- | --- | | Chatham | 2023-03-31 | 30000 | 600 | | Chatham | 2023-04-01 | 30000 | 600 | | Chatham | 2023-04-02 | 31250 | 625 | | Chatham | 2023-04-03 | 30000 | 600 | Note that the `DateTime` has been converted to the target time zone **before** reducing to the local calendar day using `toDate` , and that the user input is converted to the same `Date` type for comparison, and the column has been renamed to reflect this change. Also note that the time zone parameter is a `String` , which is required by ClickHouse. It's up to you to ensure you supply the right time zone string and Store filters here. Because the Data Source in this case is a Materialized View, you should also be careful to use the correct `CountMerge` function to get the final results of the incremental field. ### Validating the results¶ You can validate the results by comparing them to the same query over the raw data, and see that the results are identical. select store, toDate(substring(timestamp_local, 1, 10)) as period_local, sum(sale_amount) as sale_amount, count() as sale_count from sales_data_raw where store = 'Chatham' and toDate(substring(timestamp_local, 1, 10)) >= toDate('20230331') and toDate(substring(timestamp_local, 1, 10)) <= toDate('20230403') group by store, period_local order by period_local asc| store | period_local | sale_amount | sale_count | | --- | --- | --- | --- | | Chatham | 2023-03-31 | 30000 | 600 | | Chatham | 2023-04-01 | 30000 | 600 | | Chatham | 2023-04-02 | 31250 | 625 | | Chatham | 2023-04-03 | 30000 | 600 | ### Querying with a lookup table¶ In our test data, the store day does not match the local calendar day - if you recall it was specified that it starts at 0900 local time. Fortunately, we can use a lookup table to map the store day to our UTC timestamps, and then use that to query the data. This is often a good idea if you have a lot of data, as it can be more efficient than using a function to calculate the mapping each time. select store, store_date, sum(sale_amount), countMerge(sale_count) as sale_count from sales_15min_utc_mv join store_hours_raw using store where period_utc >= store_hours_raw.store_open_time_utc and period_utc < store_hours_raw.store_close_time_utc and store_date >= toDate('2023-03-31') and store_date <= toDate('2023-04-03') and store = 'Chatham' group by store, store_date order by store_date asc| store | period_local | sale_amount | sale_count | | --- | --- | --- | --- | | Chatham | 2023-03-31 | 30000 | 600 | | Chatham | 2023-04-01 | 31250 | 625 | | Chatham | 2023-04-02 | 30000 | 600 | | Chatham | 2023-04-03 | 30000 | 600 | Note that the extra hour of business moves to the 1st of April, as that store day doesn't close until 0859 on the 2nd of April, which is after the DST change. ## Use Materialized Views to pre-calculate the data¶ Several of these examples have pointed to a [Materialized View](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) , which is a pre-calculated table that is updated incrementally as new data arrives. These are also prepared for you in the companion Workspace, specifically for the UTC 15min rollup, and the normalized 15min rollup. ## Next steps¶ - Read Tinybird's[ "10 Commandments for Working With Time" blog post](https://www.tinybird.co/blog-posts/database-timestamps-timezones) to understand best practices and top tips for working with time. - Understand[ Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) . - Learn how to dynamically aggregate time series data by[ different time intervals (rollups)](https://www.tinybird.co/docs/docs/guides/querying-data/dynamic-aggregation) . - Got a tricky use case that you want help with? Connect with us in the[ Tinybird Slack community](https://www.tinybird.co/docs/docs/community) . Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/tutorials/analytics-with-confluent Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Build user-facing analytics apps with Confluent · Tinybird Docs" theme-color: "#171612" description: "Learn how to build a user-facing web analytics application with Confluent and Tinybird." --- # Build a user-facing web analytics application with Confluent and Tinybird¶ In this guide you'll learn how to take data from Kafka and build a user-facing web analytics dashboard using Confluent and Tinybird. [GitHub Repository](https://github.com/tinybirdco/demo_confluent_charts/tree/main) <-figure-> ![Tinybird Charts showing e-commerce events](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-confluent-chart-1.png&w=3840&q=75) In this tutorial, you will learn how to: 1. Connect Tinybird to a Kafka topic. 2. Build and publish Tinybird API Endpoints using SQL. 3. Create 2 Charts without having to code from scratch. ## Prerequisites¶ To complete this tutorial, you'll need: 1. A[ free Tinybird account](https://www.tinybird.co/signup) 2. An empty Tinybird Workspace 3. A Confluent account 4. Node.js >=20.11 This tutorial includes a [Next.js](https://nextjs.org/) app for frontend visualization, but you don't need working familiarity with TypeScript - just copy & paste the code snippets. ## 1. Setup¶ Clone the `demo_confluent_charts` repo. ## 2. Create your data¶ ### Option 1: Use your own existing data¶ In Confluent, create a Kafka topic with simulated e-commerce events data. Check [this file](https://github.com/tinybirdco/demo_confluent_charts/blob/main/tinybird/datasources/ecomm_events.datasource) for the schema outline to follow. ### Option 2: Mock the data¶ Use Tinybird's [Mockingbird](https://mockingbird.tinybird.co/docs) , an open source mock data stream generator, to stream mock web events instead. In the repo, navigate to `/datagen` and run `npm i` to install the dependencies. Create an `.env` and replace the default Confluent variables: cp .env.example .env Run the mock generator script: node mockConfluent.js ## 3. Connect Confluent <> Tinybird¶ In your Tinybird Workspace, create a new [Data Source](https://www.tinybird.co/docs/docs/concepts/data-sources) using the native [Confluent connector](https://www.tinybird.co/docs/docs/ingest/confluent) . Paste in the bootstrap server, rename the connection to `tb_confluent` , then paste in your API key and secret. Select "Next". Search for and select your topic, and select "Next". Ingest from the earliest offset, then under "Advanced settings" > "Sorting key" select `timestamp`. Rename the Data Source to `ecomm_events` and select "Create". Your Data Source is now ready, and you've connected Confluent to Tinybird! You now have something that's like a database table and a Kafka consumer ***combined*** . Neat. ## 4. Transform your data¶ ### Query your data stream¶ Your data should now be streaming in, so let's do something with it. In Tinybird, you can transform data using straightforward SQL in chained Nodes that form a Pipe. Create a new [Pipe](https://www.tinybird.co/docs/docs/concepts/pipes) and rename it `sales_trend` . In the first Node space, paste the following SQL: SELECT timestamp, sales FROM ecomm_events WHERE timestamp >= now() - interval 7 day This gets the timestamp and sales from just the last 7 days. Run the query and rename the Node `filter_data`. In the second Node space, paste the following: SELECT toDate(timestamp) AS ts, sum(sales) AS total_sales from filter_data GROUP BY ts ORDER BY ts This casts the timestamp to a date as `ts` , and sums up the sales - meaning you can get a trend of sales by day. Run the query and rename the Node `endpoint`. ### Publish your transformed data¶ In the top right of the screen, select "Create API Endpoint" and select the `endpoint` Node. Congratulations! It's published and ready to be consumed. ## 5. Create a Tinybird Chart¶ In the top right of the screen, select "Create Chart". Rename the chart "Sales Trend" and select and Area Chart. Under the "Data" tab, select `ts` as the index and `total_sales` as the category. You should see a Chart magically appear! In the top right of the screen, select "Save". ## 6. Run an app locally¶ View the component code for the Chart by selecting the code symbol ( `<>` ) above it. Copy this code and paste into a new file in the `components` folder called `SalesTrend.tsx`. In `page.tsx` , replace `

Chart 1

` with your new Chart `` . Save and view in the browser with `npm run dev` . You should see your Chart! ### Create a second Pipe --> Chart¶ Create a second Pipe in Tinybird called `utm_sales`: SELECT utm_source, sum(sales) AS total_sales FROM ecomm_events WHERE timestamp >= now() - interval 7 day GROUP BY utm_source ORDER BY total_sales DESC This gets sales by utm over the last 7 days. Run the query and rename the Node `endpoint` .... Then, you guessed it! Publish it as an Endpoint, create a Chart, and get the code. This time, create a donut Chart called "UTM Sales" with `utm_source` as the index and `total_sales` as the category. Check the "Legend" box and play around with the colors to create clear differentiators. Create a new component file called `UTMSales.tsx` and import in `page.tsx` replacing Chart 2. You did it! <-figure-> ![Tinybird Charts showing e-commerce events](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-confluent-chart-1.png&w=3840&q=75) ## Next steps¶ - Read more about[ Tinybird Charts](https://www.tinybird.co/docs/docs/publish/charts) . - Use Charts internally to[ monitor latency](https://www.tinybird.co/docs/docs/monitoring/latency#how-to-visualize-latency) in your own Workspace. --- URL: https://www.tinybird.co/docs/guides/tutorials/bigquery-dashboard Last update: 2024-11-07T09:52:34.000Z Content: --- title: "Build user-facing dashboard with BigQuery · Tinybird Docs" theme-color: "#171612" description: "Learn how to build a user-facing dashboard using Tinybird and BigQuery." --- # Build a user-facing analytics dashboard with BigQuery and Tinybird¶ In this guide you'll learn how to take data from BigQuery and build a user-facing analytics dashboard using Tinybird, Next.js, and Tremor components. You'll end up with a dashboard and enough familiarity with Tremor to adjust the frontend & data visualization for your own projects in the future. Google BigQuery is a serverless data warehouse, offering powerful online analytical processing (OLAP) computations over large data sets with a familiar SQL interface. Since its launch in 2010, it’s been widely adopted by Google Cloud users to handle long-running analytics queries to support strategic decision-making through business intelligence (BI) visualizations. Sometimes, however, you want to extend the functionality of your BigQuery data beyond business intelligence: For instance, real-time data visualizations that can be integrated into user-facing applications. As outlined in [the Tinybird blog post on BigQuery dashboard options](https://www.tinybird.co/blog-posts/bigquery-real-time-dashboard) , you can build Looker Studio dashboards over BigQuery data, but they'll struggle to support user-facing applications that require high concurrency, fresh data, and low-latency API responses. Tinybird is the smart option for fast and real-time. Let's get building! [GitHub Repository](https://github.com/tinybirdco/bigquery-dashboard) <-figure-> ![Analytics dashboard build with BigQuery data, Tinybird Endpoints, and Tremor components in a Next.js app](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-bigquery-dashboard.png&w=3840&q=75) Imagine you're a ***huge*** baseball fan. You want to build a real-time dashboard that aggregates up-to-the-moment accurate baseball stats from teams around the world, and gives you the scoop on all your favorite players. This tutorial explains how to build a really nice-looking prototype version. In this tutorial, you'll learn how to: 1. Ingest your existing BigQuery data into Tinybird. 2. Process and transform that data with accessible SQL. 3. Publish the transformations as real-time APIs. 4. Use Tremor components in a Next.js app to build a clean, responsive, real-time dashboard that consumes those API Endpoints. ## Prerequisites¶ To complete this tutorial, you'll need: 1. A[ free Tinybird account](https://www.tinybird.co/signup) 2. A BigQuery account 3. Node.js >=20.11 This tutorial includes a [Next.js](https://nextjs.org/) app and [Tremor](https://www.tremor.so/components) components for frontend visualization, but you don't need working familiarity with TypeScript or JavaScript - just copy & paste the code snippets. ## 1. Create a Tinybird Workspace¶ Navigate to the Tinybird web UI ( [app.tinybird.co](https://app.tinybird.co/) ) and create an empty Tinybird Workspace (no starter kit) called `bigquery_dashboard` in your preferred region. ## 2. Connect your BigQuery dataset to Tinybird¶ To get your BigQuery data into Tinybird, you’ll use the [Tinybird BigQuery Connector](https://www.tinybird.co/docs/docs/ingest/bigquery). Download [this sample dataset](https://github.com/tinybirdco/bigquery-dashboard/blob/main/baseball_stats.csv) that contains 20,000 rows of fake baseball stats. Upload it to your BigQuery project as a new CSV dataset. Next, follow the [steps in the documentation](https://www.tinybird.co/docs/docs/ingest/bigquery#load-a-bigquery-table-in-the-ui) to authorize Tinybird to view your BigQuery tables, select the table you want to sync, and set a sync schedule. Call the Data Source `baseball_game_stats`. Tinybird will copy the contents of your BigQuery table into a Tinybird Data Source and ensure the Data Source stays in sync with your BigQuery table. Tinybird can sync BigQuery tables as often as every 5 minutes. If you need fresher data in your real-time dashboards, consider sending data to Tinybird via alternative sources such as [Apache Kafka](https://www.tinybird.co/docs/docs/ingest/kafka), [Confluent Cloud](https://www.tinybird.co/docs/docs/ingest/confluent), [Google Pub/Sub](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-from-google-pubsub) , or Tinybird’s native [HTTP streaming endpoint](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-from-the-events-api). ## 3. Create some Pipes¶ In Tinybird, a [Pipe](https://www.tinybird.co/docs/docs/concepts/pipes) is a transformation definition comprised of a series of SQL statements. You can build metrics through a series of short, composable [Nodes](https://www.tinybird.co/docs/docs/concepts/pipes#nodes) of SQL. Think of Pipes as a way to build SQL queries without always needing to write common table expressions or subqueries, as these can be split out into reusable, independent Nodes. For example, here's a simple single-Node Pipe definition that calculates the season batting average for each player: ##### player\_batting\_percentages.pipe SELECT player_name AS "Player Name", sum(stat_hits)/sum(stat_at_bats) AS "Batting Percentage" FROM baseball_game_stats GROUP BY "Player Name" ORDER BY "Batting Percentage" DESC Create your first Pipe from your newly-created BigQuery Data Source by selecting “Create Pipe” in the top right corner of the Tinybird UI. Paste in the SQL above and run the query. Rename the Pipe `player_batting_percentages`. Naming your Pipe something descriptive is important, as the Pipe name will be used as the URL slug for your API Endpoint later on. ## 4. Extend Pipes with Query Parameters¶ Every good dashboard is interactive. You can make your Tinybird queries interactive using Tinybird’s templating language to [generate query parameters](https://www.tinybird.co/docs/docs/query/query-parameters) . In Tinybird, you add query parameters using `{{(,}}` , defining the data type of the parameter, its name, and an optional default value. For example, you can extend the SQL query in the previous step to dynamically change the number of results returned from the Pipe, by using a `limit` parameter and a default value of 10: ##### player\_batting\_percentages.pipe plus query parameters SELECT player_name AS "Player Name", sum(stat_hits)/sum(stat_at_bats) AS "Batting Percentage" FROM baseball_game_stats GROUP BY "Player Name" ORDER BY "Batting Percentage" DESC LIMIT {{UInt16(limit, 10, description="The number of results to display")}} Replace the SQL in your Pipe with this code snippet. Run the query and rename the Node `endpoint`. The `%` character at the start of a Tinybird SQL query shows there's a [dynamic query parameter](https://www.tinybird.co/docs/docs/query/query-parameters#define-dynamic-parameters) coming up. ## 5. Publish your Pipes as APIs¶ The magic of Tinybird is that you can instantly publish your Pipes as fully-documented, scalable REST APIs instantly. From the Pipe definition in the Tinybird UI, select “Create API Endpoint” in the top right corner, select the `endpoint` Node. Congratulations! You just ingested BigQuery data, transformed it, and published it as a Tinybird API Endpoint! ### Create additional Pipes¶ Create these additional 5 Pipes (they can also be found in the [project repository](https://github.com/tinybirdco/bigquery-dashboard/tree/main/data-project/pipes) ). Rename them as they are titled in each snippet, and call each Node `endpoint` . Read through the SQL to get a sense of what each query does, then run and publish each one as its own API Endpoint: ##### batting\_percentage\_over\_time % SELECT game_date AS "Game Date", sum(stat_hits)/sum(stat_at_bats) AS "Batting Percentage" FROM baseball_game_stats WHERE player_team = {{String(team_name, 'BOS', required=True)}} GROUP BY "Game Date" ORDER BY "Game Date" ASC ##### most\_hits\_by\_type % SELECT player_name AS name, sum({{ column(hit_type, 'stat_hits') }}) AS value FROM baseball_game_stats GROUP BY name ORDER BY value DESC LIMIT 7 ##### opponent\_batting\_percentages % SELECT game_opponent AS "Team", sum(stat_hits) / sum(stat_at_bats) AS "Opponent Batting Percentage" FROM baseball_game_stats GROUP BY "Team" ORDER BY "Opponent Batting Percentage" ASC LIMIT {{ UInt16(limit, 10) }} ##### player\_batting\_percentages % SELECT player_name AS "Player Name", sum(stat_hits)/sum(stat_at_bats) AS "Batting Percentage" FROM baseball_game_stats GROUP BY "Player Name" ORDER BY "Batting Percentage" DESC LIMIT {{UInt16(limit, 10)}} ##### team\_batting\_percentages % SELECT player_team AS "Team", sum(stat_hits) / sum(stat_at_bats) AS "Batting Percentage" FROM baseball_game_stats GROUP BY "Team" ORDER BY "Batting Percentage" DESC LIMIT {{ UInt16(limit, 10) }} 1 Data Source, 6 Pipes: Perfect. Onto the next step. ## 6. Create a Next.js app¶ This tutorial uses Next.js, but you can visualize Tinybird APIs just about anywhere, for example with an app-building tool like [Retool](https://www.tinybird.co/blog-posts/service-data-sources-and-retool) or a monitoring platform like [Grafana](https://www.tinybird.co/blog-posts/tinybird-grafana-plugin-launch). In your terminal, create a project folder and inside it create your Next.js app, using all the default options: ##### Create a Next app mkdir bigquery-tinybird-dashboard cd bigquery-tinybird-dashboard npx create-next-app Tinybird APIs are accessible via [Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens) . In order to run your dashboard locally, you'll need to create a `.env.local` file at the root of your new project: ##### Create .env.local at root of my-app touch .env.local And include the following: ##### Set up environment variables NEXT_PUBLIC_TINYBIRD_HOST="YOUR TINYBIRD API HOST" # Your regional API host e.g. https://api.tinybird.co NEXT_PUBLIC_TINYBIRD_TOKEN="YOUR SIGNING TOKEN" # Use your Admin Token as the signing token Replace the Tinybird API hostname/region with the [right API URL region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) that matches your Workspace. Your Token lives in the Workspace under "Tokens". ## 7. Define your APIs in code¶ To support the dashboard components you’re about to build, it's a great idea to create a helper file that contains all your Tinybird API references. In the project repo, that’s called `tinybird.js` and it looks like this: ##### tinybird.js helper file const playerBattingPercentagesURL = (host, token, limit) => `https://${host}/v0/pipes/player_batting_percentages.json?limit=${limit}&token=${token}` const teamBattingPercentagesURL = (host, token, limit) => `https://${host}/v0/pipes/team_batting_percentages.json?limit=${limit}&token=${token}` const opponentBattingPercentagesURL = (host, token, limit) => `https://${host}/v0/pipes/opponent_batting_percentages.json?limit=${limit}&token=${token}` const battingPercentageOverTimeURL = (host, token, team_name) => `https://${host}/v0/pipes/batting_percentage_over_time.json?team_name=${team_name}&token=${token}` const hitsByTypeURL = (host, token, hit_type) => `https://${host}/v0/pipes/most_hits_by_type.json?hit_type=${hit_type}&token=${token}` const fetchTinybirdUrl = async (fetchUrl, setData, setLatency) => { const data = await fetch(fetchUrl) const jsonData = await data.json(); setData(jsonData.data); setLatency(jsonData.statistics.elapsed) } export { fetchTinybirdUrl, playerBattingPercentagesURL, teamBattingPercentagesURL, opponentBattingPercentagesURL, battingPercentageOverTimeURL, hitsByTypeURL } Inside `/src/app` , create a new subfolder called `/services` and paste the snippet into a new `tinybird.js` helper file. ## 8. Build your dashboard components¶ This tutorial uses the [Tremor React library](https://tremor.so/) because it provides a clean UI out of the box with very little code. You could easily use [ECharts](https://echarts.apache.org/en/index.html) or something similar if you prefer. ### Add Tremor to your Next.js app¶ You're going to use Tremor to create a simple bar chart that displays the signature count for each organization. Tremor provides stylish React chart components that you can deploy easily and customize as needed. Inside your app folder, install Tremor with the CLI: ##### Install Tremor npx @tremor/cli@latest init Select Next as your framework and allow Tremor to overwrite your existing `tailwind.config.js`. ### Create dashboard component files¶ Your final dashboard contains 3 Bar Charts, 1 Area Chart, and 1 Bar List. You’ll use Tremor Cards to display these components, and each one will have an interactive input. In addition, you’ll show the API response latency underneath the Chart (just so you can show off about how “real-timey” the dashboard is). Here’s the code for the Player Batting Averages component ( `playerBattingPercentages.js` ). It sets up the file, defines the limit parameters, then renders the Chart components: "use-client"; import { Card, Title, Subtitle, BarChart, Text, NumberInput, Flex } from '@tremor/react'; // Tremor components import React, { useState, useEffect } from 'react'; import {fetchTinybirdUrl, playerBattingPercentagesURL } from '../services/tinybird.js' // Tinybird API // utilize useState/useEffect to get data from Tinybird APIs on change const PlayerBattingPercentages = ({host, token}) => { const [player_batting_percentages, setData] = useState([{ "Player Name": "", "Batting Percentage": 0, }]); // set latency from the API response const [latency, setLatency] = useState(0 // set limit parameter when the component input is changed const [limit, setLimit] = useState(10); // format the numbers on the component const valueFormatter = (number) => `${new Intl.NumberFormat("us").format(number).toString()}`; // set the Tinybird API URL with query parameters let player_batting_percentages_url = playerBattingPercentagesURL(host, token, limit) useEffect(() => { fetchTinybirdUrl(player_batting_percentages_url, setData, setLatency) }, [player_batting_percentages_url]); // build the Tremor component return (

Player Batting Percentages All Players
# of Results setLimit(value)} />
// Build the bar chart with the data received from the Tinybird API Latency: {latency*1000} ms // Add the latency metric ); }; export default PlayerBattingPercentages; In the project repo, you’ll find the 5 dashboard components you need, inside the `src/app/components` directory. Each one renders a dashboard component to display the data received by one of the Tinybird APIs. It's time to build them out. For this tutorial, just recreate the same files in your app, pasting in the JavaScript (or downloading the files and dropping them in to your app directory). When building your own dashboard in future, use this as a template and build to fit your needs! ## 9. Compile components into a dashboard¶ Final step! Update your `page.tsx` file to render a nicely-organized dashboard with your 5 components: Replace the contents of `page.tsx` with [this file](https://github.com/tinybirdco/bigquery-dashboard/blob/main/src/app/page.js). The logic in this page gets your Tinybird Token from your local environment variables to be able to access the Tinybird APIs, then renders the 5 components you just built in a Tremor [Grid](https://blocks.tremor.so/blocks/grid-lists). To visualize your dashboard, run it locally with `npm run dev` and open http://localhost:3000. You’ll see your complete real-time dashboard! <-figure-> ![Analytics dashboard build with BigQuery data, Tinybird Endpoints, and Tremor components in a Next.js app](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-bigquery-dashboard.png&w=3840&q=75) Notice the latencies in each dashboard component. This is the Tinybird API request latency. This is not using any sort of cache or query optimization; each request is directly querying the 20,000 rows in the table and returning a response. As you interact with the dashboard and change inputs, the APIs respond. In this case, that’s happening in just a few milliseconds. Now ***that’s*** a fast dashboard. ### Optional: Expand your dashboard¶ You've got the basics: An active Workspace and Data Source, knowledge of how to build Pipes, and access to the [Tremor docs](https://www.tremor.so/docs/getting-started/installation) . Build out some more Pipes, API Endpoints, and visualizations! You can also spend some time [optimizing your data project](https://www.tinybird.co/docs/docs/query/sql-best-practices) for faster responses and minimal data processing using fine-tuned indexes, [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) , and more. ## Next steps¶ - Investigate the[ GitHub repository for this project](https://github.com/tinybirdco/bigquery-dashboard) in more depth. - Understand today's real-time analytics landscape with[ Tinybird's definitive guide](https://www.tinybird.co/blog-posts/real-time-analytics-a-definitive-guide) . - Learn how to implement[ multi-tenant security](https://www.tinybird.co/blog-posts/multi-tenant-saas-options) in your user-facing analytics. --- URL: https://www.tinybird.co/docs/guides/tutorials/leaderboard Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Build a real-time game leaderboard · Tinybird Docs" theme-color: "#171612" description: "Learn how to build a real-time leaderboard using Tinybird." --- # Build a real-time game leaderboard¶ In this guide you'll learn how to build a real-time leaderboard using Tinybird. Leaderboards are a visual representation that ranks things by one or more attributes. For gaming use cases, commonly-displayed attributes include total points scored, high game scores, and number of games played. But leaderboards are used for far more than games. For example, app developers use leaderboards to display miles biked, donations raised, documentation pages visited most often, and countless other examples - basically, anywhere there is some user attribute that can be ranked to compare results. This tutorial is a great starting point for building your own leaderboard. [GitHub Repository](https://github.com/tinybirdco/demo-user-facing-leaderboard) <-figure-> ![A fast, fun leaderboard built on Tinybird](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fleaderboard-tutorial-2.png&w=3840&q=75) In this tutorial, you'll add a gaming leaderboard to the [Flappybird game](https://flappy.tinybird.co/) . Not only is this game fun to play, it's also a vehicle for demonstrating user-facing features. For example, the game features a leaderboard so you can see how your most recent score compares to other top players. You will: 1. Generate a mock game event stream that mimics a high-intensity Flappybird global tournament. 2. Post these mock events to your Tinybird Workspace using the Events API. 3. Transform (rank) this data using Tinybird Pipes and SQL. 4. Optimize your data handling with a Materialized View. 5. Publish the results as a Tinybird API Endpoint. 6. Generate a leaderboard that makes calls to your API Endpoint securely and directly from the browser. Each time a leaderboard request is made, up-to-the-second results are returned for the leaderboard app to render. When embedded in the game, a `leaderboard` API Endpoint is requested when a game ends. For this tutorial, the app will make requests on a specified interval and have a button for ad-hoc requests. Game events consist of three values: - `core` - Generated when a point is scored. - `game_over` - Generated when a game ends. - `purchase` - Generated when a ‘make-the-game-easier’ coupon is redeemed. Each event object has the following JSON structure: ##### Example JSON event object { "session_id": "1f2c8bcf-8a5b-4eb1-90bf-8726e63d81b7", "name": "Marley", "timestamp": "2024-06-20T19:06:15.373Z", "type": "game_over", "event": "Mockingbird" } Here's how it all fits together: <-figure-> ![Events and process of the leaderboard](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fleaderboard-tutorial-1.png&w=3840&q=75) ## Prerequisites¶ To complete this tutorial, you'll need the following: 1. A[ free Tinybird account](https://www.tinybird.co/signup) 2. An empty Tinybird Workspace 3. Node.js >=20.11 4. Python >=3.8 ## 1. Create a Tinybird Workspace¶ Navigate to the Tinybird web UI ( [app.tinybird.co](https://app.tinybird.co/) ) and create an empty Tinybird Workspace (no starter kit) called `tiny_leaderboard` in your preferred region. ### Create a Data Source for events¶ The first step with any Tinybird project is to create Data Sources to work with. For this tutorial, you have two options. The first is to create a Data Source based on a schema that you define. The alternative is to rely on the [Mockingbird](https://mockingbird.tinybird.co/docs) tool used to stream mock data to create the Data Source for you. While the Mockingbird method is faster, building your own Data Source gives you more control and introduces some fundamental concepts along the way. #### Option 1: Create a Data Source using a written schema¶ In the Tinybird UI, add a new Data Source and use the `Write schema` option. In the schema editor, use the [following schema](https://github.com/tinybirdco/demo-user-facing-leaderboard/blob/main/tinybird/datasources/game_events.datasource): ##### Data Source schema SCHEMA > `name` String `json:$.name`, `session_id` String `json:$.session_id`, `timestamp` DateTime64(3) `json:$.timestamp`, `type` LowCardinality(String) `json:$.type`, `event` String `json:$.event` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYear(timestamp)" ENGINE_SORTING_KEY "event, name, timestamp" Name the Data Source `game_events` and select `Create Data Source`. This schema definition shows how the incoming JSON events are parsed and assigned to each of schema fields. The definition also defines [database table ‘engine’ details](https://www.tinybird.co/docs/concepts/data-sources#supported-engines-settings) of the underlying ClickHouse® instance. Tinybird projects are made of Data Source and Pipe definition files like this example, and they can be managed like any other code project using Git. #### Option 2: Create a Data Source with Mockingbird¶ As part of the Mockingbird configuration (see below), you'll provide the name of the Data Source to write the events to. If that Data Source does not already exist, a new Data Source with that name will be created with an automatically-generated schema. This auto-inferred schema may match your expectations, but it may lack important features. For example, automatically-generated schema will not apply the `LowCardinality` operator, a commonly-used operator that can make data lookups more efficient. Having Mockingbird auto-create your exploratory Data Source is a great way to explore Tinybird. As you begin to prototype and design production systems, you should anticipate creating new Data Sources by providing a schema design. Now you've created your main Data Source, it's ready to receive events! ## 2. Create a mock data stream¶ In a real-life scenario, you'd stream your game events into the `game_events` Data Source. For this tutorial, you'll use [Mockingbird](https://mockingbird.tinybird.co/docs) , an open source mock data stream generator, to stream mock events instead. Mockingbird generates a JSON payload based on a predefined schema and posts it to the [Tinybird Events API](https://www.tinybird.co/docs/docs/ingest/events-api) , which then writes the data to your Data Source. ### Generate fake data¶ Use [this Mockingbird link](https://mockingbird.tinybird.co/?host=eu_gcp&datasource=game_events&eps=10&withLimit=on&generator=Tinybird&endpoint=eu_gcp&limit=-1&generatorName=Tinybird&template=Flappybird&schema=Preset) to generate fake data for the `game_events` Data Source. Using this link ^ provides a pre-configured schema. Enter your Workspace admin Token and select the Host region that matches your Workspace region. Select `Save` , then scroll down and select `Start Generating!`. Replace the Tinybird API hostname/region with the [right API URL region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) that matches your Workspace. Your Token lives in the Workspace under "Tokens". In the Tinybird UI, confirm that the `game_events` Data Source is successfully receiving data. Leaderboards typically leverage a concise data schema with just a user/item name, the ranked attribute, and a timestamp. This tutorial is based on this schema: - `name` String - `session_id` String - `timestamp` DateTime64(3) - `type` LowCardinality(String) - `event` String Ranking algorithms can be based on a single score, time-based metrics, or weighted combinations of factors. ## 3. Transform and publish your data¶ Your Data Source is collecting events, so now it's time to create some [Pipes](https://www.tinybird.co/docs/docs/concepts/pipes) . Pipes are made up of chained, reusable SQL [Nodes](https://www.tinybird.co/docs/docs/concepts/pipes#nodes) and form the logic that will rank the results. You'll start by creating a `leaderboard` Pipe with two Nodes. The first Node will return all ‘score’ events. The second Node will take those results and count these events by player and session (which defines a single game), and return the top 10 results. ### Create a Pipe¶ In the Tinybird UI, create a new Pipe called `leaderboard` . To begin with, you'll use some basic SQL that isn't fully optimized, and that's ok! You'll optimize it later. Paste in the following SQL and rename the first Node `get_all_scores`: ##### get\_all\_scores Node % SELECT name AS player_id, timestamp, session_id, event FROM game_events WHERE type = 'score' AND event == {{ String(event_param, 'Mockingbird', description="Event to filter on") }} This query returns all events where the type is `score`. Note that this Node creates a query parameter named `event_param` using the [Tinybird templating syntax](https://www.tinybird.co/docs/query/query-parameters) . This instance of Flappybird supports an ‘event’ attribute that supports organizing players, games, and events into separate groups. As shown above, incoming Mockingbird game events have a `"event": "Mockingbird"` attribute. Select "Run" and add a new Node underneath, called `endpoint` . Paste in: ##### endpoint Node SELECT player_id, session_id, event, count() AS score FROM get_all_scores GROUP BY player_id, session_id, event ORDER BY score DESC LIMIT 10 Select "Run", then select "Create API Endpoint". Congrats! Your data is now ranked, published, and available for consuming. ## 4. Optimize with Materialized Views¶ Before you run the frontend for your leaderboard, there are a few optimizations to make. Even with small datasets, it's a great habit to get into. [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) are updated as data is ingested, and create intermediate states that are merged with already-processed data. This ability to keep track of already-processed data, and combine it with recently arrived data, helps keep your API Endpoints performance super efficient. The Materialized View (MV) continuously re-evaluates queries as new events are inserted, reducing both latency and processed-data-per-query. In this case, the MV you create will pre-calculate the top scores, and merge those with recently-received events. This significantly improves query performance by reducing the amount of data that needs to be processed for each leaderboard request. To create a new Materialized View, begin by adding a new Pipe and call it `user_stats_mv` . Then paste the following SQL into the first Node: SELECT event, name AS player_id, session_id, countIfState(type = 'score') AS scores, countIfState(type = 'game_over') AS games, countIfState(type = 'purchase') AS purchases, minState(timestamp) AS start_ts, maxState(timestamp) AS end_ts FROM games_events GROUP BY event, player_id, session_id This query relies on the `countIfState` function, which includes the `-State` operator to maintain immediate states containing recent data. When triggered by a `-Merge` operator (see below), these intermediate states are combined with the pre-calculated data. The `countIfState` function is used to maintain counts of each type of game event. Name this Node `populate_mv` , then [publish it as a Materialized View](https://www.tinybird.co/docs/publish/materialized-views/overview) . Name your Materialized View `user_stats`. You now have a new Data Source called `user_stats` , which is a Materialized View that is continuously updated with the latest game events. As you will see next, the `-State` modifier that maintains intermediate states as new data arrives will be paired with a `-Merge` modifier in Pipes that pull from the `user_stats` Data Source. ## 5. Update leaderboard Pipe¶ Now that `user_stats` is available, you will now rebuild the `leaderboard` Pipe to take advantage of this more efficient Data Source. This step will help prepare your leaderboard feature to handle massive amounts of game events while serving requests to thousands of users. The updated leaderboard Pipe will consist of three Nodes: - `rank_games` - Applies the countMerge( scores) function to get the current total from the user_stats Data Source. - `last_game` - Retrieves the score from the player’s most recent game and determines the player’s rank. - `endpoint` - Combines the results of these two Nodes and ranks by score. Note that the `last_game` Node introduces the *user-facing* aspect of the leaderboard. As seen below, this Node retrieves a specific user’s data and blends it into the leaderboard results. To get started, update the `leaderboard` Pipe to use the `user_stats` Materialized View. Return to the `leaderboard` Pipe and un-publish it. Now, change the name of the first Node to `rank_games` and update the SQL to: ##### rank\_games Node % SELECT ROW_NUMBER() OVER (ORDER BY total_score DESC, t) AS rank, player_id, session_id, countMerge(scores) AS total_score, maxMerge(end_ts) AS t FROM user_stats GROUP BY player_id, session_id ORDER BY rank A few things to notice here: 1. The `rank_games` Node now uses the `user_stats` Materialized View instead of the `game_events` Data Source. 2. The use of the `countMerge(scores)` function. The `-Merge` operator triggers the MV-based `user_stats` Data Source to combine any intermediate states with the pre-calculated data and return the results. 3. The use of the `ROW_NUMBER()` window function that returns a ranking of top scores. These rankings are based on the merged scores (aliased as `total_scores` ) retrieved from the `user_stats` Data Source. Next, change the name of the second Node to `last_game` and update the SQL to: ##### last\_game Node % SELECT argMax(rank, t) AS rank, player_id, argMax(session_id, t) AS session_id, argMax(total_score, t) AS total_score FROM rank_games WHERE player_id = {{ String(player_id, 'Jim', description="Player to filter on", required=True) }} GROUP BY player_id This query returns the highest rank of a specified player and introduces a `player_id` query parameter. To combine these results, add a new Node called `endpoint` and paste the following SQL: ##### endpoint Node SELECT * FROM ( SELECT rank, player_id, session_id, total_score FROM rank_games WHERE (player_id, session_id) NOT IN (SELECT player_id, session_id FROM last_game) LIMIT 10 UNION ALL SELECT rank, player_id, session_id, total_score FROM last_game ) ORDER BY rank ASC This query applies the `UNION ALL` statement to combine the two result sets. Note that the selected attribute data types must match to be combined. This completes the `leaderboard` Pipe. Publish it as an API Endpoint. Now that the final version of the 'leaderboard' Endpoint has been published, create one last Pipe in the UI. This one gets the overall stats for the leaderboard, the number of players and completed games. Name the Pipe `get_stats` and create a single Node named `endpoint`: ##### endpoint Node in the get\_stats Pipe WITH player_count AS ( SELECT COUNT(DISTINCT player_id) AS players FROM user_stats ), game_count AS ( SELECT COUNT(*) AS games FROM game_events WHERE type == 'game_over' ) SELECT players, games FROM player_count, game_count Publish this Node as an API Endpoint. You're ready to get it all running! ## 6. Run your app¶ Clone the `demo-user-facing-leaderboard` repo locally. Install the app dependencies by running this command from the `app` dir of the cloned repo: npm install ### Add your Tinybird settings as environment variables¶ Create a new `.env.local` file: touch .env.local Copy your Tinybird Admin Token, Workspace UUID (Workspace > Settings > Advanced settings > `...` ), and API host url from your Tinybird Workspace into the new `.env.local`: TINYBIRD_SIGNING_TOKEN="YOUR SIGNING TOKEN" # Use your Admin Token as the signing token TINYBIRD_WORKSPACE="YOUR WORKSPACE ID" # The UUID of your Workspace NEXT_PUBLIC_TINYBIRD_HOST="YOUR TINYBIRD API HOST e.g. https://api.tinybird.co" # Your regional API host Replace the Tinybird API hostname/region with the [right API URL region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) that matches your Workspace. Your Token lives in the Workspace under "Tokens". ### Run your app¶ Run your app locally and navigate to [http://localhost:3000](http://localhost:3000/): npm run dev <-figure-> ![A fast, fun leaderboard built on Tinybird](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fleaderboard-tutorial-2.png&w=3840&q=75) Congrats! You now have an optimized gaming leaderboard ingesting real-time data! Have a think about how you'd adapt or extend it for your own use case. ## Next steps¶ - Read the in-depth blog post on[ building a real-time leaderboard](https://www.tinybird.co/blog-posts/building-real-time-leaderboards-with-tinybird) . - Understand today's real-time analytics landscape with[ Tinybird's definitive guide](https://www.tinybird.co/blog-posts/real-time-analytics-a-definitive-guide) . - Learn how to implement[ multi-tenant security](https://www.tinybird.co/blog-posts/multi-tenant-saas-options) in your user-facing analytics. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/tutorials/real-time-dashboard Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Build a real-time dashboard with Tremor & Next.js · Tinybird Docs" theme-color: "#171612" description: "Learn how to build a user-facing web analytics dashboard using Tinybird, Tremor, and Next.js." --- # Build a real-time dashboard¶ In this guide you'll learn how to build a real-time analytics dashboard from scratch, for free, using just 3 tools: Tinybird, Tremor, and Next.js. You'll end up with a dashboard and enough familiarity with Tremor to adjust the frontend & data visualization for your own projects in the future. [GitHub Repository](https://github.com/tinybirdco/demo-user-facing-saas-dashboard-signatures) Imagine you’re a [DocuSign](https://www.docusign.com/) competitor. You’re building a SaaS to disrupt the document signature space, and as a part of that, you want to give your users a real-time data analytics dashboard so they can monitor how, when, where, and what is happening with their documents in real time. In this tutorial, you'll learn how to: 1. Use Tinybird to capture events (like a document being sent, signed, or received) using the Tinybird Events API. 2. Process them with SQL. 3. Publish the transformations as real-time APIs. 4. Use Tremor components in a Next.js app to build a clean, responsive, real-time dashboard. Here's how it all fits together: <-figure-> ![Diagram showing the data flow from Tinybird --> Next.js --> Tremor](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-real-time-dashboard-data-flow.png&w=3840&q=75) ## Prerequisites¶ To complete this tutorial, you'll need the following: 1. A[ free Tinybird account](https://www.tinybird.co/signup) 2. Node.js >=18 3. Python >=3.8 4. Working familiarity with JavaScript This tutorial uses both the Tinybird web UI and the Tinybird CLI. If you're not familiar with the Tinybird CLI, [read the CLI docs](https://www.tinybird.co/docs/docs/cli/install) or just give it a go! You can copy and paste every code snippet and command in this tutorial - each step is clearly explained. ## 1. Create a Tinybird Workspace¶ Navigate to the Tinybird web UI ( [app.tinybird.co](https://app.tinybird.co/) ) and create an empty Tinybird Workspace (no starter kit) called `signatures_dashboard` in your preferred region. ## 2. Create the folder structure¶ In your terminal, create a folder called `tinybird-signatures-dashboard` . This folder is going to contain all your code. Inside it, create a bunch of folders to keep things organized: ##### Create the folder structure mkdir tinybird-signatures-dashboard && cd tinybird-signatures-dashboard mkdir datagen datagen/utils app tinybird The final structure will be: ##### Folder structure └── tinybird-signatures-dashboard ├── app ├── datagen │ └── utils └── tinybird ## 3. Install the Tinybird CLI¶ The Tinybird CLI is a command-line tool that allows you to interact with Tinybird’s API. You will use it to create and manage the data project resources that underpin your real-time dashboard. Run the following commands to prepare the virtual environment, install the CLI, and authenticate (the `-i` flag is for "interactive"): ##### Install the Tinybird CLI python -m venv .venv source .venv/bin/activate pip install tinybird-cli tb auth -i Choose the region that matches your Workspace region (if you're not sure which region you chose, don't worry: In the Tinybird UI, select the same of the Workspace (top left) and it will say the region under your email address). You’ll then be prompted for your [user admin Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens) , which lives in the Tinybird UI under "Tokens". Paste it into the CLI and press enter. You’re now authenticated to your Workspace from the CLI, and your auth details are saved in a `.tinyb` file in the current working directory. Your user admin Token has full read/write privileges for your Workspace. Don't share it or publish it in your application. You can find more detailed info about Static Tokens [in the Tokens docs](https://www.tinybird.co/docs/docs/concepts/auth-tokens). Ensure that the `.tinyb` file and the `.venv` folder are not publicly exposed by creating a `.gitignore` file and adding it: ##### Housekeeping: Hide your Token\! touch .gitignore echo ".tinyb" >> .gitignore echo ".venv" >> .gitignore ## 4. Create a mock data stream¶ Now download the [mockDataGenerator.js](https://github.com/tinybirdco/demo-user-facing-saas-dashboard-signatures/blob/main/datagen/mockDataGenerator.js) file and place it in the `datagen` folder. ##### Mock data generator cd datagen curl -O https://raw.githubusercontent.com/tinybirdco/demo-user-facing-saas-dashboard-signatures/refs/heads/main/datagen/mockDataGenerator.js ### What this file does¶ The `mockDataGenerator.js` script generates mock user accounts, with fields like `account_id`, `organization`, `phone_number` , and various certification statuses related to the account’s means of identification: ##### Create fake account data const generateAccountPayload = () => { const status = ["active", "inactive", "pending"]; const id = faker.number.int({ min: 10000, max: 99999 }); account_id_list.push(id); return { account_id: id, organization: faker.company.name(), status: status[faker.number.int({ min: 0, max: 2 })], role: faker.person.jobTitle(), certified_SMS: faker.datatype.boolean(), phone: faker.phone.number(), email: faker.internet.email(), person: faker.person.fullName(), certified_email: faker.datatype.boolean(), photo_id_certified: faker.datatype.boolean(), created_on: (faker.date.between({ from: '2020-01-01', to: '2023-12-31' })).toISOString().substring(0, 10), timestamp: Date.now(), } } In addition, the code generates mock data events about the document signature process, with variable status values such as `in_queue`, `signing`, `expired`, `error` , and more: const generateSignaturePayload = (account_id, status, signatureType, signature_id, since, until, created_on) => { return { signature_id, account_id, status, signatureType, since: since.toISOString().substring(0, 10), until: until.toISOString().substring(0, 10), created_on: created_on.toISOString().substring(0, 10), timestamp: Date.now(), uuid: faker.string.uuid(), } } Lastly, the generator creates and sends a final status for the signature using weighted values: const finalStatus = faker.helpers.weightedArrayElement([ { weight: 7.5, value: 'completed' }, { weight: 1, value: 'expired' }, { weight: 0.5, value: 'canceled' }, { weight: 0.5, value: 'declined' }, { weight: 0.5, value: 'error' }, ]) // 7.5/10 chance of being completed, 1/10 chance of being expired, 0.5/10 chance of being canceled, declined or error ### Download the helper functions¶ This script also utilizes a couple of helper functions to access your Tinybird Token and send the data to Tinybird with an HTTP request using the Tinybird Events API. These helper functions are located in the `tinybird.js` file in the repo. [Download that file](https://github.com/tinybirdco/demo-user-facing-saas-dashboard-signatures/blob/main/datagen/utils/tinybird.js) and add it to the `datagen/utils` directory. ##### Helper functions cd datagen/utils curl -O https://raw.githubusercontent.com/tinybirdco/demo-user-facing-saas-dashboard-signatures/refs/heads/main/datagen/utils/tinybird.js The Tinybird Events API is useful for two reasons: 1. It allows for the flexible and efficient ingestion of data, representing various stages of signatures, directly into the Tinybird platform without needing complex streaming infrastructure. 2. It allows you to stream events directly from your application instead of relying on batch ETLs or change data capture which requires the events to first be logged in a transactional database, which can add lag to the data pipeline. ### Install the Faker library¶ Run this command: ##### Install Faker cd datagen npm init --yes npm install @faker-js/faker To run this file and start sending mock data to Tinybird, you need to create a custom script in the `package.json` generated file inside `datagen` folder. Open up that file and add the following to the scripts: ##### Add seed npm script "seed": "node data-project/mockDataGenerator.js" Note that since your code is using ES modules, you’ll need to add `"type": "module"` to the `package.json` file to be able to run the script and access the modules. For more information on why, [read this helpful post](https://www.codeconcisely.com/posts/nextjs-esm/). Your package.json should now look something like this: ##### package.json { "name": "datagen", "version": "1.0.0", "description": "", "main": "index.js", "type": "module", "scripts": { "seed": "node ./mockDataGenerator.js" }, "dependencies": { "@faker-js/faker": "^8.4.1" }, "license": "ISC", "author": "" } Okay: You're ready to start sending mock data to Tinybird. Open up a new terminal tab or window in this local project directory, in the `datagen` folder run: ##### Generate mock data\! npm run seed Congratulations! You should see the seed output in your terminal. Let this run in the background so you have some data for the next steps. Return to your original terminal tab or window and move onto the next steps. ### Verify your mock data stream¶ To verify that the data is flowing properly into Tinybird, inspect the Tinybird Data Sources. In the Tinybird UI, navigate to the `signatures` and `accounts` [Data Sources](https://www.tinybird.co/docs/docs/concepts/data-sources) to confirm that the data has been received. The latest records should be visible. You can also confirm using the CLI, by running a SQL command on your Data Source: tb sql "select count() from signatures" If you run this a few times, and your mock data stream is still running, you'll see this number increase. Neat. This project uses mock data streams to simulate data generated by a hypothetical document signatures app. If you have your own app that’s generating data, you don’t need to do this! You can just add the helper functions to your codebase and call them to send data directly from your app to Tinybird. ## 5. Build dashboard metrics with SQL¶ You now have a Data Source: Events streaming into Tinybird, which ensures your real-time dashboard has access to fresh data. The next step is to build real-time metrics using [Tinybird Pipes](https://www.tinybird.co/docs/docs/concepts/pipes). A Pipe is a set of chained, composable Nodes of SQL that process, transform, and enrich data in your Data Sources. Create a new Pipe in the Tinybird UI by selecting the + icon in the left-hand nav bar and selecting "Pipe". Rename your new Pipe `ranking_of_top_organizations_creating_signatures`. Next, time to make your first Node! Remove the placeholder text from the Node, and paste the following SQL in: % SELECT account_id, {% if defined(completed) %} countIf(status = 'completed') total {% else %} count() total {% end %} FROM signatures WHERE fromUnixTimestamp64Milli(timestamp) BETWEEN {{ Date( date_from, '2023-01-01', description="Initial date", required=True, ) }} AND {{ Date( date_to, '2024-01-01', description="End date", required=True ) }} GROUP BY account_id HAVING total > 0 ORDER BY total DESC Key points to understand in this snippet: 1. As well as standard SQL, it uses the Tinybird[ templating language and query parameters](https://www.tinybird.co/docs/docs/query/query-parameters) - you can tell when query params are used, because the `%` symbol appears at the top of the query. This makes the query* dynamic* , so instead of hardcoding the date range, the user can now select a range and have the results refresh in real time. 2. It has an `if defined` statement. In this case, if a boolean tag called `completed` is passed, the Pipe calculates the number of completed signatures. Otherwise, it calculates all signatures. Select "Run" to run and save this Node, then rename `retrieve_signatures` . Below this Node, create a second one. Remove the placeholder text and paste the following SQL in: ##### Second Node SELECT organization, sum(total) AS org_total FROM retrieve_signatures LEFT JOIN accounts ON accounts.account_id = retrieve_signatures.account_id GROUP BY organization ORDER BY org_total DESC LIMIT {{Int8(limit, 10, description="The number of rows accounts to retrieve", required=False)}} Name this Node `endpoint` and select "Run" to save it. You now have a 2-Node Pipe that gets the top `` number of organizations by signatures within a date range, either completed or total depending on whether a completed query parameter is passed or not. ## 6. Publish metrics as APIs¶ You're now ready to build a low-latency, high-concurrency REST API Endpoint from your Pipe - with just 2 clicks! Select the "Create API Endpoint" button at top right, then select the `endpoint` Node. You’ll be greeted with an API page that contains a usage monitoring chart, parameter documentation, and sample usage. In addition, the API has been secured through an automatically-generated, read-only Token. ### Test your API¶ Copy the HTTP API Endpoint from the "Sample usage" box and paste it directly into a new browser tab to see the response. In the URL, you can manually adjust the `date_from` and `date_to` parameters and see the different responses. You can also adjust the `limit` parameter, which controls how many rows are returned. If you request the data in a JSON format (the default behavior), you’ll also receive some metadata about the response, including statistics about the query latency: ##### Example Tinybird API statistics "statistics": { "elapsed": 0.001110996, "rows_read": 4738, "bytes_read": 101594 } You'll notice that the API response in this example took barely 1 millisecond (which is... pretty fast) so your dashboards are in good hands when it comes to being ultra responsive. When building out your own projects in the future, use this metadata [and Tinybird's other tools](https://www.tinybird.co/docs/docs/monitoring/health-checks) to monitor and optimize your dashboard query performance. ### Optional: Pull the Tinybird resources into your local directory¶ At this point, you've created a bunch of Tinybird resources: A Workspace, a Data Source, Pipes, and an API Endpoint. You can pull these resources down locally, so that you can manage this project with Git. In your terminal, start by pulling the Tinybird data project: ##### In the root directory tb pull --auto You’ll see a confirmation that 3 resources ( `signatures.datasource`, `accounts.datasource` , and `ranking_of_top_organizations_creating_signatures.pipe` ) were written into two subfolders, `datasources` and `pipes` , which were created by using the `--auto` flag. Move them into the `data-project` directory: ##### Move to /tinybird directory cd tinybird mv datasources pipes tinybird/ As you add additional resources in the Tinybird UI, use the `tb pull –auto` command to pull files from Tinybird. You can then add them to your Git commits and push them to your remote repository. If you create data project resources locally using the CLI, you can push them to the Tinybird server with `tb push` . For more information on managing Tinybird data projects in the CLI, check out [this CLI overview](https://www.tinybird.co/docs/docs/cli/quick-start). ## 7. Create real-time dashboard¶ Now that you have a low-latency API with real-time dashboard metrics, you're ready to create the visualization layer using Next.js and Tremor. These two tools provide a scalable and responsive interface that integrate with Tinybird's APIs to display data dynamically. Plus, they look great. ## Initialize your Next.js project¶ In your terminal, create a folder call `app` and inside it create your Next.js app with this command. In this tutorial you'll use plain JavaScript files and Tailwind CSS: ##### Create a Next app cd app npx create-next-app . --js --tailwind --eslint --src-dir --app --import-alias "@/*" ### Add Tremor to your Next.js app¶ You're going to use Tremor to create a simple bar chart that displays the signature count for each organization. Tremor provides stylish React chart components that you can deploy easily and customize as needed. Install Tremor with the CLI: ##### Install Tremor npx @tremor/cli@latest init Select Next as your framework and allow Tremor to overwrite your existing `tailwind.config.js`. ### Add SWR to your Next.js app¶ You're going to use [SWR](https://swr.vercel.app/) to handle the API Endpoint data and refresh it every 5 seconds. SWR is a great React library to avoid dealing with data caching and revalidating complexity on your own. Plus, you can define what refresh policy you want to follow. Take a look to [its docs](https://swr.vercel.app/docs/revalidation) to know different revalidation strategies. ##### Install SWR npm i swr ### Set up environment variables¶ Next, you need to add your Tinybird host and user admin Token as environment variables so you can run the project locally. Create a `.env.local` file in the root directory ( `/signatures_dashboard` ) and add the following: ##### Set up environment variables NEXT_PUBLIC_TINYBIRD_HOST="YOUR TINYBIRD API HOST" # Your regional API host e.g. https://api.tinybird.co NEXT_PUBLIC_TINYBIRD_TOKEN="YOUR SIGNING TOKEN" # Use your Admin Token as the signing token Replace the Tinybird API hostname/region with the [right API URL region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) that matches your Workspace. Your Token lives in the Workspace under "Tokens". ### Set up your page.js¶ Next.js created a `page.js` as part of the bootstrap process. Open it in your preferred code editor and clear the contents. Paste in the snippets in order from the following sections, understanding what each one does: ### Import UI libraries¶ To build your dashboard component, you will need to import various UI elements and functionalities from the libraries provided at the beginning of your file. Note the use of the `use client;` directive to render the components on the client side. For more details on this, check out the [Next.js docs](https://nextjs.org/docs/app/building-your-application/rendering#network-boundary). ##### Start building index.js "use client"; import { BarChart, Card, Subtitle, Text, Title } from "@tremor/react"; import React from "react"; import useSWR from "swr"; ### Define constants¶ Inside your main component, define the constants required for this specific component: ##### Add environment variables and states // Get your Tinybird host and Token from the .env file const TINYBIRD_HOST = process.env.NEXT_PUBLIC_TINYBIRD_HOST; // The host URL for the Tinybird API const TINYBIRD_TOKEN = process.env.NEXT_PUBLIC_TINYBIRD_TOKEN; // The access Token for authentication with the Tinybird API const REFRESH_INTERVAL_IN_MILLISECONDS = 5000; // five seconds ### Connect your dashboard to your Tinybird API¶ You’ll need to write a function to fetch data from Tinybird. Note that for the sake of brevity, this snippet hardcodes the dates and uses the default limit in the Tinybird API. You could set up a Tremor datepicker and/or number input if you wanted to dynamically update the dashboard components from within the UI. ##### Define query parameters and Tinybird fetch function export default function Dashboard() { // Define date range for the query const today = new Date(); // Get today's date const dateFrom = new Date(today.setMonth(today.getMonth() - 1)); // Start the query's dateFrom to the one month before today const dateTo = new Date(today.setMonth(today.getMonth() + 1)); // Set the query's dateTo to be one month from today // Format for passing as a query parameter const dateFromFormatted = dateFrom.toISOString().substring(0, 10); const dateToFormatted = dateTo.toISOString().substring(0, 10); // Constructing the URL for fetching data, including host, token, and date range const endpointUrl = new URL( "/v0/pipes/ranking_of_top_organizations_creating_signatures.json", TINYBIRD_HOST ); endpointUrl.searchParams.set("token", TINYBIRD_TOKEN); endpointUrl.searchParams.set("date_from", dateFromFormatted); endpointUrl.searchParams.set("date_to", dateToFormatted); // Initializes variables for storing data let ranking_of_top_organizations_creating_signatures, latency, errorMessage; try { // Function to fetch data from Tinybird URL and parse JSON response const fetcher = (url) => fetch(url).then((r) => r.json()); // Using SWR hook to handle state and refresh result every five seconds const { data, error } = useSWR(endpointUrl.toString(), fetcher, { refreshInterval: REFRESH_INTERVAL_IN_MILLISECONDS, }); if (error) { errorMessage = error; return; } if (!data) return; if (data?.error) { errorMessage = data.error; return; } ranking_of_top_organizations_creating_signatures = data.data; // Setting the state with the fetched data latency = data.statistics?.elapsed; // Setting the state with the query latency from Tinybird } catch (e) { console.error(e); errorMessage = e; } ### Render the Component¶ Finally, include the rendering code to display the "Ranking of the top organizations creating signatures" in the component's return statement: ##### Render the dashboard component return ( Top Organizations Creating Signatures Ranked from highest to lowest {ranking_of_top_organizations_creating_signatures && ( )} {latency && Latency: {latency * 1000} ms} {errorMessage && (

Oops, something happens: {errorMessage}

Check your console for more information

)}
); } ### View your dashboard\!¶ It's time! Run `npm run dev` and navigate to `http://localhost:3000/` in your browser. You should see something like this: <-figure-> ![Diagram showing the data flow from Tinybird --> Next.js --> Tremor](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-real-time-dashboard-data-flow.png&w=3840&q=75) Congratulations! You’ve created a real-time dashboard component using Tinybird, Tremor, and Next.js. You’ll notice the dashboard is rendering very quickly by taking a peek at the latency number below the component. In this example case, Tinybird returned the data for the dashboard in a little over 40 milliseconds aggregating over about a million rows. Not too bad for a relatively un-optimized query! ### Optional: Expand your dashboard¶ You've got the basics: An active Workspace and Data Source, knowledge of how to build Pipes, and access to the [Tremor docs](https://www.tremor.so/docs/getting-started/installation) . Build out some more Pipes, API Endpoints, and visualizations! <-figure-> ![Dashboard showing more visualizations](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-real-time-dashboard-further-examples.png&w=3840&q=75) You can also spend some time [optimizing your data project](https://www.tinybird.co/docs/docs/query/sql-best-practices) for faster responses and minimal data processing using fine-tuned indexes, [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) , and more. ## Next steps¶ - Investigate the[ GitHub repository for this project](https://github.com/tinybirdco/demo-user-facing-saas-dashboard-signatures) in more depth. - Understand today's real-time analytics landscape with[ Tinybird's definitive guide](https://www.tinybird.co/blog-posts/real-time-analytics-a-definitive-guide) . - Learn how to implement[ multi-tenant security](https://www.tinybird.co/blog-posts/multi-tenant-saas-options) in your user-facing analytics. --- URL: https://www.tinybird.co/docs/guides/tutorials/tinybird-101-tutorial Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Tinybird 101 Tutorial · Tinybird Docs" theme-color: "#171612" description: "Tinybird provides you with an easy way to ingest and query large amounts of data with low-latency, and instantly create API Endpoints to consume those queries. This makes it extremely easy to build fast and scalable applications that query your data; no backend needed!" --- # Tinybird 101¶ Tinybird provides you with a simple way to ingest and query large amounts of data with low latency, and instantly create API Endpoints to consume those queries. This means you can easily build fast and scalable applications that query your data; no backend needed! ## Example use case: ecommerce¶ This walkthrough demonstrates how to build an API Endpoint that returns the top 10 most searched products in an ecommerce website. It follows the process of "ingest > query > publish". 1. First, you'll** ingest** a set of ecommerce events based on user actions, such as viewing an item, adding items to their cart, or going through the checkout. This data is available as a CSV file with 50 million rows. 2. Next, you'll** write queries** to filter, aggregate, and transform the data into the top 10 list. 3. Finally, you'll** publish** that top 10 result as an HTTP Tinybird API Endpoint. ## Prerequisites¶ There are no specific prerequisites - you can start from the very beginning, without a Tinybird account. If you need help, reach out to us on [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co). This quick start walkthrough uses the **web UI** (not the Tinybird CLI), and **doesn't use version control features** . Both are great additions to your next Tinybird project, but not necessary to get up and running. Here's why: **Using version control:** There are two distinct workflows you can use when building Tinybird projects: with version control and without version control. Tinybird has an extensive feature set that supports Git-based, version-controlled development, including creating feature [Branches](https://www.tinybird.co/docs/docs/core-concepts#branches) , managing Pull Requests with integrated CI/CD scripts. As your Tinybird projects get more complex and your team of collaborators grows, the version control workflow becomes more critical. However, this walkthrough focuses on understanding the platform and its capabilities, so it's fine to build **without** the version control features. **Using the web UI versus the CLI:** In addition to the Tinybird UI, there is also a Tinybird [command line interface (CLI)](https://www.tinybird.co/docs/docs/core-concepts#cli) for building and managing projects. If you're using the version control workflow, the CLI is the best option, and comes complete with methods for creating and navigating [Branches](https://www.tinybird.co/docs/docs/core-concepts#branches). This example uses the web UI. ## Your first Workspace¶ Wondering how to create an account? It's free! [Start here](https://www.tinybird.co/signup). After [creating your account](https://www.tinybird.co/signup) , select a region, and name your Workspace. You can call the Workspace whatever you want; generally, people name their Workspace after the project they are working on. As you learn Tinybird, you are likely to create multiple Workspaces, so descriptive names are helpful. You can always rename a Workspace in the future. Leave the Starter Kit selection dropdown blank. Welcome to your new Workspace! ## 1. Create a Data Source¶ Tinybird can import data from many different sources, but let's start off simple with a CSV file that Tinybird has posted online for you. ### Add the new Data Source¶ In your Workspace, find the Data Sources section at the bottom of the left side navigation. **Select the Plus (+) icon** to add a new Data Source (see Mark 1 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-creating-data-source-1.png&w=3840&q=75) In the modal that opens, **select the Remote URL connector** (see Mark 1 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-remote-url-1.png&w=3840&q=75) In the next window, ensure that `csv` is selected (see Mark 1 below), and then **paste the following URL into the text box** (see Mark 2 below). https://storage.googleapis.com/tinybird-assets/datasets/guides/events_50M_1.csv Finally, **select the Add button** to finish (see Mark 3 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-remote-url-2.png&w=3840&q=75) On the next screen, give the Data Source a name and description (see Mark 1 below). Tinybird also shows you a preview of the schema and data (see Mark 2 below). Change the name to something more descriptive. **Select the name field** and enter the name `shopping_data`. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-schema-preview-1.png&w=3840&q=75) ### Start the data import¶ After setting the name of your first Data Source, **select Create Data Source** to start importing the data (see Mark 1 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-complete-datasource-1.png&w=3840&q=75) Your Data Source has now been created in your Workspace. If it doesn't show up, refresh the window. Loading the data does not take very long, and you do not need to wait for it to finish. Your Workspace should look like this: <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-review-datasource-1.png&w=3840&q=75) OK: You've **ingested** your data. Now, you can move on to creating your first Pipe! ## 2. Create a Pipe¶ In Tinybird, SQL queries are written inside *Pipes* . One Pipe can be made up of many individual SQL queries called *Nodes* ; each Node is a single SQL SELECT statement. A Node can query the output of another Node in the same Pipe. This means that you can break large queries down into a multiple smaller, more modular, queries and chain them together - making it much easier to build and maintain in the future. ### Add the new Pipe¶ Add a new Pipe by **selecting the Plus (+) icon** next to the Pipes category in the left side navigation (see Mark 1 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-create-pipe-1.png&w=3840&q=75) This adds a new Pipe with an auto-generated default name. Just like with a Data Source, you can select the name and description to change it (see Mark 1 below). Call this Pipe `top_10_searched_products`. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-rename-pipe-1.png&w=3840&q=75) ### Filter the data¶ At the top of your new Pipe is the first Node, which is pre-populated with a simple SELECT over the data in your Data Source (see Mark 1 below). Before you start modifying the query in the Node, **select the Run button** (see Mark 2 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-first-node-1.png&w=3840&q=75) Hitting Run executes the query in the Node, and shows us a preview of the query result (see Mark 1 below). You can individually execute any Node in your Pipe to see the result, so it is a great way to check that your queries are generating the output you expect. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-first-pipe-run-1.png&w=3840&q=75) Now, you can modify the query to do something more interesting. In this Pipe, you want to create a list of the top 10 most searched products. If you take a look at the data, you will notice an `event` column, which describes what kind of event happened. This column has various values, including `view`, `search` , and `buy` . You are only interested in rows where the `event` is `search` , so modify the query to filter the rows. Replace the Node SQL with the following query: SELECT * FROM shopping_data WHERE event == 'search' **Select Run** again when you are done updating the Node. This Node is now applying a filter to the data, so you only see the rows of interest. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-first-node-query-1.png&w=3840&q=75) At the top of the Node, you will notice that it has been named after the Pipe. Just as before, you can **rename the Node** and give it a description, to help us remember what the Node is doing (see Mark 1 below). Call this Node `search_events`. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-rename-first-node-1.png&w=3840&q=75) ### Aggregate the data¶ Next, you want to work out how many times each individual product has been searched for. This means you are going to want to do a count and aggregate by the *product id* . To keep your queries more simple, create a second Node to do this aggregation. Scroll down a little, and you will see a second Node is suggested at the bottom of the page (see Mark 1 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-find-second-node-1.png&w=3840&q=75) The really cool thing here is that this second Node can query the result of the `search_events` Node you just created. This means you do not need to duplicate the WHERE filter in this next query, as you are already querying the filtered rows in the previous Node. Use the following query for the next Node: SELECT product_id, count() as total FROM search_events GROUP BY product_id ORDER BY total DESC **Select Run** again to see the results of the query. Do not forget to **rename this Node** ... Your future self will thank you! Call it `aggregate_by_product_id`. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-second-node-query-1.png&w=3840&q=75) ### Transform the result¶ Finally, create the last Node that you'll use to publish as a *Tinybird API Endpoint* and limit the results to the top 10 products. Create a third Node and use the following query: SELECT product_id, total FROM aggregate_by_product_id LIMIT 10 Just as before, you could theoretically name this Node whatever you want. However, as you'll publish it as an API Endpoint, follow the common convention to name this Node `endpoint` . When you get to the "Publish API Endpoint" step next, all your Nodes are listed and the `endpoint` name makes it clear which one to select. **Select Run** to preview the results. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-endpoint-node-1.png&w=3840&q=75) With that, you've built the logic required to **query your ingested data** and return the top 10 most searched products. Nice! Top tip: Now you've got some Data Sources and Pipes, press `⌘+K` or `CTRL+K` at any time in the Tinybird UI to open the Command Bar and view all those Workspace resources. No matter how many you build out in the future, you'll always have a quick way to navigate to them! ## 3. Publish and use your API Endpoint¶ Now, let's say you have an application that wants to be able to access and use this top 10 result. The magic of Tinybird is that you can choose any query and instantly publish the results as a RESTful API Endpoint. Applications can simply hit the API Endpoint and get the very latest results. ### Publish the API Endpoint¶ At this point you encounter a core Tinybird object, the *API Endpoint* . Tinybird API Endpoints are published directly from selected Nodes. API Endpoints come with an extensive feature set, including support for dynamic query parameters and auto-generated docs complete with code samples. To publish a Node as an API Endpoint, **select the Create API Endpoint button** in the top right corner of the Workspace (see Mark 1 below). Then, select the **Create API Endpoint** option. You will then see a list of your Nodes - **select the endpoint Node** (see Mark 2 below). <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-publish-endpoint-1.png&w=3840&q=75) And now you have your very first API Endpoint! This page shows all the details about your new API Endpoint including observability charts, links to auto-generated API Endpoint documentation, and code snippets to help you integrate it with your application. ### Test the API Endpoint¶ On your API overview page, scroll down to the Sample usage section, and **copy the HTTP URL** from the snippet box. Open this URL in a new tab in your browser. Hitting the API Endpoint triggers your Pipe to execute, and you get a JSON formatted response with the results. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-view-api-browser-1.png&w=3840&q=75) ## 4. Build Charts showing your data¶ On your API overview page, select **Create Chart** (using the dropdown, you can also share the docs for this API Endpoint, too - see Mark 1 below): <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstart-publish-build-charts-1.png&w=3840&q=75) You can also build Charts to embed in your own application. See the [Charts documentation](https://www.tinybird.co/docs/docs/publish/charts) for more. ## Celebrate¶ Congrats! You have finished creating your first API Endpoint in Tinybird! While working through these steps, you have accomplished a lot in just a few minutes. You have imported 50 million events, built a variety of queries with latencies measured in milliseconds, and stood up an API Endpoint that can serve thousands of concurrent requests. ## Next steps¶ ### Import and build with your own data¶ Ready to start building with your own data? If you have a variety of data stores and sources, Tinybird is a great platform to unify them. Tinybird provides built-in connectors to easily ingest data from Kafka, Confluent Cloud, Big Query, Amazon S3, and Snowflake. If you want to stream data over HTTP, you can send data directly to Tinybird's [Events API](https://www.tinybird.co/docs/docs/ingest/events-api) with no additional infrastructure. ### Using version control¶ This walkthrough covered the basics of Tinybird without integrating version control into the process. You're ready to take it to the next level! Head over to the [version control overview](https://www.tinybird.co/docs/docs/production/overview). ### Build Charts¶ Tinybird Charts make it easy to create interactive, customizable charts of your data. Read more about them in the [Charts docs](https://www.tinybird.co/docs/docs/publish/charts). --- URL: https://www.tinybird.co/docs/guides/tutorials/user-facing-web-analytics Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Build a user-facing web analytics dashboard · Tinybird Docs" theme-color: "#171612" description: "Learn how to build a user-facing web analytics dashboard using Tinybird for real-time, user-facing analytics." --- # Build a user-facing web analytics dashboard¶ In this guide you'll learn how to build a user-facing web analytics dashboard. You'll use Tinybird to capture web clickstream events, process the data in real-time, and expose metrics as APIs. You'll then deploy a Next.js app to visualize your metrics. [GitHub Repository](https://github.com/tinybirdco/demo-user-facing-web-analytics) <-figure-> ![A user-facing web analytics dashboard built with Tinybird and Next.js](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-user-facing-web-analytics-screenshot.png&w=3840&q=75) In this tutorial, you will learn how to: 1. Stream unstructured events data to Tinybird with the[ Events API](https://www.tinybird.co/docs/docs/ingest/events-api) . 2. Parse those events with a global SQL Node that you can reuse in all your subsequent[ Tinybird Pipes](https://www.tinybird.co/docs/docs/concepts/pipes) . 3. Build performant queries to calculate user-facing analytics metrics. 4. Optimize query performance with[ Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) . 5. Publish your metrics as[ API Endpoints](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview) and integrate them into a user-facing Next.js app. ## Prerequisites¶ To complete this tutorial, you'll need the following: 1. A[ free Tinybird account](https://www.tinybird.co/signup) 2. An empty Tinybird Workspace 3. Node.js >=20.11 4. Python >=3.8 This tutorial includes a [Next.js](https://nextjs.org/) app for frontend visualization. For more information about how the Next.js app is designed and deployed, read the [repository README](https://github.com/tinybirdco/demo-user-facing-web-analytics/tree/main/app/README.md). The steps in this tutorial are completed using the Tinybird Command Line Interface (CLI). If you're not familiar with it, [read the CLI docs](https://www.tinybird.co/docs/docs/cli/install) or just give it a go! You can copy and paste every code snippet and command in this tutorial. ## 1. Create a Tinybird Data Source to store your events¶ First, you need to create a Tinybird Data Source to store your web clickstream events. Create a new directory called `tinybird` in your project folder and install the Tinybird CLI: ##### Install the Tinybird CLI mkdir tinybird cd tinybird python -m venv .venv source .venv/bin/activate pip install tinybird-cli [Copy the user admin Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens) and authenticate the CLI: ##### Authenticate the Tinybird CLI tb auth --token Initialize an empty Tinybird project and navigate to the `/datasources` directory, then create a new file called `analytics_events.datasource`: ##### Create a Data Source tb init cd datasources touch analytics_events.datasource Open the file in your preferred code editor and paste the following contents: ##### analytics\_events.datasource DESCRIPTION > Analytics events landing data source SCHEMA > `timestamp` DateTime `json:$.timestamp`, `session_id` String `json:$.session_id`, `action` LowCardinality(String) `json:$.action`, `version` LowCardinality(String) `json:$.version`, `payload` String `json:$.payload` ENGINE MergeTree ENGINE_PARTITION_KEY toYYYYMM(timestamp) ENGINE_SORTING_KEY timestamp ENGINE_TTL timestamp + toIntervalDay(60) If you pass a non-existent Data Source name to the Events API (as you're about to do), Tinybird automatically creates a new Data Source of that name with an [inferred schema](https://www.tinybird.co/docs/docs/ingest/overview#get-started) . By creating the Data Source ahead of time in this file, you have more control over the schema definition, including column types and sorting keys. For more information about creating Tinybird Data Sources, read the [Data Source docs](https://www.tinybird.co/docs/docs/concepts/data-sources). In the `/tinybird` directory, save and push the file to Tinybird: ##### Push the Data Source to Tinybird cd .. tb push datasources/analytics_events.datasource Confirm that you have a new Data Source: tb datasource ls You should see `analytics_events` in the result. Congrats, you have a Tinybird Data Source! ## 2. Stream mock data to your Data Source¶ This tutorial uses [Mockingbird](https://mockingbird.tinybird.co/docs) , an open source mock data stream generator, to stream mock web clickstream events to your Data Source. Mockingbird generates a JSON payload based on a predefined schema and posts it to the [Tinybird Events API](https://www.tinybird.co/docs/docs/ingest/events-api) , which then writes the data to your Data Source. You can explore the [Mockingbird web UI](https://mockingbird.tinybird.co/) , or follow the steps below to complete the same actions using the Mockingbird CLI. In a separate terminal window (outside your current virtual environment) install the Mockingbird CLI: ##### Install Mockingbird npm install -g @tinybirdco/mockingbird-cli Run the following command to stream 50,000 mock web clickstream events to your `analytics_events` Data Source at 50 events per second via the Events API. This command uses the predefined "Web Analytics Starter Kit" Mockingbird template schema to generate mock web clickstream events. Copy your User Admin Token to the clipboard with `tb token copy dashboard` , and use it in the following command (and change the `endpoint` argument depending on your Workspace region if required): ##### Stream to Tinybird with a template mockingbird-cli tinybird --template "Web Analytics Starter Kit" --eps 50 --limit 50000 --datasource analytics_events --token --endpoint gcp_europe_west3 Confirm that events are being written to the `analytics_events` Data Source by running the following command a few times: tb sql 'select count() from analytics_events' You should see the count incrementing up by 50 every second or so. Congratulations, you're ready to start processing your events data! ## 3. Parse the raw JSON events¶ The `analytics_events` Data Source has a `payload` column which stores a string of JSON data. To begin building your analytics metrics, you need to parse this JSON data using a Tinybird [Pipe](https://www.tinybird.co/docs/docs/concepts/pipes). When you're dealing with unstructured data that's likely to change in the future, it's a wise design pattern to retain the unstructured data as a JSON string in a single column. This gives you flexibility to change your upstream producers without breaking ingestion. You can then parse (and materialize) this data downstream. Navigate to the `/pipes` directory and create a new file called `analytics_hits.pipe`: ##### Create a Pipe touch analytics_hits.pipe Open the file and paste the following contents: ##### analytics\_hits.pipe DESCRIPTION > Parsed `page_hit` events, implementing `browser` and `device` detection logic. TOKEN "dashboard" READ NODE parsed_hits DESCRIPTION > Parse raw page_hit events SQL > SELECT timestamp, action, version, coalesce(session_id, '0') as session_id, JSONExtractString(payload, 'locale') as locale, JSONExtractString(payload, 'location') as location, JSONExtractString(payload, 'referrer') as referrer, JSONExtractString(payload, 'pathname') as pathname, JSONExtractString(payload, 'href') as href, lower(JSONExtractString(payload, 'user-agent')) as user_agent FROM analytics_events where action = 'page_hit' NODE endpoint SQL > SELECT timestamp, action, version, session_id, location, referrer, pathname, href, case when match(user_agent, 'wget|ahrefsbot|curl|urllib|bitdiscovery|\+https://|googlebot') then 'bot' when match(user_agent, 'android') then 'mobile-android' when match(user_agent, 'ipad|iphone|ipod') then 'mobile-ios' else 'desktop' END as device, case when match(user_agent, 'firefox') then 'firefox' when match(user_agent, 'chrome|crios') then 'chrome' when match(user_agent, 'opera') then 'opera' when match(user_agent, 'msie|trident') then 'ie' when match(user_agent, 'iphone|ipad|safari') then 'safari' else 'Unknown' END as browser FROM parsed_hits This Pipe contains two [Nodes](https://www.tinybird.co/docs/docs/concepts/pipes#nodes) . The first Node, called `parsed_hits` , extracts relevant information from the JSON `payload` using the `JSONExtractString()` ClickHouse® function and filters to only include `page_hit` actions. The second Node, called `endpoint` , selects from the `parsed_hits` Node and further parses the `user_agent` to get the `device` and `browser` for each event. Additionally, this code gives the Pipe a description, and creates a [Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens) called `dashboard` with `READ` scope for this Pipe. Navigate back up to the `/tinybird` directory and push the Pipe to Tinybird: ##### Push the Pipe to Tinybird tb push pipes/analytics_hits.pipe When you push a Pipe file, Tinybird automatically publishes the last Node as an API Endpoint unless you specify the Pipe as something else (more on that below), so it's best practice to call your final Node "endpoint". You can unpublish an API Endpoint at any time using `tb pipe unpublish `. You now have a public REST API that returns the results of the `analytics_hits` Pipe! Get your Admin Token again with `tb token copy dashboard` and test your API with the command: curl "https://api.tinybird.co/v0/pipes/analytics_hits.json?token=" Replace the Tinybird API hostname/region with the [right API URL region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) that matches your Workspace. Your Token lives in the Workspace under "Tokens". You should see a JSON response that looks something like this: ##### Example API response { "meta": [ { "name": "timestamp", "type": "DateTime" }, { "name": "action", "type": "LowCardinality(String)" }, { "name": "version", "type": "LowCardinality(String)" }, { "name": "session_id", "type": "String" }, { "name": "location", "type": "String" }, { "name": "referrer", "type": "String" }, { "name": "pathname", "type": "String" }, { "name": "href", "type": "String" }, { "name": "device", "type": "String" }, { "name": "browser", "type": "String" } ], "data": [ { "timestamp": "2024-04-24 18:24:21", "action": "page_hit", "version": "1", "session_id": "713355c6-6b98-4c7a-82a9-e19a7ace81fe", "location": "", "referrer": "https:\/\/www.kike.io", "pathname": "\/blog-posts\/data-market-whitebox-replaces-4-data-stack-tools-with-tinybird", "href": "https:\/\/www.tinybird.co\/blog-posts\/data-market-whitebox-replaces-4-data-stack-tools-with-tinybird", "device": "bot", "browser": "chrome" }, ... ] "rows": 150, "statistics": { "elapsed": 0.006203411, "rows_read": 150, "bytes_read": 53609 } } In Tinybird, you can `SELECT FROM` any API Endpoint published in your Workspace in any Pipe. Tinybird won't call the API Endpoint directly, rather it will treat the Endpoint as additional Node(s) and construct a final query. In this tutorial, you'll query the `analytics_hits` API Endpoint in subsequent Pipes. ## 4. Calculate aggregates for pageviews, sessions, and sources¶ Next, you'll create three [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) to store aggregates for the following: 1. pageviews 2. sessions 3. sources Later on, you'll query from the Materialized Views that you're creating here. From the `/datasources` directory in the Tinybird project, create three new Data Source files: touch analytics_pages_mv.datasource analytics_sessions_mv.datasource analytics_sources_mv.datasource Open the `analytics_pages_mv.datasource` file and paste in the following contents: ##### analytics\_pages\_mv.datasource SCHEMA > `date` Date, `device` String, `browser` String, `location` String, `pathname` String, `visits` AggregateFunction(uniq, String), `hits` AggregateFunction(count) ENGINE AggregatingMergeTree ENGINE_PARTITION_KEY toYYYYMM(date) ENGINE_SORTING_KEY date, device, browser, location, pathname Do the same for `analytics_sessions_mv.datasource` and `analytics_sources_mv.datasource` , copying the code from the [GitHub repository](https://github.com/tinybirdco/demo-user-facing-web-analytics/tree/main/tinybird/datasources) for this tutorial. These Materialized View Data Sources use an [AggregatingMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/aggregatingmergetree) Engine. If you're new to Tinybird, don't worry about this for now. To learn more about Table Engines in ClickHouse, [read this](https://clickhouse.com/docs/en/engines/table-engines). Next, you'll create three Pipes that calculate the aggregates and store the data in the Materialized View Data Sources you just created. From the `/pipes` directory, create three new Pipe files: touch analytics_pages.pipe analytics_sessions.pipe analytics_sources.pipe Open `analytics_pages.pipe` and paste the following: ##### analytics\_pages.pipe NODE analytics_pages_1 DESCRIPTION > Aggregate by pathname and calculate session and hits SQL > SELECT toDate(timestamp) AS date, device, browser, location, pathname, uniqState(session_id) AS visits, countState() AS hits FROM analytics_hits GROUP BY date, device, browser, location, pathname TYPE MATERIALIZED DATASOURCE analytics_pages_mv This code calculates aggregates for pageviews, and designates the Pipe as a Materialized View with `analytics_pages_mv` as the target Data Source. Note that you're using the `-State` modifier on your aggregate functions in this Pipe. Tinybird stores intermediate aggregate states in the Materialized View, which you will merge at query time. For more information on how this process works, [read this guide on Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/best-practices). Do this for the remaining two Pipes, copying the code from the [GitHub repository](https://github.com/tinybirdco/demo-user-facing-web-analytics/tree/main/tinybird/pipes). Back in the `/tinybird` directory, push these new Pipes and Data Sources to Tinybird. This populates the Materialized Views with your Mockingbird data: ##### Push to Tinybird tb push pipes --push-deps --populate Now, as new events arrive in the `analytics_events` Data Source, these Pipes will process the data and update the aggregate states in your Materialized Views as new data arrives. ## 6. Generate session count trend for the last 30 minutes¶ In this step, you'll use Tinybird Pipes to calculate user-facing analytics metrics and publish them as REST APIs. The first Pipe you create, called `trend` , will calculate the number of sessions over the last 30 minutes, grouped by 1 minute intervals. From the `/pipes` directory, create a file called `trend.pipe`: ##### Create trend.pipe touch trend.pipe Open this file and paste the following: ##### trend.pipe DESCRIPTION > Visits trend over time for the last 30 minutes, filling in the blanks. TOKEN "dashboard" READ NODE timeseries DESCRIPTION > Generate a timeseries for the last 30 minutes, so we call fill empty data points SQL > with (now() - interval 30 minute) as start select addMinutes(toStartOfMinute(start), number) as t from (select arrayJoin(range(1, 31)) as number) NODE hits DESCRIPTION > Get last 30 minutes metrics grouped by minute SQL > select toStartOfMinute(timestamp) as t, uniq(session_id) as visits from analytics_hits where timestamp >= (now() - interval 30 minute) group by toStartOfMinute(timestamp) order by toStartOfMinute(timestamp) NODE endpoint DESCRIPTION > Join and generate timeseries with metrics for the last 30 minutes SQL > select a.t, b.visits from timeseries a left join hits b on a.t = b.t order by a.t This Pipe contains three Nodes: 1. The first Node, called `timeseries` , generates a simple result set with 1-minute intervals for the last 30 minutes. 2. The second Node, called `hits` , calculates total sessions over the last 30 minutes, grouped by 1-minute intervals. 3. The third Node, called `endpoint` , performs a left join between the first two Nodes, retaining all of the 1-minute intervals from the `timeseries` Node. ## 7. Calculate the top pages visited¶ Next, you'll create a Pipe called `top_pages` to calculate a sorted list of the top pages visited over a specified time range. This Pipe will query the `analytics_pages_mv` Data Source you created in the prior steps, and it will use Tinybird's templating language to define [query parameters](https://www.tinybird.co/docs/docs/query/query-parameters) that you can use to dynamically select a time range and implement pagination in the response. From the `/pipes` directory, create the `top_pages.pipe` file: ##### Create top\_pages.pipe touch top_pages.pipe Open the file and paste the following: DESCRIPTION > Most visited pages for a given period. Accepts `date_from` and `date_to` date filter. Defaults to last 7 days. Also `skip` and `limit` parameters for pagination. TOKEN "dashboard" READ NODE endpoint DESCRIPTION > Group by pagepath and calculate hits and visits SQL > % select pathname, uniqMerge(visits) as visits, countMerge(hits) as hits from analytics_pages_mv where {% if defined(date_from) %} date >= {{ Date(date_from, description="Starting day for filtering a date range", required=False) }} {% else %} date >= timestampAdd(today(), interval -7 day) {% end %} {% if defined(date_to) %} and date <= {{ Date(date_to, description="Finishing day for filtering a date range", required=False) }} {% else %} and date <= today() {% end %} group by pathname order by visits desc limit {{ Int32(skip, 0) }},{{ Int32(limit, 50) }} Note the use of the `-Merge` modifiers on the end of the aggregate function. This modifier is used to perform a final merge on the aggregate states in the Materialized View. [Read this Guide](https://www.tinybird.co/docs/docs/publish/materialized-views/best-practices) for more details. ## 8. Create the remaining API Endpoints¶ In the [GitHub repository](https://github.com/tinybirdco/demo-user-facing-web-analytics/tree/main/tinybird/pipes) , you'll find five additional Pipe files that calculate other various user-facing metrics: - `kpis.pipe` - `top_browsers.pipe` - `top_devices.pipe` - `top_locations.pipe` - `top_sources.pipe` Create those into your `/pipes` directory: touch kpis.pipe top_browsers.pipe top_devices.pipe top_locations.pipe top_sources.pipe And copy the file contents from the GitHub examples into your files. Finally, in the `/tinybird` directory, push all these new Pipes to Tinybird: tb push pipes Congrats! You now have seven API Endpoints that you will integrate into your Next.js app to provide data to your dashboard components. ## 9. Deploy the Next.js app¶ You can easily deploy the accompanying Next.js app to Vercel by clicking in this button (you'll need a Vercel account): [Deploy with Vercel](https://vercel.com/new/clone?repository-url=https%253A%252F%252Fgithub.com%252Ftinybirdco%252Fdemo-user-facing-web-analytics%252Ftree%252Fmain%252Fapp&env=NEXT_PUBLIC_TINYBIRD_AUTH_TOKEN,NEXT_PUBLIC_TINYBIRD_HOST,NEXT_PUBLIC_BASE_URL&envDescription=Tinybird%2520configuration&project-name=user-facing-web-analytics&repository-name=user-facing-web-analytics) First, select the Git provider where you'll clone the Git repository: ![Select Git provider](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-user-facing-web-analytics-deploy-1.png&w=3840&q=75) Next, set the following environment variables: - `NEXT_PUBLIC_TINYBIRD_AUTH_TOKEN` : your[ Tinybird Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens) - `NEXT_PUBLIC_TINYBIRD_HOST` : your Tinybird Region (e.g. `https://api.tinybird.co` ) - `NEXT_PUBLIC_BASE_URL` : The URL where you will publish your app (e.g. `https://my-analytics.com` ) ![Set env variables](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-user-facing-web-analytics-deploy-2.png&w=3840&q=75) Click "Deploy" and you're done! Explore your dashboard and have a think about how you'd like to adapt or extend it in the future. ## Next steps¶ - Understand today's real-time analytics landscape with[ Tinybird's definitive guide](https://www.tinybird.co/blog-posts/real-time-analytics-a-definitive-guide) . - Learn how to implement[ multi-tenant security](https://www.tinybird.co/blog-posts/multi-tenant-saas-options) in your user-facing analytics. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/guides/tutorials/vector-search-recommendation Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Build a content recommendation API using vector search · Tinybird Docs" theme-color: "#171612" description: "Learn how to compute embeddings in Python and use vector search SQL functions in Tinybird to build a content recommendation API." --- # Build a content recommendation API using vector search¶ In this guide you'll learn how to calculate vector embeddings using HuggingFace models and use Tinybird to perform vector search to find similar content based on vector distances [GitHub Repository](https://github.com/tinybirdco/demo_vector_search_recommendation/tree/main) <-figure-> ![Tinybird blog related posts uses vector search recommendation algorithm.](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-vector-search-recommendation-1.png&w=3840&q=75) In this tutorial, you will learn how to: 1. Use Python to fetch content from an RSS feed 2. Calculate vector embeddings on long form content (blog posts) using SentenceTransformers in Python 3. Post vector embeddings to a Tinybird Data Source using the Tinybird Events API 4. Write a dynamic SQL query to calculate the closest content matches to a given blog post based on vector distances 5. Publish your query as an API and integrate it into a frontend application ## Prerequisites¶ To complete this tutorial, you'll need: 1. A[ free Tinybird account](https://www.tinybird.co/signup) 2. An empty Tinybird Workspace 3. Python >= 3.8 This tutorial does not include a frontend, but we provide an example snippet below on how you might integrate the published API into a React frontend. ## 1. Setup¶ Clone the `demo_vector_search_recommendation` repo . We'll use the repository as reference throughout this tutorial. Authenticate the Tinybird CLI using your user admin token from your Tinybird Workspace: cd tinybird tb auth --token $USER_ADMIN_TOKEN ## 2. Fetch content and calculate embeddings¶ In this tutorial we fetch blog posts from the Tinybird Blog using the [Tinybird Blog RSS feed](https://www.tinybird.co/blog-posts/rss.xml) . You can use any `rss.xml` feed to fetch blog posts and calculate embeddings from their content. You can fetch and parse the RSS feed using the `feedparser` library in Python, get a list of posts, and then fetch each post and parse the content with the `BeautifulSoup` library. Once you've fetched each post, you can calculate an embedding using the HuggingFace `sentence_transformers` library. In this demo, we use the `all-MiniLM-L6-v2` model, which maps sentences & paragraphs to a 384 dimensional dense vector space. You can browse other models [here](https://huggingface.co/models). You can achieve this and the following step (fetch posts, calculate embeddings, and send them to Tinybird) by running `load.py` from the code repository. We walk through the function of that script below so you can understand how it works. from bs4 import BeautifulSoup from sentence_transformers import SentenceTransformer import datetime import feedparser import requests import json timestamp = datetime.datetime.now().isoformat() url = "https://www.tinybird.co/blog-posts/rss.xml" # Update to your preferred RSS feed feed = feedparser.parse(url) model = SentenceTransformer("all-MiniLM-L6-v2") posts = [] for entry in feed.entries: doc = BeautifulSoup(requests.get(entry.link).content, features="html.parser") if (content := doc.find(id="content")): embedding = model.encode([content.get_text()]) posts.append(json.dumps({ "timestamp": timestamp, "title": entry.title, "url": entry.link, "embedding": embedding.mean(axis=0).tolist() })) ## 3. Post content metadata and embeddings to Tinybird¶ Once you've calculated the embeddings, you can push them along with the content metadata to Tinybird using the [Events API](https://www.tinybird.co/docs/docs/ingest/events-api). First, set up some environment variables for your Tinybird host and token with `DATASOURCES:WRITE` scope: export TB_HOST=your_tinybird_host export TB_TOKEN=your_tinybird_token Next, you'll need to set up a Tinybird Data Source to receive your data. Note that if the Events API doesn't find a Tinybird Data Source by the supplied name, it will create one. But since we want control over our schema, we're going to create an empty Data Source first. In the `tinybird/datasources` folder of the repository, you'll find a `posts.datasource` file that looks like this: SCHEMA > `timestamp` DateTime `json:$.timestamp`, `title` String `json:$.title`, `url` String `json:$.url`, `embedding` Array(Float32) `json:$.embedding[:]` ENGINE ReplacingMergeTree ENGINE_PARTITION_KEY "" ENGINE_SORTING_KEY title, url ENGINE_VER timestamp This Data Source will receive the updated post metadata and calculated embeddings and deduplicate based on the most up to data retrieval. The `ReplacingMergeTree` is used to deduplicate, relying on the `ENGINE_VER` setting, which in this case is set to the `timestamp` column. This tells the engine that the versioning of each entry is based on the `timestamp` column, and only the entry with the latest timestamp will be kept in the Data Source. The Data Source has the `title` column as its primary sorting key, because we will be filtering by title to retrieve the embedding for the current post. Having `title` as the primary sorting key makes that filter more performant. Push this Data Source to Tinybird: cd tinybird tb push datasources/posts.datasource Then, you can use a Python script to push the post metadata and embeddings to the Data Source using the Events API: import os import requests TB_APPEND_TOKEN=os.getenv("TB_APPEND_TOKEN") TB_HOST=os.getenv("TB_HOST") def send_posts(posts): params = { "name": "posts", "token": TB_APPEND_TOKEN } data = "\n".join(posts) # ndjson r = requests.post(f"{TB_HOST}/v0/events", params=params, data=data) print(r.status_code) send_posts(posts) To keep embeddings up to date, you should retrieve new content on a schedule and push it to Tinybird. In the repository, you'll find a GitHub Action called [tinybird_recommendations.yml](https://github.com/tinybirdco/demo_vector_search_recommendation/blob/main/.github/workflows/tinybird_recommendations.yml) that fetches new content from the Tinybird blog every 12 hours and pushes it to Tinybird. The Tinybird Data Source in this project uses a ReplacingMergeTree to deduplicate blog post metadata and embeddings as new data arrives. ## 4. Calculate distances in SQL using Tinybird Pipes.¶ If you've completed steps above, you should have a `posts` Data Source in your Tinybird Workspace containing the last fetched timestamp, title, url, and embedding for each blog post fetched from your RSS feed. You can verify that you have data from the Tinybird CLI with: tb sql 'SELECT * FROM posts' This tutorial includes a single-node SQL Pipe to calculate the vector distance of each post to specific post supplied as a query parameter. The Pipe config is contained in the `similar_posts.pipe` file in the `tinybird/pipes` folder, and the SQL is copied below for reference and explaination. % WITH ( SELECT embedding FROM ( SELECT 0 AS id, embedding FROM posts WHERE title = {{ String(title) }} ORDER BY timestamp DESC LIMIT 1 UNION ALL SELECT 999 AS id, arrayWithConstant(384, 0.0) embedding ) ORDER BY id LIMIT 1 ) AS post_embedding SELECT title, url, L2Distance(embedding, post_embedding) similarity FROM posts FINAL WHERE title <> {{ String(title) }} ORDER BY similarity ASC LIMIT 10 This query first fetches the embedding of the requested post, and returns an array of 0s in the event an embedding can't be fetched. It then calculates the Euclidean vector distance between each additional post and the specified post using the `L2Distance()` function, sorts them by ascending distance, and limits to the top 10 results. You can push this Pipe to your Tinybird server with: cd tinybird tb push pipes/similar_posts.pipe When you push it, Tinybird will automatically publish it as a scalable, dynamic REST API Endpoint that accepts a `title` query parameter. You can test your API Endpoint with a cURL. First, create an envvar with a token that has `PIPES:READ` scope for your Pipe. You can get this token from your Workspace UI or in the CLI with `tb token` [commands](https://www.tinybird.co/docs/docs/cli/command-ref#tb-token). export TB_READ_TOKEN=your_read_token Then request your Endpoint: curl --compressed -H "Authorization: Bearer $TB_READ_TOKEN" https://api.tinybird.co/v0/pipes/similar_posts.json?title='Some blog post title' You will get a JSON object containing the 10 most similar posts to the post whose title you supplied in the request. ## 5. Integrate into the frontend¶ Integrating your vector search API into the frontend is relatively straightforward, as it's just a RESTful Endpoint. Here's an example implementation (pulled from the actual code used to fetch related posts in the Tinybird Blog): export async function getRelatedPosts(title: string) { const recommendationsUrl = `${host}/v0/pipes/similar_posts.json?token=${token}&title=${title}`; const recommendationsResponse = await fetch(recommendationsUrl).then( function (response) { return response.json(); } ); if (!recommendationsResponse.data) return; return Promise.all( recommendationsResponse.data.map(async ({ url }) => { const slug = url.split("/").pop(); return await getPost(slug); }) ).then((data) => data.filter(Boolean)); } ## 6. See it in action¶ You can see how this looks by checking out any blog post in the [Tinybird Blog](https://www.tinybird.co/blog) . At the bottom of each post, you'll find a Related Posts section that's powered by a real Tinybird API using the method described here! <-figure-> ![Tinybird blog related posts uses vector search recommendation algorithm.](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftutorial-vector-search-recommendation-1.png&w=3840&q=75) ## Next steps¶ - Read more about[ vector search](https://www.tinybird.co/docs/docs/use-cases/vector-search) and[ content recommendation](https://www.tinybird.co/docs/docs/use-cases/content-recommendation) use cases. - Join the[ Tinybird Slack Community](https://www.tinybird.co/community) for additional support. --- URL: https://www.tinybird.co/docs/index Last update: 2024-11-19T08:08:52.000Z Content: --- title: "Overview of Tinybird · Tinybird Docs" theme-color: "#171612" description: "Tinybird provides you with an easy way to ingest and query large amounts of data with low-latency, and automatically create APIs to consume those queries. This makes it extremely easy to build fast and scalable applications that query your data; no backend needed!" --- # Welcome to Tinybird¶ Tinybird is a data platform for data and engineering teams to solve complex real-time, operational, and user-facing analytics use cases at any scale. Tinybird makes it easy to import data from a variety of sources, use SQL to filter, aggregate, and join that data, and publish low-latency, high-concurrency RESTful API Endpoints. You can build and manage your Tinybird data projects as you would any software, using [version control](https://www.tinybird.co/docs/docs/production/overview). ## Create an account¶ Tinybird has a time-unlimited free tier, so you can start building today and learn at your own pace. [Create an account](https://www.tinybird.co/signup) using a Google, Microsoft, or GitHub account, or your email address. After you create your account, pick the cloud region that works best for you, then set your [Workspace](https://www.tinybird.co/docs/docs/concepts/workspaces) name. ## Try out your Workspace¶ The [Quick start](https://www.tinybird.co/docs/docs/quick-start) gets you building immediately. Learn the basics of ingesting data into Tinybird, writing SQL, and publishing APIs, starting from an empty Workspace and walking through building an example use case. You can also use one of the [ready-to-deploy Starter Kits](https://www.tinybird.co/docs/docs/starter-kits) . The template in each Starter Kit includes pre-built Data Sources, Pipes, and API Endpoints ready to serve data. ## Watch the videos¶ Watch the following videos to get familiar with Tinybird's user interface and CLI. - [ The Tinybird basics in 3 minutes (UI)](https://www.youtube.com/watch?v=cvay_LW685w) - [ Get started with the CLI](https://www.youtube.com/watch?v=OOEe84ly7Cs) - [ Ingest data from a file (UI vs CLI)](https://www.youtube.com/watch?v=1R0G1EolSEM) ## Next steps¶ - Browse the[ Use Case Hub](https://www.tinybird.co/docs/docs/use-cases) and see how to boost your project with real-time, user-facing analytics. - Learn the Tinybird[ core concepts](https://www.tinybird.co/docs/docs/core-concepts) . - Start building using the[ quick start](https://www.tinybird.co/docs/docs/quick-start) . - Read how to build a[ user-facing web analytics dashboard](https://www.tinybird.co/docs/docs/guides/tutorials/user-facing-web-analytics) . --- URL: https://www.tinybird.co/docs/ingest/bigquery Last update: 2024-11-12T11:45:41.000Z Content: --- title: "BigQuery Connector · Tinybird Docs" theme-color: "#171612" description: "Documentation for how to use the Tinybird BigQuery Connector" --- # BigQuery Connector¶ Use the BigQuery Connector to load data from BigQuery into Tinybird so that you can quickly turn it into high-concurrency, low-latency API Endpoints. You can load full tables or the result of an SQL query. The BigQuery Connector is fully managed and requires no additional tooling. You can define a sync schedule inside Tinybird and execution is taken care of for you. With the BigQuery Connector you can: - Connect to your BigQuery database with a handful of clicks. Select which tables to sync and set the schedule. - Use an SQL query to get the data you need from BigQuery and then run SQL queries on that data in Tinybird. - Use authentication tokens to control access to API endpoints. Implement access policies as you need, with support for row-level security. Check the [use case examples](https://github.com/tinybirdco/use-case-examples) repository for examples of BigQuery Data Sources iteration using Git integration. The BigQuery Connector can't access BigQuery external tables, like connected Google Sheets. If you need this functionality, reach out to [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co). ## Prerequisites¶ - Tinybird CLI. See[ the Tinybird CLI quick start](https://www.tinybird.co/docs/docs/cli/quick-start) . - Tinybird CLI[ authenticated with the desired Workspace](https://www.tinybird.co/docs/docs/cli/install) . You can switch the Tinybird CLI to the correct Workspace using `tb workspace use `. To use version control, connect your Tinybird Workspace with [your repository](https://www.tinybird.co/docs/production/working-with-version-control#connect-your-workspace-to-git-from-the-cli) , and set the [CI/CD configuration](https://www.tinybird.co/docs/production/continuous-integration) . For testing purposes, use a different connection than in the main branches or Workspaces. For instance to create the connections in the main branch or Workspace using the CLI: tb auth # Use the main Workspace admin Token tb connection create bigquery # Prompts are interactive and ask you to insert the necessary information You can only create connections in the main Workspace. Even when creating the connection in the branch or as part of a Data Source creation flow, it's created in the main workspace and from there it's available for every branch. ## Load a BigQuery table¶ ### Load a BigQuery table in the UI¶ Open the [Tinybird UI](https://app.tinybird.co/) and add a new Data Source by clicking **Create new (+)** next to the Data Sources section on the left hand side navigation bar (see Mark 1 below). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-sync-first-table-ui-1.png&w=3840&q=75) In the modal, select the BigQuery option from the list of Data Sources. The next modal screen shows the **Connection details** . Follow the instructions and configure access to your BigQuery. Access the GCP IAM Dashboard by selecting the **IAM & Admin** link, and use the provided **principal** name from this modal. In the GCP IAM Dashboard, click the **Grant Access** button (see Mark 1 below). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-sync-first-table-ui-4.png&w=3840&q=75) In the box that appears on the right-hand side, paste the **principal** name you just copied into the **New principals** box (see Mark 1 below). Next, in the **Role** box, find and select the role **BigQuery Data Viewer** (see Mark 2 below). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-sync-first-table-ui-5.png&w=3840&q=75) Click **Save** to complete. The principal should now be listed in the **View By Principals** list (see Mark 1 below). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-sync-first-table-ui-7.png&w=3840&q=75) OK! Now return to the Tinybird UI. In the modal, click **Next** (see Mark 1 below). Note: It can take a few seconds for the GCP permissions to apply. The next screen allows you to browse the tables available in BigQuery, and select the table you wish to load. Start by selecting the **project** that the table belongs to (see Mark 1 below), then the **dataset** (see Mark 2 below) and finally the **table** (see Mark 3 below). Finish by clicking **Next** (see Mark 4 below). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-sync-first-table-ui-9.png&w=3840&q=75) Note: the maximum allowed table size is 50 million rows, the result will be truncated if it exceeds that limit. You can now configure the schedule on which you wish to load data. You can configure a schedule in minutes, hours, or days by using the drop down selector, and set the value for the schedule in the text field (see Mark 1 below). The screenshot below shows a schedule of 10 minutes. Next, you can configure the **Import Strategy** . The strategy **Replace data** is selected by default (see Mark 2 below). Finish by clicking **Next** (see Mark 3 below). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-sync-first-table-ui-10.png&w=3840&q=75) Note: the maximum allowed frequency is 5 minutes. The final screen of the modal shows you the interpreted schema of the table, which you can edit as needed. You can also modify what the Data Source in Tinybird will be called by changing the name at the top (see Mark 1 below). To finish, click **Create Data Source** (see Mark 2 below). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-sync-first-table-ui-11.png&w=3840&q=75) You are now on the Data Source data page, where you can view the data that has been loaded (see Mark 1 below) and a status chart showing executions of the loading schedule (see Mark 2 below). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-sync-first-table-ui-12.png&w=3840&q=75) ### Load a BigQuery table in the CLI¶ You need to create a connection before you can load a BigQuery table into Tinybird using the CLI. Creating a connection grants your Tinybird Workspace the appropriate permissions to view data from BigQuery. [Authenticate your CLI](https://www.tinybird.co/docs/docs/cli/install#authentication) and switch to the desired Workspace. Then run: tb connection create bigquery The output of this command includes instructions to configure a GCP principal with read only access to your data in BigQuery. The instructions include the URL to access the appropriate page in GCP's IAM Dashboard. Copy the **principal name** shown in the output. In the GCP IAM Dashboard, select the **Grant Access** button (see Mark 1 below). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-sync-first-table-ui-4.png&w=3840&q=75) In the box that appears on the right-hand side, paste the **principal** name you just copied into the **New principals** box (see Mark 1 below). Next, in the **Role** box, find and select the role **BigQuery Data Viewer** (see Mark 2 below). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-sync-first-table-ui-5.png&w=3840&q=75) Click **Save** to complete. The principal should now be listed in the **View By Principals** list (see Mark 1 below). <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-sync-first-table-ui-7.png&w=3840&q=75) Note: It can take a few seconds for the GCP permissions to apply. Once done, select **yes** (y) to create the connection. A new `bigquery.connection` file is created in your project files. Note: At the moment, the `.connection` file is not used and cannot be pushed to Tinybird. It is safe to delete this file. A future release will allow you to push this file to Tinybird to automate creation of connections, similar to Kafka connection. Now that your connection is created, you can create a Data Source and configure the schedule to import data from BigQuery. The BigQuery import is configured using the following options, which can be added at the end of your .datasource file: - `IMPORT_SERVICE` : name of the import service to use, in this case, `bigquery` - `IMPORT_SCHEDULE` : a cron expression (UTC) with the frequency to run imports, must be higher than 5 minutes, e.g. `*/5 * * * *` - `IMPORT_STRATEGY` : the strategy to use when inserting data, either `REPLACE` or `APPEND` - `IMPORT_EXTERNAL_DATASOURCE` : (optional) the fully qualified name of the source table in BigQuery e.g. `project.dataset.table` - `IMPORT_QUERY` : (optional) the SELECT query to extract your data from BigQuery when you don't need all the columns or want to make a transformation before ingestion. The FROM must reference a table using the full scope: `project.dataset.table` Both `IMPORT_EXTERNAL_DATASOURCE` and `IMPORT_QUERY` are optional, but you must provide one of them for the connector to work. Note: For `IMPORT_STRATEGY` only `REPLACE` is supported today. The `APPEND` strategy will be enabled in a future release. For example: ##### bigquery.datasource file DESCRIPTION > bigquery demo data source SCHEMA > `timestamp` DateTime `json:$.timestamp`, `id` Integer `json:$.id`, `orderid` LowCardinality(String) `json:$.orderid`, `status` LowCardinality(String) `json:$.status`, `amount` Integer `json:$.amount` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" ENGINE_TTL "timestamp + toIntervalDay(60)" IMPORT_SERVICE bigquery IMPORT_SCHEDULE */5 * * * * IMPORT_EXTERNAL_DATASOURCE mydb.raw.events IMPORT_STRATEGY REPLACE IMPORT_QUERY > select timestamp, id, orderid, status, amount from mydb.raw.events The columns you select in the `IMPORT_QUERY` must match the columns defined in the Data Source schema. For example, if your Data Source has the columns `ColumnA, ColumnB` then your `IMPORT_QUERY` must contain `SELECT ColumnA, ColumnB FROM ...` . A mismatch of columns causes data to arrive in the [quarantine Data Source](https://www.tinybird.co/docs/docs/guides/ingesting-data/recover-from-quarantine). With your connection created and Data Source defined, you can now push your project to Tinybird using: tb push The first run of the import will begin on the next lapse of the CRON expression. ## Configure granular permissions¶ If you need to configure more granular permissions for BigQuery, you can always grant access at dataset or individual object level. The first step is creating a new role in your [IAM & Admin Console in GCP](https://console.cloud.google.com/iam-admin/roles/create) , and assigning the `resourcemanager.projects.get` permission. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-custom-role-1.png&w=3840&q=75) The Connector needs this permission to list the available projects the generated Service Account has access to, so you can explore the BigQuery tables and views in the Tinybird UI. After that, you can grant permissions to specific datasets to the Service Account by clicking on **Sharing** > **Permissions**: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-custom-role-2.png&w=3840&q=75) Then **ADD PRINCIPAL**: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-custom-role-3.png&w=3840&q=75) And finally paste the **principal** name copied earlier into the **New principals** box. Next, in the **Role** box, find and select the role **BigQuery Data Viewer**: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fguides-bigquery-connector-custom-role-4.png&w=3840&q=75) Now the Tinybird Connector UI only shows the specific resources you've granted permissions to. ## Schema evolution¶ The BigQuery Connector supports backwards compatible changes made in the source table. This means that, if you add a new column in BigQuery, the next sync job will automatically add it to the Tinybird Data Source. Non-backwards compatible changes, such as dropping or renaming columns, are not supported and will cause the next sync to fail. ## Iterate a BigQuery Data Source¶ To iterate a BigQuery Data Source, use the Tinybird CLI and the version control integration to handle your resources. You can only create connections in the main Workspace. When creating the connection in a Branch, it's created in the main Workspace and from there is available to every Branch. ### Add a new BigQuery Data Source¶ You can add a new Data Source directly with the UI or the CLI tool, following [the load of a BigQuery table section](https://www.tinybird.co/docs/about:blank#load-a-BigQuery-table). When adding a Data Source in a Tinybird Branch, it will work for testing purposes, but won't have any connection details internally. You must add the connection and BigQuery configuration in the .datasource Datafile when moving to production. To add a new Data Source using the recommended version control workflow check the instructions in the [examples repository](https://github.com/tinybirdco/use-case-examples). ### Update a Data Source¶ - BigQuery Data Sources can't be modified directly from UI - When you create a new Tinybird Branch, the existing BigQuery Data Sources won't be connected. You need to re-create them in the Branch. - In Branches, it's usually useful to work with[ fixtures](https://www.tinybird.co/docs/production/implementing-test-strategies#fixture-tests) , as they'll be applied as part of the CI/CD, allowing the full process to be deterministic in every iteration and avoiding quota consume from external services. BigQuery Data Sources can be modified from the CLI tool: tb auth # modify the .datasource Datafile with your editor tb push --force {datafile} # check the command output for errors To update it using the recommended version control workflow check the instructions in the [examples repository](https://github.com/tinybirdco/use-case-examples). ### Delete a Data Source¶ BigQuery Data Sources can be deleted directly from UI or CLI like any other Data Source. To delete it using the recommended version control workflow check the instructions in the [examples repository](https://github.com/tinybirdco/use-case-examples). ## Logs¶ Job executions are logged in the `datasources_ops_log` [Service Data Source](https://www.tinybird.co/docs/docs/monitoring/service-datasources) . This log can be checked directly in the Data Source view page in the UI. Filter by `datasource_id` to monitor ingestion through the BigQuery Connector from the `datasources_ops_log`: SELECT timestamp, event_type, result, error, job_id FROM tinybird.datasources_ops_log WHERE datasource_id = 't_1234' AND event_type = 'replace' ORDER BY timestamp DESC ## Limits¶ See [BigQuery Connector limits](https://www.tinybird.co/docs/docs/support/limits#bigquery-connector-limits). --- URL: https://www.tinybird.co/docs/ingest/confluent Last update: 2024-11-08T11:23:54.000Z Content: --- title: "Confluent Connector · Tinybird Docs" theme-color: "#171612" description: "Use the Confluent Connector to bring data from Confluent to Tinybird." --- # Confluent Connector¶ Use the Confluent Connector to bring data from your existing Confluent Cloud cluster into Tinybird so that you can quickly turn them into high-concurrency, low-latency REST API Endpoints and query using SQL. The Confluent Connector is fully managed and requires no additional tooling. Connect Tinybird to your Confluent Cloud cluster, select a topic, and Tinybird automatically begins consuming messages from Confluent Cloud. ## Prerequisites¶ You need to grant `READ` permissions to both the Topic and the Consumer Group to ingest data from Confluent into Tinybird. The Confluent Cloud Schema Registry is only supported for decoding Avro messages. When using Confluent Schema Registry, the Schema name must match the Topic name. For example, if you're ingesting the Kafka Topic `my-kafka-topic` using a Connector with Schema Registry enabled, it expects to find a Schema named `my-kafka-topic-value`. ## Create the Data Source using the UI¶ To connect Tinybird to your Confluent Cloud cluster, select **Create new (+)** next to the data project section, select **Data Source** , and then select **Confluent** from the list of available Data Sources. Enter the following details: - ** Connection name** : A name for the Confluent Cloud connection in Tinybird. - ** Bootstrap Server** : The comma-separated list of bootstrap servers, including port numbers. - ** Key** : The key component of the Confluent Cloud API Key. - ** Secret** : The secret component of the Confluent Cloud API Key. - ** Decode Avro messages with schema registry** : (Optional) Turn on Schema Registry support to decode Avro messages. Enter the Schema Registry URL, username, and password. After you've entered the details, select **Connect** . This creates the connection between Tinybird and Confluent Cloud. A list of your existing topics appears and you can select the topic to consume from. Tinybird creates a **Group ID** that specifies the name of the consumer group that this Kafka consumer belongs to. You can customize the Group ID, but ensure that your Group ID has **Read** permissions to the topic. After you've chosen a topic, you can select the starting offset to consume from. You can consume from the earliest offset or the latest offset: - If you consume from the earliest offset, Tinybird consumes all messages from the beginning of the topic. - If you consume from the latest offset, Tinybird only consumes messages that are produced after the connection is created. After selecting the offset, select **Next** . Tinybird consumes a sample of messages from the topic and displays the schema. You can adjust the schema and Data Source settings as needed, then select **Create Data Source**. Tinybird begins consuming messages from the topic and loading them into the Data Source. ## Configure the connector using .datasource files¶ If you are managing your Tinybird resources in files, there are several settings available to configure the Confluent Connector in .datasource files. See the [datafiles docs](https://www.tinybird.co/docs/docs/cli/datafiles/datasource-files#kafka-confluent-redpanda) for more information. The following is an example of Kafka .datasource file for an already existing connection: ##### Example data source for Confluent Connector SCHEMA > `__value` String, `__topic` LowCardinality(String), `__partition` Int16, `__offset` Int64, `__timestamp` DateTime, `__key` String `__headers` Map(String,String) ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" # Connection is already available. If you # need to create one, add the required fields # on an include file with the details. KAFKA_CONNECTION_NAME my_connection_name KAFKA_TOPIC my_topic KAFKA_GROUP_ID my_group_id KAFKA_STORE_HEADERS true ### Columns of the Data Source¶ When you connect a Kafka producer to Tinybird, Tinybird consumes optional metadata columns from that Kafka record and writes them to the Data Source. The following fields represent the raw data received from Kafka: - `__value` : A String representing the entire unparsed Kafka record inserted. - `__topic` : The Kafka topic that the message belongs to. - `__partition` : The kafka partition that the message belongs to. - `__offset` : The Kafka offset of the message. - `__timestamp` : The timestamp stored in the Kafka message received by Tinybird. - `__key` : The key of the kafka message. - `__headers` : Headers parsed from the incoming topic messages. See[ Using custom Kafka headers for advanced message processing](https://www.tinybird.co/blog-posts/using-custom-kafka-headers) . Metadata fields are optional. Omit the fields you don't need to reduce your data storage. ### Use INCLUDE to store connection settings¶ To avoid configuring the same connection settings across many files, or to prevent leaking sensitive information, you can store connection details in an external file and use `INCLUDE` to import them into one or more .datasource files. You can find more information about `INCLUDE` in the [Advanced Templates](https://www.tinybird.co/docs/docs/cli/advanced-templates) documentation. For example, you might have two Confluent Cloud .datasource files, which re-use the same Confluent Cloud connection. You can create an include file which stores the Confluent Cloud connection details. The Tinybird project might use the following structure: ##### Tinybird data project file structure ecommerce_data_project/ datasources/ connections/ my_connector_name.incl my_confluent_datasource.datasource another_datasource.datasource endpoints/ pipes/ Where the file `my_connector_name.incl` has the following content: ##### Include file containing Confluent Cloud connection details KAFKA_CONNECTION_NAME my_connection_name KAFKA_BOOTSTRAP_SERVERS my_server:9092 KAFKA_KEY my_username KAFKA_SECRET my_password And the Confluent Cloud .datasource files look like the following: ##### Data Source using includes for Confluent Cloud connection details SCHEMA > `value` String, `topic` LowCardinality(String), `partition` Int16, `offset` Int64, `timestamp` DateTime, `key` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" INCLUDE "connections/my_connection_name.incl" KAFKA_TOPIC my_topic KAFKA_GROUP_ID my_group_id When using `tb pull` to pull a Confluent Cloud Data Source using the CLI, the `KAFKA_KEY` and `KAFKA_SECRET` settings are not included in the file to avoid exposing credentials. ### Internal fields¶ The `__` fields stored in the Kafka datasource represent the raw data received from Kafka: - `__value` : A String representing the whole Kafka record inserted. - `__topic` : The Kafka topic that the message belongs to. - `__partition` : The kafka partition that the message belongs to. - `__offset` : The Kafka offset of the message. - `__timestamp` : The timestamp stored in the Kafka message received by Tinybird. - `__key` : The key of the kafka message. ## Compressed messages¶ Tinybird can consume from Kafka topics where Kafka compression is enabled, as decompressing the message is a standard function of the Kafka Consumer. However, if you compressed the message before passing it through the Kafka Producer, then Tinybird can't do post-Consumer processing to decompress the message. For example, if you compressed a JSON message through gzip and produced it to a Kafka topic as a `bytes` message, it's ingested by Tinybird as `bytes` . If you produced a JSON message to a Kafka topic with the Kafka Producer setting `compression.type=gzip` , while it's stored in Kafka as compressed bytes, it's decoded on ingestion and arrive to Tinybird as JSON. --- URL: https://www.tinybird.co/docs/ingest/datasource-api Last update: 2024-11-07T15:40:07.000Z Content: --- title: "Data Sources API · Tinybird Docs" theme-color: "#171612" description: "Use the Data Sources API to create and manage your Data Sources as well as importing data into them." --- # Data Sources API¶ Use Tinybird's Data Sources API to import files into your Tinybird Data Sources. With the Data Sources API you can use files to create new Data Sources, and append data to, or replace data from, an existing Data Source. See [Data Sources](https://www.tinybird.co/docs/docs/concepts/data-sources). The following examples show how to use the Data Sources API to perform various tasks. See the [Data Sources API Reference](https://www.tinybird.co/docs/docs/api-reference/datasource-api) for more information. ## Import a file into a new Data Source¶ Tinybird can create a Data Source from a file. This operation supports CSV, NDJSON, and Parquet files. You can create a Data Source from local or remote files. Automatic schema inference is supported for CSV files, but isn't supported for NDJSON or Parquet files. ### CSV files¶ CSV files must follow these requirements: - One line per row - Comma-separated Tinybird supports Gzip compressed CSV files with .csv.gz extension. The Data Sources API automatically detects and optimizes your column types, so you don't need to manually define a schema. You can use the `type_guessing=false` parameter to force Tinybird to use `String` for every column. CSV headers are optional. When creating a Data Source from a CSV file, if your file contains a header row, Tinybird uses the header to name your columns. If no header is present, your columns receive default names with an incrementing number. When appending a CSV file to an existing Data Source, if your file has a header, Tinybird uses the headers to identify the columns. If no header is present, Tinybird uses the order of columns. If the order of columns in the CSV file is always the same, you can omit the header line. For example, to create a new Data Source from a local file using cURL: ##### Creating a Data Source from a local CSV file curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?name=my_datasource_name" \ -F csv=@local_file.csv From a remote file: ##### Creating a Data Source from a remote CSV file curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?name=my_datasource_name" \ -d url='https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-12.csv' When importing a remote file from a URL, the response contains the details of an import Job. To see the status of the import, use the [Jobs API](https://www.tinybird.co/docs/docs/api-reference/jobs-api). ### NDJSON and Parquet files¶ The Data Sources API doesn't support automatic schema inference for NDJSON and Parquet files. You must specify the `schema` parameter with a valid schema to parse the files. The schema for both NDJSON and Parquet files uses [JSONPaths](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-ndjson-data#jsonpaths) to identify columns in the data. You can add default values to the schema. Tinybird supports compressed NDJSON and Parquet files with .ndjson.gz and .parquet.gz extensions. You can use the [Analyze API](https://www.tinybird.co/docs/about:blank#generate-schemas-with-the-analyze-api) to automatically generate a schema definition from a file. For example, assume your NDJSON or Parquet data looks like this: ##### Simple NDJSON data example { "id": 123, "name": "Al Brown"} Your schema definition must provide the JSONPath expressions to identify the columns `id` and `name`: ##### Simple NDJSON schema definition id Int32 `json:$.id`, name String `json:$.name` To create a new Data Source from a local file using cURL, you must URL encode the Schema as a query parameter. The following examples use NDJSON. To use Parquet, adjust the `format` parameter to `format=parquet`: ##### Creating a Data Source from a local NDJSON file curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?name=events&mode=create&format=ndjson&schema=id%20Int32%20%60json%3A%24.id%60%2C%20name%20String%20%60json%3A%24.name%60" \ -F ndjson=@local_file.ndjson From a remote file: ##### Creating a Data Source from a remote NDJSON file curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?name=events&mode=create&format=ndjson" \ --data-urlencode "schema=id Int32 \`json:$.id\`, name String \`json:$.name\`" \ -d url='http://example.com/file.json' Note the escape characters in this example are only required due to backticks in cURL. When importing a remote file from a URL, the response contains the details of an import Job. To see the status of the import, you must the [Jobs API](https://www.tinybird.co/docs/docs/api-reference/jobs-api). To add default values to the schema, use the `DEFAULT` parameter after the JSONPath expressions. For example: ##### Simple NDJSON schema definition with default values id Int32 `json:$.id` DEFAULT 1, name String `json:$.name` DEFAULT 'Unknown' ## Append a file into an existing Data Source¶ If you already have a Data Source, you can append the contents of a file to the existing data. This operation supports CSV, NDJSON, and Parquet files. You can append data from local or remote files. When appending CSV files, you can improve performance by excluding the CSV Header line. However, in this case, you must ensure the CSV columns are ordered. If you can't guarantee the order of column in your CSV, include the CSV Header. For example, to append data into an existing Data Source from a local file using cURL: ##### Appending data to a Data Source from a local CSV file curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?mode=append&name=my_datasource_name" \ -F csv=@local_file.csv From a remote file: ##### Appending data to a Data Source from a remote CSV file curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?mode=append&name=my_datasource_name" \ -d url='https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-12.csv' If the Data Source has dependent Materialized Views, data is appended in cascade. ## Replace data in an existing Data Source with a file¶ If you already have a Data Source, you can replace existing data with the contents of a file. You can replace all data or a selection of data. This operation supports CSV, NDJSON, and Parquet files. You can replace with data from local or remote files. For example, to replace all the data in a Data Source with data from a local file using cURL: ##### Replacing a Data Source from a URL curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv" \ -F csv=@local_file.csv From a remote file: ##### Replacing a Data Source from a URL curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv" \ --data-urlencode "url=http://example.com/file.csv" Rather than replacing all data, you can also replace specific partitions of data. This operation is atomic. To do this, use the `replace_condition` parameter. This parameter defines the filter that's applied, where all matching rows are deleted before finally ingesting the new file. Only the rows matching the condition are ingested. If the source file contains rows that don't match the filter, the rows are ignored. Replacements are made by partition, so make sure that the `replace_condition` filters on the partition key of the Data Source. To replace filtered data in a Data Source with data from a local file using cURL, you must URL encode the `replace_condition` as a query parameter. For example: ##### Replace filtered data in a Data Source with data from a local file curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv&replace_condition=my_partition_key%20%3E%20123" \ -F csv=@local_file.csv From a remote file: ##### Replace filtered data in a Data Source with data from a remote file curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?mode=replace&name=data_source_name&format=csv" \ -d replace_condition='my_partition_key > 123' \ --data-urlencode "url=http://example.com/file.csv" All the dependencies of the Data Source, for example Materialized Views, are recalculated so that your data is consistent after the replacement. If you have n-level dependencies, they're also updated by this operation. Taking the example `A --> B --> C` , if you replace data in A, Data Sources B and C are automatically updated. The Partition Key of Data Source C must also be compatible with Data Source A. You can find more examples in the [Replace and delete data](https://www.tinybird.co/docs/docs/guides/ingesting-data/replace-and-delete-data#replace-data-selectively) guide. Although replacements are atomic, Tinybird can't assure data consistency if you continue appending data to any related Data Source at the same time the replacement takes place. The new incoming data is discarded. ## Creating an empty Data Source from a schema¶ When you want to have more granular control about the Data Source schema, you can manually create the Data Source with a specified schema. For example, to create an empty Data Source with a set schema using cURL: ##### Create an empty Data Source with a set schema curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/datasources?name=stocks" \ -d "schema=symbol String, date Date, close Float32" To create an empty Data Source, you must pass a `schema` with your desired column names and types and leave the `url` parameter empty. ## Generate schemas with the Analyze API¶ The Analyze API can analyze a given NDJSON or Parquet file to produce a valid schema. The column names, types, and JSONPaths are inferred from the file. For example, to analyze a local NDJSON file using cURL: ##### analyze a NDJSON file to get a valid schema curl \ -H "Authorization: Bearer " \ -X POST "https://api.tinybird.co/v0/analyze" \ -F "ndjson=@local_file_path" The response contains a `schema` field that can be used to create your Data Source. For example: ##### Successful analyze response { "analysis": { "columns": [{ "path": "$.a_nested_array.nested_array[:]", "recommended_type": "Array(Int16)", "present_pct": 3, "name": "a_nested_array_nested_array" }, { "path": "$.an_array[:]", "recommended_type": "Array(Int16)", "present_pct": 3, "name": "an_array" }, { "path": "$.field", "recommended_type": "String", "present_pct": 1, "name": "field" }, { "path": "$.nested.nested_field", "recommended_type": "String", "present_pct": 1, "name": "nested_nested_field" } ], "schema": "a_nested_array_nested_array Array(Int16) `json:$.a_nested_array.nested_array[:]`, an_array Array(Int16) `json:$.an_array[:]`, field String `json:$.field`, nested_nested_field String `json:$.nested.nested_field`" }, "preview": { "meta": [{ "name": "a_nested_array_nested_array", "type": "Array(Int16)" }, { "name": "an_array", "type": "Array(Int16)" }, { "name": "field", "type": "String" }, { "name": "nested_nested_field", "type": "String" } ], "data": [{ "a_nested_array_nested_array": [ 1, 2, 3 ], "an_array": [ 1, 2, 3 ], "field": "test", "nested_nested_field": "bla" }], "rows": 1, "statistics": { "elapsed": 0.00032175, "rows_read": 2, "bytes_read": 142 } } } ## Error handling¶ Most errors return an HTTP Error code, for example `HTTP 4xx` or `HTTP 5xx`. However, if the imported file is valid, but some rows failed to ingest due to an incompatible schema, you might still receive an `HTTP 200` . In this case, the Response body contains two keys, `invalid_lines` and `quarantine_rows` , which tell you how many rows failed to ingest. Additionally, an `error` key is present with an error message. ##### Successful ingestion with errors { "import_id": "e9ae235f-f139-43a6-7ad5-a1e17c0071c2", "datasource": { "id": "t_0ab7a11969fa4f67985cec481f71a5c2", "name": "your_datasource_name", "cluster": null, "tags": {}, "created_at": "2019-03-12 17:45:04", "updated_at": "2019-03-12 17:45:04", "statistics": { "bytes": 1397, "row_count": 4 }, "replicated": false, "version": 0, "project": null, "used_by": [] }, "error": "There was an error with file contents: 2 rows in quarantine and 2 invalid lines", "quarantine_rows": 2, "invalid_lines": 2 } --- URL: https://www.tinybird.co/docs/ingest/dynamodb Last update: 2024-11-13T15:42:17.000Z Content: --- title: "DynamoDB Connector · Tinybird Docs" theme-color: "#171612" description: "Bring your DynamoDB data to Tinybird using the DynamoDB Connector." --- # DynamoDB Connector¶ Use the DynamoDB Connector to ingest historical and change stream data from Amazon DynamoDB to Tinybird. The DynamoDB Connector is fully managed and requires no additional tooling. Connect Tinybird to DynamoDB, select your tables, and Tinybird keeps in sync with DynamoDB. With the DynamoDB Connector you can: - Connect to your DynamoDB tables and start ingesting data in minutes. - Query your DynamoDB data using SQL and enrich it with dimensions from your streaming data, warehouse, or files. - Use Auth tokens to control access to API endpoints. Implement access policies as you need. Support for row-level security. DynamoDB Connector only works with Workspaces created in AWS Regions. ## Prerequisites¶ - Tinybird CLI version 5.3.0 or higher. See[ the Tinybird CLI quick start](https://www.tinybird.co/docs/docs/cli/quick-start) . - Tinybird CLI[ authenticated with the desired Workspace](https://www.tinybird.co/docs/docs/cli/install) . - DynamoDB Streams is active on the target DynamoDB tables with `NEW_IMAGE` or `NEW_AND_OLD_IMAGE` type. - Point-in-time recovery (PITR) is active on the target DynamoDB table. You can switch the Tinybird CLI to the correct Workspace using `tb workspace use `. Supported characters for column names are letters, numbers, underscores, and dashes. Tinybird automatically sanitizes invalid characters like dots or dollar signs. ## Required permissions¶ The DynamoDB Connector requires certain permissions to access your tables. The IAM Role needs the following permissions: - `dynamodb:Scan` - `dynamodb:DescribeStream` - `dynamodb:DescribeExport` - `dynamodb:GetRecords` - `dynamodb:GetShardIterator` - `dynamodb:DescribeTable` - `dynamodb:DescribeContinuousBackups` - `dynamodb:ExportTableToPointInTime` - `dynamodb:UpdateTable` - `dynamodb:UpdateContinuousBackups` The following is an example of AWS Access Policy: When configuring the connector, the UI, CLI and API all provide the necessary policy templates. { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "dynamodb:Scan", "dynamodb:DescribeStream", "dynamodb:DescribeExport", "dynamodb:GetRecords", "dynamodb:GetShardIterator", "dynamodb:DescribeTable", "dynamodb:DescribeContinuousBackups", "dynamodb:ExportTableToPointInTime", "dynamodb:UpdateTable", "dynamodb:UpdateContinuousBackups" ], "Resource": [ "arn:aws:dynamodb:*:*:table/", "arn:aws:dynamodb:*:*:table//stream/*", "arn:aws:dynamodb:*:*:table//export/*" ] }, { "Effect": "Allow", "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"], "Resource": [ "arn:aws:s3:::", "arn:aws:s3:::/*" ] } ] } The following is an example trust policy: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "sts:AssumeRole", "Principal": { "AWS": "arn:aws:iam::473819111111111:root" }, "Condition": { "StringEquals": { "sts:ExternalId": "ab3caaaa-01aa-4b95-bad3-fff9b2ac789f8a9" } } } ] } ## Load a table using the CLI¶ To load a DynamoDB table into Tinybird using the CLI, create a connection and then a Data Source. The connection grants your Tinybird Workspace the necessary permissions to access AWS and your tables in DynamoDB. The Data Source then maps a table in DynamoDB to a table in Tinybird and manages the historical and continous sync. ### Create the DynamoDB connection¶ The connection grants your Tinybird Workspace the necessary permissions to access AWS and your tables in DynamoDB. To connect, run the following command: tb connection create dynamodb This command initiates the process of creating a connection. When prompted, type `y` to proceed. ### Create a new IAM Policy in AWS¶ The Tinybird CLI provides a policy template. 1. Replace `` with the name of your DynamoDB table.ç 2. Replace `` with the name of the S3 bucket you want to use for the initial load. 3. In AWS, go to** IAM** ,** Policies** ,** Create Policy** . 4. Select the** JSON** tab and paste the modified policy text. 5. Save and create the policy. ### Create a new IAM Role in AWS¶ 1. Return to the Tinybird CLI to get a trust policy template. 2. In AWS, go to** IAM** ,** Roles** ,** Create Role** . 3. Select** Custom Trust Policy** and paste the trust policy copied from the CLI. 4. In the** Permissions** tab, attach the policy created in the previous step. 5. Complete the role creation process. ### Complete the connection¶ In the AWS IAM console, find the role you've created. Copy its Amazon Resource Name (ARN), which looks like `arn:aws:iam::111111111111:role/my-awesome-role`. Provide the following information to Tinybird CLI: - The Role ARN - AWS region of your DynamoDB tables - Connection name Tinybird uses the connection name to identify the connection. The name can only contain AlphaNumeric characters `a-zA-Z` and underscores `_` , and must start with a letter. When the CLI prompts are completed, Tinybird creates the connection. The CLI will generate a `.connection` file in your project directory. This file is not used and is safe to delete. A future release will allow you to push this file to Tinybird to automate the creation of connections, similar to Kafka connections. ### Create a DynamoDB Data Source file¶ The Data Source maps a table in DynamoDB to a table in Tinybird and manages the historical and continous sync. [Data Source files](https://www.tinybird.co/docs/docs/cli/datafiles/datasource-files) contain the table schema, and specific DynamoDB properties to target the table that Tinybird imports. Create a Data Source file called `mytable.datasource` . There are two approaches to defining the schema for a DynamoDB Data Source: 1. Define the Partition Key and Sort Key from your DynamoDB table, and access other properties from JSON at query time. 2. Define all DynamoDB item properties as columns. The Partition Key and Sort Key, if any, from your DynamoDB must be defined in the Data Source schema. These are the only properties that are mandatory to define, as they're used for deduplication of records (upserts and deletes). #### Approach 1: Define only the Partition Key and Sort Key¶ If you don't want to map all properties from your DynamoDB table, you can define only the Partition Key and Sort Keys. The entire DynamoDB item is as JSON in a `_record` column, and you can extract properties using `JSONExtract*` functions. For example, if you have a DynamoDB table with `transaction_id` as the Partition Key, you can define your Data Source schema like this: ##### mytable.datasource SCHEMA > transaction_id String `json:$.transaction_id` IMPORT_SERVICE "dynamodb" IMPORT_CONNECTION_NAME IMPORT_TABLE_ARN IMPORT_EXPORT_BUCKET Replace the `` with the name of the connection created in the first step. Replace `` with the ARN of the table you'd like to import. Replace `` with the name of the S3 bucket you want to use for the initial sync. #### Approach 2: Define all DynamoDB item properties as columns¶ If you want to strictly define all your properties and their types, you can map them into your Data Source as columns. You can map properties to [any of the supported types in Tinybird](https://www.tinybird.co/docs/docs/concepts/data-sources#supported-data-types) . Properties can be also arrays of the previously mentioned types, and nullable. Use the nullable type when there are properties that might not have a value in every item within your DynamoDB table. For example, if you have a DynamoDB with items like this: { "timestamp": "2024-07-25T10:46:37.380Z", "transaction_id": "399361d5-10fc-4777-8187-88aaa4623569", "name": "Chris Donnelly", "passport_number": 4904040, "flight_from": "Burien", "flight_to": "Sanford", "airline": "BrianAir" } Where `transaction_id` is the partition key, you can define your Data Source schema like this: ##### mytable.datasource SCHEMA > `timestamp` DateTime64(3) `json:$.timestamp`, `transaction_id` String `json:$.transaction_id`, `name` String `json:$.name`, `passport_number` Int64 `json:$.passport_number`, `flight_from` String `json:$.flight_from`, `flight_to` String `json:$.flight_to`, `airline` String `json:$.airline` IMPORT_SERVICE "dynamodb" IMPORT_CONNECTION_NAME IMPORT_TABLE_ARN IMPORT_EXPORT_BUCKET Replace `` with the name of the connection created in the first step. Replace `` with the ARN of the table you'd like to import. Replace `` with the name of the S3 bucket you want to use for the initial sync. You can map properties with basic types (String, Number, Boolean, Binary, String Set, Number Set) at the root item level. Follow this schema definition pattern: `json:$.` - `PropertyName` is the name of the column within your Tinybird Data Source. - `PropertyType` is the type of the column within your Tinybird Data Source. It must match the type in the DynamoDB Data Source: - Strings correspond to `String` columns. - All `Int` , `UInt` , or `Float` variants correspond to `Number` columns. - `Array(String)` corresponds to `String Set` columns. - `Array(UInt)` and all numeric variants correspond to `Number Set` columns. - `PropertyNameInDDB` is the name of the property in your DynamoDB table. It must match the letter casing. Map properties within complex types, like `Maps` , using JSONPaths. For example, you can map a property at the first level in your Data Source schema like: MyString String `json:$..`. For `Lists` , standalone column mapping isn't supported. Those properties require extraction using `JSONExtract*` functions or consumed after a transformation with a Materialized View. ### Push the Data Source¶ With your connection created and Data Source defined, push your Data Source to Tinybird using `tb push`. For example, if your Data Source file is `mytable.datasource` , run: tb push mytable.datasource Due to how Point-in-time recovery works, data might take some minutes before it appears in Tinybird. This delay only happens the first time Tinybird retrieves the table. ## Load a table using the UI¶ To load a DynamoDB table into Tinybird using the UI, select the DynamoDB option in the Data Source dialog. You need an existing connection to your DynamoDB table. The UI guides you through the process of creating a connection and finally creating a Data Source that imports the data from your DynamoDB table. ### Create a DynamoDB connection¶ When you create a connection, provide the following information: - AWS region of your DynamoDB tables - ARN of the table you want to import - Name of the S3 bucket you want to use for the initial sync. In the next step, provide the ARN of the IAM Role you created in AWS. This role must have the necessary permissions to access your DynamoDB tables and S3 bucket. ### Create a Data Source¶ After you've created the connection, a preview of the imported data appears. You can change the schema columns, the sorting key, or the TTL. Due to the schemaless nature of DynamoDB, the preview might not show all the columns in your table. You can manually add columns to the schema in the **Code Editor** tab. When you're ready, select **Create Data Source**. Due to how Point-in-time recovery works, data might take some minutes before it appears in Tinybird. This delay only happens the first time Tinybird retrieves the table. ## Columns added by Tinybird¶ When loading a DynamoDB table, Tinybird automatically adds the following columns: | Column | Type | Description | | --- | --- | --- | | `_record` | `Nullable(String)` | Contents of the event, in JSON format. Added to `NEW_IMAGES` and `NEW_AND_OLD_IMAGES` streams. | | `_old_record` | `Nullable(String)` | Stores the previous state of the record. Added to `NEW_AND_OLD_IMAGES` streams. | | `_timestamp` | `DateTime64(3)` | Date and time of the event. | | `_event_type` | `LowCardinality(String)` | Type of the event. | | `_is_deleted` | `UInt8` | Whether the record has been deleted. | If an existing table with stream type `NEW_AND_OLD_IMAGES` is missing the `_old_record` column, add it manually with the following configuration: `_old_record` Nullable(String) `json:$.OldImage`. ## Iterate a Data Source¶ To iterate a DynamoDB Data Source, use the Tinybird CLI and the [version control integration](https://www.tinybird.co/docs/docs/production/working-with-version-control) to handle your resources. You can only create connections in the main Workspace. When creating the connection in a Branch, it's created in the main Workspace and from there is available to every Branch. DynamoDB Data Sources created in a Branch aren't connected to your source. AWS DynamoDB documentation discourages [reading the same Stream from various processes](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html#Streams.Processing) , because it can result in throttling. This can affect the ingestion in the main Branch. Browse the [use case examples](https://github.com/tinybirdco/use-case-examples) repository to find basic instructions and examples to handle DynamoDB Data Sources iteration using git integration. ### Add a new DynamoDB Data Source¶ You can add a new Data Source directly with the Tinybird CLI. See [load of a DynamoDB table](https://www.tinybird.co/docs/about:blank#load-a-table-using-the-cli). To add a new Data Source using the recommended version control workflow, see the [examples repository](https://github.com/tinybirdco/use-case-examples/tree/main/iterate_dynamodb). When you add a Data Source to a Tinybird Branch, it doesn't have any connection details. You must add the Connection and DynamoDB configuration in the .datasource Datafile when moving to a production environment or Branch. ### Update a Data Source¶ You can modify DynamoDB Data Sources using Tinybird CLI. For example: tb auth # modify the .datasource Datafile with your editor tb push --force {datafile} # check the command output for errors When updating an existent DynamoDB Data Source, the first sync isn't repeated, only the new item modifications are synchronized by the CDC process. To update a Data Source using the recommended version control workflow, see the [examples repository](https://github.com/tinybirdco/use-case-examples/tree/main/iterate_dynamodb). In Branches, work with [fixtures](https://www.tinybird.co/docs/docs/production/implementing-test-strategies#fixture-tests) , as they're be applied as part of the CI/CD, allowing the full process to be deterministic in every iteration and avoiding quota usage from external services. ### Delete a Data Source¶ You can delete DynamoDB Data Sources like any other Data Source. To delete it using the recommended version control workflow, see the [examples repository](https://github.com/tinybirdco/use-case-examples/tree/main/iterate_dynamodb). ## Retrieve logs¶ You can find DynamoDB logs in the `datasources_ops_log` Service Data Source. You can check logs directly in the Data Source screen. Filter by `datasource_id` to select the correct datasource, and use `event_type` to select between initial synchronization logs ( `sync-dynamodb` ), or update logs ( `append-dynamodb` ). To select all DynamoDB releated logs in the last day, run the following query: SELECT * FROM tinybird.datasources_ops_log WHERE datasource_id = 't_1234' AND event_type in ['sync-dynamodb', 'append-dynamodb'] AND timestamp > now() - INTERVAL 1 day ORDER BY timestamp DESC ## Connector architecture¶ AWS provides two free, default functions for DynamoDB: - DynamoDB Streams captures change events for a given DynamoDB table and provides an API to access events as a stream. This allows CDC-like access to the table for continuous updates. - You can use Point-in-time recovery (PITR) to take snapshots of your DynamoDB table and save the export to S3. This allows historical access to table data for batch uploads. The DynamoDB Connector uses the following functions to send DynamoDB data to Tinybird: <-figure-> ![Connecting DynamoDB to Tinybird architecture](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Fguides%2Fingest-from-dynamodb%2Ftinybird-dynamodb-connector-arch.png&w=3840&q=75) <-figcaption-> Connecting DynamoDB to Tinybird architecture ## Schema evolution¶ The DynamoDB Connector supports backwards compatible changes made in the source table. This means that, if you add a new column in DynamoDB, the next sync job automatically adds it to the Tinybird Data Source. Non-backwards compatible changes, such as dropping or renaming columns, aren't supported by default and might cause the next sync to fail. ## Considerations on queries¶ The DynamoDB Connector uses the ReplacingMergeTree engine to remove duplicate entries with the same sorting key. Deduplication occurs during a merge, which happens at an unknown time in the background. Doing `SELECT * FROM ddb_ds` might yield duplicated rows after an insertion. To account for this, force the merge at query time by adding `FINAL` to the query. For example, `SELECT * FROM ddb_ds FINAL` . Adding `FINAL` also filters out the rows where `_is_deleted = 1`. ## Override sort and partition keys¶ The DynamoDB Connector automatically sets values for the Sorting Key and the Partition Key properties based on the source DynamoDB table. You might want to override the default values to fit your needs. To override Sorting and Partition key values, open your .datasource file and edit the values for `ENGINE_PARTITION_KEY` and `ENGINE_SORTING_KEY` . For the Sorting key, you must append the additional columns and leave `pk` and `sk` in place. For example: ENGINE "ReplacingMergeTree" ENGINE_PARTITION_KEY "toYYYYMM(toDateTime64(_timestamp, 3))" ENGINE_SORTING_KEY "pk, sk, " ENGINE_VER "_timestamp" Sorting key is used for deduplication of data. When adding columns to `ENGINE_SORTING_KEY` , make sure they contain the same value across record changes. You can then push the new .datasource configuration using `tb push`: tb push updated-ddb.datasource Don't edit the values for `ENGINE` or `ENGINE_VER` . The DynamoDB Connector requires the ReplacingMergeTree engine and a version based on the timestamp. ## Limits¶ See [DynamoDB Connector limits](https://www.tinybird.co/docs/docs/support/limits#dynamodb-connector-limits). Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/ingest/events-api Last update: 2024-10-28T11:06:14.000Z Content: --- title: "Events API · Tinybird Docs" theme-color: "#171612" description: "Documentation for the Tinybird Events API" --- # Events API¶ The Events API enables high-throughput streaming ingestion into Tinybird from an easy-to-use HTTP API. This page gives examples of how to use the Events API to perform various tasks. For more information, read the [Events API Reference](https://www.tinybird.co/docs/docs/api-reference/events-api) docs. ## Send individual JSON events¶ You can send individual JSON events to the Events API by including the JSON event in the Request body. For example, to send an individual JSON event using cURL: ##### Sending individual JSON events curl \ -H "Authorization: Bearer " \ -d '{"date": "2020-04-05 00:05:38", "city": "Chicago"}' \ 'https://api.tinybird.co/v0/events?name=events_test' The `name` parameter defines the name of the Data Source in which to insert events. If the Data Source does not exist, Tinybird creates the Data Source by inferring the schema of the JSON. The Token used to send data to the Events API needs the appropriate scopes. To append data to an existing Data Source, the `DATASOURCE:APPEND` scope is required. If the Data Source does not already exist, the `DATASOURCE:CREATE` scope is required to create the new Data Source. ### Define the schema¶ Defining your schema allows you to set data types, sorting key, TTL and more. Read the [schema definition docs here](https://www.tinybird.co/docs/docs/ingest/overview#define-the-schema-yourself). ## Send batches of JSON events¶ Sending batches of events enables you to achieve much higher total throughput than sending individual events. You can send batches of JSON events to the Events API by formatting the events as NDJSON (newline delimited JSON). Each individual JSON event should be separated by a newline ( `\n` ) character. ##### Sending batches of JSON events curl \ -H "Authorization: Bearer " \ -d $'{"date": "2020-04-05 00:05:38", "city": "Chicago"}\n{"date": "2020-04-05 00:07:22", "city": "Madrid"}\n' \ 'https://api.tinybird.co/v0/events?name=events_test' The `name` parameter defines the name of the Data Source in which to insert events. If the Data Source does not exist, Tinybird creates the Data Source by inferring the schema of the JSON. The Token used to send data to the Events API must have the appropriate scopes. To append data to an existing Data Source, the `DATASOURCE:APPEND` scope is required. If the Data Source does not already exist, the `DATASOURCE:CREATE` scope is required to create the new Data Source. ## Limits¶ The Events API delivers a default capacity of: - Up to 1000 requests/second per Data Source - Up to 20MB/s per Data Source - Up to 10MB per request per Data Source Throughput beyond these limits is offered as best-effort. The Events API is able to scale beyond these limits. If you are reaching these limits, contact [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co). **Rate limit headers** | Header Name | Description | | --- | --- | | `X-RateLimit-Limit` | The maximum number of requests you're permitted to make in the current limit window. | | `X-RateLimit-Remaining` | The number of requests remaining in the current rate limit window. | | `X-RateLimit-Reset` | The time in seconds after the current rate limit window resets. | | `Retry-After` | The time to wait before making a another request. Only present on 429 responses. | Events API is a high-throughput streaming ingestion and as a distributed system, the values in these headers are offered as best-effort. ## Compression¶ NDJSON events sent to the Events API can be compressed with Gzip. However, it is only recommended to do this when necessary, such as when you have big events that are grouped into large batches. Compressing events adds overhead to the ingestion process, which can introduce latency, although it is typically minimal. Here is an example of sending a JSON event compressed with Gzip from the command line: echo '{"timestamp":"2022-10-27T11:43:02.099Z","transaction_id":"8d1e1533-6071-4b10-9cda-b8429c1c7a67","name":"Bobby Drake","email":"bobby.drake@pressure.io","age":42,"passport_number":3847665,"flight_from":"Barcelona","flight_to":"London","extra_bags":1,"flight_class":"economy","priority_boarding":false,"meal_choice":"vegetarian","seat_number":"15D","airline":"Red Balloon"}' | gzip > body.gz curl \ -X POST 'https://api.tinybird.co/v0/events?name=gzip_events_example' \ -H "Authorization: Bearer ." \ -H "Content-Encoding: gzip" \ --data-binary @body.gz ## Write acknowledgements¶ When you send data to the Events API, you usually receive a `HTTP202` response, which indicates that the request was successful - however it does not confirm that the data has been committed into the underlying database. This is useful when guarantees on writes are not strictly necessary. Typically, it should take under 2 seconds to receive a response from the Events API in this case. curl \ -X POST 'https://api.tinybird.co/v0/events?name=events_example' \ -H "Authorization: Bearer " \ -d $'{"timestamp":"2022-10-27T11:43:02.099Z"}' < HTTP/2 202 < content-type: application/json < content-length: 42 < {"successful_rows":2,"quarantined_rows":0} However, if your use case requires absolute guarantees that data is committed, use the `wait` parameter. The `wait` parameter is a boolean that accepts a value of `true` or `false` . A value of `false` is the default behavior, equivalent to omitting the parameter entirely. Using `wait=true` with your request will ask the Events API to wait for acknowledgement that the data you sent has been committed into the underlying database. You will receive a `HTTP200` response that confirms data has been committed. Note that adding `wait=true` to your request can result in a slower response time, and we recommend having a time-out of at least 10 seconds when waiting for the response. For example: curl \ -X POST 'https://api.tinybird.co/v0/events?name=events_example&wait=true' \ -H "Authorization: Bearer " \ -d $'{"timestamp":"2022-10-27T11:43:02.099Z"}' < HTTP/2 200 < content-type: application/json < content-length: 42 < {"successful_rows":2,"quarantined_rows":0} It is good practice to log your requests to, and responses from, the Events API. This will help to give you visibility into any failures for reporting or recovery. --- URL: https://www.tinybird.co/docs/ingest/kafka Last update: 2024-11-11T09:57:52.000Z Content: --- title: "Kafka Connector · Tinybird Docs" theme-color: "#171612" description: "Documentation for the Tinybird Kafka Connector" --- # Kafka Connector¶ Use the Kafka Connector to ingest data streams from your Kafka cluster into Tinybird so that you can quickly turn them into high-concurrency, low-latency REST APIs. The Kafka Connector is fully managed and requires no additional tooling. Connect Tinybird to your Kafka cluster, select a topic, and Tinybird automatically begins consuming messages from Kafka. You can transform or enrich your Kafka topics with JOINs using serverless Data Pipes. Auth tokens control access to API endpoints. Secure connections through AWS PrivateLink or Multi-VPC for MSK are available for Enterprise customers on a Dedicated infrastructure plan. Reach out to support@tinybird.co for more information. ## Prerequisites¶ Grant `READ` permissions to both the Topic and the Consumer Group to ingest data from Kafka into Tinybird. You must secure your Kafka brokers with SSL/TLS and SASL. Tinybird uses `SASL_SSL` as the security protocol for the Kafka consumer. Connections are rejected if the brokers only support `PLAINTEXT` or `SASL_PLAINTEXT`. Kafka Schema Registry is supported only for decoding Avro messages. ## Add a Kafka connection¶ You can create a connection to Kafka using the Tinybird CLI or the UI. ### Using the CLI¶ Run the following commands to add a Kafka connection: ##### Adding a Kafka connection in the main Workspace tb auth # Use the main Workspace admin Token tb connection create kafka --bootstrap-servers --key --secret --connection-name --ssl-ca-pem ### Using the UI¶ Follow these steps to add a new connection using the UI: 1.Go to **Data Project**. 2. Select the **+** icon, then select **Data Source**. 3. Select **Kafka**. 4. Follow the steps to configure the connection. ### Add a CA certificate¶ You can add a CA certificate in PEM format when configuring your Kafka connection from the UI. Tinybird checks the certificate for issues before creating the connection. To add a CA certificate using the Tinybird CLI, pass the `--ssl-ca-pem ` argument to `tb connection create` , where `` is the location or value of the CA certificate. CA certificates don't work with Kafka Sinks and Streaming Queries. #### Aiven Kafka¶ Aiven for Apache Kafka service instances expose multiple SASL ports with two different kinds of SASL certificates: Private CA (self-signed) and Public CA, signed by Let's Encrypt. If you are using the Public CA port, you can connect to Aiven Kafka without any additional configuration. However, if you are using the Private CA port, you need to provide the CA certificate by pointing to the path of the CA certificate file using the `KAFKA_SSL_CA_PEM` setting. ## Update a Kafka connection¶ You can update your credentials or cluster details only from the Tinybird UI. Follow these steps: 1. Go to** Data Project** , select the** +** icon, then select** Data Source** . 2. Select** Kafka** and then the connection you want to edit or delete using the three-dots menu. Any Data Source that depends on this connection is affected by updates. ## Use .datasource files¶ You can configure the Kafka Connector using .datasource files. See the [datafiles documentation](https://www.tinybird.co/docs/docs/cli/datafiles/datasource-files#kafka-confluent-redpanda). The following is an example of Kafka .datasource file for an already existing connection: ##### Example data source for Kafka Connector SCHEMA > `__value` String, `__topic` LowCardinality(String), `__partition` Int16, `__offset` Int64, `__timestamp` DateTime, `__key` String `__headers` Map(String,String) ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" # Connection is already available. If you # need to create one, add the required fields # on an include file with the details. KAFKA_CONNECTION_NAME my_connection_name KAFKA_TOPIC my_topic KAFKA_GROUP_ID my_group_id KAFKA_STORE_HEADERS true To add connection details in an INCLUDE file, see [Use INCLUDE to store connection settings](https://www.tinybird.co/docs/about:blank#use-include-to-store-connection-settings). ### Columns of the Data Source¶ When you connect a Kafka producer to Tinybird, Tinybird consumes optional metadata columns from that Kafka record and writes them to the Data Source. The following fields represent the raw data received from Kafka: - `__value` : A String representing the entire unparsed Kafka record inserted. - `__topic` : The Kafka topic that the message belongs to. - `__partition` : The kafka partition that the message belongs to. - `__offset` : The Kafka offset of the message. - `__timestamp` : The timestamp stored in the Kafka message received by Tinybird. - `__key` : The key of the kafka message. - `__headers` : Headers parsed from the incoming topic messages. See[ Using custom Kafka headers for advanced message processing](https://www.tinybird.co/blog-posts/using-custom-kafka-headers) . Metadata fields are optional. Omit the fields you don't need to reduce your data storage. ### Use INCLUDE to store connection settings¶ To avoid configuring the same connection settings across many files, or to prevent leaking sensitive information, you can store connection details in an external file and use `INCLUDE` to import them into one or more .datasource files. You can find more information about `INCLUDE` in the [Advanced Templates](https://www.tinybird.co/docs/docs/cli/advanced-templates) documentation. For example, you might have two Kafka .datasource files that reuse the same Kafka connection. You can create an include file which stores the Kafka connection details. The Tinybird project would use the following structure: ##### Tinybird data project file structure ecommerce_data_project/ datasources/ connections/ my_connector_name.incl ca.pem # CA certificate (optional) my_kafka_datasource.datasource another_datasource.datasource endpoints/ pipes/ Where the file `my_connector_name.incl` has the following content: ##### Include file containing Kafka connection details KAFKA_CONNECTION_NAME my_connection_name KAFKA_BOOTSTRAP_SERVERS my_server:9092 KAFKA_KEY my_username KAFKA_SECRET my_password KAFKA_SSL_CA_PEM ca.pem # CA certificate (optional) And the Kafka .datasource files look like the following: ##### Data Source using includes for Kafka connection details SCHEMA > `__value` String, `__topic` LowCardinality(String), `__partition` Int16, `__offset` Int64, `__timestamp` DateTime, `__key` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" INCLUDE "connections/my_connection_name.incl" KAFKA_TOPIC my_topic KAFKA_GROUP_ID my_group_id When using `tb pull` to pull a Kafka Data Source using the CLI, the `KAFKA_KEY`, `KAFKA_SECRET`, `KAFKA_SASL_MECHANISM` and `KAFKA_SSL_CA_PEM` settings aren't included in the file to avoid exposing credentials. ## Iterate a Kafka Data Source¶ The following instructions use Branches. Be sure you're familiar with the behavior of Branches in Tinybird when using the Kafka Connector - see [Prerequisites](https://www.tinybird.co/docs/about:blank#prerequisites). Use Branches to test different Kafka connections and settings. See [Branches](https://www.tinybird.co/docs/docs/concepts/branches). Connections created using the UI are created in the main Workspace. so if you create a new Branch from a Workspace with existing Kafka Data Sources, the Branch Data Sources don't receive that streaming data automatically. Use the CLI to recreate the Kafka Data Source. ### Update a Kafka Data Source¶ When you create a Branch that has existing Kafka Data Sources, the Data Sources in the Branch aren't connected to Kafka. Therefore, if you want to update the schema, you need to recreate the Kafka Data Source in the Branch. In branches, Tinybird automatically appends `_{BRANCH}` to the Kafka group ID to prevent collisions. It also forces the consumers in Branches to always consume the `latest` messages, to reduce the performance impact. ### Add a new Kafka Data Source¶ To create and test a Kafka Data Source in a Branch, start by using an existing connection. You can create and use existing connections from the Branch using the UI: these connections are always created in the main Workspace. You can create a Kafka Data Source in a Branch as in production. This Data Source doesn't have any connection details internally, so you it's useful for testing purposes. Define the connection in the .datafile and Kafka parameters that are used in production. To move the Data Source to production, include the connection settings in the Data Source .datafile, as explained in the [.datafiles documentation](https://www.tinybird.co/docs/docs/cli/datafiles/datasource-files#kafka-confluent-redpanda). ### Delete a Kafka Data Source¶ If you've created a Data Source in a Branch, the Data Source is active until the Data Source is removed from the Branch or when the entire Branch is removed. If you delete an existing Kafka Data Source in a Branch, it isn't deleted in the main Workspace. To delete a Kafka Data Source, do it against the main Workspace. You can also use the CLI and include it in the CI/CD workflows as necessary. ## Limits¶ The limits for the Kafka connector are: - Minimum flush time: 4 seconds - Throughput (uncompressed): 20MB/s - Up to 3 connections per Workspace If you're regularly hitting these limits, contact [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) for support. ## Troubleshooting¶ ### If you aren't receiving data¶ When Kafka commits a message for a topic and a group id, it always sends data from the latest committed offset. In Tinybird, each Kafka Data Source receives data from a topic and uses a group id. The combination of `topic` and `group id` must be unique. If you remove a Kafka Data Source and you recreate it again with the same settings after having received data, you' only get data from the latest committed offset, even if `KAFKA_AUTO_OFFSET_RESET` is set to `earliest`. This happens both in the main Workspace and in Branches, if you're using them, because connections are always created in the main Workspace and are shared across Branches. Recommended next steps: - Use always a different group id when testing Kafka Data Sources. - Check in the `tinybird.kafka_ops_log`[ Service Data Source](https://www.tinybird.co/docs/docs/monitoring/service-datasources) to see if you've already used a group id to ingest data from a topic. ### Compressed messages¶ Tinybird can consume from Kafka topics where Kafka compression is enabled, as decompressing the message is a standard function of the Kafka Consumer. If you compressed the message before passing it through the Kafka Producer, Tinybird can't do post-Consumer processing to decompress the message. For example, if you compressed a JSON message through gzip and produced it to a Kafka topic as a `bytes` message, it would be ingested by Tinybird as `bytes` . If you produced a JSON message to a Kafka topic with the Kafka Producer setting `compression.type=gzip` , while it would be stored in Kafka as compressed bytes, it would be decoded on ingestion and arrive to Tinybird as JSON. --- URL: https://www.tinybird.co/docs/ingest/overview Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Ingest data · Tinybird Docs" theme-color: "#171612" description: "Tinybird allows you to ingest your data from a variety of sources, then create Tinybird Data Sources that can be queried, published, materialized, and more." --- ## Ingest your data into Tinybird¶ Tinybird allows you to ingest your data from a variety of sources, then create Tinybird Data Sources that can be queried, published, materialized, and more. ## Get started¶ Pick your Data Source, get connected, and start ingesting *fast*! ## Create your schema¶ Schemas are defined using a specific file type: .datasource files. Read more about them in the [.datasource file docs](https://www.tinybird.co/docs/docs/cli/datafiles/datasource-files). You can choose to define the schema first or send data straight away and let Tinybird infer the schema. ### Define the schema yourself¶ If you want to **define your schema** first, use one of the following methods: #### Option 1: Create an empty Data Source in the UI¶ On the left-hand nav, select **Create new (+)** then "Data Source". Select the "Write schema" option and review the .datasource file that's presented. Update the generated parameters to match your desired schema. Make sure you understand the definition and syntax for the schema, JSONPaths, engine, partition key, sorting key(s), and TTL - these are all essential for efficient data operations later down the line. The input here is responsive, so if you change the engine, the UI automatically updates with any other definitions you need to provide. Select "Next", review your column names and types, rename the Data Source, and when you're happy with it, select "Create Data Source". #### Option 2: Upload a .datasource file with the desired schema¶ Alternatively, you can define your schema locally in a .datasource file. Drag and drop the file onto the UI and Tinybird will prompt you to import the resource. You can also select the + icon followed by "Write schema", drag and drop the file onto the modal, and continue editing in the UI. #### Option 3: Use the Data Sources API¶ #### Option 4: Use the Tinybird CLI¶ ### Send data and use inferred schema¶ If you want to **send data straight away** and let Tinybird infer the schema, then use one of the many Connectors, upload a CSV, NDJSON, or Parquet file, or send data to the [Events API](https://www.tinybird.co/docs/docs/ingest/events-api#send-individual-json-events). To use the Events API to infer the schema, send a single event over the API and allow Tinybird to create the schema for you automatically. If the schema is incorrect, go to the Data Source schema page in the UI and download the schema as a .datasource file. Edit the file to adjust the schema if required, then drag & drop the file back on to the UI. You cannot override an existing Data Source, so you will have to delete the existing one or rename the new Data Source. ## Supported data types, file types, and compression formats¶ See [Concepts > Data Sources](https://www.tinybird.co/docs/docs/concepts/data-sources#supported-data-types) for more information on supported types and formats. ## Update your schema¶ Realized you have the wrong schema for your Data Source? If your data does not match the schema, it will end up in the [quarantine Data Source](https://www.tinybird.co/docs/docs/guides/ingesting-data/recover-from-quarantine) . You may also want to change your schema for optimization purposes. In both cases, read the [Iterate a Data Source](https://www.tinybird.co/docs/docs/guides/ingesting-data/iterate-a-data-source) guide. ## Limits¶ Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. ## Looking for inspiration?¶ If you're new to Tinybird and looking to learn a simple flow of ingest data > query it > publish an API Endpoint, check out our [quick start](https://www.tinybird.co/docs/docs/quick-start)! --- URL: https://www.tinybird.co/docs/ingest/postgresql Last update: 2024-11-13T15:42:17.000Z Content: --- title: "PostgreSQL table function · Tinybird Docs" theme-color: "#171612" description: "Documentation for the Tinybird PostgreSQL table function" --- # PostgreSQL table function BETA¶ The Tinybird `postgresql` table function is currently in public beta. The Tinybird `postgresql()` table function allows you to read data from your existing PostgreSQL database into Tinybird, then schedule a regular Copy Pipe to orchestrate synchronization. You can load full tables, and every run performs a full replace on the Data Source. Based on ClickHouse® `postgresql` table function , the Tinybird table function uses all the same syntax, requiring no additional tooling. To use it, define a Node using standard SQL and the `postgresql` function keyword, then publish the Node as a Copy Pipe that does a sync on every run. ## Set up¶ ### Prerequisites¶ Your postgres database needs to be open and public (exposed to the internet, with publicly-signed certs), so you can connect it to Tinybird via the hostname and port using your username and password. You'll also need familiarity with making cURL requests to [manage your secrets](https://www.tinybird.co/docs/about:blank#about-secrets). ### Type support and inference¶ Since this table functions is based on ClickHouse's `postgresql` table function , Tinybird inherits the same types support and inference. Here's a detailed conversion table: | PostgreSQL Data Type | ClickHouse Data Type | | --- | --- | | BOOLEAN | UInt8 or Bool | | SMALLINT | Int16 | | INTEGER | Int32 | | BIGINT | Int64 | | REAL | Float32 | | DOUBLE PRECISION | Float64 | | NUMERIC or DECIMAL | Decimal(p, s) | | CHAR(n) | FixedString(n) | | VARCHAR (n) | String | | TEXT | String | | BYTEA | String | | TIMESTAMP | DateTime | | TIMESTAMP WITH TIME ZONE | DateTime (with appropriate timezone handling) | | DATE | Date | | TIME | String (since there is no direct TIME type) | | TIME WITH TIME ZONE | String | | INTERVAL | String | | UUID | UUID | | ARRAY | Array(T) where T is the array element type | | JSON | String or JSON (ClickHouse's JSON type for some versions) | | JSONB | String | | INET | String | | CIDR | String | | MACADDR | String | | ENUM | Enum8 or Enum16 | | GEOMETRY | String | Notes: - ClickHouse does not support all PostgreSQL types directly, so some types are mapped to String in ClickHouse, which is the most flexible type for arbitrary data. - For the NUMERIC and DECIMAL types, Decimal(p, s) in ClickHouse requires specifying precision (p) and scale (s). - Time zone support in ClickHouse's DateTime can be managed via additional functions or by ensuring consistent storage and retrieval time zones. - Some types like INTERVAL do not have a direct equivalent in ClickHouse and are usually stored as String or decomposed into separate fields. ## About secrets¶ The Environment Variables API is currently only accessible at API level. UI support will be released in the near future. Pasting your credentials into a Pipe Node or `.datafile` as plain text is a security risk. Instead, use the Environment Variables API to [create two new secrets](https://www.tinybird.co/docs/docs/api-reference/environment-variables-api#post-v0secrets) for your postgres username and password. In the next step, you'll then be ready to interpolate your new secrets using the `tb_secret` function: {{tb_secret('pg_username')}} {{tb_secret('pg_password')}} ## Load a PostgreSQL table¶ In the Tinybird UI, create a new Pipe Node. Call the `postgresql` table function and pass the hostname & port, database, table, user, and password: ##### Example Node logic with actual values SELECT * FROM postgresql( 'aws-0-eu-central-1.TODO.com:3866', 'postgres', 'orders', {{tb_secret('pg_username')}}, {{tb_secret('pg_password')}}, ) Publish this Node as a Copy Pipe, thereby running the query manually. You can choose to append only new data, or replace all data. ### Alternative: Use datafiles¶ As well as using the UI, you can also define Node logic in Pipe `.datafile` files . An example for an ecommerce `orders_backfill` scenario, with a Node called `all_orders` , would be: NODE all_orders SQL > % SELECT * FROM postgresql( 'aws-0-eu-central-1.TODO.com:3866', 'postgres', 'orders', {{tb_secret('pg_username')}}, {{tb_secret('pg_password')}}, ) TYPE copy TARGET_DATASOURCE orders COPY_SCHEDULE @on-demand COPY_MODE replace ## Include filters¶ You can use a source column in postgres and filter by a value in Tinybird, for example: ##### Example Copy Pipe with postgresql function and filters SELECT * FROM postgresql( 'aws-0-eu-central-1.TODO.com:3866', 'postgres', 'orders', {{tb_secret('pg_username')}}, {{tb_secret('pg_password')}}, ) WHERE orderDate > (select max(orderDate) from orders) ## Schedule runs¶ When publishing as a Copy Pipe, most users set it to run at a frequent interval using a cron expression. It's also possible to trigger manually: curl -H "Authorization: Bearer " \ -X POST "https:/tinybird.co/api/v0/pipes//run" Having manual Pipes in your Workspace is helpful, as you can run a full sync manually any time you need it - sometimes delta updates are not 100% accurate. Some users also leverage them for weekly full syncs. ## Synchronization strategies¶ When copying data from PostgreSQL to Tinybird you can use one of the following strategies: - Use `COPY_MODE replace` to synchronize small dimensions tables, up to a few million rows, in a frequent schedule (1 to 5 minutes). - Use `COPY_MODE append` to do incremental appends. For example, you can append events data tagged with a timestamp. Combine it with `COPY_SCHEDULE` and filters in the Copy Pipe SQL to sync the new events. ### Timeouts¶ When synchronizing dimensions tables with `COPY_MODE replace` and 1 minute schedule, the copy job might timeout because it can't ingest the whole table in the defined schedule. Timeouts depend on several factors: - The `statement_timeout` configured in PostgreSQL. - The PostgreSQL database load. - Network connectivity, for example when copying data from different cloud regions. Follow these steps to avoid timeouts using incremental appends: 1. Make sure your PostgreSQL dimensions rows are tagged with an updated timestamp. Use the column to filter the copy Pipe SQL. In the following example, the column is `updated_at`: CREATE TABLE users ( created_at TIMESTAMPTZ(6) NOT NULL, updated_at TIMESTAMPTZ(6) NOT NULL, name TEXT, user_id TEXT PRIMARY KEY ); 1. Create the target Data Source as a[ ReplacingMergeTree](https://www.tinybird.co/docs/docs/guides/querying-data/deduplication-strategies#use-the-replacingmergetree-engine) using a unique or primary key as the `ENGINE_SORTING_KEY` in the Postgres table. Rows with the same `ENGINE_SORTING_KEY` are deduplicated. SCHEMA > `created_at` DateTime64(6), `updated_at` DateTime64(6), `name` String, `user_id` String ENGINE "ReplacingMergeTree" ENGINE_SORTING_KEY "user_id" 1. Configure the Copy Pipe with an incremental append strategy and 1 minute schedule. That way you make sure only new records in the last minute are ingested, thus optimizing the copy job duration. NODE copy_pg_users_rmt_0 SQL > % SELECT * FROM postgresql( 'aws-0-eu-central-1.TODO.com:6543', 'postgres', 'users', {{ tb_secret('pg_username') }}, {{ tb_secret('pg_password') }} ) WHERE updated_at > (SELECT max(updated_at) FROM pg_users_rmt)::String TYPE copy TARGET_DATASOURCE pg_users_rmt COPY_MODE append COPY_SCHEDULE * * * * * Optionally, you can create an index in the PostgreSQL table to speed up filtering: -- Create an index on updated_at for faster queries CREATE INDEX idx_updated_at ON users (updated_at); 1. A Data Source with `ReplacingMergeTree` engine deduplicates records based on the sorting key in batch mode. As you can't ensure when deduplication is going to happen, use the `FINAL` keyword when querying the Data Source to force deduplication at query time. SELECT * FROM pg_users FINAL 1. You can combine this approach with an hourly or daily replacement to get rid of deleted rows. Learn about[ how to handle deleted rows](https://www.tinybird.co/docs/docs/guides/querying-data/deduplication-strategies#use-the-replacingmergetree-engine) when using `ReplacingMergeTree` . Learn more about [how to migrate from Postgres to Tinybird](https://www.tinybird.co/docs/docs/guides/migrations/migrate-from-postgres). ## Observability¶ Job executions are logged in the `datasources_ops_log` [Service Data Source](https://www.tinybird.co/docs/docs/monitoring/service-datasources) . This log can be checked directly in the Data Source view page in the UI. Filter by `datasource_id` to monitor ingestion through the PostgreSQL table function from the `datasources_ops_log`: ##### Example query to the datasources\_ops\_log Service Data Source SELECT timestamp, event_type, result, error, job_id FROM tinybird.datasources_ops_log WHERE datasource_id = 't_1234' AND event_type = 'copy' ORDER BY timestamp DESC ## Limits¶ The table function inherits all the [limits of Copy Pipes](https://www.tinybird.co/docs/docs/support/limits#copy-pipe-limits). Secrets are created at a Workspace level, so you will be able to connect one PostgreSQL database per Tinybird Workspace. Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. ## Billing¶ When set up, this functionality is a Copy Pipe with a query (processed data). There are no additional or specific costs for the table function itself. See the [billing docs](https://www.tinybird.co/docs/docs/support/billing) for more information on data operations and how they're charged. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/ingest/redpanda Last update: 2024-11-08T11:23:54.000Z Content: --- title: "Redpanda Connector · Tinybird Docs" theme-color: "#171612" description: "Documentation for the Tinybird Redpanda Connector" --- # Redpanda Connector¶ The Redpanda Connector allows you to ingest data from your existing Redpanda cluster and load it into Tinybird so that you can quickly turn them into high-concurrency, low-latency REST APIs. The Redpanda Connector is fully managed and requires no additional tooling. Connect Tinybird to your Redpanda cluster, choose a topic, and Tinybird will automatically begin consuming messages from Redpanda. The Redpanda Connector is: - ** Easy to use** . Connect to your Redpanda cluster in seconds. Choose your topics, define your schema, and ingest millions of events per second into a fully-managed OLAP. - ** SQL-based** . Using nothing but SQL, query your Redpanda data and enrich it with dimensions from your database, warehouse, or files. - ** Secure** . Use Auth tokens to control access to API endpoints. Implement access policies as you need. Support for row-level security. Note that you need to grant READ permissions to both the Topic and the Consumer Group to ingest data from Redpanda into Tinybird. ## Using the UI¶ To connect Tinybird to your Redpanda cluster, click the `+` icon next to the data project section on the left navigation menu, select **Data Source** , and select **Redpanda** from the list of available Data Sources. Enter the following details: - ** Connection name** : A name for the Redpanda connection in Tinybird. - ** Bootstrap Server** : The comma-separated list of bootstrap servers (including Port numbers). - ** Key** : The** Key** component of the Redpanda API Key. - ** Secret** : The** Secret** component of the Redpanda API Key. - ** Decode Avro messages with schema registry** : Optionally, you can enable Schema Registry support to decode Avro messages. You will be prompted to enter the Schema Registry URL, username and password. Once you have entered the details, select **Connect** . This creates the connection between Tinybird and Redpanda. You will then see a list of your existing topics and can select the topic to consume from. Tinybird will create a **Group ID** that specifies the name of the consumer group this consumer belongs to. You can customize the Group ID, but ensure that your Group ID has **read** permissions to the topic. Once you have chosen a topic, you can select the starting offset to consume from. You can choose to consume from the **latest** offset or the **earliest** offset. If you choose to consume from the earliest offset, Tinybird will consume all messages from the beginning of the topic. If you choose to consume from the latest offset, Tinybird will only consume messages that are produced after the connection is created. Select the offset, and click **Next**. Tinybird will then consume a sample of messages from the topic and display the schema. You can adjust the schema and Data Source settings as needed, then click **Create Data Source** to create the Data Source. Tinybird will now begin consuming messages from the topic and loading them into the Data Source. ## Using .datasource files¶ If you are managing your Tinybird resources in files, there are several settings available to configure the Redpanda Connector in .datasource files. See the [datafiles docs](https://www.tinybird.co/docs/docs/cli/datafiles/datasource-files#kafka-confluent-redpanda) for more information. The following is an example of Kafka .datasource file for an already existing connection: ##### Example data source for Redpanda Connector SCHEMA > `__value` String, `__topic` LowCardinality(String), `__partition` Int16, `__offset` Int64, `__timestamp` DateTime, `__key` String `__headers` Map(String,String) ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" # Connection is already available. If you # need to create one, add the required fields # on an include file with the details. KAFKA_CONNECTION_NAME my_connection_name KAFKA_TOPIC my_topic KAFKA_GROUP_ID my_group_id KAFKA_STORE_HEADERS true ### Columns of the Data Source¶ When you connect a Kafka producer to Tinybird, Tinybird consumes optional metadata columns from that Kafka record and writes them to the Data Source. The following fields represent the raw data received from Kafka: - `__value` : A String representing the entire unparsed Kafka record inserted. - `__topic` : The Kafka topic that the message belongs to. - `__partition` : The kafka partition that the message belongs to. - `__offset` : The Kafka offset of the message. - `__timestamp` : The timestamp stored in the Kafka message received by Tinybird. - `__key` : The key of the kafka message. - `__headers` : Headers parsed from the incoming topic messages. See[ Using custom Kafka headers for advanced message processing](https://www.tinybird.co/blog-posts/using-custom-kafka-headers) . Metadata fields are optional. Omit the fields you don't need to reduce your data storage. ### Using INCLUDE to store connection settings¶ To avoid configuring the same connection settings across many files, or to prevent leaking sensitive information, you can store connection details in an external file and use `INCLUDE` to import them into one or more .datasource files. You can find more information about `INCLUDE` in the [Advanced Templates](https://www.tinybird.co/docs/docs/cli/advanced-templates) documentation. As an example, you may have two Redpanda .datasource files, which re-use the same Redpanda connection. You can create an INCLUDE file that stores the Redpanda connection details. The Tinybird project may use the following structure: ##### Tinybird data project file structure ecommerce_data_project/ ├── datasources/ │ └── connections/ │ └── my_connector_name.incl │ └── my_kafka_datasource.datasource │ └── another_datasource.datasource ├── endpoints/ ├── pipes/ Where the file `my_connector_name.incl` has the following content: ##### Include file containing Redpanda connection details KAFKA_CONNECTION_NAME my_connection_name KAFKA_BOOTSTRAP_SERVERS my_server:9092 KAFKA_KEY my_username KAFKA_SECRET my_password And the Redpanda .datasource files look like the following: ##### Data Source using includes for Redpanda connection details SCHEMA > `value` String, `topic` LowCardinality(String), `partition` Int16, `offset` Int64, `timestamp` DateTime, `key` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" INCLUDE "connections/my_connection_name.incl" KAFKA_TOPIC my_topic KAFKA_GROUP_ID my_group_id When using `tb pull` to pull a Redpanda Data Source using the CLI, the `KAFKA_KEY` and `KAFKA_SECRET` settings will **not** be included in the file to avoid exposing credentials. --- URL: https://www.tinybird.co/docs/ingest/s3 Last update: 2024-11-12T11:45:41.000Z Content: --- title: "S3 Connector · Tinybird Docs" theme-color: "#171612" description: "Bring your S3 data to Tinybird using the S3 Connector." --- # S3 Connector¶ Use the S3 Connector to ingest files from your Amazon S3 buckets into Tinybird so that you can turn them into high-concurrency, low-latency REST APIs. You can load a full bucket or load files that match a pattern. In both cases you can also set an update date from which the files are loaded. With the S3 Connector you can load your CSV, NDJSON, or Parquet files into your S3 buckets and turn them into APIs. Tinybird detects new files in your buckets and ingests them automatically. You can then run serverless transformations using Data Pipes or implement auth tokens in your API Endpoints. ## Prerequisites¶ The S3 Connector requires permissions to access objects in your Amazon S3 bucket. The IAM Role needs the following permissions: - `s3:GetObject` - `s3:ListBucket` - `s3:ListAllMyBuckets` The following is an example of AWS Access Policy: When configuring the connector, the UI, CLI and API all provide the necessary policy templates. { "Version": "2012-10-17", "Statement": [ { "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::", "arn:aws:s3:::/*" ], "Effect": "Allow" }, { "Sid": "Statement1", "Effect": "Allow", "Action": [ "s3:ListAllMyBuckets" ], "Resource": [ "*" ] } ] } The following is an example trust policy: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "sts:AssumeRole", "Principal": { "AWS": "arn:aws:iam::473819111111111:root" }, "Condition": { "StringEquals": { "sts:ExternalId": "ab3caaaa-01aa-4b95-bad3-fff9b2ac789f8a9" } } } ] } ## Supported file types¶ The S3 Connector supports the following file types: | File type | Accepted extensions | Compression formats supported | | --- | --- | --- | | CSV | `.csv` , `.csv.gz` | `gzip` | | NDJSON | `.ndjson` , `.ndjson.gz` | `gzip` | | | `.jsonl` , `.jsonl.gz` | | | | `.json` , `.json.gz` | | | Parquet | `.parquet` , `.parquet.gz` | `snappy` , `gzip` , `lzo` , `brotli` , `lz4` , `zstd` | You can upload files with .json extension, provided they follow the Newline Delimited JSON (NDJSON) format. Each line must be a valid JSON object and every line has to end with a `\n` character. Parquet schemas use the same format as NDJSON schemas, using [JSONPath](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-ndjson-data#jsonpaths) syntax. ## S3 file URI¶ Use the full S3 File URI and wildcards to select files. The file extension is required. The S3 Connector supports the following wildcard patterns: | File path | S3 File URI with wildcard | Will match? | | --- | --- | --- | | example.ndjson | `s3://bucket-name/*.ndjson` | Yes | | example.ndjson.gz | `s3://bucket-name/**/*.ndjson.gz` | No | | example.ndjson | `s3://bucket-name/example.ndjson` | Yes | | pending/example.ndjson | `s3://bucket-name/*.ndjson` | Yes | | pending/example.ndjson | `s3://bucket-name/**/*.ndjson` | Yes | | pending/example.ndjson | `s3://bucket-name/pending/example.ndjson` | Yes | | pending/example.ndjson | `s3://bucket-name/pending/*.ndjson` | Yes | | pending/example.ndjson | `s3://bucket-name/pending/**/*.ndjson` | No | | pending/example.ndjson | `s3://bucket-name/**/pending/example.ndjson` | No | | pending/example.ndjson | `s3://bucket-name/customers/pending/example.ndjson` | No | | pending/example.ndjson | `s3://bucket-name/other/example.ndjson` | No | | pending/example.ndjson.gz | `s3://bucket-name/*.csv.gz` | No | | pending/o/inner/example.ndjson | `s3://bucket-name/*.ndjson` | Yes | | pending/o/inner/example.ndjson | `s3://bucket-name/**/*.ndjson` | Yes | | pending/o/inner/example.ndjson | `s3://bucket-name/**/inner/example.ndjson` | Yes | | pending/o/inner/example.ndjson | `s3://bucket-name/**/ex*.ndjson` | Yes | | pending/o/inner/example.ndjson | `s3://bucket-name/**/**/*.ndjson` | Yes | | pending/o/inner/example.ndjson | `s3://bucket-name/pending/**/*.ndjson` | Yes | | pending/o/inner/example.ndjson | `s3://bucket-name/inner/example.ndjson` | No | | pending/o/inner/example.ndjson | `s3://bucket-name/pending/example.ndjson` | No | | pending/o/inner/example.ndjson.gz | `s3://bucket-name/pending/*.ndjson.gz` | No | | pending/o/inner/example.ndjson.gz | `s3://bucket-name/other/example.ndjson.gz` | No | Use the **Preview** step in the UI to see an excerpt of the files. ## Set up the connection¶ You can set up an S3 connection using the UI or the CLI. The steps are as follows: 1. Create a new Data Source in Tinybird. 2. Create the AWS S3 connection. 3. Configure the scheduling options and path/file names. 4. Start ingesting the data. ## Load files using the CLI¶ Before you can load files from Amazon S3 into Tinybird using the CLI, you must create a connection. Creating a connection grants your Tinybird Workspace the appropriate permissions to view files in Amazon S3. To create a connection, you need to use the Tinybird CLI version 3.8.3 or higher. [Authenticate your CLI](https://www.tinybird.co/docs/docs/cli/install#authentication) and switch to the desired Workspace. Follow these steps to create a connection: 1. Run `tb connection create s3_iamrole --policy read` command and press `y` to confirm. 2. Copy the suggested policy and replace the bucket placeholder `` with your bucket name. 3. In AWS, create a new policy in** IAM** ,** Policies (JSON)** using the edited policy. 4. Return to the Tinybird CLI, press `y` , and copy the next policy. 5. In AWS, go to** IAM** ,** Roles** and copy the new custom trust policy. Attach the policy you edited in the previous step. 6. Return to the CLI, press `y` , and paste the ARN of the role you've created in the previous step. 7. Enter the region of the bucket. For example, `us-east-1` . 8. Provide a name for your connection in Tinybird. The `--policy` flag allows to switch between write (sink) and read (ingest) policies. Now that you've created a connection, you can add a Data Source to configure the import of files from Amazon S3. Configure the Amazon S3 import using the following options in your .datasource file: - `IMPORT_SERVICE` : name of the import service to use, in this case, `s3_iamrole` . - `IMPORT_SCHEDULE` : either `@auto` to sync once per minute, or `@on-demand` to only execute manually (UTC). - `IMPORT_STRATEGY` : the strategy used to import data. Only `APPEND` is supported. - `IMPORT_BUCKET_URI` : a full bucket path, including the `s3://` protocol , bucket name, object path and an optional pattern to match against object keys. You can use patterns in the path to filter objects. For example, ending the path with `*.csv` matches all objects that end with the `.csv` suffix. - `IMPORT_CONNECTION_NAME` : name of the S3 connection to use. - `IMPORT_FROM_TIMESTAMP` : (optional) set the date and time from which to start ingesting files. Format is `YYYY-MM-DDTHH:MM:SSZ` . When Tinybird discovers new files, it appends the data to the existing data in the Data Source. Replacing data isn't supported. The following is an example of a .datasource file for S3: ##### s3.datasource file DESCRIPTION > Analytics events landing data source SCHEMA > `timestamp` DateTime `json:$.timestamp`, `session_id` String `json:$.session_id`, `action` LowCardinality(String) `json:$.action`, `version` LowCardinality(String) `json:$.version`, `payload` String `json:$.payload` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" ENGINE_TTL "timestamp + toIntervalDay(60)" IMPORT_SERVICE s3_iamrole IMPORT_CONNECTION_NAME connection_name IMPORT_BUCKET_URI s3://bucket-name/*.csv IMPORT_SCHEDULE @auto IMPORT_STRATEGY APPEND With your connection created and Data Source defined, you can now push your project to Tinybird using: tb push ## Load files using the UI¶ ### 1. Create a new Data Source¶ In Tinybird, go to **Data Sources** and select **Create Data Source**. Select **Amazon S3** and enter the bucket name and region, then select **Continue**. ### 2. Create the AWS S3 connection¶ Follow these steps to create the connection: 1. Open the AWS console and navigate to IAM. 2. Create and name the policy using the provided copyable option. 3. Create and name the role with the trust policy using the provided copyable option. 4. Select** Connect** . 5. Paste the connection name and ARN. ### 3. Select the data¶ Select the data you want to ingest by providing the [S3 File URI](https://www.tinybird.co/docs/about:blank#s3-file-uri) and selecting **Preview**. You can also set the ingestion to start from a specific date and time, so that the ingestion process ignores all files added or updated before the set date and time: 1. Select** Ingest since ISO date and time** . 2. Write the desired date or datetime in the input, following the format `YYYY-MM-DDTHH:MM:SSZ` . ### 4. Preview and create¶ The next screen shows a preview of the incoming data. You can review and modify any of the incoming columns, adjust their names, change their types, or delete them. You can also configure the name of the Data Source. After reviewing your incoming data, select **Create Data Source** . On the Data Source details page, you can see the sync history in the tracker chart and the current status of the connection. ## Schema evolution¶ The S3 Connector supports adding new columns to the schema of the Data Source using the CLI. Non-backwards compatible changes, such as dropping, renaming, or changing the type of columns, aren't supported. Any rows from these files are sent to the [quarantine Data Source](https://www.tinybird.co/docs/docs/concepts/data-sources#the-quarantine-data-source). ## Iterate an S3 Data Source¶ To iterate an S3 Data Source, use the Tinybird CLI and the [version control integration](https://www.tinybird.co/docs/docs/production/working-with-version-control) to handle your resources. Create a connection using the CLI: tb auth # use the main Workspace admin Token tb connection create s3_iamrole To iterate an S3 Data Source through a Branch, create the Data Source using a connector that already exists. The S3 Connector doesn't ingest any data, as it isn't configured to work in Branches. To test it on CI, you can directly append the files to the Data Source. After you've merged it and are running CD checks, run `tb datasource sync ` to force the sync in the main Workspace. ## Limits¶ The following limits apply to the S3 Connector: - When using the `auto` mode, execution of imports runs once every minute. - Tinybird ingests a maximum of 5 files per minute. This is a Workspace-level limit, so it's shared across all Data Sources. The following limits apply to maximum file size per type: | File type | Max file size | | --- | --- | | CSV | 10 GB for the Free plan, 32 GB for Pro and Enterprise | | NDJSON | 10 GB for the Free plan, 32 GB for Pro and Enterprise | | Parquet | 1 GB | Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. To adjust these limits, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). ## Monitoring¶ You can follow the standard recommended practices for monitoring Data Sources as explained in our [ingestion monitoring guide](https://www.tinybird.co/docs/docs/guides/monitoring/monitor-your-ingestion) . There are specific metrics for the S3 Connector. If a sync finishes unsuccessfully, Tinybird adds a new event to `datasources_ops_log`: - If all the files in the sync failed, the event has the `result` field set to `error` . - If some files failed and some succeeded, the event has the `result` field set to `partial-ok` . Failures in syncs are atomic, meaning that if one file fails, no data from that file is ingested. A JSON object with the list of files that failed is included in the `error` field. Some errors can happen before the file list can be retrieved (for instance, an AWS connection failure), in which case there are no files in the `error` field. Instead, the `error` field contains the error message and the files to be retried in the next execution. In scheduled runs, Tinybird retries all failed files in the next executions, so that rate limits or temporary issues don't cause data loss. In on-demand runs, since there is no next execution, truncate the Data Source and sync again. You can distinguish between individual failed files and failed syncs by looking at the `error` field: - If the `error` field contains a JSON object, the sync failed and the object contains the error message with the list of files that failed. - If the `error` field contains a string, a file failed to ingest and the string contains the error message. You can see the file that failed by looking at the `Options.Values` field. For example, you can use the following query to see the sync error messages for the last day: SELECT JSONExtractString(error, 'message') message, * FROM tinybird.datasources_ops_log WHERE datasource_id = '' AND timestamp > now() - INTERVAL 1 day AND message IS NOT NULL ORDER BY timestamp DESC --- URL: https://www.tinybird.co/docs/ingest/snowflake Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Snowflake Connector · Tinybird Docs" theme-color: "#171612" description: "Documentation for how to use the Tinybird Snowflake Connector" --- # Snowflake Connector¶ Use the Snowflake Connector to load data from your existing Snowflake account into Tinybird so that you can quickly turn them into high-concurrency, low-latency REST APIs. The Snowflake Connector is fully managed and requires no additional tooling. You can define a sync schedule inside Tinybird and execution is taken care of for you. With the Snowflake Connector you can: - Start ingesting data instantly from Snowflake using SQL. - Use SQL to query, shape, and join your Snowflake data with other sources. - Use Auth tokens to control access to API endpoints. Implement access policies as you need. Support for row-level security. Snowflake IP filtering isn't supported by the Snowflake Connector. If you need to filter IPs, use the GCS/S3 Connector. ## Load a Snowflake table¶ ### Load a Snowflake table in the UI¶ To add a Snowflake table as a Data Source, follow these steps. #### Create a connection¶ Create a new Data Source using the Snowflake Connector dialog: 1. Open Tinybird and add a new Data Source by selecting** Create new (+)** next to the** Data Sources** section. 2. In the Data Sources dialog, select the Snowflake connector. 3. Enter your Snowflake Account Identifier. To find this, log into Snowflake, find the account info and then copy the Account Identifier. 4. In Tinybird, in the** Connection details** dialog, configure authentication with your Snowflake account. Enter your user password and Account Identifier. 5. Select the role and warehouse to access your data. 6. Copy the SQL snippet from the text box. The snippet creates a new Snowflake Storage Integration linking your Snowflake account with a Tinybird staging area for your Workspace. It also grants permission to the given role to create new Stages to unload data from your Snowflake Account into Tinybird. 7. With the SQL query copied, open a new SQL Worksheet inside Snowflake. Paste the SQL into the Worksheet query editor. You must edit the query and replace the `` fragment with the name of your Snowflake database. 8. Select** Run** . The statement must be executed with a Snowflake `ACCOUNTADMIN` role, since Snowflake Integrations operate at Account level and usually need administrator permissions. #### Select the database, table, and schema¶ After running the query, the `Statement executed successfully` message appears. Return to your Tinybird tab to resume the configuration of the Snowflake connector. Set a name for the Snowflake connection and complete this step by selecting **Next**. The Snowflake Connector now has enough permissions to inspect your Snowflake objects available to the given role. Browse the tables available in Snowflake and select the table you wish to load. Start by selecting the database to which the table belongs, then the schema, and the table. Finish by selecting **Next**. Maximum allowed table size is 50 million rows. The result is truncated if it exceeds that limit. #### Configure the schedule¶ You can configure the schedule on which you wish to load data. By default, the frequency is set to **One-off** which performs a one-time sync of the table. You can change this by selecting a different option from the menu. To configure a schedule that runs a regular sync, select the **Interval** option. You can configure a schedule in minutes, hours, or days by using the menu, and set the value for the schedule in the text field. You can also select whether the sync should run immediately, or if it should wait until the first scheduled sync. The **Replace data** import strategy is selected by default. Finish by selecting **Next**. Maximum allowed frequency is 5 minutes. #### Complete the configuration¶ The final screen of the dialog shows the interpreted schema of the table, which you can change as needed. You can also modify what the name of the Data Source in Tinybird. Select **Create Data Source** to complete the process. After you've created the Data Source, a status chart appears showing executions of the loading schedule. The Data Source takes a moment to create the resources required to perform the first sync. When the first sync has completed, a green bar appears indicating the status. Details about the data, such as storage size and number of rows, is shown. You can also see a preview of the data. ### Load a Snowflake table in the CLI¶ To add a Snowflake table as a Data Source using the Tinybird CLI, follow these steps. #### Create a connection¶ You need to create a connection before you can load a table from Snowflake into Tinybird using the CLI. Creating a connection grants your Tinybird Workspace the appropriate permissions to view data from Snowflake. [Authenticate your CLI](https://www.tinybird.co/docs/docs/cli/install#authentication) and switch to the desired Workspace. Then run: tb connection create snowflake The output includes instructions to configure read-only access to your data in Snowflake. Enter your user, password, account identifier, role, warehouse, and a name for the connection. After introducing the required information, copy the SQL block that appears. ** Creating a new Snowflake connection at the xxxx workspace. User (must have create stage and create integration in Snowflake): Password: Account identifier: Role (optional): Warehouse (optional): Connection name (optional, current xxxx): Enter this SQL statement in Snowflake using your admin account to create the connection: ------ create storage integration if not exists "tinybird_integration_role" type = external_stage storage_provider = 'GCS' enabled = true comment = 'Tinybird Snowflake Connector Integration' storage_allowed_locations = ('gcs://tinybird-cdk-production-europe-west3/id'); grant create stage on all schemas in database to role ACCOUNTADMIN; grant ownership on integration "tinybird_integration_ACCOUNTADMIN" to role ACCOUNTADMIN; ------ Ready? (y, N): ** Validating connection... ** xxxx.connection created successfully! Connection details saved into the .env file and referenced automatically in your connection file. With the SQL query copied, open a new SQL Worksheet inside Snowflake. Paste the SQL into the Worksheet query editor. You must edit the query and replace the `` fragment with the name of your Snowflake database. Select **Run** . This statement must be executed with a Snowflake `ACCOUNTADMIN` role, since Snowflake Integrations operate at Account level and usually need administrator permissions. The `Statement executed successfully` message appears. Return to your terminal, select **yes** (y) and the connection is created.A new `snowflake.connection` file appears in your project files. The `.connection` file can be safely deleted. #### Create the Data Source¶ After you've created the connection, you can create a Data Source and configure the schedule to import data from Snowflake. The Snowflake import is configured using the following options, which you can add at the end of your .datasource file: - `IMPORT_SERVICE` : Name of the import service to use. In this case, `snowflake` . - `IMPORT_CONNECTION_NAME` : The name given to the Snowflake connection inside Tinybird. For example, `'my_connection'` . - `IMPORT_EXTERNAL_DATASOURCE` : The fully qualified name of the source table in Snowflake. For example, `database.schema.table` . - `IMPORT_SCHEDULE` : A cron expression (UTC) with the frequency to run imports. Must be higher than 5 minutes. For example, `*/5 * * * *` . - `IMPORT_STRATEGY` : The strategy to use when inserting data, either `REPLACE` or `APPEND` . - `IMPORT_QUERY` : (Optional) The SELECT query to extract your data from Snowflake when you don't need all the columns or want to make a transformation before ingestion. The FROM must reference a table using the full scope: `database.schema.table` . Note: For `IMPORT_STRATEGY` only `REPLACE` is supported today. The `APPEND` strategy will be enabled in a future release. The following example shows a configured .datasource file for a Snowflake Data Source: ##### snowflake.datasource file DESCRIPTION > Snowflake demo data source SCHEMA > `timestamp` DateTime `json:$.timestamp`, `id` Integer `json:$.id`, `orderid` LowCardinality(String) `json:$.orderid`, `status` LowCardinality(String) `json:$.status`, `amount` Integer `json:$.amount` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" ENGINE_TTL "timestamp + toIntervalDay(60)" IMPORT_SERVICE snowflake IMPORT_CONNECTION_NAME my_snowflake_connection IMPORT_EXTERNAL_DATASOURCE mydb.raw.events IMPORT_SCHEDULE */5 * * * * IMPORT_STRATEGY REPLACE IMPORT_QUERY > select timestamp, id, orderid, status, amount from mydb.raw.events The columns you select in the `IMPORT_QUERY` must match the columns defined in the Data Source schema. For example, if you Data Source has columns `ColumnA, ColumnB` then your `IMPORT_QUERY` must contain `SELECT ColumnA, ColumnB FROM ...` . A mismatch of columns causes data to arrive in the [quarantine Data Source](https://www.tinybird.co/docs/docs/guides/ingesting-data/recover-from-quarantine). #### Push the configuration to Tinybird¶ With your connection created and Data Source defined, you can now push your project to Tinybird using: tb push The first run of the import begins on the next lapse of the CRON expression. ## Iterate a Snowflake Data Source¶ ### Prerequisites¶ Use of the CLI and the version control integration to handle your resources. To use the advantages of version control, connect your Workspace with [your repository](https://www.tinybird.co/docs/production/working-with-version-control#connect-your-workspace-to-git-from-the-cli) , and set the [CI/CD configuration](https://www.tinybird.co/docs/production/continuous-integration). Check the [use case examples](https://github.com/tinybirdco/use-case-examples) repository where you can find basic instructions and examples to handle Snowflake Data Sources iteration using git integration, under the `iterate_snowflake` section. To use the [Tinybird CLI](https://www.tinybird.co/docs/docs/cli/quick-start) check its documentation. For instance to create the connections in the main-branch Workspace using the CLI: tb auth # use the main Workspace admin Token tb connection create snowflake # these prompts are interactive and will ask you to insert the necessary information You can only create connections in the main Workspace. When creating the connection in a Branch, it's created in the main Workspace and from there is available to every Branch. For testing purposes, use different connections from main-branches workspaces. ### Add a new Snowflake Data Source¶ You can add a new Data Source directly with the UI or the CLI tool, following [the load of a Snowflake table section](https://www.tinybird.co/docs/about:blank#load-a-snowflake-table). This works for testing purposes, but doesn't carry any connection details. You must add the connection and Snowflake configuration in the .datasource file when moving to production. To add a new Data Source using the recommended version control workflow check the instructions in the [examples repository](https://github.com/tinybirdco/use-case-examples/tree/main/iterate_snowflake). ### Update a Data Source¶ - Snowflake Data Sources can't be modified directly from UI - When you create a new Tinybird Branch, the existing Snowflake Data Sources won't be connected. You need to re-create them in the Branch. - In Branches, it's usually useful to work with[ fixtures](https://www.tinybird.co/docs/production/implementing-test-strategies#fixture-tests) , as they'll be applied as part of the CI/CD, allowing the full process to be deterministic in every iteration and avoiding quota consume from external services. Snowflake Data Sources can be modified from the CLI tool: tb auth # modify the .datasource Datafile with your editor tb push --force {datafile} # check the command output for errors To update it using the recommended version control workflow check the instructions in the [examples repository](https://github.com/tinybirdco/use-case-examples/tree/main/iterate_snowflake). ### Delete a Data Source¶ Snowflake Data Sources can be deleted directly from UI or CLI like any other Data Source. To delete it using the recommended version control workflow check the instructions in the [examples repository](https://github.com/tinybirdco/use-case-examples/tree/main/iterate_snowflake). ## Logs¶ Job executions are logged in the `datasources_ops_log` [Service Data Source](https://www.tinybird.co/docs/docs/monitoring/service-datasources) . You can check this log directly in the Data Source view page in the UI. Filter by `datasource_id` to monitor ingestion through the Snowflake Connector from the `datasources_ops_log`: SELECT timestamp, event_type, result, error, job_id FROM tinybird.datasources_ops_log WHERE datasource_id = 't_1234' AND event_type = 'replace' ORDER BY timestamp DESC ## Schema evolution¶ The Snowflake Connector supports backwards compatible changes made in the source table. This means that, if you add a new column in Snowflake, the next sync job automatically adds it to the Tinybird Data Source. Non-backwards compatible changes, such as dropping or renaming columns, aren't supported and might cause the next sync to fail. ## Limits¶ See [Snowflake Connector limits](https://www.tinybird.co/docs/docs/support/limits#snowflake-connector-limits). --- URL: https://www.tinybird.co/docs/integrations Last update: 2024-10-10T10:03:16.000Z Content: --- title: "Integrations · Tinybird Docs" theme-color: "#171612" description: "Tinybird allows you to create Data Sources that ingest data from many integrations." --- # Integrations¶ You can create Data Sources that ingest data into Tinybird from many different integrations. You can also send data to Tinybird from your application using the [Events API](https://www.tinybird.co/docs/docs/ingest/events-api). Looking for an integration and don’t see it? [Contact us](https://www.tinybird.co/contact-us) or join our [Slack community](https://www.tinybird.co/docs/docs/community) to provide feedback. ## Core Integrations¶ ## Streaming Data Sources - ### Apache Kafka [ Kafka Connector](https://www.tinybird.co/docs/ingest/kafka) ,[ Screencast](https://www.tinybird.co/docs/screencasts?video=create-rest-apis-from-kafka-streams-in-minutes) ,[ Free Training](https://www.tinybird.co/docs/live/from-kafka-to-analytics-april-2024) - ### Confluent Cloud [ Confluent Connector](https://www.tinybird.co/docs/ingest/confluent) ,[ Screencast](https://www.tinybird.co/docs/screencasts?video=publish-apis-over-confluent-streams) ,[ Screencast: User-Facing Apps](https://www.tinybird.co/docs/screencasts?video=build-real-time-user-facing-applications-with-tinybird-and-confluent) ,[ Tutorial: User-Facing Dashboards](https://www.tinybird.co/docs/use-cases/user-facing-dashboards) ,[ Tutorial: Web Analytics](https://www.tinybird.co/docs/use-cases/web-analytics) - ### Google Pub/Sub [ Guide](https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-google-pubsub) - ### AWS Kinesis [ Guide](https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-aws-kinesis) - ### Redpanda [ Redpanda Connector](https://www.tinybird.co/docs/ingest/redpanda) ## Data Warehouses and Data Lakes - ### Google BigQuery [ BigQuery Connector](https://www.tinybird.co/docs/ingest/bigquery) ,[ Screencast](https://www.tinybird.co/docs/screencasts?video=sync-bigquery-tables-with-the-bigquery-connector) ,[ Tutorial: User-Facing Dashboards](https://www.tinybird.co/docs/use-cases/user-facing-dashboards) ,[ Tutorial: Web Analytics](https://www.tinybird.co/docs/use-cases/web-analytics) - ### Snowflake [ Snowflake Connector](https://www.tinybird.co/docs/ingest/snowflake) ,[ Guide](https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-snowflake-via-unloading) ,[ Screencast](https://www.tinybird.co/docs/screencasts?video=publish-real-time-apis-over-snowflake-data) ## Relational Databases - ### Postgres Table Function [ Postgres Table Function](https://www.tinybird.co/docs/ingest/postgresql) ,[ Launch Blog Post](https://www.tinybird.co/blog-posts/postgresql-table-function-announcement) ## Document Databases - ### Amazon DynamoDB [ DynamoDB Connector](https://www.tinybird.co/docs/ingest/dynamodb) ,[ Guide](https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-dynamodb) - ### MongoDB [ Guide](https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-mongodb) ## Files and Object Storage - ### Amazon S3 [ Amazon S3 Connector](https://www.tinybird.co/docs/ingest/s3) ,[ Amazon S3 Sink](https://www.tinybird.co/docs/publish/s3-sink) - ### Google Cloud Storage [ Guide](https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-google-gcs) - ### CSV Files [ Guide](https://www.tinybird.co/docs/guides/ingesting-data/ingest-from-csv-files) --- URL: https://www.tinybird.co/docs/monitoring/health-checks Last update: 2024-11-13T14:29:47.000Z Content: --- title: "Health checks · Tinybird Docs" theme-color: "#171612" description: "Tinybird is built around the idea of data that changes or grows continuously. Use the built-in Tinybird tools to monitor your data ingestion and API Endpoint processes." --- # Health checks¶ Tinybird is built around the idea of data that changes or grows continuously. Use the built-in Tinybird tools to monitor your data ingestion and API Endpoint processes. ## Data Source health¶ Once you have fixed all the possible errors in your source files, matched the Data Source schema to your needs and done the on-the-fly transformations (if needed), at some point, you'll start ingesting data periodically. Knowing the status of your ingestion processes will be key. ### Data Sources Log¶ From the 'Data Sources log' in your Dashboard, you can check whether there are new rows in quarantine, if jobs are failing or if there is any other problem. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fon-the-fly-health-checks-1.png&w=3840&q=75) <-figcaption-> The Data Sources log shows you the operations performed on your data In addition to the tools we provide in our User Interface (UI), there are powerful tools that you can use for advanced monitoring. ### Operations Log¶ By clicking on an individual Data Source in the left-hand panel you can see the size of the Data Source, the number of rows, the number of rows in the [quarantine Data Source](https://www.tinybird.co/docs/docs/guides/ingesting-data/recover-from-quarantine) (if any) and when it was last updated. The Operations log contains details of the events for the Data Source, which are displayed as the results of the query. ## Service Data Sources for continuous monitoring¶ [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) can help you with ingestion health checks. They can be used like any other Data Source in your Workspace, which means you can create API Endpoints to monitor your ingestion processes. Querying the 'tinybird.datasources_ops_log' directly, you can, for example, list your ingest processes during the last week: ##### LISTING INGESTIONS IN THE LAST 7 DAYS SELECT * FROM tinybird.datasources_ops_log WHERE toDate(timestamp) > now() - INTERVAL 7 DAY ORDER BY timestamp DESC This query calculates the percentage of quarantined rows for a given period of time: ##### CALCULATE % OF ROWS THAT WENT TO QUARANTINE SELECT countIf(result != 'ok') / countIf(result == 'ok') * 100 percentage_failed, sum(rows_quarantine) / sum(rows) * 100 quarantined_rows FROM tinybird.datasources_ops_log This query monitors the average duration of your periodic ingestion processes for a given Data Source: ##### CALCULATING AVERAGE INGEST DURATION SELECT avg(elapsed_time) avg_duration FROM tinybird.datasources_ops_log WHERE datasource_id = 't_8417d5126ed84802aa0addce7d1664f2' If you want to configure or build an external service that monitors these metrics, you just need to create an API Endpoint and raise an alert when passing a threshold. When you receive an alert, you can check the quarantine Data Source or the Operations log to see what's going on and fix your source files or ingestion processes. ## Monitoring API Endpoints¶ You can use the [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) 'pipe_stats' and 'pipe_stats_rt' to monitor the performance of your API Endpoints. Every request to a Pipe is logged to 'tinybird.pipe_stats_rt' and kept in this Data Source for the last week. This example API Endpoint aggregates the statistics for each hour for the selected Pipe. ##### PIPE\_STATS\_RT\_BY\_HR SELECT toStartOfHour(start_datetime) as hour, count() as view_count, round(avg(duration), 2) as avg_time, arrayElement(quantiles(0.50)(duration),1) as quantile_50, arrayElement(quantiles(0.95)(duration),1) as quantile_95, arrayElement(quantiles(0.99)(duration),1) as quantile_99 FROM tinybird.pipe_stats_rt WHERE pipe_id = ‘PIPE_ID’ GROUP BY hour ORDER BY hour 'pipe_stats' contains statistics about your Pipe Endpoints' API calls aggregated per day using intermediate states. ##### PIPE\_STATS\_BY\_DATE SELECT date, sum(view_count) view_count, sum(error_count) error_count, avgMerge(avg_duration_state) avg_time, quantilesTimingMerge(0.9, 0.95, 0.99)(quantile_timing_state) quantiles_timing_in_millis_array FROM tinybird.pipe_stats WHERE pipe_id = 'PIPE_ID' GROUP BY date ORDER BY date API Endpoints such as these can be used to raise alerts for further investigation whenever statistics pass certain thresholds. To see how Pipes and Data Sources health can be monitored in a dashboard have a look at the blog [Operational Analytics in Real Time with Tinybird and Retool](https://www.tinybird.co/blog-posts/service-data-sources-and-retool). --- URL: https://www.tinybird.co/docs/monitoring/jobs Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Monitor jobs in your Workspace · Tinybird Docs" theme-color: "#171612" description: "Many of the operations you can run in your Workspace are executed using jobs. jobs_log provides you with an overview of all your jobs." --- # Monitor jobs in your Workspace¶ ## What are jobs?¶ Many operations in your Tinybird Workspace, like Imports, Copy Jobs, Sinks, and Populates, are executed as **background jobs** within the platform. When you trigger these operations via the Tinybird API, they are queued and processed asynchronously on Tinybird's infrastructure. This means the API request itself completes quickly, while the actual operation runs in the background and finishes slightly later. This approach ensures that the system can handle a large volume of requests efficiently without causing timeouts or delays in your workflow. Monitoring and managing these jobs (for instance, querying job statuses, types, and execution details) is essential for maintaining a healthy Workspace. The two mechanisms for generic job monitoring are the [Jobs API](https://www.tinybird.co/docs/docs/api-reference/jobs-api) and the `jobs_log` Data Source. The Jobs API and the `jobs_log` return identical information about job execution. However, the Jobs API has some limitations: It reports only on a single Workspace, returns only 100 records, from the last 48 hours. If you want to monitor jobs outside these parameters, use the `jobs_log` Data Source. You can also track more specific things using dedicated Service Data Sources, such as `datasources_ops_log` for import, replaces, or copy, or `sinks_ops_log` for sink operations, or tracking jobs across [Organizations](https://www.tinybird.co/docs/docs/monitoring/organizations) with `organization.jobs_log` . Read the [Service Data Sources docs](https://www.tinybird.co/docs/docs/monitoring/service-datasources) for more. ## Track a specific job¶ ### Jobs API¶ The Jobs API is a convenient way to programmatically check the status of a job. By sending an HTTP GET request, you can retrieve detailed information about a specific job. This method is particularly useful for integration into scripts or applications. curl \ -X GET "https://$TB_HOST/v0/jobs/{job_id}" \ -H "Authorization: Bearer $TOKEN" Replace `{job_id}` with the actual job ID. Replace the Tinybird API hostname/region with the [right API URL region](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) that matches your Workspace. Your Token lives in the Workspace under "Tokens". ### SQL¶ Alternatively, you can use SQL to query the `jobs_log` Data Source from directly within a Tinybird Pipe. This method is ideal for users who are comfortable with SQL and prefer to run queries directly against the data, and then expose them with an Endpoint or perform any other actions with it. SELECT * FROM tinybird.jobs_log WHERE job_id='{job_id}' Replace `{job_id}` with the actual job ID. This query retrieves all columns for the specified job, providing comprehensive details about its execution. ## Track specific job types¶ Tracking jobs by type allows you to monitor and analyze all jobs of a certain category, such as all `copy` jobs. This can help you understand the performance and status of specific job types across your entire Workspace. ### Jobs API¶ Using the Jobs API, fetch all jobs of a specific type by making an HTTP GET request: curl \ -X GET "https://$TB_HOST/v0/jobs?kind=copy" \ -H "Authorization: Bearer $TOKEN" Replace `copy` with the type of job you want to track. Ensure you have set your Tinybird host ( `$TB_HOST` ) and authorization token ( `$TOKEN` ) correctly. ### SQL¶ Alternatively, run an SQL query to fetch all jobs of a specific type from the `jobs_log` Data Source: SELECT * FROM tinybird.jobs_log WHERE job_type='copy' Replace `copy` with the desired job type. This query retrieves all columns for jobs of the specified type. ## Track ongoing jobs¶ To keep track of jobs that are currently running, you can query the status of jobs in progress. This helps in monitoring the real-time workload and managing system performance. ### Jobs API¶ By making an HTTP GET request to the Jobs API, you can fetch all jobs that are currently in the `working` status: curl \ -X GET "https://$TB_HOST/v0/jobs?status=working" \ -H "Authorization: Bearer $TOKEN" This command retrieves jobs that are actively running. Ensure you have set your Tinybird host ( `$TB_HOST` ) and authorization token ( `$TOKEN` ) correctly. ### SQL¶ You can also use an SQL query to fetch currently running jobs from the `jobs_log` Data Source: SELECT * FROM tinybird.jobs_log WHERE status='working' This query retrieves all columns for jobs with the status `working` , allowing you to monitor ongoing operations. ## Track errored jobs¶ Tracking errored jobs is crucial for identifying and resolving issues that may arise during job execution. Jobs API and/or SQL queries to `jobs_log` will help you to efficiently monitor jobs that errored during the execution. ### Jobs API¶ The Jobs API allows you to programmatically fetch details of jobs that have ended in error. Use the following `curl` command to retrieve all jobs that have a status of `error`: curl \ -X GET "https://$TB_HOST/v0/jobs?status=error" \ -H "Authorization: Bearer $TOKEN" This command fetches a list of jobs that are currently in an errored state, providing details that can be used for further analysis or debugging. Ensure you've set your Tinybird host ( `$TB_HOST` ) and authorization token ( `$TOKEN` ) correctly. ### SQL¶ Alternatively, you can use SQL to query the `jobs_log` Data Source directly. Use the following SQL query to fetch job IDs, job types, and error messages for jobs that have encountered errors in the past day: SELECT job_id, job_type, error FROM tinybird.jobs_log WHERE status='error' AND created_at > now() - INTERVAL 1 DAY ### Track success rate¶ Extrapolating from errored jobs, you can also use `jobs_log` to calculate the success rate of your Workspace jobs: SELECT job_type, pipe_id, countIf(status='done') AS job_success, countIf(status='error') AS job_error, job_success / (job_success + job_error) as success_rate FROM tinybird.jobs_log WHERE created_at > now() - INTERVAL 1 DAY GROUP BY job_type, pipe_id ## Get job execution metadata¶ In the `jobs_log` Data Source, there is a property called `job_metadata` that contains metadata related to job executions. This includes the execution type (manual or scheduled) for Copy and Sink jobs, or the count of quarantined rows for Append operations, along with many other properties. You can extract and analyze this metadata using JSON functions within SQL queries. This allows you to gain valuable information about job executions directly from the `jobs_log` Data Source. The following SQL query is an example of how to extract specific metadata fields from the `job_metadata` property, such as the import mode and counts of quarantined rows and invalid lines, and how to aggregate this data for analysis: SELECT job_type, JSONExtractString(job_metadata, 'mode') AS import_mode, sum(simpleJSONExtractUInt(job_metadata, 'quarantine_rows')) AS quarantine_rows, sum(simpleJSONExtractUInt(job_metadata, 'invalid_lines')) AS invalid_lines FROM tinybird.jobs_log WHERE job_type='import' AND created_at >= toStartOfDay(now()) GROUP BY job_type, import_mode There are many other use cases you can put together with the properties in the `job_metadata` ; see below. ## Advanced use cases¶ Beyond basic tracking, you can leverage the `jobs_log` Data Source for more advanced use cases, such as gathering statistics and performance metrics. This can help you optimize job scheduling and resource allocation. ### Get queue status¶ The following SQL query returns the number of jobs that are waiting to be executed, the number of jobs that are in progress, and how many of them are done already: SELECT job_type, countIf(status='waiting') AS jobs_in_queue, countIf(status='working') AS jobs_in_progress, countIf(status='done') AS jobs_succeeded, countIf(status='error') AS jobs_errored FROM tinybird.jobs_log WHERE created_at > now() - INTERVAL 1 DAY GROUP BY job_type ### Get statistics on run time grouped by type of job¶ The following SQL query calculates the maximum, minimum, median, and p95 running time (in seconds) grouped by type of job (e.g. import, copy, sinks) over the past day. This helps in understanding the efficiency of different job types: SELECT job_type, max(date_diff('s', started_at, updated_at)) as max_run_time_in_secs, min(date_diff('s', started_at, updated_at)) as min_run_time_in_secs, median(date_diff('s', started_at, updated_at)) as median_run_time_in_secs, quantile(0.95)(date_diff('s', started_at, updated_at)) as p95_run_time_in_secs FROM tinybird.jobs_log WHERE created_at > now() - INTERVAL 1 DAY GROUP BY job_type ### Get statistics on queue time by type of job¶ The following SQL query calculates the average queue time (in seconds) for a specific type of job (e.g., copy) over the past day. This can help in identifying bottlenecks in job scheduling: SELECT job_type, max(date_diff('s', created_at, started_at)) as max_run_time_in_secs, min(date_diff('s', created_at, started_at)) as min_run_time_in_secs, median(date_diff('s', created_at, started_at)) as median_run_time_in_secs, quantile(0.95)(date_diff('s', created_at, started_at)) as p95_run_time_in_secs FROM tinybird.jobs_log WHERE created_at > now() - INTERVAL 1 DAY GROUP BY job_type ### Get statistics on job completion rate¶ The following SQL query calculates the success rate by type of job (e.g., copy) and Pipe over the past day. This can help you to assess the reliability and efficiency of your workflows by measuring the completion rate of the jobs, and find potential issues and areas for improvement: SELECT job_type, pipe_id, countIf(status='done') AS job_success, countIf(status='error') AS job_error, job_success / (job_success + job_error) as success_rate FROM tinybird.jobs_log WHERE created_at > now() - INTERVAL 1 DAY GROUP BY job_type, pipe_id ### Get statistics on the amount of manual vs. scheduled run jobs¶ The following SQL query calculates the percentage rate between manual and scheduled jobs. Understanding the distribution of manually-executed jobs versus scheduled jobs can let you know about some on-demand jobs performed for some specific reasons: SELECT job_type, countIf(JSONExtractString(job_metadata, 'execution_type')='manual') AS job_manual, countIf(JSONExtractString(job_metadata, 'execution_type')='scheduled') AS job_scheduled FROM tinybird.jobs_log WHERE job_type='copy' AND created_at > now() - INTERVAL 1 DAY GROUP BY job_type ## Next steps¶ - Read up on the `jobs_log` Service Data Source specification . - Learn how to[ monitor your Workspace ingestion](https://www.tinybird.co/docs/docs/guides/monitoring/monitor-your-ingestion) . --- URL: https://www.tinybird.co/docs/monitoring/latency Last update: 2024-11-06T17:38:37.000Z Content: --- title: "How to measure API Endpoint latency · Tinybird Docs" theme-color: "#171612" description: "Latency is an essential metric to monitor in real-time applications. In this guide, you'll learn how to measure and monitor the latency of your API Endpoints in Tinybird." --- # Latency¶ Latency is an essential metric to monitor in real-time applications. This page explains how latency is measured in Tinybird, and how to monitor and visualize the latency of your [API Endpoints](https://www.tinybird.co/docs/publish/api-endpoints/overview) when data is being retrieved. ## What is latency?¶ Latency is the time it takes for a request to travel from the client to the server and back; the time it takes for a request to be sent and received. Latency is usually measured in seconds or milliseconds (ms). The lower the latency, the faster the response time. ## How latency is measured¶ When measuring latency in an end-to-end application, you consider data ingestion, data transformation, and data retrieval. In Tinybird, thanks to the exceptional speed of ClickHouse® ingestion and real-time Materialized Views, the data freshness is guaranteed. Putting this all together: In Tinybird, latency is measured as the time it takes for a request to be sent, processed in ClickHouse (very fast), and the response to be sent back to the client. When calling an API Endpoint, you can check this metric defined as `elapsed` in the `statistics` object of the response: ##### Statistics object within an example Tinybird API Endpoint call { "meta": [ ... ], "data": [ ... ], "rows": 10, "statistics": { "elapsed": 0.001706275, "rows_read": 10, "bytes_read": 180 } } ## How to monitor latency¶ To monitor the latency of your API Endpoints, use the `pipe_stats_rt` and `pipe_stats` [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources): - `pipe_stats_rt` consists of the real-time statistics of your API Endpoints, and has a `duration` field that encapsulates the latency time in seconds. - `pipe_stats` contains the** aggregated** statistics of your API Endpoints by date, and presents a `avg_duration_state` field which is the average duration of the API Endpoint by day in seconds. Because the `avg_duration_state` field is an intermediate state, you'd need to merge it when querying the Data Source using something like `avgMerge`. More for details on building Pipes and Endpoints that monitor the performance of your API Endpoints using the `pipe_stats_rt` and `pipe_stats` Data Sources, follow the [API Endpoint performance guide](https://www.tinybird.co/docs/docs/guides/monitoring/analyze-endpoints-performance#example-2-analyzing-the-performance-of-api-endpoints-over-time). ## How to visualize latency¶ Visualizing the latency of your API Endpoints can make it easier to see the at-a-glance overview. Tinybird has built-in tools to help you do this: Time Series is a quick option for output that lives internally in your Workspace, and Charts give you the option to embed in an external application. ### Time Series¶ In your Workspace, you can create [a Time Series](https://www.tinybird.co/docs/docs/query/overview#time-series) to visualize the latency of your API Endpoints over time. You just need to point to `pipe_stats_rt` and select `duration` and `start_datetime` , or point to or `pipe_stats` and select `avgMerge(avg_duration_state)` and `date`. <-figure-> ![Example Tinybird Time Series showing pipe\_stats\_analysis](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftime-series-latency.png&w=3840&q=75) ### Charts¶ If you want to expose your latency metrics in your own application, you can use a Tinybird-generated [Chart](https://www.tinybird.co/docs/docs/publish/charts) to expose the results of an Endpoint that queries the `pipe_stats_rt` or `pipe_stats` Data Source. Then, you can embed the Chart into your application by using the `iframe` code. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fcharts-latency.png&w=3840&q=75) ## Next steps¶ - Optimize even further by[ monitoring your ingestion](https://www.tinybird.co/docs/docs/guides/monitoring/monitor-your-ingestion) . - Read this blog on[ Monitoring global API latency](https://www.tinybird.co/blog-posts/dev-qa-global-api-latency-chronark) . Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/monitoring/organizations Last update: 2024-11-13T11:01:06.000Z Content: --- title: "Organizations · Tinybird Docs" theme-color: "#171612" description: "Tinybird Organizations provide enterprise customers with a single pane of glass to monitor usage across multiple Workspaces." --- # Monitor your organization in Tinybird¶ The Tinybird organizations feature is only available to Enterprise or Dedicated plan customers (see ["Tinybird plans"](https://www.tinybird.co/docs/docs/plans) ). Use the organizations section to monitor their consolidated Tinybird usage in one place. The section provides a single dashboard for the overview of all Workspaces within an organization, and unlocks specific Workspace-level actions. It includes dashboards for monitoring consumption of storage and processed data, Workspaces, and organization members. Usage data is also available through an HTTP API. The organizations section consists of the following areas: - Overview - Workspaces - Members - Monitoring ## Access the organizations section¶ To access the organizations UI, log in as an administrator and select your Workspace name, then select your organization name. To add another user as an organization administrator, follow these steps: 1. Navigate to the** Your organization** page. 2. Go to the** Members** section. 3. Locate the user you want to make an administrator. 4. Select** Organization Admin** next to their name. This grants administrator access to the selected users. ## Usage overview¶ The **Usage** page shows details about your platform usage against your billing plan commitment followed by a detailed breakdown of your consumption. Only billable Workspaces are in this view. Find non-billable Workspaces in the **Workspaces** tab. ### Processed data¶ The first metric shows an aggregated summary of your processed data. This is aggregated across all billable Workspaces included in your plan. Processed data is cumulative over the plan's billing period. ### Storage¶ The second metric shows an aggregated summary of your current storage. This is aggregated across all billable Workspaces included in your plan. Storage is the maximum storage used in the past day. ### Contract¶ The third metric shows the details of your current contract, including the plan type and start/end dates of your plan period. If your plan includes usage limits, for example commitments on an Enterprise plan, your commitment details are also shown here. For both the **Processed data** and **Storage** metrics, the summary covers the current billing period. For **Enterprise** plans this covers the term of your current contract. For monthly plans, it's the current month. After the summary, the page shows a breakdown of Processed data and Storage per Workspace and Data Source. The calculation of these metrics is the same as previously explained for the summary section, but on an individual basis. ### Organization Service Data Sources¶ The Charts in the Overview page get their data from Organization Service Data Sources. The complete list of available ones is: - `organization.workspaces` : lists all Organization Workspaces and related information (name, IDs, databases, plan, when it was created, and whether it has been soft-deleted). - `organization.processed_data` : information related to all processed data per day per workspace. - `organization.datasources_storage` : equivalent to tinybird.datasources_storage but with data for all Organization Workspaces. - `organization.pipe_stats` : equivalent to tinybird.pipe_stats but with data for all Organization Workspaces. - `organization.pipe_stats_rt` : equivalent to tinybird.pipe_stats_rt but with data for all Organization Workspaces. - `organization.datasources_ops_log` : equivalent to tinybird.datasources_ops_log but with data for all Organization Workspaces. - `organization.data_transfer` : equivalent to tinybird.data_transfer but with data for all Organization Workspaces. - `organization.jobs_log` : equivalent to tinybird.jobs_log but with data for all Organization Workspaces. - `organization.sinks_ops_log` : equivalent to tinybird.sinks_ops_log but with data for all Organization Workspaces. - `organization.bi_stats` : equivalent to tinybird.bi_stats but with data for all Organization Workspaces. - `organization.bi_stats_rt` : equivalent to tinybird.bi_stats_rt but with data for all Organization Workspaces. Only Organization Admins are able to run these queries. To query these Organization Service Data Sources, go to any Workspace that belongs to the Organization, and use these as you would a regular Service Data Source from the Playground or within Pipes. Use the admin `user@domain` Token of an Organization Admin. You can also copy your user admin Token and make queries using your preferred method, like `tb sql`. ## Workspaces¶ This page displays details of all your Workspaces, their consumption, and whether they're billable or not. Using the date range selector at the top of the page, you can adjust the time of the data displayed in the table. The table shows the following information: - ** Workspace name** - ** Processed data** : Processed data is cumulative over the selected time range. - ** Storage** : Storage is the maximum storage used on the last day of the selected time range. - ** Plan type** : Billable or free. Usage in a billable Workspace counts towards your billing plan. New Workspaces that are created by a user with an email domain linked to (or matching) an Organization are automatically added to that Organization. The new Workspace then automatically shows up here in your Organization's Consumption metrics and listed Workspaces. If you encounter any challenges with creating a new Workspace in your Organization, contact [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co). To delete Workspace, select the checkbox of a Workspace name, followed by the **Delete** button. You don't need to be a user in that Workspace to delete it. ## Members¶ **Members** shows details of your Organization members, the Workspaces they belong to, and their roles. User roles: - ** Admins** can do everything in the Workspace. - ** Guests** can do most things, but they can't delete Workspaces, invite or remove users, or share Data Sources across Workspaces. - ** Viewers** can't edit anything in the main Workspace[ Branch](https://www.tinybird.co/docs/docs/concepts/branches) , but they can use[ Playgrounds](https://www.tinybird.co/docs/docs/query/overview#use-the-playground) to query the data, as well as create or edit Branches. The table shows the following information: - Email - Workspaces and roles To view the detail of a member’s Workspaces and roles, select the arrow next to the Workspace count. A menu shows all the Workspaces that user is part of, plus their role in each Workspace. To change a user’s role or remove them from a Workspace, hover over the Workspace name and follow the arrow. Select a new role from **Admin**, **Guest** , or **Viewer** , or remove them from the Workspace. You don't need to be a user in that Workspace to make changes to its users. As mentioned, you can also make a user an organization admin from this page. To remove a user from the organization, select **Remove member** in the menu. You can see if there are Workspaces where that user is the only admin and if the Token associated to the email has had activity in the last 7 days. ## Monitoring endpoints¶ To monitor the usage of your Organization use the [Organization Service Data Sources](https://www.tinybird.co/docs/about:blank#organization-service-data-sources). The endpoints page shows details about the APIs that allow you to export, or integrate external tools with, your usage data. There are two APIs available: ### Processed data¶ The first API shows a daily aggregated summary of your processed data per Workspaces. ### Storage¶ The last API shows a daily aggregated summary of your current storage per Workspaces. You can select the time by editing the parameters in the URL. **Processed data** | Field | Type | Description | | --- | --- | --- | | day | DateTime | Day of the record | | workspace_id | String | ID of the Workspace. | | read_bytes | UInt64 | Bytes read in the Workspace that day | | written_bytes | UInt64 | Bytes written in the Workspace that day | **Storage** | Field | Type | Description | | --- | --- | --- | | day | DateTime | Day of the record | | workspace_id | String | ID of the Workspace. | | bytes | UInt64 | Maximum Bytes stored in the Workspace that day | | bytes_quarantine | UInt64 | Maximum Bytes stored in the Workspace quarantine that day | ## Dedicated infrastructure monitoring¶ The following features are in public beta and may change without notice. If you have feedback or suggestions, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). If your organization is on an infrastructure commitment plan, Tinybird offers two ways of monitoring the state of your dedicated clusters: using the `organization.metrics_logs` service Data Source, or through the Prometheus endpoint `/v0/metrics` , which you can integrate with the observability platform of your choice. ### Billing dashboard¶ You can track your credits usage from the **Billing** section under **Your organization** . The dashboard shows your cumulative credits usage and estimated trend against the total, and warns you if you're about to run out of credits. For more details, you can access your customer portal using the direct link. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fcredits-usage.png&w=3840&q=75) ### Cluster load chart¶ You can check the current load of your clusters using the chart under **Your organization**, **Usage** . Select a cluster in the menu to see all its hosts, then select a time. Each line represents the CPU usage of the host. When you select a host, the dotted line represents the total amount of CPUs available to the cluster. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fcluster-load-chart.png&w=3840&q=75) ### metrics\_logs Service Data Source¶ The `metrics_logs` Service Data Source is available in all the organization's workspaces. As with the rest of Organization Service Data Sources, it's only available to Organization administrators. New records for each of the metrics monitored are added every minute with the following schema: | Field | Type | Description | | --- | --- | --- | | timestamp | DateTime | Timestamp of the metric | | cluster | LowCardinality(String) | Name of the cluster | | host | LowCardinality(String) | Name of the host | | metric | LowCardinality(String) | Name of the metric | | value | String | Value of the metric | | description | LowCardinality(String) | Description of the metric | | organization_id | String | ID of your organization | The available metrics are the following: | Metric | Description | | --- | --- | | MemoryTracking | Total amount of memory, in bytes, allocated by the server. | | OSMemoryTotal | The total amount of memory on the host system, in bytes. | | InstanceType | Instance type of the host . | | Query | Number of executing queries. | | NumberCPU | Number of CPUs. | | LoadAverage1 | The whole system load, averaged with exponential smoothing over 1 minute. The load represents the number of threads across all the processes (the scheduling entities of the OS kernel), that are currently running by CPU or waiting for IO, or ready to run but not being scheduled at this point of time. This number includes all the processes, not only clickhouse-server. The number can be greater than the number of CPU cores, if the system is overloaded, and many processes are ready to run but waiting for CPU or IO. | | LoadAverage15 | The whole system load, averaged with exponential smoothing over 15 minutes. The load represents the number of threads across all the processes (the scheduling entities of the OS kernel), that are currently running by CPU or waiting for IO, or ready to run but not being scheduled at this point of time. This number includes all the processes, not only clickhouse-server. The number can be greater than the number of CPU cores, if the system is overloaded, and many processes are ready to run but waiting for CPU or IO. | | CPUUsage | The ratio of time the CPU core was running OS kernel (system) code or userspace code. This is a system-wide metric, it includes all the processes on the host machine, not just clickhouse-server. This includes also the time when the CPU was under-utilized due to the reasons internal to the CPU (memory loads, pipeline stalls, branch mispredictions, running another SMT core). | ### Prometheus metrics endpoint¶ The Prometheus endpoint is available at `/v0/metrics` . Use your organization's observability Token, which you can find in the **Prometheus** tab under **Monitoring**, **Your organization**. Tinybird reports all the metrics available in the service Data Source, except for `InstanceType` , which is reported as a metric label. #### Refreshing your organization's Observability Token¶ If your organization's observability Token gets compromised or is lost, refresh it using the following endpoint: `/v0/organizations//tokens/Observability%20%28builtin%29/refresh?token=` You must use your `user token` for this call, which you can copy from any of your workspaces. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/monitoring/service-datasources Last update: 2024-11-18T21:34:21.000Z Content: --- title: "Service Data Sources · Tinybird Docs" theme-color: "#171612" description: "In addition to the Data Sources you upload, Tinybird provides other "Service Data Sources" that allow you to inspect what's going on in your account." --- # Service Data Sources¶ Tinybird provides Service Data Sources that you can use to inspect what's going on in your Tinybird account, diagnose issues, monitor usage, and so on. For example, you can get real time stats about API calls or a log of every operation over your Data Sources. This is similar to using system tables in a database, although Service Data Sources contain information about the usage of the service itself. Queries made to Service Data Sources are free of charge and don't count towards your usage. However, calls to API Endpoints that use Service Data Sources do count towards API rate limits. [Read the billing docs](https://www.tinybird.co/docs/docs/support/billing). ## Considerations¶ - You can't use Service Data Sources in Materialized View queries. - Pass dynamic[ query parameters](https://www.tinybird.co/docs/docs/query/query-parameters#leverage-dynamic-parameters) to API Endpoints to then query Service Data Sources. - You can only query Organization-level Service Data Sources if you're an administrator. See[ Consumption overview](https://www.tinybird.co/docs/docs/monitoring/organizations#consumption-overview) . ## Service Data Sources¶ The following Service Data Sources are available. ### tinybird.pipe\_stats\_rt¶ Contains information about all requests made to your [API Endpoints](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview) in real time. This Data Source has a TTL of 7 days. If you need to query data older than 7 days you must use the aggregated by day data available at [tinybird.pipe_stats](https://www.tinybird.co/docs/about:blank#tinybird-pipe-stats). | Field | Type | Description | | --- | --- | --- | | `start_datetime` | `DateTime` | API call start date and time. | | `pipe_id` | `String` | Pipe Id as returned in our[ Pipes API](https://www.tinybird.co/docs/docs/api-reference/pipe-api/overview) ( `query_api` in case it is a Query API request). | | `pipe_name` | `String` | Pipe name as returned in our[ Pipes API](https://www.tinybird.co/docs/docs/api-reference/pipe-api/overview) ( `query_api` in case it is a Query API request). | | `duration` | `Float` | API call duration in seconds. | | `read_bytes` | `UInt64` | API call read data in bytes. | | `read_rows` | `UInt64` | API call rows read. | | `result_rows` | `UInt64` | Rows returned by the API call. | | `url` | `String` | URL ( `token` param is removed for security reasons). | | `error` | `UInt8` | `1` if query returned error, else `0` . | | `request_id` | `String` | API call identifier returned in `x-request-id` header. Format is ULID string. | | `token` | `String` | API call token identifier used. | | `token_name` | `String` | API call token name used. | | `status_code` | `Int32` | API call returned status code. | | `method` | `String` | API call method POST or GET. | | `parameters` | `Map(String, String)` | API call parameters used. | | `release` | `String` | Semantic version of the release (deprecated). | | `user_agent` | `Nullable(String)` | User Agent HTTP header from the request. | | `resource_tags` | `Array(String)` | Tags associated with the Pipe when the request was made. | ### tinybird.pipe\_stats¶ Aggregates the request stats in [tinybird.pipe_stats_rt](https://www.tinybird.co/docs/about:blank#tinybird-pipe-stats-rt) by day. | Field | Type | Description | | --- | --- | --- | | `date` | `Date` | Request date and time. | | `pipe_id` | `String` | Pipe Id as returned in our[ Pipes API](https://www.tinybird.co/docs/docs/api-reference/pipe-api/overview) . | | `pipe_name` | `String` | Name of the Pipe. | | `view_count` | `UInt64` | Request count. | | `error_count` | `UInt64` | Number of requests with error. | | `avg_duration_state` | `AggregateFunction(avg, Float32)` | Average duration state in seconds (see[ Querying _state columns](https://www.tinybird.co/docs/about:blank#querying-state-columns) ). | | `quantile_timing_state` | `AggregateFunction(quantilesTiming(0.9, 0.95, 0.99), Float64)` | 0.9, 0.95 and 0.99 quantiles state. Time in milliseconds (see[ Querying _state columns](https://www.tinybird.co/docs/about:blank#querying-state-columns) ). | | `read_bytes_sum` | `UInt64` | Total bytes read. | | `read_rows_sum` | `UInt64` | Total rows read. | | `resource_tags` | `Array(String)` | All the tags associated with the resource when the aggregated requests were made. | ### tinybird.bi\_stats\_rt¶ Contains information about all requests to your [BI Connector interface](https://www.tinybird.co/docs/docs/query/bi-connector) in real time. This Data Source has a TTL of 7 days. If you need to query data older than 7 days you must use the aggregated by day data available at [tinybird.bi_stats](https://www.tinybird.co/docs/about:blank#tinybird-bi-stats). | Field | Type | Description | | --- | --- | --- | | `start_datetime` | `DateTime` | Query start timestamp. | | `query` | `String` | Executed query. | | `query_normalized` | `String` | Normalized executed query. This is the pattern of the query, without literals. Useful to analyze usage patterns. | | `error_code` | `Int32` | Error code, if any. `0` on normal execution. | | `error` | `String` | Error description, if any. Empty otherwise. | | `duration` | `UInt64` | Query duration in milliseconds. | | `read_rows` | `UInt64` | Read rows. | | `read_bytes` | `UInt64` | Read bytes. | | `result_rows` | `UInt64` | Total rows returned. | | `result_bytes` | `UInt64` | Total bytes returned. | ### tinybird.bi\_stats¶ Aggregates the stats in [tinybird.bi_stats_rt](https://www.tinybird.co/docs/about:blank#tinybird-bi-stats-rt) by day. | Field | Type | Description | | --- | --- | --- | | `date` | `Date` | Stats date. | | `database` | `String` | Database identifier. | | `query_normalized` | `String` | Normalized executed query. This is the pattern of the query, without literals. Useful to analyze usage patterns. | | `view_count` | `UInt64` | Requests count. | | `error_count` | `UInt64` | Error count. | | `avg_duration_state` | `AggregateFunction(avg, Float32)` | Average duration state in milliseconds (see[ Querying _state columns](https://www.tinybird.co/docs/about:blank#querying-state-columns) ). | | `quantile_timing_state` | `AggregateFunction(quantilesTiming(0.9, 0.95, 0.99), Float64)` | 0.9, 0.95 and 0.99 quantiles state. Time in milliseconds (see[ Querying _state columns](https://www.tinybird.co/docs/about:blank#querying-state-columns) ). | | `read_bytes_sum` | `UInt64` | Total bytes read. | | `read_rows_sum` | `UInt64` | Total rows read. | | `avg_result_rows_state` | `AggregateFunction(avg, Float32)` | Total bytes returned state (see[ Querying _state columns](https://www.tinybird.co/docs/about:blank#querying-state-columns) ). | | `avg_result_bytes_state` | `AggregateFunction(avg, Float32)` | Total rows returned state (see[ Querying _state columns](https://www.tinybird.co/docs/about:blank#querying-state-columns) ). | ### tinybird.block\_log¶ The Data Source contains details about how Tinybird ingests data into your Data Sources. You can use this Service Data Source to spot problematic parts of your data. | Field | Type | Description | | --- | --- | --- | | `timestamp` | `DateTime` | Date and time of the block ingestion. | | `import_id` | `String` | Id of the import operation. | | `job_id` | `Nullable(String)` | Id of the job that ingested the block of data, if it was ingested by URL. In this case, `import_id` and `job_id` must have the same value. | | `request_id` | `String` | Id of the request that performed the operation. In this case, `import_id` and `job_id` must have the same value. Format is ULID string. | | `source` | `String` | Either the URL or `stream` or `body` keywords. | | `block_id` | `String` | Block identifier. You can cross this with the `blocks_ids` column from the[ tinybird.datasources_ops_log](https://www.tinybird.co/docs/about:blank#tinybird-datasources-ops-log) Service Data Source. | | `status` | `String` | `done` | `error` . | | `datasource_id` | `String` | Data Source consistent id. | | `datasource_name` | `String` | Data Source name when the block was ingested. | | `start_offset` | `Nullable(Int64)` | The starting byte of the block, if the ingestion was split, where this block started. | | `end_offset` | `Nullable(Int64)` | If split, the ending byte of the block. | | `rows` | `Nullable(Int32)` | How many rows it ingested. | | `parser` | `Nullable(String)` | Whether the native block parser or falling back to row by row parsing is used. | | `quarantine_lines` | `Nullable(UInt32)` | If any, how many rows went into the quarantine Data Source. | | `empty_lines` | `Nullable(UInt32)` | If any, how many empty lines were skipped. | | `bytes` | `Nullable(UInt32)` | How many bytes the block had. | | `processing_time` | `Nullable(Float32)` | How long it took in seconds. | | `processing_error` | `Nullable(String)` | Detailed message in case of error. | When Tinybird ingests data from a URL, it splits the download in several requests, resulting in different ingestion blocks. The same happens when the data upload happens with a multipart request. ### tinybird.datasources\_ops\_log¶ Contains all operations performed to your Data Sources. Tinybird tracks the following operations: | Event | Description | | | --- | --- | | | `create` | A Data Source is created. | | | `sync-dynamodb` | Initial synchronization from a DynamoDB table when using the[ DynamoDB Connector](https://www.tinybird.co/docs/docs/ingest/dynamodb) | | | `append` | Append operation. | | | `append-hfi` | Append operation using the[ High-frequency Ingestion API](https://www.tinybird.co/docs/docs/guides/ingesting-data/ingest-from-the-events-api) . | | | `append-kafka` | Append operation using the[ Kafka Connector](https://www.tinybird.co/docs/docs/ingest/kafka) . | | | `append-dynamodb` | Append operation using the[ DynamoDB Connector](https://www.tinybird.co/docs/docs/ingest/dynamodb) | | | `replace` | A replace operation took place in the Data Source. | | | `delete` | A delete operation took place in the Data Source. | | | `truncate` | A truncate operation took place in the Data Source. | | | `rename` | The Data Source was renamed. | | | `populateview-queued` | A populate operation was queued for execution. | | | `populateview` | A finished populate operation (up to 8 hours after it started). | | | `copy` | A copy operation took place in the Data Source. | | | `alter` | An alter operation took place in the Data Source. | | | `resource_tags` | `Array(String)` | Tags associated with the Pipe when the request was made. | Materializations are logged with same `event_type` and `operation_id` as the operation that triggers them. You can track the materialization Pipe with `pipe_id` and `pipe_name`. Tinybird logs all operations with the following information in this Data Source: | Field | Type | Description | | --- | --- | --- | | `timestamp` | `DateTime` | Date and time when the operation started. | | `event_type` | `String` | Operation being logged. | | `operation_id` | `String` | Groups rows affected by the same operation. Useful for checking materializations triggered by an append operation. | | `datasource_id` | `String` | Id of your Data Source. The Data Source id is consistent after renaming operations. You should use the id when you want to track name changes. | | `datasource_name` | `String` | Name of your Data Source when the operation happened. | | `result` | `String` | `ok` | `error` | | `elapsed_time` | `Float32` | How much time the operation took in seconds. | | `error` | `Nullable(String)` | Detailed error message if the result was error. | | `import_id` | `Nullable(String)` | Id of the import operation, if data has been ingested using one of the following operations: `create` , `append` or `replace` | | `job_id` | `Nullable(String)` | Id of the job that performed the operation, if any. If data has been ingested, `import_id` and `job_id` must have the same value. | | `request_id` | `String` | Id of the request that performed the operation. If data has been ingested, `import_id` and `request_id` must have the same value. Format is ULID string. | | `rows` | `Nullable(UInt64)` | How many rows the operations affected. This depends on `event_type` : for the `append` event, how many rows got inserted; for `delete` or `truncate` events, how many rows the Data Source had; for `replace` , how many rows the Data Source has after the operation. | | `rows_quarantine` | `Nullable(UInt64)` | How many rows went into the quarantine Data Source, if any. | | `blocks_ids` | `Array(String)` | List of blocks ids used for the operation. See the[ tinybird.block_log](https://www.tinybird.co/docs/about:blank#tinybird-block-log) Service Data Source for more details. | | `options` | `Nested(Names String, Values String)` | Tinybird stores key-value pairs with extra information for some operations. For the `replace` event, Tinybird uses the `rows_before_replace` key to track how many rows the Data Source had before the replacement happened, the `replace_condition` key shows what condition was used. For `append` and `replace` events, Tinybird stores the data `source` , for example the URL, or body/stream keywords. For `rename` event, `old_name` and `new_name` . For `populateview` you can find there the whole populate `job` metadata as a JSON string. For `alter` events, Tinybird stores `operations` , and dependent pipes as `dependencies` if they exist. | | `read_bytes` | `UInt64` | Read bytes in the operation. | | `read_rows` | `UInt64` | Read rows in the operation. | | `written_rows` | `UInt64` | Written rows in the operation. | | `written_bytes` | `UInt64` | Written bytes in the operation. | | `written_rows_quarantine` | `UInt64` | Quarantined rows in the operation. | | `written_bytes_quarantine` | `UInt64` | Quarantined bytes in the operation. | | `pipe_id` | `String` | If present, materialization Pipe id as returned in our[ Pipes API](https://www.tinybird.co/docs/docs/api-reference/pipe-api/overview) . | | `pipe_name` | `String` | If present, materialization Pipe name as returned in our[ Pipes API](https://www.tinybird.co/docs/docs/api-reference/pipe-api/overview) . | | `release` | `String` | Semantic version of the release (deprecated). | ### tinybird.datasource\_ops\_stats¶ Data from `datasource_ops_log` , aggregated by day. | Field | Type | Description | | | | | --- | --- | --- | | | | | `event_date` | `Date` | Date of the event. | | | | | `workspace_id` | `String` | Unique identifier for the Workspace. | | | | | `event_type` | `String` | Name of your Data Source. | | | | | `pipe_id` | `String` | Identifier of the Pipe. | | | | | `pipe_name` | `String` | Name of the Pipe. | | | | | `error_count` | `UInt64` | Number of requests with an error. | | | | | `executions` | `UInt64` | Number of executions. | | | | | `avg_elapsed_time_state` | `Float32` | Average time spent in elapsed state. | | | | | `quantiles_state` | `Float32` | 0.9, 0.95 and 0.99 quantiles state. Time in milliseconds (see[ Querying _state columns](https://www.tinybird.co/docs/about:blank#querying-state-columns) ). | | | | | `read_bytes` | `UInt64` | Read bytes in the operation. | | | | | `read_rows` | `UInt64` | Read rows in the Sink operation. | | | | | `written_rows` | `UInt64` | Written rows in the Sink operation. | | | | | `read_bytes` | `UInt64` | Read bytes in the operation. | | | | | `written_bytes` | `UInt64` | Written bytes in the operation. | `written_rows_quarantine` | `UInt64` | Quarantined rows in the operation. | | `written_bytes_quarantine` | `UInt64` | Quarantined bytes in the operation. | | | | | `resource_tags` | `Array(String)` | Tags associated with the Pipe when the request was made. | | | | ### tinybird.endpoint\_errors¶ It provides the last 30 days errors of your published endpoints. Tinybird logs all errors with additional information in this Data Source. | Field | Type | Description | | | | | --- | --- | --- | | | | | `start_datetime` | `DateTime` | Date and time when the API call started. | | | | | `request_id` | `String` | The id of the request that performed the operation. Format is ULID string. | | | | | `pipe_id` | `String` | If present, Pipe id as returned in our[ Pipes API](https://www.tinybird.co/docs/docs/api-reference/pipe-api/overview) . | | | | | `pipe_name` | `String` | If present, Pipe name as returned in our[ Pipes API](https://www.tinybird.co/docs/docs/api-reference/pipe-api/overview) . | | | | | `params` | `Nullable(String)` | URL query params included in the request. | | | | | `url` | `Nullable(String)` | URL pathname. | | | | | `status_code` | `Nullable(Int32)` | HTTP error code. | | | | | `error` | `Nullable(String)` | Error message. | `resource_tags` | `Array(String)` | Tags associated with the Pipe when the request was made. | ### tinybird.kafka\_ops\_log¶ Contains all operations performed to your [Kafka Data Sources](https://www.tinybird.co/integrations/kafka-data) during the last 30 days. | Field | Type | Description | | --- | --- | --- | | `timestamp` | `DateTime` | Date and time when the operation took place. | | `datasource_id` | `String` | Id of your Data Source. The Data Source id is consistent after renaming operations. You should use the id when you want to track name changes. | | `topic` | `String` | Kafka topic. | | `partition` | `Int16` | Partition number, or `-1` for all partitions. | | `msg_type` | `String` | 'info' for regular messages, 'warning' for issues related to the user's Kafka cluster, deserialization or Materialized Views, and 'error' for other issues. | | `lag` | `Int64` | Number of messages behind for the partition. This is the difference between the high-water mark and the last commit offset. | | `processed_messages` | `Int32` | Messages processed for a topic and partition. | | `processed_bytes` | `Int32` | Amount of bytes processed. | | `committed_messages` | `Int32` | Messages ingested for a topic and partition. | | `msg` | `String` | Information in the case of warnings or errors. Empty otherwise. | ### tinybird.datasources\_storage¶ Contains stats about your Data Sources storage. Tinybird logs maximum values per hour, the same as when it calculates storage consumption. | Field | Type | Description | | --- | --- | --- | | `datasource_id` | `String` | Id of your Data Source. The Data Source id is consistent after renaming operations. You should use the id when you want to track name changes. | | `datasource_name` | `String` | Name of your Data Source. | | `timestamp` | `DateTime` | When storage was tracked. By hour. | | `bytes` | `UInt64` | Max number of bytes the Data Source has, not including quarantine. | | `rows` | `UInt64` | Max number of rows the Data Source has, not including quarantine. | | `bytes_quarantine` | `UInt64` | Max number of bytes the Data Source has in quarantine. | | `rows_quarantine` | `UInt64` | Max number of rows the Data Source has in quarantine. | ### tinybird.releases\_log (deprecated)¶ Contains operations performed to your releases. Tinybird tracks the following operations: | Event | Description | | --- | --- | | `init` | First Release is created on Git sync. | | `override` | Release commit is overridden. `tb init --override-commit {{commit}}` . | | `deploy` | Resources from a commit are deployed to a Release. | | `preview` | Release status is changed to preview. | | `promote` | Release status is changed to live. | | `post` | Resources from a commit are deployed to the live Release. | | `rollback` | Rollback is done a previous Release is now live. | | `delete` | Release is deleted. | Tinybird logs all operations with additional information in this Data Source. | Field | Type | Description | | `timestamp` | `DateTime64` | Date and time when the operation took place. | | `event_type` | `String` | Name of your Data Source. | | `semver` | `String` | Semantic version identifies a release. | | `commit` | `String` | Git sha commit related to the operation. | | `token` | `String` | API call token identifier used. | | `token_name` | `String` | API call token name used. | | `result` | `String` | `ok` | `error` | | `error` | `String` | Detailed error message | ### tinybird.sinks\_ops\_log¶ Contains all operations performed to your Sink Pipes. | Field | Type | Description | | `timestamp` | `DateTime64` | Date and time when the operation took place. | | `service` | `LowCardinality(String)` | Type of Sink (GCS, S3, and so on) | | `pipe_id` | `String` | The ID of the Sink Pipe. | | `pipe_name` | `String` | the name of the Sink Pipe. | | `token_name` | `String` | Token name used. | | `result` | `LowCardinality(String)` | `ok` | `error` | | `error` | `Nullable(String)` | Detailed error message | | `elapsed_time` | `Float64` | The duration of the operation in seconds. | | `job_id` | `Nullable(String)` | ID of the job that performed the operation, if any. | | `read_rows` | `UInt64` | Read rows in the Sink operation. | | `written_rows` | `UInt64` | Written rows in the Sink operation. | | `read_bytes` | `UInt64` | Read bytes in the operation. | | `written_bytes` | `UInt64` | Written bytes in the operation. | | `output` | `Array(String)` | The outputs of the operation. In the case of writing to a bucket, the name of the written files. | | `parameters` | `Map(String, String)` | The parameters used. Useful to debug the parameter query values. | | `options` | `Map(String, String)` | Extra information. You can access the values with `options['key']` where key is one of: file_template, file_format, file_compression, bucket_path, execution_type. | ### tinybird.data\_transfer¶ Stats of data transferred per hour by a Workspace. | Field | Type | Description | | `timestamp` | `DateTime` | Date and time data transferred is tracked. By hour. | | `event` | `LowCardinality(String)` | Type of operation generated the data (ie. `sink` ) | | `origin_provider` | `LowCardinality(String)` | Provider data was transferred from. | | `origin_region` | `LowCardinality(String)` | Region data was transferred from. | | `destination_provider` | `LowCardinality(String)` | Provider data was transferred to. | | `destination_region` | `LowCardinality(String)` | Region data was transferred to. | | `kind` | `LowCardinality(String)` | `intra` | `inter` depending if the data moves within or outside the region. | ### tinybird.jobs\_log¶ Contains all job executions performed in your Workspace. Tinybird logs all jobs with extra information in this Data Source: | Field | Type | Description | | --- | --- | --- | | `job_id` | `String` | Unique identifier for the job. | | `job_type` | `LowCardinality(String)` | Type of job execution. `delete_data` , `import` , `populateview` , `query` , `copy` , `copy_from_main` , `copy_from_branch` , `data_branch` , `deploy_branch` , `regression_tests` , `sink` , `sink_from_branch` . | | `workspace_id` | `String` | Unique identifier for the Workspace. | | `pipe_id` | `String` | Unique identifier for the Pipe. | | `pipe_name` | `String` | Name of the Pipe. | | `created_at` | `DateTime` | Timestamp when the job was created. | | `updated_at` | `DateTime` | Timestamp when the job was last updated. | | `started_at` | `DateTime` | Timestamp when the job execution started. | | `status` | `LowCardinality(String)` | Current status of the job. `waiting` , `working` , `done` , `error` , `cancelled` . | | `error` | `Nullable(String)` | Detailed error message if the result was error. | | `job_metadata` | `JSON String` | Additional metadata related to the job execution. | Learn more about how to track background jobs execution in the [Jobs monitoring guide](https://www.tinybird.co/docs/docs/monitoring/jobs). ## Use resource\_tags to better track usage¶ You can use tags that you've added to your resources, like Pipes or Data Sources, to analyze usage and cost attribution across your organization. For example, you can add tags for projects, environments, or versions and compare usage in later queries to Service Data Sources such as [tinybird.datasources_ops_stats](https://www.tinybird.co/docs/about:blank#tinybird-datasources-ops-stats) , which aggregates operations data by day. The following Service Data Sources support `resource_tags`: - `pipe_stats_rt` - `pipe_stats` - `endpoint_errors` - `organization.pipe_stats_rt` - `organization.pipe_stats` - `datasources_ops_log` - `datasources_ops_stats` - `organization.datasources_ops_log` - `organization.datasources_ops_stats` To add tags to resources, see [Organizing resources in Workspaces](https://www.tinybird.co/docs/docs/production/organizing-resources). ## Query \_state columns¶ Several of the Service Data Sources include columns suffixed with `_state` . This suffix identifies columns with values that are in an intermediate aggregated state. When reading these columns, merge the intermediate states to get the final value. To merge intermediate states, wrap the column in the original aggregation function and apply the `-Merge` combinator. For example, to finalize the value of the `avg_duration_state` column, you use the `avgMerge` function: ##### finalize the value for the avg\_duration\_state column SELECT date, avgMerge(avg_duration_state) avg_time, quantilesTimingMerge(0.9, 0.95, 0.99)(quantile_timing_state) quantiles_timing_in_ms_array FROM tinybird.pipe_stats where pipe_id = 'PIPE_ID' group by date Learn more about the `-Merge` combinator in the [ClickHouse® documentation](https://clickhouse.com/docs/en/sql-reference/aggregate-functions/combinators#-merge). Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/plans Last update: 2024-11-13T14:37:53.000Z Content: --- title: "Plans · Tinybird Docs" theme-color: "#171612" description: "The three Tinybird plans explained in one place. Get building today!" --- # Tinybird plans¶ Tinybird has the following plan options: Build, Professional, and Enterprise. You can [upgrade your plan](https://www.tinybird.co/docs/about:blank#upgrade-your-plan) at any time. ## Build¶ The Build plan is free. It provides you with a full-featured, production-grade instance of the Tinybird platform, including all managed ingest connectors, real time querying, and managed API Endpoints. There is no time limit to the Build plan, meaning you can develop using this plan for as long as you want. There are no limits on the number of team seats, Data Sources, or API Endpoints. Support is available through the [Community Slack](https://www.tinybird.co/docs/docs/community) , which is monitored by the Tinybird team. Build plan usage limits: - ** Up to 10 GB of compressed data storage.** This is the total amount of compressed data you're storing, including Data Sources and Materialized Views. - ** Up to 1,000 requests per day to your API Endpoints.** This limit applies to the[ API Endpoints](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview) that you publish from your SQL queries, and queries executed using the[ Query API](https://www.tinybird.co/docs/docs/api-reference/query-api) . The limit doesn't apply to the[ Tinybird REST API](https://www.tinybird.co/docs/docs/api-reference/overview) or[ Events API](https://www.tinybird.co/docs/docs/api-reference/events-api) . The Build plan is suited for development and experimentation. Many Professional and Enterprise customers use Build plan Workspaces to develop and test new use cases before deploying to their production billed Workspaces. See the [billing docs](https://www.tinybird.co/docs/docs/support/billing) for more information. ## Professional¶ The Professional plan is a usage-based plan that scales with you as you grow. When your application is ready for production, you can upgrade any Workspace on the Build plan to the Professional plan. The Professional plan includes all the Tinybird product features as the Build plan, and removes the usage limits for data storage, processed bytes, and API Endpoint requests. This means that you can store as much data, handle as many API requests, and process as much data as you need with no artificial limits. In addition to the [Community Slack](https://www.tinybird.co/docs/docs/community) , Professional customers can also contact the Tinybird support team through email at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co). The Professional plan requires a valid payment method, such as a credit card. Billing is as follows: - ** Data storage is billed at US$0.34 per GB** , with no limit on the amount of data storage. - ** Processed data is billed at US$0.07 per GB** , with no limit on the amount of processed data. - ** Transferred data is billed at US$0.01 - $0.10 per GB** , depending on cloud provider or region, with no limit on the amount of transferred data. See the [billing docs](https://www.tinybird.co/docs/docs/support/billing) for more information. ### Upgrade your plan¶ As you approach the usage limits of the Build plan, you might receive emails and see dashboard banners about upgrading. As a Workspace admin: 1. View your usage indicators, like monthly processed and stored data, by selecting the cog icon in the navigation pane and selecting the** Usage** tab. 2. Select the** Upgrade to pro** button to enter your card details and upgrade your plan to Professional. The following screenshot shows to access the **Usage** tab and the location of the **Upgrade to pro** button: <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fplans-usage-and-upgrade.png&w=3840&q=75) ## Enterprise¶ As the scale of your Tinybird storage and processing grows, you can customize an Enterprise plan to meet your needs. Enterprise plans can include volume discounts, service-level agreements (SLA), dedicated infrastructure, and a direct Slack-connect support channel. If you're interested in discussing the Enterprise plan, [contact the Tinybird Sales team](https://www.tinybird.co/contact-us) for more information. ## Dedicated¶ The Dedicated plan provides you with a Tinybird cluster with at least two database servers. On a Dedicated plan, you're the only customer on your cluster. Your queries and outputs are more performant as a result. ### Understand billing¶ Dedicated plans are billed every month according to the amount of credits you've used. Credits are a way of tracking your usage of Tinybird's infrastructure and features. The following table shows how Tinybird calculates credits usage for each resource: | Resource | Explanation | | --- | --- | | Clusters | Cluster size, tracked every 15 minutes. Cluster size changes are detected automatically and billed accordingly. | | Storage | Compressed disk storage of all your data. Calculated daily, in terabytes, using the maximum value of the day. | | Data transfer | When using[ Sinks](https://www.tinybird.co/docs/docs/api-reference/sink-pipes-api) , usage is billed depending on the destination, which can be the same cloud provider and region as your Tinybird cluster, or a different one. | | Support | Premier or Enterprise monthly support fee. | | Private Link | Billed monthly. | ### Rate limiter¶ In Dedicated plans, the rate limiter monitors the status of the cluster and limits the number of concurrent requests to prevent the cluster from crashing due to insufficient memory. This allows the cluster to continue working, albeit with a rate limit. The rate limiter activates when the following situation occurs: - When total memory usage in the cluster is over 60%. - Percentage of 408 Timeout Exceeded and 500 Internal Server Error due to memory limits for a Pipe endpoint exceeds 10% of the total requests. If both conditions are met, the maximum number of concurrent requests to the Pipe endpoint is limited proportionally to the percentage of errors. Workspace administrators receive an email indicating the affected Pipe endpoints and the concurrency limit. For example, if a Pipe endpoint is receiving 10 requests per second and 5 failed during a high memory usage scenario due to a timeout or memory error, the number of concurrent queries is limited to a half, that is, 5 concurrent requests for that specific Pipe endpoint. ### Track invoices¶ In Dedicated plans, invoices are issued upon credits purchase, which can happen when signing the contract or when purchasing additional credits. You can check your invoices from the customer portal. ### Monitor usage¶ You can monitor credits usage, including remaining credits, cluster usage, and current commitment through your organization's dashboard. See [Dedicated infrastructure monitoring](https://www.tinybird.co/docs/docs/monitoring/organizations#dedicated-infrastructure-monitoring) . You can also check usage using the monthly usage receipts. ## Next steps¶ - Explore[ Tinybird's Customer Stories](https://www.tinybird.co/customer-stories) and see what people have built on Tinybird. - Start building now using the[ quick start](https://www.tinybird.co/docs/docs/quick-start) . - Read the[ billing docs](https://www.tinybird.co/docs/docs/support/billing) to understand which data operations count towards your bill, and how to optimize your usage. --- URL: https://www.tinybird.co/docs/production/backfill-strategies Last update: 2024-11-15T11:01:39.000Z Content: --- title: "Backfill strategies · Tinybird Docs" theme-color: "#171612" description: "When iterating Data Sources or Materialized Views, you will often need to backfill data from a Tinybird Data Source to another. This guide will help you understand the different strategies to do it in a safe way." --- # Backfill strategies¶ Backfilling data is the process of filling in missing data that didn't exist before. Whether you're changing data types, changing the sorting key, or redefining whole views, at some point you may need to run a backfill from the previous version of your Data Source or Materialized View to the new one. This page introduces the key challenges of backfilling real-time data, and covers the different strategies to run a backfill when you are iterating a Data Source or Materialized View. Before you start iterating and making critical changes to your Data Sources, Materialized Views, and Pipes, it's crucial to read the [Deployment Strategies](https://www.tinybird.co/docs/docs/production/deployment-strategies) docs. ## The challenge of backfilling real-time data¶ The iteration of Data Sources or Materialized Views often needs a careful approach to backfill data. This process becomes critical, especially when you create a new version of a Data Source or Materialized View, which results in creating a new, empty Data Source or Materialized View. The main challenge lies in migrating historical data while continuously ingesting new real time data. See the detailed explanation [Best practices for Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/best-practices). ## Use case¶ Imagine you have the following Data Source deployed in our main Workspace: ##### analytics\_events.datasource SCHEMA > `timestamp` DateTime `json:$.timestamp`, `session_id` String `json:$.session_id`, `action` LowCardinality(String) `json:$.action`, `version` LowCardinality(String) `json:$.version`, `payload` String `json:$.payload` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" We want to modify the sorting key from `timestamp` to `action, timestamp` . This change requires us create a new Data Source (e.g. `events1.datasource` ). When merging the Pull Request, you will have `analytics_events` and `analytics_events_1` in our main Workspace (also in the branch while in CI). This means that you will have two Data Sources one representing the new version you want to deploy which since is newly created is empty. So, how you can sync the data between the two Data Sources? This is when you use a backfill. ## How to move data in Tinybird¶ Reminder: "Running a backfill" just means copying all the data from one Data Source to another Data Source. There are different ways to move data in Tinybird: ### Using Copy Pipes¶ A Copy Pipe is a Pipe used to copy data from one Data Source to another Data Source. This method is useful for one-time moves of data or scheduled executions (for example, every day at 00:00), but it's not recommended if you want to keep the data in sync between two Data Sources. In the context of a backfill, you could use the following Pipe to copy the data from one Data Source to another. (Later, we will explain why you need the `timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}}` condition). ##### backfill\_data.pipe file NODE node SQL > % SELECT * FROM analytics_events WHERE timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}} TYPE COPY TARGET_DATASOURCE analytics_events_1 Once deployed, you would need to run the following command to execute the copy: ##### Command to run the Copy Pipe with the backfill\_timestamp parameter tb pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-01-31 00:00:00' --wait --yes You can read more about it in our [Copy Pipes](https://www.tinybird.co/docs/docs/publish/copy-pipes) docs. ### Using Materialized Views¶ A Materialized View is a Pipe that will materialize the data from one Data Source to another Data Source. This method is useful to keep the data in sync between two Data Sources. ##### sync\_data.pipe file NODE node SQL > % SELECT * FROM analytics_events WHERE timestamp > '2024-01-31 00:00:00' TYPE materialized DATASOURCE analytics_events_1 By default, a Materialized view will only materialize the new incoming data; it won't process the old data. It can be forced by using `tb pipe populate` command using the CLI, but be careful as this can lead to duplicate or loss of data as explained in the previous section. Combining both methods, you will see how you can start syncing both Data Sources and start backfilling data. ## Scenarios for backfill strategies¶ Depending on your use case and ingestion pattern, there are different recommended strategies to backfilling data in Tinybird. The complexity of this migration depends on several factors, notably the presence of streaming ingestion. The most common scenarios are: We are actively improving this workflow in Tinybird. Reach out to Tinybird support ( [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) ) if you have any questions. - ** Scenario 1: I'm not in Production** . - ** Scenario 2: Full replacement every few hours** . - ** Scenario 3: Streaming ingestion WITH incremental timestamp column** . - ** Scenario 4: Streaming ingestion WITH NOT incremental timestamp column** . ### Scenario 1: I'm not in Production¶ If you are not in production or the data from that Data Source is not being used and **you can accept losing data** , you can opt-in by create a new Data Source and start using it right away. Alternatively you can remove and re-create the original Data Source using a custom deployment. Once you start to append data to the Data Source, you will start seeing data in the new Data Source. ### Scenario 2: Full replacement every few hours¶ If you are running a full replacement every few hours, you can create a Materialized View the two Data Sources. To sync the data between the two Data Sources, you will use a Materialized Pipe (MV) that will materialize the data from the old Data Source to the new one. Something like this: ##### Materialize data from old to new Data Source NODE migration_node SQL > SELECT * FROM analytics_events TYPE materialized DATASOURCE analytics_events_1 You would deploy this new Pipe along with the modified Data Source. Once you deploy the Pull Request, you will have this Materialized Pipe along with the new Data Source and the Pipe will materialize the data from the old Data Source. At this point, you would just need to wait until the full replacement is executed, the new Data Source will have all the data, after that you can create a new Pull Request to connect the new Data Source to the rest of your Pipe Endpoints. ### Scenario 3: Streaming ingestion WITH incremental timestamp column¶ If you have streaming ingestion using the [events API](https://www.tinybird.co/docs/docs/ingest/events-api) with a huge ingest rate, you can use the following strategy to not be impacted by [the backfilling challenge with real-time data](https://www.tinybird.co/docs/about:blank#the-challenge-of-backfilling-real-time-data). To use this strategy successfully, you must have an incremental timestamp column with the same time zone in your Data Source. In our example, you have the `timestamp` column. First, create a new Pipe that will materialize the data from old Data Source to the new one, **but filter by a future timestamp** . For example, if you are deploying the Pull Request on `2024-02-02 13:00:00` , you can use `timestamp > '2024-02-02 13:30:00'`. ##### sync\_data.pipe file NODE node SQL > SELECT * FROM analytics_events WHERE timestamp > '2024-02-02 13:30:00' TYPE materialized DATASOURCE analytics_events_1 We are using the `timestamp > '2024-02-02 13:30:00'` condition to only materialize data that is newer than the `2024-02-02 13:30:00` timestamp. Then, create a Copy Pipe with the same SQL statement, but instead of filtering by a specific future timestamp, you will use two parameters to filter by a timestamp range. This allows us to have better control of the backfilling process. For example, if you are moving very large amounts of data, split the backfilling process into different batches to avoid overloading the system. ##### backfill\_data.pipe file NODE node SQL > % SELECT * FROM analytics_events WHERE timestamp BETWEEN {{DateTime(start_backfill_timestamp)}} AND {{DateTime(end_backfill_timestamp)}} TYPE COPY TARGET_DATASOURCE analytics_events_1 Once you have these changes in code, create a Pull Request and the CI Workflow will generate a new Branch. #### CI workflow¶ Once the CI Workflow has finished successfully, a new Branch will be created. For the following steps, use the CLI. If you don't have it installed, you can follow the [CLI installation docs](https://www.tinybird.co/docs/docs/cli/install). First, you should be able to authenticate in the Branch by copying the Token from the Branch or using these commands: ##### Authenticate in the Branch # You can use `tb auth -i` to authenticate in the branch tb auth -i # Or you can switch to the branch if you are already authenticated tb branch ls # By default, the CI Workflow will create a branch following the pattern `tmp_ci-`. tb branch use **We recommend you to run the Copy Pipe outside of the CI Workflow** . You can automate it using a custom deployment, but most of the times it's just not worth it. As you do not have continuous ingestion in the Branch, don't wait for the future filter timestamp. Instead, run directly the Copy Pipe to backfill the data by running the following command: ##### Run the Copy Pipe tb pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-02-02 13:30:00' --wait --yes Once the Copy Pipe has finished, you will have all the data in the new Data Source. We can compare the number of rows from both Data Sources with the following commands: ##### Compare the number of rows tb sql "SELECT count() FROM analytics_events" tb sql "SELECT count() FROM analytics_events_1" #### CD workflow¶ Now that you have tested the backfilling process in the Branch, you can merge the Pull Request and the CD Workflow, the operation will be exactly the same as in the Branch: first deploy the resources then run the data operations either manually (recommended) or automate it with a custom deployment. **You should verify that have deployed the new Data Source before the timestamp you have used in the Materialized Pipe. Otherwise, you will be missing data in the new Data Source**. For example, if you have used `timestamp > '2024-02-02 13:30:00'` in the Materialized Pipe, you should verify that you have deployed before `2024-02-02 13:30:00`. If you have deployed after `2024-02-02 13:30:00` , you will need to remove the Data Source and start again the process using a different timestamp. At `2024-02-02 13:30:00` , yuo will start seeing data in the new Data Source, that's when you will execute the same process you have done in the CI Workflow to backfill the data. First, you will need to authenticate in the Workspace by running the following command: ##### Authenticate in the Workspace tb auth -i Then, you will need to run the Copy Pipe to backfill the data by running the following command: ##### Run the Copy Pipe tb pipe copy run backfill_data --param start_backfill_timestamp='1970-01-01 00:00:00' --param end_backfill_timestamp='2024-02-02 13:30:00' --wait --yes If the Copy Pipe fails, you can re-run the same command without duplicating data. **The Copy Pipe will only copy the data if the process is successful.** If you get any error like `MEMORY LIMIT` , you can also run the Copy Pipe in batches. For example, you could run the Copy Pipe with a timestamp range of 1 hour, 1 day, 1 week, depending on the amount of data you are moving. Once the Copy Pipe has finished, you will have all the data in the new Data Source. We can compare the number of rows by running the following command: ##### Compare the number of rows tb sql "SELECT count() FROM analytics_events" tb sql "SELECT count() FROM analytics_events" At this point, you should have the same number of rows in both places and you can connect the new Data Source with the rest of the Dataflow. Finally, you have the Data Source with the new schema and all the data migrated from the previous one. The Data Source is receiving real-time data directly and now the next step is to remove the Materialized Pipe and Copy Pipe you have used to backfill the data. To do that, you would just need to create a new Pull Request and remove ( `git rm` ) the Materialized Pipe and Copy Pipe you have used to backfill the data. Once, the Pull Request is merged, the resources will be automatically removed, you can double check this operation while in CI. ### Scenario 4: Streaming ingestion WITH NOT incremental timestamp column¶ If you have streaming ingestion, but you do not have an incremental timestamp column, you can use one of the following strategies to backfill data in Tinybird. Reach out to Tinybird support ( [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) ) if you have any questions or you aren't sure how to proceed. - ** Strategy 1** : Run a populate, but be aware that you may be impacted by[ the previously-mentioned challenges of backfilling real-time data](https://www.tinybird.co/docs/about:blank#the-challenge-of-backfilling-real-time-data) . - ** Strategy 2** : Move ingestion to the new Data Source until you finish backfilling data,** but the data in your old Data Source will be outdated until the new Data Source is fully in sync** . #### Strategy 1: Run a populate¶ Before following this strategy, you should be aware of the [backfilling challenge with real-time data](https://www.tinybird.co/docs/about:blank#the-challenge-of-backfilling-real-time-data). Let's consider that the use case is the same as the previous one, but you do not have an incremental timestamp column. We can not rely on the `timestamp` column to filter the data as it's not incremental. First, create a Materialized Pipe that will materialize the data from the old Data Source to the new one. ##### backfill\_data.pipe file NODE migrating_node SQL > SELECT * FROM analytics_events TYPE materialized DATASOURCE analytics_events_1 To run the backfill, you will use the `tb pipe populate` command. This command will materialize the data from the old Data Source to the new one and as you don't need to wait until a future timestamp, you can run it inside the CI/CD Workflow. You can create a custom deployment using `VERSION=1.0.0` in the `.tinyenv` file and placing the custom deployment script in the `deploy/1.0.0` folder: ##### Scripts generated inside the deploy folder deploy/1.0.0 ├── deploy.sh ## This is the script that will be executed during the deployment You will need to modify the `deploy.sh` script to run the `tb pipe populate` command: ##### deploy.sh script #!/bin/bash # This script will be executed after the deployment # You can use it to run any command after the deployment # Run the populate Pipe tb pipe populate backfill_data --node migrating_node --wait Once you have these changes in the code, you will create a Pull Request and the CI Workflow will generate a new Branch with the new Data Source. Now, you should verify that everything is working as expected as you did in the previous section. # You can use `tb auth -i` to authenticate in the branch tb auth -i # Or you can switch to the branch if you are already authenticated tb branch ls # By default, the CI Workflow will create a branch following the pattern `tmp_ci-`. tb branch use # Also, you could compare the number of rowsby running the following command: tb sql "SELECT count() FROM analytics_events" tb sql "SELECT count() FROM analytics_events_1" Once you have verified that everything is working as expected, merge the Pull Request and the CD Workflow will generate a new Data Source in the Main Branch. Once the CD Workflow has finished successfully, you verify the same way as you did in the Branch. #### Strategy 2: Move ingestion to the new Data Source¶ Let's consider that the use case is the same as the previous one. We do not have an incremental timestamp column. We can not rely on the `timestamp` column to filter the data as it's not incremental and you do not want to run a populate as you might be impacted by [the backfilling challenge with real-time data](https://www.tinybird.co/docs/about:blank#the-challenge-of-backfilling-real-time-data). In this case, you can move the ingestion to the new Data Source until you finish backfilling data. First, create a Copy Pipe that will copy the data from the old Data Source to the new one. ##### backfill\_data.pipe file NODE migrate_data SQL > SELECT * FROM analytics_events TYPE COPY TARGET_DATASOURCE analytics_events_1 We could also parametrize the Copy Pipe to filter by a parameter. This allows us to have better control of the backfilling process. Once you have these changes in our code, create a Pull Request and the CI Workflow will generate a new Branch with the new Data Source and the Copy Pipe. Now, run the Copy Pipe to backfill the data. To do that, authenticate in the branch by either copying the Token from the branch. ##### Authenticate in the branch # You can use `tb auth -i` to authenticate in the branch tb auth -i # Or you can switch to the branch if you are already authenticated tb branch ls # By default, the CI Workflow will create a branch following the pattern `tmp_ci-`. tb branch use # Once you have authenticated in the branch, you can run the Copy Pipe by running the following command: tb pipe copy run backfill_data --node migrate_data --wait --yes Once the Copy Pipe has finished, you will have all the data in the new Data Source. As you are likely not ingesting data into your Branch, both numbers should match. ##### Compare the number of rows tb sql "SELECT count() FROM analytics_events" tb sql "SELECT count() FROM analytics_events_1" Now, merge the Pull Request and the CD Workflow will generate the new resources in the main Branch. At this point, you should modify the ingestion to start ingesting data into the new Data Source. **Keep in mind that while you are ingesting data into the new Data Source, you will stop ingesting data into the old Data Source**. Once all the ingestion is pointing to the new Data Source, you should verify that new data is being ingested into the new Data Source and nothing is being ingested into the old one. To do that you could query the Data Source directly or the Service Data Source [tinybird.datasources_ops_log](https://www.tinybird.co/docs/docs/monitoring/service-datasources). At this point, you can start backfilling data by running the Copy Pipe. To do that, you would need to run the following command: ##### Run the Copy Pipe tb pipe copy run backfill_data --node migrate_data --wait --yes There are sometimes when the Data Source you are modifying has downstream dependencies, in that case when creating the new version of the Data Source you need to make sure that you create new version of all downstream dependencies to avoid connecting two different Data Sources receiving data to the same part of the Dataflow hence duplicating data. ## Next steps¶ If you're familiar with backfilling strategies, check out the [Deployment Strategies](https://www.tinybird.co/docs/docs/production/deployment-strategies) docs. --- URL: https://www.tinybird.co/docs/production/continuous-integration Last update: 2024-11-13T08:04:24.000Z Content: --- title: "Continuous Integration and Deployment (CI/CD) · Tinybird Docs" theme-color: "#171612" description: "How to implement Continuous Integration and Deployment workflows for your Tinybird data project." --- # Continuous integration and continuous deployment (CI/CD)¶ Once you connect your data project and Workspace [through Git](https://www.tinybird.co/docs/docs/production/working-with-version-control) you can implement a Continuous Integration (CI) and Continuous Deployment (CD) workflow to automate interaction with Tinybird. This page covers how CI and CD work using a walkthrough example. CI/CD pipelines require the use of: - [ Datafiles](https://www.tinybird.co/docs/docs/cli/datafiles/overview) - [ CLI commands](https://www.tinybird.co/docs/docs/cli/command-ref) - [ Tinybird Branches](https://www.tinybird.co/docs/docs/concepts/branches) ## How continuous integration works¶ As you expand and iterate on your data projects, you can continuously validate your API Endpoints. In the same way that you write integration and acceptance tests for source code in a software project, you can write automated tests for your API Endpoints to run on each Pull or Merge request. Continuous Integration can help with: - Linting: Syntax and formatting on[ datafiles](https://www.tinybird.co/docs/docs/cli/datafiles/overview) . - Correctness: Making sure you can push your changes to a Tinybird Workspace. - Quality: Running fixture tests or data quality tests or both to validate the changes in the Pull Request. - Regression: Running automatic regression tests to validate endpoint performance and data quality. The following section uses the CI template, GitHub Actions, and the Tinybird CLI to demonstrate how to test your API Endpoints on any new commit to a Pull Request. Set these optional environment variables to adapt your CI/CD workflow: - `TB_VERSION_WARNING=0` : Don't print CLI version warning message if there's a new version available. - `TB_SKIP_REGRESSION=0` : Skip regression tests. ### Building the CI/CD pipeline¶ This section demonstrates automating CI/CD pipelines using GitHub as the provider with a GitHub Action, but you can use any suitable platform. The examples on this page use the Tinybird's CI and CD templates in [this repository](https://github.com/tinybirdco/ci) . You can find examples for Gitlab in that repository as well. You can those example templates or build your own pipelines inspired by them. That way you can adapt them to suit your data project needs and integrate them better with the CI/CD workflow you use for other parts of your toolset. These steps use the [Tinybird CLI commands](https://www.tinybird.co/docs/docs/cli/command-ref) so you can fully reproduce the pipeline locally. Remember to add a new secret with the Workspace administrator [Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens) to the repository's settings to be able to run the needed commands from the CLI. #### 1. Trigger the CI workflow¶ ##### Run the CI workflow on each commit to a Pull Request, when labelling, or with other kinds of triggers name: Tinybird - CI Workflow on: workflow_dispatch: pull_request: branches: - main types: [opened, reopened, labeled, unlabeled, synchronize, closed] Key points: The CI workflow triggers when a new Pull Request opens, reopens, synchronizs or updates labels and the base branch has to be `main` . On closed, it deletes the Tinybird branch created for CI. #### 2. Configure the CI job¶ ##### Use the workflow configuration defined in the uses reference jobs: ci: # ci using Branches from Workspace 'web_analytics_starter_kit' uses: tinybirdco/ci/.github/workflows/ci.yml@main with: data_project_dir: . secrets: tb_admin_token: ${{ secrets.TB_ADMIN_TOKEN }} # set Workspace admin Token in GitHub secrets tb_host: https://api.tinybird.co You can combine the CI trigger and the CI job configuration into a single YAML workflow file and store it in `/my_repo/.github/workflows` . If you are not already familiar with GitHub actions you can checkout the [quickstart guide](https://docs.github.com/en/actions/writing-workflows/quickstart). If your data project directory isn't in the root of the Git repository, change the `data_project_dir` variable. About secrets: - `tb_host` : The URL of the region you want to use. - `tb_admin_token` : The Workspace admin Token. This grants all the permissions for a specific Workspace. You can find more information in the[ Tokens docs](https://www.tinybird.co/docs/docs/concepts/auth-tokens) . ### The CI workflow¶ A potential CI workflow could run the following steps: 1. Configuration: set up dependencies and installs the Tinybird CLI to run the required commands. 2. Check the data project syntax and the authentication. 3. Create a new ephemeral CI Tinybird Branch. 4. Push the changes to the Branch. 5. Run tests in the Branch. 6. Delete the Branch. #### 0. Workflow configuration¶ defaults: run: working-directory: ${{ inputs.data_project_dir }} if: ${{ github.event.action != 'closed' }} steps: - uses: actions/checkout@master with: fetch-depth: 300 ref: ${{ github.event.pull_request.head.sha }} - uses: actions/setup-python@v5 with: python-version: "3.11" architecture: "x64" cache: 'pip' - name: Validate input run: | [[ "${{ secrets.tb_admin_token }}" ]] || { echo "Go to the tokens section in your Workspace, copy the 'admin token' and set TB_ADMIN_TOKEN as a Secret in your Git repository"; exit 1; } - name: Set environment variables run: | _ENV_FLAGS="${ENV_FLAGS:=--last-partition --wait}" _NORMALIZED_BRANCH_NAME=$(echo $DATA_PROJECT_DIR | rev | cut -d "/" -f 1 | rev | tr '.-' '_') GIT_BRANCH=${GITHUB_HEAD_REF} echo "GIT_BRANCH=$GIT_BRANCH" >> $GITHUB_ENV echo "_ENV_FLAGS=$_ENV_FLAGS" >> $GITHUB_ENV echo "_NORMALIZED_BRANCH_NAME=$_NORMALIZED_BRANCH_NAME" >> $GITHUB_ENV Key points: This sets the default `working-directory` to the `data_project_dir` variable, check outs the `main` branch to get the head commit, checks the `TB_ADMIN_TOKEN` , and installs Python 3.11. #### 1. Install the Tinybird CLI¶ - name: Install Tinybird CLI run: | if [ -f "requirements.txt" ]; then pip install -r requirements.txt else pip install tinybird-cli fi - name: Tinybird version run: tb --version Workflow actions use the Tinybird CLI to interact with your Workspace, create a test Branch, and run the tests. You can use a `requirements.txt` file to pin a tinybird-cli version to avoid automatically install the latest version. You can run this workflow locally by having a local data project and the CLI authenticated to your Tinybird Workspace. #### 2. Check the data project syntax and the authentication¶ - name: Check all the datafiles syntax run: tb check - name: Check auth run: tb --host ${{ secrets.tb_host }} --token ${{ secrets.tb_admin_token }} auth info #### 3. Create a new Tinybird Branch to deploy changes and run the tests¶ A Branch is an isolated copy of the resources in your Workspace at a specific point in time. It's designed to be temporary and disposable so that you can develop and test changes before deploying them to your Workspace. Each CI job creates a Branch. In this example, the Tinybird Brand name uses `github.event.pull_request.number` as a unique identifier so multiple tests can run in parallel. If a Branch with the same name exist, it's removed and recreated again. The `tb branch create` command creates new Branches. Once you merge your changes with the Pull Request, the workflow deletes your Tinybird Branch. - name: Try to delete previous Branch run: | output=$(tb --host ${{ secrets.tb_host }} --token ${{ secrets.tb_admin_token }} branch ls) BRANCH_NAME="tmp_ci_${_NORMALIZED_BRANCH_NAME}_${{ github.event.pull_request.number }}" # Check if the branch name exists in the output if echo "$output" | grep -q "\b$BRANCH_NAME\b"; then tb \ --host ${{ secrets.tb_host }} \ --token ${{ secrets.tb_admin_token }} \ branch rm $BRANCH_NAME \ --yes else echo "Skipping clean up: The Branch '$BRANCH_NAME' does not exist." fi - name: Create new test Branch with data run: | tb \ --host ${{ secrets.tb_host }} \ --token ${{ secrets.tb_admin_token }} \ branch create tmp_ci_${_NORMALIZED_BRANCH_NAME}_${{ github.event.pull_request.number }} \ ${_ENV_FLAGS} Set the `_ENV_FLAGS` variable to `--last-partition --wait` to attach the most recently ingested data in the Workspace. This way, you can run the tests using the same data as in production. Alternatively, leave it empty and use fixtures. #### 4. Deploy changes to the Tinybird Branch¶ You can push the changes in your current Pull Request to the test Branch previously created in two ways: ##### Standard deployment Use `tb deploy` if you connected your data project and Workspace [through Git](https://www.tinybird.co/docs/docs/production/working-with-version-control) . This command pushes the file changes based on the result of `git diff` between the latest commit deployed to the Workspace and the current git branch HEAD commit. If your data project and Workspace aren't connected through Git, you can use `tb push --only-changes --force --yes` . This command pushes the file changes based on the result of `tb diff` between the local changes in the git branch and the remote changes in the Tinybird branch. Common `tb push` options: - `--only-changes` : Deploys the changed datafiles and its dependencies. - `--force` : Overrides any existing Pipe. - `--yes` : Confirms any alter to a Data Source. - `--no-check` : Avoid running regression tests when overwriting a Pipe Endpoint. ##### Custom deploy command Alternatively, for more complex changes, you can decide how to deploy the changes to the test Branch. This is convenient, for instance, if additionally to deploy the datafiles you want to automate some other data operation, such as a running a copy Pipe, truncate a Data Source, etc. For this to work, you have to place an executable shell script file in `deploy/$VERSION/deploy.sh` with the CLI commands to push the changes. `$VERSION` should be a global variable and unique to the current active Pull Request. You can find it in the `.tinyenv` file in the data project. - name: Deploy changes to the test Branch run: | DEPLOY_FILE=./deploy/${VERSION}/deploy.sh if [ ! -f "$DEPLOY_FILE" ]; then echo "$DEPLOY_FILE not found, running default tb deploy command" tb deploy fi - name: Custom deployment to the test Branch run: | DEPLOY_FILE=./deploy/${VERSION}/deploy.sh if [ -f "$DEPLOY_FILE" ]; then echo "$DEPLOY_FILE found" if ! [ -x "$DEPLOY_FILE" ]; then echo "Error: You do not have permission to execute '$DEPLOY_FILE'. Run:" echo "> chmod +x $DEPLOY_FILE" echo "and commit your changes" exit 1 else $DEPLOY_FILE fi fi #### 5. Run the tests¶ You can now run your test suite. This is an optional step but recommended if you want to make sure everything works as expected. Tinybird provides three type of tests by default, but you can include any test needed for your deployment pipeline: - Data fixture tests: These test specific business logic based on fixture data, see `datasources/fixtures` . - Data quality tests: These test precise data scenarios. - Regression tests: These test that requests to your API Endpoints are still working as expected. For these tests to work, you must attach production data using the `--last-partition` flag when creating the test Branch. To learn more about testing Tinybird data projects, refer to the [Implementing test strategies](https://www.tinybird.co/docs/docs/production/implementing-test-strategies) docs. - name: Get regression labels id: regression_labels uses: SamirMarin/get-labels-action@v0 with: github_token: ${{ secrets.GITHUB_TOKEN }} label_key: regression - name: Run Pipe regression tests run: | source .tinyenv echo ${{ steps.regression_labels.outputs.labels }} REGRESSION_LABELS=$(echo "${{ steps.regression_labels.outputs.labels }}" | awk -F, '{for (i=1; i<=NF; i++) if ($i ~ /^--/) print $i}' ORS=',' | sed 's/,$//') echo ${REGRESSION_LABELS} CONFIG_FILE=./tests/regression.yaml BASE_CMD="tb branch regression-tests" LABELS_CMD="$(echo ${REGRESSION_LABELS} | tr , ' ')" if [ -f ${CONFIG_FILE} ]; then echo "Config file found: ${CONFIG_FILE}" ${BASE_CMD} -f ${CONFIG_FILE} --wait ${LABELS_CMD} else echo "Config file not found at '${CONFIG_FILE}', running with default values" ${BASE_CMD} coverage --wait ${LABELS_CMD} fi - name: Append fixtures run: | if [ -f ./scripts/append_fixtures.sh ]; then echo "append_fixtures script found" ./scripts/append_fixtures.sh fi - name: Run fixture tests run: | if [ -f ./scripts/exec_test.sh ]; then ./scripts/exec_test.sh fi - name: Run data quality tests run: | tb test run -v -c 4 You can find the reference `append_fixtures` and `exec_test` scripts in [this repository](https://github.com/tinybirdco/ci/tree/main/scripts). #### 6. Delete the Branch¶ By default, the workflow doesn't delete Branches until it's merged into the main Workspace. The following step runs after the tests: - name: Try to delete previous Branch run: | output=$(tb --host ${{ secrets.tb_host }} --token ${{ secrets.tb_admin_token }} branch ls) BRANCH_NAME="tmp_ci_${_NORMALIZED_BRANCH_NAME}_${{ github.event.pull_request.number }}" # Check if the branch name exists in the output if echo "$output" | grep -q "\b$BRANCH_NAME\b"; then tb \ --host ${{ secrets.tb_host }} \ --token ${{ secrets.tb_admin_token }} \ branch rm $BRANCH_NAME \ --yes else echo "Skipping clean up: The Branch '$BRANCH_NAME' does not exist." fi You can have up to simultaneous 3 Branches per Workspace at any time. Contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) if you need to increase this limit. ## How continuous deployment works¶ Once a Pull Request passes CI and a peer reviews and approves it, it's time to merge it to your main Git branch. Continuous Deployment or sometimes Continuous Delivery automatically deploys changes to the Workspace. While efficient, this workflow comes with several challenges, most of them related to handling the current state of your Tinybird Workspace. For instance: - As opposed to when you deploy a stateless app, deployments to a Workspace are incremental, based on the previous resources in the Workspace. - Resources or operations that run programatically and deal with handling state: populating operations or permission handling. - Performing deployments in the same Workspace; you need to be aware of this and implement a policy to avoid collisions from different Pull Requests deploying at the same time, or regressions. As deployments rely on Git commits to push resources, your Branches **must not** be out-of-date when merging. Use your Git provider to control branch freshness. The CD workflow explained here is a guide relevant to many of the most common use cases. However, some complex deployments may require additional knowledge and expertise from the team deploying the change. Continuous Deployment helps with: - Correctness: Ensuring you can push your changes to a Tinybird Workspace. - Deployment: Deploying the changes to the Workspace automatically. - Data Operations: Centralizing data operations required after pushing resources to the Workspace. The following section uses the generated CD template, GitHub Actions, and the Tinybird CLI to explain how to deploy Pull Request changes after merging. ### Configure the CD job¶ ##### CD workflow name: Tinybird - CD Workflow on: workflow_dispatch: push: branches: - main jobs: cd: # deploy changes to Workspace 'web analytics starter kit' uses: tinybirdco/ci/.github/workflows/cd.yml@main with: data_project_dir: . secrets: tb_admin_token: ${{ secrets.TB_ADMIN_TOKEN }} # set Workspace admin Token in GitHub secrets tb_host: https://api.tinybird.co You can use this YAML workflow file and store it in `/my_repo/.github/workflows` . This workflow deploys on merge to `main` to the Workspace defined by the `TB_ADMIN_TOKEN` set as secret in the GitHub repository's Settings. If your data project directory isn't in the root of the Git repository, change the `data_project_dir` variable. About secrets: - `tb_host` : The URL of the region you want to use. - `tb_admin_token` : The Workspace admin Token. This grants all the permissions for a specific Workspace. You can find more information in the[ Tokens docs](https://www.tinybird.co/docs/docs/concepts/auth-tokens) . ### The CD workflow¶ The CD pipeline deploys the changes to the main Workspace the same way the CI pipeline deploys them to the Tinybird Branch. Run CD workflow on merging a PR to keep your Workspace in sync with the git repository main branch HEAD commit. The CD workflow performs the following steps: 1. Configuration 2. Install the Tinybird CLI 3. Checks authentication 4. Pushes changes 5. Post-deployment #### 0. Workflow configuration¶ Same as the CI workflow. #### 1. Install the Tinybird CLI and check authentication¶ Worflow actions use the Tinybird CLI to interact with your Workspace. You can run this workflow locally by having a local data project and the CLI authenticated to your Tinybird Workspace. This step is equivalent to, but not identical to, the CI workflow step 1. - name: Install Tinybird CLI run: | if [ -f "requirements.txt" ]; then pip install -r requirements.txt else pip install tinybird-cli fi - name: Tinybird version run: tb --version - name: Check auth run: tb --host ${{ secrets.tb_host }} --token ${{ secrets.tb_admin_token }} auth info #### 2. Deploy changes¶ Use the same exact strategy that you used in CI. If you did automatic deployment through git, then use `tb deploy` , otherwise `tb push --only-changes --force`. If you did a custom deployment for this specific PR make sure the same exact script runs in CD. - name: Deploy changes to the main Workspace run: | DEPLOY_FILE=./deploy/${VERSION}/deploy.sh if [ ! -f "$DEPLOY_FILE" ]; then echo "$DEPLOY_FILE not found, running default tb push command" tb deploy fi - name: Custom deployment to the main Workspace run: | DEPLOY_FILE=./deploy/${VERSION}/deploy.sh if [ -f "$DEPLOY_FILE" ]; then echo "$DEPLOY_FILE found" if ! [ -x "$DEPLOY_FILE" ]; then echo "Error: You do not have permission to execute '$DEPLOY_FILE'. Run:" echo "> chmod +x $DEPLOY_FILE" echo "and commit your changes" exit 1 else $DEPLOY_FILE fi fi ## Other git providers¶ Most git vendors provide a way to run CI/CD pipelines. This example here is a guide on how you can use the Tinybird CLI + git to build CI/CD pipelines with GitHub actions, but you can create similar pipelines with other providers or adapt your own pipelines to support CI/CD to a Tinybird Workspace. You can find the a similar workflow for GitLab [here](https://github.com/tinybirdco/ci/blob/main/.gitlab/README.md). --- URL: https://www.tinybird.co/docs/production/deployment-strategies Last update: 2024-11-08T11:17:52.000Z Content: --- title: "Deployment strategies · Tinybird Docs" theme-color: "#171612" description: "Things to take into account when building your Continuous Deployment pipeline." --- # Deployment strategies¶ This page explains deployment strategies when you are doing one of the following: - Adding, updating or deleting resources - Maintaining streaming ingestion - Handling user API requests It covers the default method for implementing Continuous Deployment (CD), how to bypass the default deployment strategy to create custom deployments, and finally, strategies to take into account when migrating data. ## How deployment works¶ With the Git integration, your project in Git is the real source of truth and you should expect your Workspace(s) to be a working version of the resources in Git. In the default [CI/CD workflow templates](https://github.com/tinybirdco/ci) the `tb deploy` command is used to deploy changes to the Workspace. This does the following: - Checks the current commit in the Workspace and validates that is an ancestor of the commit in the Pull Request being deployed. If not, usually you have to `git rebase` your branch. - Performs a `git diff` from the current branch to the main branch so it can get a list of the datafiles that changed. - Deploys them both in CI and CD. ## Alter strategy¶ Updates existing resources that have been changed. This is the default deployment strategy when using `tb deploy` . Use it to add a new column to a Data Source or change the TTL. Not all operations can be performed with this strategy. For instance, you cannot change the Sorting Key of a Data Source with this strategy. An example use case can be found here: [Add column to a Data Source](https://github.com/tinybirdco/use-case-examples/tree/main/add_nullable_column_to_landing_data_source). ## Versioning strategy¶ When you want to make a breaking change to some resource, we recommend you to create a new version of that resource. Usually this strategy can be performed in different steps each one corresponding to a Pull Request like this: - Create a new Branch with the new resource(s) (Pipe or Data Source) with a different name and deploy it. - Make sure any backfill operation is run over the new resources so data in the main Workspace is in sync. - Create a new Branch to connect the corresponding dependencies to the new resource(s) and deploy it. - Create a new Branch to `git rm` the old resources and deploy it. This strategy allows for a staged and controlled deployment of breaking changes. You keep old and new versions of the resource(s) in the main Workspace, until your end user application is rolled out. ## Custom deployments¶ The `tb deploy` command allows you to deploy a project with a single command. Under the hood, this command performs a series of actions that are common to most deployments. However, your project might have specific requirements that mean you need to customize these actions, or run additional actions after the deployment process. Use a custom deployment when you need to perform some operation that's not directly supported by `tb deploy` or require several steps like automating a backfill operation. ### Custom deployment actions¶ The `deploy.sh` file allows you to overwrite the default deployment process, so instead of running the default `tb deploy` command you can run a custom shell script to deploy your changes. Use custom deployment actions if the default deployment process is not suitable for your project. For example, you might want to deploy resources in a specific order or handle errors differently. To create custom deployment actions: - If you are using the Tinybird CI/CD templates, export an environment variable in your CI/CD templates with name `VERSION` . You can find it in the `.tinyenv` file in the data project. - Create a `deploy/$VERSION/deploy.sh` file and ensure it has execution permissions. For example, `chmod +x -R deploy/0.0.1/` . - In the `deploy.sh` file, write the Tinybird CLI commands you want to execute during the deployment process. - The CI/CD pipelines will find the `deploy.sh` file and execute it when deploying the matching version of the project. It's important to note that custom deployments run in CI, so use it to validate that the custom deployment will work when merging the Pull Request. Once you merge a Pull Request with a custom deployment make sure you update the `VERSION` environment variable, so this custom deployment does not run on the next Pull Request. After a custom deployment, if you did not run `tb deploy` make sure your Workspace is synchronized with the Git main branch head commit by running `tb init --override-commit HEAD_COMMIT_ID`. ## Deploying common use cases¶ Check out [the Use Case repository](https://github.com/tinybirdco/use-case-examples) for common use cases and scenarios. ## Next steps¶ - Practice iterating on one of Tinybird's examples in the[ Use Case repository](https://github.com/tinybirdco/use-case-examples) . - Learn about[ backfill strategies](https://www.tinybird.co/docs/docs/production/backfill-strategies) . --- URL: https://www.tinybird.co/docs/production/implementing-test-strategies Last update: 2024-11-08T11:17:52.000Z Content: --- title: "Implementing test strategies · Tinybird Docs" theme-color: "#171612" description: "Learn about different strategies for testing your data project. You'll learn how to implement regression tests, data quality tests, and fixture tests." --- # Implementing test strategies¶ Tinybird provides you with a suite of tools to test your project. This means you can make changes and be confident that they won't break the API Endpoints you've deployed. ## Overview¶ This walkthrough is based on the [Web Analytics Starter Kit](https://github.com/tinybirdco/web-analytics-starter-kit) . You can follow along using your existing Tinybird project, or by cloning the Web Analytics Starter Kit. If you need to, create a new Workspace by clicking on the following button: <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fcreate-workspace-with-web-analytics.png&w=3840&q=75) ### Generate mock data¶ If you want to send fake data to your project, use Tinybird's [mockingbird](https://mockingbird.tinybird.co/docs/cli) CLI tool. You'll need to run the following command to start receiving dummy events: ##### Command to populate the Web Analytics Starter Kit with fake events mockingbird-cli tinybird \ --template "Web Analytics Starter Kit" \ --token \ --datasource "analytics_events" \ # The region should be "eu_gcp", "us_gcp" --endpoint "" \ --eps 100 \ --limit 1000 ## Testing strategies¶ You can implement three different testing strategies in a Tinybird data project: Regression tests, data quality tests, and fixture tests. All three of them are included as part of the [Continuous Integration workflow](https://www.tinybird.co/docs/docs/production/continuous-integration). **Regression tests** prevent you from breaking working APIs. They run automatically on each commit to a Pull Request, or when trying to overwrite a Pipe with an API Endpoint in a Workspace. They compare both the output and performance of your API Endpoints using the previous and current versions of the Pipe endpoints. **Data quality tests** warn you about anomalies in the data. As opposed to regression tests, you do have to write data quality tests to cover one or more criteria over your data: the presence of null values, duplicates, out-of-range values, etc. Data quality tests are usually scheduled by users to run over production data as well. **Fixture tests** are like "manual tests" for your API Endpoints. They check that a given call to a given Pipe endpoint with a set of parameters and a known set of data (fixtures) returns a deterministic response. They're useful for coverage testing and for when you are developing or debugging new business logic that requires very specific data scenarios. When creating fixture tests, you must provide both the test and any data fixtures. ## Regression tests¶ When one of your API Endpoints is integrated into a production environment (such as a web/mobile application or a dashboard), you want to make sure that any change in the Pipe doesn't change the output of the API endpoint. In other words, you want the same version of an API Endpoint to return the same data for the same requests. Tinybird provides you with automatic regression tests that run any time you push a new change to an API Endpoint. Here's an example you can follow along by reading - no need to clone anything. We'll go through the example, and then explain how regression tests work under the hood. Imagine you have a `top_browsers` Pipe: ##### Definition of the top\_browsers.pipe file DESCRIPTION > Top Browsers ordered by most visits. Accepts `date_from` and `date_to` date filter. Defaults to last 7 days. Also `skip` and `limit` parameters for pagination. TOKEN "dashboard" READ NODE endpoint DESCRIPTION > Group by browser and calculate hits and visits SQL > % SELECT browser, uniqMerge(visits) as visits, countMerge(hits) as hits FROM analytics_sources_mv WHERE {% if defined(date_from) %} date >= {{Date(date_from, description="Starting day for filtering a date range", required=False)}} {% else %} date >= timestampAdd(today(), interval -7 day) {% end %} {% if defined(date_to) %} and date <= {{Date(date_to, description="Finishing day for filtering a date range", required=False)}} {% else %} and date <= today() {% end %} GROUP BY browser ORDER BY visits desc LIMIT {{Int32(skip, 0)}},{{Int32(limit, 50)}} These requests are sent to the API Endpoint: ##### Let's run some requests to the API Endpoint to see the output curl https://api.tinybird.co/v0/pipes/top_browsers.json?token={TOKEN} curl https://api.tinybird.co/v0/pipes/top_browsers.json?token={TOKEN}&date_from=2020-04-23&date_to=2030-04-23 Now, it's possible to filter by `device` and modify the API Endpoint: ##### Definition of the top\_browsers.pipe file DESCRIPTION > Top Browsers ordered by most visits. Accepts `date_from` and `date_to` date filter. Defaults to last 7 days. Also `skip` and `limit` parameters for pagination. TOKEN "dashboard" READ NODE endpoint DESCRIPTION > Group by browser and calculate hits and visits SQL > % SELECT browser, uniqMerge(visits) as visits, countMerge(hits) as hits FROM analytics_sources_mv WHERE {% if defined(date_from) %} date >= {{Date(date_from, description="Starting day for filtering a date range", required=False)}} {% else %} date >= timestampAdd(today(), interval -7 day) {% end %} {% if defined(date_to) %} and date <= {{Date(date_to, description="Finishing day for filtering a date range", required=False)}} {% else %} and date <= today() {% end %} {% if defined(device) %} and device = {{String(device, description="Device to filter", required=False)}} {% end %} GROUP BY browser ORDER BY visits desc LIMIT {{Int32(skip, 0)}},{{Int32(limit, 50)}} At this point, you'd create a new Pull Request like [this example](https://github.com/tinybirdco/use-case-examples/pull/213) with the changes above, so the API Endpoint is tested for regressions. On the standard Tinybird [Continuous Integration pipeline](https://github.com/tinybirdco/use-case-examples/actions/runs/7616008955/job/20741856936?pr=213) , changes are deployed to a branch, and there's a `Run Pipe regression tests` step that runs the following command on that branch: ##### Run regression tests tb branch regression-tests coverage --wait The next step creates a Job that runs the coverage of the API Endpoints. The Job tests all combinations of parameters by running at least one request for each combination, and comparing the results of the new and old versions of the Pipe: ##### Run coverage regression tests ## In case the endpoints don't have any requests, we will show a warning so you can delete the endpoint if it's not needed or it's expected 🚨 No requests found for the endpoint analytics_hits - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies 🚨 No requests found for the endpoint trend - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies 🚨 No requests found for the endpoint top_locations - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies 🚨 No requests found for the endpoint kpis - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies 🚨 No requests found for the endpoint top_sources - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies 🚨 No requests found for the endpoint top_pages - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies ## If the endpoint has been requested, we will run the regression tests and show the results ## This validation is running for each combination of parameters and comparing the results from the branch against the resources that we have copied from the production OK - top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?&pipe_checker=true - 0.318s (9.0%) 8.59 KB (0.0%) OK - top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?date_start=2020-04-23&date_end=2030-04-23&pipe_checker=true - 0.267s (-58.0%) 8.59 KB (0.0%) OK - top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?date_from=2020-04-23&date_to=2030-04-23&pipe_checker=true - 0.297s (-26.0%) 0 bytes (0%) 🚨 No requests found for the endpoint top_devices - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies ==== Performance metrics ==== --------------------------------------------------------------------- | top_browsers(coverage) | Origin | Branch | Delta | --------------------------------------------------------------------- | min response time | 0.293 seconds | 0.267 seconds | -8.83 % | | max response time | 0.639 seconds | 0.318 seconds | -50.16 % | | mean response time | 0.445 seconds | 0.294 seconds | -33.87 % | | median response time | 0.402 seconds | 0.297 seconds | -26.23 % | | p90 response time | 0.639 seconds | 0.318 seconds | -50.16 % | | min read bytes | 0 bytes | 0 bytes | +0.0 % | | max read bytes | 8.59 KB | 8.59 KB | +0.0 % | | mean read bytes | 5.73 KB | 5.73 KB | +0.0 % | | median read bytes | 8.59 KB | 8.59 KB | +0.0 % | | p90 read bytes | 8.59 KB | 8.59 KB | +0.0 % | --------------------------------------------------------------------- ==== Results Summary ==== -------------------------------------------------------------------------------------------- | Endpoint | Test | Run | Passed | Failed | Mean response time | Mean read bytes | -------------------------------------------------------------------------------------------- | analytics_hits | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | trend | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | top_locations | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | kpis | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | top_sources | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | top_pages | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | top_browsers | coverage | 3 | 3 | 0 | -33.87 % | +0.0 % | | top_devices | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | -------------------------------------------------------------------------------------------- As you can see, regression tests are run for each combination of parameters and the results are compared against the Workspace. In this case, there are no regression issues since adding a new filter. Let's see what happens if you introduce a breaking change in the Pipe definition. First, you'd run some requests using the device filter: ##### Let's run some requests using the device filter curl https://api.tinybird.co/v0/pipes/top_browsers.json?token={TOKEN}&device=mobile-android curl https://api.tinybird.co/v0/pipes/top_browsers.json?token={TOKEN}&date_from=2020-04-23&date_to=2030-04-23&device=desktop Then, introduce a breaking change in the Pipe definition by removing the `device` filter that was added in the previous step: ##### Definition of the top\_browsers.pipe file DESCRIPTION > Top Browsers ordered by most visits. Accepts `date_from` and `date_to` date filter. Defaults to last 7 days. Also `skip` and `limit` parameters for pagination. TOKEN "dashboard" READ NODE endpoint DESCRIPTION > Group by browser and calculate hits and visits SQL > % SELECT browser, uniqMerge(visits) as visits, countMerge(hits) as hits FROM analytics_sources_mv WHERE {% if defined(date_from) %} date >= {{Date(date_from, description="Starting day for filtering a date range", required=False)}} {% else %} date >= timestampAdd(today(), interval -7 day) {% end %} {% if defined(date_to) %} and date <= {{Date(date_to, description="Finishing day for filtering a date range", required=False)}} {% else %} and date <= today() {% end %} GROUP BY browser ORDER BY visits desc LIMIT {{Int32(skip, 0)}},{{Int32(limit, 50)}} At this point, you'd create a Pull Request [like this example](https://github.com/tinybirdco/use-case-examples/pull/215) with the change above so the API Endpoint is tested for regressions. This time, the output is slightly different than before: ##### Run coverage regression tests 🚨 No requests found for the endpoint analytics_hits - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies 🚨 No requests found for the endpoint trend - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies 🚨 No requests found for the endpoint top_locations - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies 🚨 No requests found for the endpoint kpis - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies 🚨 No requests found for the endpoint top_sources - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies 🚨 No requests found for the endpoint top_pages - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies ## The requests without using the device filter are still working, but the other will return a different number of rows, value OK - top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?&pipe_checker=true - 0.3s (-45.0%) 8.59 KB (0.0%) FAIL - top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?device=mobile-android&pipe_checker=true - 0.274s (-43.0%) 8.59 KB (-2.0%) OK - top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?date_start=2020-04-23&date_end=2020-04-23&pipe_checker=true - 0.341s (18.0%) 8.59 KB (0.0%) OK - top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?date_from=2020-04-23&date_to=2020-04-23&pipe_checker=true - 0.314s (-49.0%) 0 bytes (0%) FAIL - top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?date_from=2020-01-23&date_to=2030-04-23&device=desktop&pipe_checker=true - 0.218s (-58.0% skipped < 0.3) 8.59 KB (-2.0%) 🚨 No requests found for the endpoint top_devices - coverage (Skipping validation). 💡 See this guide for more info about the regression tests => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies ==== Failures Detail ==== ❌ top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?device=mobile-android&pipe_checker=true ** 1 != 4 : Unexpected number of result rows count, this might indicate regression. 💡 Hint: Use `--no-assert-result-rows-count` if it's expected and want to skip the assert. Origin Workspace: https://api.tinybird.co/v0/pipes/top_browsers.json?device=mobile-android&pipe_checker=true&__tb__semver=0.0.0 Test Branch: https://api.tinybird.co/v0/pipes/top_browsers.json?device=mobile-android&pipe_checker=true ❌ top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?date_from=2020-01-23&date_to=2030-04-23&device=desktop&pipe_checker=true ** 3 != 4 : Unexpected number of result rows count, this might indicate regression. 💡 Hint: Use `--no-assert-result-rows-count` if it's expected and want to skip the assert. Origin Workspace: https://api.tinybird.co/v0/pipes/top_browsers.json?date_from=2020-01-23&date_to=2030-04-23&device=desktop&pipe_checker=true&__tb__semver=0.0.0 Test Branch: https://api.tinybird.co/v0/pipes/top_browsers.json?date_from=2020-01-23&date_to=2030-04-23&device=desktop&pipe_checker=true ==== Performance metrics ==== Error: ** Check Failures Detail above for more information. If the results are expected, skip asserts or increase thresholds, see 💡 Hints above (note skip asserts flags are applied to all regression tests, so use them when it makes sense). If you are using the CI template for GitHub or GitLab you can add skip asserts flags as labels to the MR and they are automatically applied. Find available flags to skip asserts and thresholds here => https://www.tinybird.co/docs/production/implementing-test-strategies#testing-strategies --------------------------------------------------------------------- | top_browsers(coverage) | Origin | Branch | Delta | --------------------------------------------------------------------- | min response time | 0.29 seconds | 0.218 seconds | -24.61 % | | max response time | 0.612 seconds | 0.341 seconds | -44.28 % | | mean response time | 0.491 seconds | 0.289 seconds | -41.06 % | | median response time | 0.523 seconds | 0.3 seconds | -42.72 % | | p90 response time | 0.612 seconds | 0.341 seconds | -44.28 % | | min read bytes | 0 bytes | 0 bytes | +0.0 % | | max read bytes | 8.8 KB | 8.59 KB | -2.33 % | | mean read bytes | 6.95 KB | 6.87 KB | -1.18 % | | median read bytes | 8.59 KB | 8.59 KB | +0.0 % | | p90 read bytes | 8.8 KB | 8.59 KB | -2.33 % | --------------------------------------------------------------------- ==== Results Summary ==== -------------------------------------------------------------------------------------------- | Endpoint | Test | Run | Passed | Failed | Mean response time | Mean read bytes | -------------------------------------------------------------------------------------------- | analytics_hits | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | trend | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | top_locations | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | kpis | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | top_sources | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | top_pages | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | | top_browsers | coverage | 5 | 3 | 2 | -41.06 % | -1.18 % | | top_devices | coverage | 0 | 0 | 0 | +0.0 % | +0.0 % | -------------------------------------------------------------------------------------------- ❌ FAILED top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?device=mobile-android&pipe_checker=true ❌ FAILED top_browsers(coverage) - https://api.tinybird.co/v0/pipes/top_browsers.json?date_from=2020-01-23&date_to=2030-04-23&device=desktop&pipe_checker=true Because the API Endpoint filter changed, the request response changed, and the regression testing warns you. If the change is expected, you can skip the assertion by adding the following labels to the Pull Request: - `--no-assert-result` : If you expect the API Endpoint output to be different from the current version - `--no-assert-result-no-error` : If you expect errors from the API Endpoint - `--no-assert-result-rows-count` : If you expect the number of elements in the API Endpoint output to be different than the current version - `--assert-result-ignore-order` : If you expect the API Endpoint output is returning the same elements but in a different order - `--assert-time-increase-percentage -1` : If you expect the API Endpoint execution time to increase more than 25% from the current version - `--assert-bytes-read-increase-percentage -1` : If you expect the API Endpoint bytes read to increase more than 25% from the current version - `--assert-max-time` : If you expect the API Endpoint execution time to vary but you don't want to assert the increase in time up to a certain threshold. For instance, if you want to allow your API Endpoints to respond in up to 1s and you don't want to assert any increase in time percentage use `--assert-max-time 1` . By default is 0.3s. These flags will be applied to ALL the regression tests and are advised to be used for one-time breaking changes. Define a file in `tests/regression.yaml` to configure the behavior of the regression tests for each API Endpoint. Follow the docs in [Configure regression tests](https://www.tinybird.co/docs/about:blank#configure-regression-tests). For this example, you would add the label `--no-assert-result-rows-count` to the PR as you'd expect the number of rows to be different and you'd want to remove the filter. <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fskip-assert-result-rows-count.png&w=3840&q=75) At this point, the regression tests would pass and the PR could be merged. ### How regression tests work¶ The regression test functionality is powered by `tinybird.pipe_stats_rt` , one of the [service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) that is available to all users by default. There are three options to run regression tests: `coverage`, `last` and `manual`. - The `coverage` option will run at least 1 request for each combination of parameters. - The `last` option will run the last N requests for each combination of parameters. - The `manual` option will run the requests you define in the configuration file `tests/regression.yaml` . When you run `tb branch regression-tests coverage --wait` , Tinybird generates a job that queries `tinybird.pipe_stats_rt` to gather all the possible combinations of queries done in the last 7 days for each API Endpoint. ##### Simplification of the query used to gather all the possible combinations SELECT ## Using this function we extract all the parameters used in each request extractURLParameterNames(assumeNotNull(url)) as params, ## According to the option `--sample-by-params`, we run one query for each combination of parameters or more groupArraySample({sample_by_params if sample_by_params > 0 else 1})(url) as endpoint_url FROM tinybird.pipe_stats_rt WHERE pipe_name = '{pipe_name}' ## According to the option `--match`, we will filter only the requests that contain that parameter ## This is especially useful when you want to validate a new parameter you want to introduce or you have optimized the endpoint in that specific case { " AND " + " AND ".join([f"has(params, '{match}')" for match in matches]) if matches and len(matches) > 0 else ''} GROUP BY params FORMAT JSON When you run `tb branch regression-tests last --wait` , Tinybird generates a job that queries `tinybird.pipe_stats_rt` to gather the last N requests for each API Endpoint. ##### Query to get the last N requests SELECT url FROM tinybird.pipe_stats_rt WHERE pipe_name = '{pipe_name}' ## According to the option `--match`, we will filter only the requests that contain that parameter ## This is especially useful when you want to validate a new parameter you want to introduce or you have optimize the endpoint in that specific case { " AND " + " AND ".join([f"has(params, '{match}')" for match in matches]) if matches and len(matches) > 0 else ''} ## According to the option `--limit` by default 100 LIMIT {limit} FORMAT JSON ### Configure regression tests¶ By default, the CI Workflow uses the `coverage` option when running the regression-test: ##### Default command to run regression tests in the CI Workflow tb branch regression-tests coverage --wait But if it finds a `tests/regression.yaml` file in your project, it uses the configuration defined in that file. ##### Run regression tests for all endpoints in a test branch using a configuration file tb branch regression-tests -f tests/regression.yaml --wait If the file `tests/regression.yaml` is present, only `--skip-regression-tests` and `--no-skip-regression-tests` labels in the Pull Request will take effect The configuration file is a YAML file with the following structure: - pipe: '.*' # regular expression that selects all API Endpoints in the Workspace tests: # list of tests to run for this Pipe - [coverage|last|manual]: # testing strategy to use (coverage, last, or manual) config: # configuration options for this strategy assert_result: bool = True # whether to perform an assertion on the results returned by the endpoint assert_result_no_error: bool = True # whether to verify that the endpoint does not return errors assert_result_rows_count: bool = True # whether to verify that the correct number of elements are returned in the results assert_result_ignore_order: bool = False # whether to ignore the order of the elements in the results assert_time_increase_percentage: int = 25 # allowed percentage increase in endpoint response time. use -1 to disable assert assert_bytes_read_increase_percentage: int = 25 # allowed percentage increase in the amount of bytes read by the endpoint. use -1 to disable assert assert_max_time: float = 1 # Max time allowed for the endpoint response time. If the response time is lower than this value then the assert_time_increase_percentage is not taken into account. Default is 0.3 skip: bool = False # whether the test should be skipped, use it to skip individual Pipe tests failfast: bool = False # whether the test should stop at the first error encountered Note that the order of preference for the configuration options is from bottom to top, so the configuration options specified for a particular Pipe take precedence over the options specified earlier (higher up) in the configuration file. Here's an example YAML file with two entries for two regular expressions of a Pipe, where one overrides the configuration of the other: - pipe: 'top_.*' tests: - coverage: config: # Default config but reducing thresholds from default 25 and expecting different order in the response payload assert_time_increase_percentage: 15 assert_bytes_read_increase_percentage: 15 assert_result_ignore_order: true - last: config: # default config but not asserting performance and failing at first occurrence assert_time_increase_percentage: -1 assert_bytes_read_increase_percentage: -1 failfast: true limit: 5 # Default value will be 10 - pipe: 'top_pages' tests: - coverage: - manual: params: - {param1: value1, param2: value2} - {param1: value3, param2: value4} config: failfast: true # Override config for top_pages executing coverage with default config and two manual requests ## Data quality tests¶ Data quality tests are meant to cover scenarios that don't have to happen in your production data. For example, you can check that the data is not duplicated or you do not have values out of range. Data quality tests are run with the `tb test` command. You can use them in two different ways: - Run them periodically over your production data. - Use them as part of your test suite in Continuous Integration with a Branch or fixtures. Here's an example you can follow along by reading - no need to clone anything. In the Web Analytics example, you can use data quality tests to validate that there are no duplicate entries with the same session_id in the `analytics_events`. Run `tb test init` to generate a dummy test file in `tests/default.yaml`: ##### Example of tests generated by running tb test init # This test should always pass as numbers(5) returns values from [1,5] - this_test_should_pass: max_bytes_read: null max_time: null pipe: null sql: SELECT * FROM numbers(5) WHERE 0 # This test should always fail as numbers(5) returns values from [1,5]. Therefore the SQL will return a value - this_test_should_fail: max_bytes_read: null max_time: null pipe: null sql: SELECT * FROM numbers(5) WHERE 1 # If max_time is specified, the test will show a warning if the query takes more than the threshold - this_test_should_pass_over_time: max_bytes_read: null max_time: 1.0e-07 pipe: null sql: SELECT * FROM numbers(5) WHERE 0 # if max_bytes_read is specified, the test will show a warning if the query reads more bytes than the threshold - this_test_should_pass_over_bytes: max_bytes_read: 5 max_time: null pipe: null sql: SELECT sum(number) AS total FROM numbers(5) HAVING total>1000 # The combination of both - this_test_should_pass_over_time_and_bytes: max_bytes_read: 5 max_time: 1.0e-07 pipe: null sql: SELECT sum(number) AS total FROM numbers(5) HAVING total>1000 These tests check that the SQL query returns an empty result. If the result is not empty, the test fails. In this example, you'd write a similar test to check that there are no duplicate entries with the same session_id in the `analytics_events`: - no_duplicate_entries: max_bytes_read: null max_time: null sql: | SELECT date, sumMerge(total_sales) total_sales FROM top_products_view GROUP BY date HAVING total_sales < 0 You can follow along with the [example Pull Request](https://github.com/tinybirdco/use-case-examples/pull/216) that would be made, and see the [Run tests](https://github.com/tinybirdco/use-case-examples/actions/runs/7626991493/job/20774698416?pr=216) workflow output: ##### Output of data quality tests -------------------------- semver: 0.0.2 file: ./tests/default.yaml test: no_duplicate_entries status: Pass elapsed: 0.001710404 ms -------------------------- Totals: Total Pass: 1 If the test fails, the CI workflow will fail, and the output returns the value of the SQL query. You can also test the output of a Pipe. For instance, in the Web Analytics example, you could add a validation: 1. First, query the API Endpoint `top_products` with the parameters `date_start` and `date_end` specified in the test. 2. Then, run the SQL query from the result of the previous step. ##### Data quality test for a Pipe - products_by_date: max_bytes_read: null max_time: null sql: | SELECT 1 FROM top_products HAVING count() = 0 pipe: name: top_products params: date_start: '2020-01-01' date_end: '2020-12-31' ## Fixture tests¶ Regression tests confirm the backward compatibility of your API Endpoints when overwriting them, and data quality tests cover scenarios that might not happen with test or production data. Sometimes, you need to cover a very specific scenario, or a use case that is being developed and you don't have production data for it. This is when to use fixture tests. To configure fixture testing you need: - A script to run fixture tests, like[ this example](https://github.com/tinybirdco/ci/blob/main/scripts/exec_test.sh) . The script is automatically created when you[ connect your Workspace to Git](https://www.tinybird.co/docs/docs/production/working-with-version-control) . - Data fixtures: These are datafiles placed in the `datasources/fixtures` folder of your project. Their name must match the name of a Data Source. - Fixture tests in the `tests` folder. In the Continuous Integration job, a Branch is created. If fixture data exists in the `datasources/fixtures` folder, is appended to the Data Sources in the Branch, and the fixture tests are run. To effectively use data fixtures you should: - Use data that do not collide with production data, to avoid unexpected results in regression testing. - Use only data fixtures for landing Data Sources since Materialized Views are automatically populated. - Use future dates in the data fixtures to avoid problems with the TTL of the Data Sources. Fixture tests are placed inside the `tests` folder of your project. If you have a lot of tests, create subfolders to organize the tests by module or API Endpoint. Each fixture test requires two files: - One to indicate a request to an API Endpoint with the naming convention `.test` - One to indicate the exact response to the API Endpoint with the naming convention `.test.result` For instance, to test the output of the `top_browsers` endpoint, create a `simple_top_browsers.test` fixture test with this content: ##### top\_browsers.test tb pipe data top_browsers --date_from 2100-01-01 --date_to 2100-02-01 --format CSV The test makes a request to the `top_browsers` API endpoint passing the `date_from` and `date_to` parameters and the response is `CSV` Now we need a `simple_top_browsers.test.result` with the expected result given our data fixtures: ##### simple\_top\_browsers.test.result "browser","visits","hits" "chrome",1,1 With this approach, your tests should run like this [example Pull Request](https://github.com/tinybirdco/use-case-examples/pull/217/) . Fixture tests are run as part of the Continuous Integration pipeline and the Job fails if the tests fail. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Ffixture-tests-CI-step.png&w=3840&q=75) ## Next steps¶ - Learn how to use[ staging and production Workspaces](https://www.tinybird.co/docs/docs/production/staging-and-production-workspaces) . - Check out an[ example test Pull Request](https://github.com/tinybirdco/use-case-examples/pull/217/) . --- URL: https://www.tinybird.co/docs/production/organizing-resources Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Organizing resources in Workspaces · Tinybird Docs" theme-color: "#171612" description: "Tinybird projects can have many resources. Here's how to organize them." --- # Organizing resources in Workspaces¶ When working with Tinybird, projects can quickly accumulate a multitude of resources. Effective organization is essential for maintaining efficiency and clarity. While there’s no one-size-fits-all approach, we recommend structuring each Workspace around a specific use case. This practice keeps related resources together and simplifies collaboration and management. In the Tinybird UI, your Data Project is automatically divided into two main folders: Pipes and Data Sources. These folders are fundamental and cannot be deleted or removed. However, you have the flexibility to further organize your resources by assigning specific attributes using Tags. ## Organizing resources in the UI¶ ### Assigning Tags to Resources¶ Tags are a powerful way to categorize and manage your resources. You can add tags to both Pipes and Data Sources: - For Pipes: You’ll find the option to add tags directly under the title of the Pipe or in the Pipe list screen. <-figure-> ![pipe tags](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Forganizing-resources-1.png&w=3840&q=75) - For Data Sources: Tags can be added in the metadata section of each Data Source or in the Data Source list screen. <-figure-> ![datasource tags](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Forganizing-resources-2.png&w=3840&q=75) When you create a tag, it becomes available to assign to multiple resources, allowing for consistent categorization across your project. <-figure-> ![creating tag](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Forganizing-resources-3.png&w=3840&q=75) If you didn’t assign tags when creating your resources, you can always add or modify them later from the general lists of Pipes and Data Sources. <-figure-> ![bulk tags](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Forganizing-resources-4.png&w=3840&q=75) ### Filtering Resources¶ One of the main advantages of using tags is the ability to filter your resources in the sidebar. <-figure-> ![tag filters](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Forganizing-resources-5.png&w=3840&q=75) For instance, if you have a tag called “latest” to denote the most recent versions of your resources, you can easily filter the sidebar to display only those tagged resources. This is especially helpful when your Workspace contains numerous resources, and you need to focus on specific versions or categories to streamline your workflow. By effectively using tags, you can enhance your productivity and maintain a well-organized Workspace, making it easier to navigate and manage your Tinybird projects. ### Renaming Tags¶ If you need to change the name of a tag, go to any resource that it's tagged with the one you want to replace. Click on it and choose **"Rename tag"** <-figure-> ![renaming tags](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Frenaming-tags.gif&w=3840&q=75) ### Deleting Tags permanently¶ If you need to delete a tag from the whole Workspace, go to any resource that it's tagged with the one you want to delete. Click on it and choose **"Delete permanently"** You'll see a list with all resources attached to the tag. Once you confirm, the tag will be deleted and the resources dettached from the tag. <-figure-> ![deleting tags](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fdeleting-tags.gif&w=3840&q=75) ## Organizing resources using the CLI¶ Tags are supported in CLI versions 5.8.0 and above. ### Tags in datafiles¶ Tags are not restricted to working in the UI. If you use the CLI to manage your projects, you can also add tags to Data Sources and Pipes using their datafiles. The tags syntax is `TAGS` **followed by comma-separated values** , at the top section of the files. TAGS "stock_case, stats" To tag a Data Source, write the TAGS directive right before `SCHEMA` ##### tinybird/datasources/example.datasource TOKEN tracker APPEND DESCRIPTION > Analytics events **landing data source** TAGS "stock_case, stats" SCHEMA > `timestamp` DateTime `json:$.timestamp`, `session_id` String `json:$.session_id`, `action` LowCardinality(String) `json:$.action`, ... To tag a Pipe, write the `TAGS` directive right before the `NODE` s settings ##### tinybird/pipes/sales\_by\_hour\_mv.pipe DESCRIPTION materialized Pipe to aggregate sales per hour in the sales_by_hour Data Source TAGS "stock_case, stats" NODE daily_sales SQL > SELECT toStartOfDay(starting_date) day, country, sum(sales) as total_sales FROM teams GROUP BY day, country TYPE MATERIALIZED DATASOURCE sales_by_hour When you deploy your changes, the resource will get tagged. You can use the tags to filter the data project later in the UI. Tags get created automatically. You don't have to worry about creating a new one before tagging a resource. If the tag you've written doesn't exist, pushing that file will create it. To know more about publishing changes to your data project, read the [Deployment Strategies](https://www.tinybird.co/docs/docs/production/deployment-strategies) docs. To know more about the datafiles syntax, read the [datafiles](https://www.tinybird.co/docs/docs/cli/datafiles/overview) doc in the CLI section. ### tb tag¶ `tb tag` is a CLI command to manage your Workspace tags. You can create, list and delete tags in your workspace. Please refer to the [tb tag command reference](https://www.tinybird.co/docs/docs/cli/command-ref#tb-tag) for more details. Do you miss any functionality in our tags system? Would you like any other command in the CLI regarding tags? Do you have in mind any other feature that would benefit from our tagging system? We would love to hear your feedback. Please, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). --- URL: https://www.tinybird.co/docs/production/overview Last update: 2024-07-31T16:54:46.000Z Content: --- title: "Continuous Integration Introduction · Tinybird Docs" theme-color: "#171612" description: "Learn about the importance of CI/CD in Tinybird, how it integrates with Git, and the advantages it offers for managing real-time data projects." --- # Prepare your data project for production¶ Tinybird's Git integration transforms data pipeline management, aligning it with established software development practices. This ensures each Tinybird Workspace action relates to a specific Git commit, offering a robust, version-controlled environment for your real-time data deployments. In a nutshell, it brings your workflow closer to industry best practices. ## Why use version control?¶ Version control, a standard in software development, is now integral to building real-time data products with Tinybird. If your Workspace uses Tinybird's integration with Git, it means you can build real-time data products like you build any software - not just benefiting from version control, but also isolated environments, CI/CD, and testing too. This approach allows you to treat and manage your real-time data in the **same way you manage your code** . You can take your existing software engineering knowledge and apply the same principles to real-time data products. When managing your Tinybird projects with version control, you can also: - Sync Tinybird actions with your Git commits. - Deploy semantically-versioned Data Sources, Pipes, and API Endpoints as code. - Use Branches to attach production data to non-production Branches, and test your data pipelines safely with real data. If you're familiar with version control already, it should make testing, merging, and deploying your Tinybird data projects a familiar process. Data teams can use Tinybird in the same way that software engineering teams work: To enforce standardized agreements to use version control, code reviews, quality assurance, testing strategies, and continuous deployment. ## Next steps¶ - Get familiar with the core concepts:[ Branches](https://www.tinybird.co/docs/docs/concepts/branches) and[ Workspaces](https://www.tinybird.co/docs/docs/concepts/workspaces) . - Follow the version control tutorial to connect your Tinybird Workspace to version control:[ Working with version control](https://www.tinybird.co/docs/docs/production/working-with-version-control) . - Explore our[ repository of common use cases for iterating using version control](https://github.com/tinybirdco/use-case-examples) to explore and try out how to iterate Tinybird projects with version control. --- URL: https://www.tinybird.co/docs/production/populate-data Last update: 2024-11-06T17:38:37.000Z Content: --- title: "Populate and copy data between Data Sources · Tinybird Docs" theme-color: "#171612" description: "Learn how populating data within Tinybird works, including the details of reingesting data and important considerations for maintaining data integrity." --- # Populate and copy data between Data Sources¶ You can use Tinybird to populate Data Sources using existing data through the Populate and Copy operations. Both are often used in similar scenarios, with the main distinction being that you can schedule Copy jobs, and that they have more restrictions. Read on to learn how populating data within Tinybird works, including the details of reingesting data and important considerations for maintaining data integrity. ## Populate by partition¶ Populating data by partition requires as many steps as partitions in the origin Data Source. You can use this approach for progress tracking and dynamic resource recalculation while the job is in progress. You can also retry steps in case of a memory limit error, which is safer and more reliable. The following diagram shows a populate scenario that involves two Data Sources: Given that Data Source A has 3 partitions: - The job processes the data partition by partition from the source, `Data Source A` - After the data has been processed, it updates the data on the destination, `Data Source B` ## Understanding the Data Flow¶ As a use case expands, it might develop a complex Data Flow. Here are three key points to consider: - Data is processed by partition from the origin: each step handles data from a single partition of the origin Data Source. - When more than one Materialized Pipe exists for the same Data Source, the execution order is not deterministic. - Destination Data Sources in the Data Flow only use the data from the specific partition being processed. The following examples illustrates the behavior of Populate jobs in different scenarios. ### Case 1: Joining data from a Data Source that isn't a destination in the Data Flow¶ When using a Data Source (Data Source C) in a Materialized Pipe query, if Data Source C isn't a destination Data Source for any other Materialized Pipe in the Data Flow, it uses all available data in Data Source C at that moment. ### Case 2: Joining data from a Data Source that is a destination in the same Materialized View¶ When using the destination Data Source `Data Source B` in the Materialized Pipe query, `Data Source B` doesn't join any data. This occurs because the data is processed by partition, and the required partition isn't available in the destination at that time. ### Case 3: Joining data from a Data Source that is a destination in another Materialized View¶ When using a Data Source (Data Source C) in a Materialized Pipe (Materialized Pipe 3) query that is the destination of another Materialized Pipe (Materialized Pipe 2) in the Data Flow, it retrieves the data ingested during the process. Whether Data Source C contains data before the view on Materialized Pipe 3 runs isn't deterministic. Because the order depends on the internal ID, you can't determine which Data Source is updated first. To control the order of the Data Flow, run each populate operation separately: 1. Run a populate over Materialized Pipe 1 to populate the data from Data Source A to Data Source B. To prevent automatic data propagation through the rest of the Materialized Views, either unlink the views or truncate the dependent Data Sources if they are repopulated. 2. Perform separate populate operations on Materialized Pipe 2 and Materialized Pipe 3, instead of a single operation on Materialized Pipe 1. ## Learn more¶ Before you use Populate and Copy operations for different [backfill strategies](https://www.tinybird.co/docs/docs/production/backfill-strategies) , understand how they work within the Data Flow and its limitations. Read also the [Materialized Views guide](https://www.tinybird.co/docs/docs/publish/materialized-views/best-practices) : populating data while continuing ingestion into the origin Data Source might lead to duplicated data. --- URL: https://www.tinybird.co/docs/production/staging-and-production-workspaces Last update: 2024-11-08T11:23:54.000Z Content: --- title: "Staging and production Workspaces · Tinybird Docs" theme-color: "#171612" description: "Tinybird projects can be deployed to multiple Workspaces. You can use this to create production and staging environments for your project." --- # Staging and production Workspaces¶ Tinybird projects can be deployed to multiple Workspaces that contain different data or different connection settings. You can use this to create production and staging environments for your project. Keep your Workspaces clear and distinct. While you can deploy a project to **multiple** Workspaces, you should avoid deploying multiple projects to the **same** Workspace. This page covers how to set up a CI/CD workflow with staging and production Workspaces. This setup includes: - A staging Workspace with staging data and connection settings, used to validate change to your project before deploying to production. - A production Workspace with production data and connection settings, used to serve your data to real users. - A CI/CD workflow that will run CI over the staging Workspace, deploy manually to the staging Workspace, and finally automatically promote to the production Workspace upon merge. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fstaging-prod.png&w=3840&q=75) ## Example project setup¶ Create a staging and production Workspaces and authenticate using your Workspace admin Token. ##### Create Workspaces tb workspace create staging_acme --user_token tb workspace create pro_acme --user_token Push the project to the production Workspace: ##### Recreating the project tb workspace use pro_acme tb push --push-deps --fixtures Finally, push the project to the staging Workspace: ##### Recreating the project tb workspace use staging_acme tb push --push-deps --fixtures Once you have the project deployed to both Workspaces make sure you [connect them to Git](https://www.tinybird.co/docs/docs/production/working-with-version-control) by: 1. Running `tb auth` using the admin token of the Workspace. 2. Running `tb init --git` to connect the Workspace to your git main branch. You'll need to run both steps twice (once for each Workspace). When running `tb init --git` you need to make sure you end up with a setup like this one: - `tinybird_ci_stg.yml` : Runs CI over the staging Workspace. - `tinybird_cd_stg.yml` : Runs CD over the staging Workspace. You can run this job manually or following any other strategy. - `tinybird_cd.yml` : Runs CD over the production Workspace on merge a branch to the main git branch. ## Configuring CI/CD¶ The following CI/CD jobs are based on the examples in the [Continuous Integration](https://www.tinybird.co/docs/docs/production/continuous-integration) section. A common pattern is to run CD in the staging Workspace before moving to production. Below is an example of a GitHub Actions workflow that will perform CI/CD on the staging Workspace. ##### Staging CI pipeline name: Tinybird - Staging CI Workflow on: workflow_dispatch: pull_request: paths: - './**' branches: - main types: [opened, reopened, labeled, unlabeled, synchronize, closed] concurrency: ${{ github.workflow }}-${{ github.event.pull_request.number }} jobs: staging_ci: # run CI to staging Workspace uses: tinybirdco/ci/.github/workflows/ci.yml@main with: data_project_dir: . tb_env: stg secrets: tb_admin_token: ${{ secrets.TB_ADMIN_TOKEN_STG }} # set the Workspace admin Token from the staging Workspace in a new secret tb_host: https://app.tinybird.co ##### Staging CD pipeline name: Tinybird - Staging CD Workflow on: workflow_dispatch: jobs: staging_cd: # deploy changes to staging Workspace uses: tinybirdco/ci/.github/workflows/cd.yml@main with: data_project_dir: . tb_env: stg secrets: tb_admin_token: ${{ secrets.TB_ADMIN_TOKEN_STG }} # set the Workspace admin Token from the staging Workspace in a new secret tb_host: https://app.tinybird.co The important part here is we are using the admin token from the staging Workspace, stored in a secret with name `TB_ADMIN_TOKEN_STG` . Additionally we are setting the `tb_env` variable to `stg` , usage of the `tb_env` input variable is described in the next section. The CD GitHub Action for the production Workspace, would be like this: ##### Production CD pipeline name: Tinybird - Production CD Workflow on: workflow_dispatch: push: paths: - './**' branches: - main jobs: production_cd: # deploy changes to production Workspace uses: tinybirdco/ci/.github/workflows/cd.yml@main with: data_project_dir: . tb_env: prod secrets: tb_admin_token: ${{ secrets.TB_ADMIN_TOKEN }} # set the Workspace admin Token from the production Workspace in a new secret tb_host: https://app.tinybird.co In this case, the job will run on merge to the git `main` branch, using the admin token from the production Workspace and setting the `tb_env` variable to `prod` ## Workspace connection credentials and environment variables¶ You'll likely use different credentials or connection parameters for each Workspace, for instance if you have a Kafka topic/S3 bucket/etc. for staging, and a different one for production. Use environment variables to manage these credentials. If you are using your own CI/CD pipelines see the [docs on Include and environment variables](https://www.tinybird.co/docs/docs/cli/datafiles/include-files#include-with-environment-variables). If you are using the ones provided in this [GitHub repository](https://github.com/tinybirdco/ci) you can follow this strategy: - Set the `tb_env` input variable as shown in the previous section to a value depending on the environment to be deployed. - Then you have available a `$TB_ENV` environment to configure connection credentials via `include` files as described in[ this section](https://www.tinybird.co/docs/docs/cli/datafiles/include-files#include-with-environment-variables) - Now you can create `connection_details_stg.incl` and `connection_details_prod.incl` and use them inside datafiles like this: `INCLUDE connection_details_${TB_ENV}.incl` . Since `$TB_ENV` has a different value depending on the pipeline for the staging or production environments, resources will be deployed using their environment specific credentials. You can see a working example [here](https://github.com/tinybirdco/use-case-examples/pull/351/files) ## Staging Workspace vs Branch¶ Branches are intended to be ephemeral or short-lived. They are useful for testing changes while you are developing them. Typically, when you begin to develop a new feature, you'll begin by creating a new Branch. Your development work takes place on the Branch, and you can test your changes as you go. On the other hand Workspaces are intended to be permanent or long-lived. You'll use a Workspace to deploy your project into production, as testing environments or to experiment with new projects. Staging Workspaces are optional, and different teams might use them differently, for example: - You don't want to test with your production data, so you have a separate well known subset of data in staging. - You want to perform integration testing with the development version of your project before moving it to the production Workspace. - You want to test out a complex deployment or data migration before deploying to the production Workspace. ## Next steps¶ - Learn more about[ working with version control](https://www.tinybird.co/docs/docs/production/working-with-version-control) . - Learn how to integrate Workspaces with Git in a[ Continuous Integration and Deployment (CI/CD)](https://www.tinybird.co/docs/docs/production/continuous-integration) pipeline. --- URL: https://www.tinybird.co/docs/production/working-with-version-control Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Working with version control · Tinybird Docs" theme-color: "#171612" description: "With Tinybird, you can take your existing version control knowledge from software engineering, and apply it immediately to your real-time data products." --- # Working with version control¶ With Tinybird, you can take your existing version control knowledge from software engineering, and apply it immediately to your real-time data products, using Git. ## How the Git integration works¶ Tinybird's Git integration creates a bi-directional link between your Tinybird Workspace and a remote Git repository. This means you can work on Tinybird resources locally, push to Git and then sync to Tinybird. In this way, Git becomes the ultimate source of truth for your project. You can then follow standard development patterns, such as working in feature branches, using pull requests, and running CI/CD pipelines to validate changes and deploy to production. When you connect your Workspace to a Git repository, we suggest following [CI/CD](https://www.tinybird.co/docs/docs/production/continuous-integration) to deploy to the Workspace. The CI/CD pipelines need to be executed outside of Tinybird in a runner such as GitHub Action or GitLab Runner. You can connect your Workspace to Git using the [CLI](https://www.tinybird.co/docs/about:blank#connect-your-workspace-to-git-from-the-cli). You can connect both new and existing Tinybird Workspaces with Git at any time. If you're connecting an existing Workspace for the first time, your project syncs with the remote Git-based repository the moment you connect it. If you make changes in Tinybird after connecting with Git, you can create a Branch and merge the changes with a Pull Request. ## Project structure¶ A Tinybird project is represented by a collection of text files (called [datafiles](https://www.tinybird.co/docs/docs/cli/datafiles/overview) ) that are organized in folders. You can initialize a new Tinybird project with automatic scaffolding using the `tb init` CLI command. This creates the following files and folders: - /datasources - /datasources/fixtures - /endpoints - /pipes - /tests - /scripts - /scripts/exec_test.sh - /scripts/append_fixtures.sh - /deploy If you have an existing project created in the UI, use the `tb pull --force` command to download all resources from your Workspace, creating the same structure as above. The purpose of these files and folder is as follows: - `datasources` : Where you put your .datasource files. - `datasources/fixtures` : Place as many CSV or NDJSON files that will be pushed when using the default `./scripts/append_fixtures.sh` script. They need to share name with a .datasource file. - `endpoints` : You can use this folder to create a logical separation between non-Endpoint Pipes and Endpoint Pipes, though it is not necessary. By default, all .pipe files will be placed in the `pipes/` directory when pulled from Tinybird. - `pipes` : Where you put your .pipe files. - `tests` : Where you put[ data quality and fixture tests](https://www.tinybird.co/docs/docs/production/implementing-test-strategies) . - `scripts` : Useful scripts for common operations like data migrations, fixtures tests, etc. - `deploy` : Custom deployment shell scripts. ## Connect your Workspace to Git from the CLI¶ To connect your Workspace to Git, you will need a Tinybird [Workspace](https://www.tinybird.co/docs/docs/concepts/workspaces) and a Git repository. You can either connect a pre-existing repository, or create a new one as part of this process. If you do not already have the CLI installed, follow the instructions [here](https://www.tinybird.co/docs/docs/cli/install). To initialize your Workspace with Git, run the `tb init --git` command. It performs the following actions: - Checks there are no differences between your local files and Tinybird Workspace. - Saves a reference to the current Git repository commit in the Workspace. This commit reference is used later on to diff Workspace resources and resources in a branch, to ease deployment. ##### Initialize Tinybird with a Git repository tb init --git ** - /datasources already exists, skipping ** - /datasources/fixtures already exists, skipping ** - /endpoints already exists, skipping ** - /pipes already exists, skipping ** - /tests already exists, skipping ** - /scripts already exists, skipping ** - /deploy already exists, skipping ** - '.tinyenv' already exists, skipping ** Checking diffs between remote Workspace and local. Hint: use 'tb diff' to check if your data project and Workspace synced Pulling datasources [####################################] 100% Pulling pipes [####################################] 100% Pulling tokens [####################################] 100% ** No diffs detected for 'workspace' Once complete, create CI/CD actions to integrate Tinybird with your Git provider. [These actions](https://www.tinybird.co/docs/docs/production/continuous-integration) should be based on your development pipeline, and the CLI commands Tinybird provides offer an excellent basis upon which to validate changes and deploy to Tinybird safely from Git. Add the `.tinyb` file to your `.gitignore` to avoid pushing the Tinybird configuration files to your Git provider. ##### Pushing the CI/CD actions to your Git provider echo ".tinyb" >> .gitignore git add . git commit -m "Add Tinybird CI/CD actions" git push You must save your Workspace admin [Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens) as a secret in your repository. For example, with GitHub, go to your repository settings, and under Secrets and Variables / Actions, add the Token value in a secret called `TB_ADMIN_TOKEN`. You can make your shell PROMPT print your current Workspace by following this [CLI guide](https://www.tinybird.co/docs/docs/cli/install#configure-your-shell-prompt) ## Protecting the main Workspace¶ Once you decide to use this version-controlled workflow, the Git repository becomes your single source of truth. You'll want to keep production protected so users can't modify resources and break the Git workflow. If you want to prevent users from making changes to a the Workspace from the CLI or API, open the Tinybird UI and navigate to the Workspace settings menu, then the **Members** tab, and assign the **Viewer** role. Users with a Viewer role aren't able to create, edit, or delete resources or run data operations. They are allowed to create a new Branch and change resources there. ## Development workflow¶ This section explains how to safely develop new features using [Branches](https://www.tinybird.co/docs/docs/concepts/branches). ### Develop using the CLI¶ When prototyping a new API Endpoint or changing its logic, we recommend you use the UI. It is the easiest and fastest way to iterate and validate your changes, and you can see the results of your changes in real time. You can use a branch and then `tb pull` your changes to push them to Git. But, when making changes like data migrations or column types, you need to use the CLI and modify the datafiles. Make sure you're familiar with [the Tinybird CLI docs](https://www.tinybird.co/docs/docs/cli/quick-start). For changes like these, use the Git workflow: Create a new branch, make the changes in the datafiles, and create a Pull Request to validate the changes. For further guidance, read the [CI/CD docs](https://www.tinybird.co/docs/docs/production/continuous-integration). This image visualizes each process step when setting up and working with Git via the Tinybird CLI: <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fconnect-to-git.png&w=3840&q=75) ## Exploration workflow¶ ### The Playground¶ Once you have protected production from user modifications, you might not be able to create Pipes directly. To explore the data, you need to either create a new Branch, or use [the Playground](https://www.tinybird.co/docs/docs/query/overview#use-the-playground). Once you've prototyped your new Pipe, download the Pipe from the UI and commit the change to a new Git branch to follow the CI workflow. By default, Playground content is private to your Workspace view. However, you have the option to share your Playground with other Workspace members. ## Troubleshooting¶ This section covers some of the problems you might face when working with version control. ### Connect to more than one Workspace¶ You can have one Git repo connected to 2 Tinybird Workspaces - a common scenario is using this for [staging and production Workspaces](https://www.tinybird.co/docs/production/staging-and-production-workspaces) . Just use a different `ADMIN_TOKEN` in the GitHub Actions. ### Initialization¶ When your Git repository and Tinybird Workspace have the same resources, `tb init --git` allows you to easily start up a CI/CD pipeline. However, when these are not in sync, you might experience one of these three scenarios: **Problem: There are resources in the Workspace that are not present in your Git repository.** Remove them from your Workspace (they are probably unnecessary) or run `tb pull --match ""` to download the resource(s) locally and push them to the Git repository before continuing. **Problem: There are resources in your Git repository that are not present in your Workspace.** Either remove them from the project or run `tb push` them to the Workspace and re-run the `init` command. **Problem: There are differences between the resources in the Git repository and the ones in your Workspace.** In this instance, you must decide which version is the source of truth. If it is the resources in the Workspace, run `tb pull --force` to overwrite your local files. If it is your local files, run `tb push --force` file by file so the Workspace is synchronized with your project in Git. Use `tb diff` to check if your project is synced. It diffs local [datafiles](https://www.tinybird.co/docs/docs/cli/datafiles/overview) to the corresponding resources in the Workspace. ### Git and Workspace no longer synced¶ When you deploy, the Git commit ID of that deployment is stored in the Workspace. This means you can use the `git diff` command to compare your Branch against the main one, and know which resources have been modified and need to be deployed. If you introduce a manual change in your Workspace or Git outside of the CI/CD workflow, they will no longer be synced. To get both components to once again be in sync with each other, use the command `tb init --override-commit ` . This command will override the Git commit ID from the Workspace. ## Common use cases¶ Version control allows you to incrementally change or iterate your data project. It's ideal for managing changes like adding a column, changing data types, redefining whole views, and lots more. If you're new to using version control or want to be sure how to do it on Tinybird, there's an entire repository of use cases available: [tinybirdco/use-case-examples](https://github.com/tinybirdco/use-case-examples). ## Next steps¶ - Read about[ datafiles](https://www.tinybird.co/docs/docs/cli/datafiles/overview) , the text files that describe your Tinybird resources (Data Sources, Pipes). - Understand[ CI/CD processes on Tinybird](https://www.tinybird.co/docs/docs/production/continuous-integration) . --- URL: https://www.tinybird.co/docs/publish/api-endpoints/list-of-errors Last update: 2024-11-05T10:29:52.000Z Content: --- title: "List of API Endpoint database errors · Tinybird Docs" theme-color: "#171612" description: "The following list contains all internal database errors that an API Endpoint might return, and their numbers." --- # List of internal database errors¶ API Endpoint responses have an additional HTTP header, `X-DB-Exception-Code` , where you can check the internal database error, reported as a stringified number. The following list contains all internal database errors and their numbers: - `UNSUPPORTED_METHOD = "1"` - `UNSUPPORTED_PARAMETER = "2"` - `UNEXPECTED_END_OF_FILE = "3"` - `EXPECTED_END_OF_FILE = "4"` - `CANNOT_PARSE_TEXT = "6"` - `INCORRECT_NUMBER_OF_COLUMNS = "7"` - `THERE_IS_NO_COLUMN = "8"` - `SIZES_OF_COLUMNS_DOESNT_MATCH = "9"` - `NOT_FOUND_COLUMN_IN_BLOCK = "10"` - `POSITION_OUT_OF_BOUND = "11"` - `PARAMETER_OUT_OF_BOUND = "12"` - `SIZES_OF_COLUMNS_IN_TUPLE_DOESNT_MATCH = "13"` - `DUPLICATE_COLUMN = "15"` - `NO_SUCH_COLUMN_IN_TABLE = "16"` - `DELIMITER_IN_STRING_LITERAL_DOESNT_MATCH = "17"` - `CANNOT_INSERT_ELEMENT_INTO_CONSTANT_COLUMN = "18"` - `SIZE_OF_FIXED_STRING_DOESNT_MATCH = "19"` - `NUMBER_OF_COLUMNS_DOESNT_MATCH = "20"` - `CANNOT_READ_ALL_DATA_FROM_TAB_SEPARATED_INPUT = "21"` - `CANNOT_PARSE_ALL_VALUE_FROM_TAB_SEPARATED_INPUT = "22"` - `CANNOT_READ_FROM_ISTREAM = "23"` - `CANNOT_WRITE_TO_OSTREAM = "24"` - `CANNOT_PARSE_ESCAPE_SEQUENCE = "25"` - `CANNOT_PARSE_QUOTED_STRING = "26"` - `CANNOT_PARSE_INPUT_ASSERTION_FAILED = "27"` - `CANNOT_PRINT_FLOAT_OR_DOUBLE_NUMBER = "28"` - `CANNOT_PRINT_INTEGER = "29"` - `CANNOT_READ_SIZE_OF_COMPRESSED_CHUNK = "30"` - `CANNOT_READ_COMPRESSED_CHUNK = "31"` - `ATTEMPT_TO_READ_AFTER_EOF = "32"` - `CANNOT_READ_ALL_DATA = "33"` - `TOO_MANY_ARGUMENTS_FOR_FUNCTION = "34"` - `TOO_FEW_ARGUMENTS_FOR_FUNCTION = "35"` - `BAD_ARGUMENTS = "36"` - `UNKNOWN_ELEMENT_IN_AST = "37"` - `CANNOT_PARSE_DATE = "38"` - `TOO_LARGE_SIZE_COMPRESSED = "39"` - `CHECKSUM_DOESNT_MATCH = "40"` - `CANNOT_PARSE_DATETIME = "41"` - `NUMBER_OF_ARGUMENTS_DOESNT_MATCH = "42"` - `ILLEGAL_TYPE_OF_ARGUMENT = "43"` - `ILLEGAL_COLUMN = "44"` - `ILLEGAL_NUMBER_OF_RESULT_COLUMNS = "45"` - `UNKNOWN_FUNCTION = "46"` - `UNKNOWN_IDENTIFIER = "47"` - `NOT_IMPLEMENTED = "48"` - `LOGICAL_ERROR = "49"` - `UNKNOWN_TYPE = "50"` - `EMPTY_LIST_OF_COLUMNS_QUERIED = "51"` - `COLUMN_QUERIED_MORE_THAN_ONCE = "52"` - `TYPE_MISMATCH = "53"` - `STORAGE_DOESNT_ALLOW_PARAMETERS = "54"` - `STORAGE_REQUIRES_PARAMETER = "55"` - `UNKNOWN_STORAGE = "56"` - `TABLE_ALREADY_EXISTS = "57"` - `TABLE_METADATA_ALREADY_EXISTS = "58"` - `ILLEGAL_TYPE_OF_COLUMN_FOR_FILTER = "59"` - `UNKNOWN_TABLE = "60"` - `ONLY_FILTER_COLUMN_IN_BLOCK = "61"` - `SYNTAX_ERROR = "62"` - `UNKNOWN_AGGREGATE_FUNCTION = "63"` - `CANNOT_READ_AGGREGATE_FUNCTION_FROM_TEXT = "64"` - `CANNOT_WRITE_AGGREGATE_FUNCTION_AS_TEXT = "65"` - `NOT_A_COLUMN = "66"` - `ILLEGAL_KEY_OF_AGGREGATION = "67"` - `CANNOT_GET_SIZE_OF_FIELD = "68"` - `ARGUMENT_OUT_OF_BOUND = "69"` - `CANNOT_CONVERT_TYPE = "70"` - `CANNOT_WRITE_AFTER_END_OF_BUFFER = "71"` - `CANNOT_PARSE_NUMBER = "72"` - `UNKNOWN_FORMAT = "73"` - `CANNOT_READ_FROM_FILE_DESCRIPTOR = "74"` - `CANNOT_WRITE_TO_FILE_DESCRIPTOR = "75"` - `CANNOT_OPEN_FILE = "76"` - `CANNOT_CLOSE_FILE = "77"` - `UNKNOWN_TYPE_OF_QUERY = "78"` - `INCORRECT_FILE_NAME = "79"` - `INCORRECT_QUERY = "80"` - `UNKNOWN_DATABASE = "81"` - `DATABASE_ALREADY_EXISTS = "82"` - `DIRECTORY_DOESNT_EXIST = "83"` - `DIRECTORY_ALREADY_EXISTS = "84"` - `FORMAT_IS_NOT_SUITABLE_FOR_INPUT = "85"` - `RECEIVED_ERROR_FROM_REMOTE_IO_SERVER = "86"` - `CANNOT_SEEK_THROUGH_FILE = "87"` - `CANNOT_TRUNCATE_FILE = "88"` - `UNKNOWN_COMPRESSION_METHOD = "89"` - `EMPTY_LIST_OF_COLUMNS_PASSED = "90"` - `SIZES_OF_MARKS_FILES_ARE_INCONSISTENT = "91"` - `EMPTY_DATA_PASSED = "92"` - `UNKNOWN_AGGREGATED_DATA_VARIANT = "93"` - `CANNOT_MERGE_DIFFERENT_AGGREGATED_DATA_VARIANTS = "94"` - `CANNOT_READ_FROM_SOCKET = "95"` - `CANNOT_WRITE_TO_SOCKET = "96"` - `CANNOT_READ_ALL_DATA_FROM_CHUNKED_INPUT = "97"` - `CANNOT_WRITE_TO_EMPTY_BLOCK_OUTPUT_STREAM = "98"` - `UNKNOWN_PACKET_FROM_CLIENT = "99"` - `UNKNOWN_PACKET_FROM_SERVER = "100"` - `UNEXPECTED_PACKET_FROM_CLIENT = "101"` - `UNEXPECTED_PACKET_FROM_SERVER = "102"` - `RECEIVED_DATA_FOR_WRONG_QUERY_ID = "103"` - `TOO_SMALL_BUFFER_SIZE = "104"` - `CANNOT_READ_HISTORY = "105"` - `CANNOT_APPEND_HISTORY = "106"` - `FILE_DOESNT_EXIST = "107"` - `NO_DATA_TO_INSERT = "108"` - `CANNOT_BLOCK_SIGNAL = "109"` - `CANNOT_UNBLOCK_SIGNAL = "110"` - `CANNOT_MANIPULATE_SIGSET = "111"` - `CANNOT_WAIT_FOR_SIGNAL = "112"` - `THERE_IS_NO_SESSION = "113"` - `CANNOT_CLOCK_GETTIME = "114"` - `UNKNOWN_SETTING = "115"` - `THERE_IS_NO_DEFAULT_VALUE = "116"` - `INCORRECT_DATA = "117"` - `ENGINE_REQUIRED = "119"` - `CANNOT_INSERT_VALUE_OF_DIFFERENT_SIZE_INTO_TUPLE = "120"` - `UNSUPPORTED_JOIN_KEYS = "121"` - `INCOMPATIBLE_COLUMNS = "122"` - `UNKNOWN_TYPE_OF_AST_NODE = "123"` - `INCORRECT_ELEMENT_OF_SET = "124"` - `INCORRECT_RESULT_OF_SCALAR_SUBQUERY = "125"` - `CANNOT_GET_RETURN_TYPE = "126"` - `ILLEGAL_INDEX = "127"` - `TOO_LARGE_ARRAY_SIZE = "128"` - `FUNCTION_IS_SPECIAL = "129"` - `CANNOT_READ_ARRAY_FROM_TEXT = "130"` - `TOO_LARGE_STRING_SIZE = "131"` - `AGGREGATE_FUNCTION_DOESNT_ALLOW_PARAMETERS = "133"` - `PARAMETERS_TO_AGGREGATE_FUNCTIONS_MUST_BE_LITERALS = "134"` - `ZERO_ARRAY_OR_TUPLE_INDEX = "135"` - `UNKNOWN_ELEMENT_IN_CONFIG = "137"` - `EXCESSIVE_ELEMENT_IN_CONFIG = "138"` - `NO_ELEMENTS_IN_CONFIG = "139"` - `ALL_REQUESTED_COLUMNS_ARE_MISSING = "140"` - `SAMPLING_NOT_SUPPORTED = "141"` - `NOT_FOUND_NODE = "142"` - `FOUND_MORE_THAN_ONE_NODE = "143"` - `FIRST_DATE_IS_BIGGER_THAN_LAST_DATE = "144"` - `UNKNOWN_OVERFLOW_MODE = "145"` - `QUERY_SECTION_DOESNT_MAKE_SENSE = "146"` - `NOT_FOUND_FUNCTION_ELEMENT_FOR_AGGREGATE = "147"` - `NOT_FOUND_RELATION_ELEMENT_FOR_CONDITION = "148"` - `NOT_FOUND_RHS_ELEMENT_FOR_CONDITION = "149"` - `EMPTY_LIST_OF_ATTRIBUTES_PASSED = "150"` - `INDEX_OF_COLUMN_IN_SORT_CLAUSE_IS_OUT_OF_RANGE = "151"` - `UNKNOWN_DIRECTION_OF_SORTING = "152"` - `ILLEGAL_DIVISION = "153"` - `AGGREGATE_FUNCTION_NOT_APPLICABLE = "154"` - `UNKNOWN_RELATION = "155"` - `DICTIONARIES_WAS_NOT_LOADED = "156"` - `ILLEGAL_OVERFLOW_MODE = "157"` - `TOO_MANY_ROWS = "158"` - `TIMEOUT_EXCEEDED = "159"` - `TOO_SLOW = "160"` - `TOO_MANY_COLUMNS = "161"` - `TOO_DEEP_SUBQUERIES = "162"` - `TOO_DEEP_PIPELINE = "163"` - `READONLY = "164"` - `TOO_MANY_TEMPORARY_COLUMNS = "165"` - `TOO_MANY_TEMPORARY_NON_CONST_COLUMNS = "166"` - `TOO_DEEP_AST = "167"` - `TOO_BIG_AST = "168"` - `BAD_TYPE_OF_FIELD = "169"` - `BAD_GET = "170"` - `CANNOT_CREATE_DIRECTORY = "172"` - `CANNOT_ALLOCATE_MEMORY = "173"` - `CYCLIC_ALIASES = "174"` - `CHUNK_NOT_FOUND = "176"` - `DUPLICATE_CHUNK_NAME = "177"` - `MULTIPLE_ALIASES_FOR_EXPRESSION = "178"` - `MULTIPLE_EXPRESSIONS_FOR_ALIAS = "179"` - `THERE_IS_NO_PROFILE = "180"` - `ILLEGAL_FINAL = "181"` - `ILLEGAL_PREWHERE = "182"` - `UNEXPECTED_EXPRESSION = "183"` - `ILLEGAL_AGGREGATION = "184"` - `UNSUPPORTED_MYISAM_BLOCK_TYPE = "185"` - `UNSUPPORTED_COLLATION_LOCALE = "186"` - `COLLATION_COMPARISON_FAILED = "187"` - `UNKNOWN_ACTION = "188"` - `TABLE_MUST_NOT_BE_CREATED_MANUALLY = "189"` - `SIZES_OF_ARRAYS_DOESNT_MATCH = "190"` - `SET_SIZE_LIMIT_EXCEEDED = "191"` - `UNKNOWN_USER = "192"` - `WRONG_PASSWORD = "193"` - `REQUIRED_PASSWORD = "194"` - `IP_ADDRESS_NOT_ALLOWED = "195"` - `UNKNOWN_ADDRESS_PATTERN_TYPE = "196"` - `SERVER_REVISION_IS_TOO_OLD = "197"` - `DNS_ERROR = "198"` - `UNKNOWN_QUOTA = "199"` - `QUOTA_DOESNT_ALLOW_KEYS = "200"` - `QUOTA_EXCEEDED = "201"` - `TOO_MANY_SIMULTANEOUS_QUERIES = "202"` - `NO_FREE_CONNECTION = "203"` - `CANNOT_FSYNC = "204"` - `NESTED_TYPE_TOO_DEEP = "205"` - `ALIAS_REQUIRED = "206"` - `AMBIGUOUS_IDENTIFIER = "207"` - `EMPTY_NESTED_TABLE = "208"` - `SOCKET_TIMEOUT = "209"` - `NETWORK_ERROR = "210"` - `EMPTY_QUERY = "211"` - `UNKNOWN_LOAD_BALANCING = "212"` - `UNKNOWN_TOTALS_MODE = "213"` - `CANNOT_STATVFS = "214"` - `NOT_AN_AGGREGATE = "215"` - `QUERY_WITH_SAME_ID_IS_ALREADY_RUNNING = "216"` - `CLIENT_HAS_CONNECTED_TO_WRONG_PORT = "217"` - `TABLE_IS_DROPPED = "218"` - `DATABASE_NOT_EMPTY = "219"` - `DUPLICATE_INTERSERVER_IO_ENDPOINT = "220"` - `NO_SUCH_INTERSERVER_IO_ENDPOINT = "221"` - `ADDING_REPLICA_TO_NON_EMPTY_TABLE = "222"` - `UNEXPECTED_AST_STRUCTURE = "223"` - `REPLICA_IS_ALREADY_ACTIVE = "224"` - `NO_ZOOKEEPER = "225"` - `NO_FILE_IN_DATA_PART = "226"` - `UNEXPECTED_FILE_IN_DATA_PART = "227"` - `BAD_SIZE_OF_FILE_IN_DATA_PART = "228"` - `QUERY_IS_TOO_LARGE = "229"` - `NOT_FOUND_EXPECTED_DATA_PART = "230"` - `TOO_MANY_UNEXPECTED_DATA_PARTS = "231"` - `NO_SUCH_DATA_PART = "232"` - `BAD_DATA_PART_NAME = "233"` - `NO_REPLICA_HAS_PART = "234"` - `DUPLICATE_DATA_PART = "235"` - `ABORTED = "236"` - `NO_REPLICA_NAME_GIVEN = "237"` - `FORMAT_VERSION_TOO_OLD = "238"` - `CANNOT_MUNMAP = "239"` - `CANNOT_MREMAP = "240"` - `MEMORY_LIMIT_EXCEEDED = "241"` - `TABLE_IS_READ_ONLY = "242"` - `NOT_ENOUGH_SPACE = "243"` - `UNEXPECTED_ZOOKEEPER_ERROR = "244"` - `CORRUPTED_DATA = "246"` - `INCORRECT_MARK = "247"` - `INVALID_PARTITION_VALUE = "248"` - `NOT_ENOUGH_BLOCK_NUMBERS = "250"` - `NO_SUCH_REPLICA = "251"` - `TOO_MANY_PARTS = "252"` - `REPLICA_IS_ALREADY_EXIST = "253"` - `NO_ACTIVE_REPLICAS = "254"` - `TOO_MANY_RETRIES_TO_FETCH_PARTS = "255"` - `PARTITION_ALREADY_EXISTS = "256"` - `PARTITION_DOESNT_EXIST = "257"` - `UNION_ALL_RESULT_STRUCTURES_MISMATCH = "258"` - `CLIENT_OUTPUT_FORMAT_SPECIFIED = "260"` - `UNKNOWN_BLOCK_INFO_FIELD = "261"` - `BAD_COLLATION = "262"` - `CANNOT_COMPILE_CODE = "263"` - `INCOMPATIBLE_TYPE_OF_JOIN = "264"` - `NO_AVAILABLE_REPLICA = "265"` - `MISMATCH_REPLICAS_DATA_SOURCES = "266"` - `STORAGE_DOESNT_SUPPORT_PARALLEL_REPLICAS = "267"` - `CPUID_ERROR = "268"` - `INFINITE_LOOP = "269"` - `CANNOT_COMPRESS = "270"` - `CANNOT_DECOMPRESS = "271"` - `CANNOT_IO_SUBMIT = "272"` - `CANNOT_IO_GETEVENTS = "273"` - `AIO_READ_ERROR = "274"` - `AIO_WRITE_ERROR = "275"` - `INDEX_NOT_USED = "277"` - `ALL_CONNECTION_TRIES_FAILED = "279"` - `NO_AVAILABLE_DATA = "280"` - `DICTIONARY_IS_EMPTY = "281"` - `INCORRECT_INDEX = "282"` - `UNKNOWN_DISTRIBUTED_PRODUCT_MODE = "283"` - `WRONG_GLOBAL_SUBQUERY = "284"` - `TOO_FEW_LIVE_REPLICAS = "285"` - `UNSATISFIED_QUORUM_FOR_PREVIOUS_WRITE = "286"` - `UNKNOWN_FORMAT_VERSION = "287"` - `DISTRIBUTED_IN_JOIN_SUBQUERY_DENIED = "288"` - `REPLICA_IS_NOT_IN_QUORUM = "289"` - `LIMIT_EXCEEDED = "290"` - `DATABASE_ACCESS_DENIED = "291"` - `MONGODB_CANNOT_AUTHENTICATE = "293"` - `INVALID_BLOCK_EXTRA_INFO = "294"` - `RECEIVED_EMPTY_DATA = "295"` - `NO_REMOTE_SHARD_FOUND = "296"` - `SHARD_HAS_NO_CONNECTIONS = "297"` - `CANNOT_PIPE = "298"` - `CANNOT_FORK = "299"` - `CANNOT_DLSYM = "300"` - `CANNOT_CREATE_CHILD_PROCESS = "301"` - `CHILD_WAS_NOT_EXITED_NORMALLY = "302"` - `CANNOT_SELECT = "303"` - `CANNOT_WAITPID = "304"` - `TABLE_WAS_NOT_DROPPED = "305"` - `TOO_DEEP_RECURSION = "306"` - `TOO_MANY_BYTES = "307"` - `UNEXPECTED_NODE_IN_ZOOKEEPER = "308"` - `FUNCTION_CANNOT_HAVE_PARAMETERS = "309"` - `INVALID_SHARD_WEIGHT = "317"` - `INVALID_CONFIG_PARAMETER = "318"` - `UNKNOWN_STATUS_OF_INSERT = "319"` - `VALUE_IS_OUT_OF_RANGE_OF_DATA_TYPE = "321"` - `BARRIER_TIMEOUT = "335"` - `UNKNOWN_DATABASE_ENGINE = "336"` - `DDL_GUARD_IS_ACTIVE = "337"` - `UNFINISHED = "341"` - `METADATA_MISMATCH = "342"` - `SUPPORT_IS_DISABLED = "344"` - `TABLE_DIFFERS_TOO_MUCH = "345"` - `CANNOT_CONVERT_CHARSET = "346"` - `CANNOT_LOAD_CONFIG = "347"` - `CANNOT_INSERT_NULL_IN_ORDINARY_COLUMN = "349"` - `INCOMPATIBLE_SOURCE_TABLES = "350"` - `AMBIGUOUS_TABLE_NAME = "351"` - `AMBIGUOUS_COLUMN_NAME = "352"` - `INDEX_OF_POSITIONAL_ARGUMENT_IS_OUT_OF_RANGE = "353"` - `ZLIB_INFLATE_FAILED = "354"` - `ZLIB_DEFLATE_FAILED = "355"` - `BAD_LAMBDA = "356"` - `RESERVED_IDENTIFIER_NAME = "357"` - `INTO_OUTFILE_NOT_ALLOWED = "358"` - `TABLE_SIZE_EXCEEDS_MAX_DROP_SIZE_LIMIT = "359"` - `CANNOT_CREATE_CHARSET_CONVERTER = "360"` - `SEEK_POSITION_OUT_OF_BOUND = "361"` - `CURRENT_WRITE_BUFFER_IS_EXHAUSTED = "362"` - `CANNOT_CREATE_IO_BUFFER = "363"` - `RECEIVED_ERROR_TOO_MANY_REQUESTS = "364"` - `SIZES_OF_NESTED_COLUMNS_ARE_INCONSISTENT = "366"` - `TOO_MANY_FETCHES = "367"` - `ALL_REPLICAS_ARE_STALE = "369"` - `DATA_TYPE_CANNOT_BE_USED_IN_TABLES = "370"` - `INCONSISTENT_CLUSTER_DEFINITION = "371"` - `SESSION_NOT_FOUND = "372"` - `SESSION_IS_LOCKED = "373"` - `INVALID_SESSION_TIMEOUT = "374"` - `CANNOT_DLOPEN = "375"` - `CANNOT_PARSE_UUID = "376"` - `ILLEGAL_SYNTAX_FOR_DATA_TYPE = "377"` - `DATA_TYPE_CANNOT_HAVE_ARGUMENTS = "378"` - `UNKNOWN_STATUS_OF_DISTRIBUTED_DDL_TASK = "379"` - `CANNOT_KILL = "380"` - `HTTP_LENGTH_REQUIRED = "381"` - `CANNOT_LOAD_CATBOOST_MODEL = "382"` - `CANNOT_APPLY_CATBOOST_MODEL = "383"` - `PART_IS_TEMPORARILY_LOCKED = "384"` - `MULTIPLE_STREAMS_REQUIRED = "385"` - `NO_COMMON_TYPE = "386"` - `DICTIONARY_ALREADY_EXISTS = "387"` - `CANNOT_ASSIGN_OPTIMIZE = "388"` - `INSERT_WAS_DEDUPLICATED = "389"` - `CANNOT_GET_CREATE_TABLE_QUERY = "390"` - `EXTERNAL_LIBRARY_ERROR = "391"` - `QUERY_IS_PROHIBITED = "392"` - `THERE_IS_NO_QUERY = "393"` - `QUERY_WAS_CANCELLED = "394"` - `FUNCTION_THROW_IF_VALUE_IS_NON_ZERO = "395"` - `TOO_MANY_ROWS_OR_BYTES = "396"` - `QUERY_IS_NOT_SUPPORTED_IN_MATERIALIZED_VIEW = "397"` - `UNKNOWN_MUTATION_COMMAND = "398"` - `FORMAT_IS_NOT_SUITABLE_FOR_OUTPUT = "399"` - `CANNOT_STAT = "400"` - `FEATURE_IS_NOT_ENABLED_AT_BUILD_TIME = "401"` - `CANNOT_IOSETUP = "402"` - `INVALID_JOIN_ON_EXPRESSION = "403"` - `BAD_ODBC_CONNECTION_STRING = "404"` - `PARTITION_SIZE_EXCEEDS_MAX_DROP_SIZE_LIMIT = "405"` - `TOP_AND_LIMIT_TOGETHER = "406"` - `DECIMAL_OVERFLOW = "407"` - `BAD_REQUEST_PARAMETER = "408"` - `EXTERNAL_EXECUTABLE_NOT_FOUND = "409"` - `EXTERNAL_SERVER_IS_NOT_RESPONDING = "410"` - `PTHREAD_ERROR = "411"` - `NETLINK_ERROR = "412"` - `CANNOT_SET_SIGNAL_HANDLER = "413"` - `ALL_REPLICAS_LOST = "415"` - `REPLICA_STATUS_CHANGED = "416"` - `EXPECTED_ALL_OR_ANY = "417"` - `UNKNOWN_JOIN = "418"` - `MULTIPLE_ASSIGNMENTS_TO_COLUMN = "419"` - `CANNOT_UPDATE_COLUMN = "420"` - `CANNOT_ADD_DIFFERENT_AGGREGATE_STATES = "421"` - `UNSUPPORTED_URI_SCHEME = "422"` - `CANNOT_GETTIMEOFDAY = "423"` - `CANNOT_LINK = "424"` - `SYSTEM_ERROR = "425"` - `CANNOT_COMPILE_REGEXP = "427"` - `UNKNOWN_LOG_LEVEL = "428"` - `FAILED_TO_GETPWUID = "429"` - `MISMATCHING_USERS_FOR_PROCESS_AND_DATA = "430"` - `ILLEGAL_SYNTAX_FOR_CODEC_TYPE = "431"` - `UNKNOWN_CODEC = "432"` - `ILLEGAL_CODEC_PARAMETER = "433"` - `CANNOT_PARSE_PROTOBUF_SCHEMA = "434"` - `NO_COLUMN_SERIALIZED_TO_REQUIRED_PROTOBUF_FIELD = "435"` - `PROTOBUF_BAD_CAST = "436"` - `PROTOBUF_FIELD_NOT_REPEATED = "437"` - `DATA_TYPE_CANNOT_BE_PROMOTED = "438"` - `CANNOT_SCHEDULE_TASK = "439"` - `INVALID_LIMIT_EXPRESSION = "440"` - `CANNOT_PARSE_DOMAIN_VALUE_FROM_STRING = "441"` - `BAD_DATABASE_FOR_TEMPORARY_TABLE = "442"` - `NO_COLUMNS_SERIALIZED_TO_PROTOBUF_FIELDS = "443"` - `UNKNOWN_PROTOBUF_FORMAT = "444"` - `CANNOT_MPROTECT = "445"` - `FUNCTION_NOT_ALLOWED = "446"` - `HYPERSCAN_CANNOT_SCAN_TEXT = "447"` - `BROTLI_READ_FAILED = "448"` - `BROTLI_WRITE_FAILED = "449"` - `BAD_TTL_EXPRESSION = "450"` - `BAD_TTL_FILE = "451"` - `SETTING_CONSTRAINT_VIOLATION = "452"` - `MYSQL_CLIENT_INSUFFICIENT_CAPABILITIES = "453"` - `OPENSSL_ERROR = "454"` - `SUSPICIOUS_TYPE_FOR_LOW_CARDINALITY = "455"` - `UNKNOWN_QUERY_PARAMETER = "456"` - `BAD_QUERY_PARAMETER = "457"` - `CANNOT_UNLINK = "458"` - `CANNOT_SET_THREAD_PRIORITY = "459"` - `CANNOT_CREATE_TIMER = "460"` - `CANNOT_SET_TIMER_PERIOD = "461"` - `CANNOT_DELETE_TIMER = "462"` - `CANNOT_FCNTL = "463"` - `CANNOT_PARSE_ELF = "464"` - `CANNOT_PARSE_DWARF = "465"` - `INSECURE_PATH = "466"` - `CANNOT_PARSE_BOOL = "467"` - `CANNOT_PTHREAD_ATTR = "468"` - `VIOLATED_CONSTRAINT = "469"` - `QUERY_IS_NOT_SUPPORTED_IN_LIVE_VIEW = "470"` - `INVALID_SETTING_VALUE = "471"` - `READONLY_SETTING = "472"` - `DEADLOCK_AVOIDED = "473"` - `INVALID_TEMPLATE_FORMAT = "474"` - `INVALID_WITH_FILL_EXPRESSION = "475"` - `WITH_TIES_WITHOUT_ORDER_BY = "476"` - `INVALID_USAGE_OF_INPUT = "477"` - `UNKNOWN_POLICY = "478"` - `UNKNOWN_DISK = "479"` - `UNKNOWN_PROTOCOL = "480"` - `PATH_ACCESS_DENIED = "481"` - `DICTIONARY_ACCESS_DENIED = "482"` - `TOO_MANY_REDIRECTS = "483"` - `INTERNAL_REDIS_ERROR = "484"` - `SCALAR_ALREADY_EXISTS = "485"` - `CANNOT_GET_CREATE_DICTIONARY_QUERY = "487"` - `UNKNOWN_DICTIONARY = "488"` - `INCORRECT_DICTIONARY_DEFINITION = "489"` - `CANNOT_FORMAT_DATETIME = "490"` - `UNACCEPTABLE_URL = "491"` - `ACCESS_ENTITY_NOT_FOUND = "492"` - `ACCESS_ENTITY_ALREADY_EXISTS = "493"` - `ACCESS_ENTITY_FOUND_DUPLICATES = "494"` - `ACCESS_STORAGE_READONLY = "495"` - `QUOTA_REQUIRES_CLIENT_KEY = "496"` - `ACCESS_DENIED = "497"` - `LIMIT_BY_WITH_TIES_IS_NOT_SUPPORTED = "498"` - S3_ERROR = "499" - `AZURE_BLOB_STORAGE_ERROR = "500"` - `CANNOT_CREATE_DATABASE = "501"` - `CANNOT_SIGQUEUE = "502"` - `AGGREGATE_FUNCTION_THROW = "503"` - `FILE_ALREADY_EXISTS = "504"` - `CANNOT_DELETE_DIRECTORY = "505"` - `UNEXPECTED_ERROR_CODE = "506"` - `UNABLE_TO_SKIP_UNUSED_SHARDS = "507"` - `UNKNOWN_ACCESS_TYPE = "508"` - `INVALID_GRANT = "509"` - `CACHE_DICTIONARY_UPDATE_FAIL = "510"` - `UNKNOWN_ROLE = "511"` - `SET_NON_GRANTED_ROLE = "512"` - `UNKNOWN_PART_TYPE = "513"` - `ACCESS_STORAGE_FOR_INSERTION_NOT_FOUND = "514"` - `INCORRECT_ACCESS_ENTITY_DEFINITION = "515"` - `AUTHENTICATION_FAILED = "516"` - `CANNOT_ASSIGN_ALTER = "517"` - `CANNOT_COMMIT_OFFSET = "518"` - `NO_REMOTE_SHARD_AVAILABLE = "519"` - `CANNOT_DETACH_DICTIONARY_AS_TABLE = "520"` - `ATOMIC_RENAME_FAIL = "521"` - `UNKNOWN_ROW_POLICY = "523"` - `ALTER_OF_COLUMN_IS_FORBIDDEN = "524"` - `INCORRECT_DISK_INDEX = "525"` - `NO_SUITABLE_FUNCTION_IMPLEMENTATION = "527"` - `CASSANDRA_INTERNAL_ERROR = "528"` - `NOT_A_LEADER = "529"` - `CANNOT_CONNECT_RABBITMQ = "530"` - `CANNOT_FSTAT = "531"` - `LDAP_ERROR = "532"` - `INCONSISTENT_RESERVATIONS = "533"` - `NO_RESERVATIONS_PROVIDED = "534"` - `UNKNOWN_RAID_TYPE = "535"` - `CANNOT_RESTORE_FROM_FIELD_DUMP = "536"` - `ILLEGAL_MYSQL_VARIABLE = "537"` - `MYSQL_SYNTAX_ERROR = "538"` - `CANNOT_BIND_RABBITMQ_EXCHANGE = "539"` - `CANNOT_DECLARE_RABBITMQ_EXCHANGE = "540"` - `CANNOT_CREATE_RABBITMQ_QUEUE_BINDING = "541"` - `CANNOT_REMOVE_RABBITMQ_EXCHANGE = "542"` - `UNKNOWN_MYSQL_DATATYPES_SUPPORT_LEVEL = "543"` - `ROW_AND_ROWS_TOGETHER = "544"` - `FIRST_AND_NEXT_TOGETHER = "545"` - `NO_ROW_DELIMITER = "546"` - `INVALID_RAID_TYPE = "547"` - `UNKNOWN_VOLUME = "548"` - `DATA_TYPE_CANNOT_BE_USED_IN_KEY = "549"` - `CONDITIONAL_TREE_PARENT_NOT_FOUND = "550"` - `ILLEGAL_PROJECTION_MANIPULATOR = "551"` - `UNRECOGNIZED_ARGUMENTS = "552"` - `LZMA_STREAM_ENCODER_FAILED = "553"` - `LZMA_STREAM_DECODER_FAILED = "554"` - `ROCKSDB_ERROR = "555"` - `SYNC_MYSQL_USER_ACCESS_ERROR = "556"` - `UNKNOWN_UNION = "557"` - `EXPECTED_ALL_OR_DISTINCT = "558"` - `INVALID_GRPC_QUERY_INFO = "559"` - `ZSTD_ENCODER_FAILED = "560"` - `ZSTD_DECODER_FAILED = "561"` - `TLD_LIST_NOT_FOUND = "562"` - `CANNOT_READ_MAP_FROM_TEXT = "563"` - `INTERSERVER_SCHEME_DOESNT_MATCH = "564"` - `TOO_MANY_PARTITIONS = "565"` - `CANNOT_RMDIR = "566"` - `DUPLICATED_PART_UUIDS = "567"` - `RAFT_ERROR = "568"` - `MULTIPLE_COLUMNS_SERIALIZED_TO_SAME_PROTOBUF_FIELD = "569"` - `DATA_TYPE_INCOMPATIBLE_WITH_PROTOBUF_FIELD = "570"` - `DATABASE_REPLICATION_FAILED = "571"` - `TOO_MANY_QUERY_PLAN_OPTIMIZATIONS = "572"` - `EPOLL_ERROR = "573"` - `DISTRIBUTED_TOO_MANY_PENDING_BYTES = "574"` - `UNKNOWN_SNAPSHOT = "575"` - `KERBEROS_ERROR = "576"` - `INVALID_SHARD_ID = "577"` - `INVALID_FORMAT_INSERT_QUERY_WITH_DATA = "578"` - `INCORRECT_PART_TYPE = "579"` - `CANNOT_SET_ROUNDING_MODE = "580"` - `TOO_LARGE_DISTRIBUTED_DEPTH = "581"` - `NO_SUCH_PROJECTION_IN_TABLE = "582"` - `ILLEGAL_PROJECTION = "583"` - `PROJECTION_NOT_USED = "584"` - `CANNOT_PARSE_YAML = "585"` - `CANNOT_CREATE_FILE = "586"` - `CONCURRENT_ACCESS_NOT_SUPPORTED = "587"` - `DISTRIBUTED_BROKEN_BATCH_INFO = "588"` - `DISTRIBUTED_BROKEN_BATCH_FILES = "589"` - `CANNOT_SYSCONF = "590"` - `SQLITE_ENGINE_ERROR = "591"` - `DATA_ENCRYPTION_ERROR = "592"` - `ZERO_COPY_REPLICATION_ERROR = "593"` - BZIP2_STREAM_DECODER_FAILED = "594" - BZIP2_STREAM_ENCODER_FAILED = "595" - `INTERSECT_OR_EXCEPT_RESULT_STRUCTURES_MISMATCH = "596"` - `NO_SUCH_ERROR_CODE = "597"` - `BACKUP_ALREADY_EXISTS = "598"` - `BACKUP_NOT_FOUND = "599"` - `BACKUP_VERSION_NOT_SUPPORTED = "600"` - `BACKUP_DAMAGED = "601"` - `NO_BASE_BACKUP = "602"` - `WRONG_BASE_BACKUP = "603"` - `BACKUP_ENTRY_ALREADY_EXISTS = "604"` - `BACKUP_ENTRY_NOT_FOUND = "605"` - `BACKUP_IS_EMPTY = "606"` - `CANNOT_RESTORE_DATABASE = "607"` - `CANNOT_RESTORE_TABLE = "608"` - `FUNCTION_ALREADY_EXISTS = "609"` - `CANNOT_DROP_FUNCTION = "610"` - `CANNOT_CREATE_RECURSIVE_FUNCTION = "611"` - `OBJECT_ALREADY_STORED_ON_DISK = "612"` - `OBJECT_WAS_NOT_STORED_ON_DISK = "613"` - `POSTGRESQL_CONNECTION_FAILURE = "614"` - `CANNOT_ADVISE = "615"` - `UNKNOWN_READ_METHOD = "616"` - LZ4_ENCODER_FAILED = "617" - LZ4_DECODER_FAILED = "618" - `POSTGRESQL_REPLICATION_INTERNAL_ERROR = "619"` - `QUERY_NOT_ALLOWED = "620"` - `CANNOT_NORMALIZE_STRING = "621"` - `CANNOT_PARSE_CAPN_PROTO_SCHEMA = "622"` - `CAPN_PROTO_BAD_CAST = "623"` - `BAD_FILE_TYPE = "624"` - `IO_SETUP_ERROR = "625"` - `CANNOT_SKIP_UNKNOWN_FIELD = "626"` - `BACKUP_ENGINE_NOT_FOUND = "627"` - `OFFSET_FETCH_WITHOUT_ORDER_BY = "628"` - `HTTP_RANGE_NOT_SATISFIABLE = "629"` - `HAVE_DEPENDENT_OBJECTS = "630"` - `UNKNOWN_FILE_SIZE = "631"` - `UNEXPECTED_DATA_AFTER_PARSED_VALUE = "632"` - `QUERY_IS_NOT_SUPPORTED_IN_WINDOW_VIEW = "633"` - `MONGODB_ERROR = "634"` - `CANNOT_POLL = "635"` - `CANNOT_EXTRACT_TABLE_STRUCTURE = "636"` - `INVALID_TABLE_OVERRIDE = "637"` - `SNAPPY_UNCOMPRESS_FAILED = "638"` - `SNAPPY_COMPRESS_FAILED = "639"` - `NO_HIVEMETASTORE = "640"` - `CANNOT_APPEND_TO_FILE = "641"` - `CANNOT_PACK_ARCHIVE = "642"` - `CANNOT_UNPACK_ARCHIVE = "643"` - `REMOTE_FS_OBJECT_CACHE_ERROR = "644"` - `NUMBER_OF_DIMENSIONS_MISMATCHED = "645"` - `CANNOT_BACKUP_DATABASE = "646"` - `CANNOT_BACKUP_TABLE = "647"` - `WRONG_DDL_RENAMING_SETTINGS = "648"` - `INVALID_TRANSACTION = "649"` - `SERIALIZATION_ERROR = "650"` - `CAPN_PROTO_BAD_TYPE = "651"` - `ONLY_NULLS_WHILE_READING_SCHEMA = "652"` - `CANNOT_PARSE_BACKUP_SETTINGS = "653"` - `WRONG_BACKUP_SETTINGS = "654"` - `FAILED_TO_SYNC_BACKUP_OR_RESTORE = "655"` - `MEILISEARCH_EXCEPTION = "656"` - `UNSUPPORTED_MEILISEARCH_TYPE = "657"` - `MEILISEARCH_MISSING_SOME_COLUMNS = "658"` - `UNKNOWN_STATUS_OF_TRANSACTION = "659"` - `HDFS_ERROR = "660"` - `CANNOT_SEND_SIGNAL = "661"` - `FS_METADATA_ERROR = "662"` - `INCONSISTENT_METADATA_FOR_BACKUP = "663"` - `ACCESS_STORAGE_DOESNT_ALLOW_BACKUP = "664"` - `CANNOT_CONNECT_NATS = "665"` - `NOT_INITIALIZED = "667"` - `INVALID_STATE = "668"` - `NAMED_COLLECTION_DOESNT_EXIST = "669"` - `NAMED_COLLECTION_ALREADY_EXISTS = "670"` - `NAMED_COLLECTION_IS_IMMUTABLE = "671"` - `INVALID_SCHEDULER_NODE = "672"` - `RESOURCE_ACCESS_DENIED = "673"` - `RESOURCE_NOT_FOUND = "674"` - CANNOT_PARSE_IPV4 = "675" - CANNOT_PARSE_IPV6 = "676" - `THREAD_WAS_CANCELED = "677"` - `IO_URING_INIT_FAILED = "678"` - `IO_URING_SUBMIT_ERROR = "679"` - `MIXED_ACCESS_PARAMETER_TYPES = "690"` - `UNKNOWN_ELEMENT_OF_ENUM = "691"` - `TOO_MANY_MUTATIONS = "692"` - `AWS_ERROR = "693"` - `ASYNC_LOAD_CYCLE = "694"` - `ASYNC_LOAD_FAILED = "695"` - `ASYNC_LOAD_CANCELED = "696"` - `CANNOT_RESTORE_TO_NONENCRYPTED_DISK = "697"` - `INVALID_REDIS_STORAGE_TYPE = "698"` - `INVALID_REDIS_TABLE_STRUCTURE = "699"` - `USER_SESSION_LIMIT_EXCEEDED = "700"` - `CLUSTER_DOESNT_EXIST = "701"` - `CLIENT_INFO_DOES_NOT_MATCH = "702"` - `INVALID_IDENTIFIER = "703"` - `QUERY_CACHE_USED_WITH_NONDETERMINISTIC_FUNCTIONS = "704"` - `TABLE_NOT_EMPTY = "705"` - `LIBSSH_ERROR = "706"` - `GCP_ERROR = "707"` - `ILLEGAL_STATISTICS = "708"` - `CANNOT_GET_REPLICATED_DATABASE_SNAPSHOT = "709"` - `FAULT_INJECTED = "710"` - `FILECACHE_ACCESS_DENIED = "711"` - `TOO_MANY_MATERIALIZED_VIEWS = "712"` - `BROKEN_PROJECTION = "713"` - `UNEXPECTED_CLUSTER = "714"` - `CANNOT_DETECT_FORMAT = "715"` - `CANNOT_FORGET_PARTITION = "716"` - `EXPERIMENTAL_FEATURE_ERROR = "717"` - `TOO_SLOW_PARSING = "718"` - `QUERY_CACHE_USED_WITH_SYSTEM_TABLE = "719"` - `USER_EXPIRED = "720"` - `DEPRECATED_FUNCTION = "721"` - `ASYNC_LOAD_WAIT_FAILED = "722"` - `PARQUET_EXCEPTION = "723"` - `TOO_MANY_TABLES = "724"` - `TOO_MANY_DATABASES = "725"` - `DISTRIBUTED_CACHE_ERROR = "900"` - `CANNOT_USE_DISTRIBUTED_CACHE = "901"` - `KEEPER_EXCEPTION = "999"` - `POCO_EXCEPTION = "1000"` - `STD_EXCEPTION = "1001"` - `UNKNOWN_EXCEPTION = "1002"` --- URL: https://www.tinybird.co/docs/publish/api-endpoints/overview Last update: 2024-11-13T15:42:17.000Z Content: --- title: "API Endpoints · Tinybird Docs" theme-color: "#171612" description: "API Endpoints make it easy to use the results of your queries in applications." --- # API Endpoints¶ Tinybird can turn any Pipe into an API Endpoint that you can query. For example, you can ingest your data, build SQL logic inside a Pipe, and then publish the result of your query as an HTTP API Endpoint. You can then create interactive [Charts](https://www.tinybird.co/docs/docs/publish/charts) of your data. API Endpoints make it easy to use the results of your queries in applications. Any app that can run an HTTP GET can use Tinybird API Endpoints. Tinybird represents API Endpoints using the icon. ## Create an API Endpoint¶ To create an API Endpoint, you first need a [Pipe](https://www.tinybird.co/docs/docs/concepts/pipes) . You can publish any of the queries in your Pipes as an API Endpoint. ### Using the UI¶ First, [create a Pipe](https://www.tinybird.co/docs/docs/concepts/pipes#creating-pipes-in-the-ui) in the UI. In the Pipe, select **Create API Endpoint** , then select the Node that you want to publish. You can export a CSV file with the extracted data by selecting **Export CSV**. ### Using the CLI¶ First, [create a Pipe](https://www.tinybird.co/docs/docs/concepts/pipes#creating-pipes-in-the-cli) using the Tinybird CLI. Use the following command to publish an API Endpoint from the CLI. This automatically selects the final Node in the Pipe. tb pipe publish PIPE_NAME_OR_ID If you want to manually select a different Node to publish, supply the Node name as the final command argument: tb pipe publish PIPE_NAME_OR_ID NODE_NAME ## Secure your API Endpoints¶ Access to the APIs you publish in Tinybird are also protected with Tokens. You can limit which operations a specific Token can do through scopes. For example, you can create Tokens that are only able to do admin operations on Tinybird resources, or only have `READ` permission for a specific Data Source. See [Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens) to understand how they work and see what types are available. ## API gateways¶ API gateways allow you to cloak or rebrand Tinybird API Endpoints while meeting additional security and compliance requirements. When you publish an [API Endpoint](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview) in Tinybird, it's available through `api.tinybird.co` or the API Gateway URL that corresponds to your [Workspace](https://www.tinybird.co/docs/docs/concepts/workspaces) region. See [API Endpoint URLs](https://www.tinybird.co/docs/docs/api-reference/overview#regions-and-endpoints) . API Endpoints are secured using [Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens) that are managed inside your Tinybird Workspace. Sometimes you might want to put the Tinybird API Endpoints behind an API Gateway. For example: - To present a unified brand experience to your users. - To avoid exposing Tokens and the underlying technology. - To comply with regulations around data privacy and security. - To add Tinybird to an existing API architecture. ### Alternative approaches¶ You can meet the requirements satisfied by an API gateway through other methods. - Use[ JSON Web Tokens (JWTs)](https://www.tinybird.co/docs/docs/concepts/auth-tokens#json-web-tokens-jwts) to habe your application call Tinybird API Endpoints from the frontend without proxying through your backend. - Appropriately scope the Token used inside your application. Exposing a read-only Token has limited security concerns as it can't be used to modify data. You can invalidate the Token at any time. - Use row-level security to ensure that an Token only provides access to the appropriate data. ### Amazon API Gateway¶ The steps to create a reverse proxy using Amazon API Gateway are as follows: 1. Access the API Gateway console. 2. Select** Create API** , then** HTTP API** . 3. Select** Add Integration** and then select** HTTP** . 4. Configure the integration with the method** GET** and the full URL to your Tinybird API with its Token. For example, `https://api.tinybird.co/v0/pipes/top-10-products.json?token=p.eyJ1Ijog...` 5. Set a name for the API and select** Next** . 6. On the** Routes** page, set the method to** GET** and configure the desired** Resource path** . For example,** /top-10-products** . 7. Go through the rest of the step to create the API. You can find more information about applying a custom domain name in the [Amazon API Gateway documentation](https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api-custom-domain-names.html). ### Google Cloud Apigee¶ The steps to create a reverse proxy using Apigee are as follows: 1. Access the Apigee console. 2. Add a new** Reverse Proxy** . 3. Add your** Base path** . For example,** /top-10-products** . 4. Add the** Target** . For example, `https://api.tinybird.co/v0/pipes/top-10-products.json?token=p.eyJ1Ijog...` 5. Select** Pass through** for security. 6. Select an environment to deploy the API to. 7. Deploy, and test the API. You can find more information about applying a custom domain name in the [Apigee documentation](https://cloud.google.com/apigee/docs/api-platform/publish/portal/custom-domain). ### Grafbase Edge Gateway¶ To create a new Grafbase Edge Gateway using the Grafbase CLI, follow these steps. Inside a new directory, run: npx grafbase init --template openapi-tinybird In Tinybird, open your API Endpoint page. Select **Create Chart**, **Share this API Endpoint** , then select **OpenAPI 3.0** . Copy the link that appears, including the full Token. Create an `.env` file using the following template and enter the required details. # TINYBIRD_API_URL is the URL for your published API Endpoint TINYBIRD_API_URL= # TINYBIRD_API_TOKEN is the Token with READ access to the API Endpoint TINYBIRD_API_TOKEN= # TINYBIRD_API_SCHEMA is the OpenAPI 3.0 spec URL copied from the API Endpoint docs page TINYBIRD_API_SCHEMA= You can now run the Grafbase Edge Gateway locally: npx grafbase dev Open the local Pathfinder at `http://127.0.0.1:4000` to test your Edge Gateway. Here is an example GraphQL query: query Tinybird { tinybird { topPages { data { action payload } rows } } } Make sure to replace `topPages` with the name of your API Endpoint. ### NGINX¶ The following is an example NGINX configuration file that handles a `GET` request and make the request to Tinybird on your behalf. The Token is only accessed server-side and never exposed to the user. worker_processes 1; events { worker_connections 1024; } http { server { listen 8080; server_name localhost; location /top-10-products { proxy_pass https://api.tinybird.co/v0/pipes/top-10-products.json?token=p.eyJ1Ijog...; } } } ## Query API and API Endpoints¶ The [Query API](https://www.tinybird.co/docs/docs/api-reference/query-api) is similar to running SQL statements against a normal database instead of using it as your backend, which is useful for ad-hoc queries. Publish API Endpoints instead of using the Query API in the following situations: - You want to build and maintain all the logic in Tinybird and call the API Endpoint to fetch the result. - You want to use incremental Nodes in your Pipes to simplify the development and maintenance of your queries. - You need support for query parameters and more complex logic using the Tinybird[ templating language](https://www.tinybird.co/docs/docs/cli/advanced-templates) . - You need to incorporate changes from your query, which means little downstream impact. You can monitor performance of individual API Endpoints using [pipe_stats_rt](https://www.tinybird.co/docs/docs/monitoring/service-datasources#tinybird-pipe-stats-rt) and [pipe_stats](https://www.tinybird.co/docs/docs/monitoring/service-datasources#tinybird-pipe-stats) , uncovering optimization opportunities. All requests to the Query API are grouped together, making it more difficult to monitor performance of a specific query. ## Errors and retries¶ API Endpoints return standard HTTP success or error codes. For errors, the response also includes extra information about what went wrong, encoded in the response as JSON. ### Error codes¶ API Endpoints might return the following HTTP error codes: | Code | Description | | --- | --- | | 400 | Bad request. A `HTTP400` can be returned in several scenarios and typically represents a malformed request such as errors in your SQL queries or missing query parameters. | | 403 | Forbidden. The auth Token doesn't have the correct scopes. | | 404 | Not found. This usually occurs when the name of the API Endpoint is wrong or hasn't been published. | | 405 | HTTP Method not allowed. Requests to API Endpoints must use the `GET` method. | | 408 | Request timeout. This occurs when the query takes too long to complete by default this is 10 seconds. | | 414 | Request-URI Too Large. Not all APIs have the same limit but it's usually 2KB for GET requests. Reduce the URI length or use a POST request to avoid the limit. | | 429 | Too many requests. Usually occurs when an API Endpoint is hitting into rate limits. | | 499 | Connection closed. This occurs if the client closes the connection after 1 second, if this is unexpected increase the connection timeout on your end. | | 500 | Internal Server Error. Usually an unexpected transient service error. | Errors when running a query are usually reported as 400 Bad request or 500 Internal Server Error, depending on whether the error can be fixed by the caller or not. In those cases the API response has an additional HTTP header, `X-DB-Exception-Code` where you can check the internal database error, reported as a stringified number. For a full list of internal database errors, see [List of API Endpoint database errors](https://www.tinybird.co/docs/list-of-errors). ### Retries¶ When implementing an API Gateway, make sure to handle potential errors and implement retry strategies where appropriate. Implement automatic retries for the following errors: - HTTP 429: Too many requests - HTTP 500: Internal Server Error Follow an exponential backoff when retrying requests that produce the previous errors. See [Exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff) in Wikipedia. ### Token limitations with API Gateways¶ When using an API Gateway or proxy between your application and Tinybird, your proxy uses a Token to authenticate requests to Tinybird. Treat the Token as a secret and don't expose it to the client. Use a service such as Unkey to add multi-tenant API keys, rate limiting, and token usage analytics to your app at scale. ## Build plan limits¶ The Build plan is the free tier of Tinybird. See ["Tinybird plans"](https://www.tinybird.co/docs/docs/plans) ). The Build plan has the following limits on the amount of API requests per day that you can make against published API Endpoints. | Description | Limit and time window | | --- | --- | | API Endpoint | 1,000 requests per day | | Data Sources storage | 10 gigabytes in total | These limits don't apply to paid plans. To learn more about how Tinybird bills for different data operations, see [Billing](https://www.tinybird.co/docs/docs/support/billing). ## Next steps¶ - Read more about how to use Tokens in the[ Tokens docs](https://www.tinybird.co/docs/docs/concepts/auth-tokens) . - Read the guide:[ "Consume APIs in a Next.js frontend with JWTs"](https://www.tinybird.co/docs/docs/guides/integrations/consume-apis-nextjs) . - Understand[ Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens) . --- URL: https://www.tinybird.co/docs/publish/charts Last update: 2024-11-05T10:32:54.000Z Content: --- title: "Charts · Tinybird Docs" theme-color: "#171612" description: "Create beautiful, fast charts of your Tinybird data." --- # Charts¶ Charts are a great way to visualize your data. You can create and publish easy, fast Charts in Tinybird from any of your published API Endpoints. <-figure-> ![Example Tinybird dashboard showing multiple chart types](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fcharts-dashboard.png&w=3840&q=75) <-figcaption-> Example Tinybird Charts dashboard Check out the [live demo](https://guide-tinybird-charts.vercel.app/) to see an example of Charts in action. ## Overview¶ When you publish an API Endpoint, you often want to visualize the data in a more user-friendly way. Charts are a great way to do this. Tinybird provides three options: - ** No-code:** A fast, UI-based flow for creating Charts that live in your Tinybird Workspace UI (great for internal reference use, smaller projects, and getting started). - ** Low code:** Using the Tinybird UI to create Charts and generate an iframe, which you can then embed in your own application. - ** Code-strong** : Using the `@tinybirdco/charts` npm React library to build out exactly what you need, in your own application, using React components. Fully customizable and secured with JWTs. You can either generate initial Chart data in the Tinybird UI, or start using the library directly. Instead of coding your own charts and dashboards from scratch, use Tinybird's pre-built Chart components. You won't have to implement the frontend and backend architecture, or any security middleware. Use the library components and JWTs, manage the token exchange flow, and interact directly with any of your published Tinybird API Endpoints. To create a Chart, you need to have a published API Endpoint. Learn how to [publish an API Endpoint here](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview). ## The Tinybird Charts library BETA¶ All options are built on the Tinybird Charts library ( `@tinybirdco/charts` ) , a modular way to build fast visualizations of your data, which leverages [Apache ECharts](https://echarts.apache.org/en/index.html) . You can use Tinybird Charts with any level of customization, and also use your Tinybird data with any third party library. ### Components¶ The library provides the following components: - `AreaChart` - `BarChart` - `BarList` - `DonutChart` - `LineChart` - `PieChart` - `Table` All components share the same API, making it easy to switch between different types of charts. The Tinybird Charts library is currently in public beta. If you have any feedback or suggestions, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). ## Create a Chart in the UI (no-code)¶ 1. In your Workspace, navigate to the Overview page of one of your API Endpoints. 2. Select the "Create Chart" button (top right). 3. Configure your Chart by selecting the name, the type of Chart, and the fields you want to visualize. Under "Data", select the index and category. 4. Once you're happy with your Chart, select "Save". <-figure-> ![Example Tinybird pie chart, showing configuration options](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fcharts-create-chart.png&w=3840&q=75) <-figcaption-> Example Tinybird pie Chart, showing configuration options Your Chart now lives in the API Endpoint Overview page. ## Create a Chart using an iframe (low code)¶ Tinybird users frequently want to take the data from their Tinybird API Endpoints, create charts, and embed them in their own dashboard application. A low-overhead option is to take the generated iframe and drop it into your application: 1. Create a Chart using the process described above. 2. In the API Endpoint Overview page, scroll down to the "Charts" tab and select your Chart. 3. Select the `<>` tab to access the code snippets. 4. Copy and paste the ready-to-use iframe code into your application. <-figure-> ![GIF showing a user creating a Pie Chart in the Tinybird UI and generating the iframe code](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fcharts-create-pie-chart.gif&w=3840&q=75) <-figcaption-> Creating a Pie Chart in the Tinybird UI and generating the iframe code ## Create a Chart using the React library (code-strong)¶ This option gives you the most customization flexibility. You'll need to be familiar with frontend development and styling. To create a Chart component and use the library, you can either create a Chart in the UI first, or use the library directly. Using the library directly means there will not be a Chart created in your Workspace, and no generated snippet, so skip to #2. ### 1. View your Chart code¶ 1. In your Workspace, navigate to the API Endpoint Overview page. 2. Scroll down to the "Charts" tab and select one of your Charts. 3. Select the `<>` tab to access the code snippets. 4. You now have the code for a ready-to-use React component. ### 2. Install the library¶ Install the `@tinybirdco/charts` library locally in your project: npm install @tinybirdco/charts ### 3. Create a JWT¶ Calls need to be secured with a token. To learn more about the token exchange, see [Understanding the token exchange](https://www.tinybird.co/docs/docs/guides/integrations/charts-using-iframes-and-jwt-tokens#understanding-the-token-exchange). In React code snippets, the Chart components are authenticated using the `token` prop, where you paste your [Tinybird Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens). You can limit how often you or your users can fetch Tinybird APIs on a per-endpoint or per-user basis. See [Rate limits for JWTs](https://www.tinybird.co/docs/docs/concepts/auth-tokens#rate-limits-for-jwts). #### Create a JWT¶ There is wide support for creating JWTs in many programming languages and frameworks. See the [Tinybird JWT docs for popular options](https://www.tinybird.co/docs/concepts/auth-tokens#create-a-jwt-in-production). ### 4. Embed the Chart into your application¶ Copy and paste the Chart snippet (either generated by Tinybird, or constructed by you [using the same configuration](https://www.npmjs.com/package/@tinybirdco/charts#usage) ) into your application. For example: ##### Example Line Chart component code snippet import React from 'react' import { LineChart } from '@tinybirdco/charts' function MyLineChart() { return ( ) } ### 5. Fetch data¶ The most common approach for fetching and rendering data is to directly use a single Chart component (or group of individual components) by passing the required props. These are included by default within each generated code snippet, so for most use cases, you should be able to simply copy, paste, and have the Chart you want. The library offers [many additional props](https://www.npmjs.com/package/@tinybirdco/charts?activeTab=readme#api) for further customization, including many that focus specifically on fetching data. See [6. Customization](https://www.tinybird.co/docs/docs/publish/charts#6-customization) for more. #### Alternative approaches and integrations¶ Depending on your needs, you have additional options: - Wrapping components within `` to share styles, query configuration, and custom loading and error states[ among several Chart components](https://www.npmjs.com/package/@tinybirdco/charts?activeTab=readme#reusing-styles-and-query-config-using-the-chartprovider) . - Adding your own fetcher to the `ChartProvider` (or to a specific Chart component) using the `fetcher` prop . This can be useful to add custom headers or dealing with JWT tokens. - Using the `useQuery` hook to[ fetch data and pass it directly to the component](https://www.npmjs.com/package/@tinybirdco/charts?activeTab=readme#using-the-hook) . It works with any custom component, or with any third-party library like[ Tremor](https://www.tremor.so/) or[ shadcn](https://ui.shadcn.com/) . <-figure-> ![GIF showing how to use the library with Tinybird Charts & shadcn in the UI](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fcharts-tb-sd.gif&w=3840&q=75) <-figcaption-> Using the library with Tinybird Charts & shadcn ### 6. Customization¶ Tinybird supports customization: - ** Standard customization** : Use the properties provided by the[ @tinybirdco/charts library](https://www.npmjs.com/package/@tinybirdco/charts?activeTab=readme#api) . - ** Advanced customization** : Send a specific parameter to[ customize anything within a Chart](https://www.npmjs.com/package/@tinybirdco/charts?activeTab=readme#extra-personalization-using-echarts-options) , aligning with the[ ECharts specification](https://echarts.apache.org/handbook/en/get-started/) . ### 7. Filtering¶ Filtering is possible by using the endpoint parameters. Use the `params` data-fetching property to pass your parameters to a chart component, `` , or the `useQuery` hook. See the [example snippet](https://www.tinybird.co/docs/about:blank#4-embed-the-chart-into-your-application) for how `params` and filter are used. ### 8. Advanced configuration (optional)¶ #### Polling¶ Control the frequency that your chart polls for new data by setting the `refreshInterval` prop (interval in milliseconds). The npm library offers [a range of additional component props](https://www.npmjs.com/package/@tinybirdco/charts) specifically for data fetching, so be sure to review them and use a combination to build the perfect chart. Use of this feature may significantly increase your billing costs. The lower the refreshInterval prop (so the more frequently you're polling for fresh data), the more requests you're making to your Tinybird API Endpoints. Read [the billing docs](https://www.tinybird.co/docs/docs/support/billing) and understand the pricing of different operations. #### Global vs local settings¶ Each chart can have its own settings, or settings can be shared across a group of Chart components by wrapping them within ``. #### States¶ Chart components can be one of a range of states: - Success - Error - Loading ## Example use cases¶ Two examples showing different ways to generate JWTs, set up a local project, and implement a Chart: 1. Guide:[ Consume APIs in a Next.js frontend with JWTs](https://www.tinybird.co/docs/docs/guides/integrations/consume-apis-nextjs) . 2. Guide:[ Build charts with iframes and JWTs](https://www.tinybird.co/docs/docs/guides/integrations/charts-using-iframes-and-jwt-tokens) . Interested in Charts but don't have any data to use? Run the demo in the [Add Tinybird Charts to a Next.js frontend](https://www.tinybird.co/docs/docs/guides/integrations/add-charts-to-nextjs) guide, which uses a bootstrapped Next.js app and fake data. ## Troubleshooting¶ ### Handle token errors¶ See the information on [JWT error handling](https://www.tinybird.co/docs/docs/concepts/auth-tokens#error-handling). ### Refresh token¶ See the information on [JWT refreshing and limitations](https://www.tinybird.co/docs/docs/concepts/auth-tokens#jwt-limitations). ## Next steps¶ - Check out the[ Tinybird Charts library](https://www.npmjs.com/package/@tinybirdco/charts) . - Understand how Tinybird[ uses Tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens) . --- URL: https://www.tinybird.co/docs/publish/copy-pipes Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Copy Pipes · Tinybird Docs" theme-color: "#171612" description: "Copy Pipes allow you to capture the result of a Pipe at a moment in time, and write the result into a target Data Source. They can be run on a schedule, or executed on demand." --- # Copy Pipes¶ Copy Pipes are an extension of Tinybird's [Pipes](https://www.tinybird.co/docs/docs/concepts/pipes) . Copy Pipes allow you to capture the result of a Pipe at a moment in time, and write the result into a target [Data Source](https://www.tinybird.co/docs/docs/concepts/data-sources) . They can be run on a schedule, or executed on demand. Copy Pipes are great for use cases like: - Event-sourced snapshots, such as change data capture (CDC). - Copy data from Tinybird to another location in Tinybird to experiment. - De-duplicate with snapshots. Copy Pipes should not be confused with [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) . Materialized Views continuously re-evaluate a query as new events are inserted, while Copy Pipes create a single snapshot at a given point in time. Tinybird represents Copy Pipes using the icon. ## Best practices¶ A Copy Pipe executes the Pipe's query on each run to export the result. This means that the size of the copy operation is tied to the size of your data and the complexity of the Pipe's query. As Copy Pipes can run frequently, it's strongly recommended that you [follow the best practices for faster SQL](https://www.tinybird.co/docs/docs/query/sql-best-practices) to optimize your queries, and understand the following best practices for optimizing your Copy Pipes. ### 1. Round datetime filters to your schedule¶ Queries in Copy Pipes should always have a time window filter that aligns with the execution schedule. For example, a Copy Pipe that runs once a day typically has a filter that filters for yesterday's data, and an hourly schedule usually means a filter to get results from the previous hour. Remember that a Copy Pipe job is not guaranteed to run exactly at the scheduled time. If a Copy Pipe is scheduled to run at 16:00:00, the job could run at 16:00:01 or even 16:00:05. To account for a potential delay, **round your time window filter to align with the schedule window**. For example, if your Copy Pipe is scheduled hourly, instead of writing: SELECT * FROM datasource WHERE datetime >= now() - interval 1 hour AND datetime < now() You should use `toStartOfHour()` to round the time filter to the hour: SELECT * FROM datasource WHERE datetime >= toStartOfHour(now()) - interval 1 hour AND datetime < toStartOfHour(now()) Doing this means that, even if the Copy Pipe's execution is delayed (perhaps being triggered at 16:00:01, 17:00:30, and 18:0:002) you still maintain consistent copies of data regardless of the delay, with no gaps or overlaps. ### 2. Account for late data in your schedule¶ There are many reasons why you might have late-arriving data: system downtime in your message queues, network outages, or something else. These are largely unavoidable and will occur at some point in your streaming journey. You should **account for potential delays ahead of time**. When using Copy Pipes, include some headroom in your schedule to allow for late data. How much headroom you give to your schedule is up to you, but some useful guidance is that you should consider the Service Level Agreements (SLAs) both up and downstream of Tinybird. For instance, if your streaming pipeline has 5 minute downtime SLAs, then most of your late data should be less than 5 minutes. Similarly, consider if you have any SLAs from data consumers who are expecting timely data in Tinybird. If you schedule a Copy Pipe every 5 minutes (16:00, 16:05, 16:10...), there could be events with a timestamp of 15:59:59 that do not arrive in Tinybird until 16:00:01 (2 seconds late!). If the Copy Pipe executes at exactly 16:00:00, these events could be lost. There are two ways to add headroom to your schedule: #### Option 1: Delay the execution¶ The first option is to simply delay the schedule. For example, if you want to create a Copy Pipe that creates 5 minute snapshots, you could delay the schedule by 1 minute, so that instead of running at 17:00, 17:05, 17:10, etc. it would instead run at 17:01, 17:06, 17:11. To achieve this, you could use the cron expression `1-59/5 * * * *`. If you use this method, you must combine it with the advice from the first tip in this best practices guide to [Round datetime filters to your schedule](https://www.tinybird.co/docs/about:blank#1-round-datetime-filters-to-your-schedule) . For example: SELECT * FROM datasource WHERE datetime >= toStartOfFiveMinutes(now()) - interval 5 minute AND datetime < toStartOfFiveMinutes(now()) #### Option 2: Filter the result with headroom¶ Another strategy is to keep your schedule as desired (17:00, 17:05, 17:10, etc.) but apply a filter in the query to add some headroom. For example, you can move the copy window by 15 seconds: WITH (SELECT toStartOfFiveMinutes(now()) - interval 15 second) AS snapshot SELECT snapshot, * FROM datasource WHERE timestamp >= snapshot - interval 5 minute AND timestamp < snapshot With this method, a Copy Pipe that executes at 17:00 will copy data from 16:54:45 to 16:59:45. At 17:05, it would copy data from 16:59:45 to 17:04:45, and so on. It is worth noting that this can be confusing to data consumers who might notice that the data timestamps don't perfectly align with the schedule, so consider whether you'll need extra documentation. ### 3. Write a snapshot timestamp¶ There are many reasons why you might want to capture a snapshot timestamp, as it documents when a particular row was written. This helps you identify which execution of the Copy Pipe is responsible for which row, which is useful for debugging or auditing. For example: SELECT toStartOfHour(now()) as snapshot_id, * FROM datasource WHERE timestamp >= snapshot_id - interval 1 hour AND timestamp < snapshot_id In this example, you're adding a new column at the start of the result which contains the rounded timestamp of the execution time. By applying an alias to this column, you can re-use it in the query as your [rounded datetime filter](https://www.tinybird.co/docs/about:blank#1-round-datetime-filters-to-your-schedule) , saving you a bit of typing. ### 4. Use parameters in your Copy Pipe¶ Copy Pipes can be executed following a schedule or on-demand. All the previous best practices on this page are focused on scheduled executions. But what happens if you want to use the same Copy Pipe to do a backfill? For example, you want to execute the Copy Pipe only on data from last year, to fill in a gap behind your fresh data. To do this, you can parameterize the filters. When you run an on-demand Copy Pipe with parameters, you can modify the values of the parameters before execution. Scheduled Copy Pipes with parameters use the default value for any parameters. This means you can simply re-use the same Copy Pipe for your fresh, scheduled runs as well as any ad-hoc backfills. The following example creates a Pipe with two [Nodes](https://www.tinybird.co/docs/docs/concepts/pipes#nodes) that break up the Copy Pipe logic to be more readable. The first Node is called `date_params`: % {% if defined(snapshot_id) %} SELECT parseDateTimeBestEffort({{String(snapshot_id)}}) as snapshot_id, snapshot_id - interval 5 minute as start {% else %} SELECT toStartOfFiveMinutes(now()) as snapshot_id, snapshot_id - interval 5 minute as start {% end %} The `date_params` Node looks for a parameter called `snapshot_id` . If it encounters this parameter, it knows that this is an on-demand execution, because a scheduled Copy Pipe will not be passed any parameters by the scheduler. The scheduled execution of the Copy Pipe will create a time filter based on `now()` . An on-demand execution of the Copy Pipe will use this `snapshot_id` parameter to create a dynamic time filter. In both cases, the final result of this Node is a time filter called `snapshot_id`. In the second Node: SELECT (SELECT snapshot_id FROM date_params) as snapshot_id, * FROM datasource WHERE timestamp >= (SELECT start FROM date_params) AND timestamp < snapshot_id First, you select the result of the previous `date_params` Node, which is the `snapshot_id` time filter. You do not need to worry about whether this is a scheduled or on-demand execution at this point, as it has already been handled by the previous Node. This also retrieves the other side of the time filter from the first Node with `(SELECT start FROM date_params)` . This is not strictly needed but it's convenient so you don't have to write the `- interval 5 minute` in multiple places, making it easier to update in the future. With this, the same Copy Pipe can perform both functions: being executed on a regular schedule to keep up with fresh data, and being executed on-demand when needed. ## Configure a Copy Pipe (CLI)¶ To create a Copy Pipe from the CLI, you need to create a .pipe file. This file follows the same format as [any other .pipe file](https://www.tinybird.co/docs/docs/cli/datafiles/overview) , including defining Nodes that contain your SQL queries. In this file, define the queries that will filter and transform the data as needed. The final result of all queries should be the result that you want to write into a Data Source. You must define which Node contains the final result. To do this, include the following parameters at the end of the Node: TYPE COPY TARGET_DATASOURCE datasource_name COPY_SCHEDULE --(optional) a cron expression or @on-demand. If not defined, it would default to @on-demand. COPY_MODE append --(Optional) The strategy to ingest data for copy jobs. One of `append` or `replace`, if empty the default strategy is `append`. There can be only one copy Node per Pipe, and no other outputs, such as [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) or [API Endpoints](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview). Copy Pipes can either be scheduled, or executed on-demand. This is configured using the `COPY_SCHEDULE` setting. To schedule a Copy Pipe, configure `COPY_SCHEDULE` with a cron expression. On-demand Copy Pipes are defined by configuring `COPY_SCHEDULE` with the value `@on-demand`. Note that all schedules are executed in the UTC time zone. If you are configuring a schedule that runs at a specific time, be careful to consider that you will need to convert the desired time from your local time zone into UTC. Here is an example of a Copy Pipe that is scheduled every hour and writes the results of a query into the `sales_hour_copy` Data Source: NODE daily_sales SQL > % SELECT toStartOfDay(starting_date) day, country, sum(sales) as total_sales FROM teams WHERE day BETWEEN toStartOfDay(now()) - interval 1 day AND toStartOfDay(now()) and country = {{ String(country, ‘US’)}} GROUP BY day, country TYPE COPY TARGET_DATASOURCE sales_hour_copy COPY_SCHEDULE 0 * * * * Before pushing the Copy Pipe to your Workspace, make sure that the target Data Source already exists and has a schema that matches the output of the query result. Data Sources will not be created automatically when a Copy Pipe runs. If you push the target Data Source and the Copy Pipe at the same time, be sure to use the `--push-deps` option in the CLI. ### Change Copy Pipe Token reference¶ Tinybird automatically creates a [Token](https://www.tinybird.co/docs/docs/concepts/auth-tokens) each time a scheduled copy is created, to read from the Pipe and copy the results. To change the Token strategy (for instance, to share the same one across each copy, rather than individual Tokens for each), update the [Token reference in the .pipe datafile](https://www.tinybird.co/docs/docs/cli/datafiles/pipe-files). ## Execute Copy Pipes (CLI)¶ Copy Pipes can either be scheduled, or executed on-demand. When a Copy Pipe is pushed with a schedule, it will automatically be executed as per the schedule you defined. If you need to pause the scheduler, you can run `tb pipe copy pause [pipe_name]` , and use `tb pipe copy resume [pipe_name]` to resume. Note that you cannot customize the values of dynamic parameters on a scheduled Copy Pipe. Any parameters will use their default values. When a Copy Pipe is pushed without a schedule, using the `@on-demand` directive, you can run `tb pipe copy run [pipe_name]` to trigger the Copy Pipe as needed. You can pass parameter values to the Copy Pipe by using the `param` flag, e.g., `--param key=value`. You can run `tb job ls` to see any running jobs, as well as any jobs that have finished during the last 48 hours. If you remove a Copy Pipe from your Workspace, the schedule will automatically stop and no more copies will be executed. ## Configure a Copy Pipe (UI)¶ To create a Copy Pipe from the UI, [follow the process to create a standard Pipe](https://www.tinybird.co/docs/docs/concepts/pipes#creating-pipes-in-the-ui). After writing your queries: 1. Select the Node that contains the final result. 2. Select the actions button next to the Node. 3. Select** Create Copy Job** . To configure the frequency: 1. Select whether the Copy Pipe should be scheduled using a cron expression, or run on-demand, using the** Frequency** menu. 2. If you select a cron expression, configure the expression. 3. Select** Next** to continue. All schedules run in the UTC time zone. If you are configuring a schedule that runs at a specific time, you might need to convert the desired time from your local time zone into UTC. ### On-demand¶ If you selected **on-demand** as the frequency for the Copy Pipe, you can customize the values for any parameters of the Pipe. You can also configure whether the Copy Pipe should write results into a new, or existing, Data Source, using the radial buttons. If you use an existing Data Source, you can select which one to use from the menu of your Data Sources. Only Data Sources with a compatible schema are in the menu. If you create a new Data Source, Tinybird guides you through creating the new Data Source. Select **Next** to continue and go through the standard [Create Data Source wizard](https://www.tinybird.co/docs/docs/concepts/data-sources#creating-data-sources-in-the-ui). ### Scheduled¶ If you selected **cron expression** as the frequency for the Copy Pipe, a preview of the result appears. You can't configure parameter values for a scheduled Copy Pipe. Review the results and select **Next** to continue. Finally, you can configure whether the Copy Pipe should write results into a new, or existing, Data Source, using the radial buttons. If you use an existing Data Source, you can select which one to use from the menu of your Data Sources. Only Data Sources with a compatible schema is in the menu. If you create a new Data Source, Tinybird guides you through creating the new Data Source. Select **Next** to continue and go through the standard [Create Data Source wizard](https://www.tinybird.co/docs/docs/concepts/data-sources#creating-data-sources-in-the-ui). ## Run Copy Pipes (UI)¶ To run a Copy Pipe in the UI, navigate to the Pipe, and select the **Copying** button. From the options, select **Run copy now**. You can't customize the values of dynamic parameters on a scheduled Copy Pipe. Any parameters use their default values. ## Iterating a Copy Pipe¶ Copy Pipes can be iterated using [version control](https://www.tinybird.co/docs/docs/production/overview) like any other resource in your Data Project. However, you need to understand how connections work in Branches and deployments to select the appropriate strategy for your desired changes. Branches don't execute on creation by default recurrent jobs, like the scheduled ones for Copy Pipes. To iterate a Copy Pipe, create a new one, or recreate the existing one, with the desired configuration. The new Copy Pipe starts executing from the Branch, without affecting the unchanged production resource. This means you can test the changes without mixing the test resource with your production exports. [This example](https://github.com/tinybirdco/use-case-examples/tree/main/change_copy_pipe_time_granularity) , shows how to change the Copy Pipe time granularity, adding an extra step for backfill the old data. ## Monitoring¶ Tinybird provides a high level metrics page for each Copy Pipe in the UI, as well as exposing low-level observability data through the [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources). You can view high level status and statistics about your Copy Pipes in the Tinybird UI from the Copy Pipe's details page. To access the details page, navigate to the Pipe, and select **View Copy Job**. The details page shows summaries of the Copy Pipe's current status and configuration, as well as Charts showing the performance of previous executions. You can also monitor your Copy Pipes using the [datasource_ops_log Service Data Source](https://www.tinybird.co/docs/docs/monitoring/service-datasources) . This Data Source contains data about all your operations in Tinybird. Logs that relate to Copy Pipes can be identified by a value of `copy` in the `event_type` column. For example, the following query aggregates the Processed Data from Copy Pipes, for the current month, for a given Data Source name. SELECT toStartOfMonth(timestamp) month, sum(read_bytes + written_bytes) processed_data FROM tinybird.datasources_ops_log WHERE datasource_name = '{YOUR_DATASOURCE_NAME}' AND event_type = 'copy' AND timestamp >= toStartOfMonth(now()) GROUP BY month Using this Data Source, you can also write queries to determine average job duration, amount of errors, error messages, and more. ## Billing¶ Processed Data and Storage are the two metrics that Tinybird uses for [billing](https://www.tinybird.co/docs/docs/support/billing) Copy Pipes. A Copy Pipe executes the Pipe's query (Processed Data) and writes the result into a Data Source (Storage). Any processed data and storage incurred by a Copy Pipe is charged at the standard rate for your billing plan (see ["Tinybird plans"](https://www.tinybird.co/docs/docs/plans) ). See the [Monitoring section](https://www.tinybird.co/docs/about:blank#monitoring) for guidance on monitoring your usage of Copy Pipes. ## Limits¶ Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. ### Build and Professional¶ The schedule applied to a Copy Pipe does not guarantee that the underlying job executes immediately at the configured time. The job is placed into a job queue when the configured time elapses. It is possible that, if the queue is busy, the job could be delayed and executed some time after the scheduled time. ### Enterprise¶ A maximum execution time of 50% of the scheduling period, 30 minutes max, means that if the Copy Pipe is scheduled to run every minute, the operation can take up to 30 seconds. If it is scheduled to run every 5 minutes, the job can last up to 2m30s, and so forth. This is to prevent overlapping jobs, which can impact results. The schedule applied to a Copy Pipe does not guarantee that the job executes immediately at the configured time. When the configured time elapses, the job is placed into a job queue. It is possible that, if the queue is busy, the job could be delayed and executed some time after the scheduled time. To reduce the chances of a busy queue affecting your Copy Pipe execution schedule, we recommend distributing the jobs over a wider period of time rather than grouped close together. For Enterprise customers, these settings can be customized. Reach out to your Customer Success team directly, or email us at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co). ## Next steps¶ - Understand how to use Tinybird's[ dynamic query parameters](https://www.tinybird.co/docs/docs/query/query-parameters) . - Read up on configuring, executing, and iterating[ Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) . --- URL: https://www.tinybird.co/docs/publish/kafka-sink Last update: 2024-11-08T11:23:54.000Z Content: --- title: "Kafka Sink · Tinybird Docs" theme-color: "#171612" description: "Push events to Kafka on a batch-based schedule using Tinybird's fully managed Kafka Sink Connector." --- # Kafka Sink¶ Kafka Sinks are currently in private beta. If you have any feedback or suggestions, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). Tinybird's Kafka Sink allows you to push the results of a query to a Kafka topic. Queries can be executed on a defined schedule or on-demand. Common uses for the Kafka Sink include: - Push events to Kafka as part of an event-driven architecture. - Exporting data to other systems that consume data from Kafka. - Hydrating a data lake or data warehouse with real-time data. Tinybird represents Sinks using the icon. ## Prerequisites¶ To use the Kafka Sink, you need to have a Kafka cluster that Tinybird can reach via the internet, or via private networking for Enterprise customers. ## Configure using the UI¶ ### 1. Create a Pipe and promote it to Sink Pipe¶ In the Tinybird UI, create a Pipe and write the query that produces the result you want to export. In the top right "Create API Endpoint" menu, select "Create Sink". In the modal, choose the destination (Kafka). ### 2. Choose the scheduling options¶ You can configure your Sink to run using a cron expression, so it runs automatically when needed. ### 3. Configure destination topic¶ Enter the Kafka topic where events are going to be pushed. ### 4. Preview and create¶ The final step is to check and confirm that the preview matches what you expect. Congratulations! You've created your first Sink. ## Configure using the CLI¶ ### 1. Create the Kafka Connection¶ Run the `tb connection create kafka` command, and follow the instructions. ### 2. Create Kafka Sink Pipe¶ To create a Sink Pipe, create a regular .pipe and filter the data you want to export to your bucket in the SQL section as in any other Pipe. Then, specify the Pipe as a sink type and add the needed configuration. Your Pipe should have the following structure: NODE node_0 SQL > SELECT * FROM events WHERE time >= toStartOfMinute(now()) - interval 30 minute) TYPE sink EXPORT_SERVICE kafka EXPORT_CONNECTION_NAME "test_kafka" EXPORT_KAFKA_TOPIC "test_kafka_topic" EXPORT_SCHEDULE "*/5 * * * *" **Pipe parameters** For this step, you will need to configure the following [Pipe parameters](https://www.tinybird.co/docs/docs/cli/datafiles/pipe-files#sink-pipe): | Key | Type | Description | | --- | --- | --- | | EXPORT_CONNECTION_NAME | string | Required. The connection name to the destination service. This the connection created in Step 1. | | EXPORT_KAFKA_TOPIC | string | Required. The desired topic for the export data. | | EXPORT_SCHEDULE | string | A crontab expression that sets the frequency of the Sink operation or the @on-demand string. | Once ready, push the datafile to your Workspace using `tb push` (or `tb deploy` if you are using [version control](https://www.tinybird.co/docs/docs/production/overview) ) to create the Sink Pipe. ## Scheduling¶ The schedule applied doesn't guarantee that the underlying job executes immediately at the configured time. The job is placed into a job queue when the configured time elapses. It is possible that, if the queue is busy, the job could be delayed and executed after the scheduled time. To reduce the chances of a busy queue affecting your Sink Pipe execution schedule, we recommend distributing the jobs over a wider period of time rather than grouped close together. For Enterprise customers, these settings can be customized. Reach out to your Customer Success team or email us at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co). ## Query parameters¶ You can add [query parameters](https://www.tinybird.co/docs/docs/query/query-parameters) to your Sink, the same way you do in API Endpoints or Copy Pipes. For scheduled executions, the default values for the parameters will be used when the Sink runs. ## Iterating a Kafka Sink (Coming soon)¶ Iterating features for Kafka Sinks are not yet supported in the beta. They are documented here for future reference. Sinks can be iterated using [version control](https://www.tinybird.co/docs/docs/production/overview) , similar to other resources in your project. When you create a Branch, resources are cloned from the main Branch. However, there are two considerations for Kafka Sinks to understand: **1. Schedules** When you create a Branch with an existing Kafka Sink, the resource will be cloned into the new Branch. However, **it will not be scheduled** . This prevents Branches from running exports unintentially and consuming resources, as it is common that development Branches do not need to export to external systems. If you want these queries to run in a Branch, you must recreate the Kafka Sink in the new Branch. **2. Connections** Connections are not cloned when you create a Branch. You need to create a new Kafka connection in the new Branch for the Kafka Sink. ## Observability¶ Kafka Sink operations are logged in the [tinybird.sinks_ops_log](https://www.tinybird.co/docs/docs/monitoring/service-datasources#tinybird-sinks-ops-log) Service Data Source. ## Limits & quotas¶ Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. ## Billing¶ Any Processed Data incurred by a Kafka Sink is charged at the standard rate for your account. The Processed Data is already included in your plan, and counts towards your commitment. If you're on an Enterprise plan, view your plan and commitment on the [Organizations](https://www.tinybird.co/docs/docs/monitoring/organizations) tab in the UI. ## Next steps¶ - Get familiar with the[ Service Data Source](https://www.tinybird.co/docs/docs/monitoring/service-datasources) and see what's going on in your account - Deep dive on Tinybird's[ Pipes concept](https://www.tinybird.co/docs/docs/concepts/pipes) --- URL: https://www.tinybird.co/docs/publish/materialized-views/best-practices Last update: 2024-11-15T10:10:55.000Z Content: --- title: "Best practices for Materialized Views · Tinybird Docs" theme-color: "#171612" description: "Learn how Materialized Views work and how to best use them." --- # Best practices for Materialized Views¶ Read on to learn how [Materialized Views](https://www.tinybird.co/docs/overview) work and how to best use them in your data projects. ## How data gets into a Materialized View¶ Tinybird ingests data into Materialized Views in blocks. This process is presented in the following diagram. Every time new data is ingested into the origin Data Source, the materialization process is triggered, which applies the transformation Pipe over the data ingested and sabes the output of that Pipe, which is a partial result, in the Materialized View. <-figure-> ![Materialization process](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fmaster-materialized-views-1.png&w=3840&q=75) Data that was present in the origin Data Source prior to the Materialized View creation is inserted into the destination Materialized View through a populate operation. ### Regular materializaton¶ Materialized Views in Tinybird are incremental and triggered upon ingestion. From the moment it's created, the new rows are inserted in the Materialized View. If an insert is too big, it's processed in blocks. Because materializations are performed only over the new data being ingested, and not over the whole data source, avoid using the following operations: Window functions, `ORDER BY`, `Neighbor` , and `DISTINCT`. ### Populates¶ Populates move historical data from the origin Data Source into the Materialized View. There are two types: complete and partial. If you're populating from a Data Source with hundreds of millions of rows and doing complex transformations, you might face memory errors. In this type of situation, use partial populates. If you’re using the CLI, populates are triggered using `tb pipe populate` . You can add conditions using the `--sql-condition` flag, for example, `--sql-condition='date == toYYYYMM(now())'` . If your `sql_condition` includes any column present in the Data Source `engine_sorting_key` , the populate job should process less data. If you have constant ingest in your origin Data Source, see [Populates and streaming ingest](https://www.tinybird.co/docs/about:blank#populates-and-streaming-ingest). ## Aggregated Materialized Views¶ Sometimes a background process in Tinybird merges partial results saved in intermediate states during the ingestion process, compacting the results and reducing the number of rows. The following diagram illustrates this process in more detail through a simple example. Let’s say an eCommerce store wants to materialize the count of units sold by product. It's ingesting a JSON object every minute, with a product represented by a capital letter and the quantity sold during the last minute. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmaster-materialized-views-2.png&w=3840&q=75) The store could define in their Pipe some simple SQL to sum the count of units sold per minute as data is ingested. The Pipe is applied over each new block of appended data, and the output is immediately saved in intermediate states into the Materialized View. Every 8 or 10 minutes, the background process merges the intermediate states, completing the aggregation across the entire Data Source. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmaster-materialized-views-3.png&w=3840&q=75) Because they're working in real time, the store can’t always wait for this background process to take place. When querying the Materialized View, they should use the proper merge combinator and `GROUP BY` clause in the query itself. ### Understanding State and Merge combinators for Aggregates¶ Tracking when the background process that merges aggregate results in a Materialized View has occurred isn't always practical. Because of this, you need to store intermediate states using the `-State` suffix. If you’re creating a Materialized View using the UI, this is done automatically. Here’s an example of using `-State` when defining the transformation Pipe to calculate these intermediate states: ##### USING the -State SUFFIX NODE Avg calculation SQL > SELECT day, city, avgState() avg FROM table GROUP BY day, city You also need to specifically define the appropriate schema for the Materialized View: ##### MV SCHEMA SCHEMA > day Date, city String, avg AggregateFunction(avg, Float64) ENGINE_SORTING_KEY date, city Finally, you need to retrieve the data using the `-Merge` suffix in your API Endpoint Node to make sure the merge process is completed for all data in the Materialized View: ##### USE MERGE SUFFIX IN ENDPOINT NODE NODE endpoint SQL > % SELECT day, city, avgMerge(avg) as avg FROM avg_table WHERE day > {{Date(start_date)}} GROUP BY day, city ## Understanding the Materialized View parameters¶ When you create a Materialized View in the UI, Tinybird automatically recommends the best parameters for most use cases. This is useful to understand parameters for more complex use cases. ### Sorting Key¶ The Sorting Key defines how data is sorted and is critical for great performance when filtering. Choose the order of your sorting keys depending on how you're going to query them. Here are a few examples for a simple Materialized View containing `day`, `city` , and `avg` columns: - You want to query the average for all cities on a particular day: the `day` column should be the first sorting key. - You wanted the average over the last month for a particular city: the `city` column should be the first sorting key. For Materialized Views containing aggregations, every column in the `GROUP BY` statement has to be in the sorting keys, and only those columns can be sorting keys. For non-aggregated Materialized Views, you can select other columns if they fit better for your use case, but we don't recommend adding too many. You get only a negligible performance boost after the fourth sorting key column. ### Partition by¶ A partition is a logical combination of records by a given criterion. Usually you don't need a partition key, or a partition by month is enough. Tinybird guesses the best partition key if your materialization query has a Date or DateTime column. If there aren't any Date columns, Tinybird doesn't set a partition key. Having no partition is better than having the wrong partition. If you're comfortable with partitions and you want to group records by another criteria, you can switch to the advanced tab and add your custom partition code. ### Time To Live (TTL)¶ If you have certain lifetime requirements on the data in your Materialized Views, you can specify a Time To Live (TTL) parameter when creating a Materialized View. An example of TTL requirement is satisfying GDPR regulations. TTLs can also be useful if you only intend to query a brief history of the data. For example, if you always and only query for data from within the last week, you can set a TTL of 7 days. When a TTL is set, all rows older than the TTL are removed from the Materialized View. ### Advanced tab (UI)¶ Most of the time, the defaults recommended by Tinybird are the best parameters to use. Occasionally, however, you might need to tweak these parameters. For this, you can use the **Advanced** tab in Tinybird, where you can write code that's passed to the **View** creation. You need ClickHouse® expertise to modify these parameters. If you aren't sure, stick with the defaults. ## Populates and streaming ingest¶ A populate is the operation of moving data that was present in the origin Data Source before the creation of the Materialized View. The following diagram illustrates the process. At `t1` , the Materialized View is created, so new rows arriving in the origin Data Source are processed into the Materialized View. To move the data from `t0` to `t1` , launch a populate, either manually or when defining the Materialized View, at time `t2`. All that data that arrives between `t1` and `t2` might be materialized twice: once due to the regular materialization process, at ingest time, and the other one due to the populate process. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fpopulate-duplicates-data.png&w=3840&q=75) When you don’t have streaming ingest in the origin Data Source, it's usually a safe operation, as long as no new data arrives while the populate is running. ### Backfill strategies¶ Consider one of the following strategies for backfilling data. #### Two Materialized View Pipes¶ Use a timestamp in the near future to split real-time ingest and backfill. Create the regular MV with a `WHERE` clause specifying that materialized data is newer than a certain timestamp in the future. For example, `WHERE timestamp >= '2024-01-31 00:00:00'`: ##### realtime materialized.pipe NODE node SQL > % SELECT (...) FROM origin_ds WHERE timestamp >= '2024-01-31 00:00:00' TYPE Materialized TARGET_DATASOURCE mv Wait until the desired timestamp has passed, and create a the backfill Materialized View Pipe with a `WHERE` clause for data before the specified timestamp. No new data is processed, as the condition can't be met. ##### populate.pipe NODE node SQL > % SELECT (...) FROM origin_ds WHERE timestamp < '2024-01-31 00:00:00' TYPE Materialized TARGET_DATASOURCE mv Finally, because it's now safe, run the `--populate` command. #### Use Copy Pipes¶ Depending on the transformations, Copy Pipes can substitute the populate for historical data. See [Backfill strategies](https://www.tinybird.co/docs/docs/production/backfill-strategies). These tips only apply for streaming ingest. With batch ingest, or being able to pause ingest, populates are totally safe. ## Use the same alias in SELECT and GROUP BY¶ If you use an alias in the `SELECT` clause, you must reuse the same alias in the `GROUP BY`. Take the following query as an example: ##### Different alias in SELECT and GROUP BY SELECT key, multiIf(value = 0, 'isZero', 'isNotZero') as zero, sum(amount) as amount FROM ds GROUP BY key, value The previous query results in the following error: `Column 'value' is present in the GROUP BY but not in the SELECT clause` To fix this, use the same alias in the `GROUP BY`: ##### GOOD: same alias in SELECT and GROUP BY SELECT key, multiIf(value = 0, 'isZero', 'isNotZero') as zero, sum(amount) as amount FROM ds GROUP BY key, zero ## Don't use nested GROUP BYs¶ Don't use nested `GROUP BY` clauses in the Pipe that creates a Materialized View. While nested aggregations are possible, Materialized Views are processed in independent blocks, and this might yield unexpected results. Tinybird restricts these behaviors and throws an error when they're detected to avoid inaccurate results. Consider the following query with nested `GROUP BY` clauses: ##### Nested GROUP BY in Pipe SELECT product, count() as c FROM ( SELECT key, product, count() as orders FROM ds GROUP BY key, product ) GROUP BY product The previous query throws the following error: Columns 'key, product' are present in the GROUP BY but not in the SELECT clause To fix this, make sure you don't nest `GROUP BY` clauses: ##### Single GROUP BY SELECT key, product, count() as orders FROM ds GROUP BY key, product ## Avoid big scans¶ Avoid big scans in Materialized Views. When using JOINs, do them with a subquery to the Data Source you join to, not the whole Data Source. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/publish/materialized-views/example-mv-cli Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Example of Materialized View (CLI) · Tinybird Docs" theme-color: "#171612" description: "The following example shows how to create a Materialized View using Tinybird CLI." --- # Example of Materialized View (CLI)¶ Consider an `events` Data Source which, for each action performed in an ecommerce website, stores a timestamp, the user that performed the action, the product, which type of action - `buy`, `add to cart`, `view` , and so on - and a JSON column containing some metadata, such as the price. The `events` Data Source is expected to store billions of rows per month. Its data schema is as follows: ##### DEFINITION OF THE EVENTS.DATASOURCE FILE SCHEMA > `date` DateTime, `product_id` String, `user_id` Int64, `event` String, `extra_data` String ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYear(date)" ENGINE_SORTING_KEY "date, cityHash64(extra_data)" ENGINE_SAMPLING_KEY "cityHash64(extra_data)" You want to publish an API Endpoint calculating the top 10 products in terms of sales for a date range ranked by total amount sold. Here's where Materialized Views can help you. ## Materialize the results¶ After doing the desired transformations, set the `TYPE` parameter to `materialized` and add the name of the Data Source, which materializes the results. ##### DEFINITION OF THE TOP PRODUCT PER\_DAY.PIPE NODE only_buy_events DESCRIPTION > filters all the buy events SQL > SELECT toDate(date) AS date, product_id, JSONExtractFloat(extra_data, 'price') AS price FROM events WHERE event = 'buy' NODE top_per_day SQL > SELECT date, topKState(10)(product_id) AS top_10, sumState(price) AS total_sales FROM only_buy_events GROUP BY date TYPE materialized DATASOURCE top_products_view Do the rest in the Data Source schema definition for the Materialized View, named `top_products_view`: ##### DEFINITION OF THE TOP PRODUCTS VIEW.DATASOURCE FILE SCHEMA > `date` Date, `top_10` AggregateFunction(topK(10), String), `total_sales` AggregateFunction(sum, Float64) ENGINE "AggregatingMergeTree" ENGINE_SORTING_KEY "date" The destination Data Source uses an [AggregatingMergeTree](https://www.tinybird.co/docs/docs/) engine, which for each `date` stores the corresponding `AggregateFunction` for the top 10 products and the total sales. Having the data precalculated as it gets ingested makes the API Endpoint run in real time, no matter the number of rows in the `events` Data Source. As for the Pipe used to build the API Endpoint, `top_products_agg` , it's as follows: ##### DEFINITION OF THE TOP PRODUCTS PER DAY PIPE NODE top_products_day SQL > SELECT date, topKMerge(10)(top_10) AS top_10, sumMerge(total_sales) AS total_sales FROM dev__top_products_view GROUP BY date When preaggregating, the Aggregate Function uses the mode `State` , while when getting the calculation it makes use of `Merge`. ## Push to Tinybird¶ Once it's done, push everything to your Tinybird account: ##### PUSH YOUR PIPES AND DATA SOURCES USING THE CLI tb push datasources/top_products_view.datasource tb push pipes/top_product_per_day.pipe --populate tb push endpoints/top_products_endpoint.pipe When pushing the `top_product_per_day.pipe` , use the `--populate` flag. This causes the transformation to run in a job, and the Materialized View `top_products_view` to be populated. You can repopulate Materialized Views at any moment: ##### Command to force populate the materialized view tb push pipes/top_product_per_day.pipe --populate --force --- URL: https://www.tinybird.co/docs/publish/materialized-views/example-mv-ui Last update: 2024-11-06T17:38:37.000Z Content: --- title: "Example of Materialized View · Tinybird Docs" theme-color: "#171612" description: "The following example shows how to create a Materialized View in Tinybird." --- # Example of Materialized View¶ The following example shows how to create a Materialized View in Tinybird. For this example, the question to answer is: "What was the average purchase price of all products in each city on a particular day?" ## Get baseline performance¶ First, you need to check the baseline performance before creating an Endpoint on top of the original ingested Data Source. ### Filtering and extracting¶ Create a first node to filter the data to include only `buy` events. Simultaneously, normalize event timestamps to rounded days and extract `city` and `price` data from the `extra_data` column containing JSON. ##### NODE 1: FILTERING & EXTRACTING SELECT toDate(date) day, JSONExtractString(extra_data, 'city') as city, JSONExtractFloat(extra_data, 'price') as price FROM events WHERE event = 'buy' ### Averaging¶ Next, create a node to aggregate data and get the average purchase price by city per day: ##### Node 2: Averaging SELECT day, city, avg(price) as avg FROM only_buy_events_per_city GROUP BY day, city ### Creating a parameterized API Endpoint¶ Finally, create a node that you can publish as an API Endpoint, adding a parameter to filter results to a particular day: ##### Node 3: Creating a parameterized API Endpoint % SELECT * FROM avg_buy_per_day_and_city WHERE day = {{Date(day, '2020-09-09')}} ## Create the Materialized View¶ A Materialized View is formed by a Pipe that ends in the creation of a new Data Source instead of an Endpoint. Start by duplicating the existing Pipe: 1. Duplicate the `events_pipe` so that you get a Pipe with the same Nodes as the baseline. 2. Rename the Pipe to `events_pipe_mv` . In this case you're going to materialize the second node in the Pipe, because it's the one performing the aggregation. The third node simply provides you with a filter to create a parameterized Endpoint. You can't use [query parameters](https://www.tinybird.co/docs/docs/query/query-parameters) in nodes that are published as Materialized Views. To create the Materialized View: 1. Select the node options. 2. Select** Create a Materialized View from this Node** . 3. Update the** View** settings as required. 4. Select** Create Materialized View** . <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmaterialized-views-2.gif&w=3840&q=75) Your Materialized View has been created as a new Data Source. By default, the name that Tinybird gives the Data Source is the name of the materialized Pipe ode appended with `_mv` , in this case `avg_buy_per_day_and_city_mv`. Append the names of all your Transformation Pipes and Materialized View Data Sources with `_mv` , or another common identifier. This example uses a new Data Source. You can also select an existing Data Source as the destination for the Materialized View, but it must have the same schema as the Materialized View output. If both schemas match, Tinybird offers that Data Source as an option that you can select when you're creating a Materialized View. ### Populating Existing Data¶ When Tinybird creates a Materialized View, it initially only populates a partial set of data from the original Data Source. This allows you to quickly validate the results of the Materialized View. Once you have validated with the partial dataset, you can populate the Materialized View with all the existing data in the Original Data source. To do so, select **Populate With All Data**. You now have a Materialized View Data Source that you can use to query against in your Pipes. ### Testing performance improvements¶ To test how the Materialized View has improved the performance of the API Endpoint, return to the original Pipe. In the original Pipe, do the following: 1. Copy the SQL from the original API Endpoint Node, `avg_buy` . 2. Create a new transformation Node, called `avg_buy_mv` . 3. Paste the query from the original API Endpoint Node into your new Node. 4. Update the query to select from your new Materialized View, `avg_buy_per_day_and_city_mv` . <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmaterialized-views-3.gif&w=3840&q=75) Because this query is an aggregation, you need to rewrite the query, because data in Materialized Views in Tinybird exists in intermediate states. As new data is ingested, the data in the Materialized View gets appended in blocks of partial results. A background periodically process merges the appended partial results and saves them in the Materialized View. Because you are processing data in real time, you might not be able to wait for the background process to complete. To account for this, reaggregate in the API Endpoint query using the `-merge` combinator. This example uses an `avg` aggregation, so you need to use `avgMerge` to compact the results in the Materialized View. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmaterialized-views-4.gif&w=3840&q=75) When you run your the modified query, you get the same results as you got when you ran the final node against the original Data Source. This time, however, the performance has improved significantly. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmaterialized-views-5.png&w=3840&q=75) With the Materialized View, you get the same results, but process less data twice as fast at a fraction of the cost. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmaterialized-views-6.png&w=3840&q=75) ### Pointing the API Endpoint at the new node¶ Now that you've seen how much the performance of the API Endpoint query has improved by using a Materialized View, you can easily change which node the API Endpoint uses. Select the node dropdown, and then select the new node you created by querying the Materialized View. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fmaterialized-views-7.gif&w=3840&q=75) This way, you improve the API Endpoint performance while retaining the original URL, so applications which call that API Endpoint see an immediate performance boost. --- URL: https://www.tinybird.co/docs/publish/materialized-views/overview Last update: 2024-11-13T15:42:17.000Z Content: --- title: "Materialized Views · Tinybird Docs" theme-color: "#171612" description: "Materialized Views process your data at ingest time to increase the performance of your queries and endpoints." --- # Materialized Views¶ A Materialized View is the continuous, streaming result of a Pipe saved as a new Data Source. As new data is ingested into the origin Data Source, the transformed results from the Pipe are continually inserted in the new Materialized View, which you can query as any other Data Source. Tinybird represents Materialized Views using the icon. Preprocessing data at ingest time reduces latency and cost-per-query, and can significantly improve the performance of your API Endpoints. For example, you can transform the data through SQL queries, using calculations such as counts, sums, averages, or arrays, or transformations like string manipulations or joins. The resulting Materialized View acts as a Data Source you can query or publish. Typical use cases of Materialized Views include: - Aggregating, sorting, or filtering data at ingest time. - Improving the speed of a query that's taking too much time to run. - Simplifying query development by automating common filters and aggregations. - Reducing the amount of data processed by a single query. - Changing an existing schema for a different use case. You can create a new Materialized View and populate it with all existing data without any cost. On-going incremental writes to Materialized Views count towards your Tinybird usage. ## Create Materialized Views¶ To create a Materialized View using the Tinybird UI, follow these steps: 1. Write your Pipe using as many Nodes as needed. 2. Select the downward arrow (▽) next to** Create API Endpoint** and select** Create Materialized View** . 3. Select the node you want to use as output. 4. Edit the name of the destination Data Source. 5. Adjust the Engine Type, Sorting Keys, and so on. For a detailed example, see [Example of Materialized View](https://www.tinybird.co/docs/example-mv-ui). ### Error messages¶ With any Materialized View, Tinybird runs a speed simulation to ensure that the Materialized View won't produce any lag. If you're getting an error in Tinybird that your query is not compatible with real-time ingestion, review your Materialized View query setup. Review [the 5 rules for faster queries](https://www.tinybird.co/docs/docs/query/sql-best-practices) and keep the following principles in mind: 1. Avoid doing huge joins without filtering. 2. Use `GROUP BY` before filtering. 3. Remember that array `JOIN` s are slow. 4. Filter the right side of JOINs to speed up Materialized Views. If you're batching, especially when ingesting from Kinesis, consider decreasing the amount of data you batch. ## Create Materialized Views (CLI)¶ Consider an `origin` data source, for example `my_origin.datasource` , like the following: ##### Origin data source SCHEMA > `id` Int16, `local_date` Date, `name` String, `count` Int64 You might want to create an optimized version of the Data Source that preaggregates `count` for each ID. To do this, create a new Data Source that uses a `SimpleAggregateFunction` as a Materialized View. First, define the `destination` data source, for example `my_destination.datasource`: ##### Destination data source SCHEMA > `id` Int16, `local_date` Date, `name` String, `total_count` SimpleAggregateFunction(sum, UInt64) ENGINE "AggregatingMergeTree" ENGINE_PARTITION_KEY "toYYYYMM(local_date)" ENGINE_SORTING_KEY "local_date,id" Write a transformation Pipe, for example `my_transformation.pipe`: ##### Transformation Pipe NODE transformation_node SQL > SELECT id, local_date, name, sum(count) as total_count FROM my_origin GROUP BY id, local_date, name TYPE materialized DATASOURCE my_destination Once you have the origin and destination Data Sources defined and the transformation Pipe, you can push them: ##### Push the Materialized Views tb push my_origin.datasource tb push my_destination.datasource tb push my_transformation.pipe --populate Any time you ingest data into `my_origin` , the data in `my_destination` is automatically updated. For a detailed example, see [Example of Materialized View (CLI)](https://www.tinybird.co/docs/example-mv-cli). ### Guided process using tb materialize¶ Alternatively, you can use the `tb materialize` command to generate the target .datasource file needed to push a new Materialized View. The goal of the command is to guide you through all the needed steps to create a Materialized View. Given a Pipe, `tb materialize`: 1. Asks you which Node of the Pipe you want to materialize. By default, it selects the last one in the Pipe. If there's only one, it's automatically selected, skipping asking you. From the selected query, the commands guesses the best parameters for the following steps. 2. It warns you of the errors the query has, if any, that prevents it from materializing. If everything is correct, it continues. 3. It creates the target Data Source file that receives the results of the materialization, setting default engine parameters. If you are materializing an aggregation you should make sure the `ENGINE_SORTING_KEY` columns in the .datasource file are in the right order you are going to filter the table. 4. It modifies the query to set up the materialization settings and pushes the Pipe to create the materialization. You can skip the Pipe checks if needed as well. 5. It asks you if you want to populate the Materialized View with existing data. If you select to populate, it asks you if you want to use a subset of the data or fully populate with all existing data. 6. It creates a backup file of the Pipe adding the `_bak` suffix to the file extension. It completes the aggregate functions with `-State` combinators where needed and adds the target Data Source name. The backup file is preserved in case you want to recover the original query. The command generates and modifies the files involved in the materialization. If you run into an error or you need to modify something in the materialization, you can reuse the files as a better starting point. ### Force populate Materialized Views¶ Sometimes you might want to force populate a Materialized View, most likely because you changed the transformation in the Pipe and you want the data from the origin Data Source to be reingested. You can do this using `tb push` and the `--force` and `--populate` flags: ##### Populate a Materialized View tb push my_may_view_pipe.pipe --force --populate The response contains a [Jobs API](https://www.tinybird.co/docs/docs/api-reference/jobs-api) `job_url` that you can use to check progress and status of the job. ## Query Materialized Views¶ If the type of engine in your Data Source is `MergeTree` and you're not doing aggregations, you can query a Materialized View as a standard Data Source. If the engine is `AggregatingMergeTree`, `SummingMergeTree` or other special engine, and you have functions in your Pipe with the `-State` modifier, use the `-Merge` modifier, or `max()` and group by the Sorting Key columns. See [Best practices for Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/best-practices). For Deduplication use cases with `ReplacingMergeTree` , see [Deduplication strategies](https://www.tinybird.co/docs/docs/guides/querying-data/deduplication-strategies#use-the-replacingmergetree-engine). ## Limitations¶ Materialized Views work as insert triggers, which means a delete or truncate operation on your original Data Source doesn't affect the related Materialized Views. As transformation and ingestion in the Materialized View is done on each block of inserted data in the original Data Source, some operations such as `GROUP BY`, `ORDER BY`, `DISTINCT` and `LIMIT` might need a specific `engine` , such as `AggregatingMergeTree` or `SummingMergeTree` , which can handle data aggregations. The Data Source resulting from a Materialized View generated using `JOIN` is automatically updated only if and when a new operation is performed over the Data Source in the `FROM`. You can't create Materialized Views that depend on the `UNION` of several Data Sources. ## Considerations on populates¶ As described in [Best practices for Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/best-practices) , populates are the process to move the data that was already in the original Data Source through to the new Materialized View. If you have a continuous or streaming ingestion into the original Data Source, the populate might produce duplicated rows in the Materialized View. The populate runs as a separate job that goes partition by partition moving the existing data into the Materialized View. At the same time, the Materialized View is already automatically receiving new rows. There may be an overlap where the populate job moves a partition that includes rows that were already ingested into the Materialized View. To handle this scenario, see [Backfill strategies](https://www.tinybird.co/docs/docs/production/backfill-strategies). ## Next steps¶ - Learn how to make the most of Materialized Views. See[ Best Practices](https://www.tinybird.co/docs/best-practices) . - Review[ the 5 rules for faster queries](https://www.tinybird.co/docs/docs/query/sql-best-practices) . --- URL: https://www.tinybird.co/docs/publish/overview Last update: 2024-11-07T09:59:13.000Z Content: --- title: "Publish data · Tinybird Docs" theme-color: "#171612" description: "Overview of publishing data using Tinybird" --- # Publish your data¶ Whatever you need from your data, you can achieve it using Tinybird. Publish it as a queryable [API Endpoint](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview) , a [Materialized View](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) , or an advanced type of Tinybird Pipe (either a [Copy Pipe](https://www.tinybird.co/docs/docs/publish/copy-pipes) or a [Sink Pipe](https://www.tinybird.co/docs/docs/publish/s3-sink#sink-pipes) ). If you're new to Tinybird and looking to learn a simple flow of ingest data > query it > publish an API Endpoint, check out our [quick start](https://www.tinybird.co/docs/docs/quick-start)! --- URL: https://www.tinybird.co/docs/publish/s3-sink Last update: 2024-11-08T11:23:54.000Z Content: --- title: "S3 Sink · Tinybird Docs" theme-color: "#171612" description: "Offload data to S3 on a batch-based schedule using Tinybird's fully managed S3 Sink Connector." --- # S3 Sink¶ Tinybird's S3 Sink allows you to offload data to Amazon S3, either on a pre-defined schedule or on demand. It's good for a variety of different scenarios where Amazon S3 is the common ground, for example: - You're building a platform on top of Tinybird, and need to send data extracts to your clients on a regular basis. - You want to export new records to Amazon S3 every day, so you can load them into Snowflake to run ML recommendation jobs. - You need to share the data you have in Tinybird with other systems in your organization, in bulk. The Tinybird S3 Sink feature is available for Professional and Enterprise plans (see ["Tinybird plans"](https://www.tinybird.co/docs/docs/plans) ). If you are on a Build plan but want to access this feature, you can upgrade to Professional directly from your account Settings, or contact us at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co). Tinybird represents Sinks using the icon. ## About Tinybird's S3 Sink¶ ### How it works¶ Tinybird's S3 Sink is fully managed and requires no additional tooling. You create a new connection to an Amazon S3 bucket, then choose a Pipe whose result gets written to Amazon S3. Tinybird provides you with complete observability and control over the executions, resulting files, their size, data transfer, and more. ### Why S3?¶ Amazon S3 is very commonly used in data. Almost every service offers a way to import data from Amazon S3 (or S3-compatible storage). This Sink Connector enables you to use Tinybird's analytical capabilities to transform your data and provide it for onward use via Amazon S3 files. ### Sink Pipes¶ The Sink connector is built on Tinybird's Sink Pipes, an extension of the [Pipes](https://www.tinybird.co/docs/docs/concepts/pipes) concept, similar to [Copy Pipes](https://www.tinybird.co/docs/docs/publish/copy-pipes) . Sink Pipes allow you to capture the result of a Pipe at a moment in time, and store the output. Currently, Amazon S3 is the only service Tinybird's Sink Pipes support. Sink Pipes can be run on a schedule, or executed on demand. ### Supported regions¶ The Tinybird S3 Sink feature only supports exporting data to the following AWS regions: - `us-east-*` - `us-west-*` - `eu-central-*` - `eu-west-*` - `eu-south-*` - `eu-north-*` ### Prerequisites¶ To use the Tinybird S3 Sink feature, you should be familiar with Amazon S3 buckets and have the necessary permissions to set up a new policy and role in AWS. ### Scheduling considerations¶ The schedule applied to a [Sink Pipe](https://www.tinybird.co/docs/about:blank#sink-pipes) doesn't guarantee that the underlying job executes immediately at the configured time. The job is placed into a job queue when the configured time elapses. It is possible that, if the queue is busy, the job could be delayed and executed after the scheduled time. To reduce the chances of a busy queue affecting your Sink Pipe execution schedule, we recommend distributing the jobs over a wider period of time rather than grouped close together. For Enterprise customers, these settings can be customized. Reach out to your Customer Success team or email us at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co). ### Query parameters¶ You can add [query parameters](https://www.tinybird.co/docs/docs/query/query-parameters) to your Sink Pipes, the same way you do in API Endpoints or Copy Pipes. - For on-demand executions, you can set parameters when you trigger the Sink Pipe to whatever values you wish. - For scheduled executions, the default values for the parameters will be used when the Sink Pipe runs. ## Set up¶ The setup process involves configuring both Tinybird and AWS: 1. Create your Pipe and promote it to Sink Pipe 2. Create the AWS S3 connection 3. Choose the scheduling options 4. Configure destination path and file names 5. Preview and trigger your new Sink Pipe ### Using the UI¶ #### 1. Create a Pipe and promote it to Sink Pipe¶ In the Tinybird UI, create a Pipe and write the query that produces the result you want to export. In the top right "Create API Endpoint" menu, select "Create Sink". In the modal, choose the destination (Amazon S3), and enter the bucket name and region. Follow the step-by-step process on the modal to guide you through the AWS setup steps, or use the docs below. #### 2. Create the AWS S3 Connection¶ ##### 2.1. Create the S3 access policy First, create an IAM policy that grants the IAM role permissions to write to S3. Open the AWS console and navigate to IAM > Policies, then select “Create Policy”: <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fs3-sink-step-1-access-policy.png&w=3840&q=75) On the “Specify Permissions” screen, select “JSON” and paste the policy generated in the UI by clicking on the Copy icon. It'll look like something like this: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:PutObjectAcl" ], "Resource": "arn:aws:s3::://*" }, { "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListBucket" ], "Resource": "arn:aws:s3:::" } ] } Select “Next”, add a memorable name in the following dialog box (you’ll need it later!), and select “Create Policy”. ##### 2.2. Create the IAM role In the AWS console, navigate to IAM > Roles and select “Create Role”: <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fs3-sink-step-3-roles.png&w=3840&q=75) On the “Select trusted entity” dialog box, select the "Custom trust policy” option. Copy the generated JSON and paste in into the Tinybird UI modal. Select "Next". On the “Add permissions” screen, find the policy for S3 access you just created and tick the checkbox to the left of it. Select "Next" and give it a meaningful name and description. Confirm that the trusted entities and permissions granted are the expected ones, and select "Create Role". You’ll need the role’s ARN (Amazon Resource Name) in order to create the connection in the next step. To save you having to come back and look for it, go to IAM > Roles and browse the search box for the role you just created. Select it to open more role details, including the role's ARN. Copy it down somewhere you can find it easily again. It'll look like something like `arn:aws:iam::111111111111:role/my-awesome-role`. Return to Tinybird's UI and enter the role ARN and Connection name in the modal. The Connection to AWS S3 is now created in Tinybird, and can be reused in multiple Sinks. #### 3. Choose the scheduling options¶ You can configure your Sink to run "on demand" 9meaning you'll need to manually trigger it) or using a cron expression, so it runs automatically when needed. #### 4. Configure destination path and file names¶ Enter the bucket URI where files will be generated (you can use subfolders), and the file name template. When generating multiple files, the Sink creates them using this template. You have multiple ways to configure this - see the [File template](https://www.tinybird.co/docs/about:blank#file-template) section below. #### 5. Preview and create¶ The final step is to check and confirm that the preview matches what you expect. Congratulations! You've created your first Sink. Trigger it manually using the "Run Sink now" option in the top right menu, or wait for the next scheduled execution. When triggering a Sink Pipe you have the option of overriding several of its settings, like format or compression. Refer to the [Sink Pipes API spec](https://www.tinybird.co/docs/docs/api-reference/sink-pipes-api) for the full list of parameters. Once the Sink Pipe is triggered, it creates a standard Tinybird job that can be followed via the `v0/jobs` API. ### Using the CLI¶ #### 1. Create the AWS S3 Connection¶ To create a connection for an S3 Sink Pipe you need to use a CLI version equal to or higher than 3.5.0. To start: 1. Run the `tb connection create s3_iamrole` command. 2. Copy the suggested policy and replace the two bucket placeholders with your bucket name. 3. Log into your AWS Console. 4. Create a new policy in AWS IAM > Policies using the copied text. In the next step, you'll need the role's ARN (Amazon Resource Name) to create the connection. Go to IAM > Roles and browse the search box for the role you just created. Select it to open more role details, including the role's ARN. Copy it and paste it into the CLI when requested. It'll look like something like `arn:aws:iam::111111111111:role/my-awesome-role`. Then, you will need to type the region where the bucket is located and choose a name to identify your connection within Tinybird. Once you have completed all these inputs, Tinybird will check access to the bucket and create the connection with the connection name you selected. #### 2. Create S3 Sink Pipe¶ To create a Sink Pipe, create a regular .pipe and filter the data you want to export to your bucket in the SQL section as in any other Pipe. Then, specify the Pipe as a sink type and add the needed configuration. Your Pipe should have the following structure: NODE node_0 SQL > SELECT * FROM events WHERE time >= toStartOfMinute(now()) - interval 30 minute) TYPE sink EXPORT_SERVICE s3_iamrole EXPORT_CONNECTION_NAME "test_s3" EXPORT_BUCKET_URI "s3://tinybird-sinks" EXPORT_FILE_TEMPLATE "daily_prices" # Supports partitioning EXPORT_SCHEDULE "*/5 * * * *" # Optional EXPORT_FORMAT "csv" EXPORT_COMPRESSION "gz" # Optional **Sink Pipe parameters** See the [Sink Pipe parameter docs](https://www.tinybird.co/docs/docs/cli/datafiles/pipe-files#sink-pipe) for more information. For this step, your details will be: | Key | Type | Description | | --- | --- | --- | | EXPORT_CONNECTION_NAME | string | Required. The connection name to the destination service. This the connection created in Step 1. | | EXPORT_BUCKET_URI | string | Required. The path to the destination bucket. Example: `s3://tinybird-export` | | EXPORT_FILE_TEMPLATE | string | Required. The target file name. Can use parameters to dynamically name and partition the files. See File partitioning section below. Example: `daily_prices_{customer_id}` | | EXPORT_FORMAT | string | Optional. The output format of the file. Values: CSV, NDJSON, Parquet. Default value: CSV | | EXPORT_COMPRESSION | string | Optional. Accepted values: `none` , `gz` for gzip, `br` for brotli, `xz` for LZMA, `zst` for zstd. Default: `none` | | EXPORT_SCHEDULE | string | A crontab expression that sets the frequency of the Sink operation or the @on-demand string. | Once ready, push the datafile to your Workspace using `tb push` (or `tb deploy` if you are using [version control](https://www.tinybird.co/docs/docs/production/overview) ) to create the Sink Pipe. ## File template¶ The export process allows you to partition the result in different files, allowing you to organize your data and get smaller files. The partitioning is defined in the file template and based on the values of columns of the result set. ### Partition by column¶ Add a template variable like `{COLUMN_NAME}` to the file name. For instance, let’s set the file template as `invoice_summary_{customer_id}.csv`. Imagine your query schema and result for an export is like this: | customer_id | invoice_id | amount | | --- | --- | --- | | ACME | INV20230608 | 23.45 | | ACME | 12345INV | 12.3 | | GLOBEX | INV-ABC-789 | 35.34 | | OSCORP | INVOICE2023-06-08 | 57 | | ACME | INV-XYZ-98765 | 23.16 | | OSCORP | INV210608-001 | 62.23 | | GLOBEX | 987INV654 | 36.23 | With the given file template `invoice_summary_{customer_id}.csv` you’d get 3 files: `invoice_summary_ACME.csv` | customer_id | invoice_id | amount | | --- | --- | --- | | ACME | INV20230608 | 23.45 | | ACME | 12345INV | 12.3 | | ACME | INV-XYZ-98765 | 23.16 | `invoice_summary_OSCORP.csv` | customer_id | invoice_id | amount | | --- | --- | --- | | OSCORP | INVOICE2023-06-08 | 57 | | OSCORP | INV210608-001 | 62.23 | `invoice_summary_GLOBEX.csv` | customer_id | invoice_id | amount | | --- | --- | --- | | GLOBEX | INV-ABC-789 | 35.34 | | GLOBEX | 987INV654 | 36.23 | ### Values format¶ In the case of DateTime columns, it can be dangerous to partition just by the column. Why? Because you could end up with as many files as seconds, as they’re the different values for a DateTime column. In an hour, that’s potentially 3600 files. To help partition in a sensible way, you can add a format string to the column name using the following placeholders: | Placeholder | Description | Example | | --- | --- | --- | | %Y | Year | 2023 | | %m | Month as an integer number (01-12) | 06 | | %d | Day of the month, zero-padded (01-31) | 07 | | %H | Hour in 24h format (00-23) | 14 | | %i | Minute (00-59) | 45 | For instance, for a result like this: | timestamp | invoice_id | amount | | --- | --- | --- | | 2023-07-07 09:07:05 | INV20230608 | 23.45 | | 2023-07-07 09:07:01 | 12345INV | 12.3 | | 2023-07-07 09:06:45 | INV-ABC-789 | 35.34 | | 2023-07-07 09:05:35 | INVOICE2023-06-08 | 57 | | 2023-07-06 23:14:05 | INV-XYZ-98765 | 23.16 | | 2023-07-06 23:14:02 | INV210608-001 | 62.23 | | 2023-07-06 23:10:55 | 987INV654 | 36.23 | Note that all 7 events have different times in the column timestamp. Using a file template like `invoices_{timestamp}` would create 7 different files. If you were interested in writing one file per hour, you could use a file template like `invoices_{timestamp, ‘%Y%m%d-%H’}` . You'd then get only two files for that dataset: `invoices_20230707-09.csv` | timestamp | invoice_id | amount | | --- | --- | --- | | 2023-07-07 09:07:05 | INV20230608 | 23.45 | | 2023-07-07 09:07:01 | 12345INV | 12.3 | | 2023-07-07 09:06:45 | INV-ABC-789 | 35.34 | | 2023-07-07 09:05:35 | INVOICE2023-06-08 | 57 | `invoices_20230706-23.csv` | timestamp | invoice_id | amount | | --- | --- | --- | | 2023-07-06 23:14:05 | INV-XYZ-98765 | 23.16 | | 2023-07-06 23:14:02 | INV210608-001 | 62.23 | | 2023-07-06 23:10:55 | 987INV654 | 36.23 | ### By number of files¶ You also have the option to write the result into X files. Instead of using a column name, use an integer between brackets. Example: `invoice_summary.{8}.csv` This is convenient to reduce the file size of the result, especially when the files are meant to be consumed by other services, like Snowflake where uploading big files is discouraged. The results are written in random order. This means that the final result rows would be written in X files, but you can’t count the specific order of the result. There are a maximum of 16 files. ### Combining different partitions¶ It’s possible to add more than one partitioning parameter in the file template. This is useful, for instance, when you do a daily dump of data, but want to export one file per hour. Setting the file template as `invoices/dt={timestamp, ‘%Y-%m-%d’}/H{timestamp, ‘%H}.csv` would create the following file structure in different days and executions: Invoices ├── dt=2023-07-07 │ └── H23.csv │ └── H22.csv │ └── H21.csv │ └── ... ├── dt=2023-07-06 │ └── H23.csv │ └── H22.csv You can also mix column names and number of files. For instance, setting the file template as `invoices/{customer_id}/dump_{4}.csv` would create the following file structure in different days and executions: Invoices ├── ACME │ └── dump_0.csv │ └── dump_1.csv │ └── dump_2.csv │ └── dump_3.csv ├── OSCORP │ └── dump_0.csv │ └── dump_1.csv │ └── dump_2.csv │ └── dump_3.csv Be careful with excessive partitioning. Take into consideration that the write process will create as many files as combinations of the values of the partitioning columns for a given result set. ## Iterating a Sink Pipe¶ Sink Pipes can be iterated using [version control](https://www.tinybird.co/docs/docs/production/overview) just like any other resource in your Data Project. However, you need to understand how connections work in Branches and deployments in order to select the appropriate strategy for your desired changes. **Branches don't execute on creation by default recurrent jobs** , like the scheduled ones for Sink Pipes (they continue executing in Production as usual). To iterate a Sink Pipe, create a new one (or recreate the existing one) with the desired configuration. The new Sink Pipe will start executing from the Branch too (without affecting the unchanged production resource). It will use the new configuration and export the new files into a **branch-specific folder** `my_bucket//branch_/` (the `` is optional). This means you can test the changes without mixing the test resource with your production exports. When you deploy that Branch, the specific folder in the path is automatically ignored and production continues to point to `my_bucket/prefix/` or the new path you changed. Take into account that, for now, while you can change the Sink Pipe configuration using version control, new connections to S3 must be created directly in the Workspace. There is an example about how to create a Pipe Sink to S3 with version control [here](https://github.com/tinybirdco/use-case-examples/tree/main/create_pipe_sink). ## Observability¶ Sink Pipes operations are logged in the [tinybird.sinks_ops_log](https://www.tinybird.co/docs/docs/monitoring/service-datasources#tinybird-sinks-ops-log) Service Data Source. Data Transfer incurred by Sink Pipes is tracked in [tinybird.data_transfer](https://www.tinybird.co/docs/docs/monitoring/service-datasources#tinybird-data-transfer) Service Data Source. ## Limits & quotas¶ Check the [limits page](https://www.tinybird.co/docs/docs/support/limits) for limits on ingestion, queries, API Endpoints, and more. ## Billing¶ Tinybird uses two metrics for billing Sink Pipes: Processed Data and Data Transfer. A Sink Pipe executes the Pipe’s query (Processed Data), and writes the result into a Bucket (Data Transfer). If the resulting files are compressed, Tinybird accounts for the compressed size. ### Processed Data¶ Any Processed Data incurred by a Sink Pipe is charged at the standard rate for your account. The Processed Data is already included in your plan, and counts towards your commitment. If you're on an Enterprise plan, view your plan and commitment on the [Organizations](https://www.tinybird.co/docs/docs/monitoring/organizations) tab in the UI. ### Data Transfer¶ Data Transfer depends on your environment. There are two scenarios: - The destination bucket is in the** same** cloud provider and region as your Tinybird Workspace: $0.01 / GB - The destination bucket is in a** different** cloud provider or region as your Tinybird Workspace: $0.10 / GB ### Enterprise customers¶ We're including 50 GB for free for every Enterprise customer, so you can test the feature and validate your use case. After that, we're happy to set up a meeting to understand your use case and adjust your contract accordingly, to accommodate the necessary Data Transfer. ## Next steps¶ - Get familiar with the[ Service Data Source](https://www.tinybird.co/docs/docs/monitoring/service-datasources) and see what's going on in your account - Deep dive on Tinybird's[ Pipes concept](https://www.tinybird.co/docs/docs/concepts/pipes) --- URL: https://www.tinybird.co/docs/query/bi-connector Last update: 2024-11-06T17:38:37.000Z Content: --- title: "BI Connector · Tinybird Docs" theme-color: "#171612" description: "Learn about the BI Connector in Tinybird, including how to connect and best practices." --- # BI Connector¶ The BI Connector is a PostgreSQL-compatible interface to data in Tinybird. You can connect many of your favorite tools to Tinybird, such as Tableau, Apache SuperSet, or Grafana. All Data Sources and published Pipes created in Tinybird are available as standard PostgreSQL tables. You can query using standard PostgreSQL syntax and use any tool that implements the PostgreSQL protocol. Tinybird uses TLS 1.2 and higher for BI Connector connections. The BI Connector isn't active by default. Contact Tinybird support ( [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) ) to discuss if your use case is supported. After we activate the connector, Tinybird sends you all the connection details. ## Compatibility¶ The BI Connector has been successfully tested on the following solutions: - DBeaver - PowerBI - Tableau - Grafana - Metabase - Superset - Klipfolio - Anodot ## Connect to Tinybird¶ You can connect and access your database as you would with regular PostgreSQL. You can use your command-line tool, any graphical interface such as DBbeaver, or any other SQL client such as, for example, your favorite BI tool. For example: psql -U --host bi.us-east.tinybird.co --port 5432 -d After connecting to your database, in the public schema, if you list all the available views you get all the available Data Sources and endpoints. If you're using the PostgreSQL command line you can type `\dv` to get the list of views. If you list all the tables, a table per Data Source appears. The name of the table is the identifier of the Data Source. Don't use those IDs. Instead, find your tables with their user-friendly names in the views space. ## Best practices¶ ### 1. Avoid queries on exposed Pipes¶ Both your Pipes and Data Sources are exposed in the BI Connector as tables. Query against Data Sources tables only, and avoid using any Pipes tables. In general, queries on Data Sources tables are much faster & easier to optimize. ### 2. Use Materialized Views¶ With Tinybird you can build Pipes that materialize the result as a [Materialized View](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) . Materialized Views appear in the BI Connector as tables. Rather than querying the raw data via the BI Connector, build your transformations, filters, aggregations, or joins in standard Tinybird Pipes to create a Materialized View. You can then use the BI Connector to read the pre-prepared Materialized View. For example, if your dashboard shows events aggregated by day, create a Materialized View with pre-aggregated data by day, and have your BI chart read this. ### 3. Use dedicated Materialized Views for different charts¶ If you have multiple charts or widgets consuming data, try to use Materialized Views that are specifically built for that chart or widget. This helps to avoid additional transformation operations. ### 4. Use efficient sorting keys¶ Data stored in Tinybird is sorted using a sorting key. This helps to efficiently find data that is relevant to your query. Use the appropriate sorting keys for your data according to the needs of your dashboard. For example, if your dashboard is filtering by time, you should have time in the table's sorting key. ### 5. Avoid column mappings¶ Avoid using column mappings. Filtering by a mapped column can mean that the index (sorting key) is not being used. ### 6. Avoid JOINs¶ Don't do JOINs over the BI Connector where possible. Instead, do any JOINs inside a Pipe and materialize the result as a Materialized View, then query the already-JOINed data. ### 7. Monitor usage with bi\_stats\_rt¶ The [bi_stats_rt](https://www.tinybird.co/docs/docs/monitoring/service-datasources#tinybird-bi-stats-rt) table contains observability data about your usage of the BI Connector. ## Limits¶ The following limits apply: - Query timeout is 10 seconds. - Resources are limited to 1 thread per query. - Concurrent queries are limited to 100 per cluster. - Queries over the BI Connector are only allocated 1 CPU core per query. This only applies to the BI Connector, not to standard Tinybird Pipes. - Single queries mustn't return more than 1 GB of data at a time. - The total amount of data returned by all queries executed within a 5-minute period must not exceed 6 GB. This is to prevent overloading the system with too much data in a short period of time. Some settings can be adjusted on request. Contact Tinybird support ( [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) ) to discuss if we can support your use case. ## Limitations¶ The following limitations apply: - You can't use parameters in your endpoints. However, you can still filter your data using SQL `WHERE` clauses. - Some widgets of BI tools create SQL queries that aren't compatible with the BI connector. - Some JOIN operations run in Postgres and not ClickHouse®, which means they might run much, much more slowly. Avoid JOINs over the BI Connector. - Sometimes, PostgreSQL might rewrite the query so that filters might end up not using sorting keys. Sorting keys allow you to run faster queries, and therefore those resulting queries might be slower and more expensive since they would require full table scans. - Querying large datasets can result in a timeout. Large generally means many millions of rows, but this is also affected by how many columns and how large the values are. ## Next steps¶ - Read up on[ Materialized View](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) and getting the most from your data. - Understand how to use and query Tinybird's[ Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) . Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/query/overview Last update: 2024-10-16T16:43:40.000Z Content: --- title: "Query overview · Tinybird Docs" theme-color: "#171612" description: "Explore and manipulate your data in Tinybird to make it more useful and relevant." --- # Query your ingested data¶ After you've brought your data to Tinybird, you can explore and manipulate it to make it more useful and relevant. ## Data Flow¶ Data Flow visualizes how your Data Sources, API Endpoints, Materialized Views, and Pipes connect and relate to each other. You can filter the view by keyword, resource type, or tag. Select an item to see its details in the side panel. <-figure-> ![Data Flow demonstration](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fdataflow.gif&w=3840&q=75) ## Playground¶ Playgrounds are sandbox environments where you can test your queries using ingested data. For example, you can use playgrounds to quickly query real-time production data, debug existing queries, or prototype new Pipes. You can download any playground by selecting **Download** . You can then add the .pipe file to your project. To share your Playground contents with other users of the Workspace, select **Share**. For more information on the statements, functions, and settings you can use in queries, see [Supported syntax](https://www.tinybird.co/docs/docs/support/syntax). <-figure-> ![Playground demonstration](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fplayground.png&w=3840&q=75) ## Time Series¶ Time series help analyze a sequence of data points collected over an interval of time. Use the Time Series feature to visualize any time series Data Source in your Workspace, including [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources). When you create a new Time Series in your Workspace, your team can view it. You can also generate a public URL to share the visualization outside of your team. Viewers using the public URL can explore the chart but can't change the original data source, filters, or groupings. <-figure-> ![Time Series example](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Ftimeseries.png&w=3840&q=75) ## Next steps¶ - Learn how to[ use query parameters](https://www.tinybird.co/docs/docs/query/query-parameters) . - Read the[ SQL best practices](https://www.tinybird.co/docs/docs/query/sql-best-practices) . --- URL: https://www.tinybird.co/docs/query/query-parameters Last update: 2024-11-14T10:11:48.000Z Content: --- title: "Using query parameters · Tinybird Docs" theme-color: "#171612" description: "Query parameters are great for any value of the query that you might want control dynamically from your applications." --- # Using query parameters¶ Query parameters are great for any value of the query that you might want control dynamically from your applications. For example, you can get your API Endpoint to answer different questions by passing a different value as query parameter. Using dynamic parameters means you can do things like: - Filtering as part of a `WHERE` clause. - Changing the number of results as part of a `LIMIT` clause. - Sorting order as part of an `ORDER BY` clause. - Selecting specific columns for `ORDER BY` or `GROUP BY` clauses. ## Define dynamic parameters¶ To make a query dynamic, start the query with a `%` character. That signals the engine that it needs to parse potential parameters. Tinybird automatically inserts the `%` character in the first line when you add a parameter to a Node. After you have created a dynamic query, you can define parameters by using the following pattern `{{([,, description=<"This is a description">, required=])}}` . For example: ##### Simple select clause using dynamic parameters % SELECT * FROM TR LIMIT {{Int32(lim, 10, description="Limit the number of rows in the response", required=False)}} The previous query returns 10 results by default, or however many are specified on the `lim` parameter when requesting data from that API Endpoint. ## Use Pipes API Endpoints with dynamic parameters¶ When using a Data Pipes API Endpoint that uses parameters, pass in the desired parameters. Using the previous example where `lim` sets the amount of maximum rows you want to get, the request would look like this: ##### Using a data Pipes API Endpoint containing dynamic parameters curl -d https://api.tinybird.co/v0/pipes/tr_pipe?lim=20&token=.... You can specify parameters in more than one Node in a Data Pipe. When invoking the API Endpoint through its URL, the passed parameters are included in the request. You can't use query parameters in Nodes that are published as [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview). ## Leverage dynamic parameters¶ As well as using dynamic parameters in your API Endpoints, you can then leverage them further downstream for monitoring purposes. When you pass a parameter to your queries, you can build Pipes to reference the parameters and query the Service Data Sources with them, even if you don't use them in the API Endpoints themselves. Review the [Service Data Sources docs](https://www.tinybird.co/docs/docs/monitoring/service-datasources) to use the available options. For example, using the `user_agent` column on `pipe_stats_rt` shows which user agent made the request. Pass any additional things you need as a parameter to improve visibility and avoid, or get insights into, incidents and Workspace performance. This process helps you forward things like user agent or others from any app requests all the way through to Tinybird, and track if the request was done in the app and details like which device was used. ##### Example query to the pipe\_stats\_rt Service Data Source leveraging a passed 'referrer' parameter SELECT toStartOfMinute(start_datetime) as date, count(), parameters['referrer'] FROM tinybird.pipes_stats_rt WHERE ( pipe_id = '' and status_code != 429) or pipe_name = '' and status_code != 429) ) and start_datetime > now() - interval - 1 hour GROUP BY date, parameters['referrer'] ORDER BY count() DESC, date DESC ## Available data types for dynamic parameters¶ You can use the following data types for dynamic parameters: - `Boolean` : Accepts `True` and `False` as values, as well as strings like `'TRUE'` , `'FALSE'` , `'true'` , `'false'` , `'1'` , or `'0'` , or the integers `1` and `0` . - `String` : For any string values. - `DateTime64` , `DateTime` and `Date` : Accepts values like `YYYY-MM-DD HH:MM:SS.MMM` , `YYYY-MM-DD HH:MM:SS` and `YYYYMMDD` respectively. - `Float32` and `Float64` : Accepts floating point numbers of either 32 or 64 bit precision. - `Int` or `Integer` : Accepts integer numbers of any precision. - `Int8` , `Int16` , `Int32` , `Int64` , `Int128` , `Int256` and `UInt8` , `UInt16` , `UInt32` , `UInt64` , `UInt128` , `UInt256` : Accepts signed or unsigned integer numbers of the specified precision. ### Use column parameters¶ You can use `column` to pass along column names of a defined type as parameters, like: ##### Using column dynamic parameters % SELECT * FROM TR ORDER BY {{column(order_by, 'timestamp')}} LIMIT {{Int32(lim, 10)}} Always define the `column` function's second argument, the one for the default value. The alternative for not defining the argument is to validate that the first argument is defined, but this only has an effect on the execution of the API Endpoint. A placeholder is used in the development of the Pipes. ##### Validate the column parameter when not defining a default value % SELECT * FROM TR {% if defined(order_by) %} ORDER BY {{column(order_by)}} {% end %} ### Pass arrays¶ You can pass along a list of values with the `Array` function for parameters, like so: ##### Passing arrays as dynamic parameters % SELECT * FROM TR WHERE access_type IN {{Array(access_numbers, 'Int32', default='101,102,110')}} ## Send stringified JSON as parameter¶ Consider the following stringified JSON: "filters": [ { "operand": "date", "operator": "equals", "value": "2018-01-02" }, { "operand": "high", "operator": "greater_than", "value": "100" }, { "operand": "symbol", "operator": "in_list", "value": "AAPL,AMZN" } ] You can use the `JSON()` function to use `filters` as a query parameter. The following example shows to use the `filters` field from the JSON snippet with the stock_prices_1m sample dataset. % SELECT symbol, date, high FROM stock_prices_1m WHERE 1 {% if defined(filters) %} {% for item in JSON(filters, '[]') %} {% if item.get('operator', '') == 'equals' %} AND {{ column(item.get('operand', '')) }} == {{ item.get('value', '') }} {% elif item.get('operator') == 'greater_than' %} AND {{ column(item.get('operand', '')) }} > {{ item.get('value', '') }} {% elif item.get('operator') == 'in_list' %} AND {{ column(item.get('operand', '')) }} IN splitByChar(',',{{ item.get('value', '') }}) {% end %} {% end %} {% end %} When accessing the fields in a JSON object, use the following syntax: item.get('Field', 'Default value to avoid SQL errors'). ### Pagination¶ You paginate results by adding `LIMIT` and `OFFSET` clauses to your query. You can parameterize the values of these clauses, allowing you to pass pagination values as query parameters to your API Endpoint. Use the `LIMIT` clause to select only the first `n` rows of a query result. Use the `OFFSET` clause to skip `n` rows from the beginning of a query result. Together, you can dynamically chunk the results of a query up into pages. For example, the following query introduces two dynamic parameters `page_size` and `page` which lets you control the pagination of a query result using query parameters on the URL of an API Endpoint. ##### Paging results using dynamic parameters % SELECT * FROM TR LIMIT {{Int32(page_size, 100)}} OFFSET {{Int32(page, 0) * Int32(page_size, 100)}} You can also use pages to perform calculations such as `count()` . The following example counts the total number of pages: ##### Operation on a paginated endpoint % SELECT count() as total_rows, ceil(total_rows/{{Int32(page_size, 100)}}) pages FROM endpoint_to_paginate To get consistent pagination results, add an `ORDER BY` clause to your paginated queries. ## Advanced templating using dynamic parameters¶ To build more complex queries, use flow control operators like `if`, `else` and `elif` in combination with the `defined()` function, which helps you to check if a parameter whether a parameter has been received and act accordingly. Tinybird's templating system is based on the [Tornado Python framework](https://github.com/tornadoweb/tornado) , and uses Python syntax. You must enclose control statements in curly brackets with percentages `{%..%}` as in the following example: ##### Advanced templating using dynamic parameters % SELECT toDate(start_datetime) as day, countIf(status_code < 400) requests, countIf(status_code >= 400) errors, avg(duration) avg_duration FROM log_events WHERE endsWith(user_email, {{String(email, 'gmail.com')}}) AND start_datetime >= {{DateTime(start_date, '2019-09-20 00:00:00')}} AND start_datetime <= {{DateTime(end_date, '2019-10-10 00:00:00')}} {% if method != 'All' %} AND method = {{String(method,'POST')}} {% end %} GROUP BY day ORDER BY day DESC ### Validate presence of a parameter¶ ##### Validate if a param is in the query % select * from table {% if defined(my_filter) %} where attr > {{Int32(my_filter)}} {% end %} When you call the API Endpoint with `/v0/pipes/:PIPE.json?my_filter=20` it applies the filter. ### Default parameter values and placeholders¶ Following best practices, you should set default parameter values as follows: ##### Default parameter values % SELECT * FROM table WHERE attr > {{Int32(my_filter, 10)}} When you call the API Endpoint with `/v0/pipes/:PIPE.json` without setting any value to `my_filter` , it automatically applies the default value of 10. If you don't set a default value for a parameter, you should validate that the parameter is defined before using it in the query as explained previously. If you don't validate the parameter and it's not defined, the query might fail. Tinybird populates the parameter with a placeholder value based on the data type. For instance, numerical data types are populated with 0, strings with `__placeholder__` , and date and timestamps with `2019-01-01` and `2019-01-01 00:00:00` respectively. You could try yourself with a query like this: ##### Get placeholder values % SELECT {{String(param)}} as placeholder_string, {{Int32(param)}} as placeholder_num, {{Boolean(param)}} as placeholder_bool, {{Float32(param)}} as placeholder_float, {{Date(param)}} as placeholder_date, {{DateTime(param)}} as placeholder_ts, {{Array(param)}} as placeholder_array This returns the following values: { "placeholder_string": "__placeholder__", "placeholder_num": 0, "placeholder_bool": 0, "placeholder_float": 0, "placeholder_date": "2019-01-01", "placeholder_ts": "2019-01-01 00:00:00", "placeholder_array": ["__placeholder__0","__placeholder__1"] } ### Test dynamic parameters¶ Any dynamic parameters you create appears in the UI along with a "Test new values" button. Select this button to open a test dialog populated with the default value of your parameters. The test dialog is a convenient way to quickly test different Pipe values than the default ones without impacting production. Use the View API page to see API Endpoint metrics resulting from that specific combination of parameters. Close the dialog to bring the Pipe back to its default production state. ### Cascade parameters¶ Parameters with the same name in different Pipes are cascaded down the dependency chain. For example, if you publish Pipe A with the parameter `foo` , and then Pipe B which uses Pipe A as a Data Source also with the parameter `foo` , then when you call the API Endpoint of Pipe B with `foo=bar` , the value of `foo` will be `bar` in both Pipes. ### Throw errors¶ The following example stops the API Endpoint processing and returns a 400 error: ##### Validate if a param is defined and throw an error if it's not defined % {% if not defined(my_filter) %} {{ error('my_filter (int32) query param is required') }} {% end %} select * from table where attr > {{Int32(my_filter)}} The `custom_error` function is an advanced version of `error` where you can customize the response and other aspects. The function gets an object as the first argument, which is sent as JSON, and the status_code as a second argument, which defaults to 400. ##### Validate if a param is defined and throw an error if it's not defined % {% if not defined(my_filter) %} {{ custom_error({'error_id': 10001, 'error': 'my_filter (int32) query param is required'}) }} {% end %} select * from table where attr > {{Int32(my_filter)}} ## Limits¶ You can't use query parameters in Nodes that are published as [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) , only as API Endpoints or in on-demand Copies or Sinks. You can use query parameters in scheduled Sinks and Copies, but must have a default. That default is used in the scheduled execution. The preview step fails if the default doesn't exist. ## Next steps¶ Thanks to the magic of dynamic parameters, you can create flexible API Endpoints with ease, so you don't need to manage or test dozens of Pipes. Be sure you're familiar with the [5 rules for faster SQL queries](https://www.tinybird.co/docs/docs/query/sql-best-practices). --- URL: https://www.tinybird.co/docs/query/sql-best-practices Last update: 2024-11-15T09:15:26.000Z Content: --- title: "SQL best practices · Tinybird Docs" theme-color: "#171612" description: "Learn the best practices when working with a huge amount of data." --- # Best practices for SQL queries¶ When you're trying to process significant amounts of data, following best practices help you create faster and more robust queries. Follow these principles when writing queries meant for Tinybird: 1. The best data is the data you don't write. 2. The second best data is the one you don't read. The less data you read, the better. 3. Sequential reads are much faster. 4. The less data you process after read, the better. 5. Perform complex operations later in the processing pipeline. The following sections analyze how performance improves after implementing each principle. To follow the examples, download the [NYC Taxi Trip](https://storage.googleapis.com/tinybird-demo/yellow_trip_data_2018/yellow_tripdata_2018-01.csv) and import it using a Data Source. See [Data Sources](https://www.tinybird.co/docs/docs/concepts/data-sources). ## The best data is the one you don't write¶ Don't save data that you don't need, as it impacts memory usage, causing queries to take more time. ## The second best data is the one you don't read¶ To avoid reading data that you don't need, apply filters as soon as possible. For example, consider a list of the trips whose distance is greater than 10 miles and that took place between `2017-01-31 14:00:00` and `2017-01-31 15:00:00` . Additionally, you want to retrieve the trips ordered by date. The following examples show the difference between applying the filters at the end or at the beginning of the Pipe. The first approach orders all the data by date: ##### node rule2\_data\_read\_NOT\_OK SELECT * FROM nyc_taxi ORDER BY tpep_pickup_datetime ASC 10.31MB, 139.26k x 17 ( 6.95ms ) After the data is sorted, you can filter. This approach takes around 30 to 60 ms after adding the time of both steps. <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fbest-practices-faster-sql-queries-1.png&w=3840&q=75) Compare the number of scanned rows (139.26k) and the size of data (10.31MB) to the number of scanned rows (24.58k) and the size of data (1.82MB): you only need to scan 24.58k rows. Both values directly impact the query execution time and also affect other queries you might be running at the same time. Bandwidth is also a factor you need to keep in mind. The following example shows what happens if the filter is applied before the sorting: ##### node rule2\_data\_read\_OK SELECT * FROM nyc_taxi WHERE (trip_distance > 10) AND ((tpep_pickup_datetime >= '2017-01-31 14:00:00') AND (tpep_pickup_datetime <= '2017-01-31 15:00:00')) ORDER BY tpep_pickup_datetime ASC 1.50MB, 24.58k x 17 ( 32.28ms ) If the filter is applied before the sorting, it takes only 1 to 10 ms. The size of the data read is 1.82 MB, while the number of rows read is 24.58k: they're much smaller figures than the ones in the first example. This significant difference happens because in the first approach you are sorting all the data available, even the data that you don't need for your query, while in the second approach you are sorting only the rows you need. As filtering is the fastest operation, always filter first. ## Sequential reads are much faster¶ To carry out sequential reads, define indexes correctly. Indexes should be defined based on the queries that are going to be run. The following example simulates a case by ordering the data based on the columns. For example, if you want to query the data by date, compare what happens when the data is sorted by date to when it's sorted by any other column. The first approach sorts the data by another column, `passenger_count`: ##### node rule3\_sequential\_read\_NOT\_OK SELECT * FROM nyc_taxi ORDER BY passenger_count ASC 718.55MB, 9.71m x 17 ( 132.17ms ) After you've sorted the data by `passenger_count` , filter it by date: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fbest-practices-faster-sql-queries-2.png&w=3840&q=75) This approach takes around 5-10 ms, the number of scanned rows is 26.73k, and the size of data is 1.98 MB. For the second approach, sort the data by date: ##### node rule3\_ordered\_by\_date\_OK SELECT * FROM nyc_taxi ORDER BY tpep_pickup_datetime ASC 10.31MB, 139.26k x 17 ( 27.80ms ) After it's sorted by date, filter it: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fbest-practices-faster-sql-queries-3.png&w=3840&q=75) When the data is sorted by date and the query uses date for filtering, it takes 1-2 ms, the number of scanned rows is 10.35k and the size of data is 765.53KB. The more data you have, the greater the difference between both approaches. When dealing with significant amounts of data, sequential reads can be much faster. Therefore, define the indexes taking into account the queries that you want to make. ## The less data you process after read, the better¶ If you only need two columns, only retrieve those. Consider a case where you only need the following: `vendorid`, `tpep_pickup_datetime` , and `trip_distance`. When retrieving all the columns instead of the previous three, you need around 140-180 ms and the size of data is 718.55MB: ##### NODE RULE4\_LESS\_DATA\_NOT\_OK SELECT * FROM ( SELECT * FROM nyc_taxi order by tpep_dropoff_datetime ) 718.55MB, 9.71m x 17 ( 137.93ms ) When retrieving only the columns you need, it takes around 35-60 ms: ##### node rule4\_less\_data\_OK SELECT * FROM ( SELECT vendorid, tpep_pickup_datetime, trip_distance FROM nyc_taxi order by tpep_dropoff_datetime ) 155.36MB, 9.71m x 3 ( 22.14ms ) With analytical databases, not retrieving unnecessary columns make queries much more performant and efficient. Process only the data you need. ## Perform complex operations later in the processing pipeline¶ Perform complex operations, such as joins or aggregations, as late as possible in the processing pipeline. As you filter all the data at the beginning, the number of rows at the end of the pipeline is lower and, therefore, the cost of executing complex operations is also lower. Using the example dataset, aggregate the data: ##### node rule5\_complex\_operation\_NOT\_OK SELECT vendorid, pulocationid, count(*) FROM nyc_taxi GROUP BY vendorid, pulocationid 77.68MB, 9.71m x 3 ( 19.35ms ) Apply the filter: <-figure-> ![](/docs/_next/image?url=%2Fdocs%2Fimg%2Fbest-practices-faster-sql-queries-4.png&w=3840&q=75) If you apply the filter after aggregating the data, it takes around 50-70 ms to retrieve the data, the number of scanned rows is 9.71m, and the size of data is 77.68 MB. If you filter before aggregating the data: ##### node rule5\_complex\_operation\_OK SELECT vendorid, pulocationid, count(*) FROM nyc_taxi WHERE vendorid < 10 GROUP BY vendorid, pulocationid 77.68MB, 9.71m x 3 ( 73.26ms ) The query takes only 20-40 ms, although the number of scanned rows and the size of data is the same as in the previous approach. ## Additional guidance¶ Follow these additional recommendations when creating SQL queries. ### Avoid full scans¶ The less data you read in your queries, the faster they are. There are different strategies you can follow to avoid reading all the data in a Data Source, or doing a full scan, in your queries: - Always filter first. - Use indices by setting a proper `ENGINE_SORTING_KEY` in the Data Source. - The column names present in the `ENGINE_SORTING_KEY` should be the ones you use for filtering in the `WHERE` clause. You don't need to sort by all the columns you use for filtering, only the ones to filter first. - The order of the columns in the `ENGINE_SORTING_KEY` is important: from left to right ordered by relevance. The columns that matter the most for filtering and have less cardinality should go first. Consider the following Data Source, which is sorted by date by default: ##### Data Source: data\_source\_sorted\_by\_date SCHEMA > `id` Int64, `amount` Int64, `date` DateTime ENGINE "MergeTree" ENGINE_SORTING_KEY "id, date" The following query is slower because it filters data using a column other than the ones defined in the `ENGINE_SORTING_KEY` instruction: ##### Not filtering by any column present in the ENGINE\_SORTING\_KEY SELECT * FROM data_source_sorted_by_date WHERE amount > 30 The following query is faster because it filters data using a column defined in the `ENGINE_SORTING_KEY` instruction: ##### Filtering first by columns present in the ENGINE\_SORTING\_KEY SELECT * FROM data_source_sorted_by_date WHERE id = 135246 AND date > now() - INTERVAL 3 DAY AND amount > 30 ## Avoid big joins¶ When doing a `JOIN` , the data in the Data Source on the right side loads in memory to perform the operation. `JOIN` s over tables of more than 1 million rows might lead to `MEMORY_LIMIT` errors when used in Materialized Views, affecting ingestion. Avoid joining big Data Sources by filtering the data in the Data Source on the right side. For example, the following pattern is less efficient because it's joining a Data Source with too many rows: ##### Doing a JOIN with a Data Source with too many rows SELECT left.id AS id, left.date AS day, right.response_id AS response_id FROM left_data_source AS left INNER JOIN big_right_data_source AS right ON left.id = right.id The following query is faster and more efficient because if prefilters the Data Source before the `JOIN`: ##### Prefilter the joined Data Source for better performance SELECT left.id AS id, left.date AS day, right.response_id AS response_id FROM left_data_source AS left INNER JOIN ( SELECT id, response_id FROM big_right_data_source WHERE id IN (SELECT id FROM left_data_source) ) AS right ON left.id = right.id ## Memory issues¶ Sometimes, you might reach the memory limit when running a query. This is usually because of the following reasons: - Lot of columns are used: try to reduce the amount of columns used in the query. As this isn't always possible, try to change data types or merge some columns. - A cross `JOIN` or some operation that generates a lot of rows: it might happen if the cross `JOIN` is done with two Data Sources with a large amount of rows. Try to rewrite the query to avoid the cross `JOIN` . - A massive `GROUP BY` : try to filter out rows before executing the `GROUP BY` . If you are getting a memory error while populating a Materialized View, the solutions are the same. Consider that the populate process runs in 1 million rows chunks, so if you hit memory limits, the cause might be one of the following: 1. There is a `JOIN` and the right table is large. 2. There is an `ARRAY JOIN` with a huge array that make the number of rows significantly increase. To check if a populate process could break, create a Pipe with the same query as the Materialized View and replace the source table with a node that gets 1 million rows from the source table. The following example shows an unoptimized Materialized View Pipe: ##### original Materialized View Pipe SQL NODE materialized SQL > select date, count() c from source_table group by date The following query shows the transformed Pipe: ##### Transformed Pipe to check how the Materialized View would process the data NODE limited SQL > select * from source_table limit 1000000 NODE materialized SQL > select date, count() c from limited group by date ## Nested aggregate functions¶ You can't nest aggregate functions or use an alias of an aggregate function that's used in another aggregate function. For example, the following query causes an error due to a nested aggregate function: ##### Error on using nested aggregate function SELECT max(avg(number)) as max_avg_number FROM my_datasource The following query causes an error due to a nested aggregate with alias: ##### Error on using nested aggregate function with alias SELECT avg(number) avg_number, max(avg_number) max_avg_number FROM my_datasource Instead of using nested aggregate functions, use a subquery. The following example shows how to use aggregate functions in a subquery: ##### Using aggregate functions in a subquery SELECT avg_number as number, max_number FROM ( SELECT avg(number) as avg_number, max(number) as max_number FROM numbers(10) ) The following example shows how to nest aggregate functions using a subquery: ##### Nesting aggregate functions using a subquery SELECT max(avg_number) as number FROM ( SELECT avg(number) as avg_number, max(number) as max_number FROM numbers(10) ) ## Merge aggregate functions¶ Columns with `AggregateFunction` types such as `count`, `avg` , and others precalculate their aggregated values using intermediate states. When you query those columns you have to add the `-Merge` combinator to the aggregate function to get the final aggregated results. Use `-Merge` aggregated states as late in the pipeline as possible. Intermediate states are stored in binary format, which explains why you might see special characters when selecting columns with the `AggregateFunction` type. For example, consider the following query: ##### Getting 'result' as aggregate function SELECT result FROM my_datasource The result contains special characters: | AggregateFunction(count) | | --- | | @33M@ | | �o�@ | When selecting columns with the `AggregateFunction` type use `-Merge` the intermediate states to get the aggregated result for that column. This operation might compute several rows, so use `-Merge` as late in the pipeline as possible. Consider the following query: ##### Getting 'result' as UInt64 -- Getting the 'result' column aggregated using countMerge. Values are UInt64 SELECT countMerge(result) as result FROM my_datasource The result is the following: | UInt64 | | --- | | 1646597 | --- URL: https://www.tinybird.co/docs/quick-start Last update: 2024-11-12T11:45:41.000Z Content: --- title: "Quick start · Tinybird Docs" theme-color: "#171612" description: "Get started with Tinybird as quickly as possible. Ingest, query, and publish data in minutes." --- # Quick start¶ With Tinybird, you can ingest data from anywhere, query and transform it using SQL, and publish it as high-concurrency, low-latency REST API endpoints. Read on to learn how to create a Workspace, ingest data, create a query, publish an API, and confirm your setup works properly using the Tinybird user interface. ## Step 1: Create your Tinybird account¶ [Create a Tinybird account](https://www.tinybird.co/signup) . It's free and no credit card is required. See [Tinybird pricing plans](https://www.tinybird.co/docs/docs/support/billing) for more information. [Sign up for Tinybird](https://www.tinybird.co/signup) ## Step 2: Select your cloud provider and region¶ When logging in to Tinybird, select the cloud provider and region you want to work in. ![Select your region](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstartui-region-select.png&w=3840&q=75) ## Step 3: Create your Workspace¶ A [Workspace](https://www.tinybird.co/docs/docs/concepts/workspaces) is an area that contains a set of Tinybird resources, including Data Sources, Pipes, Nodes, API Endpoints, and Tokens. Create a Workspace named `customer_rewards` . The name must be unique. ![Create a Workspace](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstartui-create-workspace.png&w=3840&q=75) ## Step 4: Download and ingest sample data¶ Download the following sample data from a fictitious online coffee shop: [Download data file](https://www.tinybird.co/docs/docs/assets/sample-data-files/orders.ndjson) Select **File Upload** and follow the instructions to load the file. ![Upload a file with data](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstartui-file-upload.png&w=3840&q=75) Select **Create Data Source** to automatically create the `orders` [Data Source](https://www.tinybird.co/docs/docs/concepts/data-sources). ![Create a Data Source](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstartui-create-data-source.png&w=3840&q=75) ## Step 5: Query data using a Pipe¶ You can create [Pipes](https://www.tinybird.co/docs/docs/concepts/pipes) to query your data using SQL. To create a Pipe, select **Pipes** and then **Create Pipe**. Name your Pipe `rewards` and add the following SQL: select count() from orders Select the Node name and change it to `rewards_count`. ![Create a Pipe](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstartui-create-pipe.png&w=3840&q=75) Select **Run** to preview the result of your Pipe. ## Step 6: Publish your query as an API¶ You can turn any Pipe into a high-concurrency, low-latency API Endpoint. Select **Create API Endpoint** and then select the `rewards_count` Node in the menu. ![Create an API Endpoint](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstartui-create-api.png&w=3840&q=75) ## Step 7: Call your API¶ You can test your API endpoint using a curl command. Go to the **Output** section of the API page and select the **cURL** tab. Copy the curl command into a Terminal window and run it. ![Test your API Endpoint](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fquickstartui-test-api.png&w=3840&q=75) Congratulations! You have created your first API Endpoint in Tinybird. ## Next steps¶ - Check the[ Tinybird CLI Quick start](https://www.tinybird.co/docs/docs/cli/quick-start) . - Learn more about[ User-Facing Analytics](https://www.tinybird.co/docs/docs/use-cases) in the Use Case Hub. - Learn about[ Tinybird Charts](https://www.tinybird.co/docs/docs/publish/charts) and build beautiful visualizations for your API endpoints. --- URL: https://www.tinybird.co/docs/starter-kits/log-analytics Content: --- title: "Log Analytics・Tinybird Starter Kit" theme-color: "#171612" description: "Analyze software logs, warnings, and errors in minutes with this language-agnostic Log Analytics Starter Kit." --- Click the button below to deploy the data project to Tinybird. [Deploy project](https://app.tinybird.co/workspaces/new?name=log_analytics_starter_kit&starter_kit=log-analytics-starter-kit)![Deploy the Tinybird data project](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Fstarter-kits%2Flog-analytics-step-1.jpeg&w=3840&q=75) This will automatically set up all of the resources you need to collect, analyze, and publish your logs. --- URL: https://www.tinybird.co/docs/starter-kits/web-analytics Content: --- title: "Open Source Google Analytics Alternative・Tinybird Starter Kit" theme-color: "#171612" description: "Deploy a Google Analytics alternative with this open source Starter Kit. Everything you need to start tracking web traffic analytics in just a few minutes." --- Click the button below to instantly deploy the project to a Tinybird workspace. [Deploy project](https://app.tinybird.co/workspaces/new?name=web_analytics&starter_kit=web-analytics-starter-kit)![Deploy the project](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Fstarter-kits%2Fweb-analytics-step-1.jpeg&w=3840&q=75) This resources will be created in a new workspace in your Tinybird account. --- URL: https://www.tinybird.co/docs/support/billing Last update: 2024-11-07T09:52:34.000Z Content: --- title: "Billing, plans, and pricing · Tinybird Docs" theme-color: "#171612" description: "Information about billing, what it is based on, as well as Tinybird pricing plans." --- # Billing¶ Tinybird billing is based on the pricing of different data operations, such as storage, processing, and transfer. If you are on a Professional plan, read on to learn how billing works. If you're an Enterprise customer, contact us at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) to reduce unit prices as part of volume discounts and Enterprise plan commitments. See [Tinybird plans](https://www.tinybird.co/docs/docs/plans). ## At a glance¶ - Data storage: US$0.34 per GB - Data processing (read or write): US$0.07 per GB - Data transfer (outbound): US$0.01 to US$0.10 per GB To see the full breakdown for each individual operation, skip to [billing breakdown](https://www.tinybird.co/docs/about:blank#billing-breakdown). ## Data storage¶ Data storage refers to the disk storage of all the data you keep in Tinybird. Data storage is priced at **US$0.34 per GB** , regardless of the region. Data storage is usually the smallest part of your Tinybird bill. Your [Data Sources](https://www.tinybird.co/docs/docs/concepts/data-sources) use the largest percentage of storage. ### Compression¶ Data storage pricing is based on the volume of storage used after compression, calculated on the last day of every month. The exact rate of compression varies depending on your data. You can expect a compression factor of between 3x to 10x. For example, with a compression factor of 3.5x, if you import 100 GB of uncompressed data, that translates to approximately 28.6 GB compressed. In that case, your bill would be based on the final 28.6 GB of stored data. ### Version control¶ If your Workspace uses the Tinybird Git integration, only data storage associated with the production Workspace, and not Branches, is included when determining the storage bill. Remove historical data to lower your storage bill. You can configure a [time-to-live (TTL)](https://www.tinybird.co/docs/docs/concepts/data-sources#setting-data-source-ttl) on any Data Source, which deletes data older than a given time. This gives you control over how much data is retained in a Data Source. A common pattern is to ingest raw data, materialize it, and clear out the raw data with a TTL to reduce storage. ## Data processing¶ Data processing is split into write and read activities. All processed data is priced at **US$0.07 per GB**. ### Write activities¶ You write data whenever you ingest into Tinybird. When you create, append, delete, or replace data in a [Data Source](https://www.tinybird.co/docs/docs/concepts/data-sources) , or write data to a Materialized View, you are writing data. ### Read activities¶ You read data when you run queries against your Data Sources to generate responses to API Endpoint requests. You also read data when you make requests to the [Query API](https://www.tinybird.co/docs/docs/api-reference/query-api) . The only exception is when you're manually running a query. See [Exceptions](https://www.tinybird.co/docs/about:blank#exceptions) for more information. Read activities also include the amount of data fetched to generate API Endpoint responses. For example, if 10 MB of data is processed to generate one API Endpoint response, you would be billed for 10 MB. If the same API Endpoint is called 10 times, that would be 10 x 10 MB, and you would be billed for 100 MB of processed data in total. Even if there are no rows in a response, you could be billed for it, so create your queries with care. For example, if you read 1 billion rows but the query returns no rows because of the endpoint filters, you have still read 1 billion rows. Additionally [ClickHouse® "sparse" indexing](https://clickhouse.com/docs/en/optimize/sparse-primary-indexes#an-index-design-for-massive-data-scales) means that even if a row isn't in the table, it still takes read activity to confirm it's not there. Bad 4xx Copy Pipes, API and Query Endpoint requests like timeouts or memory usage errors are also billed. You can check these errors using the `pipe_stats_rt`, `pipe_stats` and `datasources_ops_log` [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources#tinybird-pipe-stats-rt). ### Materialized Views¶ [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) involve both read and write operations, plus data storage operations. Whenever you add new data to a Materialized View, you are writing to it. However, there is no charge when you first create and populate a Materialized View. Only incremental updates are billed. Because Materialized Views typically process and store only a fraction of the data that you ingest into Tinybird, the cost of Materialized Views is usually minimal. ### Compression¶ You data processing bill might be impacted by compression. Depending on the operation being performed, data is handled in different ways and it isn't always possible to predict exact levels of read or written bytes in advance for all customers. The best option is to query the [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) and analyze your results. ### Version control¶ If your Workspace uses the Tinybird Git integration feature, only data processing associated with the production Workspace, and not Branches, is included when determining the amount of processed data. Typically, data processing is the largest percentage of your Tinybird bill. This is why, as you scale, you should [optimize your queries](https://www.tinybird.co/docs/docs/query/sql-best-practices) , understand [Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) , and [analyze the performance of your API Endpoints](https://www.tinybird.co/docs/docs/guides/monitoring/analyze-endpoints-performance). Tinybird works with customers on a daily basis to help optimize their queries and reduce data processing, sometimes reducing their processed data by over 10x. If you need support to optimize your use case, contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or through the [Community Slack](https://www.tinybird.co/docs/docs/community). ## Data transfer¶ Currently, the only service to incur data transfer costs is Tinybird [AWS S3 Sink](https://www.tinybird.co/docs/docs/publish/s3-sink) . If you're not using this Sink, you aren't charged any data transfer costs. Tinybird S3 Sink incurs both data transfer and data processing (read) costs. See [AWS S3 Sink Billing](https://www.tinybird.co/docs/docs/publish/s3-sink#billing). Data transfer depends on your environment. There are two possible scenarios: - Destination bucket is in the same cloud provider and region as your Tinybird Workspace: US$0.01 per GB - Destination bucket is in a different cloud provider or region as your Tinybird Workspace: US$0.10 per GB ## Exceptions¶ The following operations are free and don't count towards billing: - Anything on a Build plan. - Any operation that doesn't involve processing, storing, or transferring data: - API calls to the Tokens, Jobs, or Analyze Endpoints. - Management operations over resources like Sinks or Pipes (create, update, delete, get details), or Data Sources (create, get details; update & delete incur cost). - Populating a Materialized View with historical data (only inserting new data into an existing MV is billed). - Manual query executions made inside the UI (Pipes, Time Series, Playground). Anywhere you can press the "Run" button, that's free. - Queries to Service Data Sources. - Any time data is deleted as a result of TTL operations. ## Monitor your usage¶ Users on any plan can monitor their usage. To see an at-a-glance overview, select the cog icon in the navigation and select the **Usage** tab: <-figure-> ![image](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fimg%2Fbilling-plans-usage.png&w=3840&q=75) You can also check your usage by querying the data available in the [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) . These Data Sources contain all the internal data about your Tinybird usage, and you can query them using Pipes like any other Data Source. This means you can publish the results as an API Endpoint, and [build charts in Grafana](https://www.tinybird.co/docs/docs/guides/integrations/consume-api-endpoints-in-grafana), [export to DataDog](https://www.tinybird.co/blog-posts/how-to-monitor-tinybird-using-datadog-with-vector-dev) , and more. Queries made to Service Data Sources are free of charge and don't count towards your usage. However, calls to API Endpoints that use Service Data Sources do count towards API rate limits. Users on any plan can use the strategies outlined in the ["Monitor your ingestion"](https://www.tinybird.co/docs/docs/guides/monitoring/monitor-your-ingestion) guide. If you're an Enterprise customer, check your [Consumption overview in the Organizations UI](https://www.tinybird.co/docs/docs/monitoring/organizations#consumption-overview). ## Reduce your bill¶ The way you reduce your Tinybird overall bill is by reducing your stored, processed, and transferred data. | Type of data | How to reduce | | --- | --- | | Stored data | To reduce stored data, pick the right sorting keys based on your queries, and use ([ Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) to process data on ingestion. | | Processed data | To reduce processed data, use Materialized Views and implement a[ TTL on raw data](https://www.tinybird.co/docs/docs/concepts/data-sources#setting-data-source-ttl) . | | Transferred data | To reduce** transferred data** costs, make sure you're transferring data in the same cloud region. | See the [Optimization guide](https://www.tinybird.co/docs/docs/guides/optimizations/overview) to learn how to optimize your projects and queries and reduce your bill. ## Billing breakdown¶ The following tables provide details on each operation, grouped by main user action. ### Data ingestion¶ | Service | Operation | Processing fee | Description | | --- | --- | --- | --- | | Data Sources API | Write | US$0.07 per GB | Low frequency: Append data to an existing Data Source (imports, backfilling, and so on). | | Events API | Write | US$0.07 per GB | High frequency: Insert events in real-time (individual or batched). | | Connectors | Write | US$0.07 per GB | Any connector that ingests data into Tinybird (Kafka, S3, GCS, BigQuery, and so on). | ### Data manipulation¶ | Service | Operation | Processing fee | Description | | --- | --- | --- | --- | | Pipes API | Read | US$0.07 per GB | Interactions with Pipes to retrieve data from Tinybird generate read operations. | | Query API | Read | US$0.07 per GB | Interactions with the Query API to retrieve data from Tinybird. | | Materialized Views (Populate) | Read/Write | Free | Executed as soon as you create the MV to populate it. Tinybird doesn't charge any processing fee. Data is written into a new or existing Data Source. | | Materialized Views (Append) | Read/Write | US$0.07 per GB | New data is read from an origin Data Source, filtered, and written to a destination Data Source. | | Copy Pipes | Read/Write | US$0.07 per GB | On-demand or scheduled operations. Data is read from the Data Source, filtered, and written to a destination Data Source. | | Replace | Read/Write | US$0.07 per GB | Replacing data entirely or selectively. | | Delete data | Read/Write | US$0.07 per GB | Selective data delete from a Data Source. | | Delete an entire Data Source | Read/Write | US$0.07 per GB | Delete all the data inside a Data Source. | | Truncate | Write | US$0.07 per GB | Delete all the data from a Data Source. | | Time-to-live (TTL) operations | Write | Free | Anytime Tinybird deletes data as a result of a TTL. | | BI Connector | Read | US$0.07 per GB | Data read from Tinybird using the BI connector. | ### Data transfer¶ | Service | Operation | Processing fee | Data transfer fee | Description | | --- | --- | --- | --- | --- | | S3 Sink | Read/Transfer (no write fees) | US$0.07 per GB | Same region: US$0.01 per GB. Different region: US$0.10 per GB | Data is read, filtered, and then transferred to the destination bucket. This is an on-demand or scheduled operation. Data transfer fees apply. | ## Next steps¶ - [ Sign up for a free Tinybird account and follow the quick start](https://www.tinybird.co/docs/docs/quick-start) . - Get the most from your Workspace, for free: Learn more about[ using the Playground and Time Series](https://www.tinybird.co/docs/docs/query/overview#use-the-playground) . - Explore different[ Tinybird plans](https://www.tinybird.co/docs/docs/plans) and find the right one for you. Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/support/limits Last update: 2024-10-23T12:59:27.000Z Content: --- title: "Limits · Tinybird Docs" theme-color: "#171612" description: "Tinybird has limits on certain operations and processes to ensure the highest performance." --- # Limits¶ Tinybird has limits on certain operations and processes to ensure the highest performance. ## Workspace limits¶ | Description | Limit | | --- | --- | | Number of Workspaces | Default 90 (soft limit; ask to increase) | | Number of seats | Default 90 (soft limit; ask to increase) | | Number of Data Sources | Default 100 (soft limit; ask to increase) | | Number of Tokens | 100,000 (If you need more you should take a look at[ JWT tokens](https://www.tinybird.co/docs/docs/concepts/auth-tokens#json-web-tokens-jwts) ) | | Number of secrets | 100 | | Queries per second | Default 20 (Soft limit. Contact Tinybird Support to increase it.) | See [Rate limits for JWTs](https://www.tinybird.co/docs/docs/concepts/auth-tokens#json-web-tokens-jwts) for more detail specifically on JWT limits. ## Ingestion limits¶ | Description | Limit | | --- | --- | | Data Source max columns | 500 | | Full body upload | 8MB | | Multipart upload - CSV and NDJSON | 500MB | | Multipart upload - Parquet | 50MB | | Max file size - Parquet - Build plan | 1GB | | Max file size - Parquet - Pro and Enterprise plan | 5GB | | Max file size (uncompressed) - Build plan | 10GB | | Max file size (uncompressed) - Pro and Enterprise plan | 32GB | | Kafka topics | Default 5 (soft limit; ask to increase) | | Max parts created at once - NDJSON/Parquet jobs and Events API | 12 | ### Ingestion limits (API)¶ Tinybird throttles requests based on the capacity. So if your queries are using 100% resources you might not be able to run more queries until the running ones finish. | Description | Limit and time window | | --- | --- | | Request size - Events API | 10MB | | Response size | 100MB | | Create Data Source from schema | 25 times per minute | | Create Data Source from file or URL* | 5 times per minute | | Append data to Data Source* | 5 times per minute | | Append data to Data Source using v0/events | 1,000 times per second | | Replace data in a Data Source* | 5 times per minute | - The quota is shared at Workspaces level when creating, appending data, or replacing data. For example, you can't do 5 requests of each type per minute, for a total of 15 requests. You can do at most a grand total of 5 requests of those types combined. The number of rows in append requests does not impact the ingestion limit; each request counts as a single ingestion. If you exceed your rate limit, your request will be throttled and you will receive *HTTP 429 Too Many Requests* response codes from the API. Each response contains a set of HTTP headers with your current rate limit status. | Header Name | Description | | --- | --- | | `X-RateLimit-Limit` | The maximum number of requests you're permitted to make in the current limit window. | | `X-RateLimit-Remaining` | The number of requests remaining in the current rate limit window. | | `X-RateLimit-Reset` | The time in seconds after the current rate limit window resets. | | `Retry-After` | The time to wait before making a another request. Only present on 429 responses. | ### BigQuery Connector limits¶ The import jobs run in a pool, with capacity for up to 2 concurrent jobs. If more scheduled jobs overlap, they're queued. | Description | Limit and time window | | --- | --- | | Maximum frequency for the scheduled jobs | 5 minutes | | Maximum rows per append or replace | 50 million rows. Exports that exceed this number of rows are truncated to this amount | You can't pause a Data Source with an ongoing import. You must wait for the import to finish before pausing the Data Source. ### DynamoDB Connector limits¶ | Description | Limit and time window | | --- | --- | | Storage | 500 GB | | Throughput | 250 Write Capacity Units (WCU), equivalent to 250 writes of at most 1 KB per second | ### Snowflake Connector limits¶ The import jobs run in a pool, with capacity for up to 2 concurrent jobs. If more scheduled jobs overlap, they're queued. | Description | Limit and time window | | --- | --- | | Maximum frequency for the scheduled jobs | 5 minutes | | Maximum rows per append or replace | 50 million rows. Exports that exceed this number of rows are truncated to this amount | You can't pause a Data Source with an ongoing import. You must wait for the import to finish before pausing the Data Source. ## Query limits¶ | Description | Limit | | --- | --- | | SQL length | 8KB | | Result length | 100 MB | | Query execution time | 10 seconds | If you exceed your rate limit, your request will be throttled and you will receive *HTTP 429 Too Many Requests* response codes from the API. Each response contains a set of HTTP headers with your current rate limit status. | Header Name | Description | | --- | --- | | `X-RateLimit-Limit` | The maximum number of requests you're permitted to make in the current limit window. | | `X-RateLimit-Remaining` | The number of requests remaining in the current rate limit window. | | `X-RateLimit-Reset` | The time in seconds after the current rate limit window resets. | | `Retry-After` | The time to wait before making a another request. Only present on 429 responses. | ### Query timeouts¶ If query execution time exceeds the default limit of 10 seconds, an error message appears. Long execution times hint at issues that need to be fixed in the query or the Data Source schema. To avoid query timeouts, optimize your queries to remove inefficiencies and common mistakes. See [Optimizations](https://www.tinybird.co/docs/docs/guides/optimizations/overview) for advice on how to detect and solve issues in your queries that might cause timeouts. If you still need to increase the timeout limit, contact support. See [Get help](https://www.tinybird.co/docs/docs/support/overview#get-help). Only paid accounts can raise the timeout limit. ## Publishing limits¶ ### Materialized Views limits¶ No numerical limits, certain operations are [inadvisable when using Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views#limitations). ### Sink limits¶ Sink Pipes have the following limits, depending on your billing plan: | Plan | Sink Pipes per Workspace | Execution time | Frequency | Memory usage per query | Active jobs (running or queued) | | --- | --- | --- | --- | --- | --- | | Pro | 3 | 30s | Up to every 10 min | 10 GB | 3 | | Enterprise | 10 | 300s | Up to every minute | 10 GB | 6 | ### Copy Pipe limits¶ Copy Pipes have the following limits, depending on your billing plan: | Plan | Copy Pipes per Workspace | Execution time | Frequency | Active jobs (running or queued) | | --- | --- | --- | --- | --- | | Build | 1 | 20s | Once an hour | 1 | | Pro | 3 | 30s | Up to every 10 minutes | 3 | | Enterprise | 10 | 50% of the scheduling period, 30 minutes max | Up to every minute | 6 | ## Delete limits¶ Delete jobs have the following limits, depending on your billing plan: | Plan | Active delete jobs per Workspace | | --- | --- | | Build | 1 | | Pro | 3 | | Enterprise | 6 | ## Next steps¶ - Understand how Tinybird[ plans and billing work](https://www.tinybird.co/docs/docs/support/billing) . - Explore popular use cases for user-facing analytics (like dashboards) in Tinybird's[ Use Case Hub](https://www.tinybird.co/docs/docs/use-cases) . --- URL: https://www.tinybird.co/docs/support/overview Last update: 2024-11-07T09:52:34.000Z Content: --- title: "Support · Tinybird Docs" theme-color: "#171612" description: "Tinybird is here to help." --- # Support¶ Read on to learn how to get help. ## Troubleshooting¶ Tinybird tries to give you direct feedback and notifications if it spots anything going wrong. Use Tinybird's [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) to get more details on what's going on under the hood in your data and queries. ## Recover deleted items¶ Tinybird creates and back ups daily snapshots and retains them for 7 days. If you deleted something by mistake and need to recover it, open a ticket and our support team will assist you. ## FAQs¶ There are a handful of common questions that get asked. In no particular order, these are: ### How do I... iterate a Data Source? (Change or update my schema)¶ If you're using version control to manage your Workspace, you manage iterating the Data Source using the version control process. We recommend you read the docs on [version control](https://www.tinybird.co/docs/docs/production/working-with-version-control) and then explore the GitHub repo with [examples for iterating different things](https://github.com/tinybirdco/use-case-examples) to understand the process. Be careful: Iterating Data Sources often needs a careful approach to [backfilling data](https://www.tinybird.co/docs/docs/production/backfill-strategies#the-challenge-of-backfilling-real-time-data). If you're **not** using version control to manage your Workspace, follow the [Iterating a Data Source](https://www.tinybird.co/docs/docs/guides/ingesting-data/iterate-a-data-source) Guide. ### How do I... change a sorting key or a type?¶ Both of these changes require you to iterate the Data Source (see above). ### How do I... copy/move data?¶ You have two options when it comes to moving or copying data. 1. [ Copy Pipes](https://www.tinybird.co/docs/docs/publish/copy-pipes) : the recommended option. Tinybird provides Copy Pipes purely to make copying data between Data Sources easier. 2. [ Materialized Views](https://www.tinybird.co/docs/docs/publish/materialized-views/overview) : The legacy way to copy data. You can either create a materialization over a Data Source with a compatible schema, or you can create a new materialization. Once created and linked, there is a populate to copy all the data from the origin to the destination, and for each new ingest on the origin, it will also send the data to the destination. ### How do I... recover data from quarantine?¶ The quickest way to recover rows from quarantine is to fix the cause of the errors and then re-ingest the data. However, that is not always possible. Read the [docs on recovering data from quarantine](https://www.tinybird.co/docs/docs/guides/ingesting-data/recover-from-quarantine#recovering-rows-from-quarantine). You can also use the [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) , especially `datasources_ops_log` . There is also a `Log` tab on each of your Data Sources in the UI, which displayed the information from `datasources_ops_log`. ### How do I... know if there was an error on an API Endpoint?¶ Your Tinybird API Endpoints return standard HTTP success or error codes. For errors, the response also includes extra information about what went wrong, encoded in the response as JSON. See the docs on [API errors](https://www.tinybird.co/docs/docs/publish/api-endpoints/overview#errors-and-retries) for more information. ### How do I... know how much I'm consuming and how much I'm going to pay?¶ Monitoring your usage is really helpful. Use the [Service Data Sources](https://www.tinybird.co/docs/docs/monitoring/service-datasources) and, if you're part of an Enterprise account, check your [Consumption overview in the Organizations UI](https://www.tinybird.co/docs/docs/monitoring/organizations#consumption-overview). ## Get help¶ If you haven't been able to solve the issue, or it looks like there is a problem on Tinybird's side, get in touch. You can always contact Tinybird at [support@tinybird.co](https://www.tinybird.co/docs/mailto:support@tinybird.co) or in the [Community Slack](https://www.tinybird.co/docs/docs/community). If you have an Enterprise account with Tinybird, contact us using your shared Slack channel. --- URL: https://www.tinybird.co/docs/support/syntax Last update: 2024-10-17T14:29:53.000Z Content: --- title: "Supported syntax · Tinybird Docs" theme-color: "#171612" description: "Tinybird supports the following statements and functions in queries." --- # Supported syntax¶ Tinybird supports the following statements and functions in queries. ## SQL statements¶ The only statement you can use in Tinybird's queries is `SELECT` . The SQL clauses for `SELECT` are fully supported. All other SQL statements are handled by Tinybird's features. ## ClickHouse® functions¶ You can use most functions from the latest version of ClickHouse. See [ClickHouse](https://www.tinybird.co/docs/docs/core-concepts#clickhouse). ### Unsupported functions¶ The following functions aren't supported and don't work in Tinybird: - `FQDN` - `addressToLine` - `addressToLineWithInlines` - `addressToSymbol` - `azureBlobStorage` - `azureBlobStorageCluster` - `buildId` - `catboostEvaluate` - `cosn` - `currentDatabase` - `currentProfiles` - `currentRoles` - `currentSchemas` - `currentUser` - `current_database` - `current_schemas` - `current_user` - `database` - `defaultProfiles` - `defaultRoles` - `deltaLake` - `demangle` - `dictionary` - `displayName` - `enabledProfiles` - `enabledRoles` - `executable` - `file` - `fileCluster` - `filesystemAvailable` - `filesystemCapacity` - `filesystemFree` - `filesystemUnreserved` - `fullHostName` - `gcs` - `generateRandomStructure` - `getClientHTTPHeader` - `getMacro` - `getOSKernelVersion` - `getServerPort` - `getSetting` - `globalVariable` - `hasColumnInTable` - `hasThreadFuzzer` - `hdfs` - `hdfsCluster` - `hive` - `hostName` - `hostname` - `hudi` - `iceberg` - `indexHint` - `initialQueryID` - `initial_query_id` - `input` - `jdbc` - `JSONRemoveDynamoDBAnnotations` - `logTrace` - `loop` - `meiliMatch` - `meilisearch` - `merge` - `mergeTreeIndex` - `mongodb` - `odbc` - `oss` - `redis` - `remote` - `remoteSecure` - `reverseDNSQuery` - `revision` - `s3` - `s3Cluster` - `SCHEMA` - `serverUUID` - `shardCount` - `shardNum` - `showCertificate` - `sleep` - `sleepEachRow` - `sqlite` - `tcpPort` - `tid` - `unnestDynamoDBStructure` - `uptime` - `urlCluster` - `user` - `version` - `view` - `viewExplain` - `viewIfPermitted` - `zookeeperSessionUptime` ### Private beta¶ Tinybird supports the following ClickHouse table functions upon request: - `mysql` - `url` ## ClickHouse settings¶ Tinybird supports the following ClickHouse settings: - `aggregate_functions_null_for_empty` - `join_use_nulls` - `group_by_use_nulls` - `join_algorithm` - `date_time_output_format` Tinybird is not affiliated with, associated with, or sponsored by ClickHouse, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc. --- URL: https://www.tinybird.co/docs/use-cases Last update: 2024-10-03T12:47:27.000Z Content: --- title: "Use Cases · Tinybird Docs" theme-color: "#171612" description: "Build a user-facing analytics project with these end-to-end tutorials, videos, and blog posts." --- # Use Case Hub¶ Tinybird is the data platform for building real-time, user-facing analytics. This Use Case Hub provides centralized resources to help you understand and build specific solutions for your project. ## Examples¶ The top 5 most popular ways people are using Tinybird to build user-facing analytics features: - Want to provide a real-time, beautiful dashboard for your users? -->[ User-facing dashboards](https://www.tinybird.co/docs/docs/use-cases/user-facing-dashboards) . - Need to build a change data capture (CDC) system for your database? -->[ Real-time Change Data Capture (CDC)](https://www.tinybird.co/docs/docs/use-cases/realtime-cdc) . - Looking to level up your game with leaderboards, match-making, and personalized in-game ads? -->[ Gaming analytics](https://www.tinybird.co/docs/docs/use-cases/gaming-analytics) . - Want to build a way to monitor and analyze your web traffic? -->[ Web analytics](https://www.tinybird.co/docs/docs/use-cases/web-analytics) . - Already know that the secret to amazing UX in your app is personalization? -->[ Real-time personalization](https://www.tinybird.co/docs/docs/use-cases/realtime-personalization) . - Need a way for your content creators to analyze their content performance? -->[ User-generated content (UGC) analytics](https://www.tinybird.co/docs/docs/use-cases/ugc-analytics) . - Need to build a content recommendation system? -->[ Content recommendation systems](https://www.tinybird.co/docs/docs/use-cases/content-recommendation) . - Want to learn how to build a vector search system using real-time data? -->[ Vector search](https://www.tinybird.co/docs/docs/use-cases/vector-search) . Explore the nav for more. ## Not sure where to start?¶ Read the [Tinybird "Definitive Guide To Real-Time User-Facing Analytics" blog post](https://www.tinybird.co/blog-posts/user-facing-analytics). ## Customer stories¶ Learn how Tinybird customers have built user-facing analytics. - [ ](https://www.tinybird.co/case-studies/audiense) Audiense evolves its social media timeline with Tinybird - [ ](https://www.tinybird.co/case-studies/canva) Canva designs brilliant user experiences with Tinybird - [ ](https://www.tinybird.co/case-studies/dub) Dub shortens time to market for user-facing analytics with Tinybird - [ ](https://www.tinybird.co/case-studies/factorial-builds-real-time-data-products-with-tinybird) Factorial builds real-time data products with Tinybird - [ ](https://www.tinybird.co/case-studies/fanduel) FanDuel harnesses the power of real-time data using Tinybird - [ ](https://www.tinybird.co/case-studies/genially-uses-tinybird-to-display-realtime-metrics) Genially uses Tinybird to display real-time content interaction metrics for their customers - [ ](https://www.tinybird.co/case-studies/smartme-analytics) Smartme Analytics builds complete consumer insights with Tinybird - [ ](https://www.tinybird.co/case-studies/vercel-relies-on-tinybird-to-power-their-realtime-user-facing-analytics) Vercel uses Tinybird to help developers ship code faster --- URL: https://www.tinybird.co/docs/use-cases/content-recommendation Last update: 2024-09-30T12:19:50.000Z Content: --- title: "Content recommendation · Tinybird Docs" theme-color: "#171612" description: "Recommend useful or related content in real-time using SQL." --- # Content recommendation¶ Content recommendation systems can use real-time analytics or vector search approaches to show a web visitor or platform user content their likely to enjoy based on similar content. Learn more about content recommendation systems and how to build them with Tinybird. ## Tutorials¶ - [ ](https://www.tinybird.co/docs/docs/guides/tutorials/vector-search-recommendation) Build a content recommendation API using vector search ## Blog posts¶ Learn more about content recommendation on the Tinybird Blog. --- URL: https://www.tinybird.co/docs/use-cases/gaming-analytics Last update: 2024-09-18T17:42:03.000Z Content: --- title: "Gaming analytics · Tinybird Docs" theme-color: "#171612" description: "Embed analytics in games." --- # Gaming analytics¶ From leaderboards, to match-making, to serving personalized in-game ads, data is at the core of modern gaming. Tinybird can help you ship user-facing analytics features in your games, faster. ## Tutorials¶ - [ ](https://www.tinybird.co/docs/docs/guides/tutorials/leaderboard) Build a real-time game leaderboard ## Demo¶ Want to see it in action? Check out our Tiny Flappybird demo: - [ Tiny Flappybird Demo](https://flappy.tinybird.co/) - [ Demo code](https://github.com/tinybirdco/flappy-tinybird) ## Videos¶ Watch screencasts and free training workshops that build user-facing analytics. [](/docs/live/build-real-time-leaderboards-june-2024)![Free Workshop: Build Real-Time Leaderboards](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Flive%2Fflappy-bird.jpeg&w=750&q=75) Free Workshop: Build Real-Time Leaderboards [](/docs/live/user-segmentation-at-1m-events-per-second)![User segmentation at 1M events per second](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Flive%2Fuser-segmentation-at-1m-events-per-second.png&w=750&q=75) User segmentation at 1M events per second --- URL: https://www.tinybird.co/docs/use-cases/realtime-cdc Last update: 2024-10-03T12:50:26.000Z Content: --- title: "Real-time Change Data Capture · Tinybird Docs" theme-color: "#171612" description: "Build real-time CDC systems using fresh, accurate data from your favorite databases." --- # Change data capture¶ Change Data Capture (CDC) is a process used in databases to track and capture any changes made to data, such as inserts, updates, or deletes. It enables the identification and recording of modifications in real-time or near real-time, ensuring that downstream systems can be kept in sync with the primary database without having to constantly query the entire dataset. This is achieved by monitoring the transaction logs of the database, allowing CDC to capture only the changes rather than reprocessing all data. Tinybird helps you quickly build applications on top of your CDC data. When you send your change data to Tinybird, you are able to unify your CDC data with other data sources and transform it using SQL, and with one click or command line instruction, convert your SQL queries into high-concurrency, low-latency REST APIs. ## Tutorials¶ - [ ](https://www.tinybird.co/docs/docs/guides/querying-data/lambda-example-cdc) Change Data Capture project with lambda architecture - [ ](https://www.tinybird.co/docs/docs/guides/querying-data/deduplication-strategies) Deduplicate data in your Data Source ## Videos¶ Watch screencasts and free training workshops for building CDC systems. [](/docs/live/mongodb-cdc-confluent-tinybird-sep-2024)![Real-time MongoDB CDC with Confluent and Tinybird](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Flive%2Fmongodb-cdc-confluent-tinybird-sep-2024.png&w=750&q=75) Real-time MongoDB CDC with Confluent and Tinybird [](/docs/live/dynamodb-cdc-july-2024)![Capturing DynamoDB Change Streams for Real-Time Analytics](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Flive%2Fdynamodb-cdc-july-2024.png&w=750&q=75) Capturing DynamoDB Change Streams for Real-Time Analytics ## Blog posts¶ Learn more about CDC on the Tinybird Blog. - [ From CDC to real-time analytics with Tinybird and Estuary](https://www.tinybird.co/blog-posts/cdc-real-time-analytics-estuary-dekaf) - [ A practical guide to real-time CDC with MySQL](https://www.tinybird.co/blog-posts/mysql-cdc) - [ A practical guide to real-time CDC with Postgres](https://www.tinybird.co/blog-posts/postgres-cdc) - [ A practical guide to real-time CDC with MongoDB](https://www.tinybird.co/blog-posts/mongodb-cdc) --- URL: https://www.tinybird.co/docs/use-cases/realtime-personalization Last update: 2024-05-22T09:05:57.000Z Content: --- title: "Real-time personalization · Tinybird Docs" theme-color: "#171612" description: "Personalize your user experience in real time." --- # Real-time personalization¶ Real-time personalization is the pathway to better user experiences. This user-centered approach powers a vast number of modern customer experiences, and is fast becoming the standard of digital engagement. Not only can you provide meaningfully tailored recommendations, ads, or products, you can also leverage real-time personalization to analyze transaction streams for digital fraud, and more. ## Blog posts¶ Learn more about user-facing analytics on the Tinybird Blog. - [ Using Tinybird for real-time marketing at Tinybird](https://www.tinybird.co/blog-posts/tinybird-for-real-time-marketing) - [ Real-time Personalization: Choosing the right tools](https://www.tinybird.co/blog-posts/real-time-personalization) - [ How to build a real-time fraud detection system](https://www.tinybird.co/blog-posts/how-to-build-a-real-time-fraud-detection-system) - [ Building real-time solutions with Snowflake at a fraction of the cost](https://www.tinybird.co/blog-posts/real-time-solutions-with-snowflake) - [ Designing a faster data model to personalize browsing in real time](https://www.tinybird.co/blog-posts/clickhouse-query-optimization) --- URL: https://www.tinybird.co/docs/use-cases/ugc-analytics Last update: 2024-05-22T09:34:43.000Z Content: --- title: "User-generated content · Tinybird Docs" theme-color: "#171612" description: "Provide your content creators with insights into their content performance." --- # User-generated content (UGC) analytics¶ For user-generated content platforms, creators are everything. Creators need to know how their content is performing, so they can make better decisions about what to create next. Tinybird can help you ship creator insights, faster. ## Demo¶ Want to see it in action? Check out our Tiny Flappybird demo: - [ Multi-tenant user-facing analytics with Tinybird and Mux](https://github.com/tinybirdco/demo-user-generated-content-analytics) ## Videos¶ Watch screencasts and free training workshops that build user-facing content analytics. [](/docs/live/ugc-analytics-mux-tinybird)![Building UGC Analytics with Mux and Tinybird](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Flive%2Fugc-analytics-mux-tinybird.png&w=750&q=75) Building UGC Analytics with Mux and Tinybird ## Blog posts¶ - [ Upgrading short link analytics by 100x with Steven Tey](https://www.tinybird.co/blog-posts/upgrading-short-link-analytics-by-100x-with-steven-tey) - [ Low-code analytics with James Devonport of UserLoop](https://www.tinybird.co/blog-posts/low-code-analytics-with-james-devonport-of-userloop) --- URL: https://www.tinybird.co/docs/use-cases/user-facing-dashboards Last update: 2024-05-22T09:05:57.000Z Content: --- title: "User-facing dashboards · Tinybird Docs" theme-color: "#171612" description: "Build user-facing dashboards." --- # User-facing dashboards¶ Users love insights. Give your users a real-time data analytics dashboard so they can monitor what's happening (plus how, when, and where), as it happens. Tinybird can help you implement user-facing dashboards, faster. ## Tutorials¶ - [ ](https://www.tinybird.co/docs/docs/guides/tutorials/real-time-dashboard) Build a real-time dashboard with Tremor & Next.js - [ ](https://www.tinybird.co/docs/docs/guides/tutorials/bigquery-dashboard) Build user-facing dashboard with BigQuery - [ ](https://www.tinybird.co/docs/docs/guides/tutorials/analytics-with-confluent) Build user-facing analytics apps with Confluent ## Videos¶ Watch screencasts and free training workshops that build user-facing dashboards. [](/docs/live/build-a-real-time-dashboard-apr-16)![Build User-Facing Analytics with Kafka, Tinybird, and React](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Flive%2Fbuild-a-real-time-dashboard-apr-16.png&w=750&q=75) Build User-Facing Analytics with Kafka, Tinybird, and React [](/docs/live/kafka-real-time-dashboard)![Build a Real-Time Dashboard over Kafka](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Flive%2Fkafka-real-time-dashboard.png&w=750&q=75) Build a Real-Time Dashboard over Kafka [](/docs/live/ugc-analytics-mux-tinybird)![Building UGC Analytics with Mux and Tinybird](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Flive%2Fugc-analytics-mux-tinybird.png&w=750&q=75) Building UGC Analytics with Mux and Tinybird ## Blog posts¶ Learn more about user-facing analytics on the Tinybird Blog. - [ Building real-time leaderboards with Tinybird](https://www.tinybird.co/blog-posts/building-real-time-leaderboards-with-tinybird) - [ JWTs for API Endpoints now in public beta!](https://www.tinybird.co/blog-posts/jwt-api-endpoints-public-beta) - [ 7 tips to make your dashboards faster](https://www.tinybird.co/blog-posts/7-tips-to-make-your-dashboards-faster) - [ Build a real-time dashboard over BigQuery](https://www.tinybird.co/blog-posts/bigquery-real-time-dashboard) - [ Real-time dashboards: Are they worth it?](https://www.tinybird.co/blog-posts/real-time-dashboards-are-they-worth-it) - [ Real-time Data Visualization: How to build faster dashboards](https://www.tinybird.co/blog-posts/real-time-data-visualization) - [ How Typeform Built a Fully Functional User Dash With Tinybird](https://www.tinybird.co/blog-posts/typeform-utm-realtime-analytics) --- URL: https://www.tinybird.co/docs/use-cases/vector-search Last update: 2024-09-30T12:19:50.000Z Content: --- title: "Vector search · Tinybird Docs" theme-color: "#171612" description: "Use SQL functions to calculate vector distances and search large content tables." --- # Vector search¶ Vector search allows you to search through multi-modal content based on calculated embeddings. It is a popular approach to search when keyword matching is insufficient. Tinybird can help you search largescale vector embeddings in real-time and apply real-time analytics principles to make vector search more performant. ## Tutorials¶ - [ ](https://www.tinybird.co/docs/docs/guides/tutorials/vector-search-recommendation) Build a content recommendation API using vector search --- URL: https://www.tinybird.co/docs/use-cases/web-analytics Last update: 2024-05-13T16:10:11.000Z Content: --- title: "Web analytics · Tinybird Docs" theme-color: "#171612" description: "Monitor and analyze web traffic." --- # Web analytics¶ Get the data you need about your users, while satisfying data privacy laws, and without compromising on speed or scale. Track, analyze, and report on website visits, marketing conversions, ad-generated revenue, and more. Tinybird can help you monitor and analyze your web traffic, faster. ## Tutorials¶ - [ ](https://www.tinybird.co/docs/docs/guides/tutorials/user-facing-web-analytics) Build a user-facing web analytics dashboard - [ ](https://www.tinybird.co/docs/docs/guides/tutorials/bigquery-dashboard) Build user-facing dashboard with BigQuery - [ ](https://www.tinybird.co/docs/docs/guides/tutorials/analytics-with-confluent) Build user-facing analytics apps with Confluent ## Videos¶ [](/docs/live/google-analytics-free)![Build a GDPR-compliant alternative to Google Analytics](https://www.tinybird.co/docs/docs/_next/image?url=%2Fdocs%2Fassets%2Flive%2Fgoogle-analytics-free.png&w=750&q=75) Build a GDPR-compliant alternative to Google Analytics ## Blog posts¶ - [ Multi-tenant analytics for SaaS applications](https://www.tinybird.co/blog-posts/multi-tenant-saas-options) - [ Building privacy-first native app telemetry with Guilherme Oenning](https://www.tinybird.co/blog-posts/dev-qa-guilherme-oenning) - [ Developer Q&A: Monitoring global API latency with chronark](https://www.tinybird.co/blog-posts/dev-qa-global-api-latency-chronark) - [ Developer Q&A with JR the Builder, co-creator of Beam Analytics](https://www.tinybird.co/blog-posts/qa-with-jr-the-builder-beam-analytics) - [ Looking for an open source Google Analytics alternative? Set one up in 3 minutes.](https://www.tinybird.co/blog-posts/google-analytics-alternative-in-3-minutes) - [ How I replaced Google Analytics with Retool and Tinybird, Part 2](https://www.tinybird.co/blog-posts/how-i-replaced-google-analytics-with-retool-and-tinybird-part-2) - [ How an eCommerce giant replaced Google Analytics for privacy and speed](https://www.tinybird.co/blog-posts/ecommerce-google-analytics-alternative) - [ How I replaced Google Analytics with Retool and Tinybird, Part 1](https://www.tinybird.co/blog-posts/how-i-replaced-google-analytics-with-retool-and-tinybird-part-1) - [ A privacy-first approach to building a Google Analytics alternative](https://www.tinybird.co/blog-posts/privacy-first-google-analytics-alternative) ---