Organize files in data projects¶
A data project is a set of files that describes how your data must be stored, processed, and exposed through APIs.
In the same way you maintain source code files in a repository, use a CI, make deployments, run tests, and so on, Tinybird provides tools to work following a similar pattern but with data pipelines. The source code of your project are the datafiles in Tinybird.
With a data project you can:
- Define how the data should flow, from schemas to API Endpoints.
- Manage your datafiles using version control.
- Branch your datafiles.
- Run tests.
- Deploy data projects.
Ecommerce site example¶
Consider an ecommerce site where you have events from users and a list of products with their attributes. Your goal is to expose several API endpoints to return sales per day and top product per day.
The data project file structure would look like the following:
ecommerce_data_project/ datasources/ events.datasource products.datasource fixtures/ events.csv products.csv pipes/ top_product_per_day.pipe endpoints/ sales.pipe top_products.pipe
To follow this tutorial, download and open the example using the following commands:
Clone demo
git clone https://github.com/tinybirdco/ecommerce_data_project.git cd ecommerce_data_project
Upload the project¶
You can push the whole project to your Tinybird account to check everything is fine. The tb push
command uploads the data to Tinybird, previously checking the project dependencies and the SQL syntax. In this case, use the --push-deps
flag to push everything:
Push dependencies
tb push --push-deps
After the upload completes, the endpoints defined in our project, sales
and top_products
, are available and you can start pushing data to the different Data Sources.
Define Data Sources¶
Data Sources define how your data is ingested and stored. You can add data to Data Sources using the Data Sources API.
Each Data Source is defined by a schema and other properties. See Datasource files.
The following snippet shows the content of the event.datasource
file from the ecommerce example:
DESCRIPTION > # Events from users This contains all the events produced by Kafka, there are 4 fixed columns. plus a `json` column which contains the rest of the data for that event. See [documentation](url_for_docs) for the different events. SCHEMA > timestamp DateTime, product String, user_id String, action String json String ENGINE MergeTree ENGINE_SORTING_KEY timestamp
The file describes the schema and how the data is sorted. In this case, the access pattern is most of the time by the timestamp
column. If no SORTING_KEY
is set, Tinybird picks one by default, date or datetime columns in most cases.
To push the Data Source, run:
Push the events Data Source
tb push datasources/events.datasource
You can't override Data Sources. If you try to push a Data Source that already exists in your account you get an error. To override a Data Source, remove it or upload a new one with a different name.
Define data Pipes¶
The content of the pipes/top_product_per_day.pipe
file creates a data Pipe that transforms the data as it's inserted:
NODE only_buy_events DESCRIPTION > filters all the buy events SQL > SELECT toDate(timestamp) date, product, JSONExtractFloat(json, 'price') AS price FROM events WHERE action = 'buy' NODE top_per_day SQL > SELECT date, topKState(10)(product) top_10, sumState(price) total_sales FROM only_buy_events GROUP BY date TYPE materialized DATASOURCE top_per_day_mv ENGINE AggregatingMergeTree ENGINE_SORTING_KEY date
Each Pipe can have one or more nodes. The previous Pipe defines two nodes, only_buy_events
and top_per_day
.
- The first node filters
buy
events and extracts some data from thejson
column. - The second node runs the aggregation.
In general, use NODE
to start a new Node and then use SQL >
to define the SQL for that Node. You can use other Nodes inside the SQL. In this case, the second Node uses the first one only_buy_events
.
To push the Pipe, run:
Populate
tb push pipes/top_product_per_day.pipe --populate
If you want to populate with the existing data in events
table, use the --populate
flag.
When using the --populate
flag you get a job URL so you can check the status of the job by checking the URL provided. See Populate and copy data for more information on how populate jobs work.
Define API Endpoints¶
API Endpoints are the way you expose the data to be consumed.
The following snippet shows the content of the endpoints/top_products.pipe
file:
NODE endpoint DESCRIPTION > returns top 10 products for the last week SQL > SELECT date, topKMerge(10)(top_10) AS top_10 FROM top_per_day WHERE date > today() - interval 7 day GROUP BY date
The syntax is the same as in the data transformation Pipes, though you can access the results through the {% user("apiHost") %}/v0/top_products.json?token=TOKEN
endpoint.
When you push an endpoint a Token with PIPE:READ
permissions is automatically created. You can see it from the Tokens UI or directly from the CLI with the command tb pipe token_read <endpoint_name>
.
Alternatively, you can use the TOKEN token_name READ
command to automatically create a Token with name token_name
with READ
permissions over the endpoint or add READ
permissions to the existing token_name
over the endpoint. For example:
TOKEN public_read_token READ NODE endpoint DESCRIPTION > returns top 10 products for the last week SQL > SELECT date, topKMerge(10)(top_10) AS top_10 FROM top_per_day WHERE date > today() - interval 7 day GROUP BY date
To push the endpoint, run:
Push the top products Pipe
tb push endpoints/top_products.pipe
The Token public_read_token
was created automatically and it's provided in the test URL.
You can add parameters to any endpoint. For example, parametrize the dates to be able to filter the data between two dates:
NODE endpoint DESCRIPTION > returns top 10 products for the last week SQL > % SELECT date, topKMerge(10)(top_10) AS top_10 FROM top_per_day WHERE date between {{Date(start)}} AND {{Date(end)}} GRUP BY date
Now, the endpoint can receive start
and end
parameters: {% user("apiHost") %}/v0/top_products.json?start=2018-09-07&end=2018-09-17&token=TOKEN
.
You can print the results from the CLI using the pipe data
command. For instance, for the previous example:
Print the results of the top products endpoint
tb pipe data top_products --start '2018-09-07' --end '2018-09-17' --format CSV
For the parameters templating to work you need to start your NODE SQL definition using the character %
.
Override an endpoint or a data Pipe¶
When working on a project, you might need to push several versions of the same file. You can override a Pipe that has already been pushed using the --force
flag. For example:
Override the Pipe
tb push endpoints/top_products_params.pipe --force
If the endpoint has been called before, it runs regression tests with the most frequent requests. If the new version doesn't return the same data, then it's not pushed. You can see in the example how to run all the requests tested.
You can force the push without running the checks using the --no-check
flag if needed. For example:
Force override
tb push endpoints/top_products_params.pipe --force --no-check
Downloading datafiles from Tinybird¶
You can download datafiles using the pull
command. For example:
Pull a specific file
tb pull --match endpoint_im_working_on
The previous command downloads the endpoint_im_working_on.pipe
to the current folder.