Ingest from Snowflake via unloading¶
In this guide you'll learn how to send data from Snowflake to Tinybird, for scenarios where the native connector can't be used —things outside a one-off load or periodical full replaces of the table, or where limits apply—.
This process relies on unloading (aka bulk export) data as gzipped CSVs and then ingesting via Data Sources API.
This guide explains the process using Azure Blob Storage, but it's easy to replicate using Amazon S3, Google Cloud Storage, or any storage service where you can unload data from Snowflake and share presigned URLs to access the files.
This guide is a walkthrough of the most common, basic process: Unload the table from Snowflake, then ingest this export into Tinybird.
Prerequisites¶
This guide assumes you have a Tinybird account, you are familiar with creating a Tinybird Workspace and pushing resources to it.
You will also need access to Snowflake, and permissions to create SAS Tokens for Azure Blob Storage or its equivalents in AWS S3 and Google Cloud Storage.
1. Unload the Snowflake table¶
Snowflake makes it really easy to unload query results to flat files to and external storage service.
COPY INTO 'azure://myaccount.blob.core.windows.net/unload/' FROM mytable CREDENTIALS = ( AZURE_SAS_TOKEN='****' ) FILE_FORMAT = ( TYPE = CSV COMPRESSION = GZIP ) HEADER = FALSE;
The most basic implementation is unloading directly, but for production use cases consider adding a named stage as suggested in the docs. Stages will give you more fine-grained control to grant access rights.
2. Create a SAS token for the file¶
Using Azure CLI, generate a shared access signature (SAS) token so Tinybird can read the file:
az storage blob generate-sas \ --account-name myaccount \ --account-key '****' \ --container-name unload \ --name data.csv.gz \ --permissions r \ --expiry <expiry-ts> \ --https-only \ --output tsv \ --full-uri > 'https://myaccount.blob.core.windows.net/unload/data.csv.gz?se=2024-05-31T10%3A57%3A41Z&sp=r&spr=https&sv=2022-11-02&sr=b&sig=PMC%2E9ZvOFtKATczsBQgFSsH1%2BNkuJvO9dDPkTpxXH0g%5D'
Use the same behavior in S3 and GCS to generate pre-signed URLs.
3. Ingest into Tinybird¶
Take that generated URL and make a call to Tinybird. You'll need a Token with DATASOURCES:CREATE
permissions:
curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?name=my_datasource_name" \ -d url='https://myaccount.blob.core.windows.net/unload/data.csv.gz?se=2024-05-31T10%3A57%3A41Z&sp=r&spr=https&sv=2022-11-02&sr=b&sig=PMC%2E9ZvOFtKATczsBQgFSsH1%2BNkuJvO9dDPkTpxXH0g%5D'
You should now have your Snowflake Table in Tinybird.
Automation¶
To adapt to more "real-life" scenarios (like having to append data on a timely basis, replace data that has been updated in Snowflake, etc.), you may need to define scheduled actions to move the data. You can see examples in the Ingest from Google Cloud Storage guide and in Schedule data ingestion with cron and GitHub Actions guide.
Limits¶
You will be using Data Sources API, so its limits apply:
Description | Limit |
---|---|
Append/Replace data to Data Source | 5 times per minute |
Max file size (uncompressed) | build plan 10GB |
Max file size (uncompressed) | pro and enterprise 32GB |
As a result of these limits, you may need to adjust your COPY INTO <location> expression adding PARTITION
or MAX_FILE_SIZE = 5000000000
.
COPY INTO 'azure://myaccount.blob.core.windows.net/unload/' FROM mytable CREDENTIALS=( AZURE_SAS_TOKEN='****') FILE_FORMAT = ( TYPE = CSV COMPRESSION = GZIP ) HEADER = FALSE MAX_FILE_SIZE = 5000000000;
Next steps¶
These resources may be useful: