S3 connector¶
You can set up a S3 connector to load your CSV, NDJSON, or Parquet files into Tinybird from any S3 bucket. Tinybird detects new files in your buckets and ingests them automatically.
Setting up the S3 connector requires creating and enabling a data source and its connection as separate files.
Supported file types¶
The S3 connector supports the following file types:
File type | Accepted extensions | Compression formats supported |
---|---|---|
CSV | .csv , .csv.gz | gzip |
NDJSON | .ndjson , .ndjson.gz , .jsonl , .jsonl.gz , .json , .json.gz | gzip |
Parquet | .parquet , .parquet.gz | snappy , gzip , lzo , brotli , lz4 , zstd |
You can upload files with .json extension, provided they follow the Newline Delimited JSON (NDJSON) format. Each line must be a valid JSON object and every line has to end with a \n
character.
Parquet schemas use the same format as NDJSON schemas, using JSONPath syntax.
Set up the connector¶
To set up the S3 connector, follow these steps.
Prerequisites¶
The S3 connector requires permissions to access objects in your Amazon S3 bucket. The IAM Role needs the following permissions:
s3:GetObject
s3:ListBucket
s3:GetBucketNotification
s3:PutBucketNotification
s3:GetBucketLocation
s3:PutObject
s3:PutObjectAcl
Apply the required permissions¶
To connect to your S3 bucket, set up the required permissions in AWS. See Prerequisites.
The following are examples of AWS access policy and trust policy:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket", "s3:GetBucketNotification", "s3:PutBucketNotification", "s3:GetBucketLocation", "s3:PutObject", "s3:PutObjectAcl" ], "Resource": [ "arn:aws:s3:::{bucket-name}", "arn:aws:s3:::{bucket-name}/*" ] } ] }
Create an S3 connection¶
Before you create and configure a S3 data source, you need to set up a connection. Create a connection file with the required credentials stored in secrets. For example:
s3sample.connection
TYPE s3 S3_REGION "<S3_REGION>" S3_ARN "<IAM_ROLE_ARN>"
You can obtain the IAM role ARN after applying the required permissions. See Apply the required permissions.
See Settings for a list of S3 connection settings.
Create an S3 data source¶
Create a .datasource file using tb create --prompt
or manually.
The .datasource file must contain the desired schema and the required settings for S3, including the S3_CONNECTION_NAME
setting, which must match the name of the .connection file you created in the previous step.
For example:
s3sample.datasource
DESCRIPTION > Analytics events landing data source SCHEMA > `timestamp` DateTime `json:$.timestamp`, `session_id` String `json:$.session_id`, `action` LowCardinality(String) `json:$.action`, `version` LowCardinality(String) `json:$.version`, `payload` String `json:$.payload` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" ENGINE_TTL "timestamp + toIntervalDay(60)" IMPORT_CONNECTION_NAME s3sample IMPORT_BUCKET_URI s3://my-bucket/*.csv IMPORT_SCHEDULE @auto IMPORT_STRATEGY APPEND
Test and deploy¶
After you've defined your S3 data source and connection, you can test and deploy the changes as usual. See Test and deploy.
To check if the connection is active, run tb connection ls
.
.datasource settings¶
The S3 connector use the following settings in .datasource files:
Instruction | Required | Description |
---|---|---|
IMPORT_CONNECTION_NAME | Yes | Name given to the connection inside Tinybird. For example, 'my_connection' . |
IMPORT_STRATEGY | Yes | Strategy to use when inserting data. Use APPEND for S3 connections. |
IMPORT_BUCKET_URI | Yes | Full bucket path, including the s3:// protocol, bucket name, object path, and an optional pattern to match against object keys. For example, s3://my-bucket/my-path discovers all files in the bucket my-bucket under the prefix /my-path . You can use patterns in the path to filter objects, for example, ending the path with *.csv matches all objects that end with the .csv suffix. |
IMPORT_FROM_DATETIME | No | Sets the date and time from which to start ingesting files on an S3 bucket. The format is YYYY-MM-DDTHH:MM:SSZ . |
.connection settings¶
The S3 connector use the following settings in .connection files:
Instruction | Required | Description |
---|---|---|
S3_REGION | Yes | Region of the S3 bucket. |
S3_ARN | Yes | ARN of the IAM role with the required permissions. |
S3 file URI¶
The S3 connector supports the following wildcard patterns:
- Single asterisk or
*
: matches zero or more characters within a single directory level, excluding/
. It doesn't cross directory boundaries. For example,s3://bucket-name/*.ndjson
matches all.ndjson
files in the root of your bucket but doesn't match files in subdirectories. - Double asterisk or
**
: matches zero or more characters across multiple directory levels, including/
. It can cross directory boundaries recursively. For example:s3://bucket-name/**/*.ndjson
matches all.ndjson
files in the bucket, regardless of their directory depth.
Use the full S3 file URI and wildcards to select multiple files. The file extension is required to accurately match the desired files in your pattern.
Due to a limitation in Amazon S3, you can't create different S3 data sources with path expressions that collide. For example: s3://my_bucket/**/*.csv
and s3://my_bucket/transactions/*.csv
.
Examples¶
The following are examples of patterns you can use and whether they'd match the example file path:
File path | S3 File URI | Will match? |
---|---|---|
example.ndjson | s3://bucket-name/*.ndjson | Yes. Matches files in the root directory with the .ndjson extension. |
example.ndjson.gz | s3://bucket-name/**/*.ndjson.gz | Yes. Recursively matches .ndjson.gz files anywhere in the bucket. |
example.ndjson | s3://bucket-name/example.ndjson | Yes. Exact match to the file path. |
pending/example.ndjson | s3://bucket-name/*.ndjson | No. * doesn't cross directory boundaries. |
pending/example.ndjson | s3://bucket-name/**/*.ndjson | Yes. Recursively matches .ndjson files in any subdirectory. |
pending/example.ndjson | s3://bucket-name/pending/example.ndjson | Yes. Exact match to the file path. |
pending/example.ndjson | s3://bucket-name/pending/*.ndjson | Yes. Matches .ndjson files within the pending directory. |
pending/example.ndjson | s3://bucket-name/pending/**/*.ndjson | Yes. Recursively matches .ndjson files within pending and all its subdirectories. |
pending/example.ndjson | s3://bucket-name/**/pending/example.ndjson | Yes. Matches the exact path to pending/example.ndjson within any preceding directories. |
pending/example.ndjson | s3://bucket-name/other/example.ndjson | No. Doesn't match because the path includes directories which aren't part of the file's actual path. |
pending/example.ndjson.gz | s3://bucket-name/pending/*.csv.gz | No. The file extension .ndjson.gz doesn't match .csv.gz |
pending/o/inner/example.ndjson | s3://bucket-name/*.ndjson | No. * doesn't cross directory boundaries. |
pending/o/inner/example.ndjson | s3://bucket-name/**/*.ndjson | Yes. Recursively matches .ndjson files anywhere in the bucket. |
pending/o/inner/example.ndjson | s3://bucket-name/**/inner/example.ndjson | Yes. Matches the exact path to inner/example.ndjson within any preceding directories. |
pending/o/inner/example.ndjson | s3://bucket-name/**/ex*.ndjson | Yes. Recursively matches .ndjson files starting with ex at any depth. |
pending/o/inner/example.ndjson | s3://bucket-name/**/**/*.ndjson | Yes. Matches .ndjson files at any depth, even with multiple ** wildcards. |
pending/o/inner/example.ndjson | s3://bucket-name/pending/**/*.ndjson | Yes. Matches .ndjson files within pending and all its subdirectories. |
pending/o/inner/example.ndjson | s3://bucket-name/inner/example.ndjson | No. Doesn't match because the path includes directories which aren't part of the file's actual path. |
pending/o/inner/example.ndjson | s3://bucket-name/pending/example.ndjson | No. Doesn't match because the path includes directories which aren't part of the file's actual path. |
pending/o/inner/example.ndjson.gz | s3://bucket-name/pending/*.ndjson.gz | No. * doesn't cross directory boundaries. |
pending/o/inner/example.ndjson.gz | s3://bucket-name/other/example.ndjson.gz | No. Doesn't match because the path includes directories which aren't part of the file's actual path. |
Considerations¶
When using patterns:
- Use specific directory names or even specific file URIs to limit the scope of your search. The more specific your pattern, the narrower the search.
- Combine wildcards: you can combine
**
with other patterns to match files in subdirectories selectively. For example,s3://bucket-name/**/logs/*.ndjson
matches.ndjson
files within any logs directory at any depth. - Avoid unintended matches: be cautious with
**
as it can match many files, which might impact performance and return partial matches.