S3 connector¶
You can set up an S3 connector to load your CSV, NDJSON, or Parquet files into Tinybird from any S3 bucket. Tinybird can detect new files in your buckets and ingest them automatically.
Setting up the S3 connector requires:
- Configuring AWS permissions using IAM roles.
- Creating a connection file in Tinybird.
- Creating a data source that uses this connection
Environment considerations¶
Before setting up the connector, understand how it works in different environments. See Environments.
Cloud environment¶
In the Tinybird Cloud environment, Tinybird uses its own AWS account to assume the IAM role you create, allowing it to access your S3 bucket.
Local environment¶
When using the S3 connector in the Tinybird Local environment, which runs in a container, you need to pass your local AWS credentials to the container. These credentials must have the permissions described in the AWS permissions section, including access to S3 operations like GetObject
, ListBucket
, etc. This allows Tinybird Local to assume the IAM role you specify in your connection.
To pass your AWS credentials, use the --use-aws-creds
flag when starting Tinybird Local:
tb local start --use-aws-creds » Starting Tinybird Local... ✓ AWS credentials found and will be passed to Tinybird Local (region: us-east-1) * Waiting for Tinybird Local to be ready... ✓ Tinybird Local is ready!
If you're using a specific AWS profile, you can specify it using the AWS_PROFILE
environment variable:
AWS_PROFILE=my-profile tb local start --use-aws-creds
When using the S3 connector in the --local
environment, continuous file ingestion is limited. For continuous ingestion of new files, use the Cloud environment.
Set up the connector¶
Create an S3 connection¶
You can create an S3 connection in Tinybird using either the guided CLI process or by manually creating a connection file.
Option 1: Use the guided CLI process (Recommended)¶
The Tinybird CLI provides a guided process that helps you set up the required AWS permissions and creates the connection file automatically:
tb connection create s3
When prompted, you'll need to:
- Enter a name for your connection.
- Enter the S3 bucket name.
- Enter the AWS region where your bucket is located.
- Copy the displayed AWS IAM policy to your clipboard (you'll need this to set up permissions in AWS).
- Copy the displayed AWS IAM role trust policy for your Local environment, then enter the ARN of the role you create.
- Copy the displayed AWS IAM role trust policy for your Cloud environment, then enter the ARN of the role you create.
- The ARN values will be stored securely using tb secret, which will allow you to have different roles for each environment.
Option 2: Create a connection file manually¶
You can also set up a connection manually by creating a connection file with the required credentials:
s3sample.connection
TYPE s3 S3_REGION "<S3_REGION>" S3_ARN "<IAM_ROLE_ARN>"
When creating your connection manually, you need to set up the required AWS IAM role with appropriate permissions. See the AWS permissions section for details on the required access policy and trust policy configurations.
See Connection files for more details on how to create a connection file and manage secrets.
You need to create separate connections for each environment you're working with, Local and Cloud. See Environments.
For example, you can create:
my-s3-local
for your Local environmentmy-s3-cloud
for your Cloud environment
Create an S3 data source¶
After creating the connection, you need to create a data source that uses it.
Create a .datasource file using tb create --prompt
or manually:
s3sample.datasource
DESCRIPTION > Analytics events landing data source SCHEMA > `timestamp` DateTime `json:$.timestamp`, `session_id` String `json:$.session_id`, `action` LowCardinality(String) `json:$.action`, `version` LowCardinality(String) `json:$.version`, `payload` String `json:$.payload` ENGINE "MergeTree" ENGINE_PARTITION_KEY "toYYYYMM(timestamp)" ENGINE_SORTING_KEY "timestamp" ENGINE_TTL "timestamp + toIntervalDay(60)" IMPORT_CONNECTION_NAME s3sample IMPORT_BUCKET_URI s3://my-bucket/*.csv IMPORT_SCHEDULE @auto
The IMPORT_CONNECTION_NAME
setting must match the name of the .connection file you created in the previous step.
Test the connection¶
After defining your S3 data source and connection, test it by running a deploy check:
tb --cloud deploy --check
This runs the connection locally and checks if the connection is valid. To see the connection details, run tb --cloud connection ls
.
.connection settings¶
The S3 connector use the following settings in .connection files:
Instruction | Required | Description |
---|---|---|
S3_REGION | Yes | Region of the S3 bucket. |
S3_ARN | Yes | ARN of the IAM role with the required permissions. |
Once a connection is used in a data source, you can't change the ARN account ID or region. To modify these values, you must:
- Remove the connection from the data source.
- Deploy the changes.
- Add the connection again with the new values.
.datasource settings¶
The S3 connector uses the following settings in .datasource files:
Instruction | Required | Description |
---|---|---|
IMPORT_SCHEDULE | Yes | Use @auto to ingest new files automatically, or @once to only execute manually. Note that in the --local environment, even if you set @auto , only the initial sync will be performed, loading all existing files, but the connector will not continue to automatically ingest new files afterwards. |
IMPORT_CONNECTION_NAME | Yes | Name given to the connection inside Tinybird. For example, 'my_connection' . This is the name of the connection file you created in the previous step. |
IMPORT_BUCKET_URI | Yes | Full bucket path, including the s3:// protocol, bucket name, object path, and an optional pattern to match against object keys. For example, s3://my-bucket/my-path discovers all files in the bucket my-bucket under the prefix /my-path . You can use patterns in the path to filter objects, for example, ending the path with *.csv matches all objects that end with the .csv suffix. |
IMPORT_FROM_TIMESTAMP | No | Sets the date and time from which to start ingesting files on an S3 bucket. The format is YYYY-MM-DDTHH:MM:SSZ . |
The only supported change is updating IMPORT_SCHEDULE
from @once
to @auto
which makes the connector ingest all files that match the bucket URI pattern since the last on-demand ingestion.
For any other parameter changes, you must:
- Remove the connection from the data source.
- Deploy the changes.
- Add the connection again with the new values.
- Deploy again.
S3 file URI¶
The S3 connector supports the following wildcard patterns:
- Single asterisk or
*
: matches zero or more characters within a single directory level, excluding/
. It doesn't cross directory boundaries. For example,s3://bucket-name/*.ndjson
matches all.ndjson
files in the root of your bucket but doesn't match files in subdirectories. - Double asterisk or
**
: matches zero or more characters across multiple directory levels, including/
. It can cross directory boundaries recursively. For example:s3://bucket-name/**/*.ndjson
matches all.ndjson
files in the bucket, regardless of their directory depth.
Use the full S3 file URI and wildcards to select multiple files. The file extension is required to accurately match the desired files in your pattern.
Due to a limitation in Amazon S3, you can't create different S3 data sources with path expressions that collide. For example: s3://my_bucket/**/*.csv
and s3://my_bucket/transactions/*.csv
.
Examples¶
The following are examples of patterns you can use and whether they'd match the example file path:
File path | S3 File URI | Will match? |
---|---|---|
example.ndjson | s3://bucket-name/*.ndjson | Yes. Matches files in the root directory with the .ndjson extension. |
example.ndjson.gz | s3://bucket-name/**/*.ndjson.gz | Yes. Recursively matches .ndjson.gz files anywhere in the bucket. |
example.ndjson | s3://bucket-name/example.ndjson | Yes. Exact match to the file path. |
pending/example.ndjson | s3://bucket-name/*.ndjson | No. * doesn't cross directory boundaries. |
pending/example.ndjson | s3://bucket-name/**/*.ndjson | Yes. Recursively matches .ndjson files in any subdirectory. |
pending/example.ndjson | s3://bucket-name/pending/example.ndjson | Yes. Exact match to the file path. |
pending/example.ndjson | s3://bucket-name/pending/*.ndjson | Yes. Matches .ndjson files within the pending directory. |
pending/example.ndjson | s3://bucket-name/pending/**/*.ndjson | Yes. Recursively matches .ndjson files within pending and all its subdirectories. |
pending/example.ndjson | s3://bucket-name/**/pending/example.ndjson | Yes. Matches the exact path to pending/example.ndjson within any preceding directories. |
pending/example.ndjson | s3://bucket-name/other/example.ndjson | No. Doesn't match because the path includes directories which aren't part of the file's actual path. |
pending/example.ndjson.gz | s3://bucket-name/pending/*.csv.gz | No. The file extension .ndjson.gz doesn't match .csv.gz |
pending/o/inner/example.ndjson | s3://bucket-name/*.ndjson | No. * doesn't cross directory boundaries. |
pending/o/inner/example.ndjson | s3://bucket-name/**/*.ndjson | Yes. Recursively matches .ndjson files anywhere in the bucket. |
pending/o/inner/example.ndjson | s3://bucket-name/**/inner/example.ndjson | Yes. Matches the exact path to inner/example.ndjson within any preceding directories. |
pending/o/inner/example.ndjson | s3://bucket-name/**/ex*.ndjson | Yes. Recursively matches .ndjson files starting with ex at any depth. |
pending/o/inner/example.ndjson | s3://bucket-name/**/**/*.ndjson | Yes. Matches .ndjson files at any depth, even with multiple ** wildcards. |
pending/o/inner/example.ndjson | s3://bucket-name/pending/**/*.ndjson | Yes. Matches .ndjson files within pending and all its subdirectories. |
pending/o/inner/example.ndjson | s3://bucket-name/inner/example.ndjson | No. Doesn't match because the path includes directories which aren't part of the file's actual path. |
pending/o/inner/example.ndjson | s3://bucket-name/pending/example.ndjson | No. Doesn't match because the path includes directories which aren't part of the file's actual path. |
pending/o/inner/example.ndjson.gz | s3://bucket-name/pending/*.ndjson.gz | No. * doesn't cross directory boundaries. |
pending/o/inner/example.ndjson.gz | s3://bucket-name/other/example.ndjson.gz | No. Doesn't match because the path includes directories which aren't part of the file's actual path. |
Considerations¶
When using patterns:
- Use specific directory names or even specific file URIs to limit the scope of your search. The more specific your pattern, the narrower the search.
- Combine wildcards: you can combine
**
with other patterns to match files in subdirectories selectively. For example,s3://bucket-name/**/logs/*.ndjson
matches.ndjson
files within any logs directory at any depth. - Avoid unintended matches: be cautious with
**
as it can match many files, which might impact performance and return partial matches.
Supported file types¶
The S3 connector supports the following file types:
File type | Accepted extensions | Compression formats supported |
---|---|---|
CSV | .csv , .csv.gz | gzip |
NDJSON | .ndjson , .ndjson.gz , .jsonl , .jsonl.gz , .json , .json.gz | gzip |
Parquet | .parquet , .parquet.gz | snappy , gzip , lzo , brotli , lz4 , zstd |
You can upload files with .json extension, provided they follow the Newline Delimited JSON (NDJSON) format. Each line must be a valid JSON object and every line has to end with a \n
character.
Parquet schemas use the same format as NDJSON schemas, using JSONPath syntax.
AWS permissions¶
The S3 connector requires an IAM Role with specific permissions to access objects in your Amazon S3 bucket:
s3:GetObject
s3:ListBucket
s3:GetBucketNotification
s3:PutBucketNotification
s3:GetBucketLocation
You need to create both an access policy and a trust policy in AWS:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket", "s3:GetBucketNotification", "s3:PutBucketNotification", "s3:GetBucketLocation" ], "Resource": [ "arn:aws:s3:::{bucket-name}", "arn:aws:s3:::{bucket-name}/*" ] } ] }