S3 connector¶

You can set up an S3 connector to load your CSV, NDJSON, or Parquet files into Tinybird from any S3 bucket. Tinybird can detect new files in your buckets and ingest them automatically.

Setting up the S3 connector requires:

Configuring AWS permissions using IAM roles.
Creating a connection file in Tinybird.
Creating a data source that uses this connection.

Environment considerations¶

Before setting up the connector, understand how it works in different environments.

Cloud environment¶

In the Tinybird Cloud environment, Tinybird uses its own AWS account to assume the IAM role you create, allowing it to access your S3 bucket.

Local environment¶

When using the S3 connector in the Tinybird Local environment, which runs in a container, you need to pass your local AWS credentials to the container. These credentials must have the permissions described in the AWS permissions section, including access to S3 operations like GetObject, ListBucket, etc. This allows Tinybird Local to assume the IAM role you specify in your connection.

To pass your AWS credentials, use the --use-aws-creds flag when starting Tinybird Local:

tb local start --use-aws-creds

» Starting Tinybird Local...
✓ AWS credentials found and will be passed to Tinybird Local (region: us-east-1)
* Waiting for Tinybird Local to be ready...
✓ Tinybird Local is ready!

If you're using a specific AWS profile, you can specify it using the AWS_PROFILE environment variable:

AWS_PROFILE=my-profile tb local start --use-aws-creds

When using the S3 connector in the --local environment, continuous file ingestion is limited. For continuous ingestion of new files, use the Cloud environment.

Set up the connector¶

Create an S3 connection¶

You can create an S3 connection in Tinybird using either the guided CLI process or by manually creating a connection file.

Option 1: Use the guided CLI process (recommended)¶

The Tinybird CLI provides a guided process that helps you set up the required AWS permissions and creates the connection file automatically:

tb connection create s3

When prompted, you'll need to:

Enter a name for your connection.
Specify whether you'll use this connection for sinking or ingesting data.
Enter the S3 bucket name.
Enter the AWS region where your bucket is located.
Copy the displayed AWS IAM policy to your clipboard (you'll need this to set up permissions in AWS).
Copy the displayed AWS IAM role trust policy for your Local environment, then enter the ARN of the role you create.
Copy the displayed AWS IAM role trust policy for your Cloud environment, then enter the ARN of the role you create.
The ARN values will be stored securely using tb secret, which will allow you to have different roles for each environment.

Option 2: Create a connection file manually¶

You can also set up a connection manually by creating a connection file with the required credentials:

s3sample.connection

TYPE s3
S3_REGION "<S3_REGION>"
S3_ARN "<IAM_ROLE_ARN>"

When creating your connection manually, you need to set up the required AWS IAM role with appropriate permissions. See the AWS permissions section for details on the required access policy and trust policy configurations.

See Connection files for more details on how to create a connection file and manage secrets.

You need to create separate connections for each environment you're working with, Local and Cloud.

For example, you can create:

my-s3-local for your Local environment
my-s3-cloud for your Cloud environment

Create an S3 data source¶

After creating the connection, you need to create a data source that uses it.

Create a .datasource file using tb datasource create --s3 or manually:

s3sample.datasource

DESCRIPTION >
    Analytics events landing data source

SCHEMA >
    `timestamp` DateTime `json:$.timestamp`,
    `session_id` String `json:$.session_id`,
    `action` LowCardinality(String) `json:$.action`,
    `version` LowCardinality(String) `json:$.version`,
    `payload` String `json:$.payload`

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp"
ENGINE_TTL "timestamp + toIntervalDay(60)"

IMPORT_CONNECTION_NAME s3sample
IMPORT_BUCKET_URI s3://my-bucket/*.csv
IMPORT_SCHEDULE @auto

The IMPORT_CONNECTION_NAME setting must match the name of the .connection file you created in the previous step.

Deploy¶

After defining your S3 data source and connection, test it by running a deploy check:

tb --cloud deploy --check

This runs the connection locally and checks if the connection is valid. To see the connection details, run tb --cloud connection ls.

When ready, push the datafile to your Workspace using tb deploy to create the S3 data source:

tb --cloud deploy

.connection settings¶

The S3 connector use the following settings in .connection files:

Instruction	Required	Description
`S3_REGION`	Yes	Region of the S3 bucket.
`S3_ARN`	Yes	ARN of the IAM role with the required permissions.

Once a connection is used in a data source, you can't change the ARN account ID or region. To modify these values, you must:

Remove the connection from the data source.
Deploy the changes.
Add the connection again with the new values.

.datasource settings¶

The S3 connector uses the following settings in .datasource files:

Instruction	Required	Description
`IMPORT_SCHEDULE`	Yes	Use `@auto` to ingest new files automatically, or `@once` to only execute manually. Note that in the `--local` environment, even if you set `@auto`, only the initial sync will be performed, loading all existing files, but the connector will not continue to automatically ingest new files afterwards.
`IMPORT_CONNECTION_NAME`	Yes	Name given to the connection inside Tinybird. For example, `'my_connection'`. This is the name of the connection file you created in the previous step.
`IMPORT_BUCKET_URI`	Yes	Full bucket path, including the `s3://` protocol, bucket name, object path, and an optional pattern to match against object keys. For example, `s3://my-bucket/my-path` discovers all files in the bucket `my-bucket` under the prefix `/my-path`. You can use patterns in the path to filter objects, for example, ending the path with `*.csv` matches all objects that end with the `.csv` suffix.
`IMPORT_FROM_TIMESTAMP`	No	Sets the date and time from which to start ingesting files on an S3 bucket. The format is `YYYY-MM-DDTHH:MM:SSZ`.

The only supported change is updating IMPORT_SCHEDULE from @once to @auto which makes the connector ingest all files that match the bucket URI pattern since the last on-demand ingestion.

For any other parameter changes, you must:

Remove the connection from the data source.
Deploy the changes.
Add the connection again with the new values.
Deploy again.

Syncing Your Data¶

In case you go with the @on-demand option for your IMPORT_SCHEDULE, you can always trigger a Sync now action at any time. To do this, run the tb datasource sync <datasource_name> command from the CLI. The command prompts for confirmation to sync the Data Source. Enter y to confirm. The Data Source will then sync data from its last synchronization point, preventing duplicates.

Be careful when using IMPORT_SCHEDULE with @on-demand. If you trigger a Sync now action while simultaneously uploading a large file to S3, a race condition may cause data loss.

The system syncs files based on their creation_time. However, S3 sets the creation_time only when the upload is fully completed. If you trigger a sync now at time T, but a file upload started before T and finishes after T, the file will be assigned a creation_time that is earlier than T. This means the sync process initiated at T will not detect the file.

S3 file URI¶

The S3 connector supports the following wildcard patterns:

Single asterisk or *: matches zero or more characters within a single directory level, excluding /. It doesn't cross directory boundaries. For example, s3://bucket-name/*.ndjson matches all .ndjson files in the root of your bucket but doesn't match files in subdirectories.
Double asterisk or **: matches zero or more characters across multiple directory levels, including /. It can cross directory boundaries recursively. For example: s3://bucket-name/**/*.ndjson matches all .ndjson files in the bucket, regardless of their directory depth.

Use the full S3 file URI and wildcards to select multiple files. The file extension is required to accurately match the desired files in your pattern.

Due to a limitation in Amazon S3, you can't create different S3 data sources with path expressions that collide. For example: s3://my_bucket/**/*.csv and s3://my_bucket/transactions/*.csv.

Examples¶

The following are examples of patterns you can use and whether they'd match the example file path:

File path	S3 File URI	Will match?
example.ndjson	`s3://bucket-name/*.ndjson`	Yes. Matches files in the root directory with the `.ndjson` extension.
example.ndjson.gz	`s3://bucket-name/*/.ndjson.gz`	Yes. Recursively matches `.ndjson.gz` files anywhere in the bucket.
example.ndjson	`s3://bucket-name/example.ndjson`	Yes. Exact match to the file path.
pending/example.ndjson	`s3://bucket-name/*.ndjson`	No. `*` doesn't cross directory boundaries.
pending/example.ndjson	`s3://bucket-name/*/.ndjson`	Yes. Recursively matches `.ndjson` files in any subdirectory.
pending/example.ndjson	`s3://bucket-name/pending/example.ndjson`	Yes. Exact match to the file path.
pending/example.ndjson	`s3://bucket-name/pending/*.ndjson`	Yes. Matches `.ndjson` files within the `pending` directory.
pending/example.ndjson	`s3://bucket-name/pending/*/.ndjson`	Yes. Recursively matches `.ndjson` files within `pending` and all its subdirectories.
pending/example.ndjson	`s3://bucket-name/**/pending/example.ndjson`	Yes. Matches the exact path to `pending/example.ndjson` within any preceding directories.
pending/example.ndjson	`s3://bucket-name/other/example.ndjson`	No. Doesn't match because the path includes directories which aren't part of the file's actual path.
pending/example.ndjson.gz	`s3://bucket-name/pending/*.csv.gz`	No. The file extension `.ndjson.gz` doesn't match `.csv.gz`
pending/o/inner/example.ndjson	`s3://bucket-name/*.ndjson`	No. `*` doesn't cross directory boundaries.
pending/o/inner/example.ndjson	`s3://bucket-name/*/.ndjson`	Yes. Recursively matches `.ndjson` files anywhere in the bucket.
pending/o/inner/example.ndjson	`s3://bucket-name/**/inner/example.ndjson`	Yes. Matches the exact path to `inner/example.ndjson` within any preceding directories.
pending/o/inner/example.ndjson	`s3://bucket-name/*/ex.ndjson`	Yes. Recursively matches `.ndjson` files starting with `ex` at any depth.
pending/o/inner/example.ndjson	`s3://bucket-name///*.ndjson`	Yes. Matches `.ndjson` files at any depth, even with multiple `**` wildcards.
pending/o/inner/example.ndjson	`s3://bucket-name/pending/*/.ndjson`	Yes. Matches `.ndjson` files within `pending` and all its subdirectories.
pending/o/inner/example.ndjson	`s3://bucket-name/inner/example.ndjson`	No. Doesn't match because the path includes directories which aren't part of the file's actual path.
pending/o/inner/example.ndjson	`s3://bucket-name/pending/example.ndjson`	No. Doesn't match because the path includes directories which aren't part of the file's actual path.
pending/o/inner/example.ndjson.gz	`s3://bucket-name/pending/*.ndjson.gz`	No. `*` doesn't cross directory boundaries.
pending/o/inner/example.ndjson.gz	`s3://bucket-name/other/example.ndjson.gz`	No. Doesn't match because the path includes directories which aren't part of the file's actual path.

Considerations¶

When using patterns:

Use specific directory names or even specific file URIs to limit the scope of your search. The more specific your pattern, the narrower the search.
Combine wildcards: you can combine ** with other patterns to match files in subdirectories selectively. For example, s3://bucket-name/**/logs/*.ndjson matches .ndjson files within any logs directory at any depth.
Avoid unintended matches: be cautious with ** as it can match many files, which might impact performance and return partial matches.

Supported file types¶

The S3 connector supports the following file types:

File type	Accepted extensions	Compression formats supported
CSV	`.csv`, `.csv.gz`	`gzip`
NDJSON	`.ndjson`, `.ndjson.gz`, `.jsonl`, `.jsonl.gz`, `.json`, `.json.gz`	`gzip`
Parquet	`.parquet`, `.parquet.gz`	`snappy`, `gzip`, `lzo`, `brotli`, `lz4`, `zstd`

You can upload files with .json extension, provided they follow the Newline Delimited JSON (NDJSON) format. Each line must be a valid JSON object and every line has to end with a \n character.

Parquet schemas use the same format as NDJSON schemas, using JSONPath syntax.

AWS permissions¶

The S3 connector requires an IAM Role with specific permissions to access objects in your Amazon S3 bucket:

s3:GetObject
s3:ListBucket
s3:GetBucketNotification
s3:PutBucketNotification
s3:GetBucketLocation

You need to create both an access policy and a trust policy in AWS:

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "s3:GetObject",
          "s3:ListBucket",
          "s3:GetBucketNotification",
          "s3:PutBucketNotification",
          "s3:GetBucketLocation"
        ],
        "Resource": [
          "arn:aws:s3:::{bucket-name}",
          "arn:aws:s3:::{bucket-name}/*"
        ]
      }
    ]
}

Get started

Ingest data

Work with data

Test and deploy

Monitor your data

Administration

Pricing

Deployment options

Reference