S3 connector

You can set up a S3 connector to load your CSV, NDJSON, or Parquet files into Tinybird from any S3 bucket. Tinybird detects new files in your buckets and ingests them automatically.

Setting up the S3 connector requires creating and enabling a data source and its connection as separate files.

Supported file types

The S3 connector supports the following file types:

File typeAccepted extensionsCompression formats supported
CSV.csv, .csv.gzgzip
NDJSON.ndjson, .ndjson.gz, .jsonl, .jsonl.gz, .json, .json.gzgzip
Parquet.parquet, .parquet.gzsnappy, gzip, lzo, brotli, lz4, zstd

You can upload files with .json extension, provided they follow the Newline Delimited JSON (NDJSON) format. Each line must be a valid JSON object and every line has to end with a \n character.

Parquet schemas use the same format as NDJSON schemas, using JSONPath syntax.

Set up the connector

To set up the S3 connector, follow these steps.

Prerequisites

The S3 connector requires permissions to access objects in your Amazon S3 bucket. The IAM Role needs the following permissions:

  • s3:GetObject
  • s3:ListBucket
  • s3:GetBucketNotification
  • s3:PutBucketNotification
  • s3:GetBucketLocation
  • s3:PutObject
  • s3:PutObjectAcl
1

Apply the required permissions

To connect to your S3 bucket, set up the required permissions in AWS. See Prerequisites.

The following are examples of AWS access policy and trust policy:

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "s3:GetObject",
          "s3:ListBucket",
          "s3:GetBucketNotification",
          "s3:PutBucketNotification",
          "s3:GetBucketLocation",
          "s3:PutObject",
          "s3:PutObjectAcl"
        ],
        "Resource": [
          "arn:aws:s3:::{bucket-name}",
          "arn:aws:s3:::{bucket-name}/*"
        ]
      }
    ]
}
2

Create an S3 connection

Before you create and configure a S3 data source, you need to set up a connection. Create a connection file with the required credentials stored in secrets. For example:

s3sample.connection
TYPE s3
S3_REGION "<S3_REGION>"
S3_ARN "<IAM_ROLE_ARN>"

You can obtain the IAM role ARN after applying the required permissions. See Apply the required permissions.

See Settings for a list of S3 connection settings.

3

Create an S3 data source

Create a .datasource file using tb create --prompt or manually.

The .datasource file must contain the desired schema and the required settings for S3, including the S3_CONNECTION_NAME setting, which must match the name of the .connection file you created in the previous step.

For example:

s3sample.datasource
DESCRIPTION >
    Analytics events landing data source

SCHEMA >
    `timestamp` DateTime `json:$.timestamp`,
    `session_id` String `json:$.session_id`,
    `action` LowCardinality(String) `json:$.action`,
    `version` LowCardinality(String) `json:$.version`,
    `payload` String `json:$.payload`

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp"
ENGINE_TTL "timestamp + toIntervalDay(60)"

IMPORT_CONNECTION_NAME s3sample
IMPORT_BUCKET_URI s3://my-bucket/*.csv
IMPORT_SCHEDULE @auto
IMPORT_STRATEGY APPEND
4

Test and deploy

After you've defined your S3 data source and connection, you can test and deploy the changes as usual. See Test and deploy.

To check if the connection is active, run tb connection ls.

.datasource settings

The S3 connector use the following settings in .datasource files:

InstructionRequiredDescription
IMPORT_CONNECTION_NAMEYesName given to the connection inside Tinybird. For example, 'my_connection'.
IMPORT_STRATEGYYesStrategy to use when inserting data. Use APPEND for S3 connections.
IMPORT_BUCKET_URIYesFull bucket path, including the s3:// protocol, bucket name, object path, and an optional pattern to match against object keys. For example, s3://my-bucket/my-path discovers all files in the bucket my-bucket under the prefix /my-path. You can use patterns in the path to filter objects, for example, ending the path with *.csv matches all objects that end with the .csv suffix.
IMPORT_FROM_DATETIMENoSets the date and time from which to start ingesting files on an S3 bucket. The format is YYYY-MM-DDTHH:MM:SSZ.

.connection settings

The S3 connector use the following settings in .connection files:

InstructionRequiredDescription
S3_REGIONYesRegion of the S3 bucket.
S3_ARNYesARN of the IAM role with the required permissions.

S3 file URI

The S3 connector supports the following wildcard patterns:

  • Single asterisk or *: matches zero or more characters within a single directory level, excluding /. It doesn't cross directory boundaries. For example, s3://bucket-name/*.ndjson matches all .ndjson files in the root of your bucket but doesn't match files in subdirectories.
  • Double asterisk or **: matches zero or more characters across multiple directory levels, including /. It can cross directory boundaries recursively. For example: s3://bucket-name/**/*.ndjson matches all .ndjson files in the bucket, regardless of their directory depth.

Use the full S3 file URI and wildcards to select multiple files. The file extension is required to accurately match the desired files in your pattern.

Due to a limitation in Amazon S3, you can't create different S3 data sources with path expressions that collide. For example: s3://my_bucket/**/*.csv and s3://my_bucket/transactions/*.csv.

Examples

The following are examples of patterns you can use and whether they'd match the example file path:

File pathS3 File URIWill match?
example.ndjsons3://bucket-name/*.ndjsonYes. Matches files in the root directory with the .ndjson extension.
example.ndjson.gzs3://bucket-name/**/*.ndjson.gzYes. Recursively matches .ndjson.gz files anywhere in the bucket.
example.ndjsons3://bucket-name/example.ndjsonYes. Exact match to the file path.
pending/example.ndjsons3://bucket-name/*.ndjsonNo. * doesn't cross directory boundaries.
pending/example.ndjsons3://bucket-name/**/*.ndjsonYes. Recursively matches .ndjson files in any subdirectory.
pending/example.ndjsons3://bucket-name/pending/example.ndjsonYes. Exact match to the file path.
pending/example.ndjsons3://bucket-name/pending/*.ndjsonYes. Matches .ndjson files within the pending directory.
pending/example.ndjsons3://bucket-name/pending/**/*.ndjsonYes. Recursively matches .ndjson files within pending and all its subdirectories.
pending/example.ndjsons3://bucket-name/**/pending/example.ndjsonYes. Matches the exact path to pending/example.ndjson within any preceding directories.
pending/example.ndjsons3://bucket-name/other/example.ndjsonNo. Doesn't match because the path includes directories which aren't part of the file's actual path.
pending/example.ndjson.gzs3://bucket-name/pending/*.csv.gzNo. The file extension .ndjson.gz doesn't match .csv.gz
pending/o/inner/example.ndjsons3://bucket-name/*.ndjsonNo. * doesn't cross directory boundaries.
pending/o/inner/example.ndjsons3://bucket-name/**/*.ndjsonYes. Recursively matches .ndjson files anywhere in the bucket.
pending/o/inner/example.ndjsons3://bucket-name/**/inner/example.ndjsonYes. Matches the exact path to inner/example.ndjson within any preceding directories.
pending/o/inner/example.ndjsons3://bucket-name/**/ex*.ndjsonYes. Recursively matches .ndjson files starting with ex at any depth.
pending/o/inner/example.ndjsons3://bucket-name/**/**/*.ndjsonYes. Matches .ndjson files at any depth, even with multiple ** wildcards.
pending/o/inner/example.ndjsons3://bucket-name/pending/**/*.ndjsonYes. Matches .ndjson files within pending and all its subdirectories.
pending/o/inner/example.ndjsons3://bucket-name/inner/example.ndjsonNo. Doesn't match because the path includes directories which aren't part of the file's actual path.
pending/o/inner/example.ndjsons3://bucket-name/pending/example.ndjsonNo. Doesn't match because the path includes directories which aren't part of the file's actual path.
pending/o/inner/example.ndjson.gzs3://bucket-name/pending/*.ndjson.gzNo. * doesn't cross directory boundaries.
pending/o/inner/example.ndjson.gzs3://bucket-name/other/example.ndjson.gzNo. Doesn't match because the path includes directories which aren't part of the file's actual path.

Considerations

When using patterns:

  • Use specific directory names or even specific file URIs to limit the scope of your search. The more specific your pattern, the narrower the search.
  • Combine wildcards: you can combine ** with other patterns to match files in subdirectories selectively. For example, s3://bucket-name/**/logs/*.ndjson matches .ndjson files within any logs directory at any depth.
  • Avoid unintended matches: be cautious with ** as it can match many files, which might impact performance and return partial matches.
Updated