From day one, Tinybird supported CSV as the main ingestion format. CSV is natively supported by databases and applications as an exchange format, so with a one-time development effort Tinybird could interoperate “seamlessly” with other systems.
The main problem with CSV is there’s not a standard specification but a loose set of guidelines on how to serialize data into a text string.
Those guidelines make most of the stuff related to reading and writing CSV files application-dependant, completely breaking the promise of interoperability. To name just a few:
- Mojibake encoding.
- No standard column separator, CSV actually stands for “Character Separated Values”.
- Multiple character escaping.
- Header might or might not be present.
- Unwanted headers anywhere when you read a CSV file exported in chunks.
- New lines anywhere.
- Headers with a different number of elements than rows.
- Rows with different numbers of elements.
- Lines ended with a separator.
- Untyped, no standard way of storing any nested structures or to differentiate a boolean value from a string or integer.
- Empty value or null value?
The reality is that CSV is so wrong that its name, “Comma Separated Values” is not even accurate.
At Tinybird we’ve had to ingest thousands of different user-defined CSV files “seamlessly”. We do a best guess on at least all those things mentioned above and we can tell you it is not a trivial task but a pretty challenging one indeed, especially when you are focused on doing it at scale in real-time scenarios.
The era of JSON data analytics
JSON is the defacto standard for data communication in the web. IoT sensors, server and security logs, real-time advertising, click-stream apps, social media, etc. all of them operate with JSON data, and that’s one of the reasons we are supporting JSON natively: from a Kafka stream or from local or remote NDJSON files (and very soon in other flavours).
As opposed to CSV, JSON is a semistructured standard format: Less ambiguous, less application-dependant on its interpretation, easier to match data structures or nested data, typed and relatively easy to parse.
There are still some very valid criticisms about JSON. Mainly if human readability is not an issue for your use case, then there are more efficient alternatives, such as Apache Avro (which we do support) or Protobuf. Schemaless JSON is also a pain but hey! when you’ve been able to ingest tweets embedded in CSV files everything else is easy as pie.
While CSV is far from being dead and continues to be a very common and useful exchange format, this is the era of JSON data analytics and we are ready for it.
In our quest to build a delightful developer experience, what’s more important than the nuances of parsing CSV or JSON, is identifying the critical patterns to design the best ingestion framework for our users. One that is format and transport agnostic.
When working with JSON in Tinybird we use the same framework we designed for CSV but adding some improvements for a better ingestion experience for our users. An API centric framework but integrated in our dashboard and CLI to:
- Get the best guessing on attributes and data types when creating new Data Sources.
- Stream JSON events from a Kafka topic or from NDJSON files, faster than CSV even with the JSON attributes overhead.
- Avoid broken ingestion processes thanks to quarantine.
- Monitor and trace ingestion with service data sources.
- Handle changes of the schema on read, so you can evolve your analyses as new data comes in.
While we always challenge our assumptions, this framework guides the way we are ingesting data at Tinybird, deeply focused on simplicity, speed and developer experience.
What are your main challenges when dealing with large quantities of data? Tell us about them or sign up to Tinybird and get started on solving them right away.