When you start a digital product you usually put your data in a database. It does not matter if it is a simple text file, an excel spreadsheet or a managed Postgres instance on the cloud, your data always lives somewhere. Data has been a key component of any digital product for a long time now, but it hasn’t been until recently that we’ve started to devote significant resources to aggregating and making use of it. Even if you think you are devoting enough energy to data within your organization, most probably you aren’t.
If the product has minimal complexity, choosing a proper database system might be the right choice. Developers tend to pick the easiest one to start with, and don't pay too much attention to whether it is the right database management system (DBMS) for the job. Fair enough, worrying about performance or other kinds of features at this moment is secondary, validating the business model as fast as possible is priority number one.
The basics
The developer opens the IDE, gets the framework of their choice (with a pretty good ORM and schema migration system), everything in the same server to keep costs as low as possible and everything goes smoothly.
That database is usually Postgres, MongoDB, MySQL or something else. Tons of developers start with MongoDB these days since you don't need to set up schemas beforehand, you just send data and it's stored. Schema changes aren’t a problem either. Nothing comes for free, all that flexibility turns into a problem after some time; after all, quoting Bradur "Schemaless databases are not simple; they’re easy".
"Schemaless databases are not simple; they’re easy"
Well, hopefully, the business is running but your data architecture is not mature yet. For example:
- What happens if the machine breaks, do you have backups? (and if so, do they work? Are you testing that they do?)
- When you look at the data, do you impact your production server? Can a query break the service?
- What if you need your business to be 24/7? do you have the right architecture to have high availability?
Things start to get harder
One day you look at the server and application metrics (most likely because a client is complaining about your application being "slow as hell") and the server is almost at 100% load.
The first thing you think about is setting up another server or increasing the current one. It usually is not the best option, there is always something to do (at the beginning at least) that gives you a performance improvement of one order of magnitude:
- Adding some indices. There is always an index you forgot to add, you just need to review your slow queries log.
- Remove some unnecessary queries or all those queries that could be replaced by one query with a join.
- Do some database cleaning: clustering tables, removing not needed data...
All that will help, but there comes a point when you need to scale up your application (even if the database is fine). You set up a load balancer with many applications throwing queries to a database server.
After some time you start to see some weird things:
- Some columns have no data when they should.
- Some rows should have related data in other tables/collections but they do not.
- There are repeated rows where there should be just one.
Those are the most frequent data quality problems caused most of the times because of race conditions and application crashes (unhandled exceptions for example).
There are two powerful tools that can save you hours of debugging your code and of course avoid some frustration from your customers (these issues often become bad customer facing bugs):
- Transactions: all mature web frameworks have a way to do transactions but they are not used by default. Transactions are the way to properly scale a distributed system and all DBMS have support for that. If you don't use them you will end up implementing something similar at application level (and you will probably do it much worse, since they have been working on that problem for years). Obviously there are large systems where you need to sacrifice consistency (from the CAP theorem) but you do that for a good reason.
- Constraints: foreign keys, column types and validations are key to having a healthy database. They prevent you from having hundreds of hacks spread over your codebase to check if
column != null
.
If you are interested in this topic, you should read this article on how to make robust applications.
These are not the only problems: database migrations without downtime, index creation without locking the whole system, more than one developer changing the schema at the same time (you should probably have a different kind of PR for data changes).
The consistency problem
Your application does not live alone, it's pretty likely you have to integrate it with a billing system, an online marketing tool (well, you are lucky if it’s just one) and so on. That leads to one of the hardest problems to solve: data inconsistency.
If your company grows, even internal tools will be 3rd party ones
For example, banks have "reconciliation processes" that from time to time check different data systems have consistent data. You must have a strategy to integrate third-party tools (if your company grows, even internal tools will be 3rd party ones) and:
- To keep them in sync
- Not tied to their data schema or API system
- Be able to work when those services are down (at least your core system)
In other words, you will need something to deal with the "Our email marketing tools is bouncing 10% of our emails" or "hey, we are sending spam to users who asked us to stop because of something called GDPR".
"We need intelligence"
Sooner or later someone will need reports. It could be something easy like "number of registered users", “specific feature usage during the last couple of months”, “how much money did we make last week” or something fancier like "we want to predict the number of sales next month and we need training data".
The easiest way to solve this is to set up a BI tool and link it to your database. Although that will provide you some quick wins, it could lead to potential problems if someone starts to run crazy queries and takes down your production database.
There are better ways to do this without devoting a massive amount of time and resources - putting in place systems to avoid long and dangerous queries. Most databases systems have quotas to avoid this. Use them for analytics users.
- Having a secondary or follower replica of the main database. Asynchronous replication is one of the best tools for the job, that allows you to have a replica.
- Dumping the data from the previous day so people load it in their systems. This approach needs to have a way to track what you export; otherwise you end up having multiple export processes for the same data. There is also a real-time problem: although data from the previous day is good enough for most of the cases, some others need near real-time data. There is a nice write up from Uber about how they moved from daily snapshots to one every 30 minutes.
Most large organizations work like this and many SaaS services use this system to keep data in sync (banks used to use CSV files in FTP servers, SaaS use CSV files in S3 buckets nowadays).
- Having a secondary system tailored for just analysis. Analytics databases have different properties, more compelling to analyze the data (speed, analytic features). They lack most ACID properties but you don't need those here: you want to focus on speed to provide analysis feedback as soon as possible. This is the most advanced one and it comes with the "consistency" problem but when you have a large amount of data there is no other way (well, maybe yes but too expensive).
This normally ends up in something called a data lake, where all the data is eventually dumped so it can be used by any other system.
This is a mess
You have your system in place, new people join the company and start creating new apps and analyzing new things. Here is when having a read-only user to access the operational database is not enough, people don't know what the heck that column is for, or how to get the number of active users during the last 15 days.
Having your data documented is what they call a "data catalog". Here you usually find:
- Schema definition
- Technical and business information
- Data properties, for example, does a column contain sensitive information?
- Versioning
- Information source and tracking
The challenge usually is not in having a data catalog; the challenge is in keeping it up to date.
The challenge usually is not in having a catalog; the challenge is in keeping it up to date. If you are thinking about manually (or half manually) keeping it in sync, forget about that. Instead, think about how different systems report the data they have to. This is one of the challenges of mid to large tech companies. Eventually, large tech (or not so tech) companies are made of several departments that build applications in parallel, so you need to provide them with internal tools/framework so they can report what data they have and a way to expose it. If not, every team will use their own tools and data silos will be created.
Needless to say that it's way better to expose the data as an API rather than accessing raw data(or a derivative of it) and that's why you create data teams, isn't it?
Data leaks
The words you never want to hear: "we have a data leak". Most of the times data leaks come from the inside and not by bad hackers. That's why you use systems to control your data access (who, when and what). This is called data governance.
In other words, not everybody should have access to your clients’ information and those who have should be granted by someone and every access should be registered. That's why exposing aggregated and/or tokenized data through APIs is usually way better than giving access to the raw data.
How Tinybird is related to all of this
We have been working with data-intensive products and organizations for more than a decade now, and we have experienced all the problems described in this blog post. We have also experienced what it means to have a functional data architecture and a good framework to act based on data. What previously meant thousands of dollars and pretty specific talent now is becoming cheaper and easier to implement. Tinybird was born to help organizations to deliver and exploit real-time analytics so we would love to know where you are in the process and to help you with that.