Apr 08, 2025

Best practices for downsampling billions of rows of data

Data downsampling can be an effective way to reduce compute resources, but it comes with tradeoffs.
Paco GonzĂĄlez
Data Engineer

For the last year, I've been working with a customer of Tinybird, one of the largest A/B testing platforms in the world. In a single day, this company processes almost a petabyte of data running A/B test analysis for its customers.

A/B testing can be very computationally expensive, especially when you're dealing with large data volumes, as this company is, and even more so considering their rigorous requirements for correctness, precision, and data freshness/real-time analytics.

Recently, I've been working with them to evaluate downsampling as a strategy to reduce query scan sizes and improve overall speed and cost-effectiveness of their A/B testing platform. Downsampling is an intriguing strategy; in theory, you can reduce compute resources by an order of magnitude, but you also need to approach it in a way that retains statistical rigor. This is especially important in A/B testing, but it applies to many data analysis problems.

Here, I want to share my experience and guidance on how to approach downsampling for very large data sets. To do that, I'll walk through a practical example based on a real Tinybird customer use case (>40B rows of data) to demonstrate several best practices for downsampling massive data sets to ease compute requirements without a loss in statistical significance.

Understanding downsampling

What is downsampling?

In simple terms, downsampling is just transforming raw data (often time series data) into a more compact form, while still retaining the defining characteristics of the data. You keep only a fraction of the raw data, selected randomly(ish, more on this later), and discard the rest.

Why downsample?

Having less data in your sample means that you will process less data in your analysis, which makes the task lighter and cheaper. Good downsampling algorithms can reduce data processing costs by an order of magnitude.

However, it’s not all positive (of course, there must be downsides). Generally, when you are trying to estimate metrics from a sample, having fewer data points means that you will have a less precise estimation. By discarding your data, you may miss valuable information that you would have otherwise captured, like that new trend among the users from a certain country, that was not big enough to register in your system due to the random discarding, in general terms, it can make your system not function at its full potential.

How do you downsample?

There are multiple ways to sample the data. The simplest algorithm isn't an algorithm: you discard rows completely randomly. More advanced algorithms involve discarding data based on known parameters of your data. Instead of discarding half of data randomly from the full population, for example, you discard half of the data from each country, making sure that all of the countries have some degree of representation. More complex algorithms may use machine learning or other optimization methods to choose an appropriate sample.

Start building with Tinybird
If you've read this far, you might want to use Tinybird as your analytics backend. Start for free with no time limit.

The art of downsampling

I've introduced the trade-offs that you must balance with downsampling (no post shall be without its trade-off on my watch); precision vs performance. Do you want your analysis done dirt cheap in real-time (milliseconds), or 100% perfect in a few days?

As with most things in life, the answer lies somewhere in between. The “art” of downsampling exists in finding the sweet spot for your needs, where you are getting the most out of your resources and your data.

The point that I am trying to hammer home here is that generally, when we are talking about precision, there is a “good enough” point. A threshold. An asymptote.

That threshold will depend on many obscure factors, such as the underlying distribution and dimensionality of your data, the existence of majority classes, and other factors (which may change over time). But the threshold exists, and what's better, you can usually find it (estimate it) through experimentation and optimization. 

Maybe you do not need pinpoint-accurate video recommendations in your platform every single time, or maybe you do need to be really precise in order to quantify the effects of the changes in your website.

Downsampling is not easy to implement at scale, but in the next sections I will talk about how we went about evaluating and implementing it for a real-world use case.

Get to know your data

Downsampling is all about knowing how your data is distributed, so that you can look at the fewest, most important data points. This allows you to get the benefits of downsampling without losing the insights that you need. For this, you need to figure out a strategy to sample your data in a way that allows you to squeeze the most information out of it.

At this point, you have exchanged a difficult problem (how do I scale data processing) for another: How do I sample my data? 

But, the benefits are worth it. In downsampling, you open a new avenue of experimentation, exploration, and optimization that can improve not only your system independently of other parts (e.g. we improve our system’s CPU utilization), but also optimize your application’s code and get another reduction in resource usage, these improvements compound on each other.

How do you study the underlying distribution of your data? Well, there is a whole field of study dedicated to this problem, but I am going to give you a basic exploration loop:

  1. Get the actual target results that your system produces with the full set of data
  2. Choose a downsampling algorithm or strategy, and test it. 
  3. Estimate the target results using the downsampled data
  4. Compare the results using the full population versus the estimated results using the sample
  5. If the results are "good", keep optimizing. Maybe you didn’t sample aggressively enough and left some gains on the table. Make your sampling strategy more aggressive, and go back to 2.
  6. If the results are "bad", you pushed too far and need to resample your data. Make your sampling strategy less aggressive, and go back to 2.

After a few loops, you should be able to determine an effective downsampling strategy (or that downsampling is not statistically viable).

Practical example: Downsampling 40B rows of A/B test data with Tinybird

Recently, I ran some tests with a client concerned about scalability with Tinybird for an A/B testing use case. They use Tinybird to build real-time analytics APIs for their customs to evaluate the results of A/B tests very quickly, so scalability and query speed is very important.

One of the options we considered (and that inspired this post) was using data sampling to reduce the amount of data processed in their queries to free up resources and reduce query time.

I'll go over a few code examples inspired by our sampling implementations, using a data schema similar to what you would use in an A/B testing setting.

Step 1: Look at your data

Start by looking at the data distribution. In A/B testing, you can think of the data as a sequence of timestamped actions that must go together (action1, action2, action3, etc.).

Consider a classic A/B test: changing the color of the primary call-to-action(CTA) button on your homepage. You want to track, for each color variant, how many people viewed the page (action 1) and then clicked the button (action 2). Maybe you even want to track how many people hovered over the button before clicking ("action 1.5").

When downsampling, it's important to retain the sequence. If you discard action 1 for a user session but retain action 2 for the same session, you would significantly alter the chain of actions and thus poison the analysis. This is something you must avoid at all costs!

You need to be smarter with your sampling algorithm. In A/B tests, for example, preserving user-level data is critical, since you are most likely going to want to see which users completed a chain of actions. 

For our customer, instead of sampling at the event level, we sampled at the user level to preserve these sequences. Ideally, we wanted to have some reproducible way to determine whether a user was discarded or not, while still being random (or "sufficiently close to random”). For our use case, we calculated the hash of all user IDs in the data, separating them into 100 buckets using the modulus (the remainder after dividing), and picking a certain amount, cityHash64(user_id) % 100.

Consider a data source called ab_test_events with the following columns:

  • timestamp: DateTime (event timestamp)
  • user_id: String (or UUID)
  • variant: String ('A' or 'B')
  • event_type: String (e.g., 'button_click', 'page_view', 'purchase')
  • ... other relevant event properties

Some notes:

  • cityHash64(user_id): This computes a hash value from the user_id column. Why a hash? Hashes A) produce a pseudo-random result, and B) have consistent input/outputs (the same input always produces the same output). 
  • % 100: This is the modulo operator. It returns the remainder after dividing the hash value by 100, essentially, the 2 least significant figures of the number.
  • WHERE cityHash64(user_id) % 100 < 10: This checks if the remainder is less than 10. Since the result of the hash is a random value, and we use the 2 least significant figures, this will sample approximately 10% of the user_id values, ensuring that all events for a given user are either kept or discarded.

Important: Use a good hash function. You want the distribution of the hash values to be as even as possible (i.e., the hash set they produce for your data should be as close as possible to a random uniform distribution) to avoid introducing any bias when sampling.

This resulted in an even distribution across all user IDs, so we deemed the strategy viable. This achieved the original purpose, reducing the amount of data to process, while maximizing the usefulness of the data we chose to keep.

Step 2: Downsample your data

Then, we set up a new table to store the downsampled data and created a materialized view to calculate the user bucket on data ingestion and push the downsampled data to the new table:

This gave us two tables: the original ab_test_events with all our raw data (in case we wanted to resample), and ab_test_events_downsampled to hold the 10% sample. To improve performance, we simply direct queries at the downsampled table to reduce our scan sizes and CPU usage.

Note
If you're familiar with Tinybird and ClickHouse (the database that Tinybird uses under the hood), you may be wondering why we didn't just use the SAMPLE clause available. The reason is simple: we achieved significantly better performance results using our strategy (this may change in the future, but it was our experience on ClickHouse <25).

Step 3: Calculate your “true” target values

For this part, I can't share the full query, but rest assured, it’s not a toy example by any means. The target query produces a collection of numeric results that describe the distribution (~100 numbers, things like mean, variance, count, median, etc.).

I will use a simplified query representing an A/B test use case: calculating the conversion rate. The query calculates unique visitors that performed a certain action (in this case, purchased something) and divides by the total number of visitors. Results are grouped by variant of the site, so we can measure which version of our site has the highest conversion rate.

This query selects from the raw table, giving us a true measure of conversion rate against the entire population. We run the query ad hoc, so it doesn't matter how long it takes.

Note
Note: For those less familiar with the ClickHouse SQL dialect, uniqCombined64(user_id) is an approximate method for calculating COUNT(DISTINCT column) and the -If in uniqCombined64If(user_id, event_type = 'purchase') is meant to work as a filter, similar to COUNT(DISTINCT CASE WHEN event_type ‘purchase’ THEN user_id). Read this post on counting billions of unique IDs for more insight here.

Step 4: Calculate your “estimated” target values

Now, for the downsampled form of this target query, we simply swap the table in the FROM clause:

As this queries the downsampled table, we are making a trade-off: We process 10x less data and “pay” with precision (i.e., we have reduced statistical significance to achieve faster query times).

The important point here (and perhaps the most important point of this post) is this: While downsampling may lose us the ability to detect very small changes over the long run, it gives us the ability to detect bigger changes much earlier, and we can make critical decisions faster (e.g. if a variant is severely underperforming on our key metrics, we can raise the alarm earlier). This may seem trivial, but consider that our customer processes A/B test data for some very large businesses with millions of daily website visitors, so every second counts.

Step 5: Evaluate your estimations

In our real-world use case, I had to evaluate downsampling performance across ~100 dimensions, so I used the average relative difference (i.e., the difference normalized by the absolute value) across all dimensions as the key metric to determine the estimation quality when using downsampling. I did, of course, also use some good ol' fashioned eyeballing of the numbers, as some of these metrics (like quantiles) are not easily compared numerically. 

In my case, however, I just needed to evaluate if downsampling showed enough promise to build it into the real-time system. For a more rigorous comparison, we might treat this as an A/A test (i.e., we look for statistically significant differences between the downsampled estimations and the actual values, and, ideally, find none).

In my case, downsampling by keeping data only from 10% of users randomly selected, we found that:

  • Query times improved by 4x (that is, the query on sampled data ran in 1/4th of the original time, using the same infrastructure), and scanned ~10x less data overall (as expected from the sampling rate).
  • The results showed a relative absolute error, on average, of 3%, that is, the deviation from the original values.

For reference (because downsampling 10% is completely different if you have 10 rows vs 10 billion rows), the original query scanned over 40 billion rows (like I said, not a toy example). Also, keep in mind that the downsampling did not affect the whole query, and the improvement in response time does not linearly follow the reduction in rows because there is other overhead not affected by the downsampling.

To put this into practice, we repeated this basic evaluation loop until we were satisfied with the results. In our case, based on the results we've seen, we could have likely further decreased the sampling rate and still achieved a significant result. For the next iteration, I might go to 5% or 2.5% of the original data.

Considerations and best practices for downsampling

  • The data: Pay attention to the distribution of your data. If the data is highly imbalanced or skewed, you may need to improve your sampling strategy or develop a more precise algorithm. Fortunately for Tinybird users, our latest Tinybird Forward release makes it easier than ever to iterate and experiment with data projects.
  • Precision goals: Set your precision goal, and benchmark your downsampling strategy to make sure you reach the required precision level
  • Choosing your sampling rate: This is a critical decision, and there's no one-size-fits-all answer. You'll need to experiment and validate your results. As a suggestion, you can start with a very aggressive rate (like 1-2%), and then test if you still see stable, trustworthy insights. Keep in mind that the goal here is to push the sampling rate as far as possible without making the results so imprecise that they are no longer useful.
  • Test different sampling rates: evaluate the impact of different sampling rates on the accuracy of your key metrics (i.e., the important outputs of your system).
  • Test continuously: Don’t just run a single test and trust the results; ideally, test with as many different cases as possible and continue to run ad hoc tests against the whole population for validation. Your data may have different distributions depending on multiple factors, even changing depending on the day of the week or other seasonality factors.
  • Retain the full population: Even after you have implemented downsampling, it’s a good idea to keep the full dataset stored as a backup and run sanity checks often. This means re-running the tests that validated your strategy in the first place against the full data, in order to check that the results still hold true. An obvious follow-up is that you can automate parts of the downsampling process (or the whole thing!).
Subscribe to our newsletter
Links to our blog and other great reads sent every other Saturday.

Build real-time analytics APIs with Tinybird

Random data downsampling, when applied strategically, is a powerful tool for scaling your data analytics, such as A/B tests, as we have shown with some code examples. Tinybird makes it easy to implement and optimize this.

Ready to build faster, more cost-effective data analytics APIs? Sign up for a free Tinybird account today.

Do you like this post?

Related posts

What are Materialized Views and why do they matter for real-time?
More Data, More Apps: Improving data ingestion in Tinybird
Tinybird has joined the AWS ISV Accelerate Program

Tinybird

Team

Jun 05, 2024
Adding JOIN support for parallel replicas on ClickHouse®️
Adding JOIN support for parallel replicas on ClickHouse®️
Low-latency APIs over your BigQuery datasets
How to scale a real-time data platform
$30M to lead the shift to real-time data
Automating customer usage alerts with Tinybird and Make
The 5 rules for writing faster SQL queries

Build fast data products, faster.

Try Tinybird and bring your data sources together and enable engineers to build with data in minutes. No credit card required, free to get started.