Mar 07, 2025

How to run load tests in real-time data systems

We have run hundreds of load tests for customers processing petabytes of data in real-time. Here's everything you need to know to plan, execute, and analyze a load test in a real-time data system.
Ana Guerrero
Data Engineer
Iago Enríquez
Data Engineer

Real-time data systems often process petabytes of data or more every day, serving requests to thousands or millions of concurrent users with the expectation of sub-second API response times. Infrastructure is provisioned to handle steady-state throughput, but increases in traffic or usage can lead to load spikes that can bring down a production system. In real-time systems, load testing becomes critical to ensure uptime even during surges.

In this resource, we share from our experience running hundreds of load tests at Tinybird, where our customers build real-time data APIs to serve billions of requests a day.

What is load testing?

Load testing evaluates how well an infrastructure handles expected traffic, measuring response times and stability as requests increase. In real-time data systems such as Tinybird, where low-latency API response times are critical for many customers, load testing helps predict system behavior during traffic spikes, ensures SLO compliance, and prevents unexpected downtime.

SCHEMA > Evolution
Want to stay updated? Subscribe to our bi-weekly newsletter.

Why do you need to perform load tests?

There are some scenarios where load testing is essential:

  • Expected traffic surges during scheduled events: For example, marketing campaigns that lead to a significant increase in visits and queries, or Black Friday for e-commerce companies.
  • Creation of new endpoints: Before launching a new use case, it is crucial to validate its performance under different load levels.
  • Validation of existing infrastructure: Ensuring that the current infrastructure can handle the expected traffic and identifying potential bottlenecks.

Failing to conduct load tests can lead to significant risks and costs:

  • Risks: Service outages, high response latencies, or data loss can negatively impact user experience and product reputation.
  • Costs: Unnecessary overprovisioning increases operational costs, while underestimation can cause failures during traffic spikes.

Key considerations for planning a load test in real-time data systems

Defining the objectives and scope of a load test correctly is key, as is establishing the right metrics and indicators. It's also important to select representative examples of API calls, their distribution, and the expected volume. Making 10x more calls to an endpoint that retrieves data for the last week is not the same as making 10x more calls to the same endpoint to retrieve data for the entire last year. Understanding the distribution of API requests, as well as the type of query to be performed, is key in this context.

Another important consideration is the increase in ingestion load. This affects both the machine load and the volume of data that needs to be processed. 

Consider, for example, a scenario in Tinybird. You have an endpoint to calculate the average of some metric for the last 24 hours of data, and this endpoint directly queries the raw data source where you ingest your events. If you experience a 10x ingestion spike, your endpoints will have to read 10 times more data. This can lead to higher endpoint latency and block the  I/O for longer, limiting its capacity to handle other requests. If the ingestion increase persists over time, your endpoints will keep reading more and more data, and the situation will progressively worsen. In the next picture, you can see an example of how the latency of an endpoint increases proportionally to the increase in the ingestion.

A chart showing increasing endpoint latencies as ingestion load increases.
Average latency for an endpoint during a big event. Latency increases as more data is ingested due to both higher data processing volume and increased I/O contention.

 This highlights the importance of accounting for these factors when designing and running load tests.

Important variables for your load test

When performing a load test, it is generally assumed that a set of resources (machines with memory and CPU) will be available. These resources may need to be scaled up or down to support the expected load.

For example, if endpoints are reading large amounts of data, it could create an I/O bottleneck on the machine, reducing the number of queries that can be handled concurrently. To increase this value without adding resources, you would need to reduce the volume of data read, and at some point, you will hit a limit. Additionally, the more data that is read, the higher the latency you can expect.

Understanding how certain metrics change under peak loads or as the load increases is crucial for properly optimizing and scaling the system. The following variables are important to track during a load test:

Queries per second (QPS) 

Estimating the number of simultaneous calls to the API will help you understand how to allocate compute resources. 

Request latency 

Defining a maximum acceptable response time for end users creates a benchmark for successful or failed load tests.

Processed data volume 

Reducing the amount of data each query processes or returns will free up resources during load spikes. If large volumes of data must be processed while maintaining low latency and high concurrency, the infrastructure must be scaled accordingly, leading to additional costs.

Consider an example below.  These queries get the same result, but with a marked difference in processed data and latency. The first query processes 70MB of data in 484ms. The second, which is better optimized, only processes 1KB in 3.5ms. This was achieved by pre-aggregating metrics using incremental materialized views to maintain minimal response times and I/O resources needed to run the query. Unoptimized queries may be less noticeable during steady state, but extensive load testing will expose them. It is essential to evaluate if optimizations like these can be applied before performing a load test to ensure proper resource allocation.

An unoptimized SQL query running a full table scan
This first query is unoptimized, aggregating with a full scan over the raw data source.
An optimized SQL query scanning a smaller, pre-aggregated data set
The optimized query uses pre-computed aggregates in a materialized view to massively reducing query-time resource use.

Ingestion load increase 

It's common to generate more data during events like Black Friday, big sporting events, or a large marketing campaign. You should consider the increase in ingestion when performing the load test.

Designing the load test

The load test design process involves the following steps:

  1. Define the objective:
    - What is the purpose of the test?
    - Which endpoints will be evaluated?
    - What traffic volume is expected?
  2. Review current endpoint performance:
    - Ensure that the processed data volume and latency are reasonable.
    - If uncertain, consult an expert to validate acceptable values.
  3. Obtain test data:
    - Request distribution: It can be random, uniform, or based on specific usage patterns.
  4. Define success criteria:
    - Establish acceptable latency and stability thresholds, such as "latency < 100 ms for 99% of calls."
  5. Extract metrics and analyze the result of the test
    - In Tinybird, for example, you can query service data sources to extract and monitor metrics. You can see an example query below.

Keep in mind that load tests are not static processes. Typically, multiple iterations are required to adjust configurations and evaluate different scenarios.

Importance of a representative sample

A crucial part of preparing a load test is selecting an appropriate sample of API calls. Endpoints typically include input parameters that allow filtering the queried data. The test sample must reflect the real distribution of requests made by end users.

For example, if an endpoint usually filters the last week of data but is now expected to extend queries to the last month, the test should account for this increase in volume. Otherwise, the test results will not accurately evaluate performance under the new scenario. 

When historical data is unavailable, estimates can be based on:

  • Parameters used in similar endpoints.
  • Expected query distribution based on the use case.

A practical load test example

To see how the considerations above are applied, let's walk through a practical load test example.

Imagine we're an electronics e-commerce company preparing for Black Friday. We've built API endpoints using Tinybird to serve real-time, personalized offers to our customers at scale. These endpoints are the backbone of our online store, enabling customers to find the products they need and discover the best deals quickly and efficiently. During our Black Friday sales, these endpoints will experience massive load increases, and we'll rely on Tinybird to allocate the necessary resources to manage that load.

Throughout the year, our endpoints experience stable traffic, but we anticipate a 10x increase in requests on Black Friday along with a 10x increase in data ingestion. We need to verify that our search and sales endpoints can manage this peak without any performance degradation.

Step 1: Identify Critical Endpoints

Before initiating any tests, we need to pinpoint the most critical endpoints for our business. In this case, we'll focus on:

  • /search: Used by customers to find products and get recommendations based on previous searches.
  • /sales: Used to display discounted products during Black Friday whenever a user accesses the page.

These endpoints will be our primary focus during the load testing.

Step 2: Gather Baseline Metrics

To effectively evaluate a service's performance during a load test, it’s key to gather baseline metrics. A load test typically sends hundreds or even thousands of requests per second over a prolonged period. By analyzing these requests, you can understand the latency distribution, identify bottlenecks, and evaluate how the service will perform under peak loads or in production environments.

The available statistics for all critical metrics  include:

  • Mean: The average value of a dataset, calculated by summing all values and dividing by the number of data points. 
  • Median: The middle value when the data is ordered, with 50% of the values above and 50% below it.
  • Mode: The value that appears most frequently in the dataset.
  • Maximum: The highest value in the dataset.
  • Minimum: The lowest value in the dataset.
  • Percentiles: Divide the dataset into 100 equal parts. The 99th percentile, for example, shows the value below which 99% of the requests fall.

Which statistic should I use?

When evaluating load test results, it’s important to consider the mean, median, and percentiles. A successful test demonstrates that a service can handle most requests with latencies below a set threshold. The 99th percentile is particularly useful, as it indicates the latency under which 99% of requests fall, ensuring a good user experience when combined with a low error rate.

The mean can be significantly impacted by outliers, distorting the true performance and obscuring the response times for users affected by tail latency—slower responses that occur in the "long tail" of the distribution. These latencies are key for understanding the full range of user experiences, especially in high-traffic scenarios. In such cases, the median is a more reliable measure, as it is less influenced by extreme values.

The mode, meanwhile, helps reveal where most values are concentrated and highlights any skew in the distribution, providing a clearer picture of typical performance.

A chart showing various distributions of mean, median, and mode.
Utilizing various statistics can help you understand the performance distribution during your load test.

In Tinybird, you can easily capture these statistics by running the following query against the pipe_stats_rt service data source, which contains real-time requests logs to API endpoints published on Tinybird's infrastructure:

Pipe Name QPS Avg Latency p99 Latency Errors Total Requests Top Categories
/search 10 80 150 2 14,400 ['smartphones', 'laptops', 'headphones']
/sales 5 120 250 1 7,200 ['smartphones', 'laptops']

Analysis of baseline

  • Our /search endpoint handles 10 QPS steady state with an average latency of 80ms and a p99 latency of 150ms.
  • Our /sales endpoint handles 5 QPS steady state with an average latency of 120ms and a p99 latency of 250ms.
  • The error rate is negligible.
  • The most searched categories are smartphones, laptops, and headphones.

Step 3: Define Load Test Parameters

Our experience from previous events indicates that the behavior of our endpoints will be:

  • 10x increase in traffic, meaning /search should handle 100 QPS and /sales 50 QPS.
  • Ingestion load increase of x10.
  • Our goal is to maintain the average latency below 200ms and the p99 latency below 400ms.
  • We'll set an error threshold of 0.1%.

Assessing request distribution

In Tinybird, we can easily extract a sample distribution of endpoint calls following the production distribution running the following query over the pipe_stats_rt service data source:

You can extract this sample of queries as a CSV file (e.g. query_distribution.csv) to seed your load test using your preferred tool.

Going a step further, a truly representative sample requires mirroring not only the distribution of endpoints, reflecting production load (e.g., 40% endpoint 1, 60% endpoint 2), but also the distribution of parameters within those endpoints. This ensures accurate simulation of real-world data processing. For instance, if endpoint 1 predominantly handles year-long date ranges, or if a recommendation endpoint processes data for high-interaction users, the sample must reflect these parameter distributions to accurately represent the production environment.

In the following example, you can see how to get the distribution of requests for each category. 

The following query returns the number of requests grouped by query parameters, which can be used to fine-tune your load test.

Step 4: Perform the load tests

There are several tools available for load testing, each with its own strengths and capabilities. Some popular options include JMeter, Gatling, Locust, and wrk, among others. Each tool allows you to simulate traffic, measure performance, and analyze the results based on different parameters.

In this particular example, we use wrk to execute the test, using the previously generated file and the following configuration:

  • -t12: specifies 12 execution threads.
  • -c400: defines 400 concurrent connections.
  • -d30s: sets a 30-second duration for the test.

Don’t forget to customize the multi-request-json.lua to point to your query_distribution.csv file.

Step 5: Analyze the result

Now that we've covered how to set up and run a load test in Tinybird, and which metrics to use for evaluation, let's discuss how to extract test results from Tinybird’s service tables. Full documentation on service data sources is available here: Tinybird Service Data Sources.

To analyze endpoint behavior during a load test, we use the pipe_stats_rt service data source, which contains detailed information about all endpoint requests made to our APIs in the last 7 days.. The following SQL query retrieves the average latency, median, percentiles, and counts of successful and failed requests during a load test.

Since Tinybird records all the parameters you sent to the endpoint, even if they are not used, you can use a placeholder parameter to identify the calls as a load test. For example, by passing test=test20250206 as a query parameter, for example, you can subsequently isolate the specific load test results:

This query extracts all the relevant metrics, making it sufficient for evaluating infrastructure and service performance during a load test.

It is also important to monitor CPU and memory usage during the load test. To prevent instance crashes, tests should be conducted incrementally, with continuous monitoring to avoid overwhelming the machines and causing service disruptions. The following queries help monitor CPU and memory status. 

You can use our template for monitoring to help you build a dashboard based on the metrics reported by Tinybird.

CPU Usage:

Memory Usage:

If resource usage hits limits before reaching the expected load, you may need to resize the infrastructure or optimize endpoints to resolve bottlenecks. CPU usage is especially critical — exceeding 60% can cause latency spikes, so keeping it within 50-60% is ideal. If latency becomes unacceptable during the test, it’s often a sign of instance overload, requiring infrastructure adjustments.

Going back to our example, applying the queries mentioned above we get the following results:

EndpointQPSAverage Latency (ms)Median Latency (ms)75th Percentile Latency (ms)90th Percentile Latency (ms)99th Percentile Latency (ms)Total RequestsSuccessesErrors
/search98190185210280380846,720846,300420
/checkout49230220260350450423,360423,000360

Interpreting the results:

/search endpoint: 

  • Handling an average of 98 queries per second.
  • Average latency is 190ms, with a median of 185ms. This suggests that the latency distribution is fairly symmetrical.
  • 75th percentile latency (p75) is 210ms, meaning 75% of requests are served within 210ms.
  • 99th percentile latency (p99) is 380ms, indicating that 1% of requests experience latencies higher than 380ms.
  • The error rate is very low (420 errors out of 846,720 requests).

/sales endpoint:

  • Handling 49 queries per second on average.
  • Average latency is 230ms, with a median of 220ms.
  • p99 latency is 450ms, slightly higher than /search.
  • The error rate is also low but slightly higher than /search (360 errors out of 423,360 requests).

The results suggest that both endpoints are performing reasonably well under load. However, /sales has slightly higher latencies and a higher error rate compared to /search. This indicates that /sales might be a bottleneck and could benefit from further optimization.

CPU Usage Query Results:

We can see the CPU usage across different nodes (node-1 and node-2) over time.

CPU usage is increasing gradually but remains within acceptable limits (below 60%). This indicates that the nodes are handling the load effectively.

Memory Usage Query Results:

Memory usage is also increasing but is well below the capacity of the nodes. This suggests that memory is not a bottleneck at this point. Our recommendation is to keep the memory usage always below 60%.

Important Notes:

  • These are just examples. Your actual results will vary depending on your infrastructure, load test parameters, and the specific queries you run.
  • It's crucial to monitor these metrics in real-time during your load tests to identify any potential issues early on.
  • If you see CPU usage approaching 60% or memory usage nearing capacity, you may need to consider scaling your infrastructure to handle the increased load.

Step 6: Adjustments and iterative testing

Based on the results, we can apply optimizations and rerun the tests. Some strategies include:

  • Optimize queries: We reviewed the /sales queries and found that a costly aggregation was causing the latency and errors. We optimized the query using pre-calculated aggregations.
  • Implement caching: We implemented caching to reduce the load on frequently used endpoints.
  • Ensure infrastructure can scale appropriately: We verified that our resources have sufficient CPU and memory to handle the load.

After the optimizations, we rerun the load tests and verify that the results are within the defined thresholds.

Conclusion

Conducting load testing is essential for preparing for high-traffic events like Black Friday. Tinybird, along with load testing tools like wrk, enables you to assess the performance of your endpoints and optimize them to provide a seamless experience for your customers.

By following these steps, you can prepare for traffic spikes and ensure your app or service is ready to handle the load without interruptions even on the most important day of the year.

Do you like this post?

Related posts

How we processed 12 trillion rows during Black Friday
Tinybird is out of beta and open to everyone
Real-time data platforms: An introduction
Investigating Performance Bottlenecks With SQL & Statistics
Tinybird connects with Confluent for real-time streaming analytics at scale

Tinybird

Team

Jul 18, 2023
Real-time Databases: What developers need to know
Performance and Kafka compression
Real-Time Data Ingestion: The Foundation for Real-time Analytics
Iterate your real-time data pipelines with Git
$30M to lead the shift to real-time data

Build fast data products, faster.

Try Tinybird and bring your data sources together and enable engineers to build with data in minutes. No credit card required, free to get started.