This is the unmodified (well, I removed some references and names) message we sent to one of our clients (who uses Kafka heavily) via support after migrating them to our newest version of the Kafka connector. In this case the exchange took place through Slack, which is becoming the premium support channel for developers.
I wanted to share it because we spend quite a lot of time researching and it’s easy to forget how hard these things are and the amount of effort involved in providing outstanding support. Here it is:
Regarding performance, today’s migration includes optimizations that will allow us to sustain much higher loads with Kafka.
However, benchmarking with your topic using the tinybird-XXXXXX groupID has shown that the optimizations on our system won’t be able to deliver significant improvements right now, as your Kafka cluster reading throughput is the limiting factor.
Nevertheless, we have tested the behavior with all Kafka compression codecs (snappy, lz4, gzip and zstd) and with different compression levels, using a significant sample of your data. We have seen a massive improvement when setting the producer’s compression to zstd and high compression levels.
Max throughput:
- current: ~9 million records / minute
- zstd with compression level 12: ~19 million records / minute
- zstd with compression level 8: ~18 million records / minute
Of course, using level 12 would use a higher CPU load than 8. However, zstd is pretty fast, my laptop is able to produce (and compress) at a rate of 4.3 million records per minute using level 12. With level 8, the producing rate improves to 5.3.
On top of that, there are other configuration parameters that can affect Kafka cluster brokers by loading their broker’s CPU, and thus reducing the reading performance. We are able to test the maximum reading throughput, but during regular producing throughput. A real peak will imply a higher load at your Kafka cluster servers. Confluent has an optimization guide that could improve things: https://docs.confluent.io/cloud/current/client-apps/optimizing/throughput.html
Lastly, we keep seeing degraded performance on 3 of the 24 partitions, partitions number 3, 8, and 10. Although they are able to keep up during regular loads, they are not able to keep up under heavy loads, significantly lagging behind the other partitions.
In summary:
- We highly recommend to compress on producers using zstd with a level between 8 and 12.
- We recommend to change other configuration parameters following https://docs.confluent.io/cloud/current/client-apps/optimizing/throughput.html
- We should keep investigating the problem with the 3 partitions.