PricingDocs
Bars

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
Sign inSign up
Product []

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
PricingDocs
Resources []

Learn

Blog
Musings on transformations, tables and everything in between
Customer Stories
We help software teams ship features with massive data sets
Videos
Learn how to use Tinybird with our videos
ClickHouse for Developers
Understand ClickHouse with our video series

Build

Templates
Explore our collection of templates
Tinybird Builds
We build stuff live with Tinybird and our partners
Changelog
The latest updates to Tinybird

Community

Slack Community
Join our Slack community to get help and share your ideas
Open Source Program
Get help adding Tinybird to your open source project
Schema > Evolution
Join the most read technical biweekly engineering newsletter

Our Columns:

Skip the infra work. Deploy your first ClickHouse
project now

Get started for freeRead the docs
A geometric decoration with a matrix of rectangles.

Product /

ProductWatch the demoPricingSecurityRequest a demo

Company /

About UsPartnersShopCareers

Features /

Managed ClickHouseStreaming IngestionSchema IterationConnectorsInstant SQL APIsBI & Tool ConnectionsTinybird CodeTinybird AIHigh AvailabilitySecurity & Compliance

Support /

DocsSupportTroubleshootingCommunityChangelog

Resources /

ObservabilityBlogCustomer StoriesTemplatesTinybird BuildsTinybird for StartupsRSS FeedNewsletter

Integrations /

Apache KafkaConfluent CloudRedpandaGoogle BigQuerySnowflakePostgres Table FunctionAmazon DynamoDBAmazon S3

Use Cases /

User-facing dashboardsReal-time Change Data Capture (CDC)Gaming analyticsWeb analyticsReal-time personalizationUser-generated content (UGC) analyticsContent recommendation systemsVector search
Some outages detected

Copyright © 2025 Tinybird. All rights reserved

|

Terms & conditionsCookiesTrust CenterCompliance Helpline
Tinybird wordmark
PricingDocs
Bars

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
Sign inSign up
Product []

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
PricingDocs
Resources []

Learn

Blog
Musings on transformations, tables and everything in between
Customer Stories
We help software teams ship features with massive data sets
Videos
Learn how to use Tinybird with our videos
ClickHouse for Developers
Understand ClickHouse with our video series

Build

Templates
Explore our collection of templates
Tinybird Builds
We build stuff live with Tinybird and our partners
Changelog
The latest updates to Tinybird

Community

Slack Community
Join our Slack community to get help and share your ideas
Open Source Program
Get help adding Tinybird to your open source project
Schema > Evolution
Join the most read technical biweekly engineering newsletter

Skip the infra work. Deploy your first ClickHouse
project now

Get started for freeRead the docs
A geometric decoration with a matrix of rectangles.

Product /

ProductWatch the demoPricingSecurityRequest a demo

Company /

About UsPartnersShopCareers

Features /

Managed ClickHouseStreaming IngestionSchema IterationConnectorsInstant SQL APIsBI & Tool ConnectionsTinybird CodeTinybird AIHigh AvailabilitySecurity & Compliance

Support /

DocsSupportTroubleshootingCommunityChangelog

Resources /

ObservabilityBlogCustomer StoriesTemplatesTinybird BuildsTinybird for StartupsRSS FeedNewsletter

Integrations /

Apache KafkaConfluent CloudRedpandaGoogle BigQuerySnowflakePostgres Table FunctionAmazon DynamoDBAmazon S3

Use Cases /

User-facing dashboardsReal-time Change Data Capture (CDC)Gaming analyticsWeb analyticsReal-time personalizationUser-generated content (UGC) analyticsContent recommendation systemsVector search
Some outages detected

Copyright © 2025 Tinybird. All rights reserved

|

Terms & conditionsCookiesTrust CenterCompliance Helpline
Tinybird wordmark
PricingDocs
Bars

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
Sign inSign up
Product []

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
PricingDocs
Resources []

Learn

Blog
Musings on transformations, tables and everything in between
Customer Stories
We help software teams ship features with massive data sets
Videos
Learn how to use Tinybird with our videos
ClickHouse for Developers
Understand ClickHouse with our video series

Build

Templates
Explore our collection of templates
Tinybird Builds
We build stuff live with Tinybird and our partners
Changelog
The latest updates to Tinybird

Community

Slack Community
Join our Slack community to get help and share your ideas
Open Source Program
Get help adding Tinybird to your open source project
Schema > Evolution
Join the most read technical biweekly engineering newsletter
Back to Blog
Share this article:
Back

Splitting CSV files at 3GB/s

Splitting CSV files is a must when dealing with large, potentially larger than RAM, files. But how fast can it be?
Engineering Excellence
José Muñoz
José MuñozBackend Developer

The problem ¶

When dealing with large files, for example files that don’t fit on RAM, splitting the file in chunks is a must.

Some data formats like NDJSON can be trivially split, as all \n bytes are line separators. However, CSV can have newlines within quotes, which don’t indicate a new row, but a newline in the row’s field.

There are two kinds of solutions to the problem of finding a splitting point in a CSV file:

  • Read it. By simply reading the file, we can distinguish between \n bytes that are within quotes and the ones that aren’t, and that are points in which we can split the CSV. With encodings that preserve the ASCII code, like UTF-8, for the special characters we’ll be able to avoid extra state and code.
  • Statistical solutions. By going directly to the point in which we want to split the file, and by reading some bytes around it, we can guess a good splitting point. This solution is naturally faster, as you need to process less bytes, but it can return wrong splitting points too.

We decided to go with the first approach to favor correctness over speed. However, speed is very important to us, how much faster can we make it?

Speed wins

Our first implementation was a python-based parser. We tested its correct behavior, but we certainly saw that performance with this approach was a no-go.

Hello, C

Python performance can be very limitting, but the good thing about Python is that C code can be integrated easily on simple problems like this one.

We ported the Python implementation to C and called it using CFFI. This single optimization, without any other improvement, increased performance by two orders of magnitude, reaching now the 1GB/s barrier.

Simple is faster

We were quite happy about the 1GB/s initial implementation, but we knew we could do better. We started optimizing the algorithm. The first version was based on a single complex loop. The optimized version was simpler, and had a nested loop to deal with the quoted fields.

SIMD

This implementation was quite fast, but we still thought we could go a step further. This optimized version seemed very difficult to improve, as it just performed three quick comparisons per byte on the happy path.

Improving the time it took to process each byte seemed almost impossible, and actually, we didn’t do that. Instead, we processesed multiple bytes per iteration.

All modern x86 CPUs include SIMD instructions like SSE and AVX. These instructions allow the processor to work on multiple registers in parallel, improving throughput.

This is how our last version looks like:

Used compiler intrinsics:

  • __m128i _mm_set1_epi8(unsigned char c). Returns a 16-byte SSE group of registers with all bytes filled with c.
  • __m128i _mm_loadu_si128(__m128i const* mem_addr). Returns a 16-byte SSE group of registers with bytes copied from mem_addr.
  • __m128i _mm_cmpeq_epi8 (__m128i a, __m128i b). Returns a 16-byte SSE group of registers, comparing for equality the bytes of a and b. Each returned byte is set to 0xFF if the corresponding bytes of a and b are equal, 0 otherwise.
  • int _mm_movemask_epi8 (__m128i a). Compacts the 16 bytes of a by taking the most significand bit of each byte in a, returning a 16-bit sized mask.
  • int __builtin_popcount(unsigned int x). Counts the number of bits in x set to 1.
  • int __builtin_clz(unsigned int x). Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.
  • int __builtin_ctz (unsigned int x). Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.

See also Intel’s and gcc’s docs.

This is certainly a more complex version, but thanks to SSE, it works at 3GB/s!

Want to help us solve these problems? Join us!

Do you like this post? Spread it!

Skip the infra work. Deploy your first ClickHouse
project now

Get started for freeRead the docs
A geometric decoration with a matrix of rectangles.

Product /

ProductWatch the demoPricingSecurityRequest a demo

Company /

About UsPartnersShopCareers

Features /

Managed ClickHouseStreaming IngestionSchema IterationConnectorsInstant SQL APIsBI & Tool ConnectionsTinybird CodeTinybird AIHigh AvailabilitySecurity & Compliance

Support /

DocsSupportTroubleshootingCommunityChangelog

Resources /

ObservabilityBlogCustomer StoriesTemplatesTinybird BuildsTinybird for StartupsRSS FeedNewsletter

Integrations /

Apache KafkaConfluent CloudRedpandaGoogle BigQuerySnowflakePostgres Table FunctionAmazon DynamoDBAmazon S3

Use Cases /

User-facing dashboardsReal-time Change Data Capture (CDC)Gaming analyticsWeb analyticsReal-time personalizationUser-generated content (UGC) analyticsContent recommendation systemsVector search
Some outages detected

Copyright © 2025 Tinybird. All rights reserved

|

Terms & conditionsCookiesTrust CenterCompliance Helpline
Tinybird wordmark

Related posts

Engineering Excellence
Dec 09, 2021
Performance and Kafka compression
David Manzanares
David ManzanaresSoftware Engineer
1Performance and Kafka compression
Engineering Excellence
May 06, 2025
Building a conversational AI tool for real-time analytics
Rafa Moreno
Rafa MorenoFrontend Engineer
1Building a conversational AI tool for real-time analytics
Engineering Excellence
Mar 11, 2024
Iterating terabyte-sized ClickHouse®️ tables in production
Alberto Romeu
Alberto RomeuSoftware Engineer
1Iterating terabyte-sized ClickHouse®️ tables in production
Engineering Excellence
Apr 15, 2024
We rebuilt our docs from scratch. It was worth it.
Julia Vallina
Julia VallinaFrontend Web Developer
1We rebuilt our docs from scratch. It was worth it.
Engineering Excellence
Jan 09, 2020
How We Handle Technical Incidents and Service Disruptions
Jorge Sancha
Jorge SanchaCo-founder
1How We Handle Technical Incidents and Service Disruptions
Engineering Excellence
Jan 25, 2023
Horizontally scaling Kafka consumers with rendezvous hashing
David Manzanares
David ManzanaresSoftware Engineer
1Horizontally scaling Kafka consumers with rendezvous hashing
Engineering Excellence
Jun 08, 2023
Adding JOIN support for parallel replicas on ClickHouse®️
Javi Santana
Javi SantanaCo-founder
1Adding JOIN support for parallel replicas on ClickHouse®️
Engineering Excellence
Oct 30, 2023
Resolving a year-long ClickHouse®️ lock contention
Jordi Villar
Jordi VillarStaff Engineer
1Resolving a year-long ClickHouse®️ lock contention
Engineering Excellence
Apr 07, 2025
How to safely cancel a database query
Ivan Malagon
Ivan MalagonProduct Manager
1How to safely cancel a database query
Product updates
Dec 13, 2021
Changelog #17: Guided tour, Kafka ingestion improvements and more
Tinybird
TinybirdTeam
1Changelog #17: Guided tour, Kafka ingestion improvements and more