The problem ¶
When dealing with large files, for example files that don’t fit on RAM, splitting the file in chunks is a must.
Some data formats like NDJSON can be trivially split, as all \n bytes are line separators. However, CSV can have newlines within quotes, which don’t indicate a new row, but a newline in the row’s field.
There are two kinds of solutions to the problem of finding a splitting point in a CSV file:
- Read it. By simply reading the file, we can distinguish between \n bytes that are within quotes and the ones that aren’t, and that are points in which we can split the CSV. With encodings that preserve the ASCII code, like UTF-8, for the special characters we’ll be able to avoid extra state and code.
- Statistical solutions. By going directly to the point in which we want to split the file, and by reading some bytes around it, we can guess a good splitting point. This solution is naturally faster, as you need to process less bytes, but it can return wrong splitting points too.
We decided to go with the first approach to favor correctness over speed. However, speed is very important to us, how much faster can we make it?
Speed wins
Our first implementation was a python-based parser. We tested its correct behavior, but we certainly saw that performance with this approach was a no-go.
Hello, C
Python performance can be very limitting, but the good thing about Python is that C code can be integrated easily on simple problems like this one.
We ported the Python implementation to C and called it using CFFI. This single optimization, without any other improvement, increased performance by two orders of magnitude, reaching now the 1GB/s barrier.
Simple is faster
We were quite happy about the 1GB/s initial implementation, but we knew we could do better. We started optimizing the algorithm. The first version was based on a single complex loop. The optimized version was simpler, and had a nested loop to deal with the quoted fields.
SIMD
This implementation was quite fast, but we still thought we could go a step further. This optimized version seemed very difficult to improve, as it just performed three quick comparisons per byte on the happy path.
Improving the time it took to process each byte seemed almost impossible, and actually, we didn’t do that. Instead, we processesed multiple bytes per iteration.
All modern x86 CPUs include SIMD instructions like SSE and AVX. These instructions allow the processor to work on multiple registers in parallel, improving throughput.
This is how our last version looks like:
Used compiler intrinsics:
__m128i _mm_set1_epi8(unsigned char c)
. Returns a 16-byte SSE group of registers with all bytes filled with c.__m128i _mm_loadu_si128(__m128i const* mem_addr)
. Returns a 16-byte SSE group of registers with bytes copied frommem_addr
.__m128i _mm_cmpeq_epi8 (__m128i a, __m128i b)
. Returns a 16-byte SSE group of registers, comparing for equality the bytes ofa
andb
. Each returned byte is set to0xFF
if the corresponding bytes ofa
andb
are equal,0
otherwise.int _mm_movemask_epi8 (__m128i a)
. Compacts the 16 bytes ofa
by taking the most significand bit of each byte ina
, returning a 16-bit sized mask.int __builtin_popcount(unsigned int x)
. Counts the number of bits inx
set to1
.int __builtin_clz(unsigned int x)
. Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.int __builtin_ctz (unsigned int x)
. Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.
See also Intel’s and gcc’s docs.
This is certainly a more complex version, but thanks to SSE, it works at 3GB/s!
Want to help us solve these problems? Join us!