Minimize processed bytes when filtering by high cardinality columns
By default, ClickHouse reads blocks of 8192 rows in a MergeTree
table but this setting can be tweaked with the index_granularity
setting.
As with any database, generally, the less data you read, the better your query performance. To do this, we want to filter & discard as many rows as possible. However, this can be tricky if you need to filter on a column with very high cardinality.
Changing the index_granularity
can help in this situation, but remember, this is not a silver bullet, so take care to understand the impact it has for your use case.
Let's see an example. First create a table of 100M rows with unique values (this simulates a very high cardinality data set).
CREATE TABLE default_index_granularity
(
`number` UInt64
)
ENGINE = MergeTree
PARTITION BY number % 10
ORDER BY number AS
SELECT number
FROM numbers(100000000)
In our query we want to apply a filter that selects only 100K rows, but we're filtering on a column that has 100M unique values.
SELECT *
FROM default_index_granularity
WHERE number IN (
SELECT number * 10000
FROM numbers(100000)
)
FORMAT `Null`
Query id: 11412bc3-05de-4790-9b65-06b139761e0c
Ok.
0 rows in set. Elapsed: 1.211 sec. Processed 100.00 million rows, 800.00 MB (82.56 million rows/s., 660.45 MB/s.)
We can see in the output that to fulfill this query we had to read all 100M rows of data.
Let's do that again, but with an index_granularity
of 128
:
CREATE TABLE reduced_index_granularity
(
`number` UInt64
)
ENGINE = MergeTree
PARTITION BY number % 10
ORDER BY number
SETTINGS index_granularity=128 AS
SELECT number
FROM numbers(100000000)
We'll run the exact same query to read data as we did before:
SELECT *
FROM reduced_index_granularity
WHERE number IN (
SELECT number * 10000
FROM numbers(100000)
)
FORMAT `Null`
0 rows in set. Elapsed: 0.785 sec. Processed 12.84 million rows, 102.73 MB (16.35 million rows/s., 130.81 MB/s.)
Now we can see that to fulfill this query, we only had to read 12.84M rows - a very significant reduction of processed data.
Remember, adjusting the index_granularity
comes with some tradeoffs, such as heavier indexing and slower insertions, but it's a good trick to keep in your back pocket.