CollapsingMergeTree engine

The CollapsingMergeTree engine inherits from MergeTree and adds the logic of rows collapsing to data parts merge algorithm.

CollapsingMergeTree asynchronously deletes, or collapses, pairs of rows if all of the fields in a sorting key (ENGINE_SORTING_KEY) are equivalent except the particular field Sign, which can have 1 and -1 values. Rows without a pair are kept. The engine may significantly reduce the volume of storage and increase the efficiency of SELECT queries.

CollapsingMergeTree parameters

sign

ENGINE_SIGN - name of the column with the type of row: 1 is a “state” row, -1 is a “cancel” row.

The column data type should be Int8.

Usage example

cmt.datasource
SCHEMA >
    UserID UInt64 `json:$.UserID`,
    PageViews UInt8 `json:$.PageViews`,
    Duration UInt8 `json:$.Duration`,
    Sign Int8 `json:$.Sign`

ENGINE "CollapsingMergeTree"
ENGINE_SORTING_KEY "UserID"
ENGINE_SIGN "Sign"
tb push datasources/cmt.datasource 

TB_HOST=$(cat .tinyb | jq ".host" -r)
TB_TOKEN=$(cat .tinyb | jq ".token" -r)

curl \
    -H "Authorization: Bearer $TB_TOKEN" \
    -d '{"UserID": 4324182021466249494, "PageViews": 5, "Duration": 146, "Sign": 1}' \
    "$TB_HOST/v0/events?name=cmt"

tb sql "select * from cmt"

-----------------------------------------------------
|              UserID | PageViews | Duration | Sign |
-----------------------------------------------------
| 4324182021466249494 |         5 |      146 |    1 |
-----------------------------------------------------

Let's add another row with the same UserID and negate the previous row (same row with -1 as Sign value).
{"UserID": 4324182021466249494, "PageViews": 5, "Duration": 146, "Sign": -1}

curl \
    -H "Authorization: Bearer $TB_TOKEN" \
    -d $'{"UserID": 4324182021466249494, "PageViews": 6, "Duration": 185, "Sign": 1}\n{"UserID": 4324182021466249494, "PageViews": 5, "Duration": 146, "Sign": -1}\n' \
    "$TB_HOST/v0/events?name=cmt"

tb sql "select * from cmt"

-----------------------------------------------------
|              UserID | PageViews | Duration | Sign |
-----------------------------------------------------
| 4324182021466249494 |         5 |      146 | 1    |
| 4324182021466249494 |         6 |      185 | 1    |
| 4324182021466249494 |         5 |      146 | -1   |
-----------------------------------------------------

What happened here? Collapsing happens during merges, which are performed asynchronously and can't be controlled, so you should force it using FINALor query using aggregation.

tb sql "select * from cmt final"
-----------------------------------------------------
|              UserID | PageViews | Duration | Sign |
-----------------------------------------------------
| 4324182021466249494 |         6 |      185 | 1    |
-----------------------------------------------------

tb sql "SELECT
    UserID,
    sum(PageViews * Sign) AS PageViews,
    sum(Duration * Sign) AS Duration
FROM cmt
GROUP BY UserID
HAVING sum(Sign) > 0"
----------------------------------------------
|              UserID | PageViews | Duration |
----------------------------------------------
| 4324182021466249494 |         6 |      185 |
----------------------------------------------

A different approach would be to negate all the numeric columns and then sum them up.
{"UserID": 4324182021466249494, "PageViews": -5, "Duration": -146, "Sign": -1}

curl \
    -H "Authorization: Bearer $TB_TOKEN" \
    -d '{"UserID": 4324182021466249494, "PageViews": 5, "Duration": 146, "Sign": 1}' \
    "$TB_HOST/v0/events?name=cmt"

tb sql "select * from cmt"

-----------------------------------------------------
|              UserID | PageViews | Duration | Sign |
-----------------------------------------------------
| 4324182021466249494 |         5 |      146 |    1 |
-----------------------------------------------------

curl \
    -H "Authorization: Bearer $TB_TOKEN" \
    -d $'{"UserID": 4324182021466249494, "PageViews": 6, "Duration": 185, "Sign": 1}\n{"UserID": 4324182021466249494, "PageViews": -5, "Duration": -146, "Sign": -1}\n' \
    "$TB_HOST/v0/events?name=cmt"

tb sql "SELECT
    UserID,
    sum(PageViews) AS PageViews,
    sum(Duration) AS Duration
FROM UAct
GROUP BY UserID"

----------------------------------------------
|              UserID | PageViews | Duration |
----------------------------------------------
| 4324182021466249494 |         6 |      185 |
----------------------------------------------

Query clauses

When creating a CollapsingMergeTree table, the same query clauses are required as when creating a MergeTree table.

Settings

For a list of supported settings, see Engine settings.

Updated