ReplacingMergeTree engine¶
The engine differs from MergeTree in that it removes duplicate entries with the same sorting key value.
Data deduplication occurs only during a merge. Merging occurs in the background at an unknown time, so you can’t plan for it. Some data may remain unprocessed.
ReplacingMergeTree
is suitable for clearing out duplicate data in the background to save space, but it doesn't guarantee the absence of duplicates. Check Deduplication strategies for more information about how to handle duplicates.
Creating a ReplacingMergeTree Data Source¶
post_views_rmt.datasource
DESCRIPTION > Data Source to save latest version of post info. ReplacingMergeTree Engine. SCHEMA > `post_id` Int32 `json:$.post_id`, `views` Int32 `json:$.views`, `likes` Int32 `json:$.likes`, `tag` String `json:$.tag`, `timestamp` DateTime `json:$.timestamp`, `_is_deleted` UInt8 `json:$._is_deleted` DEFAULT 0 ENGINE "ReplacingMergeTree" ENGINE_PARTITION_KEY "" ENGINE_SORTING_KEY "post_id" ENGINE_VER "timestamp" ENGINE_IS_DELETED "_is_deleted"
ReplacingMergeTree parameters¶
ENGINE_VER¶
ENGINE_VER
is the column with the version number. Type UInt*
, Date
, DateTime
or DateTime64
. Optional parameter.
When merging, ReplacingMergeTree
from all the rows with the same ENGINE_SORTING_KEY
leaves only one:
- The last in the selection, if
ENGINE_VER
not set. A selection is a set of rows in a set of parts participating in the merge. The most recently created part, the last insert, is the last one in the selection. After deduplication, the last row from the most recent insert remains for each unique sorting key. - With the maximum version, if
ENGINE_VER
specified. IfENGINE_VER
is the same for several rows, then it uses "ifENGINE_VER
isn't specified" rule for them, i.e. the most recent inserted row remains.
ENGINE_IS_DELETED¶
ENGINE_IS_DELETED
is the name of the column used during a merge to determine whether the data in this row represents the state or is to be deleted. 1
is a "deleted" row, 0
is a "state" row.
ENGINE_IS_DELETED
can only be enabled when ENGINE_VER
is used. Note it's a mask, not an actual delete, so queries to the Data Source with FINAL
are equivalent to queries with WHERE <is_deleted_column> = 0
.
No matter the operation on the data, the version must be increased. If two inserted rows have the same version number, the last inserted row is the one kept.
Query time de-duplication and FINAL¶
At merge time, the ReplacingMergeTree identifies duplicate rows, using the values of the ENGINE_SORTING_KEY
columns as a unique identifier, and retains only the highest version. This, however, offers eventual correctness only - it doesn't guarantee rows are deduplicated. Queries can, therefore, produce incorrect answers due to update and delete rows being considered in queries.
To obtain correct answers, complement background merges with query time deduplication and deletion removal. You can do this by using the FINAL
operator.
Example:¶
echo '{ "timestamp": "2024-07-02T02:22:17", "post_id": 956, "views": 856875, "likes": 2321, "tag": "Sports" }' > post_views.ndjson echo '{ "timestamp": "2024-08-02T03:22:17", "post_id": 956, "views": 956875, "likes": 3321, "tag": "Sports" }' >> post_views.ndjson echo '{ "timestamp": "2024-08-01T00:00:00", "post_id": 1, "views": 56875, "likes": 321, "tag": "Music" }' >> post_views.ndjson echo '{ "timestamp": "2024-09-01T00:00:00", "post_id": 1, "views": 56875, "likes": 321, "tag": "Music", "_is_deleted": 1 }' >> post_views.ndjson tb datasource append post_views_rmt post_views.ndjson tb sql "select * from post_views_rmt final" ------------------------------------------------------------------------- | post_id | views | likes | tag | timestamp | _is_deleted | ------------------------------------------------------------------------- | 956 | 956875 | 3321 | Sports | 2024-08-02 03:22:17 | 0 | -------------------------------------------------------------------------
Settings¶
For a list of supported settings, see Engine settings.