Parametric Aggregate Functions¶
Some aggregate functions can accept not only argument columns (used for compression), but a set of parameters – constants for initialization. The syntax is two pairs of brackets instead of one. The first is for parameters, and the second is for arguments.
histogram¶
Calculates an adaptive histogram. It does not guarantee precise results.
histogram(number_of_bins)(values)
The functions uses A Streaming Parallel Decision Tree Algorithm. The borders of histogram bins are adjusted as new data enters a function. In common case, the widths of bins are not equal.
Arguments
values
: Expression resulting in input values.
Parameters
number_of_bins
: Upper limit for the number of bins in the histogram. The function automatically calculates the number of bins. It tries to reach the specified number of bins, but if it fails, it uses fewer bins.
Returned values
Array of Tuples of the following format:
[(lower_1, upper_1, height_1), ... (lower_N, upper_N, height_N)]
-
lower
: Lower bound of the bin. -upper
: Upper bound of the bin. -height
: Calculated height of the bin.
Example
SELECT histogram(5)(number + 1) FROM ( SELECT * FROM system.numbers LIMIT 20 )
┌─histogram(5)(plus(number, 1))───────────────────────────────────────────┐ │ [(1,4.5,4),(4.5,8.5,4),(8.5,12.75,4.125),(12.75,17,4.625),(17,20,3.25)] │ └─────────────────────────────────────────────────────────────────────────┘
You can visualize a histogram with the bar function, for example:
WITH histogram(5)(rand() % 100) AS hist SELECT arrayJoin(hist).3 AS height, bar(height, 0, 6, 5) AS bar FROM ( SELECT * FROM system.numbers LIMIT 20 )
┌─height─┬─bar───┐ │ 2.125 │ █▋ │ │ 3.25 │ ██▌ │ │ 5.625 │ ████▏ │ │ 5.625 │ ████▏ │ │ 3.375 │ ██▌ │ └────────┴───────┘
In this case, you should remember that you do not know the histogram bin borders.
sequenceMatch¶
Checks whether the sequence contains an event chain that matches the pattern.
Syntax
sequenceMatch(pattern)(timestamp, cond1, cond2, ...)
Events that occur at the same second may lay in the sequence in an undefined order affecting the result.
Arguments
timestamp
: Column considered to contain time data. Typical data types areDate
andDateTime
. You can also use any of the supported UInt data types.cond1
,cond2
: Conditions that describe the chain of events. Data type:UInt8
. You can pass up to 32 condition arguments. The function takes only the events described in these conditions into account. If the sequence contains data that isn’t described in a condition, the function skips them.
Parameters
pattern
: Pattern string. See Pattern syntax.
Returned values
- 1, if the pattern is matched.
- 0, if the pattern isn’t matched.
Type: UInt8
.
Pattern syntax¶
(?N)
: Matches the condition argument at positionN
. Conditions are numbered in the[1, 32]
range. For example,(?1)
matches the argument passed to thecond1
parameter..*
: Matches any number of events. You do not need conditional arguments to match this element of the pattern.(?t operator value)
: Sets the time in seconds that should separate two events. For example, pattern(?1)(?t>1800)(?2)
matches events that occur more than 1800 seconds from each other. An arbitrary number of any events can lay between these events. You can use the>=
,>
,<
,<=
,==
operators.
Examples
Consider data in the t
table:
┌─time─┬─number─┐ │ 1 │ 1 │ │ 2 │ 3 │ │ 3 │ 2 │ └──────┴────────┘
Perform the query:
SELECT sequenceMatch('(?1)(?2)')(time, number = 1, number = 2) FROM t
┌─sequenceMatch('(?1)(?2)')(time, equals(number, 1), equals(number, 2))─┐ │ 1 │ └───────────────────────────────────────────────────────────────────────┘
The function found the event chain where number 2 follows number 1. It skipped number 3 between them, because the number is not described as an event. If we want to take this number into account when searching for the event chain given in the example, we should make a condition for it.
SELECT sequenceMatch('(?1)(?2)')(time, number = 1, number = 2, number = 3) FROM t
┌─sequenceMatch('(?1)(?2)')(time, equals(number, 1), equals(number, 2), equals(number, 3))─┐ │ 0 │ └──────────────────────────────────────────────────────────────────────────────────────────┘
In this case, the function couldn’t find the event chain matching the pattern, because the event for number 3 occurred between 1 and 2. If in the same case we checked the condition for number 4, the sequence would match the pattern.
SELECT sequenceMatch('(?1)(?2)')(time, number = 1, number = 2, number = 4) FROM t
┌─sequenceMatch('(?1)(?2)')(time, equals(number, 1), equals(number, 2), equals(number, 4))─┐ │ 1 │ └──────────────────────────────────────────────────────────────────────────────────────────┘
sequenceCount¶
Counts the number of event chains that matched the pattern. The function searches event chains that do not overlap. It starts to search for the next chain after the current chain is matched.
Events that occur at the same second may lay in the sequence in an undefined order affecting the result.
Syntax
sequenceCount(pattern)(timestamp, cond1, cond2, ...)
Arguments
timestamp
: Column considered to contain time data. Typical data types areDate
andDateTime
. You can also use any of the supported UInt data types.cond1
,cond2
: Conditions that describe the chain of events. Data type:UInt8
. You can pass up to 32 condition arguments. The function takes only the events described in these conditions into account. If the sequence contains data that isn’t described in a condition, the function skips them.
Parameters
pattern
: Pattern string. See Pattern syntax.
Returned values
- Number of non-overlapping event chains that are matched.
Type: UInt64
.
Example
Consider data in the t
table:
┌─time─┬─number─┐ │ 1 │ 1 │ │ 2 │ 3 │ │ 3 │ 2 │ │ 4 │ 1 │ │ 5 │ 3 │ │ 6 │ 2 │ └──────┴────────┘
Count how many times the number 2 occurs after the number 1 with any amount of other numbers between them:
SELECT sequenceCount('(?1).*(?2)')(time, number = 1, number = 2) FROM t
┌─sequenceCount('(?1).*(?2)')(time, equals(number, 1), equals(number, 2))─┐ │ 2 │ └─────────────────────────────────────────────────────────────────────────┘
See Also
windowFunnel¶
Searches for event chains in a sliding time window and calculates the maximum number of events that occurred from the chain.
The function works according to the algorithm:
The function searches for data that triggers the first condition in the chain and sets the event counter to 1. This is the moment when the sliding window starts.
If events from the chain occur sequentially within the window, the counter is incremented. If the sequence of events is disrupted, the counter isn’t incremented.
If the data has multiple event chains at varying points of completion, the function will only output the size of the longest chain.
Syntax
windowFunnel(window, [mode, [mode, ... ]])(timestamp, cond1, cond2, ..., condN)
Arguments
timestamp
: Name of the column containing the timestamp. Data types supported: Date, DateTime and other unsigned integer types (note that even though timestamp supports theUInt64
type, it’s value can’t exceed the Int64 maximum, which is 2^63 - 1).cond
: Conditions or data describing the chain of events. UInt8.
Parameters
window
: Length of the sliding window, it is the time interval between the first and the last condition. The unit ofwindow
depends on thetimestamp
itself and varies. Determined using the expressiontimestamp of cond1 <= timestamp of cond2 <= ... <= timestamp of condN <= timestamp of cond1 + window
.mode
: It is an optional argument. One or more modes can be set.'strict_deduplication'
: If the same condition holds for the sequence of events, then such repeating event interrupts further processing. Note: it may work unexpectedly if several conditions hold for the same event.'strict_order'
: Don't allow interventions of other events. E.g. in the case ofA->B->D->C
, it stops findingA->B->C
at theD
and the max event level is 2.'strict_increase'
: Apply conditions only to events with strictly increasing timestamps.'strict_once'
: Count each event only once in the chain even if it meets the condition several times
Returned value
The maximum number of consecutive triggered conditions from the chain within the sliding time window. All the chains in the selection are analyzed.
Type: Integer
.
Example
Determine if a set period of time is enough for the user to select a phone and purchase it twice in the online store.
Set the following chain of events:
- The user logged in to their account on the store (
eventID = 1003
). - The user searches for a phone (
eventID = 1007, product = 'phone'
). - The user placed an order (
eventID = 1009
). - The user made the order again (
eventID = 1010
).
Input table:
SCHEMA > event_date Date `json:$.event_date`, user_id UInt32 `json:$.user_id`, timestamp DateTime `json:$.timestamp`, eventID UInt32 `json:$.eventID`, product String `json:$.product` ENGINE "MergeTree"
tb push datasources/trend.datasource echo ' {"event_date": "2019-01-28", "user_id": 1, "timestamp": "2019-01-29 10:00:00", "eventID": 1003, "product": "phone"} {"event_date": "2019-01-31", "user_id": 1, "timestamp": "2019-01-31 09:00:00", "eventID": 1007, "product": "phone"} {"event_date": "2019-01-30", "user_id": 1, "timestamp": "2019-01-30 08:00:00", "eventID": 1009, "product": "phone"} {"event_date": "2019-02-01", "user_id": 1, "timestamp": "2019-02-01 08:00:00", "eventID": 1010, "product": "phone"}' > trend.ndjson tb datasource append trend trend.ndjson
Find out how far the user user_id
could get through the chain in a period in January-February of 2019.
Query:
SELECT level, count() AS c FROM ( SELECT user_id, windowFunnel(6048000000000000)(timestamp, eventID = 1003, eventID = 1009, eventID = 1007, eventID = 1010) AS level FROM trend WHERE (event_date >= '2019-01-01') AND (event_date <= '2019-02-02') GROUP BY user_id ) GROUP BY level ORDER BY level ASC
Result:
┌─level─┬─c─┐ │ 4 │ 1 │ └───────┴───┘
retention¶
The function takes as arguments a set of conditions from 1 to 32 arguments of type UInt8
that indicate whether a certain condition was met for the event. Any condition can be specified as an argument (as in WHERE).
The conditions, except the first, apply in pairs: the result of the second will be true if the first and second are true, of the third if the first and third are true, etc.
Syntax
retention(cond1, cond2, ..., cond32)
Arguments
cond
: An expression that returns aUInt8
result (1 or 0).
Returned value
The array of 1 or 0.
- 1: Condition was met for the event.
- 0: Condition wasn’t met for the event.
Type: UInt8
.
Example
Let’s consider an example of calculating the retention
function to determine site traffic.
1. Create a table to illustrate an example.
Input table:
Query:
SELECT '2020-01-01' as date, number as uid FROM numbers(5) UNION ALL SELECT '2020-01-02' as date, number as uid FROM numbers(10) UNION ALL SELECT '2020-01-03' as date, number as uid FROM numbers(15)
Result:
┌───────date─┬─uid─┐ │ 2020-01-01 │ 0 │ │ 2020-01-01 │ 1 │ │ 2020-01-01 │ 2 │ │ 2020-01-01 │ 3 │ │ 2020-01-01 │ 4 │ └────────────┴─────┘ ┌───────date─┬─uid─┐ │ 2020-01-02 │ 0 │ │ 2020-01-02 │ 1 │ │ 2020-01-02 │ 2 │ │ 2020-01-02 │ 3 │ │ 2020-01-02 │ 4 │ │ 2020-01-02 │ 5 │ │ 2020-01-02 │ 6 │ │ 2020-01-02 │ 7 │ │ 2020-01-02 │ 8 │ │ 2020-01-02 │ 9 │ └────────────┴─────┘ ┌───────date─┬─uid─┐ │ 2020-01-03 │ 0 │ │ 2020-01-03 │ 1 │ │ 2020-01-03 │ 2 │ │ 2020-01-03 │ 3 │ │ 2020-01-03 │ 4 │ │ 2020-01-03 │ 5 │ │ 2020-01-03 │ 6 │ │ 2020-01-03 │ 7 │ │ 2020-01-03 │ 8 │ │ 2020-01-03 │ 9 │ │ 2020-01-03 │ 10 │ │ 2020-01-03 │ 11 │ │ 2020-01-03 │ 12 │ │ 2020-01-03 │ 13 │ │ 2020-01-03 │ 14 │ └────────────┴─────┘
2. Group users by unique ID uid
using the retention
function.
Query:
SELECT uid, retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03') AS r FROM ( SELECT '2020-01-01' as date, number as uid FROM numbers(5) UNION ALL SELECT '2020-01-02' as date, number as uid FROM numbers(10) UNION ALL SELECT '2020-01-03' as date, number as uid FROM numbers(15)) WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03') GROUP BY uid ORDER BY uid ASC
Result:
┌─uid─┬─r───────┐ │ 0 │ [1,1,1] │ │ 1 │ [1,1,1] │ │ 2 │ [1,1,1] │ │ 3 │ [1,1,1] │ │ 4 │ [1,1,1] │ │ 5 │ [0,0,0] │ │ 6 │ [0,0,0] │ │ 7 │ [0,0,0] │ │ 8 │ [0,0,0] │ │ 9 │ [0,0,0] │ │ 10 │ [0,0,0] │ │ 11 │ [0,0,0] │ │ 12 │ [0,0,0] │ │ 13 │ [0,0,0] │ │ 14 │ [0,0,0] │ └─────┴─────────┘
3. Calculate the total number of site visits per day.
Query:
SELECT sum(r[1]) AS r1, sum(r[2]) AS r2, sum(r[3]) AS r3 FROM ( SELECT uid, retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03') AS r FROM ( SELECT '2020-01-01' as date, number as uid FROM numbers(5) UNION ALL SELECT '2020-01-02' as date, number as uid FROM numbers(10) UNION ALL SELECT '2020-01-03' as date, number as uid FROM numbers(15)) WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03') GROUP BY uid )
Result:
┌─r1─┬─r2─┬─r3─┐ │ 5 │ 5 │ 5 │ └────┴────┴────┘
Where:
r1
- the number of unique visitors who visited the site during 2020-01-01 (thecond1
condition).r2
- the number of unique visitors who visited the site during a specific time period between 2020-01-01 and 2020-01-02 (cond1
andcond2
conditions).r3
- the number of unique visitors who visited the site during a specific time period on 2020-01-01 and 2020-01-03 (cond1
andcond3
conditions).
uniqUpTo(N)(x)¶
Calculates the number of different values of the argument up to a specified limit, N
. If the number of different argument values is greater than N
, this function returns N
+ 1, otherwise it calculates the exact value.
Recommended for use with small N
s, up to 10. The maximum value of N
is 100.
For the state of an aggregate function, this function uses the amount of memory equal to 1 + N
* the size of one value of bytes. When dealing with strings, this function stores a non-cryptographic hash of 8 bytes; the calculation is approximated for strings.
For example, if you had a table that logs every search query made by users on your website. Each row in the table represents a single search query, with columns for the user ID, the search query, and the timestamp of the query. You can use uniqUpTo
to generate a report that shows only the keywords that produced at least 5 unique users.
SELECT SearchPhrase FROM SearchLog GROUP BY SearchPhrase HAVING uniqUpTo(4)(UserID) >= 5
uniqUpTo(4)(UserID)
calculates the number of unique UserID
values for each SearchPhrase
, but it only counts up to 4 unique values. If there are more than 4 unique UserID
values for a SearchPhrase
, the function returns 5 (4 + 1). The HAVING
clause then filters out the SearchPhrase
values for which the number of unique UserID
values is less than 5. This will give you a list of search keywords that were used by at least 5 unique users.
sumMapFiltered¶
This function behaves the same as sumMap except that it also accepts an array of keys to filter with as a parameter. This can be especially useful when working with a high cardinality of keys.
Syntax
sumMapFiltered(keys_to_keep)(keys, values)
Parameters
keys_to_keep
: Array of keys to filter with.keys
: Array of keys.values
: Array of values.
Returned Value
- Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys.
Example
Query:
SELECT sumMapFiltered([1, 4, 8])(mapKeys(statusMap),mapValues(statusMap)) as summapfiltered FROM ( select c1::Date as date, c2::DateTime as timeslot, (c3::Array(UInt16), c4::Array(UInt64))::Map(UInt16, UInt64) as statusMap from values( ('2000-01-01', '2000-01-01 00:00:00', [1, 2, 3], [10, 10, 10]), ('2000-01-01', '2000-01-01 00:00:00', [3, 4, 5], [10, 10, 10]), ('2000-01-01', '2000-01-01 00:01:00', [4, 5, 6], [10, 10, 10]), ('2000-01-01', '2000-01-01 00:01:00', [6, 7, 8], [10, 10, 10]) ) )
Result:
┌─────summapfiltered───┐ │ ([1,4,8],[10,20,10]) │ └──────────────────────┘
sumMapFilteredWithOverflow¶
This function behaves the same as sumMap except that it also accepts an array of keys to filter with as a parameter. This can be especially useful when working with a high cardinality of keys. It differs from the sumMapFiltered function in that it does summation with overflow. For example, it returns the same data type for the summation as the argument data type.
Syntax
sumMapFilteredWithOverflow(keys_to_keep)(keys, values)
Parameters
keys_to_keep
: Array of keys to filter with.keys
: Array of keys.values
: Array of values.
Returned Value
- Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys.
Example
In this example we create a table sum_map
, insert some data into it and then use both sumMapFilteredWithOverflow
and sumMapFiltered
and the toTypeName
function for comparison of the result. Where requests
was of type UInt8
in the created table, sumMapFiltered
has promoted the type of the summed values to UInt64
to avoid overflow whereas sumMapFilteredWithOverflow
has kept the type as UInt8
which is not large enough to store the result - i.e. overflow has occurred.
Query:
SELECT sumMapFilteredWithOverflow([1, 4, 8])(mapKeys(statusMap),mapValues(statusMap)) as summap_overflow, toTypeName(summap_overflow) FROM ( select c1::Date as date, c2::DateTime as timeslot, (c3::Array(UInt8), c4::Array(UInt8))::Map(UInt8, UInt8) as statusMap from values( ('2000-01-01', '2000-01-01 00:00:00', [1, 2, 3], [10, 10, 10]), ('2000-01-01', '2000-01-01 00:00:00', [3, 4, 5], [10, 10, 10]), ('2000-01-01', '2000-01-01 00:01:00', [4, 5, 6], [10, 10, 10]), ('2000-01-01', '2000-01-01 00:01:00', [6, 7, 8], [10, 10, 10]) ) )
Result:
┌─sum──────────────────┬─toTypeName(sum)───────────────────┐ │ ([1,4,8],[10,20,10]) │ Tuple(Array(UInt8), Array(UInt8)) │ └──────────────────────┴───────────────────────────────────┘
Query:
SELECT sumMapFiltered([1, 4, 8])(mapKeys(statusMap),mapValues(statusMap)) as summap, toTypeName(summap) FROM ( select c1::Date as date, c2::DateTime as timeslot, (c3::Array(UInt8), c4::Array(UInt8))::Map(UInt8, UInt8) as statusMap from values( ('2000-01-01', '2000-01-01 00:00:00', [1, 2, 3], [10, 10, 10]), ('2000-01-01', '2000-01-01 00:00:00', [3, 4, 5], [10, 10, 10]), ('2000-01-01', '2000-01-01 00:01:00', [4, 5, 6], [10, 10, 10]), ('2000-01-01', '2000-01-01 00:01:00', [6, 7, 8], [10, 10, 10]) ) )
Result:
┌─summap───────────────┬─toTypeName(summap)─────────────────┐ 1. │ ([1,4,8],[10,20,10]) │ Tuple(Array(UInt8), Array(UInt64)) │ └──────────────────────┴────────────────────────────────────┘