Aggregate functions¶

Aggregate functions perform a calculation on a set of values and return a single value. They are commonly used in data analysis and reporting to summarize large datasets. These functions can be used to compute sums, averages, counts, and other statistical measures.

Aggregate function combinators¶

The name of an aggregate function can have a suffix appended to it. This changes the way the aggregate function works.

-If¶

The suffix -If can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition (Uint8 type). The aggregate function processes only the rows that trigger the condition. If the condition was not triggered even once, it returns a default value (usually zeros or empty strings).

Examples: sumIf(column, cond), countIf(cond), avgIf(x, cond), quantilesTimingIf(level1, level2)(x, cond), argMinIf(arg, val, cond) and so on.

With conditional aggregate functions, you can calculate aggregates for several conditions at once, without using subqueries and JOINs. For example, conditional aggregate functions can be used to implement the segment comparison functionality.

-Array¶

The -Array suffix can be appended to any aggregate function. In this case, the aggregate function takes arguments of the 'Array(T)' type (arrays) instead of 'T' type arguments. If the aggregate function accepts multiple arguments, this must be arrays of equal lengths. When processing arrays, the aggregate function works like the original aggregate function across all array elements.

Example 1: sumArray(arr) - Totals all the elements of all 'arr' arrays. In this example, it could have been written more simply: sum(arraySum(arr)).

Example 2: uniqArray(arr) – Counts the number of unique elements in all 'arr' arrays. This could be done an easier way: uniq(arrayJoin(arr)), but it's not always possible to add 'arrayJoin' to a query.

-If and -Array can be combined. However, 'Array' must come first, then 'If'. Examples: uniqArrayIf(arr, cond), quantilesTimingArrayIf(level1, level2)(arr, cond). Due to this order, the 'cond' argument won't be an array.

-Map¶

The -Map suffix can be appended to any aggregate function. This will create an aggregate function which gets Map type as an argument, and aggregates values of each key of the map separately using the specified aggregate function. The result is also of a Map type.

Example¶

WITH map_map AS (
  SELECT c1::Date AS date, c2::DateTime AS timeslot, c3::Map(String, UInt16) AS status FROM values (
    ('2000-01-01', '2000-01-01 00:00:00', (['a', 'b', 'c'], [10, 10, 10])),
    ('2000-01-01', '2000-01-01 00:00:00', (['c', 'd', 'e'], [10, 10, 10])),
    ('2000-01-01', '2000-01-01 00:01:00', (['d', 'e', 'f'], [10, 10, 10])),
    ('2000-01-01', '2000-01-01 00:01:00', (['f', 'g', 'g'], [10, 10, 10]))
))
SELECT
    timeslot,
    sumMap(status),
    avgMap(status),
    minMap(status)
FROM map_map
GROUP BY timeslot

┌────────────timeslot─┬─sumMap(status)───────────────────────┬─avgMap(status)───────────────────────┬─minMap(status)───────────────────────┐
│ 2000-01-01 00:00:00 │ {'a':10,'b':10,'c':20,'d':10,'e':10} │ {'a':10,'b':10,'c':10,'d':10,'e':10} │ {'a':10,'b':10,'c':10,'d':10,'e':10} │
│ 2000-01-01 00:01:00 │ {'d':10,'e':10,'f':20,'g':20}        │ {'d':10,'e':10,'f':10,'g':10}        │ {'d':10,'e':10,'f':10,'g':10}        │
└─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┘

-SimpleState¶

If you apply this combinator, the aggregate function returns the same value but with a different type. This is a SimpleAggregateFunction(...) that can be stored in a table to work with AggregatingMergeTree tables.

Syntax¶

<aggFunction>SimpleState(x)

Arguments¶

x: Aggregate function parameters.

Returned values¶

The value of an aggregate function with the SimpleAggregateFunction(...) type.

Example¶

Query:

WITH anySimpleState(number) AS c SELECT toTypeName(c), c FROM numbers(1)

Result:

┌─toTypeName(c)────────────────────────┬─c─┐
│ SimpleAggregateFunction(any, UInt64) │ 0 │
└──────────────────────────────────────┴───┘

-State¶

If you apply this combinator, the aggregate function doesn't return the resulting value (such as the number of unique values for the uniq function), but an intermediate state of the aggregation (for uniq, this is the hash table for calculating the number of unique values). This is an AggregateFunction(...) that can be used for further processing or stored in a table to finish aggregating later.

-MapState isn't an invariant for the same data because the order of data in intermediate state can change, though it doesn't impact ingestion of this data.

To work with these states, use:

AggregatingMergeTree table engine.
finalizeAggregation function.
runningAccumulate function.
-Merge combinator.
-MergeState combinator.

-Merge¶

If you apply this combinator, the aggregate function takes the intermediate aggregation state as an argument, combines the states to finish aggregation, and returns the resulting value.

-MergeState¶

Merges the intermediate aggregation states in the same way as the -Merge combinator. However, it doesn't return the resulting value, but an intermediate aggregation state, similar to the -State combinator.

-ForEach¶

Converts an aggregate function for tables into an aggregate function for arrays that aggregates the corresponding array items and returns an array of results. For example, sumForEach for the arrays [1, 2], [3, 4, 5]and[6, 7]returns the result [10, 13, 5] after adding together the corresponding array items.

-Distinct¶

Every unique combination of arguments will be aggregated only once. Repeating values are ignored. Examples: sum(DISTINCT x) (or sumDistinct(x)), groupArray(DISTINCT x) (or groupArrayDistinct(x)), corrStable(DISTINCT x, y) (or corrStableDistinct(x, y)) and so on.

-OrDefault¶

Changes behavior of an aggregate function.

If an aggregate function doesn't have input values, with this combinator it returns the default value for its return data type. Applies to the aggregate functions that can take empty input data.

-OrDefault can be used with other combinators.

Syntax¶

<aggFunction>OrDefault(x)

Arguments¶

x: Aggregate function parameters.

Returned values¶

Returns the default value of an aggregate function's return type if there is nothing to aggregate.

Type depends on the aggregate function used.

Example¶

Query:

SELECT avg(number), avgOrDefault(number) FROM numbers(0)

Result:

┌─avg(number)─-─avgOrDefault(number)─┐
│         nan │                    0 │
└─────────────┴──────────────────────┘

Also -OrDefault can be used with another combinators. It is useful when the aggregate function doesn't accept the empty input.

Query:

SELECT avgOrDefaultIf(x, x > 10)
FROM
(
    SELECT toDecimal32(1.23, 2) AS x
)

Result:

┌─avgOrDefaultIf(x, greater(x, 10))─┐
│                              0.00 │
└───────────────────────────────────┘

-OrNull¶

Changes behavior of an aggregate function.

This combinator converts a result of an aggregate function to the Nullable data type. If the aggregate function doesn't have values to calculate it returns NULL.

-OrNull can be used with other combinators.

Syntax¶

<aggFunction>OrNull(x)

Arguments¶

x: Aggregate function parameters.

Returned values¶

The result of the aggregate function, converted to the Nullable data type.
NULL, if there is nothing to aggregate.

Type: Nullable(aggregate function return type).

Example¶

Add -orNull to the end of aggregate function.

Query:

SELECT sumOrNull(number), toTypeName(sumOrNull(number)) FROM numbers(10) WHERE number > 10

Result:

┌─sumOrNull(number)─┬─toTypeName(sumOrNull(number))─┐
│              ᴺᵁᴸᴸ │ Nullable(UInt64)              │
└───────────────────┴───────────────────────────────┘

Also -OrNull can be used with another combinators. It is useful when the aggregate function doesn't accept the empty input.

Query:

SELECT avgOrNullIf(x, x > 10)
FROM
(
    SELECT toDecimal32(1.23, 2) AS x
)

Result:

┌─avgOrNullIf(x, greater(x, 10))─┐
│                           ᴺᵁᴸᴸ │
└────────────────────────────────┘

-Resample¶

Lets you divide data into groups, and then separately aggregates the data in those groups. Groups are created by splitting the values from one column into intervals.

<aggFunction>Resample(start, end, step)(<aggFunction_params>, resampling_key)

Arguments¶

start: Starting value of the whole required interval for resampling_key values.
stop: Ending value of the whole required interval for resampling_key values. The whole interval doesn't include the stop value [start, stop).
step: Step for separating the whole interval into subintervals. The aggFunction is executed over each of those subintervals independently.
resampling_key: Column whose values are used for separating data into intervals.
aggFunction_params: aggFunction parameters.

Returned values¶

Array of aggFunction results for each subinterval.

Example¶

Consider the people table with the following data:

┌─name───┬─age─┬─wage─┐
│ John   │  16 │   10 │
│ Alice  │  30 │   15 │
│ Mary   │  35 │    8 │
│ Evelyn │  48 │ 11.5 │
│ David  │  62 │  9.9 │
│ Brian  │  60 │   16 │
└────────┴─────┴──────┘

Let's get the names of the people whose age lies in the intervals of [30,60) and [60,75). Since we use integer representation for age, we get ages in the [30, 59] and [60,74] intervals.

To aggregate names in an array, we use the groupArray aggregate function. It takes one argument. In our case, it's the name column. The groupArrayResample function should use the age column to aggregate names by age. To define the required intervals, we pass the 30, 75, 30 arguments into the groupArrayResample function.

SELECT groupArrayResample(30, 75, 30)(name, age) FROM people

┌─groupArrayResample(30, 75, 30)(name, age)─────┐
│ [['Alice','Mary','Evelyn'],['David','Brian']] │
└───────────────────────────────────────────────┘

Consider the results.

John is out of the sample because he's too young. Other people are distributed according to the specified age intervals.

Now let's count the total number of people and their average wage in the specified age intervals.

SELECT
    countResample(30, 75, 30)(name, age) AS amount,
    avgResample(30, 75, 30)(wage, age) AS avg_wage
FROM people

┌─amount─┬─avg_wage──────────────────┐
│ [3,2]  │ [11.5,12.949999809265137] │
└────────┴───────────────────────────┘

-ArgMin¶

The suffix -ArgMin can be appended to the name of any aggregate function. In this case, the aggregate function accepts an additional argument, which should be any comparable expression. The aggregate function processes only the rows that have the minimum value for the specified extra expression.

Examples: sumArgMin(column, expr), countArgMin(expr), avgArgMin(x, expr) and so on.

-ArgMax¶

Similar to suffix -ArgMin but processes only the rows that have the maximum value for the specified extra expression.

aggThrow¶

This function can be used for the purpose of testing exception safety. It will throw an exception on creation with the specified probability.

Syntax¶

aggThrow(throw_prob)

Arguments¶

throw_prob: Probability to throw on creation. Float64.

Returned value¶

An exception: Code: 503. DB::Exception: Aggregate function aggThrow has thrown exception successfully.

Example¶

Query:

SELECT number % 2 AS even, aggThrow(number) FROM numbers(10) GROUP BY even

Result:

Received exception:
Code: 503. DB::Exception: Aggregate function aggThrow has thrown exception successfully: While executing AggregatingTransform. (AGGREGATE_FUNCTION_THROW)

analysisOfVariance¶

Provides a statistical test for one-way analysis of variance (ANOVA test). It is a test over several groups of normally distributed observations to find out whether all groups have the same mean or not.

Syntax¶

analysisOfVariance(val, group_no)

Aliases: anova

Parameters

val: value.
group_no : group number that val belongs to.

Groups are enumerated starting from 0 and there should be at least two groups to perform a test. There should be at least one group with the number of observations greater than one.

Returned value¶

(f_statistic, p_value). Tuple(Float64, Float64).

Example¶

Query:

SELECT analysisOfVariance(number, number % 2) FROM numbers(1048575)

Result:

┌─analysisOfVariance(number, modulo(number, 2))─┐
│ (0,1)                                         │
└───────────────────────────────────────────────┘

any¶

Selects the first encountered value of a column.

As a query can be executed in arbitrary order, the result of this function is non-deterministic. If you need an arbitrary but deterministic result, use functions min or max.

By default, the function never returns NULL, i.e. ignores NULL values in the input column. However, if the function is used with the RESPECT NULLS modifier, it returns the first value reads no matter if NULL or not.

Syntax¶

any(column) [RESPECT NULLS]

Aliases any(column) (without RESPECT NULLS)

any_value, first_value.

Alias for any(column) RESPECT NULLS

any_respect_nulls, first_value_respect_nulls,any_value_respect_nulls

Parameters¶

column: The column name.

Returned value¶

The first value encountered.

The return type of the function is the same as the input, except for LowCardinality which is discarded. This means that given no rows as input it will return the default value of that type (0 for integers, or Null for a Nullable() column). You might use the -OrNull combinator ) to modify this behaviour.

Example¶

Query:

WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city)
SELECT any(city), any_respect_nulls(city) FROM cte

┌─any(city)─┬─any_respect_nulls(city)─┐
│ Amsterdam │ ᴺᵁᴸᴸ                    │
└───────────┴─────────────────────────┘

anyHeavy¶

Selects a frequently occurring value using the heavy hitters algorithm. If there is a value that occurs more than in half the cases in each of the query's execution threads, this value is returned. Normally, the result is nondeterministic.

anyHeavy(column)

Arguments¶

column – The column name.

Example¶

Query:

WITH cte AS (SELECT arrayJoin([2,1,1,1,3,1,1,2,2]) as n)
SELECT any(n), anyHeavy(n) FROM cte

┌─any(n)─┬─anyHeavy(n)─┐
│     2  │           1 │
└────────┴─────────────┘

anyLast¶

Selects the last encountered value of a column.

As a query can be executed in arbitrary order, the result of this function is non-deterministic. If you need an arbitrary but deterministic result, use functions min or max.

Syntax¶

anyLast(column) [RESPECT NULLS]

Alias anyLast(column) (without RESPECT NULLS)

last_value.

Aliases for anyLast(column) RESPECT NULLS

anyLast_respect_nulls, last_value_respect_nulls

Parameters¶

column: The column name.

Returned value¶

The last value encountered.

Example¶

Query:

WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city)
SELECT anyLast(city), anyLast_respect_nulls(city) FROM cte

┌─anyLast(city)─┬─anyLast_respect_nulls(city)─┐
│ Valencia      │ ᴺᵁᴸᴸ                        │
└───────────────┴─────────────────────────────┘

approx_top_k¶

Returns an array of the approximately most frequent values and their counts in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves).

approx_top_k(N)(column)
approx_top_k(N, reserved)(column)

This function doesn't provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren't the most frequent values.

We recommend using the N < 10 value; performance is reduced with large N values. Maximum value of N = 65536.

Parameters¶

N: The number of elements to return. Optional. Default value: 10.
reserved: Defines, how many cells reserved for values. If uniq(column) > reserved, result of topK function will be approximate. Optional. Default value: N * 3.

Arguments¶

column: The value to calculate frequency.

Example¶

Query:

SELECT approx_top_k(2)(k)
FROM values('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))

Result:

┌─approx_top_k(2)(k)────┐
│ [('y',3,0),('x',1,0)] │
└───────────────────────┘

approx_top_count¶

Is an alias to approx_top_k function

approx_top_sum¶

approx_top_sum(N)(column, weight)
approx_top_sum(N, reserved)(column, weight)

This function doesn't provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren't the most frequent values.

We recommend using the N < 10 value; performance is reduced with large N values. Maximum value of N = 65536.

Parameters¶

N: The number of elements to return. Optional. Default value: 10.
reserved: Defines, how many cells reserved for values. If uniq(column) > reserved, result of topK function will be approximate. Optional. Default value: N * 3.

Arguments¶

column: The value to calculate frequency.
weight: The weight. Every value is accounted weight times for frequency calculation. UInt64.

Example¶

Query:

SELECT approx_top_sum(2)(k, w)
FROM values('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))

Result:

┌─approx_top_sum(2)(k, w)─┐
│ [('z',10,0),('x',5,0)]  │
└─────────────────────────┘

argMax¶

Calculates the arg value for a maximum val value. If there are multiple rows with equal val being the maximum, which of the associated arg is returned isn't deterministic. Both parts the arg and the max behave as aggregate functions, they both skip Null during processing and return not Null values if not Null values are available.

Syntax¶

argMax(arg, val)

Arguments¶

arg: Argument.
val: Value.

Returned value¶

arg value that corresponds to maximum val value.

Type: matches arg type.

Example¶

Input table:

┌─user─────┬─salary─┐
│ director │   5000 │
│ manager  │   3000 │
│ worker   │   1000 │
└──────────┴────────┘

Query:

SELECT argMax(user, salary) FROM salary

Result:

┌─argMax(user, salary)─┐
│ director             │
└──────────────────────┘

Extended example¶

WITH test AS (
  SELECT c1 AS a, c2 AS b FROM values(('a', 1), ('b', 2), ('c', 2), (NULL, 3), (NULL, NULL), ('d', NULL))
)

SELECT * FROM test
┌─a─────┬─b────┐
│ a     │ 1    │
│ b     │ 2    │
│ c     │ 2    │
│ ᴺᵁᴸᴸ  │ 3    │
│ ᴺᵁᴸᴸ  │ ᴺᵁᴸᴸ │
│ d     │ ᴺᵁᴸᴸ │
└───────┴──────┘


SELECT argMax(a, b), max(b) FROM test
┌─argMax(a, b)─┬─max(b)─┐
│ b            │      3 │ -- argMax = 'b' because it the first not Null value, max(b) is from another row!
└──────────────┴────────┘


SELECT argMax(tuple(a), b) FROM test
┌─argMax(tuple(a), b)─┐
│ (NULL)              │ -- The a `Tuple` that contains only a `NULL` value isn't `NULL`, so the aggregate functions won't skip that row because of that `NULL` value
└─────────────────────┘


SELECT (argMax((a, b), b) as t).1 argMaxA, t.2 argMaxB FROM test
┌─argMaxA─┬─argMaxB─┐
│ ᴺᵁᴸᴸ    │       3 │ -- you can use Tuple and get both (all - tuple(*)) columns for the according max(b)
└─────────┴─────────┘


SELECT argMax(a, b), max(b) FROM test WHERE a IS NULL AND b IS NULL
┌─argMax(a, b)─┬─max(b)─┐
│ ᴺᵁᴸᴸ         │   ᴺᵁᴸᴸ │ -- All aggregated rows contains at least one `NULL` value because of the filter, so all rows are skipped, therefore the result will be `NULL`
└──────────────┴────────┘


SELECT argMax(a, (b,a)) FROM test
┌─argMax(a, tuple(b, a))─┐
│ c                      │ -- There are two rows with b=2, `Tuple` in the `Max` allows to get not the first `arg`
└────────────────────────┘


SELECT argMax(a, tuple(b)) FROM test
┌─argMax(a, tuple(b))─┐
│ b                   │ -- `Tuple` can be used in `Max` to not skip Nulls in `Max`
└─────────────────────┘

argMin¶

Calculates the arg value for a minimum val value. If there are multiple rows with equal val being the maximum, which of the associated arg is returned isn't deterministic. Both parts the arg and the min behave as aggregate functions, they both skip Null during processing and return not Null values if not Null values are available.

Syntax¶

argMin(arg, val)

Arguments¶

arg: Argument.
val: Value.

Returned value¶

arg value that corresponds to minimum val value.

Type: matches arg type.

Example¶

Input table:

┌─user─────┬─salary─┐
│ director │   5000 │
│ manager  │   3000 │
│ worker   │   1000 │
└──────────┴────────┘

Query:

SELECT argMin(user, salary) FROM salary

Result:

┌─argMin(user, salary)─┐
│ worker               │
└──────────────────────┘

Extended example¶

WITH test AS (
  SELECT c1 AS a, c2 AS b FROM values((NULL, 0), ('a', 1), ('b', 2), ('c', 2), (NULL, NULL), ('d', NULL))
)

SELECT * FROM test
┌─a────┬────b─┐
│ ᴺᵁᴸᴸ │    0 │
│ a    │    1 │
│ b    │    2 │
│ c    │    2 │
│ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
│ d    │ ᴺᵁᴸᴸ │
└──────┴──────┘


SELECT argMin(a, b), min(b) FROM test
┌─argMin(a, b)─┬─min(b)─┐
│ a            │      0 │ -- argMin = a because it the first not `NULL` value, min(b) is from another row!
└──────────────┴────────┘


SELECT argMin(tuple(a), b) FROM test
┌─argMin(tuple(a), b)─┐
│ (NULL)              │ -- The a `Tuple` that contains only a `NULL` value isn't `NULL`, so the aggregate functions won't skip that row because of that `NULL` value
└─────────────────────┘


SELECT (argMin((a, b), b) as t).1 argMinA, t.2 argMinB from test
┌─argMinA─┬─argMinB─┐
│ ᴺᵁᴸᴸ    │       0 │ -- you can use `Tuple` and get both (all - tuple(*)) columns for the according max(b)
└─────────┴─────────┘


SELECT argMin(a, b), min(b) FROM test WHERE a IS NULL and b IS NULL
┌─argMin(a, b)─┬─min(b)─┐
│ ᴺᵁᴸᴸ         │   ᴺᵁᴸᴸ │ -- All aggregated rows contains at least one `NULL` value because of the filter, so all rows are skipped, therefore the result will be `NULL`
└──────────────┴────────┘


SELECT argMin(a, (b, a)), min(tuple(b, a)) FROM test
┌─argMin(a, tuple(b, a))─┬─min(tuple(b, a))─┐
│ d                      │ (NULL,NULL)      │ -- 'd' is the first not `NULL` value for the min
└────────────────────────┴──────────────────┘


SELECT argMin((a, b), (b, a)), min(tuple(b, a)) FROM test
┌─argMin(tuple(a, b), tuple(b, a))─┬─min(tuple(b, a))─┐
│ (NULL,NULL)                      │ (NULL,NULL)      │ -- argMin returns (NULL,NULL) here because `Tuple` allows to don't skip `NULL` and min(tuple(b, a)) in this case is minimal value for this dataset
└──────────────────────────────────┴──────────────────┘


SELECT argMin(a, tuple(b)) FROM test
┌─argMin(a, tuple(b))─┐
│ d                   │ -- `Tuple` can be used in `min` to not skip rows with `NULL` values as b.
└─────────────────────┘

array_concat_agg¶

Alias of groupArrayArray. The function is case insensitive.

Example¶

SELECT *
FROM t

┌─a───────┐
│ [1,2,3] │
│ [4,5]   │
│ [6]     │
└─────────┘

Query:

SELECT array_concat_agg(a) AS a
FROM t

┌─a─────────────┐
│ [1,2,3,4,5,6] │
└───────────────┘

avg¶

Calculates the arithmetic mean.

Syntax¶

avg(x)

Arguments¶

x: input values, must be Integer, Float, or Decimal.

Returned value¶

The arithmetic mean, always as Float64.
NaN if the input parameter x is empty.

Example¶

Query:

SELECT avg(x) FROM values('x Int8', 0, 1, 2, 3, 4, 5)

Result:

┌─avg(x)─┐
│    2.5 │
└────────┘

avgWeighted¶

Calculates the weighted arithmetic mean.

Syntax¶

avgWeighted(x, weight)

Arguments¶

x: Values.
weight: Weights of the values.

x and weight must both be Integer or floating-point, but may have different types.

Returned value¶

NaN if all the weights are equal to 0 or the supplied weights parameter is empty.
Weighted mean otherwise.

Return type is always Float64.

Example¶

Query:

SELECT avgWeighted(x, w)
FROM values('x Int8, w Int8', (4, 1), (1, 0), (10, 2))

Result:

┌─avgWeighted(x, weight)─┐
│                      8 │
└────────────────────────┘

Example¶

Query:

SELECT avgWeighted(x, w)
FROM values('x Int8, w Float64', (4, 1), (1, 0), (10, 2))

Result:

┌─avgWeighted(x, weight)─┐
│                      8 │
└────────────────────────┘

Example¶

Query:

SELECT avgWeighted(x, w)
FROM values('x Int8, w Int8', (0, 0), (1, 0), (10, 0))

Result:

┌─avgWeighted(x, weight)─┐
│                    nan │
└────────────────────────┘

boundingRatio¶

Aggregate function that calculates the slope between the leftmost and rightmost points across a group of values.

Example¶

Sample data:

SELECT
    number,
    number * 1.5
FROM numbers(10)

┌─number─┬─multiply(number, 1.5)─┐
│      0 │                     0 │
│      1 │                   1.5 │
│      2 │                     3 │
│      3 │                   4.5 │
│      4 │                     6 │
│      5 │                   7.5 │
│      6 │                     9 │
│      7 │                  10.5 │
│      8 │                    12 │
│      9 │                  13.5 │
└────────┴───────────────────────┘

The boundingRatio() function returns the slope of the line between the leftmost and rightmost points, in the above data these points are (0,0) and (9,13.5).

SELECT boundingRatio(number, number * 1.5)
FROM numbers(10)

┌─boundingRatio(number, multiply(number, 1.5))─┐
│                                          1.5 │
└──────────────────────────────────────────────┘

Calculates the value of (P(tag = 1) - P(tag = 0))(log(P(tag = 1)) - log(P(tag = 0))) for each category.

categoricalInformationValue(category1, category2, ..., tag)

The result indicates how a discrete (categorical) feature [category1, category2, ...] contribute to a learning model which predicting the value of tag.

contingency¶

The contingency function calculates the contingency coefficient, a value that measures the association between two columns in a table. The computation is similar to the cramersV function but with a different denominator in the square root.

Syntax¶

contingency(column1, column2)

Arguments¶

column1 and column2 are the columns to be compared

Returned value¶

a value between 0 and 1. The larger the result, the closer the association of the two columns.

Return type is always Float64.

Example¶

The two columns being compared below have a small association with each other. We have included the result of cramersV also (as a comparison):

SELECT
    cramersV(a, b),
    contingency(a ,b)
FROM
    (
        SELECT
            number % 10 AS a,
            number % 4 AS b
        FROM
            numbers(150)
    )

Result:

┌──────cramersV(a, b)─┬───contingency(a, b)─┐
│ 0.41171788506213564 │ 0.05812725261759165 │
└─────────────────────┴─────────────────────┘

corr¶

Calculates the Pearson correlation coefficient: Σ((x - x̄)(y - ȳ)) / √(Σ(x - x̄)²*Σ(y - ȳ)²)

$\frac{\Sigma{(x - \bar{x})(y - \bar{y})}}{\sqrt{\Sigma{(x - \bar{x})^2} * \Sigma{(y - \bar{y})^2}}}$

This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the corrStable function. It is slower but provides a more accurate result.

Syntax¶

corr(x, y)

Arguments¶

x: first variable. (U)Int*, Float*, Decimal.
y: second variable. (U)Int*, Float*, Decimal.

Returned value¶

The Pearson correlation coefficient. Float64.

Example¶

Query:

SELECT corr(c2, c3) 
FROM values((1, 5.6, -4.4),(2, -9.6, 3),(3, -1.3, -4),(4, 5.3, 9.7),(5, 4.4, 0.037),(6, -8.6, -7.8),(7, 5.1, 9.3),(8, 7.9, -3.6),(9, -8.2, 0.62),(10, -3, 7.3))

Result:

┌─corr(c2, c3)──────────┐
│    0.1730265755453256 │
└───────────────────────┘

corrMatrix¶

Computes the correlation matrix over N variables.

Syntax¶

corrMatrix(x[, ...])

Arguments¶

x: a variable number of parameters. (U)Int*, Float*, Decimal.

Returned value¶

Correlation matrix. Array(Array(Float64)).

Example¶

Query:

SELECT 
  arrayMap(x -> round(x, 3), arrayJoin(corrMatrix(c1, c2, c3, c4))) AS corrMatrix
FROM values ((1, 5.6, -4.4, 2.6), (2, -9.6, 3, 3.3), (3, -1.3, -4, 1.2), (4, 5.3, 9.7, 2.3), (5, 4.4, 0.037, 1.222), (6, -8.6, -7.8, 2.1233), (7, 5.1, 9.3, 8.1222), (8, 7.9, -3.6, 9.837), (9, -8.2, 0.62, 8.43555), (10, -3, 7.3, 6.762))

Result:

┌─corrMatrix─────────────┐
│ [1,-0.096,0.243,0.746] │
│ [-0.096,1,0.173,0.106] │
│ [0.243,0.173,1,0.258]  │
│ [0.746,0.106,0.258,1]  │
└────────────────────────┘

corrStable¶

Calculates the Pearson correlation coefficient: Σ((x - x̄)(y - ȳ)) / √(Σ(x - x̄)²*Σ(y - ȳ)²)

$\frac{\Sigma{(x - \bar{x})(y - \bar{y})}}{\sqrt{\Sigma{(x - \bar{x})^2} * \Sigma{(y - \bar{y})^2}}}$

Similar to the corr function, but uses a numerically stable algorithm. As a result, corrStable is slower than corr but produces a more accurate result.

Syntax¶

corrStable(x, y)

Arguments¶

x: first variable. (U)Int*, Float*, Decimal.
y: second variable. (U)Int*, Float*, Decimal.

Returned value¶

The Pearson correlation coefficient. Float64.

*Example¶

Query:

SELECT corrStable(c2, c3)
FROM values((1, 5.6, -4.4),(2, -9.6, 3),(3, -1.3, -4),(4, 5.3, 9.7),(5, 4.4, 0.037),(6, -8.6, -7.8),(7, 5.1, 9.3),(8, 7.9, -3.6),(9, -8.2, 0.62),(10, -3, 7.3))

Result:

┌─corrStable(c2, c3)─────┐
│    0.17302657554532558 │
└────────────────────────┘

count¶

Counts the number of rows or not-NULL values.

Use the following syntaxes for count:

count(expr) or COUNT(DISTINCT expr).
count() or COUNT(*).

Arguments¶

The function can take:

Zero parameters.
One expression.

Returned value¶

If the function is called without parameters it counts the number of rows.
If the expression is passed, then the function counts how many times this expression returned not null. If the expression returns a Nullable-type value, then the result of count stays not Nullable. The function returns 0 if the expression returned NULL for all the rows.

In both cases the type of the returned value is UInt64.

covarPop¶

Calculates the population covariance: Σ(x - x̄)(y - ȳ) / n

$\frac{\Sigma{(x - \bar{x})(y - \bar{y})}}{n}$

This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the covarPopStable function. It works slower but provides a lower computational error.

Syntax¶

covarPop(x, y)

Arguments¶

x: first variable. (U)Int*, Float*, Decimal.
y: second variable. (U)Int*, Float*, Decimal.

Returned value¶

The population covariance between x and y. Float64.

Example¶

Query:

SELECT covarPop(c2, c3)
FROM values ((1, 5.6, -4.4),(2, -9.6, 3),(3, -1.3, -4),(4, 5.3, 9.7),(5, 4.4, 0.037),(6, -8.6, -7.8),(7, 5.1, 9.3),(8, 7.9, -3.6),(9, -8.2, 0.62),(10, -3, 7.3))

Result:

┌─covarPop(c2, c3)───┐
│           6.485648 │
└────────────────────┘

covarPopMatrix¶

Returns the population covariance matrix over N variables.

Syntax¶

covarPopMatrix(x[, ...])

Arguments¶

x: a variable number of parameters. (U)Int*, Float*, Decimal.

Returned value¶

Population covariance matrix. Array(Array(Float64)).

Example¶

Query:

SELECT arrayMap(x -> round(x, 3), arrayJoin(covarPopMatrix(c1, c2, c3, c4))) AS covarPopMatrix
FROM values ((1, 5.6, -4.4, 2.6), (2, -9.6, 3, 3.3), (3, -1.3, -4, 1.2), (4, 5.3, 9.7, 2.3), (5, 4.4, 0.037, 1.222), (6, -8.6, -7.8, 2.1233), (7, 5.1, 9.3, 8.1222), (8, 7.9, -3.6, 9.837), (9, -8.2, 0.62, 8.43555), (10, -3, 7.3, 6.762))

Result:

┌─covarPopMatrix────────────┐
│ [8.25,-1.76,4.08,6.748]   │
│ [-1.76,41.07,6.486,2.132] │
│ [4.08,6.486,34.21,4.755]  │
│ [6.748,2.132,4.755,9.93]  │
└───────────────────────────┘

covarPopStable¶

Calculates the value of the population covariance:Σ(x - x̄)(y - ȳ) / n

$\frac{\Sigma{(x - \bar{x})(y - \bar{y})}}{n}$

It is similar to the covarPop function, but uses a numerically stable algorithm. As a result, covarPopStable is slower than covarPop but produces a more accurate result.

Syntax¶

covarPop(x, y)

Arguments¶

x: first variable. (U)Int*, Float*, Decimal.
y: second variable. (U)Int*, Float*, Decimal.

Returned value¶

The population covariance between x and y. Float64.

Example¶

Query:

SELECT covarPopStable(c2, c3)
FROM values ((1, 5.6,-4.4),(2, -9.6,3),(3, -1.3,-4),(4, 5.3,9.7),(5, 4.4,0.037),(6, -8.6,-7.8),(7, 5.1,9.3),(8, 7.9,-3.6),(9, -8.2,0.62),(10, -3,7.3))

Result:

┌─covarPopStable(c2, c3)─┐
│               6.485648 │
└────────────────────────┘

covarSamp¶

Calculates the value of Σ((x - x̅)(y - y̅)) / (n - 1).

This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the covarSampStable function. It works slower but provides a lower computational error.

Syntax¶

covarSamp(x, y)

Arguments¶

x: first variable. (U)Int*, Float*, Decimal.
y: second variable. (U)Int*, Float*, Decimal.

Returned value¶

The sample covariance between x and y. For n <= 1, nan is returned. Float64.

Example¶

Query:

SELECT covarSamp(c2, c3)
FROM values ((1, 5.6,-4.4),(2, -9.6,3),(3, -1.3,-4),(4, 5.3,9.7),(5, 4.4,0.037),(6, -8.6,-7.8),(7, 5.1,9.3),(8, 7.9,-3.6),(9, -8.2,0.62),(10, -3,7.3))

Result:

┌─covarSamp(c2, c3)───┐
│   7.206275555555556 │
└─────────────────────┘

Query:

SELECT covarSamp(c2, c3)
FROM values ((1, 5.6,-4.4))

Result:

┌─covarSamp(c2, c3)───┐
│                 nan │
└─────────────────────┘

covarSampMatrix¶

Returns the sample covariance matrix over N variables.

Syntax¶

covarSampMatrix(x[, ...])

Arguments¶

x: a variable number of parameters. (U)Int*, Float*, Decimal.

Returned value¶

Sample covariance matrix. Array(Array(Float64)).

Example¶

Query:

SELECT arrayMap(x -> round(x, 3), arrayJoin(covarSampMatrix(c1, c2, c3, c4))) AS covarSampMatrix
FROM values ((1, 5.6, -4.4, 2.6), (2, -9.6, 3, 3.3), (3, -1.3, -4, 1.2), (4, 5.3, 9.7, 2.3), (5, 4.4, 0.037, 1.222), (6, -8.6, -7.8, 2.1233), (7, 5.1, 9.3, 8.1222), (8, 7.9, -3.6, 9.837), (9, -8.2, 0.62, 8.43555), (10, -3, 7.3, 6.762))

Result:

┌─covarSampMatrix─────────────┐
│ [9.167,-1.956,4.534,7.498]  │
│ [-1.956,45.634,7.206,2.369] │
│ [4.534,7.206,38.011,5.283]  │
│ [7.498,2.369,5.283,11.034]  │
└─────────────────────────────┘

covarSampStable¶

Calculates the value of Σ((x - x̅)(y - y̅)) / (n - 1). Similar to covarSamp but works slower while providing a lower computational error.

Syntax¶

covarSampStable(x, y)

Arguments¶

x: first variable. (U)Int*, Float*, Decimal.
y: second variable. (U)Int*, Float*, Decimal.

Returned value¶

The sample covariance between x and y. For n <= 1, inf is returned. Float64.

Example¶

Query:

SELECT covarSampStable(c2, c3)
FROM values ((1, 5.6,-4.4),(2, -9.6,3),(3, -1.3,-4),(4, 5.3,9.7),(5, 4.4,0.037),(6, -8.6,-7.8),(7, 5.1,9.3),(8, 7.9,-3.6),(9, -8.2,0.62),(10, -3,7.3))

Result:

┌─covarSampStable(x_value, y_value)─┐
│                 7.206275555555556 │
└───────────────────────────────────┘

Query:

SELECT covarSampStable(c2, c3)
FROM values ((1, 5.6,-4.4))

Result:

┌─covarSampStable(c2, c3)─┐
│                     inf │
└─────────────────────────┘

cramersV¶

Cramer's V (sometimes referred to as Cramer's phi) is a measure of association between two columns in a table. The result of the cramersV function ranges from 0 (corresponding to no association between the variables) to 1 and can reach 1 only when each value is completely determined by the other. It may be viewed as the association between two variables as a percentage of their maximum possible variation.

For a bias corrected version of Cramer's V see: cramersVBiasCorrected

Syntax¶

cramersV(column1, column2)

Parameters¶

column1: first column to be compared.
column2: second column to be compared.

Returned value¶

a value between 0 (corresponding to no association between the columns' values) to 1 (complete association).

Type: always Float64.

Example¶

The following two columns being compared below have no association with each other, so the result of cramersV is 0:

Query:

SELECT
    cramersV(a, b)
FROM
    (
        SELECT
            number % 3 AS a,
            number % 5 AS b
        FROM
            numbers(150)
    )

Result:

┌─cramersV(a, b)─┐
│              0 │
└────────────────┘

The following two columns below have a fairly close association, so the result of cramersV is a high value:

SELECT
    cramersV(a, b)
FROM
    (
        SELECT
            number % 10 AS a,
            number % 5 AS b
        FROM
            numbers(150)
    )

Result:

┌─────cramersV(a, b)─┐
│ 0.8944271909999159 │
└────────────────────┘

cramersVBiasCorrected¶

Cramer's V is a measure of association between two columns in a table. The result of the cramersV function ranges from 0 (corresponding to no association between the variables) to 1 and can reach 1 only when each value is completely determined by the other. The function can be heavily biased, so this version of Cramer's V uses the bias correction.

Syntax¶

cramersVBiasCorrected(column1, column2)

Parameters¶

column1: first column to be compared.
column2: second column to be compared.

Returned value¶

a value between 0 (corresponding to no association between the columns' values) to 1 (complete association).

Type: always Float64.

Example¶

The following two columns being compared below have a small association with each other. Notice the result of cramersVBiasCorrected is smaller than the result of cramersV:

Query:

SELECT
    cramersV(a, b),
    cramersVBiasCorrected(a ,b)
FROM
    (
        SELECT
            number % 10 AS a,
            number % 4 AS b
        FROM
            numbers(150)
    )

Result:

┌──────cramersV(a, b)─┬─cramersVBiasCorrected(a, b)─┐
│ 0.41171788506213564 │         0.33369281784141364 │
└─────────────────────┴─────────────────────────────┘

deltaSum¶

Sums the arithmetic difference between consecutive rows. If the difference is negative, it's ignored.

The underlying data must be sorted for this function to work properly. If you would like to use this function in a materialized view, you most likely want to use the deltaSumTimestamp method instead.

Syntax¶

deltaSum(value)

Arguments¶

value: Input values, must be Integer or Float type.

Returned value¶

A gained arithmetic difference of the Integer or Float type.

Examples¶

Query:

SELECT deltaSum(arrayJoin([1, 2, 3]))

Result:

┌─deltaSum(arrayJoin([1, 2, 3]))─┐
│                              2 │
└────────────────────────────────┘

Query:

SELECT deltaSum(arrayJoin([1, 2, 3, 0, 3, 4, 2, 3]))

Result:

┌─deltaSum(arrayJoin([1, 2, 3, 0, 3, 4, 2, 3]))─┐
│                                             7 │
└───────────────────────────────────────────────┘

Query:

SELECT deltaSum(arrayJoin([2.25, 3, 4.5]))

Result:

┌─deltaSum(arrayJoin([2.25, 3, 4.5]))─┐
│                                2.25 │
└─────────────────────────────────────┘

deltaSumTimestamp¶

Adds the difference between consecutive rows. If the difference is negative, it's ignored.

This function is primarily for materialized views that store data ordered by some time bucket-aligned timestamp, for example, a toStartOfMinute bucket. Because the rows in such a materialized view will all have the same timestamp, it's impossible for them to be merged in the correct order, without storing the original, unrounded timestamp value. The deltaSumTimestamp function keeps track of the original timestamp of the values it's seen, so the values (states) of the function are correctly computed during merging of parts.

To calculate the delta sum across an ordered collection you can simply use the deltaSum function.

Syntax¶

deltaSumTimestamp(value, timestamp)

Arguments¶

value: Input values, must be some Integer type or Float type or a Date or DateTime.
timestamp: The parameter for order values, must be some Integer type or Float type or a Date or DateTime.

Returned value¶

Accumulated differences between consecutive values, ordered by the timestamp parameter.

Type: Integer or Float or Date or DateTime.

Example¶

Query:

SELECT deltaSumTimestamp(value, timestamp)
FROM (SELECT number AS timestamp, [0, 4, 8, 3, 0, 0, 0, 1, 3, 5][number] AS value FROM numbers(1, 10))

Result:

┌─deltaSumTimestamp(value, timestamp)─┐
│                                  13 │
└─────────────────────────────────────┘

entropy¶

Calculates Shannon entropy) of a column of values.

Syntax¶

entropy(val)

Arguments¶

val: Column of values of any type.

Returned value¶

Shannon entropy.

Type: Float64.

Example¶

Query:

select entropy(c1), entropy(c2) from values((1, 'A'), (1, 'A'), (1,'A'), (1,'A'), (2,'B'), (2,'B'), (2,'C'), (2,'D'))

Result:

┌─entropy(c1)─┬─entropy(c2)─┐
│           1 │        1.75 │
└─────────────┴─────────────┘

exponentialMovingAverage¶

Calculates the exponential moving average of values for the determined time.

Syntax¶

exponentialMovingAverage(x)(value, timeunit)

Each value corresponds to the determinate timeunit. The half-life x is the time lag at which the exponential weights decay by one-half. The function returns a weighted average: the older the time point, the less weight the corresponding value is considered to be.

Arguments¶

value: Value. Integer, Float or Decimal.
timeunit: Timeunit. Integer, Float or Decimal. Timeunit isn't timestamp (seconds), it's -- an index of the time interval. Can be calculated using intDiv.

Parameters¶

x: Half-life period. Integer, Float or Decimal.

Returned values¶

Returns an exponentially smoothed moving average of the values for the past x time at the latest point of time.

Type: Float64.

Examples¶

Input table:

┌──temperature─┬─timestamp──┐
│          95  │         1  │
│          95  │         2  │
│          95  │         3  │
│          96  │         4  │
│          96  │         5  │
│          96  │         6  │
│          96  │         7  │
│          97  │         8  │
│          97  │         9  │
│          97  │        10  │
│          97  │        11  │
│          98  │        12  │
│          98  │        13  │
│          98  │        14  │
│          98  │        15  │
│          99  │        16  │
│          99  │        17  │
│          99  │        18  │
│         100  │        19  │
│         100  │        20  │
└──────────────┴────────────┘

Query:

SELECT exponentialMovingAverage(5)(temperature, timestamp)

Result:

┌──exponentialMovingAverage(5)(temperature, timestamp)──┐
│                                    92.25779635374204  │
└───────────────────────────────────────────────────────┘

Query:

SELECT
    value,
    time,
    round(exp_smooth, 3),
    bar(exp_smooth, 0, 1, 50) AS bar
FROM
(
    SELECT
        (number = 0) OR (number >= 25) AS value,
        number AS time,
        exponentialMovingAverage(10)(value, time) OVER (Rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(50)
)

Result:

┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────────────────────────────────────┐
│     1 │    0 │                0.067 │ ███▎                                       │
│     0 │    1 │                0.062 │ ███                                        │
│     0 │    2 │                0.058 │ ██▊                                        │
│     0 │    3 │                0.054 │ ██▋                                        │
│     0 │    4 │                0.051 │ ██▌                                        │
│     0 │    5 │                0.047 │ ██▎                                        │
│     0 │    6 │                0.044 │ ██▏                                        │
│     0 │    7 │                0.041 │ ██                                         │
│     0 │    8 │                0.038 │ █▊                                         │
│     0 │    9 │                0.036 │ █▋                                         │
│     0 │   10 │                0.033 │ █▋                                         │
│     0 │   11 │                0.031 │ █▌                                         │
│     0 │   12 │                0.029 │ █▍                                         │
│     0 │   13 │                0.027 │ █▎                                         │
│     0 │   14 │                0.025 │ █▎                                         │
│     0 │   15 │                0.024 │ █▏                                         │
│     0 │   16 │                0.022 │ █                                          │
│     0 │   17 │                0.021 │ █                                          │
│     0 │   18 │                0.019 │ ▊                                          │
│     0 │   19 │                0.018 │ ▊                                          │
│     0 │   20 │                0.017 │ ▋                                          │
│     0 │   21 │                0.016 │ ▋                                          │
│     0 │   22 │                0.015 │ ▋                                          │
│     0 │   23 │                0.014 │ ▋                                          │
│     0 │   24 │                0.013 │ ▋                                          │
│     1 │   25 │                0.079 │ ███▊                                       │
│     1 │   26 │                 0.14 │ ███████                                    │
│     1 │   27 │                0.198 │ █████████▊                                 │
│     1 │   28 │                0.252 │ ████████████▌                              │
│     1 │   29 │                0.302 │ ███████████████                            │
│     1 │   30 │                0.349 │ █████████████████▍                         │
│     1 │   31 │                0.392 │ ███████████████████▌                       │
│     1 │   32 │                0.433 │ █████████████████████▋                     │
│     1 │   33 │                0.471 │ ███████████████████████▌                   │
│     1 │   34 │                0.506 │ █████████████████████████▎                 │
│     1 │   35 │                0.539 │ ██████████████████████████▊                │
│     1 │   36 │                 0.57 │ ████████████████████████████▌              │
│     1 │   37 │                0.599 │ █████████████████████████████▊             │
│     1 │   38 │                0.626 │ ███████████████████████████████▎           │
│     1 │   39 │                0.651 │ ████████████████████████████████▌          │
│     1 │   40 │                0.674 │ █████████████████████████████████▋         │
│     1 │   41 │                0.696 │ ██████████████████████████████████▋        │
│     1 │   42 │                0.716 │ ███████████████████████████████████▋       │
│     1 │   43 │                0.735 │ ████████████████████████████████████▋      │
│     1 │   44 │                0.753 │ █████████████████████████████████████▋     │
│     1 │   45 │                 0.77 │ ██████████████████████████████████████▍    │
│     1 │   46 │                0.785 │ ███████████████████████████████████████▎   │
│     1 │   47 │                  0.8 │ ███████████████████████████████████████▊   │  
│     1 │   48 │                0.813 │ ████████████████████████████████████████▋  │
│     1 │   49 │                0.825 │ █████████████████████████████████████████▎ │
└───────┴──────┴──────────────────────┴────────────────────────────────────────────┘

-- Calculate timeunit using intDiv
SELECT
    value,
    time,
    exponentialMovingAverage(1)(value, intDiv(toUInt32(time), 3600)) OVER (ORDER BY time ASC) AS res,
    intDiv(toUInt32(time), 3600) AS timeunit
FROM (SELECT
    10 AS value,
    toDateTime('2020-01-01') + (3600 * number) AS time
FROM numbers_mt(10))
ORDER BY time ASC

┌─value─┬────────────────time─┬─────────res─┬─timeunit─┐
│    10 │ 2020-01-01 00:00:00 │           5 │   438288 │
│    10 │ 2020-01-01 01:00:00 │         7.5 │   438289 │
│    10 │ 2020-01-01 02:00:00 │        8.75 │   438290 │
│    10 │ 2020-01-01 03:00:00 │       9.375 │   438291 │
│    10 │ 2020-01-01 04:00:00 │      9.6875 │   438292 │
│    10 │ 2020-01-01 05:00:00 │     9.84375 │   438293 │
│    10 │ 2020-01-01 06:00:00 │    9.921875 │   438294 │
│    10 │ 2020-01-01 07:00:00 │   9.9609375 │   438295 │
│    10 │ 2020-01-01 08:00:00 │  9.98046875 │   438296 │
│    10 │ 2020-01-01 09:00:00 │ 9.990234375 │   438297 │
└───────┴─────────────────────┴─────────────┴──────────┘


-- Calculate timeunit using toRelativeHourNum
SELECT
    value,
    time,
    exponentialMovingAverage(1)(value, toRelativeHourNum(time)) OVER (ORDER BY time ASC) AS res,
    toRelativeHourNum(time) AS timeunit
FROM (SELECT
    10 AS value,
    toDateTime('2020-01-01') + (3600 * number) AS time
FROM numbers_mt(10))
ORDER BY time ASC

┌─value─┬────────────────time─┬─────────res─┬─timeunit─┐
│    10 │ 2020-01-01 00:00:00 │           5 │   438288 │
│    10 │ 2020-01-01 01:00:00 │         7.5 │   438289 │
│    10 │ 2020-01-01 02:00:00 │        8.75 │   438290 │
│    10 │ 2020-01-01 03:00:00 │       9.375 │   438291 │
│    10 │ 2020-01-01 04:00:00 │      9.6875 │   438292 │
│    10 │ 2020-01-01 05:00:00 │     9.84375 │   438293 │
│    10 │ 2020-01-01 06:00:00 │    9.921875 │   438294 │
│    10 │ 2020-01-01 07:00:00 │   9.9609375 │   438295 │
│    10 │ 2020-01-01 08:00:00 │  9.98046875 │   438296 │
│    10 │ 2020-01-01 09:00:00 │ 9.990234375 │   438297 │
└───────┴─────────────────────┴─────────────┴──────────┘

exponentialTimeDecayedAvg¶

Returns the exponentially smoothed weighted moving average of values of a time series at point t in time.

Syntax¶

exponentialTimeDecayedAvg(x)(v, t)

Arguments¶

v: Value. Integer, Float or Decimal.
t: Time. Integer, Float or Decimal, DateTime, DateTime64.

Parameters¶

x: Half-life period. Integer, Float or Decimal.

Returned values¶

Returns an exponentially smoothed weighted moving average at index t in time. Float64.

Examples¶

Query:

SELECT
    value,
    time,
    round(exp_smooth, 3),
    bar(exp_smooth, 0, 5, 50) AS bar
FROM
    (
    SELECT
    (number = 0) OR (number >= 25) AS value,
    number AS time,
    exponentialTimeDecayedAvg(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(50)
    )

Response:

   ┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────┐
1. │     1 │    0 │                    1 │ ██████████ │
2. │     0 │    1 │                0.475 │ ████▊      │
3. │     0 │    2 │                0.301 │ ███        │
4. │     0 │    3 │                0.214 │ ██▏        │
5. │     0 │    4 │                0.162 │ █▌         │
6. │     0 │    5 │                0.128 │ █▎         │
7. │     0 │    6 │                0.104 │ █          │
8. │     0 │    7 │                0.086 │ ▊          │
9. │     0 │    8 │                0.072 │ ▋          │
0. │     0 │    9 │                0.061 │ ▌          │
1. │     0 │   10 │                0.052 │ ▌          │
2. │     0 │   11 │                0.045 │ ▍          │
3. │     0 │   12 │                0.039 │ ▍          │
4. │     0 │   13 │                0.034 │ ▎          │
5. │     0 │   14 │                 0.03 │ ▎          │
6. │     0 │   15 │                0.027 │ ▎          │
7. │     0 │   16 │                0.024 │ ▏          │
8. │     0 │   17 │                0.021 │ ▏          │
9. │     0 │   18 │                0.018 │ ▏          │
0. │     0 │   19 │                0.016 │ ▏          │
1. │     0 │   20 │                0.015 │ ▏          │
2. │     0 │   21 │                0.013 │ ▏          │
3. │     0 │   22 │                0.012 │            │
4. │     0 │   23 │                 0.01 │            │
5. │     0 │   24 │                0.009 │            │
6. │     1 │   25 │                0.111 │ █          │
7. │     1 │   26 │                0.202 │ ██         │
8. │     1 │   27 │                0.283 │ ██▊        │
9. │     1 │   28 │                0.355 │ ███▌       │
0. │     1 │   29 │                 0.42 │ ████▏      │
1. │     1 │   30 │                0.477 │ ████▊      │
2. │     1 │   31 │                0.529 │ █████▎     │
3. │     1 │   32 │                0.576 │ █████▊     │
4. │     1 │   33 │                0.618 │ ██████▏    │
5. │     1 │   34 │                0.655 │ ██████▌    │
6. │     1 │   35 │                0.689 │ ██████▉    │
7. │     1 │   36 │                0.719 │ ███████▏   │
8. │     1 │   37 │                0.747 │ ███████▍   │
9. │     1 │   38 │                0.771 │ ███████▋   │
0. │     1 │   39 │                0.793 │ ███████▉   │
1. │     1 │   40 │                0.813 │ ████████▏  │
2. │     1 │   41 │                0.831 │ ████████▎  │
3. │     1 │   42 │                0.848 │ ████████▍  │
4. │     1 │   43 │                0.862 │ ████████▌  │
5. │     1 │   44 │                0.876 │ ████████▊  │
6. │     1 │   45 │                0.888 │ ████████▉  │
7. │     1 │   46 │                0.898 │ ████████▉  │
8. │     1 │   47 │                0.908 │ █████████  │
9. │     1 │   48 │                0.917 │ █████████▏ │
0. │     1 │   49 │                0.925 │ █████████▏ │
   └───────┴──────┴──────────────────────┴────────────┘

exponentialTimeDecayedCount¶

Returns the cumulative exponential decay over a time series at the index t in time.

Syntax¶

exponentialTimeDecayedCount(x)(t)

Arguments¶

t: Time. Integer, Float or Decimal, DateTime, DateTime64.

Parameters¶

x: Half-life period. Integer, Float or Decimal.

Returned values¶

Returns the cumulative exponential decay at the given point in time. Float64.

Example¶

Query:

SELECT
    value,
    time,
    round(exp_smooth, 3),
    bar(exp_smooth, 0, 20, 50) AS bar
FROM
(
    SELECT
        (number % 5) = 0 AS value,
        number AS time,
        exponentialTimeDecayedCount(10)(time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(50)
)

Result:

    ┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────────────────────┐
 1. │     1 │    0 │                    1 │ ██▌                        │
 2. │     0 │    1 │                1.905 │ ████▊                      │
 3. │     0 │    2 │                2.724 │ ██████▊                    │
 4. │     0 │    3 │                3.464 │ ████████▋                  │
 5. │     0 │    4 │                4.135 │ ██████████▎                │
 6. │     1 │    5 │                4.741 │ ███████████▊               │
 7. │     0 │    6 │                 5.29 │ █████████████▏             │
 8. │     0 │    7 │                5.787 │ ██████████████▍            │
 9. │     0 │    8 │                6.236 │ ███████████████▌           │
10. │     0 │    9 │                6.643 │ ████████████████▌          │
11. │     1 │   10 │                 7.01 │ █████████████████▌         │
12. │     0 │   11 │                7.343 │ ██████████████████▎        │
13. │     0 │   12 │                7.644 │ ███████████████████        │
14. │     0 │   13 │                7.917 │ ███████████████████▊       │
15. │     0 │   14 │                8.164 │ ████████████████████▍      │
16. │     1 │   15 │                8.387 │ ████████████████████▉      │
17. │     0 │   16 │                8.589 │ █████████████████████▍     │
18. │     0 │   17 │                8.771 │ █████████████████████▉     │
19. │     0 │   18 │                8.937 │ ██████████████████████▎    │
20. │     0 │   19 │                9.086 │ ██████████████████████▋    │
21. │     1 │   20 │                9.222 │ ███████████████████████    │
22. │     0 │   21 │                9.344 │ ███████████████████████▎   │
23. │     0 │   22 │                9.455 │ ███████████████████████▋   │
24. │     0 │   23 │                9.555 │ ███████████████████████▉   │
25. │     0 │   24 │                9.646 │ ████████████████████████   │
26. │     1 │   25 │                9.728 │ ████████████████████████▎  │
27. │     0 │   26 │                9.802 │ ████████████████████████▌  │
28. │     0 │   27 │                9.869 │ ████████████████████████▋  │
29. │     0 │   28 │                 9.93 │ ████████████████████████▊  │
30. │     0 │   29 │                9.985 │ ████████████████████████▉  │
31. │     1 │   30 │               10.035 │ █████████████████████████  │
32. │     0 │   31 │                10.08 │ █████████████████████████▏ │
33. │     0 │   32 │               10.121 │ █████████████████████████▎ │
34. │     0 │   33 │               10.158 │ █████████████████████████▍ │
35. │     0 │   34 │               10.191 │ █████████████████████████▍ │
36. │     1 │   35 │               10.221 │ █████████████████████████▌ │
37. │     0 │   36 │               10.249 │ █████████████████████████▌ │
38. │     0 │   37 │               10.273 │ █████████████████████████▋ │
39. │     0 │   38 │               10.296 │ █████████████████████████▋ │
40. │     0 │   39 │               10.316 │ █████████████████████████▊ │
41. │     1 │   40 │               10.334 │ █████████████████████████▊ │
42. │     0 │   41 │               10.351 │ █████████████████████████▉ │
43. │     0 │   42 │               10.366 │ █████████████████████████▉ │
44. │     0 │   43 │               10.379 │ █████████████████████████▉ │
45. │     0 │   44 │               10.392 │ █████████████████████████▉ │
46. │     1 │   45 │               10.403 │ ██████████████████████████ │
47. │     0 │   46 │               10.413 │ ██████████████████████████ │
48. │     0 │   47 │               10.422 │ ██████████████████████████ │
49. │     0 │   48 │                10.43 │ ██████████████████████████ │
50. │     0 │   49 │               10.438 │ ██████████████████████████ │
    └───────┴──────┴──────────────────────┴────────────────────────────┘

exponentialTimeDecayedMax¶

Returns the maximum of the computed exponentially smoothed moving average at index t in time with that at t-1.

Syntax¶

exponentialTimeDecayedMax(x)(value, timeunit)

Arguments¶

value: Value. Integer, Float or Decimal.
timeunit: Timeunit. Integer, Float or Decimal, DateTime, DateTime64.

Parameters¶

x: Half-life period. Integer, Float or Decimal.

Returned values¶

Returns the maximum of the exponentially smoothed weighted moving average at t and t-1. Float64.

Example¶

Query:

SELECT
    value,
    time,
    round(exp_smooth, 3),
    bar(exp_smooth, 0, 5, 50) AS bar
FROM
    (
    SELECT
    (number = 0) OR (number >= 25) AS value,
    number AS time,
    exponentialTimeDecayedMax(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(50)
    )

Result:

    ┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────┐
 1. │     1 │    0 │                    1 │ ██████████ │
 2. │     0 │    1 │                0.905 │ █████████  │
 3. │     0 │    2 │                0.819 │ ████████▏  │
 4. │     0 │    3 │                0.741 │ ███████▍   │
 5. │     0 │    4 │                 0.67 │ ██████▋    │
 6. │     0 │    5 │                0.607 │ ██████     │
 7. │     0 │    6 │                0.549 │ █████▍     │
 8. │     0 │    7 │                0.497 │ ████▉      │
 9. │     0 │    8 │                0.449 │ ████▍      │
10. │     0 │    9 │                0.407 │ ████       │
11. │     0 │   10 │                0.368 │ ███▋       │
12. │     0 │   11 │                0.333 │ ███▎       │
13. │     0 │   12 │                0.301 │ ███        │
14. │     0 │   13 │                0.273 │ ██▋        │
15. │     0 │   14 │                0.247 │ ██▍        │
16. │     0 │   15 │                0.223 │ ██▏        │
17. │     0 │   16 │                0.202 │ ██         │
18. │     0 │   17 │                0.183 │ █▊         │
19. │     0 │   18 │                0.165 │ █▋         │
20. │     0 │   19 │                 0.15 │ █▍         │
21. │     0 │   20 │                0.135 │ █▎         │
22. │     0 │   21 │                0.122 │ █▏         │
23. │     0 │   22 │                0.111 │ █          │
24. │     0 │   23 │                  0.1 │ █          │
25. │     0 │   24 │                0.091 │ ▉          │
26. │     1 │   25 │                    1 │ ██████████ │
27. │     1 │   26 │                    1 │ ██████████ │
28. │     1 │   27 │                    1 │ ██████████ │
29. │     1 │   28 │                    1 │ ██████████ │
30. │     1 │   29 │                    1 │ ██████████ │
31. │     1 │   30 │                    1 │ ██████████ │
32. │     1 │   31 │                    1 │ ██████████ │
33. │     1 │   32 │                    1 │ ██████████ │
34. │     1 │   33 │                    1 │ ██████████ │
35. │     1 │   34 │                    1 │ ██████████ │
36. │     1 │   35 │                    1 │ ██████████ │
37. │     1 │   36 │                    1 │ ██████████ │
38. │     1 │   37 │                    1 │ ██████████ │
39. │     1 │   38 │                    1 │ ██████████ │
40. │     1 │   39 │                    1 │ ██████████ │
41. │     1 │   40 │                    1 │ ██████████ │
42. │     1 │   41 │                    1 │ ██████████ │
43. │     1 │   42 │                    1 │ ██████████ │
44. │     1 │   43 │                    1 │ ██████████ │
45. │     1 │   44 │                    1 │ ██████████ │
46. │     1 │   45 │                    1 │ ██████████ │
47. │     1 │   46 │                    1 │ ██████████ │
48. │     1 │   47 │                    1 │ ██████████ │
49. │     1 │   48 │                    1 │ ██████████ │
50. │     1 │   49 │                    1 │ ██████████ │
    └───────┴──────┴──────────────────────┴────────────┘

exponentialTimeDecayedSum¶

Returns the sum of exponentially smoothed moving average values of a time series at the index t in time.

Syntax¶

exponentialTimeDecayedSum(x)(v, t)

Arguments¶

v: Value. Integer, Float or Decimal.
t: Time. Integer, Float or Decimal, DateTime, DateTime64.

Parameters¶

x: Half-life period. Integer, Float or Decimal.

Returned values¶

Returns the sum of exponentially smoothed moving average values at the given point in time. Float64.

Example¶

Query:

SELECT
    value,
    time,
    round(exp_smooth, 3),
    bar(exp_smooth, 0, 10, 50) AS bar
FROM
    (
    SELECT
    (number = 0) OR (number >= 25) AS value,
    number AS time,
    exponentialTimeDecayedSum(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(50)
    )

Result:

    ┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar───────────────────────────────────────────────┐
 1. │     1 │    0 │                    1 │ █████                                             │
 2. │     0 │    1 │                0.905 │ ████▌                                             │
 3. │     0 │    2 │                0.819 │ ████                                              │
 4. │     0 │    3 │                0.741 │ ███▋                                              │
 5. │     0 │    4 │                 0.67 │ ███▎                                              │
 6. │     0 │    5 │                0.607 │ ███                                               │
 7. │     0 │    6 │                0.549 │ ██▋                                               │
 8. │     0 │    7 │                0.497 │ ██▍                                               │
 9. │     0 │    8 │                0.449 │ ██▏                                               │
10. │     0 │    9 │                0.407 │ ██                                                │
11. │     0 │   10 │                0.368 │ █▊                                                │
12. │     0 │   11 │                0.333 │ █▋                                                │
13. │     0 │   12 │                0.301 │ █▌                                                │
14. │     0 │   13 │                0.273 │ █▎                                                │
15. │     0 │   14 │                0.247 │ █▏                                                │
16. │     0 │   15 │                0.223 │ █                                                 │
17. │     0 │   16 │                0.202 │ █                                                 │
18. │     0 │   17 │                0.183 │ ▉                                                 │
19. │     0 │   18 │                0.165 │ ▊                                                 │
20. │     0 │   19 │                 0.15 │ ▋                                                 │
21. │     0 │   20 │                0.135 │ ▋                                                 │
22. │     0 │   21 │                0.122 │ ▌                                                 │
23. │     0 │   22 │                0.111 │ ▌                                                 │
24. │     0 │   23 │                  0.1 │ ▌                                                 │
25. │     0 │   24 │                0.091 │ ▍                                                 │
26. │     1 │   25 │                1.082 │ █████▍                                            │
27. │     1 │   26 │                1.979 │ █████████▉                                        │
28. │     1 │   27 │                2.791 │ █████████████▉                                    │
29. │     1 │   28 │                3.525 │ █████████████████▋                                │
30. │     1 │   29 │                 4.19 │ ████████████████████▉                             │
31. │     1 │   30 │                4.791 │ ███████████████████████▉                          │
32. │     1 │   31 │                5.335 │ ██████████████████████████▋                       │
33. │     1 │   32 │                5.827 │ █████████████████████████████▏                    │
34. │     1 │   33 │                6.273 │ ███████████████████████████████▎                  │
35. │     1 │   34 │                6.676 │ █████████████████████████████████▍                │
36. │     1 │   35 │                7.041 │ ███████████████████████████████████▏              │
37. │     1 │   36 │                7.371 │ ████████████████████████████████████▊             │
38. │     1 │   37 │                7.669 │ ██████████████████████████████████████▎           │
39. │     1 │   38 │                7.939 │ ███████████████████████████████████████▋          │
40. │     1 │   39 │                8.184 │ ████████████████████████████████████████▉         │
41. │     1 │   40 │                8.405 │ ██████████████████████████████████████████        │
42. │     1 │   41 │                8.605 │ ███████████████████████████████████████████       │
43. │     1 │   42 │                8.786 │ ███████████████████████████████████████████▉      │
44. │     1 │   43 │                 8.95 │ ████████████████████████████████████████████▊     │
45. │     1 │   44 │                9.098 │ █████████████████████████████████████████████▍    │
46. │     1 │   45 │                9.233 │ ██████████████████████████████████████████████▏   │
47. │     1 │   46 │                9.354 │ ██████████████████████████████████████████████▊   │
48. │     1 │   47 │                9.464 │ ███████████████████████████████████████████████▎  │
49. │     1 │   48 │                9.563 │ ███████████████████████████████████████████████▊  │
50. │     1 │   49 │                9.653 │ ████████████████████████████████████████████████▎ │
    └───────┴──────┴──────────────────────┴───────────────────────────────────────────────────┘

first_value¶

It is an alias for any but it was introduced for compatibility with Window Functions, where sometimes it's necessary to process NULL values.

It supports declaring a modifier to respect nulls (RESPECT NULLS), both under Window Functions and in normal aggregations.

As with any, without Window Functions the result will be random if the source stream isn't ordered and the return type matches the input type (Null is only returned if the input is Nullable or -OrNull combinator is added).

flameGraph¶

Aggregate function which builds a flamegraph using the list of stacktraces. Outputs an array of strings which can be used by flamegraph.pl utility to render an SVG of the flamegraph.

Syntax¶

flameGraph(traces, [size], [ptr])

Parameters¶

traces: a stacktrace. Array(UInt64).
size: an allocation size for memory profiling. (optional - default 1). UInt64.
ptr: an allocation address. (optional - default 0). UInt64.

In the case where ptr != 0, a flameGraph will map allocations (size > 0) and deallocations (size < 0) with the same size and ptr. Only allocations which were not freed are shown. Non mapped deallocations are ignored.

Returned value¶

An array of strings for use with flamegraph.pl utility. Array(String).

groupArray¶

Syntax: groupArray(x) or groupArray(max_size)(x)

Creates an array of argument values. Values can be added to the array in any (indeterminate) order.

The second version (with the max_size parameter) limits the size of the resulting array to max_size elements. For example, groupArray(1)(x) is equivalent to [any (x)].

In some cases, you can still rely on the order of execution. This applies to cases when SELECT comes from a subquery that uses ORDER BY if the subquery result is small enough.

Example¶

SELECT * FROM default.ck
┌─id─┬─name─────┐
│  1 │ zhangsan │
│  1 │ ᴺᵁᴸᴸ     │
│  1 │ lisi     │
│  2 │ wangwu   │
└────┴──────────┘

Query:

select id, groupArray(10)(name) from default.ck group by id

Result:

┌─id─┬─groupArray(10)(name)─┐
│  1 │ ['zhangsan','lisi']  │
│  2 │ ['wangwu']           │
└────┴──────────────────────┘

The groupArray function will remove ᴺᵁᴸᴸ value based on the above results.

Alias: array_agg.

groupArrayInsertAt¶

Inserts a value into the array at the specified position.

Syntax¶

groupArrayInsertAt(default_x, size)(x, pos)

If in one query several values are inserted into the same position, the function behaves in the following ways:

If a query is executed in a single thread, the first one of the inserted values is used.
If a query is executed in multiple threads, the resulting value is an undetermined one of the inserted values.

Arguments¶

x: Value to be inserted. Expression resulting in one of the supported data types.
pos: Position at which the specified element x is to be inserted. Index numbering in the array starts from zero. UInt32.
default_x: Default value for substituting in empty positions. Optional parameter. Expression resulting in the data type configured for the x parameter. If default_x isn't defined, the [default values are used.
size: Length of the resulting array. Optional parameter. When using this parameter, the default value default_x must be specified. UInt32.

Returned value¶

Array with inserted values.

Type: Array.

Example¶

Query:

SELECT groupArrayInsertAt(toString(number), number * 2) FROM numbers(5)

Result:

┌─groupArrayInsertAt(toString(number), multiply(number, 2))─┐
│ ['0','','1','','2','','3','','4']                         │
└───────────────────────────────────────────────────────────┘

Query:

SELECT groupArrayInsertAt('-')(toString(number), number * 2) FROM numbers(5)

Result:

┌─groupArrayInsertAt('-')(toString(number), multiply(number, 2))─┐
│ ['0','-','1','-','2','-','3','-','4']                          │
└────────────────────────────────────────────────────────────────┘

Query:

SELECT groupArrayInsertAt('-', 5)(toString(number), number * 2) FROM numbers(5)

Result:

┌─groupArrayInsertAt('-', 5)(toString(number), multiply(number, 2))─┐
│ ['0','-','1','-','2']                                             │
└───────────────────────────────────────────────────────────────────┘

Multi-threaded insertion of elements into one position.

Query:

SELECT groupArrayInsertAt(number, 0) FROM numbers_mt(10) SETTINGS max_block_size = 1```

As a result of this query you get random integer in the `[0,9]` range. For example:

``` text
┌─groupArrayInsertAt(number, 0)─┐
│ [7]                           │
└───────────────────────────────┘

groupArrayIntersect¶

Return an intersection of given arrays (Return all items of arrays, that are in all given arrays).

Syntax¶

groupArrayIntersect(x)

Arguments¶

x: Argument (column name or expression).

Returned values¶

Array that contains elements that are in all arrays.

Type: Array.

Examples¶

Consider table numbers:

┌─a──────────────┐
│ [1,2,4]        │
│ [1,5,2,8,-1,0] │
│ [1,5,7,5,8,2]  │
└────────────────┘

Query with column name as argument:

SELECT groupArrayIntersect(a) as intersection FROM numbers

Result:

┌─intersection──────┐
│ [1, 2]            │
└───────────────────┘

groupArrayLast¶

Syntax: groupArrayLast(max_size)(x)

Creates an array of last argument values. For example, groupArrayLast(1)(x) is equivalent to [anyLast (x)].

In some cases, you can still rely on the order of execution. This applies to cases when SELECT comes from a subquery that uses ORDER BY if the subquery result is small enough.

Example¶

Query:

select groupArrayLast(2)(number+1) numbers from numbers(10)

Result:

┌─numbers─┐
│ [9,10]  │
└─────────┘

In compare to groupArray:

select groupArray(2)(number+1) numbers from numbers(10)

┌─numbers─┐
│ [1,2]   │
└─────────┘

groupArrayMovingAvg¶

Calculates the moving average of input values.

groupArrayMovingAvg(numbers_for_summing)
groupArrayMovingAvg(window_size)(numbers_for_summing)

The function can take the window size as a parameter. If left unspecified, the function takes the window size equal to the number of rows in the column.

Arguments¶

numbers_for_summing: Expression resulting in a numeric data type value.
window_size: Size of the calculation window.

Returned values¶

Array of the same size and type as the input data.

The function uses rounding towards zero. It truncates the decimal places insignificant for the resulting data type.

groupArrayMovingSum¶

Calculates the moving sum of input values.

groupArrayMovingSum(numbers_for_summing)
groupArrayMovingSum(window_size)(numbers_for_summing)

The function can take the window size as a parameter. If left unspecified, the function takes the window size equal to the number of rows in the column.

Arguments¶

numbers_for_summing: Expression resulting in a numeric data type value.
window_size: Size of the calculation window.

Returned values¶

Array of the same size and type as the input data.

groupArraySample¶

Creates an array of sample argument values. The size of the resulting array is limited to max_size elements. Argument values are selected and added to the array randomly.

Syntax¶

groupArraySample(max_size[, seed])(x)

Arguments¶

max_size: Maximum size of the resulting array. UInt64.
seed: Seed for the random number generator. Optional. UInt64. Default value: 123456.
x: Argument (column name or expression).

Returned values¶

Array of randomly selected x arguments.

Type: Array.

Examples¶

Consider table colors:

┌─id─┬─color──┐
│  1 │ red    │
│  2 │ blue   │
│  3 │ green  │
│  4 │ white  │
│  5 │ orange │
└────┴────────┘

Query with column name as argument:

SELECT groupArraySample(3)(color) as newcolors FROM colors

Result:

┌─newcolors──────────────────┐
│ ['white','blue','green']   │
└────────────────────────────┘

Query with column name and different seed:

SELECT groupArraySample(3, 987654321)(color) as newcolors FROM colors

Result:

┌─newcolors──────────────────┐
│ ['red','orange','green']   │
└────────────────────────────┘

Query with expression as argument:

SELECT groupArraySample(3)(concat('light-', color)) as newcolors FROM colors

Result:

┌─newcolors───────────────────────────────────┐
│ ['light-blue','light-orange','light-green'] │
└─────────────────────────────────────────────┘

groupArraySorted {#groupArraySorted}¶

Returns an array with the first N items in ascending order.

groupArraySorted(N)(column)

Arguments

N – The number of elements to return.
column – The value (Integer, String, Float and other Generic types).

Example

Gets the first 10 numbers:

SELECT groupArraySorted(10)(number) FROM numbers(100)

┌─groupArraySorted(10)(number)─┐
│ [0,1,2,3,4,5,6,7,8,9]        │
└──────────────────────────────┘

Gets all the String implementations of all numbers in column:

SELECT groupArraySorted(5)(str) FROM (SELECT toString(number) as str FROM numbers(5))

┌─groupArraySorted(5)(str)─┐
│ ['0','1','2','3','4']    │
└──────────────────────────┘

groupBitAnd¶

Applies bit-wise AND for series of numbers.

groupBitAnd(expr)

Arguments¶

expr – An expression that results in UInt* or Int* type.

Return value¶

Value of the UInt* or Int* type.

Example¶

Test data:

binary     decimal
00101100 = 44
00011100 = 28
00001101 = 13
01010101 = 85

Query:

SELECT groupBitAnd(num) FROM t

Where num is the column with the test data.

Result:

binary     decimal
00000100 = 4

groupBitmap¶

Bitmap or Aggregate calculations from a unsigned integer column, return cardinality of type UInt64, if add suffix -State, then return bitmap object.

groupBitmap(expr)

Arguments¶

expr – An expression that results in UInt* type.

Return value¶

Value of the UInt64 type.

Example¶

Test data:

UserID
1
1
2
3

Query:

SELECT groupBitmap(UserID) as num FROM t

Result:

num
3

groupBitmapAnd¶

Calculations the AND of a bitmap column, return cardinality of type UInt64, if add suffix -State, then return bitmap object.

groupBitmapAnd(expr)

Arguments¶

expr – An expression that results in AggregateFunction(groupBitmap, UInt*) type.

Return value¶

Value of the UInt64 type.

groupBitmapOr¶

Calculations the OR of a bitmap column, return cardinality of type UInt64, if add suffix -State, then return bitmap object. This is equivalent to groupBitmapMerge.

groupBitmapOr(expr)

Arguments¶

expr – An expression that results in AggregateFunction(groupBitmap, UInt*) type.

Returned value¶

Value of the UInt64 type.

groupBitmapXor¶

Calculations the XOR of a bitmap column, return cardinality of type UInt64, if add suffix -State, then return bitmap object.

groupBitmapOr(expr)

Arguments¶

expr – An expression that results in AggregateFunction(groupBitmap, UInt*) type.

Returned value¶

Value of the UInt64 type.

groupBitOr¶

Applies bit-wise OR for series of numbers.

groupBitOr(expr)

Arguments¶

expr – An expression that results in UInt* or Int* type.

Returned value¶

Value of the UInt* or Int* type.

Example¶

Test data:

binary     decimal
00101100 = 44
00011100 = 28
00001101 = 13
01010101 = 85

Query:

SELECT groupBitOr(num) FROM t

Where num is the column with the test data.

Result:

binary     decimal
01111101 = 125

groupBitXor¶

Applies bit-wise XOR for series of numbers.

groupBitXor(expr)

Arguments¶

expr – An expression that results in UInt* or Int* type.

Return value¶

Value of the UInt* or Int* type.

Example¶

Test data:

binary     decimal
00101100 = 44
00011100 = 28
00001101 = 13
01010101 = 85

Query:

SELECT groupBitXor(num) FROM t

Where num is the column with the test data.

Result:

binary     decimal
01101000 = 104

groupConcat¶

Calculates a concatenated string from a group of strings, optionally separated by a delimiter, and optionally limited by a maximum number of elements.

Syntax¶

groupConcat[(delimiter [, limit])](expression)

Arguments¶

expression: The expression or column name that outputs strings to be concatenated..
delimiter: A string that will be used to separate concatenated values. This parameter is optional and defaults to an empty string if not specified.
limit: A positive integer specifying the maximum number of elements to concatenate. If more elements are present, excess elements are ignored. This parameter is optional.

If delimiter is specified without limit, it must be the first parameter. If both delimiter and limit are specified, delimiter must precede limit.

Returned value¶

Returns a string consisting of the concatenated values of the column or expression. If the group has no elements or only null elements, and the function doesn't specify a handling for only null values, the result is a nullable string with a null value.

Examples¶

Input table:

┌─id─┬─name─┐
│ 1  │  John│
│ 2  │  Jane│
│ 3  │   Bob│
└────┴──────┘

Basic usage without a delimiter:

Query:

SELECT groupConcat(Name) FROM Employees

Result:

JohnJaneBob

This concatenates all names into one continuous string without any separator.

Using comma as a delimiter:

Query:

SELECT groupConcat(', ')(Name)  FROM Employees

Result:

John, Jane, Bob

This output shows the names separated by a comma followed by a space.

Limiting the number of concatenated elements

Query:

SELECT groupConcat(', ', 2)(Name) FROM Employees

Result:

John, Jane

This query limits the output to the first two names, even though there are more names in the table.

groupUniqArray¶

Syntax: groupUniqArray(x) or groupUniqArray(max_size)(x)

Creates an array from different argument values. Memory consumption is the same as for the uniqExact function.

The second version (with the max_size parameter) limits the size of the resulting array to max_size elements. For example, groupUniqArray(1)(x) is equivalent to [any(x)].

intervalLengthSum¶

Calculates the total length of union of all ranges (segments on numeric axis).

Syntax¶

intervalLengthSum(start, end)

Arguments¶

start: The starting value of the interval. Int32, Int64, UInt32, UInt64, Float32, Float64, DateTime or Date.
end: The ending value of the interval. Int32, Int64, UInt32, UInt64, Float32, Float64, DateTime or Date.

Arguments must be of the same data type. Otherwise, an exception will be thrown.

Returned value¶

Total length of union of all ranges (segments on numeric axis). Depending on the type of the argument, the return value may be UInt64 or Float64 type.

Examples¶

Input table:

┌─id─┬─start─┬─end─┐
│ a  │   1.1 │ 2.9 │
│ a  │   2.5 │ 3.2 │
│ a  │     4 │   5 │
└────┴───────┴─────┘

In this example, the arguments of the Float32 type are used. The function returns a value of the Float64 type.

Result is the sum of lengths of intervals [1.1, 3.2] (union of [1.1, 2.9] and [2.5, 3.2]) and [4, 5]

Query:

SELECT id, intervalLengthSum(start, end), toTypeName(intervalLengthSum(start, end)) FROM fl_interval GROUP BY id ORDER BY id

Result:

┌─id─┬─intervalLengthSum(start, end)─┬─toTypeName(intervalLengthSum(start, end))─┐
│ a  │                           3.1 │ Float64                                   │
└────┴───────────────────────────────┴───────────────────────────────────────────┘

Input table:

┌─id─┬───────────────start─┬─────────────────end─┐
│ a  │ 2020-01-01 01:12:30 │ 2020-01-01 02:10:10 │
│ a  │ 2020-01-01 02:05:30 │ 2020-01-01 02:50:31 │
│ a  │ 2020-01-01 03:11:22 │ 2020-01-01 03:23:31 │
└────┴─────────────────────┴─────────────────────┘

In this example, the arguments of the DateTime type are used. The function returns a value in seconds.

Query:

SELECT id, intervalLengthSum(start, end), toTypeName(intervalLengthSum(start, end)) FROM dt_interval GROUP BY id ORDER BY id

Result:

┌─id─┬─intervalLengthSum(start, end)─┬─toTypeName(intervalLengthSum(start, end))─┐
│ a  │                          6610 │ UInt64                                    │
└────┴───────────────────────────────┴───────────────────────────────────────────┘

Input table:

┌─id─┬──────start─┬────────end─┐
│ a  │ 2020-01-01 │ 2020-01-04 │
│ a  │ 2020-01-12 │ 2020-01-18 │
└────┴────────────┴────────────┘

In this example, the arguments of the Date type are used. The function returns a value in days.

Query:

SELECT id, intervalLengthSum(start, end), toTypeName(intervalLengthSum(start, end)) FROM date_interval GROUP BY id ORDER BY id

Result:

┌─id─┬─intervalLengthSum(start, end)─┬─toTypeName(intervalLengthSum(start, end))─┐
│ a  │                             9 │ UInt64                                    │
└────┴───────────────────────────────┴───────────────────────────────────────────┘

kolmogorovSmirnovTest¶

Applies Kolmogorov-Smirnov's test to samples from two populations.

Syntax¶

kolmogorovSmirnovTest([alternative, computation_method])(sample_data, sample_index)

Values of both samples are in the sample_data column. If sample_index equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. Samples must belong to continuous, one-dimensional probability distributions.

Arguments¶

sample_data: Sample data. Integer, Float or Decimal.
sample_index: Sample index. Integer.

Parameters¶

alternative: alternative hypothesis. (Optional, default: 'two-sided'.) String. Let F(x) and G(x) be the CDFs of the first and second distributions respectively.
- 'two-sided' The null hypothesis is that samples come from the same distribution, e.g. F(x) = G(x) for all x. And the alternative is that the distributions aren't identical.
- 'greater' The null hypothesis is that values in the first sample are stochastically smaller than those in the second one, e.g. the CDF of first distribution lies above and hence to the left of that for the second one. Which in fact means that F(x) >= G(x) for all x. And the alternative in this case is that F(x) < G(x) for at least one x.
- 'less'. The null hypothesis is that values in the first sample are stochastically greater than those in the second one, e.g. the CDF of first distribution lies below and hence to the right of that for the second one. Which in fact means that F(x) <= G(x) for all x. And the alternative in this case is that F(x) > G(x) for at least one x.
computation_method: the method used to compute p-value. (Optional, default: 'auto'.) String.
- 'exact' - calculation is performed using precise probability distribution of the test statistics. Compute intensive and wasteful except for small samples.
- 'asymp' ('asymptotic') - calculation is performed using an approximation. For large sample sizes, the exact and asymptotic p-values are very similar.
- 'auto' - the 'exact' method is used when a maximum number of samples is less than 10'000.

Returned values¶

Tuple with two elements:

calculated statistic. Float64.
calculated p-value. Float64.

Example¶

Query:

SELECT kolmogorovSmirnovTest('less', 'exact')(value, num)
FROM
(
    SELECT
        randNormal(0, 10) AS value,
        0 AS num
    FROM numbers(10000)
    UNION ALL
    SELECT
        randNormal(0, 10) AS value,
        1 AS num
    FROM numbers(10000)
)

Result:

┌─kolmogorovSmirnovTest('less', 'exact')(value, num)─┐
│ (0.009899999999999996,0.37528595205132287)         │
└────────────────────────────────────────────────────┘

Note: P-value is bigger than 0.05 (for confidence level of 95%), so null hypothesis isn't rejected.

Query:

SELECT kolmogorovSmirnovTest('two-sided', 'exact')(value, num)
FROM
(
    SELECT
        randStudentT(10) AS value,
        0 AS num
    FROM numbers(100)
    UNION ALL
    SELECT
        randNormal(0, 10) AS value,
        1 AS num
    FROM numbers(100)
)

Result:

┌─kolmogorovSmirnovTest('two-sided', 'exact')(value, num)─┐
│ (0.4100000000000002,6.61735760482795e-8)                │
└─────────────────────────────────────────────────────────┘

Note: P-value is less than 0.05 (for confidence level of 95%), so null hypothesis is rejected.

kurtPop¶

Computes the kurtosis of a sequence.

kurtPop(expr)

Arguments¶

expr: Expression returning a number.

Returned value¶

The kurtosis of the given distribution. Type: Float64.

Example¶

SELECT kurtPop(value) FROM series_with_value_column

kurtSamp¶

Computes the sample kurtosis of a sequence.

It represents an unbiased estimate of the kurtosis of a random variable if passed values form its sample.

kurtSamp(expr)

Arguments¶

expr: Expression returning a number.

Returned value¶

The kurtosis of the given distribution. Type: Float64. If n <= 1 (n is a size of the sample), then the function returns nan.

Example¶

SELECT kurtSamp(value) FROM series_with_value_column

largestTriangleThreeBuckets¶

Applies the Largest-Triangle-Three-Buckets algorithm to the input data. The algorithm is used for downsampling time series data for visualization. It is designed to operate on series sorted by x coordinate. It works by dividing the sorted series into buckets and then finding the largest triangle in each bucket. The number of buckets is equal to the number of points in the resulting series. the function will sort data by x and then apply the downsampling algorithm to the sorted data.

Syntax¶

largestTriangleThreeBuckets(n)(x, y)

Alias: lttb.

Arguments¶

x: x coordinate. Integer, Float, Decimal, Date, Date32, DateTime, DateTime64.
y: y coordinate. Integer, Float, Decimal, Date, Date32, DateTime, DateTime64.

NaNs are ignored in the provided series, meaning that any NaN values will be excluded from the analysis. This ensures that the function operates only on valid numerical data.

Parameters¶

n: number of points in the resulting series. UInt64.

Returned values¶

Array of Tuple with two elements:

Example¶

Input table:

┌─────x───────┬───────y──────┐
│ 1.000000000 │ 10.000000000 │
│ 2.000000000 │ 20.000000000 │
│ 3.000000000 │ 15.000000000 │
│ 8.000000000 │ 60.000000000 │
│ 9.000000000 │ 55.000000000 │
│ 10.00000000 │ 70.000000000 │
│ 4.000000000 │ 30.000000000 │
│ 5.000000000 │ 40.000000000 │
│ 6.000000000 │ 35.000000000 │
│ 7.000000000 │ 50.000000000 │
└─────────────┴──────────────┘

Query:

SELECT largestTriangleThreeBuckets(4)(x, y) FROM largestTriangleThreeBuckets_test

Result:

┌────────largestTriangleThreeBuckets(4)(x, y)───────────┐
│           [(1,10),(3,15),(9,55),(10,70)]              │
└───────────────────────────────────────────────────────┘

last_value¶

Selects the last encountered value, similar to anyLast, but could accept NULL. Mostly it should be used with Window Functions. Without Window Functions the result will be random if the source stream isn't ordered.

mannWhitneyUTest¶

Applies the Mann-Whitney rank test to samples from two populations.

Syntax¶

mannWhitneyUTest[(alternative[, continuity_correction])](sample_data, sample_index)

Values of both samples are in the sample_data column. If sample_index equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that two populations are stochastically equal. Also one-sided hypothesises can be tested. This test doesn't assume that data have normal distribution.

Arguments¶

sample_data: sample data. Integer, Float or Decimal.
sample_index: sample index. Integer.

Parameters¶

alternative: alternative hypothesis. (Optional, default: 'two-sided'.) String.
- 'two-sided' - 'greater' - 'less'.
continuity_correction: if not 0 then continuity correction in the normal approximation for the p-value is applied. (Optional, default: 1.) UInt64.

Returned values¶

Tuple with two elements:

calculated U-statistic. Float64.
calculated p-value. Float64.

Example¶

Input table:

┌─sample_data─┬─sample_index─┐
│          10 │            0 │
│          11 │            0 │
│          12 │            0 │
│           1 │            1 │
│           2 │            1 │
│           3 │            1 │
└─────────────┴──────────────┘

Query:

SELECT mannWhitneyUTest('greater')(sample_data, sample_index) FROM mww_ttest

Result:

┌─mannWhitneyUTest('greater')(sample_data, sample_index)─┐
│ (9,0.04042779918503192)                                │
└────────────────────────────────────────────────────────┘

max¶

Aggregate function that calculates the maximum across a group of values.

Syntax:¶

SELECT max(salary) FROM employees

SELECT department, max(salary) FROM employees GROUP BY department

If you need non-aggregate function to choose a maximum of two values, see greatest:

SELECT greatest(a, b) FROM table

maxIntersections¶

Aggregate function that calculates the maximum number of times that a group of intervals intersects each other (if all the intervals intersect at least once).

The syntax is:

maxIntersections(start_column, end_column)

Arguments¶

start_column – the numeric column that represents the start of each interval. If start_column is NULL or 0 then the interval will be skipped.
end_column - the numeric column that represents the end of each interval. If end_column is NULL or 0 then the interval will be skipped.

Returned value¶

Returns the maximum number of intersected intervals.

maxIntersectionsPosition¶

Aggregate function that calculates the positions of the occurrences of the maxIntersections function.

The syntax is:

maxIntersectionsPosition(start_column, end_column)

Arguments¶

start_column – the numeric column that represents the start of each interval. If start_column is NULL or 0 then the interval will be skipped.
end_column - the numeric column that represents the end of each interval. If end_column is NULL or 0 then the interval will be skipped.

Returned value¶

Returns the start positions of the maximum number of intersected intervals.

maxMap¶

Calculates the maximum from value array according to the keys specified in the key array.

Syntax¶

maxMap(key, value)

maxMap(Tuple(key, value))

Alias: maxMappedArrays

Passing a tuple of keys and value arrays is identical to passing two arrays of keys and values.
The number of elements in key and value must be the same for each row that is totaled.

Parameters¶

key: Array of keys. Array.
value: Array of values. Array.

Returned value¶

Returns a tuple of two arrays: keys in sorted order, and values calculated for the corresponding keys. Tuple(Array, Array).

Example¶

Query:

SELECT maxMap(a, b)
FROM values('a Array(Char), b Array(Int64)', (['x', 'y'], [2, 2]), (['y', 'z'], [3, 1]))

Result:

┌─maxMap(a, b)───────────┐
│ [['x','y','z'],[2,3,1]]│
└────────────────────────┘

meanZTest¶

Applies mean z-test to samples from two populations.

Syntax¶

meanZTest(population_variance_x, population_variance_y, confidence_level)(sample_data, sample_index)

Values of both samples are in the sample_data column. If sample_index equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that means of populations are equal. Normal distribution is assumed. Populations may have unequal variance and the variances are known.

Arguments¶

sample_data: Sample data. Integer, Float or Decimal.
sample_index: Sample index. Integer.

Parameters¶

population_variance_x: Variance for population x. Float.
population_variance_y: Variance for population y. Float.
confidence_level: Confidence level in order to calculate confidence intervals. Float.

Returned values¶

Tuple with four elements:

calculated t-statistic. Float64.
calculated p-value. Float64.
calculated confidence-interval-low. Float64.
calculated confidence-interval-high. Float64.

Example¶

Input table:

┌─sample_data─┬─sample_index─┐
│        20.3 │            0 │
│        21.9 │            0 │
│        22.1 │            0 │
│        18.9 │            1 │
│          19 │            1 │
│        20.3 │            1 │
└─────────────┴──────────────┘

Query:

SELECT meanZTest(0.7, 0.45, 0.95)(sample_data, sample_index) FROM mean_ztest

Result:

┌─meanZTest(0.7, 0.45, 0.95)(sample_data, sample_index)────────────────────────────┐
│ (3.2841296025548123,0.0010229786769086013,0.8198428246768334,3.2468238419898365) │
└──────────────────────────────────────────────────────────────────────────────────┘

median¶

The median* functions are the aliases for the corresponding quantile* functions. They calculate median of a numeric data sample.

Functions:

median: Alias for quantile.
medianDeterministic: Alias for quantileDeterministic.
medianExact: Alias for quantileExact.
medianExactWeighted: Alias for quantileExactWeighted.
medianTiming: Alias for quantileTiming.
medianTimingWeighted: Alias for quantileTimingWeighted.
medianTDigest: Alias for quantileTDigest.
medianTDigestWeighted: Alias for quantileTDigestWeighted.
medianBFloat16: Alias for quantileBFloat16.
medianDD: Alias for quantileDD.

Example¶

Input table:

┌─val─┐
│   1 │
│   1 │
│   2 │
│   3 │
└─────┘

Query:

SELECT medianDeterministic(val, 1) FROM t

Result:

┌─medianDeterministic(val, 1)─┐
│                         1.5 │
└─────────────────────────────┘

min¶

Aggregate function that calculates the minimum across a group of values.

Syntax:¶

SELECT min(salary) FROM employees

SELECT department, min(salary) FROM employees GROUP BY department

If you need non-aggregate function to choose a minimum of two values, see least:

SELECT least(a, b) FROM table

minMap¶

Calculates the minimum from value array according to the keys specified in the key array.

Syntax¶

`minMap(key, value)`

minMap(Tuple(key, value))

Alias: minMappedArrays

Passing a tuple of keys and value arrays is identical to passing an array of keys and an array of values.
The number of elements in key and value must be the same for each row that is totaled.

Parameters¶

key: Array of keys. Array.
value: Array of values. Array.

Returned value¶

Returns a tuple of two arrays: keys in sorted order, and values calculated for the corresponding keys. Tuple(Array, Array).

Example¶

Query:

SELECT minMap(a, b)
FROM values('a Array(Int32), b Array(Int64)', ([1, 2], [2, 2]), ([2, 3], [1, 1]))

Result:

┌─minMap(a, b)──────┐
│ ([1,2,3],[2,1,1]) │
└───────────────────┘

quantile¶

Computes an approximate quantile of a numeric data sequence.

This function applies reservoir sampling with a reservoir size up to 8192 and a random number generator for sampling. The result is non-deterministic. To get an exact quantile, use the quantileExact function.

When using multiple quantile* functions with different levels in a query, the internal states aren't combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Note that for an empty numeric sequence, quantile will return NaN, but its quantile* variants will return either NaN or a default value for the sequence type, depending on the variant.

Syntax¶

quantile(level)(expr)

Alias: median.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value¶

Approximate quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

Input table:

┌─val─┐
│   1 │
│   1 │
│   2 │
│   3 │
└─────┘

Query:

SELECT quantile(val) FROM t

Result:

┌─quantile(val)─┐
│           1.5 │
└───────────────┘

quantileBFloat16¶

Computes an approximate quantile of a sample consisting of bfloat16 numbers. bfloat16 is a floating-point data type with 1 sign bit, 8 exponent bits and 7 fraction bits. The function converts input values to 32-bit floats and takes the most significant 16 bits. Then it calculates bfloat16 quantile value and converts the result to a 64-bit float by appending zero bits. The function is a fast quantile estimator with a relative error no more than 0.390625%.

Syntax¶

quantileBFloat16[(level)](expr)

Alias: medianBFloat16

Arguments¶

expr: Column with numeric data. Integer, Float.

Parameters¶

level: Level of quantile. Optional. Possible values are in the range from 0 to 1. Default value: 0.5. Float.

Returned value¶

Approximate quantile of the specified level.

Type: Float64.

Example¶

Input table has an integer and a float columns:

┌─a─┬─────b─┐
│ 1 │ 1.001 │
│ 2 │ 1.002 │
│ 3 │ 1.003 │
│ 4 │ 1.004 │
└───┴───────┘

Query to calculate 0.75-quantile (third quartile):

SELECT quantileBFloat16(0.75)(a), quantileBFloat16(0.75)(b) FROM example_table

Result:

┌─quantileBFloat16(0.75)(a)─┬─quantileBFloat16(0.75)(b)─┐
│                         3 │                         1 │
└───────────────────────────┴───────────────────────────┘

Note that all floating point values in the example are truncated to 1.0 when converting to bfloat16.

quantileBFloat16Weighted¶

Like quantileBFloat16 but takes into account the weight of each sequence member.

quantileDD¶

Computes an approximate quantile of a sample with relative-error guarantees. It works by building a DD.

Syntax¶

quantileDD(relative_accuracy, [level])(expr)

Arguments¶

expr: Column with numeric data. Integer, Float.

Parameters¶

relative_accuracy: Relative accuracy of the quantile. Possible values are in the range from 0 to 1. Float. The size of the sketch depends on the range of the data and the relative accuracy. The larger the range and the smaller the relative accuracy, the larger the sketch. The rough memory size of the of the sketch is log(max_value/min_value)/relative_accuracy. The recommended value is 0.001 or higher.
level: Level of quantile. Optional. Possible values are in the range from 0 to 1. Default value: 0.5. Float.

Returned value¶

Approximate quantile of the specified level.

Type: Float64.

Example¶

Input table has an integer and a float columns:

┌─a─┬─────b─┐
│ 1 │ 1.001 │
│ 2 │ 1.002 │
│ 3 │ 1.003 │
│ 4 │ 1.004 │
└───┴───────┘

Query to calculate 0.75-quantile (third quartile):

SELECT quantileDD(0.01, 0.75)(a), quantileDD(0.01, 0.75)(b) FROM example_table

Result:

┌─quantileDD(0.01, 0.75)(a)─┬─quantileDD(0.01, 0.75)(b)─┐
│               2.974233423476717 │                            1.01 │
└─────────────────────────────────┴─────────────────────────────────┘

quantileDeterministic¶

Computes an approximate quantile of a numeric data sequence.

This function applies reservoir sampling with a reservoir size up to 8192 and deterministic algorithm of sampling. The result is deterministic. To get an exact quantile, use the quantileExact function.

Syntax¶

quantileDeterministic(level)(expr, determinator)

Alias: medianDeterministic.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over the column values resulting in numeric data types, Date or DateTime.
determinator: Number whose hash is used instead of a random number generator in the reservoir sampling algorithm to make the result of sampling deterministic. As a determinator you can use any deterministic positive number, for example, a user id or an event id. If the same determinator value occurs too often, the function works incorrectly.

Returned value¶

Approximate quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

Input table:

┌─val─┐
│   1 │
│   1 │
│   2 │
│   3 │
└─────┘

Query:

SELECT quantileDeterministic(val, 1) FROM t

Result:

┌─quantileDeterministic(val, 1)─┐
│                           1.5 │
└───────────────────────────────┘

quantileExact¶

Exactly computes the quantile of a numeric data sequence.

To get exact value, all the passed values are combined into an array, which is then partially sorted. Therefore, the function consumes O(n) memory, where n is a number of values that were passed. However, for a small number of values, the function is very effective.

Syntax¶

quantileExact(level)(expr)

Alias: medianExact.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value¶

Quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

Query:

SELECT quantileExact(number) FROM numbers(10)

Result:

┌─quantileExact(number)─┐
│                     5 │
└───────────────────────┘

quantileExactLow¶

Similar to quantileExact, this computes the exact quantile of a numeric data sequence.

To get the exact value, all the passed values are combined into an array, which is then fully sorted. The sorting algorithm's complexity is O(N·log(N)), where N = std::distance(first, last) comparisons.

The return value depends on the quantile level and the number of elements in the selection, i.e. if the level is 0.5, then the function returns the lower median value for an even number of elements and the middle median value for an odd number of elements. Median is calculated similarly to the median_low implementation which is used in python.

For all other levels, the element at the index corresponding to the value of level * size_of_array is returned. For example:

SELECT quantileExactLow(0.1)(number) FROM numbers(10)

┌─quantileExactLow(0.1)(number)─┐
│                             1 │
└───────────────────────────────┘

Syntax¶

quantileExactLow(level)(expr)

Alias: medianExactLow.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value¶

Quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

Query:

SELECT quantileExactLow(number) FROM numbers(10)

Result:

┌─quantileExactLow(number)─┐
│                        4 │
└──────────────────────────┘

quantileExactHigh¶

Similar to quantileExact, this computes the exact quantile of a numeric data sequence.

All the passed values are combined into an array, which is then fully sorted, to get the exact value. The sorting algorithm's complexity is O(N·log(N)), where N = std::distance(first, last) comparisons.

The return value depends on the quantile level and the number of elements in the selection, i.e. if the level is 0.5, then the function returns the higher median value for an even number of elements and the middle median value for an odd number of elements. Median is calculated similarly to the median_high implementation which is used in python. For all other levels, the element at the index corresponding to the value of level * size_of_array is returned.

This implementation behaves exactly similar to the current quantileExact implementation.

Syntax¶

quantileExactHigh(level)(expr)

Alias: medianExactHigh.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value¶

Quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

Query:

SELECT quantileExactHigh(number) FROM numbers(10)

Result:

┌─quantileExactHigh(number)─┐
│                         5 │
└───────────────────────────┘

quantileExactExclusive¶

Exactly computes the quantile of a numeric data sequence.

This function is equivalent to PERCENTILE.EXC Excel function, (type R6).

When using multiple quantileExactExclusive functions with different levels in a query, the internal states aren't combined (that is, the query works less efficiently than it could). In this case, use the quantilesExactExclusive function.

Syntax¶

quantileExactExclusive(level)(expr)

Arguments¶

expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Parameters¶

level: Level of quantile. Optional. Possible values: (0, 1): bounds not included. Default value: 0.5. At level=0.5 the function calculates median. Float.

Returned value¶

Quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

quantileExactInclusive¶

Exactly computes the quantile of a numeric data sequence.

This function is equivalent to PERCENTILE.INC Excel function, (type R7).

When using multiple quantileExactInclusive functions with different levels in a query, the internal states aren't combined (that is, the query works less efficiently than it could). In this case, use the quantilesExactInclusive function.

Syntax¶

quantileExactInclusive(level)(expr)

Arguments¶

expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Parameters¶

level: Level of quantile. Optional. Possible values: [0, 1]: bounds included. Default value: 0.5. At level=0.5 the function calculates median. Float.

Returned value¶

Quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

quantileExactWeighted¶

Exactly computes the quantile of a numeric data sequence, taking into account the weight of each element.

To get exact value, all the passed values are combined into an array, which is then partially sorted. Each value is counted with its weight, as if it's present weight times. A hash table is used in the algorithm. Because of this, if the passed values are frequently repeated, the function consumes less RAM than quantileExact. You can use this function instead of quantileExact and specify the weight 1.

Syntax¶

quantileExactWeighted(level)(expr, weight)

Alias: medianExactWeighted.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over the column values resulting in numeric data types, Date or DateTime.
weight: Column with weights of sequence members. Weight is a number of value occurrences with Unsigned integer types.

Returned value¶

Quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

Input table:

┌─n─┬─val─┐
│ 0 │   3 │
│ 1 │   2 │
│ 2 │   1 │
│ 5 │   4 │
└───┴─────┘

Query:

SELECT quantileExactWeighted(n, val) FROM t

Result:

┌─quantileExactWeighted(n, val)─┐
│                             1 │
└───────────────────────────────┘

quantileExactWeightedInterpolated¶

Computes quantile of a numeric data sequence using linear interpolation, taking into account the weight of each element.

To get the interpolated value, all the passed values are combined into an array, which are then sorted by their corresponding weights. Quantile interpolation is then performed using the weighted percentile method by building a cumulative distribution based on weights and then a linear interpolation is performed using the weights and the values to compute the quantiles.

We strongly recommend using quantileExactWeightedInterpolated instead of quantileInterpolatedWeighted because quantileExactWeightedInterpolated is more accurate than quantileInterpolatedWeighted. Here is an example:

SELECT
    quantileExactWeightedInterpolated(0.99)(number, 1),
    quantile(0.99)(number),
    quantileInterpolatedWeighted(0.99)(number, 1)
FROM numbers(9)


┌─quantileExactWeightedInterpolated(0.99)(number, 1)─┬─quantile(0.99)(number)─┬─quantileInterpolatedWeighted(0.99)(number, 1)─┐
│                                               7.92 │                   7.92 │                                             8 │
└────────────────────────────────────────────────────┴────────────────────────┴───────────────────────────────────────────────┘

Syntax¶

quantileExactWeightedInterpolated(level)(expr, weight)

Alias: medianExactWeightedInterpolated.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over the column values resulting in numeric data types, Date or DateTime.
weight: Column with weights of sequence members. Weight is a number of value occurrences with Unsigned integer types.

Returned value¶

Quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

Input table:

┌─n─┬─val─┐
│ 0 │   3 │
│ 1 │   2 │
│ 2 │   1 │
│ 5 │   4 │
└───┴─────┘

Result:

┌─quantileExactWeightedInterpolated(n, val)─┐
│                                       1.5 │
└───────────────────────────────────────────┘

quantileGK¶

Computes the quantile of a numeric data sequence using the Greenwald-Khanna algorithm. The Greenwald-Khanna algorithm is an algorithm used to compute quantiles on a stream of data in a highly efficient manner. It was introduced by Michael Greenwald and Sanjeev Khanna in 2001. It is widely used in databases and big data systems where computing accurate quantiles on a large stream of data in real-time is necessary. The algorithm is highly efficient, taking only O(log n) space and O(log log n) time per item (where n is the size of the input). It is also highly accurate, providing an approximate quantile value with high probability.

quantileGK is different from other quantile functions, because it enables user to control the accuracy of the approximate quantile result.

Syntax¶

quantileGK(accuracy, level)(expr)

Alias: medianGK.

Arguments¶

accuracy: Accuracy of quantile. Constant positive integer. Larger accuracy value means less error. For example, if the accuracy argument is set to 100, the computed quantile will have an error no greater than 1% with high probability. There is a trade-off between the accuracy of the computed quantiles and the computational complexity of the algorithm. A larger accuracy requires more memory and computational resources to compute the quantile accurately, while a smaller accuracy argument allows for a faster and more memory-efficient computation but with a slightly lower accuracy.
level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value¶

Quantile of the specified level and accuracy.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

SELECT quantileGK(1, 0.25)(number + 1)
FROM numbers(1000)

┌─quantileGK(1, 0.25)(plus(number, 1))─┐
│                                    1 │
└──────────────────────────────────────┘

SELECT quantileGK(10, 0.25)(number + 1)
FROM numbers(1000)

┌─quantileGK(10, 0.25)(plus(number, 1))─┐
│                                   156 │
└───────────────────────────────────────┘

SELECT quantileGK(100, 0.25)(number + 1)
FROM numbers(1000)

┌─quantileGK(100, 0.25)(plus(number, 1))─┐
│                                    251 │
└────────────────────────────────────────┘

SELECT quantileGK(1000, 0.25)(number + 1)
FROM numbers(1000)

┌─quantileGK(1000, 0.25)(plus(number, 1))─┐
│                                     249 │
└─────────────────────────────────────────┘

quantileInterpolatedWeighted¶

Computes quantile of a numeric data sequence using linear interpolation, taking into account the weight of each element.

Syntax¶

quantileInterpolatedWeighted(level)(expr, weight)

Alias: medianInterpolatedWeighted.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over the column values resulting in numeric data types, Date or DateTime.
weight: Column with weights of sequence members. Weight is a number of value occurrences.

Returned value¶

Quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

Input table:

┌─n─┬─val─┐
│ 0 │   3 │
│ 1 │   2 │
│ 2 │   1 │
│ 5 │   4 │
└───┴─────┘

Query:

SELECT quantileInterpolatedWeighted(n, val) FROM t

Result:

┌─quantileInterpolatedWeighted(n, val)─┐
│                                    1 │
└──────────────────────────────────────┘

quantiles¶

Syntax: quantiles(level1, level2, ...)(x)

All the quantile functions also have corresponding quantiles functions: quantiles, quantilesDeterministic, quantilesTiming, quantilesTimingWeighted, quantilesExact, quantilesExactWeighted, quantileExactWeightedInterpolated, quantileInterpolatedWeighted, quantilesTDigest, quantilesBFloat16, quantilesDD. These functions calculate all the quantiles of the listed levels in one pass, and return an array of the resulting values.

quantilesExactExclusive¶

Exactly computes the quantiles of a numeric data sequence.

This function is equivalent to PERCENTILE.EXC Excel function, (type R6).

Works more efficiently with sets of levels than quantileExactExclusive.

Syntax¶

quantilesExactExclusive(level1, level2, ...)(expr)

Arguments¶

expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Parameters¶

level: Levels of quantiles. Possible values: (0, 1): bounds not included. Float.

Returned value¶

Array of quantiles of the specified levels.

Type of array values:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

quantilesExactInclusive¶

Exactly computes the quantiles of a numeric data sequence.

This function is equivalent to PERCENTILE.INC Excel function, (type R7).

Works more efficiently with sets of levels than quantileExactInclusive.

Syntax¶

quantilesExactInclusive(level1, level2, ...)(expr)

Arguments¶

expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Parameters¶

level: Levels of quantiles. Possible values: 0, 1: bounds included. Float.

Returned value¶

Array of quantiles of the specified levels.

Type of array values:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

quantilesGK¶

quantilesGK works similarly with quantileGK but allows us to calculate quantities at different levels simultaneously and returns an array.

Syntax¶

quantilesGK(accuracy, level1, level2, ...)(expr)

Returned value¶

Array of quantiles of the specified levels.

Type of array values:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

Query:

SELECT quantilesGK(1, 0.25, 0.5, 0.75)(number + 1)
FROM numbers(1000)

┌─quantilesGK(1, 0.25, 0.5, 0.75)(plus(number, 1))─┐
│ [1,1,1]                                          │
└──────────────────────────────────────────────────┘

SELECT quantilesGK(10, 0.25, 0.5, 0.75)(number + 1)
FROM numbers(1000)

┌─quantilesGK(10, 0.25, 0.5, 0.75)(plus(number, 1))─┐
│ [156,413,659]                                     │
└───────────────────────────────────────────────────┘


SELECT quantilesGK(100, 0.25, 0.5, 0.75)(number + 1)
FROM numbers(1000)

┌─quantilesGK(100, 0.25, 0.5, 0.75)(plus(number, 1))─┐
│ [251,498,741]                                      │
└────────────────────────────────────────────────────┘

SELECT quantilesGK(1000, 0.25, 0.5, 0.75)(number + 1)
FROM numbers(1000)

┌─quantilesGK(1000, 0.25, 0.5, 0.75)(plus(number, 1))─┐
│ [249,499,749]                                       │
└─────────────────────────────────────────────────────┘

quantileTDigest¶

Computes an approximate quantile of a numeric data sequence using the t-digest algorithm.

Memory consumption is log(n), where n is a number of values. The result depends on the order of running the query, and is nondeterministic.

The performance of the function is lower than performance of quantile or quantileTiming. In terms of the ratio of State size to precision, this function is much better than quantile.

Syntax¶

quantileTDigest(level)(expr)

Alias: medianTDigest.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value¶

Approximate quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

Query:

SELECT quantileTDigest(number) FROM numbers(10)

Result:

┌─quantileTDigest(number)─┐
│                     4.5 │
└─────────────────────────┘

quantileTDigestWeighted¶

Computes an approximate quantile of a numeric data sequence using the t-digest algorithm. The function takes into account the weight of each sequence member. The maximum error is 1%. Memory consumption is log(n), where n is a number of values.

The performance of the function is lower than performance of quantile or quantileTiming. In terms of the ratio of State size to precision, this function is much better than quantile.

The result depends on the order of running the query, and is nondeterministic.

Syntax¶

quantileTDigestWeighted(level)(expr, weight)

Alias: medianTDigestWeighted.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over the column values resulting in numeric data types, Date or DateTime.
weight: Column with weights of sequence elements. Weight is a number of value occurrences.

Returned value¶

Approximate quantile of the specified level.

Type:

Float64 for numeric data type input.
Date if input values have the Date type.
DateTime if input values have the DateTime type.

Example¶

Query:

SELECT quantileTDigestWeighted(number, 1) FROM numbers(10)

Result:

┌─quantileTDigestWeighted(number, 1)─┐
│                                4.5 │
└────────────────────────────────────┘

quantileTiming¶

With the determined precision computes the quantile of a numeric data sequence.

The result is deterministic (it doesn't depend on the query processing order). The function is optimized for working with sequences which describe distributions like loading web pages times or backend response times.

Syntax¶

quantileTiming(level)(expr)

Alias: medianTiming.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over a column values returning a Float* type number.
- If negative values are passed to the function, the behavior is undefined.
- If the value is greater than 30,000 (a page loading time of more than 30 seconds), it's assumed to be 30,000.

Accuracy¶

The calculation is accurate if:

Total number of values doesn't exceed 5670.
Total number of values exceeds 5670, but the page loading time is less than 1024ms.

Otherwise, the result of the calculation is rounded to the nearest multiple of 16 ms.

For calculating page loading time quantiles, this function is more effective and accurate than quantile.

Returned value¶

Quantile of the specified level.

Type: Float32.

If no values are passed to the function (when using quantileTimingIf), NaN is returned. The purpose of this is to differentiate these cases from cases that result in zero.

Example¶

Input table:

┌─response_time─┐
│            72 │
│           112 │
│           126 │
│           145 │
│           104 │
│           242 │
│           313 │
│           168 │
│           108 │
└───────────────┘

Query:

SELECT quantileTiming(response_time) FROM t

Result:

┌─quantileTiming(response_time)─┐
│                           126 │
└───────────────────────────────┘

quantileTimingWeighted¶

With the determined precision computes the quantile of a numeric data sequence according to the weight of each sequence member.

Syntax¶

quantileTimingWeighted(level)(expr, weight)

Alias: medianTimingWeighted.

Arguments¶

level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
expr: Expression over a column values returning a Float* type number.
- If negative values are passed to the function, the behavior is undefined. - If the value is greater than 30,000 (a page loading time of more than 30 seconds), it's assumed to be 30,000.
weight: Column with weights of sequence elements. Weight is a number of value occurrences.

Accuracy¶

The calculation is accurate if:

Total number of values doesn't exceed 5670.
Total number of values exceeds 5670, but the page loading time is less than 1024ms.

Otherwise, the result of the calculation is rounded to the nearest multiple of 16 ms.

For calculating page loading time quantiles, this function is more effective and accurate than quantile.

Returned value¶

Quantile of the specified level.

Type: Float32.

If no values are passed to the function (when using quantileTimingIf), NaN is returned. The purpose of this is to differentiate these cases from cases that result in zero.

Example¶

Input table:

┌─response_time─┬─weight─┐
│            68 │      1 │
│           104 │      2 │
│           112 │      3 │
│           126 │      2 │
│           138 │      1 │
│           162 │      1 │
└───────────────┴────────┘

Query:

SELECT quantileTimingWeighted(response_time, weight) FROM t

Result:

┌─quantileTimingWeighted(response_time, weight)─┐
│                                           112 │
└───────────────────────────────────────────────┘

quantilesTimingWeighted¶

Same as quantileTimingWeighted, but accept multiple parameters with quantile levels and return an Array filled with many values of that quantiles.

Example¶

Input table:

┌─response_time─┬─weight─┐
│            68 │      1 │
│           104 │      2 │
│           112 │      3 │
│           126 │      2 │
│           138 │      1 │
│           162 │      1 │
└───────────────┴────────┘

Query:

SELECT quantilesTimingWeighted(0,5, 0.99)(response_time, weight) FROM t

Result:

┌─quantilesTimingWeighted(0.5, 0.99)(response_time, weight)─┐
│ [112,162]                                                 │
└───────────────────────────────────────────────────────────┘

rankCorr¶

Computes a rank correlation coefficient.

Syntax¶

rankCorr(x, y)

Arguments¶

x: Arbitrary value. Float32 or Float64.
y: Arbitrary value. Float32 or Float64.

Returned value(s)¶

Returns a rank correlation coefficient of the ranks of x and y. The value of the correlation coefficient ranges from -1 to +1. If less than two arguments are passed, the function will return an exception. The value close to +1 denotes a high linear relationship, and with an increase of one random variable, the second random variable also increases. The value close to -1 denotes a high linear relationship, and with an increase of one random variable, the second random variable decreases. The value close or equal to 0 denotes no relationship between the two random variables.

Type: Float64.

Example¶

Query:

SELECT rankCorr(number, number) FROM numbers(100)

Result:

┌─rankCorr(number, number)─┐
│                        1 │
└──────────────────────────┘

Query:

SELECT roundBankers(rankCorr(exp(number), sin(number)), 3) FROM numbers(100)

Result:

┌─roundBankers(rankCorr(exp(number), sin(number)), 3)─┐
│                                              -0.037 │
└─────────────────────────────────────────────────────┘

simpleLinearRegression¶

Performs simple (unidimensional) linear regression.

simpleLinearRegression(x, y)

Parameters:

x: Column with explanatory variable values.
y: Column with dependent variable values.

Returned values:

Constants (k, b) of the resulting line y = k*x + b.

Examples¶

SELECT arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [0, 1, 2, 3])

┌─arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [0, 1, 2, 3])─┐
│ (1,0)                                                             │
└───────────────────────────────────────────────────────────────────┘

SELECT arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [3, 4, 5, 6])

┌─arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [3, 4, 5, 6])─┐
│ (1,3)                                                             │
└───────────────────────────────────────────────────────────────────┘

singleValueOrNull¶

The aggregate function singleValueOrNull is used to implement subquery operators, such as x = ALL (SELECT ...). It checks if there is only one unique non-NULL value in the data. If there is only one unique value, it returns it. If there are zero or at least two distinct values, it returns NULL.

Syntax¶

singleValueOrNull(x)

Parameters¶

x: Column of any data type (except Map, Array or Tuple which can't be of type Nullable).

Returned values¶

The unique value, if there is only one unique non-NULL value in x.
NULL, if there are zero or at least two distinct values.

skewPop¶

Computes the skewness of a sequence.

skewPop(expr)

Arguments¶

expr: Expression returning a number.

Returned value¶

The skewness of the given distribution. Type: Float64.

Example¶

SELECT skewPop(value) FROM series_with_value_column

skewSamp¶

Computes the sample skewness of a sequence.

It represents an unbiased estimate of the skewness of a random variable if passed values form its sample.

skewSamp(expr)

Arguments¶

expr: Expression returning a number.

Returned value¶

The skewness of the given distribution. Type: Float64. If n <= 1 (n is the size of the sample), then the function returns nan.

Example¶

SELECT skewSamp(value) FROM series_with_value_column

sparkbar¶

The function plots a frequency histogram for values x and the repetition rate y of these values over the interval [min_x, max_x]. Repetitions for all x falling into the same bucket are averaged, so data should be pre-aggregated. Negative repetitions are ignored.

If no interval is specified, then the minimum x is used as the interval start, and the maximum x: as the interval end. Otherwise, values outside the interval are ignored.

Syntax¶

sparkbar(buckets[, min_x, max_x])(x, y)

Parameters¶

buckets: The number of segments. Type: Integer.
min_x: The interval start. Optional parameter.
max_x: The interval end. Optional parameter.

Arguments¶

x: The field with values.
y: The field with the frequency of values.

Returned value¶

The frequency histogram.

stddevPop¶

The result is equal to the square root of varPop.

Aliases: STD, STDDEV_POP.

This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the stddevPopStable function. It works slower but provides a lower computational error.

Syntax¶

stddevPop(x)

Parameters¶

x: Population of values to find the standard deviation of. (U)Int*, Float*, Decimal*.

Returned value¶

Square root of standard deviation of x. Float64.

stddevPopStable¶

The result is equal to the square root of varPop. Unlike stddevPop, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.

Syntax¶

stddevPopStable(x)

Parameters¶

x: Population of values to find the standard deviation of. (U)Int*, Float*, Decimal*.

Returned value¶

Square root of standard deviation of x. Float64.

stddevSamp¶

The result is equal to the square root of varSamp.

Alias: STDDEV_SAMP.

This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the stddevSampStable function. It works slower but provides a lower computational error.

Syntax¶

stddevSamp(x)

Parameters¶

x: Values for which to find the square root of sample variance. (U)Int*, Float*, Decimal*.

Returned value¶

Square root of sample variance of x. Float64.

stddevSampStable¶

The result is equal to the square root of varSamp. Unlike stddevSamp, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.

Syntax¶

stddevSampStable(x)

Parameters¶

x: Values for which to find the square root of sample variance. (U)Int*, Float*, Decimal*.

Returned value¶

Square root of sample variance of x. Float64.

stochasticLinearRegression {#agg_functions_stochasticlinearregression_parameters}¶

This function implements stochastic linear regression. It supports custom parameters for learning rate, L2 regularization coefficient, mini-batch size, and has a few methods for updating weight: Adam (used by default), simple SGD, Momentum, and Nesterov.

Parameters¶

There are 4 customizable parameters. They are passed to the function sequentially, but there is no need to pass all four - default values will be used, however good model required some parameter tuning.

stochasticLinearRegression(0.00001, 0.1, 15, 'Adam')

learning rate is the coefficient on step length, when the gradient descent step is performed. A learning rate that is too big may cause infinite weights of the model. Default is 0.00001.
l2 regularization coefficient which may help to prevent overfitting. Default is 0.1.
mini-batch size sets the number of elements, which gradients will be computed and summed to perform one step of gradient descent. Pure stochastic descent uses one element, however, having small batches (about 10 elements) makes gradient steps more stable. Default is 15.
method for updating weights, they are: Adam (by default), SGD, Momentum, and Nesterov. Momentum and Nesterov require a little bit more computations and memory, however, they happen to be useful in terms of speed of convergence and stability of stochastic gradient methods.

stochasticLogisticRegression¶

This function implements stochastic logistic regression. It can be used for binary classification problem, supports the same custom parameters as stochasticLinearRegression and works the same way.

Parameters¶

Parameters are exactly the same as in stochasticLinearRegression: learning rate, l2 regularization coefficient, mini-batch size, method for updating weights. For more information see parameters.

stochasticLogisticRegression(1.0, 1.0, 10, 'SGD')

studentTTest¶

Applies Student's t-test to samples from two populations.

Syntax¶

studentTTest([confidence_level])(sample_data, sample_index)

Values of both samples are in the sample_data column. If sample_index equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that means of populations are equal. Normal distribution with equal variances is assumed.

Arguments¶

sample_data: Sample data. Integer, Float or Decimal.
sample_index: Sample index. Integer.

Parameters¶

confidence_level: Confidence level in order to calculate confidence intervals. Float.

Returned values¶

Tuple with two or four elements (if the optional confidence_level is specified):

calculated t-statistic. Float64.
calculated p-value. Float64.
calculated confidence-interval-low. Float64.
calculated confidence-interval-high. Float64.

Example¶

Input table:

┌─sample_data─┬─sample_index─┐
│        20.3 │            0 │
│        21.1 │            0 │
│        21.9 │            1 │
│        21.7 │            0 │
│        19.9 │            1 │
│        21.8 │            1 │
└─────────────┴──────────────┘

Query:

SELECT studentTTest(sample_data, sample_index) FROM student_ttest

Result:

┌─studentTTest(sample_data, sample_index)───┐
│ (-0.21739130434783777,0.8385421208415731) │
└───────────────────────────────────────────┘

sum¶

Calculates the sum. Only works for numbers.

Syntax¶

sum(num)

Parameters¶

num: Column of numeric values. (U)Int*, Float*, Decimal*.

Returned value¶

The sum of the values. (U)Int*, Float*, Decimal*.

sumCount¶

Calculates the sum of the numbers and counts the number of rows at the same time.

Syntax¶

sumCount(x)

Arguments¶

x: Input value, must be Integer, Float, or Decimal.

Returned value¶

Tuple (sum, count), where sum is the sum of numbers and count is the number of rows with not-NULL values.

Type: Tuple.

sumKahan¶

Calculates the sum of the numbers with Kahan compensated summation algorithm Slower than sum function. The compensation works only for Float types.

Syntax¶

sumKahan(x)

Arguments¶

x: Input value, must be Integer, Float, or Decimal.

Returned value¶

the sum of numbers, with type Integer, Float, or Decimal depends on type of input arguments

Example¶

Query:

SELECT sum(0.1), sumKahan(0.1) FROM numbers(10)

Result:

┌───────────sum(0.1)─┬─sumKahan(0.1)─┐
│ 0.9999999999999999 │             1 │
└────────────────────┴───────────────┘

sumMap¶

Totals a value array according to the keys specified in the key array. Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys without overflow.

Syntax¶

sumMap(key <Array>, value <Array>) Array type.
sumMap(Tuple(key <Array>, value <Array>)) Tuple type.

Alias: sumMappedArrays.

Arguments

key: Array of keys.
value: Array of values.

Passing a tuple of key and value arrays is a synonym to passing separately an array of keys and an array of values.

The number of elements in key and value must be the same for each row that is totaled.

Returned Value

Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys.

sumMapWithOverflow¶

Totals a value array according to the keys specified in the key array. Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys. It differs from the sumMap function in that it does summation with overflow - i.e. returns the same data type for the summation as the argument data type.

Syntax¶

sumMapWithOverflow(key <Array>, value <Array>) Array type.
sumMapWithOverflow(Tuple(key <Array>, value <Array>)) Tuple type.

Arguments

key: Array of keys.
value: Array of values.

Passing a tuple of key and value arrays is a synonym to passing separately an array of keys and an array of values.

The number of elements in key and value must be the same for each row that is totaled.

Returned Value

Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys.

sumWithOverflow¶

Computes the sum of the numbers, using the same data type for the result as for the input parameters. If the sum exceeds the maximum value for this data type, it's calculated with overflow.

Only works for numbers.

Syntax¶

sumWithOverflow(num)

Parameters¶

num: Column of numeric values. (U)Int*, Float*, Decimal*.

Returned value¶

The sum of the values. (U)Int*, Float*, Decimal*.

theilsU¶

The theilsU function calculates the Theil's U uncertainty coefficient, a value that measures the association between two columns in a table. Its values range from −1.0 (100% negative association, or perfect inversion) to +1.0 (100% positive association, or perfect agreement). A value of 0.0 indicates the absence of association.

Syntax¶

theilsU(column1, column2)

Arguments¶

column1 and column2 are the columns to be compared

Returned value¶

a value between -1 and 1

Return type is always Float64.

Example¶

The following two columns being compared below have a small association with each other, so the value of theilsU is negative:

SELECT
    theilsU(a ,b)
FROM
    (
        SELECT
            number % 10 AS a,
            number % 4 AS b
        FROM
            numbers(150)
    )

Result:

┌────────theilsU(a, b)─┐
│ -0.30195720557678846 │
└──────────────────────┘

topK¶

Returns an array of the approximately most frequent values in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves).

Implements the Filtered Space-Saving algorithm for analyzing TopK, based on the reduce-and-combine algorithm from Parallel Space Saving.

topK(N)(column)
topK(N, load_factor)(column)
topK(N, load_factor, 'counts')(column)

This function doesn't provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren’t the most frequent values.

We recommend using the N < 10 value; performance is reduced with large N values. Maximum value of N = 65536.

Parameters¶

N: The number of elements to return. Optional. Default value: 10.
load_factor: Defines, how many cells reserved for values. If uniq(column) > N * load_factor, result of topK function will be approximate. Optional. Default value: 3.
counts: Defines, should result contain approximate count and error value.

Arguments¶

column: The value to calculate frequency.

Example¶

Take the OnTime data set and select the three most frequently occurring values in the AirlineID column.

SELECT topK(3)(AirlineID) AS res
FROM ontime

┌─res─────────────────┐
│ [19393,19790,19805] │
└─────────────────────┘

topKWeighted¶

Syntax¶

topKWeighted(N)(column, weight)
topKWeighted(N, load_factor)(column, weight)
topKWeighted(N, load_factor, 'counts')(column, weight)

Parameters¶

N: The number of elements to return. Optional. Default value: 10.
load_factor: Defines, how many cells reserved for values. If uniq(column) > N * load_factor, result of topK function will be approximate. Optional. Default value: 3.
counts: Defines, should result contain approximate count and error value.

Arguments¶

column: The value.
weight: The weight. Every value is accounted weight times for frequency calculation. UInt64.

Returned value¶

Returns an array of the values with maximum approximate sum of weights.

Example¶

Query:

SELECT topKWeighted(2)(k, w) FROM
VALUES('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))

Result:

┌─topKWeighted(2)(k, w)──┐
│ ['z','x']              │
└────────────────────────┘

Query:

SELECT topKWeighted(2, 10, 'counts')(k, w)
FROM VALUES('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))

Result:

┌─topKWeighted(2, 10, 'counts')(k, w)─┐
│ [('z',10,0),('x',5,0)]              │
└─────────────────────────────────────┘

uniq¶

Calculates the approximate number of different values of the argument.

uniq(x[, ...])

Arguments¶

The function takes a variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

Returned value¶

A UInt64-type number.

Implementation details¶

Function:

Calculates a hash for all parameters in the aggregate, then uses it in calculations.
Uses an adaptive sampling algorithm. For the calculation state, the function uses a sample of element hash values up to 65536. This algorithm is very accurate and very efficient on the CPU. When the query contains several of these functions, using uniq is almost as fast as using other aggregate functions.
Provides the result deterministically (it doesn't depend on the query processing order).

We recommend using this function in almost all scenarios.

uniqCombined¶

Calculates the approximate number of different argument values.

uniqCombined(HLL_precision)(x[, ...])

The uniqCombined function is a good choice for calculating the number of different values.

Arguments¶

HLL_precision: The base-2 logarithm of the number of cells in HyperLogLog. Optional, you can use the function as uniqCombined(x[, ...]). The default value for HLL_precision is 17, which is effectively 96 KiB of space (2^17 cells, 6 bits each).
X: A variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

Returned value¶

A number UInt64-type number.

Implementation details¶

The uniqCombined function:

Calculates a hash (64-bit hash for String and 32-bit otherwise) for all parameters in the aggregate, then uses it in calculations.
Uses a combination of three algorithms: array, hash table, and HyperLogLog with an error correction table.
- For a small number of distinct elements, an array is used.
- When the set size is larger, a hash table is used.
- For a larger number of elements, HyperLogLog is used, which will occupy a fixed amount of memory.
Provides the result deterministically (it doesn't depend on the query processing order).

Since it uses a 32-bit hash for non-String types, the result will have very high error for cardinalities significantly larger than UINT_MAX (error will raise quickly after a few tens of billions of distinct values), hence in this case you should use uniqCombined64.

Compared to the uniq function, the uniqCombined function:

Consumes several times less memory.
Calculates with several times higher accuracy.
Usually has slightly lower performance. In some scenarios, uniqCombined can perform better than uniq, for example, with distributed queries that transmit a large number of aggregation states over the network.

Example¶

Query:

SELECT uniqCombined(number) FROM numbers(1e6)

Result:

┌─uniqCombined(number)─┐
│              1001148 │ -- 1.00 million
└──────────────────────┘

See the example section of uniqCombined64 for an example of the difference between uniqCombined and uniqCombined64 for much larger inputs.

uniqCombined64¶

Calculates the approximate number of different argument values. It is the same as uniqCombined, but uses a 64-bit hash for all data types rather than just for the String data type.

uniqCombined64(HLL_precision)(x[, ...])

Parameters¶

HLL_precision: The base-2 logarithm of the number of cells in HyperLogLog. Optionally, you can use the function as uniqCombined64(x[, ...]). The default value for HLL_precision is 17, which is effectively 96 KiB of space (2^17 cells, 6 bits each).
X: A variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

Returned value¶

A number UInt64-type number.

Implementation details¶

The uniqCombined64 function:

Calculates a hash (64-bit hash for all data types) for all parameters in the aggregate, then uses it in calculations.
Uses a combination of three algorithms: array, hash table, and HyperLogLog with an error correction table.
- For a small number of distinct elements, an array is used.
- When the set size is larger, a hash table is used.
- For a larger number of elements, HyperLogLog is used, which will occupy a fixed amount of memory.
Provides the result deterministically (it doesn't depend on the query processing order).

Since it uses 64-bit hash for all types, the result doesn't suffer from very high error for cardinalities significantly larger than UINT_MAX like uniqCombined does, which uses a 32-bit hash for non-String types.

Compared to the uniq function, the uniqCombined64 function:

Consumes several times less memory.
Calculates with several times higher accuracy.

Example¶

In the example below uniqCombined64 is run on 1e10 different numbers returning a very close approximation of the number of different argument values.

Query:

SELECT uniqCombined64(number) FROM numbers(1e10)

Result:

┌─uniqCombined64(number)─┐
│             9998568925 │ -- 10.00 billion
└────────────────────────┘

By comparison the uniqCombined function returns a rather poor approximation for an input this size.

Query:

SELECT uniqCombined(number) FROM numbers(1e10)

Result:

┌─uniqCombined(number)─┐
│           5545308725 │ -- 5.55 billion
└──────────────────────┘

uniqExact¶

Calculates the exact number of different argument values.

uniqExact(x[, ...])

Use the uniqExact function if you absolutely need an exact result. Otherwise use the uniq function.

The uniqExact function uses more memory than uniq, because the size of the state has unbounded growth as the number of different values increases.

Arguments¶

The function takes a variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

uniqHLL12¶

Calculates the approximate number of different argument values, using the HyperLogLog algorithm.

uniqHLL12(x[, ...])

Arguments¶

The function takes a variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

Returned value¶

A UInt64-type number.

Implementation details¶

Function:

Calculates a hash for all parameters in the aggregate, then uses it in calculations.
Uses the HyperLogLog algorithm to approximate the number of different argument values.
2^12 5-bit cells are used. The size of the state is slightly more than 2.5 KB. The result isn't very accurate (up to ~10% error) for small data sets (<10K elements). However, the result is fairly accurate for high-cardinality data sets (10K-100M), with a maximum error of ~1.6%. Starting from 100M, the estimation error increases, and the function will return very inaccurate results for data sets with extremely high cardinality (1B+ elements).
Provides the determinate result (it doesn't depend on the query processing order).

We don't recommend using this function. In most cases, use the uniq or uniqCombined function.

uniqTheta¶

Calculates the approximate number of different argument values, using the Theta Sketch Framework.

uniqTheta(x[, ...])

Arguments¶

The function takes a variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

Returned value¶

A UInt64-type number.

Implementation details¶

Function:

Calculates a hash for all parameters in the aggregate, then uses it in calculations.
Uses the KMV algorithm to approximate the number of different argument values.
4096(2^12) 64-bit sketch are used. The size of the state is about 41 KB.
The relative error is 3.125% (95% confidence), see the relative error table for detail.

varPop¶

Calculates the population variance.

Syntax¶

varPop(x)

Alias: VAR_POP.

Parameters¶

x: Population of values to find the population variance of. (U)Int*, Float*, Decimal*.

Returned value¶

Returns the population variance of x. Float64.

varPopStable¶

Returns the population variance. Unlike varPop, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.

Syntax¶

varPopStable(x)

Alias: VAR_POP_STABLE.

Parameters¶

x: Population of values to find the population variance of. (U)Int*, Float*, Decimal*.

Returned value¶

Returns the population variance of x. Float64.

varSamp¶

Calculate the sample variance of a data set.

Syntax¶

varSamp(x)

Alias: VAR_SAMP.

Parameters¶

x: The population for which you want to calculate the sample variance. (U)Int*, Float*, Decimal*.

Returned value¶

Returns the sample variance of the input data set x. Float64.

Implementation details¶

The varSamp function calculates the sample variance using the following formula:

$$ \sum\frac{(x - \text{mean}(x))^2}{(n - 1)} $$

Where:

x is each individual data point in the data set.
mean(x) is the arithmetic mean of the data set.
n is the number of data points in the data set.

The function assumes that the input data set represents a sample from a larger population. If you want to calculate the variance of the entire population (when you have the complete data set), you should use varPop instead.

varSampStable¶

Calculate the sample variance of a data set. Unlike varSamp, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.

Syntax¶

varSampStable(x)

Alias: VAR_SAMP_STABLE

Parameters¶

x: The population for which you want to calculate the sample variance. (U)Int*, Float*, Decimal*.

Returned value¶

Returns the sample variance of the input data set. Float64.

Implementation details¶

The varSampStable function calculates the sample variance using the same formula as the varSamp:

$$ \sum\frac{(x - \text{mean}(x))^2}{(n - 1)} $$

Where:

x is each individual data point in the data set.
mean(x) is the arithmetic mean of the data set.
n is the number of data points in the data set.

welchTTest¶

Applies Welch's t-test to samples from two populations.

Syntax¶

welchTTest([confidence_level])(sample_data, sample_index)

Values of both samples are in the sample_data column. If sample_index equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that means of populations are equal. Normal distribution is assumed. Populations may have unequal variance.

Arguments¶

sample_data: Sample data. Integer, Float or Decimal.
sample_index: Sample index. Integer.

Parameters¶

confidence_level: Confidence level in order to calculate confidence intervals. Float.

Returned values¶

Tuple with two or four elements (if the optional confidence_level is specified)

calculated t-statistic. Float64.
calculated p-value. Float64.
calculated confidence-interval-low. Float64.
calculated confidence-interval-high. Float64.

Example¶

Input table:

┌─sample_data─┬─sample_index─┐
│        20.3 │            0 │
│        22.1 │            0 │
│        21.9 │            0 │
│        18.9 │            1 │
│        20.3 │            1 │
│          19 │            1 │
└─────────────┴──────────────┘

Query:

SELECT welchTTest(sample_data, sample_index) FROM welch_ttest

Result:

┌─welchTTest(sample_data, sample_index)─────┐
│ (2.7988719532211235,0.051807360348581945) │
└───────────────────────────────────────────┘