Aggregate functions¶
Aggregate functions perform a calculation on a set of values and return a single value. They are commonly used in data analysis and reporting to summarize large datasets. These functions can be used to compute sums, averages, counts, and other statistical measures.
Aggregate Function Combinators¶
The name of an aggregate function can have a suffix appended to it. This changes the way the aggregate function works.
-If
The suffix -If can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition (Uint8 type). The aggregate function processes only the rows that trigger the condition. If the condition was not triggered even once, it returns a default value (usually zeros or empty strings).
Examples: sumIf(column, cond)
, countIf(cond)
, avgIf(x, cond)
, quantilesTimingIf(level1, level2)(x, cond)
, argMinIf(arg, val, cond)
and so on.
With conditional aggregate functions, you can calculate aggregates for several conditions at once, without using subqueries and JOIN
s. For example, conditional aggregate functions can be used to implement the segment comparison functionality.
-Array
The -Array suffix can be appended to any aggregate function. In this case, the aggregate function takes arguments of the 'Array(T)' type (arrays) instead of 'T' type arguments. If the aggregate function accepts multiple arguments, this must be arrays of equal lengths. When processing arrays, the aggregate function works like the original aggregate function across all array elements.
Example 1: sumArray(arr)
- Totals all the elements of all 'arr' arrays. In this example, it could have been written more simply: sum(arraySum(arr))
.
Example 2: uniqArray(arr)
– Counts the number of unique elements in all 'arr' arrays. This could be done an easier way: uniq(arrayJoin(arr))
, but it's not always possible to add 'arrayJoin' to a query.
-If and -Array can be combined. However, 'Array' must come first, then 'If'. Examples: uniqArrayIf(arr, cond)
, quantilesTimingArrayIf(level1, level2)(arr, cond)
. Due to this order, the 'cond' argument won't be an array.
-Map
The -Map suffix can be appended to any aggregate function. This will create an aggregate function which gets Map type as an argument, and aggregates values of each key of the map separately using the specified aggregate function. The result is also of a Map type.
Example
WITH map_map AS ( SELECT c1::Date AS date, c2::DateTime AS timeslot, c3::Map(String, UInt16) AS status FROM values ( ('2000-01-01', '2000-01-01 00:00:00', (['a', 'b', 'c'], [10, 10, 10])), ('2000-01-01', '2000-01-01 00:00:00', (['c', 'd', 'e'], [10, 10, 10])), ('2000-01-01', '2000-01-01 00:01:00', (['d', 'e', 'f'], [10, 10, 10])), ('2000-01-01', '2000-01-01 00:01:00', (['f', 'g', 'g'], [10, 10, 10])) )) SELECT timeslot, sumMap(status), avgMap(status), minMap(status) FROM map_map GROUP BY timeslot ┌────────────timeslot─┬─sumMap(status)───────────────────────┬─avgMap(status)───────────────────────┬─minMap(status)───────────────────────┐ │ 2000-01-01 00:00:00 │ {'a':10,'b':10,'c':20,'d':10,'e':10} │ {'a':10,'b':10,'c':10,'d':10,'e':10} │ {'a':10,'b':10,'c':10,'d':10,'e':10} │ │ 2000-01-01 00:01:00 │ {'d':10,'e':10,'f':20,'g':20} │ {'d':10,'e':10,'f':10,'g':10} │ {'d':10,'e':10,'f':10,'g':10} │ └─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┘
-SimpleState
If you apply this combinator, the aggregate function returns the same value but with a different type. This is a SimpleAggregateFunction(...) that can be stored in a table to work with AggregatingMergeTree tables.
Syntax
<aggFunction>SimpleState(x)
Arguments
x
: Aggregate function parameters.
Returned values
The value of an aggregate function with the SimpleAggregateFunction(...)
type.
Example
Query:
WITH anySimpleState(number) AS c SELECT toTypeName(c), c FROM numbers(1)
Result:
┌─toTypeName(c)────────────────────────┬─c─┐ │ SimpleAggregateFunction(any, UInt64) │ 0 │ └──────────────────────────────────────┴───┘
-State
If you apply this combinator, the aggregate function does not return the resulting value (such as the number of unique values for the uniq function), but an intermediate state of the aggregation (for uniq
, this is the hash table for calculating the number of unique values). This is an AggregateFunction(...)
that can be used for further processing or stored in a table to finish aggregating later.
Please notice, that -MapState is not an invariant for the same data due to the fact that order of data in intermediate state can change, though it doesn't impact ingestion of this data.
To work with these states, use:
- AggregatingMergeTree table engine.
- finalizeAggregation function.
- runningAccumulate function.
- -Merge combinator.
- -MergeState combinator.
-Merge
If you apply this combinator, the aggregate function takes the intermediate aggregation state as an argument, combines the states to finish aggregation, and returns the resulting value.
-MergeState
Merges the intermediate aggregation states in the same way as the -Merge combinator. However, it does not return the resulting value, but an intermediate aggregation state, similar to the -State combinator.
-ForEach
Converts an aggregate function for tables into an aggregate function for arrays that aggregates the corresponding array items and returns an array of results. For example, sumForEach
for the arrays [1, 2]
, [3, 4, 5]
and[6, 7]
returns the result [10, 13, 5]
after adding together the corresponding array items.
-Distinct
Every unique combination of arguments will be aggregated only once. Repeating values are ignored. Examples: sum(DISTINCT x)
(or sumDistinct(x)
), groupArray(DISTINCT x)
(or groupArrayDistinct(x)
), corrStable(DISTINCT x, y)
(or corrStableDistinct(x, y)
) and so on.
-OrDefault
Changes behavior of an aggregate function.
If an aggregate function does not have input values, with this combinator it returns the default value for its return data type. Applies to the aggregate functions that can take empty input data.
-OrDefault
can be used with other combinators.
Syntax
<aggFunction>OrDefault(x)
Arguments
x
: Aggregate function parameters.
Returned values
Returns the default value of an aggregate function's return type if there is nothing to aggregate.
Type depends on the aggregate function used.
Example
Query:
SELECT avg(number), avgOrDefault(number) FROM numbers(0)
Result:
┌─avg(number)─��─avgOrDefault(number)─┐ │ nan │ 0 │ └─────────────┴──────────────────────┘
Also -OrDefault
can be used with another combinators. It is useful when the aggregate function does not accept the empty input.
Query:
SELECT avgOrDefaultIf(x, x > 10) FROM ( SELECT toDecimal32(1.23, 2) AS x )
Result:
┌─avgOrDefaultIf(x, greater(x, 10))─┐ │ 0.00 │ └───────────────────────────────────┘
-OrNull
Changes behavior of an aggregate function.
This combinator converts a result of an aggregate function to the Nullable data type. If the aggregate function does not have values to calculate it returns NULL.
-OrNull
can be used with other combinators.
Syntax
<aggFunction>OrNull(x)
Arguments
x
: Aggregate function parameters.
Returned values
- The result of the aggregate function, converted to the
Nullable
data type. NULL
, if there is nothing to aggregate.
Type: Nullable(aggregate function return type)
.
Example
Add -orNull
to the end of aggregate function.
Query:
SELECT sumOrNull(number), toTypeName(sumOrNull(number)) FROM numbers(10) WHERE number > 10
Result:
┌─sumOrNull(number)─┬─toTypeName(sumOrNull(number))─┐ │ ᴺᵁᴸᴸ │ Nullable(UInt64) │ └───────────────────┴───────────────────────────────┘
Also -OrNull
can be used with another combinators. It is useful when the aggregate function does not accept the empty input.
Query:
SELECT avgOrNullIf(x, x > 10) FROM ( SELECT toDecimal32(1.23, 2) AS x )
Result:
┌─avgOrNullIf(x, greater(x, 10))─┐ │ ᴺᵁᴸᴸ │ └────────────────────────────────┘
-Resample
Lets you divide data into groups, and then separately aggregates the data in those groups. Groups are created by splitting the values from one column into intervals.
<aggFunction>Resample(start, end, step)(<aggFunction_params>, resampling_key)
Arguments
start
: Starting value of the whole required interval forresampling_key
values.stop
: Ending value of the whole required interval forresampling_key
values. The whole interval does not include thestop
value[start, stop)
.step
: Step for separating the whole interval into subintervals. TheaggFunction
is executed over each of those subintervals independently.resampling_key
: Column whose values are used for separating data into intervals.aggFunction_params
:aggFunction
parameters.
Returned values
- Array of
aggFunction
results for each subinterval.
Example
Consider the people
table with the following data:
┌─name───┬─age─┬─wage─┐ │ John │ 16 │ 10 │ │ Alice │ 30 │ 15 │ │ Mary │ 35 │ 8 │ │ Evelyn │ 48 │ 11.5 │ │ David │ 62 │ 9.9 │ │ Brian │ 60 │ 16 │ └────────┴─────┴──────┘
Let's get the names of the people whose age lies in the intervals of [30,60)
and [60,75)
. Since we use integer representation for age, we get ages in the [30, 59]
and [60,74]
intervals.
To aggregate names in an array, we use the groupArray aggregate function. It takes one argument. In our case, it's the name
column. The groupArrayResample
function should use the age
column to aggregate names by age. To define the required intervals, we pass the 30, 75, 30
arguments into the groupArrayResample
function.
SELECT groupArrayResample(30, 75, 30)(name, age) FROM people
┌─groupArrayResample(30, 75, 30)(name, age)─────┐ │ [['Alice','Mary','Evelyn'],['David','Brian']] │ └───────────────────────────────────────────────┘
Consider the results.
John
is out of the sample because he's too young. Other people are distributed according to the specified age intervals.
Now let's count the total number of people and their average wage in the specified age intervals.
SELECT countResample(30, 75, 30)(name, age) AS amount, avgResample(30, 75, 30)(wage, age) AS avg_wage FROM people
┌─amount─┬─avg_wage──────────────────┐ │ [3,2] │ [11.5,12.949999809265137] │ └────────┴───────────────────────────┘
-ArgMin
The suffix -ArgMin can be appended to the name of any aggregate function. In this case, the aggregate function accepts an additional argument, which should be any comparable expression. The aggregate function processes only the rows that have the minimum value for the specified extra expression.
Examples: sumArgMin(column, expr)
, countArgMin(expr)
, avgArgMin(x, expr)
and so on.
-ArgMax
Similar to suffix -ArgMin but processes only the rows that have the maximum value for the specified extra expression.
aggThrow¶
This function can be used for the purpose of testing exception safety. It will throw an exception on creation with the specified probability.
Syntax
aggThrow(throw_prob)
Arguments
throw_prob
: Probability to throw on creation. Float64.
Returned value
- An exception:
Code: 503. DB::Exception: Aggregate function aggThrow has thrown exception successfully
.
Example
Query:
SELECT number % 2 AS even, aggThrow(number) FROM numbers(10) GROUP BY even
Result:
Received exception: Code: 503. DB::Exception: Aggregate function aggThrow has thrown exception successfully: While executing AggregatingTransform. (AGGREGATE_FUNCTION_THROW)
analysisOfVariance¶
Provides a statistical test for one-way analysis of variance (ANOVA test). It is a test over several groups of normally distributed observations to find out whether all groups have the same mean or not.
Syntax
analysisOfVariance(val, group_no)
Aliases: anova
Parameters
val
: value.group_no
: group number thatval
belongs to.
Groups are enumerated starting from 0 and there should be at least two groups to perform a test. There should be at least one group with the number of observations greater than one.
Returned value
(f_statistic, p_value)
. Tuple(Float64, Float64).
Example
Query:
SELECT analysisOfVariance(number, number % 2) FROM numbers(1048575)
Result:
┌─analysisOfVariance(number, modulo(number, 2))─┐ │ (0,1) │ └───────────────────────────────────────────────┘
any¶
Selects the first encountered value of a column.
As a query can be executed in arbitrary order, the result of this function is non-deterministic. If you need an arbitrary but deterministic result, use functions min
or max
.
By default, the function never returns NULL, i.e. ignores NULL values in the input column. However, if the function is used with the RESPECT NULLS
modifier, it returns the first value reads no matter if NULL or not.
Syntax
any(column) [RESPECT NULLS]
Aliases any(column)
(without RESPECT NULLS
)
any_value
,first_value
.
Alias for any(column) RESPECT NULLS
any_respect_nulls
,first_value_respect_nulls
,any_value_respect_nulls
Parameters
column
: The column name.
Returned value
The first value encountered.
The return type of the function is the same as the input, except for LowCardinality which is discarded. This means that given no rows as input it will return the default value of that type (0 for integers, or Null for a Nullable() column). You might use the -OrNull
combinator ) to modify this behaviour.
Example
Query:
WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city) SELECT any(city), any_respect_nulls(city) FROM cte
┌─any(city)─┬─any_respect_nulls(city)─┐ │ Amsterdam │ ᴺᵁᴸᴸ │ └───────────┴─────────────────────────┘
anyHeavy¶
Selects a frequently occurring value using the heavy hitters algorithm. If there is a value that occurs more than in half the cases in each of the query's execution threads, this value is returned. Normally, the result is nondeterministic.
anyHeavy(column)
Arguments
column
– The column name.
Example
Take the OnTime data set and select any frequently occurring value in the AirlineID
column.
WITH cte AS (SELECT arrayJoin([2,1,1,1,3,1,1,2,2]) as n) SELECT any(n), anyHeavy(n) FROM cte
┌─any(n)─┬─anyHeavy(n)─┐ │ 2 │ 1 │ └────────┴─────────────┘
anyLast¶
Selects the last encountered value of a column.
As a query can be executed in arbitrary order, the result of this function is non-deterministic. If you need an arbitrary but deterministic result, use functions min
or max
.
By default, the function never returns NULL, i.e. ignores NULL values in the input column. However, if the function is used with the RESPECT NULLS
modifier, it returns the first value reads no matter if NULL or not.
Syntax
anyLast(column) [RESPECT NULLS]
Alias anyLast(column)
(without RESPECT NULLS
)
last_value
.
Aliases for anyLast(column) RESPECT NULLS
anyLast_respect_nulls
,last_value_respect_nulls
Parameters
column
: The column name.
Returned value
- The last value encountered.
Example
Query:
WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city) SELECT anyLast(city), anyLast_respect_nulls(city) FROM cte
┌─anyLast(city)─┬─anyLast_respect_nulls(city)─┐ │ Valencia │ ᴺᵁᴸᴸ │ └───────────────┴─────────────────────────────┘
approx_top_k¶
Returns an array of the approximately most frequent values and their counts in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves).
approx_top_k(N)(column) approx_top_k(N, reserved)(column)
This function does not provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren't the most frequent values.
We recommend using the N < 10
value; performance is reduced with large N
values. Maximum value of N = 65536
.
Parameters
N
: The number of elements to return. Optional. Default value: 10.reserved
: Defines, how many cells reserved for values. If uniq(column) > reserved, result of topK function will be approximate. Optional. Default value: N * 3.
Arguments
column
: The value to calculate frequency.
Example
Query:
SELECT approx_top_k(2)(k) FROM values('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))
Result:
┌─approx_top_k(2)(k)────┐ │ [('y',3,0),('x',1,0)] │ └───────────────────────┘
approx_top_count¶
Is an alias to approx_top_k
function
approx_top_sum¶
Returns an array of the approximately most frequent values and their counts in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves). Additionally, the weight of the value is taken into account.
approx_top_sum(N)(column, weight) approx_top_sum(N, reserved)(column, weight)
This function does not provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren't the most frequent values.
We recommend using the N < 10
value; performance is reduced with large N
values. Maximum value of N = 65536
.
Parameters
N
: The number of elements to return. Optional. Default value: 10.reserved
: Defines, how many cells reserved for values. If uniq(column) > reserved, result of topK function will be approximate. Optional. Default value: N * 3.
Arguments
column
: The value to calculate frequency.weight
: The weight. Every value is accountedweight
times for frequency calculation. UInt64.
Example
Query:
SELECT approx_top_sum(2)(k, w) FROM values('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))
Result:
┌─approx_top_sum(2)(k, w)─┐ │ [('z',10,0),('x',5,0)] │ └─────────────────────────┘
argMax¶
Calculates the arg
value for a maximum val
value. If there are multiple rows with equal val
being the maximum, which of the associated arg
is returned is not deterministic. Both parts the arg
and the max
behave as aggregate functions, they both skip Null
during processing and return not Null
values if not Null
values are available.
Syntax
argMax(arg, val)
Arguments
arg
: Argument.val
: Value.
Returned value
arg
value that corresponds to maximumval
value.
Type: matches arg
type.
Example
Input table:
┌─user─────┬─salary─┐ │ director │ 5000 │ │ manager │ 3000 │ │ worker │ 1000 │ └──────────┴────────┘
Query:
SELECT argMax(user, salary) FROM salary
Result:
┌─argMax(user, salary)─┐ │ director │ └──────────────────────┘
argMin¶
Calculates the arg
value for a minimum val
value. If there are multiple rows with equal val
being the maximum, which of the associated arg
is returned is not deterministic. Both parts the arg
and the min
behave as aggregate functions, they both skip Null
during processing and return not Null
values if not Null
values are available.
Syntax
argMin(arg, val)
Arguments
arg
: Argument.val
: Value.
Returned value
arg
value that corresponds to minimumval
value.
Type: matches arg
type.
Example
Input table:
┌─user─────┬─salary─┐ │ director │ 5000 │ │ manager │ 3000 │ │ worker │ 1000 │ └──────────┴────────┘
Query:
SELECT argMin(user, salary) FROM salary
Result:
┌─argMin(user, salary)─┐ │ worker │ └──────────────────────┘
array_concat_agg¶
- Alias of
groupArrayArray
. The function is case insensitive.
Example
SELECT * FROM t ┌─a───────┐ │ [1,2,3] │ │ [4,5] │ │ [6] │ └─────────┘
Query:
SELECT array_concat_agg(a) AS a FROM t ┌─a─────────────┐ │ [1,2,3,4,5,6] │ └───────────────┘
avg¶
Calculates the arithmetic mean.
Syntax
avg(x)
Arguments
x
: input values, must be Integer, Float, or Decimal.
Returned value
- The arithmetic mean, always as Float64.
NaN
if the input parameterx
is empty.
Example
Query:
SELECT avg(x) FROM values('x Int8', 0, 1, 2, 3, 4, 5)
Result:
┌─avg(x)─┐ │ 2.5 │ └────────┘
avgWeighted¶
Calculates the weighted arithmetic mean.
Syntax
avgWeighted(x, weight)
Arguments
x
: Values.weight
: Weights of the values.
x
and weight
must both be Integer or floating-point, but may have different types.
Returned value
NaN
if all the weights are equal to 0 or the supplied weights parameter is empty.- Weighted mean otherwise.
Return type is always Float64.
Example
Query:
SELECT avgWeighted(x, w) FROM values('x Int8, w Int8', (4, 1), (1, 0), (10, 2))
Result:
┌─avgWeighted(x, weight)─┐ │ 8 │ └────────────────────────┘
Example
Query:
SELECT avgWeighted(x, w) FROM values('x Int8, w Float64', (4, 1), (1, 0), (10, 2))
Result:
┌─avgWeighted(x, weight)─┐ │ 8 │ └────────────────────────┘
Example
Query:
SELECT avgWeighted(x, w) FROM values('x Int8, w Int8', (0, 0), (1, 0), (10, 0))
Result:
┌─avgWeighted(x, weight)─┐ │ nan │ └────────────────────────┘
boundingRatio¶
Aggregate function that calculates the slope between the leftmost and rightmost points across a group of values.
Example
Sample data:
SELECT number, number * 1.5 FROM numbers(10)
┌─number─┬─multiply(number, 1.5)─┐ │ 0 │ 0 │ │ 1 │ 1.5 │ │ 2 │ 3 │ │ 3 │ 4.5 │ │ 4 │ 6 │ │ 5 │ 7.5 │ │ 6 │ 9 │ │ 7 │ 10.5 │ │ 8 │ 12 │ │ 9 │ 13.5 │ └────────┴───────────────────────┘
The boundingRatio() function returns the slope of the line between the leftmost and rightmost points, in the above data these points are (0,0)
and (9,13.5)
.
SELECT boundingRatio(number, number * 1.5) FROM numbers(10)
┌─boundingRatio(number, multiply(number, 1.5))─┐ │ 1.5 │ └──────────────────────────────────────────────┘
Calculates the value of (P(tag = 1) - P(tag = 0))(log(P(tag = 1)) - log(P(tag = 0)))
for each category.
categoricalInformationValue(category1, category2, ..., tag)
The result indicates how a discrete (categorical) feature [category1, category2, ...]
contribute to a learning model which predicting the value of tag
.
contingency¶
The contingency
function calculates the contingency coefficient, a value that measures the association between two columns in a table. The computation is similar to the cramersV
function but with a different denominator in the square root.
Syntax
contingency(column1, column2)
Arguments
column1
andcolumn2
are the columns to be compared
Returned value
- a value between 0 and 1. The larger the result, the closer the association of the two columns.
Return type is always Float64.
Example
The two columns being compared below have a small association with each other. We have included the result of cramersV
also (as a comparison):
SELECT cramersV(a, b), contingency(a ,b) FROM ( SELECT number % 10 AS a, number % 4 AS b FROM numbers(150) )
Result:
┌──────cramersV(a, b)─┬───contingency(a, b)─┐ │ 0.41171788506213564 │ 0.05812725261759165 │ └─────────────────────┴─────────────────────┘
corr¶
Calculates the Pearson correlation coefficient: Σ((x - x̄)(y - ȳ)) / √(Σ(x - x̄)²*Σ(y - ȳ)²)
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the corrStable
function. It is slower but provides a more accurate result.
Syntax
corr(x, y)
Arguments
x
: first variable. (U)Int*, Float*, Decimal.y
: second variable. (U)Int*, Float*, Decimal.
Returned value
- The Pearson correlation coefficient. Float64.
corrMatrix¶
Computes the correlation matrix over N variables.
Syntax
corrMatrix(x[, ...])
Arguments
x
: a variable number of parameters. (U)Int*, Float*, Decimal.
Returned value
- Correlation matrix. Array(Array(Float64)).
corrStable¶
Calculates the Pearson correlation coefficient: Σ((x - x̄)(y - ȳ)) / √(Σ(x - x̄)²*Σ(y - ȳ)²)
Similar to the corr
function, but uses a numerically stable algorithm. As a result, corrStable
is slower than corr
but produces a more accurate result.
Syntax
corrStable(x, y)
Arguments
x
: first variable. (U)Int*, Float*, Decimal.y
: second variable. (U)Int*, Float*, Decimal.
Returned value
- The Pearson correlation coefficient. Float64.
count¶
Counts the number of rows or not-NULL values.
Use the following syntaxes for count
:
count(expr)
orCOUNT(DISTINCT expr)
.count()
orCOUNT(*)
.
Arguments
The function can take:
- Zero parameters.
- One expression.
Returned value
- If the function is called without parameters it counts the number of rows.
- If the expression is passed, then the function counts how many times this expression returned not null. If the expression returns a Nullable-type value, then the result of
count
stays notNullable
. The function returns 0 if the expression returnedNULL
for all the rows.
In both cases the type of the returned value is UInt64.
covarPop¶
Calculates the population covariance: Σ(x - x̄)(y - ȳ) / n
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the covarPopStable
function. It works slower but provides a lower computational error.
Syntax
covarPop(x, y)
Arguments
x
: first variable. (U)Int*, Float*, Decimal.y
: second variable. (U)Int*, Float*, Decimal.
Returned value
- The population covariance between
x
andy
. Float64.
covarPopMatrix¶
Returns the population covariance matrix over N variables.
Syntax
covarPopMatrix(x[, ...])
Arguments
x
: a variable number of parameters. (U)Int*, Float*, Decimal.
Returned value
- Population covariance matrix. Array(Array(Float64)).
covarPopStable¶
Calculates the value of the population covariance:Σ(x - x̄)(y - ȳ) / n
It is similar to the covarPop function, but uses a numerically stable algorithm. As a result, covarPopStable
is slower than covarPop
but produces a more accurate result.
Syntax
covarPop(x, y)
Arguments
x
: first variable. (U)Int*, Float*, Decimal.y
: second variable. (U)Int*, Float*, Decimal.
Returned value
- The population covariance between
x
andy
. Float64.
covarSamp¶
Calculates the value of Σ((x - x̅)(y - y̅)) / (n - 1)
.
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the covarSampStable
function. It works slower but provides a lower computational error.
Syntax
covarSamp(x, y)
Arguments
x
: first variable. (U)Int*, Float*, Decimal.y
: second variable. (U)Int*, Float*, Decimal.
Returned value
- The sample covariance between
x
andy
. Forn <= 1
,nan
is returned. Float64.
covarSampMatrix¶
Returns the sample covariance matrix over N variables.
Syntax
covarSampMatrix(x[, ...])
Arguments
x
: a variable number of parameters. (U)Int*, Float*, Decimal.
Returned value
- Sample covariance matrix. Array(Array(Float64)).
covarSampStable¶
Calculates the value of Σ((x - x̅)(y - y̅)) / (n - 1)
. Similar to covarSamp but works slower while providing a lower computational error.
Syntax
covarSampStable(x, y)
Arguments
x
: first variable. (U)Int*, Float*, Decimal.y
: second variable. (U)Int*, Float*, Decimal.
Returned value
- The sample covariance between
x
andy
. Forn <= 1
,inf
is returned. Float64.
cramersV¶
Cramer's V (sometimes referred to as Cramer's phi) is a measure of association between two columns in a table. The result of the cramersV
function ranges from 0 (corresponding to no association between the variables) to 1 and can reach 1 only when each value is completely determined by the other. It may be viewed as the association between two variables as a percentage of their maximum possible variation.
For a bias corrected version of Cramer's V see: cramersVBiasCorrected
Syntax
cramersV(column1, column2)
Parameters
column1
: first column to be compared.column2
: second column to be compared.
Returned value
- a value between 0 (corresponding to no association between the columns' values) to 1 (complete association).
Type: always Float64.
cramersVBiasCorrected¶
Cramer's V is a measure of association between two columns in a table. The result of the cramersV
function ranges from 0 (corresponding to no association between the variables) to 1 and can reach 1 only when each value is completely determined by the other. The function can be heavily biased, so this version of Cramer's V uses the bias correction.
Syntax
cramersVBiasCorrected(column1, column2)
Parameters
column1
: first column to be compared.column2
: second column to be compared.
Returned value
- a value between 0 (corresponding to no association between the columns' values) to 1 (complete association).
Type: always Float64.
deltaSum¶
Sums the arithmetic difference between consecutive rows. If the difference is negative, it is ignored.
The underlying data must be sorted for this function to work properly. If you would like to use this function in a materialized view, you most likely want to use the deltaSumTimestamp method instead.
Syntax
deltaSum(value)
Arguments
value
: Input values, must be Integer or Float type.
Returned value
- A gained arithmetic difference of the
Integer
orFloat
type.
deltaSumTimestamp¶
Adds the difference between consecutive rows. If the difference is negative, it is ignored.
This function is primarily for materialized views that store data ordered by some time bucket-aligned timestamp, for example, a toStartOfMinute
bucket. Because the rows in such a materialized view will all have the same timestamp, it is impossible for them to be merged in the correct order, without storing the original, unrounded timestamp value. The deltaSumTimestamp
function keeps track of the original timestamp
of the values it's seen, so the values (states) of the function are correctly computed during merging of parts.
To calculate the delta sum across an ordered collection you can simply use the deltaSum function.
Syntax
deltaSumTimestamp(value, timestamp)
Arguments
value
: Input values, must be some Integer type or Float type or a Date or DateTime.timestamp
: The parameter for order values, must be some Integer type or Float type or a Date or DateTime.
Returned value
- Accumulated differences between consecutive values, ordered by the
timestamp
parameter.
Type: Integer or Float or Date or DateTime.
distinctDynamicTypes¶
Calculates the list of distinct data types stored in Dynamic column.
Syntax
distinctDynamicTypes(dynamic)
Arguments
dynamic
: Dynamic column.
Returned value
- The sorted list of data type names Array(String).
distinctJSONPaths¶
Calculates the list of distinct paths stored in JSON column.
Syntax
distinctJSONPaths(json)
Arguments
json
: JSON column.
Returned value
- The sorted list of paths Array(String).
distinctJSONPathsAndTypes¶
Calculates the list of distinct paths and their types stored in JSON column.
Syntax
distinctJSONPathsAndTypes(json)
Arguments
json
: JSON column.
Returned value
- The sorted map of paths and types Map(String, Array(String)).
entropy¶
Calculates Shannon entropy) of a column of values.
Syntax
entropy(val)
Arguments
val
: Column of values of any type.
Returned value
- Shannon entropy.
Type: Float64.
exponentialMovingAverage¶
Calculates the exponential moving average of values for the determined time.
Syntax
exponentialMovingAverage(x)(value, timeunit)
Each value
corresponds to the determinate timeunit
. The half-life x
is the time lag at which the exponential weights decay by one-half. The function returns a weighted average: the older the time point, the less weight the corresponding value is considered to be.
Arguments
value
: Value. Integer, Float or Decimal.timeunit
: Timeunit. Integer, Float or Decimal. Timeunit is not timestamp (seconds), it's -- an index of the time interval. Can be calculated using intDiv.
Parameters
x
: Half-life period. Integer, Float or Decimal.
Returned values
- Returns an exponentially smoothed moving average of the values for the past
x
time at the latest point of time.
Type: Float64.
Examples
Input table:
┌──temperature─┬─timestamp──┐ │ 95 │ 1 │ │ 95 │ 2 │ │ 95 │ 3 │ │ 96 │ 4 │ │ 96 │ 5 │ │ 96 │ 6 │ │ 96 │ 7 │ │ 97 │ 8 │ │ 97 │ 9 │ │ 97 │ 10 │ │ 97 │ 11 │ │ 98 │ 12 │ │ 98 │ 13 │ │ 98 │ 14 │ │ 98 │ 15 │ │ 99 │ 16 │ │ 99 │ 17 │ │ 99 │ 18 │ │ 100 │ 19 │ │ 100 │ 20 │ └──────────────┴────────────┘
Query:
SELECT exponentialMovingAverage(5)(temperature, timestamp)
Result:
┌──exponentialMovingAverage(5)(temperature, timestamp)──┐ │ 92.25779635374204 │ └───────────────────────────────────────────────────────┘
Query:
SELECT value, time, round(exp_smooth, 3), bar(exp_smooth, 0, 1, 50) AS bar FROM ( SELECT (number = 0) OR (number >= 25) AS value, number AS time, exponentialMovingAverage(10)(value, time) OVER (Rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth FROM numbers(50) )
Result:
┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────────────────────────────────────┐ │ 1 │ 0 │ 0.067 │ ███▎ │ │ 0 │ 1 │ 0.062 │ ███ │ │ 0 │ 2 │ 0.058 │ ██▊ │ │ 0 │ 3 │ 0.054 │ ██▋ │ │ 0 │ 4 │ 0.051 │ ██▌ │ │ 0 │ 5 │ 0.047 │ ██▎ │ │ 0 │ 6 │ 0.044 │ ██▏ │ │ 0 │ 7 │ 0.041 │ ██ │ │ 0 │ 8 │ 0.038 │ █▊ │ │ 0 │ 9 │ 0.036 │ █▋ │ │ 0 │ 10 │ 0.033 │ █▋ │ │ 0 │ 11 │ 0.031 │ █▌ │ │ 0 │ 12 │ 0.029 │ █▍ │ │ 0 │ 13 │ 0.027 │ █▎ │ │ 0 │ 14 │ 0.025 │ █▎ │ │ 0 │ 15 │ 0.024 │ █▏ │ │ 0 │ 16 │ 0.022 │ █ │ │ 0 │ 17 │ 0.021 │ █ │ │ 0 │ 18 │ 0.019 │ ▊ │ │ 0 │ 19 │ 0.018 │ ▊ │ │ 0 │ 20 │ 0.017 │ ▋ │ │ 0 │ 21 │ 0.016 │ ▋ │ │ 0 │ 22 │ 0.015 │ ▋ │ │ 0 │ 23 │ 0.014 │ ▋ │ │ 0 │ 24 │ 0.013 │ ▋ │ │ 1 │ 25 │ 0.079 │ ███▊ │ │ 1 │ 26 │ 0.14 │ ███████ │ │ 1 │ 27 │ 0.198 │ █████████▊ │ │ 1 │ 28 │ 0.252 │ ████████████▌ │ │ 1 │ 29 │ 0.302 │ ███████████████ │ │ 1 │ 30 │ 0.349 │ █████████████████▍ │ │ 1 │ 31 │ 0.392 │ ███████████████████▌ │ │ 1 │ 32 │ 0.433 │ █████████████████████▋ │ │ 1 │ 33 │ 0.471 │ ███████████████████████▌ │ │ 1 │ 34 │ 0.506 │ █████████████████████████▎ │ │ 1 │ 35 │ 0.539 │ ██████████████████████████▊ │ │ 1 │ 36 │ 0.57 │ ████████████████████████████▌ │ │ 1 │ 37 │ 0.599 │ █████████████████████████████▊ │ │ 1 │ 38 │ 0.626 │ ███████████████████████████████▎ │ │ 1 │ 39 │ 0.651 │ ████████████████████████████████▌ │ │ 1 │ 40 │ 0.674 │ █████████████████████████████████▋ │ │ 1 │ 41 │ 0.696 │ ██████████████████████████████████▋ │ │ 1 │ 42 │ 0.716 │ ███████████████████████████████████▋ │ │ 1 │ 43 │ 0.735 │ ████████████████████████████████████▋ │ │ 1 │ 44 │ 0.753 │ █████████████████████████████████████▋ │ │ 1 │ 45 │ 0.77 │ ██████████████████████████████████████▍ │ │ 1 │ 46 │ 0.785 │ ███████████████████████████████████████▎ │ │ 1 │ 47 │ 0.8 │ ███████████████████████████████████████▊ │ │ 1 │ 48 │ 0.813 │ ████████████████████████████████████████▋ │ │ 1 │ 49 │ 0.825 │ █████████████████████████████████████████▎ │ └───────┴──────┴──────────────────────┴────────────────────────────────────────────┘
exponentialTimeDecayedAvg¶
Returns the exponentially smoothed weighted moving average of values of a time series at point t
in time.
Syntax
exponentialTimeDecayedAvg(x)(v, t)
Arguments
v
: Value. Integer, Float or Decimal.t
: Time. Integer, Float or Decimal, DateTime, DateTime64.
Parameters
x
: Half-life period. Integer, Float or Decimal.
Returned values
- Returns an exponentially smoothed weighted moving average at index
t
in time. Float64.
Examples
Query:
SELECT value, time, round(exp_smooth, 3), bar(exp_smooth, 0, 5, 50) AS bar FROM ( SELECT (number = 0) OR (number >= 25) AS value, number AS time, exponentialTimeDecayedAvg(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth FROM numbers(50) )
Response:
┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────┐ 1. │ 1 │ 0 │ 1 │ ██████████ │ 2. │ 0 │ 1 │ 0.475 │ ████▊ │ 3. │ 0 │ 2 │ 0.301 │ ███ │ 4. │ 0 │ 3 │ 0.214 │ ██▏ │ 5. │ 0 │ 4 │ 0.162 │ █▌ │ 6. │ 0 │ 5 │ 0.128 │ █▎ │ 7. │ 0 │ 6 │ 0.104 │ █ │ 8. │ 0 │ 7 │ 0.086 │ ▊ │ 9. │ 0 │ 8 │ 0.072 │ ▋ │ 0. │ 0 │ 9 │ 0.061 │ ▌ │ 1. │ 0 │ 10 │ 0.052 │ ▌ │ 2. │ 0 │ 11 │ 0.045 │ ▍ │ 3. │ 0 │ 12 │ 0.039 │ ▍ │ 4. │ 0 │ 13 │ 0.034 │ ▎ │ 5. │ 0 │ 14 │ 0.03 │ ▎ │ 6. │ 0 │ 15 │ 0.027 │ ▎ │ 7. │ 0 │ 16 │ 0.024 │ ▏ │ 8. │ 0 │ 17 │ 0.021 │ ▏ │ 9. │ 0 │ 18 │ 0.018 │ ▏ │ 0. │ 0 │ 19 │ 0.016 │ ▏ │ 1. │ 0 │ 20 │ 0.015 │ ▏ │ 2. │ 0 │ 21 │ 0.013 │ ▏ │ 3. │ 0 │ 22 │ 0.012 │ │ 4. │ 0 │ 23 │ 0.01 │ │ 5. │ 0 │ 24 │ 0.009 │ │ 6. │ 1 │ 25 │ 0.111 │ █ │ 7. │ 1 │ 26 │ 0.202 │ ██ │ 8. │ 1 │ 27 │ 0.283 │ ██▊ │ 9. │ 1 │ 28 │ 0.355 │ ███▌ │ 0. │ 1 │ 29 │ 0.42 │ ████▏ │ 1. │ 1 │ 30 │ 0.477 │ ████▊ │ 2. │ 1 │ 31 │ 0.529 │ █████▎ │ 3. │ 1 │ 32 │ 0.576 │ █████▊ │ 4. │ 1 │ 33 │ 0.618 │ ██████▏ │ 5. │ 1 │ 34 │ 0.655 │ ██████▌ │ 6. │ 1 │ 35 │ 0.689 │ ██████▉ │ 7. │ 1 │ 36 │ 0.719 │ ███████▏ │ 8. │ 1 │ 37 │ 0.747 │ ███████▍ │ 9. │ 1 │ 38 │ 0.771 │ ███████▋ │ 0. │ 1 │ 39 │ 0.793 │ ███████▉ │ 1. │ 1 │ 40 │ 0.813 │ ████████▏ │ 2. │ 1 │ 41 │ 0.831 │ ████████▎ │ 3. │ 1 │ 42 │ 0.848 │ ████████▍ │ 4. │ 1 │ 43 │ 0.862 │ ████████▌ │ 5. │ 1 │ 44 │ 0.876 │ ████████▊ │ 6. │ 1 │ 45 │ 0.888 │ ████████▉ │ 7. │ 1 │ 46 │ 0.898 │ ████████▉ │ 8. │ 1 │ 47 │ 0.908 │ █████████ │ 9. │ 1 │ 48 │ 0.917 │ █████████▏ │ 0. │ 1 │ 49 │ 0.925 │ █████████▏ │ └───────┴──────┴──────────────────────┴────────────┘
exponentialTimeDecayedCount¶
Returns the cumulative exponential decay over a time series at the index t
in time.
Syntax
exponentialTimeDecayedCount(x)(t)
Arguments
t
: Time. Integer, Float or Decimal, DateTime, DateTime64.
Parameters
x
: Half-life period. Integer, Float or Decimal.
Returned values
- Returns the cumulative exponential decay at the given point in time. Float64.
Example
Query:
SELECT value, time, round(exp_smooth, 3), bar(exp_smooth, 0, 20, 50) AS bar FROM ( SELECT (number % 5) = 0 AS value, number AS time, exponentialTimeDecayedCount(10)(time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth FROM numbers(50) )
Result:
┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────────────────────┐ 1. │ 1 │ 0 │ 1 │ ██▌ │ 2. │ 0 │ 1 │ 1.905 │ ████▊ │ 3. │ 0 │ 2 │ 2.724 │ ██████▊ │ 4. │ 0 │ 3 │ 3.464 │ ████████▋ │ 5. │ 0 │ 4 │ 4.135 │ ██████████▎ │ 6. │ 1 │ 5 │ 4.741 │ ███████████▊ │ 7. │ 0 │ 6 │ 5.29 │ █████████████▏ │ 8. │ 0 │ 7 │ 5.787 │ ██████████████▍ │ 9. │ 0 │ 8 │ 6.236 │ ███████████████▌ │ 10. │ 0 │ 9 │ 6.643 │ ████████████████▌ │ 11. │ 1 │ 10 │ 7.01 │ █████████████████▌ │ 12. │ 0 │ 11 │ 7.343 │ ██████████████████▎ │ 13. │ 0 │ 12 │ 7.644 │ ███████████████████ │ 14. │ 0 │ 13 │ 7.917 │ ███████████████████▊ │ 15. │ 0 │ 14 │ 8.164 │ ████████████████████▍ │ 16. │ 1 │ 15 │ 8.387 │ ████████████████████▉ │ 17. │ 0 │ 16 │ 8.589 │ █████████████████████▍ │ 18. │ 0 │ 17 │ 8.771 │ █████████████████████▉ │ 19. │ 0 │ 18 │ 8.937 │ ██████████████████████▎ │ 20. │ 0 │ 19 │ 9.086 │ ██████████████████████▋ │ 21. │ 1 │ 20 │ 9.222 │ ███████████████████████ │ 22. │ 0 │ 21 │ 9.344 │ ███████████████████████▎ │ 23. │ 0 │ 22 │ 9.455 │ ███████████████████████▋ │ 24. │ 0 │ 23 │ 9.555 │ ███████████████████████▉ │ 25. │ 0 │ 24 │ 9.646 │ ████████████████████████ │ 26. │ 1 │ 25 │ 9.728 │ ████████████████████████▎ │ 27. │ 0 │ 26 │ 9.802 │ ████████████████████████▌ │ 28. │ 0 │ 27 │ 9.869 │ ████████████████████████▋ │ 29. │ 0 │ 28 │ 9.93 │ ████████████████████████▊ │ 30. │ 0 │ 29 │ 9.985 │ ████████████████████████▉ │ 31. │ 1 │ 30 │ 10.035 │ █████████████████████████ │ 32. │ 0 │ 31 │ 10.08 │ █████████████████████████▏ │ 33. │ 0 │ 32 │ 10.121 │ █████████████████████████▎ │ 34. │ 0 │ 33 │ 10.158 │ █████████████████████████▍ │ 35. │ 0 │ 34 │ 10.191 │ █████████████████████████▍ │ 36. │ 1 │ 35 │ 10.221 │ █████████████████████████▌ │ 37. │ 0 │ 36 │ 10.249 │ █████████████████████████▌ │ 38. │ 0 │ 37 │ 10.273 │ █████████████████████████▋ │ 39. │ 0 │ 38 │ 10.296 │ █████████████████████████▋ │ 40. │ 0 │ 39 │ 10.316 │ █████████████████████████▊ │ 41. │ 1 │ 40 │ 10.334 │ █████████████████████████▊ │ 42. │ 0 │ 41 │ 10.351 │ █████████████████████████▉ │ 43. │ 0 │ 42 │ 10.366 │ █████████████████████████▉ │ 44. │ 0 │ 43 │ 10.379 │ █████████████████████████▉ │ 45. │ 0 │ 44 │ 10.392 │ █████████████████████████▉ │ 46. │ 1 │ 45 │ 10.403 │ ██████████████████████████ │ 47. │ 0 │ 46 │ 10.413 │ ██████████████████████████ │ 48. │ 0 │ 47 │ 10.422 │ ██████████████████████████ │ 49. │ 0 │ 48 │ 10.43 │ ██████████████████████████ │ 50. │ 0 │ 49 │ 10.438 │ ██████████████████████████ │ └───────┴──────┴──────────────────────┴────────────────────────────┘
exponentialTimeDecayedMax¶
Returns the maximum of the computed exponentially smoothed moving average at index t
in time with that at t-1
.
Syntax
exponentialTimeDecayedMax(x)(value, timeunit)
Arguments
value
: Value. Integer, Float or Decimal.timeunit
: Timeunit. Integer, Float or Decimal, DateTime, DateTime64.
Parameters
x
: Half-life period. Integer, Float or Decimal.
Returned values
- Returns the maximum of the exponentially smoothed weighted moving average at
t
andt-1
. Float64.
Example
Query:
SELECT value, time, round(exp_smooth, 3), bar(exp_smooth, 0, 5, 50) AS bar FROM ( SELECT (number = 0) OR (number >= 25) AS value, number AS time, exponentialTimeDecayedMax(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth FROM numbers(50) )
Result:
┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────┐ 1. │ 1 │ 0 │ 1 │ ██████████ │ 2. │ 0 │ 1 │ 0.905 │ █████████ │ 3. │ 0 │ 2 │ 0.819 │ ████████▏ │ 4. │ 0 │ 3 │ 0.741 │ ███████▍ │ 5. │ 0 │ 4 │ 0.67 │ ██████▋ │ 6. │ 0 │ 5 │ 0.607 │ ██████ │ 7. │ 0 │ 6 │ 0.549 │ █████▍ │ 8. │ 0 │ 7 │ 0.497 │ ████▉ │ 9. │ 0 │ 8 │ 0.449 │ ████▍ │ 10. │ 0 │ 9 │ 0.407 │ ████ │ 11. │ 0 │ 10 │ 0.368 │ ███▋ │ 12. │ 0 │ 11 │ 0.333 │ ███▎ │ 13. │ 0 │ 12 │ 0.301 │ ███ │ 14. │ 0 │ 13 │ 0.273 │ ██▋ │ 15. │ 0 │ 14 │ 0.247 │ ██▍ │ 16. │ 0 │ 15 │ 0.223 │ ██▏ │ 17. │ 0 │ 16 │ 0.202 │ ██ │ 18. │ 0 │ 17 │ 0.183 │ █▊ │ 19. │ 0 │ 18 │ 0.165 │ █▋ │ 20. │ 0 │ 19 │ 0.15 │ █▍ │ 21. │ 0 │ 20 │ 0.135 │ █▎ │ 22. │ 0 │ 21 │ 0.122 │ █▏ │ 23. │ 0 │ 22 │ 0.111 │ █ │ 24. │ 0 │ 23 │ 0.1 │ █ │ 25. │ 0 │ 24 │ 0.091 │ ▉ │ 26. │ 1 │ 25 │ 1 │ ██████████ │ 27. │ 1 │ 26 │ 1 │ ██████████ │ 28. │ 1 │ 27 │ 1 │ ██████████ │ 29. │ 1 │ 28 │ 1 │ ██████████ │ 30. │ 1 │ 29 │ 1 │ ██████████ │ 31. │ 1 │ 30 │ 1 │ ██████████ │ 32. │ 1 │ 31 │ 1 │ ██████████ │ 33. │ 1 │ 32 │ 1 │ ██████████ │ 34. │ 1 │ 33 │ 1 │ ██████████ │ 35. │ 1 │ 34 │ 1 │ ██████████ │ 36. │ 1 │ 35 │ 1 │ ██████████ │ 37. │ 1 │ 36 │ 1 │ ██████████ │ 38. │ 1 │ 37 │ 1 │ ██████████ │ 39. │ 1 │ 38 │ 1 │ ██████████ │ 40. │ 1 │ 39 │ 1 │ ██████████ │ 41. │ 1 │ 40 │ 1 │ ██████████ │ 42. │ 1 │ 41 │ 1 │ ██████████ │ 43. │ 1 │ 42 │ 1 │ ██████████ │ 44. │ 1 │ 43 │ 1 │ ██████████ │ 45. │ 1 │ 44 │ 1 │ ██████████ │ 46. │ 1 │ 45 │ 1 │ ██████████ │ 47. │ 1 │ 46 │ 1 │ ██████████ │ 48. │ 1 │ 47 │ 1 │ ██████████ │ 49. │ 1 │ 48 │ 1 │ ██████████ │ 50. │ 1 │ 49 │ 1 │ ██████████ │ └───────┴──────┴──────────────────────┴────────────┘
exponentialTimeDecayedSum¶
Returns the sum of exponentially smoothed moving average values of a time series at the index t
in time.
Syntax
exponentialTimeDecayedSum(x)(v, t)
Arguments
v
: Value. Integer, Float or Decimal.t
: Time. Integer, Float or Decimal, DateTime, DateTime64.
Parameters
x
: Half-life period. Integer, Float or Decimal.
Returned values
- Returns the sum of exponentially smoothed moving average values at the given point in time. Float64.
Example
Query:
SELECT value, time, round(exp_smooth, 3), bar(exp_smooth, 0, 10, 50) AS bar FROM ( SELECT (number = 0) OR (number >= 25) AS value, number AS time, exponentialTimeDecayedSum(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth FROM numbers(50) )
Result:
┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar───────────────────────────────────────────────┐ 1. │ 1 │ 0 │ 1 │ █████ │ 2. │ 0 │ 1 │ 0.905 │ ████▌ │ 3. │ 0 │ 2 │ 0.819 │ ████ │ 4. │ 0 │ 3 │ 0.741 │ ███▋ │ 5. │ 0 │ 4 │ 0.67 │ ███▎ │ 6. │ 0 │ 5 │ 0.607 │ ███ │ 7. │ 0 │ 6 │ 0.549 │ ██▋ │ 8. │ 0 │ 7 │ 0.497 │ ██▍ │ 9. │ 0 │ 8 │ 0.449 │ ██▏ │ 10. │ 0 │ 9 │ 0.407 │ ██ │ 11. │ 0 │ 10 │ 0.368 │ █▊ │ 12. │ 0 │ 11 │ 0.333 │ █▋ │ 13. │ 0 │ 12 │ 0.301 │ █▌ │ 14. │ 0 │ 13 │ 0.273 │ █▎ │ 15. │ 0 │ 14 │ 0.247 │ █▏ │ 16. │ 0 │ 15 │ 0.223 │ █ │ 17. │ 0 │ 16 │ 0.202 │ █ │ 18. │ 0 │ 17 │ 0.183 │ ▉ │ 19. │ 0 │ 18 │ 0.165 │ ▊ │ 20. │ 0 │ 19 │ 0.15 │ ▋ │ 21. │ 0 │ 20 │ 0.135 │ ▋ │ 22. │ 0 │ 21 │ 0.122 │ ▌ │ 23. │ 0 │ 22 │ 0.111 │ ▌ │ 24. │ 0 │ 23 │ 0.1 │ ▌ │ 25. │ 0 │ 24 │ 0.091 │ ▍ │ 26. │ 1 │ 25 │ 1.082 │ █████▍ │ 27. │ 1 │ 26 │ 1.979 │ █████████▉ │ 28. │ 1 │ 27 │ 2.791 │ █████████████▉ │ 29. │ 1 │ 28 │ 3.525 │ █████████████████▋ │ 30. │ 1 │ 29 │ 4.19 │ ████████████████████▉ │ 31. │ 1 │ 30 │ 4.791 │ ███████████████████████▉ │ 32. │ 1 │ 31 │ 5.335 │ ██████████████████████████▋ │ 33. │ 1 │ 32 │ 5.827 │ █████████████████████████████▏ │ 34. │ 1 │ 33 │ 6.273 │ ███████████████████████████████▎ │ 35. │ 1 │ 34 │ 6.676 │ █████████████████████████████████▍ │ 36. │ 1 │ 35 │ 7.041 │ ███████████████████████████████████▏ │ 37. │ 1 │ 36 │ 7.371 │ ████████████████████████████████████▊ │ 38. │ 1 │ 37 │ 7.669 │ ██████████████████████████████████████▎ │ 39. │ 1 │ 38 │ 7.939 │ ███████████████████████████████████████▋ │ 40. │ 1 │ 39 │ 8.184 │ ████████████████████████████████████████▉ │ 41. │ 1 │ 40 │ 8.405 │ ██████████████████████████████████████████ │ 42. │ 1 │ 41 │ 8.605 │ ███████████████████████████████████████████ │ 43. │ 1 │ 42 │ 8.786 │ ███████████████████████████████████████████▉ │ 44. │ 1 │ 43 │ 8.95 │ ████████████████████████████████████████████▊ │ 45. │ 1 │ 44 │ 9.098 │ █████████████████████████████████████████████▍ │ 46. │ 1 │ 45 │ 9.233 │ ██████████████████████████████████████████████▏ │ 47. │ 1 │ 46 │ 9.354 │ ██████████████████████████████████████████████▊ │ 48. │ 1 │ 47 │ 9.464 │ ███████████████████████████████████████████████▎ │ 49. │ 1 │ 48 │ 9.563 │ ███████████████████████████████████████████████▊ │ 50. │ 1 │ 49 │ 9.653 │ ████████████████████████████████████████████████▎ │ └───────┴──────┴──────────────────────┴───────────────────────────────────────────────────┘
first_value¶
It is an alias for any
but it was introduced for compatibility with Window Functions, where sometimes it's necessary to process NULL
values.
It supports declaring a modifier to respect nulls (RESPECT NULLS
), both under Window Functions and in normal aggregations.
As with any
, without Window Functions the result will be random if the source stream is not ordered and the return type matches the input type (Null is only returned if the input is Nullable or -OrNull combinator is added).
flameGraph¶
Aggregate function which builds a flamegraph using the list of stacktraces. Outputs an array of strings which can be used by flamegraph.pl utility to render an SVG of the flamegraph.
Syntax
flameGraph(traces, [size], [ptr])
Parameters
traces
: a stacktrace. Array(UInt64).size
: an allocation size for memory profiling. (optional - default1
). UInt64.ptr
: an allocation address. (optional - default0
). UInt64.
In the case where ptr != 0
, a flameGraph will map allocations (size > 0) and deallocations (size < 0) with the same size and ptr. Only allocations which were not freed are shown. Non mapped deallocations are ignored.
Returned value
- An array of strings for use with flamegraph.pl utility. Array(String).
groupArray¶
Syntax: groupArray(x)
or groupArray(max_size)(x)
Creates an array of argument values. Values can be added to the array in any (indeterminate) order.
The second version (with the max_size
parameter) limits the size of the resulting array to max_size
elements. For example, groupArray(1)(x)
is equivalent to [any (x)]
.
In some cases, you can still rely on the order of execution. This applies to cases when SELECT
comes from a subquery that uses ORDER BY
if the subquery result is small enough.
Example
SELECT * FROM default.ck ┌─id─┬─name─────┐ │ 1 │ zhangsan │ │ 1 │ ᴺᵁᴸᴸ │ │ 1 │ lisi │ │ 2 │ wangwu │ └────┴──────────┘
Query:
select id, groupArray(10)(name) from default.ck group by id
Result:
┌─id─┬─groupArray(10)(name)─┐ │ 1 │ ['zhangsan','lisi'] │ │ 2 │ ['wangwu'] │ └────┴──────────────────────┘
The groupArray function will remove ᴺᵁᴸᴸ value based on the above results.
- Alias:
array_agg
.
groupArrayInsertAt¶
Inserts a value into the array at the specified position.
Syntax
groupArrayInsertAt(default_x, size)(x, pos)
If in one query several values are inserted into the same position, the function behaves in the following ways:
- If a query is executed in a single thread, the first one of the inserted values is used.
- If a query is executed in multiple threads, the resulting value is an undetermined one of the inserted values.
Arguments
x
: Value to be inserted. Expression resulting in one of the supported data types.pos
: Position at which the specified elementx
is to be inserted. Index numbering in the array starts from zero. UInt32.default_x
: Default value for substituting in empty positions. Optional parameter. Expression resulting in the data type configured for thex
parameter. Ifdefault_x
is not defined, the [default values are used.size
: Length of the resulting array. Optional parameter. When using this parameter, the default valuedefault_x
must be specified. UInt32.
Returned value
- Array with inserted values.
Type: Array.
Example
Query:
SELECT groupArrayInsertAt(toString(number), number * 2) FROM numbers(5)
Result:
┌─groupArrayInsertAt(toString(number), multiply(number, 2))─┐ │ ['0','','1','','2','','3','','4'] │ └───────────────────────────────────────────────────────────┘
Query:
SELECT groupArrayInsertAt('-')(toString(number), number * 2) FROM numbers(5)
Result:
┌─groupArrayInsertAt('-')(toString(number), multiply(number, 2))─┐ │ ['0','-','1','-','2','-','3','-','4'] │ └────────────────────────────────────────────────────────────────┘
Query:
SELECT groupArrayInsertAt('-', 5)(toString(number), number * 2) FROM numbers(5)
Result:
┌─groupArrayInsertAt('-', 5)(toString(number), multiply(number, 2))─┐ │ ['0','-','1','-','2'] │ └───────────────────────────────────────────────────────────────────┘
Multi-threaded insertion of elements into one position.
Query:
SELECT groupArrayInsertAt(number, 0) FROM numbers_mt(10) SETTINGS max_block_size = 1``` As a result of this query you get random integer in the `[0,9]` range. For example: ``` text ┌─groupArrayInsertAt(number, 0)─┐ │ [7] │ └───────────────────────────────┘
groupArrayIntersect¶
Return an intersection of given arrays (Return all items of arrays, that are in all given arrays).
Syntax
groupArrayIntersect(x)
Arguments
x
: Argument (column name or expression).
Returned values
- Array that contains elements that are in all arrays.
Type: Array.
Examples
Consider table numbers
:
┌─a──────────────┐ │ [1,2,4] │ │ [1,5,2,8,-1,0] │ │ [1,5,7,5,8,2] │ └────────────────┘
Query with column name as argument:
SELECT groupArrayIntersect(a) as intersection FROM numbers
Result:
┌─intersection──────┐ │ [1, 2] │ └───────────────────┘
groupArrayLast¶
Syntax: groupArrayLast(max_size)(x)
Creates an array of last argument values. For example, groupArrayLast(1)(x)
is equivalent to [anyLast (x)]
.
In some cases, you can still rely on the order of execution. This applies to cases when SELECT
comes from a subquery that uses ORDER BY
if the subquery result is small enough.
Example
Query:
select groupArrayLast(2)(number+1) numbers from numbers(10)
Result:
┌─numbers─┐ │ [9,10] │ └─────────┘
In compare to groupArray
:
select groupArray(2)(number+1) numbers from numbers(10)
┌─numbers─┐ │ [1,2] │ └─────────┘
groupArrayMovingAvg¶
Calculates the moving average of input values.
groupArrayMovingAvg(numbers_for_summing) groupArrayMovingAvg(window_size)(numbers_for_summing)
The function can take the window size as a parameter. If left unspecified, the function takes the window size equal to the number of rows in the column.
Arguments
numbers_for_summing
: Expression resulting in a numeric data type value.window_size
: Size of the calculation window.
Returned values
- Array of the same size and type as the input data.
The function uses rounding towards zero. It truncates the decimal places insignificant for the resulting data type.
groupArrayMovingSum¶
Calculates the moving sum of input values.
groupArrayMovingSum(numbers_for_summing) groupArrayMovingSum(window_size)(numbers_for_summing)
The function can take the window size as a parameter. If left unspecified, the function takes the window size equal to the number of rows in the column.
Arguments
numbers_for_summing
: Expression resulting in a numeric data type value.window_size
: Size of the calculation window.
Returned values
- Array of the same size and type as the input data.
groupArraySample¶
Creates an array of sample argument values. The size of the resulting array is limited to max_size
elements. Argument values are selected and added to the array randomly.
Syntax
groupArraySample(max_size[, seed])(x)
Arguments
max_size
: Maximum size of the resulting array. UInt64.seed
: Seed for the random number generator. Optional. UInt64. Default value:123456
.x
: Argument (column name or expression).
Returned values
- Array of randomly selected
x
arguments.
Type: Array.
Examples
Consider table colors
:
┌─id─┬─color──┐ │ 1 │ red │ │ 2 │ blue │ │ 3 │ green │ │ 4 │ white │ │ 5 │ orange │ └────┴────────┘
Query with column name as argument:
SELECT groupArraySample(3)(color) as newcolors FROM colors
Result:
┌─newcolors──────────────────┐ │ ['white','blue','green'] │ └────────────────────────────┘
Query with column name and different seed:
SELECT groupArraySample(3, 987654321)(color) as newcolors FROM colors
Result:
┌─newcolors──────────────────┐ │ ['red','orange','green'] │ └────────────────────────────┘
Query with expression as argument:
SELECT groupArraySample(3)(concat('light-', color)) as newcolors FROM colors
Result:
┌─newcolors───────────────────────────────────┐ │ ['light-blue','light-orange','light-green'] │ └─────────────────────────────────────────────┘
groupArraySorted {#groupArraySorted}¶
Returns an array with the first N items in ascending order.
groupArraySorted(N)(column)
Arguments
N
– The number of elements to return.column
– The value (Integer, String, Float and other Generic types).
Example
Gets the first 10 numbers:
SELECT groupArraySorted(10)(number) FROM numbers(100)
┌─groupArraySorted(10)(number)─┐ │ [0,1,2,3,4,5,6,7,8,9] │ └──────────────────────────────┘
Gets all the String implementations of all numbers in column:
SELECT groupArraySorted(5)(str) FROM (SELECT toString(number) as str FROM numbers(5))
┌─groupArraySorted(5)(str)─┐ │ ['0','1','2','3','4'] │ └──────────────────────────┘
groupBitAnd¶
Applies bit-wise AND
for series of numbers.
groupBitAnd(expr)
Arguments
expr
– An expression that results in UInt*
or Int*
type.
Return value
Value of the UInt*
or Int*
type.
Example
Test data:
binary decimal 00101100 = 44 00011100 = 28 00001101 = 13 01010101 = 85
Query:
SELECT groupBitAnd(num) FROM t
Where num
is the column with the test data.
Result:
binary decimal 00000100 = 4
groupBitmap¶
Bitmap or Aggregate calculations from a unsigned integer column, return cardinality of type UInt64, if add suffix -State, then return bitmap object.
groupBitmap(expr)
Arguments
expr
– An expression that results in UInt*
type.
Return value
Value of the UInt64
type.
Example
Test data:
UserID 1 1 2 3
Query:
SELECT groupBitmap(UserID) as num FROM t
Result:
num 3
groupBitmapAnd¶
Calculations the AND of a bitmap column, return cardinality of type UInt64, if add suffix -State, then return bitmap object.
groupBitmapAnd(expr)
Arguments
expr
– An expression that results in AggregateFunction(groupBitmap, UInt*)
type.
Return value
Value of the UInt64
type.
groupBitmapOr¶
Calculations the OR of a bitmap column, return cardinality of type UInt64, if add suffix -State, then return bitmap object. This is equivalent to groupBitmapMerge
.
groupBitmapOr(expr)
Arguments
expr
– An expression that results in AggregateFunction(groupBitmap, UInt*)
type.
Returned value
Value of the UInt64
type.
groupBitmapXor¶
Calculations the XOR of a bitmap column, return cardinality of type UInt64, if add suffix -State, then return bitmap object.
groupBitmapOr(expr)
Arguments
expr
– An expression that results in AggregateFunction(groupBitmap, UInt*)
type.
Returned value
Value of the UInt64
type.
groupBitOr¶
Applies bit-wise OR
for series of numbers.
groupBitOr(expr)
Arguments
expr
– An expression that results in UInt*
or Int*
type.
Returned value
Value of the UInt*
or Int*
type.
Example
Test data:
binary decimal 00101100 = 44 00011100 = 28 00001101 = 13 01010101 = 85
Query:
SELECT groupBitOr(num) FROM t
Where num
is the column with the test data.
Result:
binary decimal 01111101 = 125
groupBitXor¶
Applies bit-wise XOR
for series of numbers.
groupBitXor(expr)
Arguments
expr
– An expression that results in UInt*
or Int*
type.
Return value
Value of the UInt*
or Int*
type.
Example
Test data:
binary decimal 00101100 = 44 00011100 = 28 00001101 = 13 01010101 = 85
Query:
SELECT groupBitXor(num) FROM t
Where num
is the column with the test data.
Result:
binary decimal 01101000 = 104
groupConcat¶
Calculates a concatenated string from a group of strings, optionally separated by a delimiter, and optionally limited by a maximum number of elements.
Syntax
groupConcat[(delimiter [, limit])](expression)
Arguments
expression
: The expression or column name that outputs strings to be concatenated..delimiter
: A string that will be used to separate concatenated values. This parameter is optional and defaults to an empty string if not specified.limit
: A positive integer specifying the maximum number of elements to concatenate. If more elements are present, excess elements are ignored. This parameter is optional.
If delimiter is specified without limit, it must be the first parameter. If both delimiter and limit are specified, delimiter must precede limit.
Returned value
- Returns a string consisting of the concatenated values of the column or expression. If the group has no elements or only null elements, and the function does not specify a handling for only null values, the result is a nullable string with a null value.
Examples
Input table:
┌─id─┬─name─┐ │ 1 │ John│ │ 2 │ Jane│ │ 3 │ Bob│ └────┴──────┘
- Basic usage without a delimiter:
Query:
SELECT groupConcat(Name) FROM Employees
Result:
JohnJaneBob
This concatenates all names into one continuous string without any separator.
- Using comma as a delimiter:
Query:
SELECT groupConcat(', ')(Name) FROM Employees
Result:
John, Jane, Bob
This output shows the names separated by a comma followed by a space.
- Limiting the number of concatenated elements
Query:
SELECT groupConcat(', ', 2)(Name) FROM Employees
Result:
John, Jane
This query limits the output to the first two names, even though there are more names in the table.
groupUniqArray¶
Syntax: groupUniqArray(x)
or groupUniqArray(max_size)(x)
Creates an array from different argument values. Memory consumption is the same as for the uniqExact function.
The second version (with the max_size
parameter) limits the size of the resulting array to max_size
elements. For example, groupUniqArray(1)(x)
is equivalent to [any(x)]
.
intervalLengthSum¶
Calculates the total length of union of all ranges (segments on numeric axis).
Syntax
intervalLengthSum(start, end)
Arguments
start
: The starting value of the interval. Int32, Int64, UInt32, UInt64, Float32, Float64, DateTime or Date.end
: The ending value of the interval. Int32, Int64, UInt32, UInt64, Float32, Float64, DateTime or Date.
Arguments must be of the same data type. Otherwise, an exception will be thrown.
Returned value
- Total length of union of all ranges (segments on numeric axis). Depending on the type of the argument, the return value may be UInt64 or Float64 type.
Examples
- Input table:
┌─id─┬─start─┬─end─┐ │ a │ 1.1 │ 2.9 │ │ a │ 2.5 │ 3.2 │ │ a │ 4 │ 5 │ └────┴───────┴─────┘
In this example, the arguments of the Float32 type are used. The function returns a value of the Float64 type.
Result is the sum of lengths of intervals [1.1, 3.2]
(union of [1.1, 2.9]
and [2.5, 3.2]
) and [4, 5]
Query:
SELECT id, intervalLengthSum(start, end), toTypeName(intervalLengthSum(start, end)) FROM fl_interval GROUP BY id ORDER BY id
Result:
┌─id─┬─intervalLengthSum(start, end)─┬─toTypeName(intervalLengthSum(start, end))─┐ │ a │ 3.1 │ Float64 │ └────┴───────────────────────────────┴───────────────────────────────────────────┘
- Input table:
┌─id─┬───────────────start─┬─────────────────end─┐ │ a │ 2020-01-01 01:12:30 │ 2020-01-01 02:10:10 │ │ a │ 2020-01-01 02:05:30 │ 2020-01-01 02:50:31 │ │ a │ 2020-01-01 03:11:22 │ 2020-01-01 03:23:31 │ └────┴─────────────────────┴─────────────────────┘
In this example, the arguments of the DateTime type are used. The function returns a value in seconds.
Query:
SELECT id, intervalLengthSum(start, end), toTypeName(intervalLengthSum(start, end)) FROM dt_interval GROUP BY id ORDER BY id
Result:
┌─id─┬─intervalLengthSum(start, end)─┬─toTypeName(intervalLengthSum(start, end))─┐ │ a │ 6610 │ UInt64 │ └────┴───────────────────────────────┴───────────────────────────────────────────┘
- Input table:
┌─id─┬──────start─┬────────end─┐ │ a │ 2020-01-01 │ 2020-01-04 │ │ a │ 2020-01-12 │ 2020-01-18 │ └────┴────────────┴────────────┘
In this example, the arguments of the Date type are used. The function returns a value in days.
Query:
SELECT id, intervalLengthSum(start, end), toTypeName(intervalLengthSum(start, end)) FROM date_interval GROUP BY id ORDER BY id
Result:
┌─id─┬─intervalLengthSum(start, end)─┬─toTypeName(intervalLengthSum(start, end))─┐ │ a │ 9 │ UInt64 │ └────┴───────────────────────────────┴───────────────────────────────────────────┘
kolmogorovSmirnovTest¶
Applies Kolmogorov-Smirnov's test to samples from two populations.
Syntax
kolmogorovSmirnovTest([alternative, computation_method])(sample_data, sample_index)
Values of both samples are in the sample_data
column. If sample_index
equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. Samples must belong to continuous, one-dimensional probability distributions.
Arguments
sample_data
: Sample data. Integer, Float or Decimal.sample_index
: Sample index. Integer.
Parameters
alternative
: alternative hypothesis. (Optional, default:'two-sided'
.) String. Let F(x) and G(x) be the CDFs of the first and second distributions respectively.'two-sided'
The null hypothesis is that samples come from the same distribution, e.g. F(x) = G(x) for all x. And the alternative is that the distributions are not identical.'greater'
The null hypothesis is that values in the first sample are stochastically smaller than those in the second one, e.g. the CDF of first distribution lies above and hence to the left of that for the second one. Which in fact means that F(x) >= G(x) for all x. And the alternative in this case is that F(x) < G(x) for at least one x.'less'
. The null hypothesis is that values in the first sample are stochastically greater than those in the second one, e.g. the CDF of first distribution lies below and hence to the right of that for the second one. Which in fact means that F(x) <= G(x) for all x. And the alternative in this case is that F(x) > G(x) for at least one x.
computation_method
: the method used to compute p-value. (Optional, default:'auto'
.) String.'exact'
- calculation is performed using precise probability distribution of the test statistics. Compute intensive and wasteful except for small samples.'asymp'
('asymptotic'
) - calculation is performed using an approximation. For large sample sizes, the exact and asymptotic p-values are very similar.'auto'
- the'exact'
method is used when a maximum number of samples is less than 10'000.
Returned values
Tuple with two elements:
- calculated statistic. Float64.
- calculated p-value. Float64.
Example
Query:
SELECT kolmogorovSmirnovTest('less', 'exact')(value, num) FROM ( SELECT randNormal(0, 10) AS value, 0 AS num FROM numbers(10000) UNION ALL SELECT randNormal(0, 10) AS value, 1 AS num FROM numbers(10000) )
Result:
┌─kolmogorovSmirnovTest('less', 'exact')(value, num)─┐ │ (0.009899999999999996,0.37528595205132287) │ └────────────────────────────────────────────────────┘
Note: P-value is bigger than 0.05 (for confidence level of 95%), so null hypothesis is not rejected.
Query:
SELECT kolmogorovSmirnovTest('two-sided', 'exact')(value, num) FROM ( SELECT randStudentT(10) AS value, 0 AS num FROM numbers(100) UNION ALL SELECT randNormal(0, 10) AS value, 1 AS num FROM numbers(100) )
Result:
┌─kolmogorovSmirnovTest('two-sided', 'exact')(value, num)─┐ │ (0.4100000000000002,6.61735760482795e-8) │ └─────────────────────────────────────────────────────────┘
Note: P-value is less than 0.05 (for confidence level of 95%), so null hypothesis is rejected.
kurtPop¶
Computes the kurtosis of a sequence.
kurtPop(expr)
Arguments
expr
: Expression returning a number.
Returned value
The kurtosis of the given distribution. Type: Float64.
Example
SELECT kurtPop(value) FROM series_with_value_column
kurtSamp¶
Computes the sample kurtosis of a sequence.
It represents an unbiased estimate of the kurtosis of a random variable if passed values form its sample.
kurtSamp(expr)
Arguments
expr
: Expression returning a number.
Returned value
The kurtosis of the given distribution. Type: Float64. If n <= 1
(n
is a size of the sample), then the function returns nan
.
Example
SELECT kurtSamp(value) FROM series_with_value_column
largestTriangleThreeBuckets¶
Applies the Largest-Triangle-Three-Buckets algorithm to the input data. The algorithm is used for downsampling time series data for visualization. It is designed to operate on series sorted by x coordinate. It works by dividing the sorted series into buckets and then finding the largest triangle in each bucket. The number of buckets is equal to the number of points in the resulting series. the function will sort data by x
and then apply the downsampling algorithm to the sorted data.
Syntax
largestTriangleThreeBuckets(n)(x, y)
Alias: lttb
.
Arguments
x
: x coordinate. Integer, Float, Decimal, Date, Date32, DateTime, DateTime64.y
: y coordinate. Integer, Float, Decimal, Date, Date32, DateTime, DateTime64.
NaNs are ignored in the provided series, meaning that any NaN values will be excluded from the analysis. This ensures that the function operates only on valid numerical data.
Parameters
n
: number of points in the resulting series. UInt64.
Returned values
Array of Tuple with two elements:
Example
Input table:
┌─────x───────┬───────y──────┐ │ 1.000000000 │ 10.000000000 │ │ 2.000000000 │ 20.000000000 │ │ 3.000000000 │ 15.000000000 │ │ 8.000000000 │ 60.000000000 │ │ 9.000000000 │ 55.000000000 │ │ 10.00000000 │ 70.000000000 │ │ 4.000000000 │ 30.000000000 │ │ 5.000000000 │ 40.000000000 │ │ 6.000000000 │ 35.000000000 │ │ 7.000000000 │ 50.000000000 │ └─────────────┴──────────────┘
Query:
SELECT largestTriangleThreeBuckets(4)(x, y) FROM largestTriangleThreeBuckets_test
Result:
┌────────largestTriangleThreeBuckets(4)(x, y)───────────┐ │ [(1,10),(3,15),(9,55),(10,70)] │ └───────────────────────────────────────────────────────┘
last_value¶
Selects the last encountered value, similar to anyLast
, but could accept NULL. Mostly it should be used with Window Functions. Without Window Functions the result will be random if the source stream is not ordered.
mannWhitneyUTest¶
Applies the Mann-Whitney rank test to samples from two populations.
Syntax
mannWhitneyUTest[(alternative[, continuity_correction])](sample_data, sample_index)
Values of both samples are in the sample_data
column. If sample_index
equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that two populations are stochastically equal. Also one-sided hypothesises can be tested. This test does not assume that data have normal distribution.
Arguments
sample_data
: sample data. Integer, Float or Decimal.sample_index
: sample index. Integer.
Parameters
alternative
: alternative hypothesis. (Optional, default:'two-sided'
.) String.'two-sided'
-'greater'
-'less'
.
continuity_correction
: if not 0 then continuity correction in the normal approximation for the p-value is applied. (Optional, default: 1.) UInt64.
Returned values
Tuple with two elements:
- calculated U-statistic. Float64.
- calculated p-value. Float64.
Example
Input table:
┌─sample_data─┬─sample_index─┐ │ 10 │ 0 │ │ 11 │ 0 │ │ 12 │ 0 │ │ 1 │ 1 │ │ 2 │ 1 │ │ 3 │ 1 │ └─────────────┴──────────────┘
Query:
SELECT mannWhitneyUTest('greater')(sample_data, sample_index) FROM mww_ttest
Result:
┌─mannWhitneyUTest('greater')(sample_data, sample_index)─┐ │ (9,0.04042779918503192) │ └────────────────────────────────────────────────────────┘
max¶
Aggregate function that calculates the maximum across a group of values.
Syntax:
SELECT max(salary) FROM employees
SELECT department, max(salary) FROM employees GROUP BY department
If you need non-aggregate function to choose a maximum of two values, see greatest
:
SELECT greatest(a, b) FROM table
maxIntersections¶
Aggregate function that calculates the maximum number of times that a group of intervals intersects each other (if all the intervals intersect at least once).
The syntax is:
maxIntersections(start_column, end_column)
Arguments
start_column
– the numeric column that represents the start of each interval. Ifstart_column
isNULL
or 0 then the interval will be skipped.end_column
- the numeric column that represents the end of each interval. Ifend_column
isNULL
or 0 then the interval will be skipped.
Returned value
Returns the maximum number of intersected intervals.
maxIntersectionsPosition¶
Aggregate function that calculates the positions of the occurrences of the maxIntersections
function.
The syntax is:
maxIntersectionsPosition(start_column, end_column)
Arguments
start_column
– the numeric column that represents the start of each interval. Ifstart_column
isNULL
or 0 then the interval will be skipped.end_column
- the numeric column that represents the end of each interval. Ifend_column
isNULL
or 0 then the interval will be skipped.
Returned value
Returns the start positions of the maximum number of intersected intervals.
maxMap¶
Calculates the maximum from value
array according to the keys specified in the key
array.
Syntax
maxMap(key, value)
or
maxMap(Tuple(key, value))
Alias: maxMappedArrays
- Passing a tuple of keys and value arrays is identical to passing two arrays of keys and values.
- The number of elements in
key
andvalue
must be the same for each row that is totaled.
Parameters
key
: Array of keys. Array.value
: Array of values. Array.
Returned value
- Returns a tuple of two arrays: keys in sorted order, and values calculated for the corresponding keys. Tuple(Array, Array).
Example
Query:
SELECT maxMap(a, b) FROM values('a Array(Char), b Array(Int64)', (['x', 'y'], [2, 2]), (['y', 'z'], [3, 1]))
Result:
┌─maxMap(a, b)───────────┐ │ [['x','y','z'],[2,3,1]]│ └────────────────────────┘
meanZTest¶
Applies mean z-test to samples from two populations.
Syntax
meanZTest(population_variance_x, population_variance_y, confidence_level)(sample_data, sample_index)
Values of both samples are in the sample_data
column. If sample_index
equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that means of populations are equal. Normal distribution is assumed. Populations may have unequal variance and the variances are known.
Arguments
sample_data
: Sample data. Integer, Float or Decimal.sample_index
: Sample index. Integer.
Parameters
population_variance_x
: Variance for population x. Float.population_variance_y
: Variance for population y. Float.confidence_level
: Confidence level in order to calculate confidence intervals. Float.
Returned values
Tuple with four elements:
- calculated t-statistic. Float64.
- calculated p-value. Float64.
- calculated confidence-interval-low. Float64.
- calculated confidence-interval-high. Float64.
Example
Input table:
┌─sample_data─┬─sample_index─┐ │ 20.3 │ 0 │ │ 21.9 │ 0 │ │ 22.1 │ 0 │ │ 18.9 │ 1 │ │ 19 │ 1 │ │ 20.3 │ 1 │ └─────────────┴──────────────┘
Query:
SELECT meanZTest(0.7, 0.45, 0.95)(sample_data, sample_index) FROM mean_ztest
Result:
┌─meanZTest(0.7, 0.45, 0.95)(sample_data, sample_index)────────────────────────────┐ │ (3.2841296025548123,0.0010229786769086013,0.8198428246768334,3.2468238419898365) │ └──────────────────────────────────────────────────────────────────────────────────┘
median¶
The median*
functions are the aliases for the corresponding quantile*
functions. They calculate median of a numeric data sample.
Functions:
median
: Alias forquantile
.medianDeterministic
: Alias forquantileDeterministic
.medianExact
: Alias forquantileExact
.medianExactWeighted
: Alias forquantileExactWeighted
.medianTiming
: Alias forquantileTiming
.medianTimingWeighted
: Alias forquantileTimingWeighted
.medianTDigest
: Alias forquantileTDigest
.medianTDigestWeighted
: Alias forquantileTDigestWeighted
.medianBFloat16
: Alias forquantileBFloat16
.medianDD
: Alias forquantileDD
.
Example
Input table:
┌─val─┐ │ 1 │ │ 1 │ │ 2 │ │ 3 │ └─────┘
Query:
SELECT medianDeterministic(val, 1) FROM t
Result:
┌─medianDeterministic(val, 1)─┐ │ 1.5 │ └─────────────────────────────┘
min¶
Aggregate function that calculates the minimum across a group of values.
Syntax:
SELECT min(salary) FROM employees
SELECT department, min(salary) FROM employees GROUP BY department
If you need non-aggregate function to choose a minimum of two values, see least
:
SELECT least(a, b) FROM table
minMap¶
Calculates the minimum from value
array according to the keys specified in the key
array.
Syntax
`minMap(key, value)`
or
minMap(Tuple(key, value))
Alias: minMappedArrays
- Passing a tuple of keys and value arrays is identical to passing an array of keys and an array of values.
- The number of elements in
key
andvalue
must be the same for each row that is totaled.
Parameters
key
: Array of keys. Array.value
: Array of values. Array.
Returned value
- Returns a tuple of two arrays: keys in sorted order, and values calculated for the corresponding keys. Tuple(Array, Array).
Example
Query:
SELECT minMap(a, b) FROM values('a Array(Int32), b Array(Int64)', ([1, 2], [2, 2]), ([2, 3], [1, 1]))
Result:
┌─minMap(a, b)──────┐ │ ([1,2,3],[2,1,1]) │ └───────────────────┘
quantile¶
Computes an approximate quantile of a numeric data sequence.
This function applies reservoir sampling with a reservoir size up to 8192 and a random number generator for sampling. The result is non-deterministic. To get an exact quantile, use the quantileExact
function.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
Note that for an empty numeric sequence, quantile
will return NaN, but its quantile*
variants will return either NaN or a default value for the sequence type, depending on the variant.
Syntax
quantile(level)(expr)
Alias: median
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over the column values resulting in numeric data types, Date or DateTime.
Returned value
- Approximate quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
Input table:
┌─val─┐ │ 1 │ │ 1 │ │ 2 │ │ 3 │ └─────┘
Query:
SELECT quantile(val) FROM t
Result:
┌─quantile(val)─┐ │ 1.5 │ └───────────────┘
quantileBFloat16¶
Computes an approximate quantile of a sample consisting of bfloat16 numbers. bfloat16
is a floating-point data type with 1 sign bit, 8 exponent bits and 7 fraction bits. The function converts input values to 32-bit floats and takes the most significant 16 bits. Then it calculates bfloat16
quantile value and converts the result to a 64-bit float by appending zero bits. The function is a fast quantile estimator with a relative error no more than 0.390625%.
Syntax
quantileBFloat16[(level)](expr)
Alias: medianBFloat16
Arguments
expr
: Column with numeric data. Integer, Float.
Parameters
level
: Level of quantile. Optional. Possible values are in the range from 0 to 1. Default value: 0.5. Float.
Returned value
- Approximate quantile of the specified level.
Type: Float64.
Example
Input table has an integer and a float columns:
┌─a─┬─────b─┐ │ 1 │ 1.001 │ │ 2 │ 1.002 │ │ 3 │ 1.003 │ │ 4 │ 1.004 │ └───┴───────┘
Query to calculate 0.75-quantile (third quartile):
SELECT quantileBFloat16(0.75)(a), quantileBFloat16(0.75)(b) FROM example_table
Result:
┌─quantileBFloat16(0.75)(a)─┬─quantileBFloat16(0.75)(b)─┐ │ 3 │ 1 │ └───────────────────────────┴───────────────────────────┘
Note that all floating point values in the example are truncated to 1.0 when converting to bfloat16
.
quantileBFloat16Weighted¶
Like quantileBFloat16
but takes into account the weight of each sequence member.
quantileDD¶
Computes an approximate quantile of a sample with relative-error guarantees. It works by building a DD.
Syntax
quantileDD(relative_accuracy, [level])(expr)
Arguments
expr
: Column with numeric data. Integer, Float.
Parameters
relative_accuracy
: Relative accuracy of the quantile. Possible values are in the range from 0 to 1. Float. The size of the sketch depends on the range of the data and the relative accuracy. The larger the range and the smaller the relative accuracy, the larger the sketch. The rough memory size of the of the sketch islog(max_value/min_value)/relative_accuracy
. The recommended value is 0.001 or higher.level
: Level of quantile. Optional. Possible values are in the range from 0 to 1. Default value: 0.5. Float.
Returned value
- Approximate quantile of the specified level.
Type: Float64.
Example
Input table has an integer and a float columns:
┌─a─┬─────b─┐ │ 1 │ 1.001 │ │ 2 │ 1.002 │ │ 3 │ 1.003 │ │ 4 │ 1.004 │ └───┴───────┘
Query to calculate 0.75-quantile (third quartile):
SELECT quantileDD(0.01, 0.75)(a), quantileDD(0.01, 0.75)(b) FROM example_table
Result:
┌─quantileDD(0.01, 0.75)(a)─┬─quantileDD(0.01, 0.75)(b)─┐ │ 2.974233423476717 │ 1.01 │ └─────────────────────────────────┴─────────────────────────────────┘
quantileDeterministic¶
Computes an approximate quantile of a numeric data sequence.
This function applies reservoir sampling with a reservoir size up to 8192 and deterministic algorithm of sampling. The result is deterministic. To get an exact quantile, use the quantileExact
function.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
Syntax
quantileDeterministic(level)(expr, determinator)
Alias: medianDeterministic
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over the column values resulting in numeric data types, Date or DateTime.determinator
: Number whose hash is used instead of a random number generator in the reservoir sampling algorithm to make the result of sampling deterministic. As a determinator you can use any deterministic positive number, for example, a user id or an event id. If the same determinator value occurs too often, the function works incorrectly.
Returned value
- Approximate quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
Input table:
┌─val─┐ │ 1 │ │ 1 │ │ 2 │ │ 3 │ └─────┘
Query:
SELECT quantileDeterministic(val, 1) FROM t
Result:
┌─quantileDeterministic(val, 1)─┐ │ 1.5 │ └───────────────────────────────┘
quantileExact¶
Exactly computes the quantile of a numeric data sequence.
To get exact value, all the passed values are combined into an array, which is then partially sorted. Therefore, the function consumes O(n)
memory, where n
is a number of values that were passed. However, for a small number of values, the function is very effective.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
Syntax
quantileExact(level)(expr)
Alias: medianExact
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over the column values resulting in numeric data types, Date or DateTime.
Returned value
- Quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
Query:
SELECT quantileExact(number) FROM numbers(10)
Result:
┌─quantileExact(number)─┐ │ 5 │ └───────────────────────┘
quantileExactLow¶
Similar to quantileExact
, this computes the exact quantile of a numeric data sequence.
To get the exact value, all the passed values are combined into an array, which is then fully sorted. The sorting algorithm's complexity is O(N·log(N))
, where N = std::distance(first, last)
comparisons.
The return value depends on the quantile level and the number of elements in the selection, i.e. if the level is 0.5, then the function returns the lower median value for an even number of elements and the middle median value for an odd number of elements. Median is calculated similarly to the median_low implementation which is used in python.
For all other levels, the element at the index corresponding to the value of level * size_of_array
is returned. For example:
SELECT quantileExactLow(0.1)(number) FROM numbers(10) ┌─quantileExactLow(0.1)(number)─┐ │ 1 │ └───────────────────────────────┘
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
Syntax
quantileExactLow(level)(expr)
Alias: medianExactLow
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over the column values resulting in numeric data types, Date or DateTime.
Returned value
- Quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
Query:
SELECT quantileExactLow(number) FROM numbers(10)
Result:
┌─quantileExactLow(number)─┐ │ 4 │ └──────────────────────────┘
quantileExactHigh¶
Similar to quantileExact
, this computes the exact quantile of a numeric data sequence.
All the passed values are combined into an array, which is then fully sorted, to get the exact value. The sorting algorithm's complexity is O(N·log(N))
, where N = std::distance(first, last)
comparisons.
The return value depends on the quantile level and the number of elements in the selection, i.e. if the level is 0.5, then the function returns the higher median value for an even number of elements and the middle median value for an odd number of elements. Median is calculated similarly to the median_high implementation which is used in python. For all other levels, the element at the index corresponding to the value of level * size_of_array
is returned.
This implementation behaves exactly similar to the current quantileExact
implementation.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
Syntax
quantileExactHigh(level)(expr)
Alias: medianExactHigh
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over the column values resulting in numeric data types, Date or DateTime.
Returned value
- Quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
Query:
SELECT quantileExactHigh(number) FROM numbers(10)
Result:
┌─quantileExactHigh(number)─┐ │ 5 │ └───────────────────────────┘
quantileExactExclusive¶
Exactly computes the quantile of a numeric data sequence.
To get exact value, all the passed values are combined into an array, which is then partially sorted. Therefore, the function consumes O(n)
memory, where n
is a number of values that were passed. However, for a small number of values, the function is very effective.
This function is equivalent to PERCENTILE.EXC Excel function, (type R6).
When using multiple quantileExactExclusive
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantilesExactExclusive
function.
Syntax
quantileExactExclusive(level)(expr)
Arguments
expr
: Expression over the column values resulting in numeric data types, Date or DateTime.
Parameters
level
: Level of quantile. Optional. Possible values: (0, 1): bounds not included. Default value: 0.5. Atlevel=0.5
the function calculates median. Float.
Returned value
- Quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
quantileExactInclusive¶
Exactly computes the quantile of a numeric data sequence.
To get exact value, all the passed values are combined into an array, which is then partially sorted. Therefore, the function consumes O(n)
memory, where n
is a number of values that were passed. However, for a small number of values, the function is very effective.
This function is equivalent to PERCENTILE.INC Excel function, (type R7).
When using multiple quantileExactInclusive
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantilesExactInclusive
function.
Syntax
quantileExactInclusive(level)(expr)
Arguments
expr
: Expression over the column values resulting in numeric data types, Date or DateTime.
Parameters
level
: Level of quantile. Optional. Possible values: [0, 1]: bounds included. Default value: 0.5. Atlevel=0.5
the function calculates median. Float.
Returned value
- Quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
quantileExactWeighted¶
Exactly computes the quantile of a numeric data sequence, taking into account the weight of each element.
To get exact value, all the passed values are combined into an array, which is then partially sorted. Each value is counted with its weight, as if it is present weight
times. A hash table is used in the algorithm. Because of this, if the passed values are frequently repeated, the function consumes less RAM than quantileExact. You can use this function instead of quantileExact
and specify the weight 1.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
Syntax
quantileExactWeighted(level)(expr, weight)
Alias: medianExactWeighted
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over the column values resulting in numeric data types, Date or DateTime.weight
: Column with weights of sequence members. Weight is a number of value occurrences with Unsigned integer types.
Returned value
- Quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
Input table:
┌─n─┬─val─┐ │ 0 │ 3 │ │ 1 │ 2 │ │ 2 │ 1 │ │ 5 │ 4 │ └───┴─────┘
Query:
SELECT quantileExactWeighted(n, val) FROM t
Result:
┌─quantileExactWeighted(n, val)─┐ │ 1 │ └───────────────────────────────┘
quantileExactWeightedInterpolated¶
Computes quantile of a numeric data sequence using linear interpolation, taking into account the weight of each element.
To get the interpolated value, all the passed values are combined into an array, which are then sorted by their corresponding weights. Quantile interpolation is then performed using the weighted percentile method by building a cumulative distribution based on weights and then a linear interpolation is performed using the weights and the values to compute the quantiles.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
We strongly recommend using quantileExactWeightedInterpolated
instead of quantileInterpolatedWeighted
because quantileExactWeightedInterpolated
is more accurate than quantileInterpolatedWeighted
. Here is an example:
SELECT quantileExactWeightedInterpolated(0.99)(number, 1), quantile(0.99)(number), quantileInterpolatedWeighted(0.99)(number, 1) FROM numbers(9) ┌─quantileExactWeightedInterpolated(0.99)(number, 1)─┬─quantile(0.99)(number)─┬─quantileInterpolatedWeighted(0.99)(number, 1)─┐ │ 7.92 │ 7.92 │ 8 │ └────────────────────────────────────────────────────┴────────────────────────┴───────────────────────────────────────────────┘
Syntax
quantileExactWeightedInterpolated(level)(expr, weight)
Alias: medianExactWeightedInterpolated
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over the column values resulting in numeric data types, Date or DateTime.weight
: Column with weights of sequence members. Weight is a number of value occurrences with Unsigned integer types.
Returned value
- Quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
Input table:
┌─n─┬─val─┐ │ 0 │ 3 │ │ 1 │ 2 │ │ 2 │ 1 │ │ 5 │ 4 │ └───┴─────┘
Result:
┌─quantileExactWeightedInterpolated(n, val)─┐ │ 1.5 │ └───────────────────────────────────────────┘
quantileGK¶
Computes the quantile of a numeric data sequence using the Greenwald-Khanna algorithm. The Greenwald-Khanna algorithm is an algorithm used to compute quantiles on a stream of data in a highly efficient manner. It was introduced by Michael Greenwald and Sanjeev Khanna in 2001. It is widely used in databases and big data systems where computing accurate quantiles on a large stream of data in real-time is necessary. The algorithm is highly efficient, taking only O(log n) space and O(log log n) time per item (where n is the size of the input). It is also highly accurate, providing an approximate quantile value with high probability.
quantileGK
is different from other quantile functions, because it enables user to control the accuracy of the approximate quantile result.
Syntax
quantileGK(accuracy, level)(expr)
Alias: medianGK
.
Arguments
accuracy
: Accuracy of quantile. Constant positive integer. Larger accuracy value means less error. For example, if the accuracy argument is set to 100, the computed quantile will have an error no greater than 1% with high probability. There is a trade-off between the accuracy of the computed quantiles and the computational complexity of the algorithm. A larger accuracy requires more memory and computational resources to compute the quantile accurately, while a smaller accuracy argument allows for a faster and more memory-efficient computation but with a slightly lower accuracy.level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over the column values resulting in numeric data types, Date or DateTime.
Returned value
- Quantile of the specified level and accuracy.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
SELECT quantileGK(1, 0.25)(number + 1) FROM numbers(1000) ┌─quantileGK(1, 0.25)(plus(number, 1))─┐ │ 1 │ └──────────────────────────────────────┘ SELECT quantileGK(10, 0.25)(number + 1) FROM numbers(1000) ┌─quantileGK(10, 0.25)(plus(number, 1))─┐ │ 156 │ └───────────────────────────────────────┘ SELECT quantileGK(100, 0.25)(number + 1) FROM numbers(1000) ┌─quantileGK(100, 0.25)(plus(number, 1))─┐ │ 251 │ └────────────────────────────────────────┘ SELECT quantileGK(1000, 0.25)(number + 1) FROM numbers(1000) ┌─quantileGK(1000, 0.25)(plus(number, 1))─┐ │ 249 │ └─────────────────────────────────────────┘
quantileInterpolatedWeighted¶
Computes quantile of a numeric data sequence using linear interpolation, taking into account the weight of each element.
To get the interpolated value, all the passed values are combined into an array, which are then sorted by their corresponding weights. Quantile interpolation is then performed using the weighted percentile method by building a cumulative distribution based on weights and then a linear interpolation is performed using the weights and the values to compute the quantiles.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
Syntax
quantileInterpolatedWeighted(level)(expr, weight)
Alias: medianInterpolatedWeighted
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over the column values resulting in numeric data types, Date or DateTime.weight
: Column with weights of sequence members. Weight is a number of value occurrences.
Returned value
- Quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
Input table:
┌─n─┬─val─┐ │ 0 │ 3 │ │ 1 │ 2 │ │ 2 │ 1 │ │ 5 │ 4 │ └───┴─────┘
Query:
SELECT quantileInterpolatedWeighted(n, val) FROM t
Result:
┌─quantileInterpolatedWeighted(n, val)─┐ │ 1 │ └──────────────────────────────────────┘
quantiles¶
Syntax: quantiles(level1, level2, ...)(x)
All the quantile functions also have corresponding quantiles functions: quantiles
, quantilesDeterministic
, quantilesTiming
, quantilesTimingWeighted
, quantilesExact
, quantilesExactWeighted
, quantileExactWeightedInterpolated
, quantileInterpolatedWeighted
, quantilesTDigest
, quantilesBFloat16
, quantilesDD
. These functions calculate all the quantiles of the listed levels in one pass, and return an array of the resulting values.
quantilesExactExclusive¶
Exactly computes the quantiles of a numeric data sequence.
To get exact value, all the passed values are combined into an array, which is then partially sorted. Therefore, the function consumes O(n)
memory, where n
is a number of values that were passed. However, for a small number of values, the function is very effective.
This function is equivalent to PERCENTILE.EXC Excel function, (type R6).
Works more efficiently with sets of levels than quantileExactExclusive
.
Syntax
quantilesExactExclusive(level1, level2, ...)(expr)
Arguments
expr
: Expression over the column values resulting in numeric data types, Date or DateTime.
Parameters
level
: Levels of quantiles. Possible values: (0, 1): bounds not included. Float.
Returned value
- Array of quantiles of the specified levels.
Type of array values:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
quantilesExactInclusive¶
Exactly computes the quantiles of a numeric data sequence.
To get exact value, all the passed values are combined into an array, which is then partially sorted. Therefore, the function consumes O(n)
memory, where n
is a number of values that were passed. However, for a small number of values, the function is very effective.
This function is equivalent to PERCENTILE.INC Excel function, (type R7).
Works more efficiently with sets of levels than quantileExactInclusive
.
Syntax
quantilesExactInclusive(level1, level2, ...)(expr)
Arguments
expr
: Expression over the column values resulting in numeric data types, Date or DateTime.
Parameters
level
: Levels of quantiles. Possible values: 0, 1: bounds included. Float.
Returned value
- Array of quantiles of the specified levels.
Type of array values:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
quantilesGK¶
quantilesGK
works similarly with quantileGK
but allows us to calculate quantities at different levels simultaneously and returns an array.
Syntax
quantilesGK(accuracy, level1, level2, ...)(expr)
Returned value
- Array of quantiles of the specified levels.
Type of array values:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
Query:
SELECT quantilesGK(1, 0.25, 0.5, 0.75)(number + 1) FROM numbers(1000) ┌─quantilesGK(1, 0.25, 0.5, 0.75)(plus(number, 1))─┐ │ [1,1,1] │ └──────────────────────────────────────────────────┘ SELECT quantilesGK(10, 0.25, 0.5, 0.75)(number + 1) FROM numbers(1000) ┌─quantilesGK(10, 0.25, 0.5, 0.75)(plus(number, 1))─┐ │ [156,413,659] │ └───────────────────────────────────────────────────┘ SELECT quantilesGK(100, 0.25, 0.5, 0.75)(number + 1) FROM numbers(1000) ┌─quantilesGK(100, 0.25, 0.5, 0.75)(plus(number, 1))─┐ │ [251,498,741] │ └────────────────────────────────────────────────────┘ SELECT quantilesGK(1000, 0.25, 0.5, 0.75)(number + 1) FROM numbers(1000) ┌─quantilesGK(1000, 0.25, 0.5, 0.75)(plus(number, 1))─┐ │ [249,499,749] │ └─────────────────────────────────────────────────────┘
quantileTDigest¶
Computes an approximate quantile of a numeric data sequence using the t-digest algorithm.
Memory consumption is log(n)
, where n
is a number of values. The result depends on the order of running the query, and is nondeterministic.
The performance of the function is lower than performance of quantile
or quantileTiming
. In terms of the ratio of State size to precision, this function is much better than quantile
.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
Syntax
quantileTDigest(level)(expr)
Alias: medianTDigest
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over the column values resulting in numeric data types, Date or DateTime.
Returned value
- Approximate quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
Query:
SELECT quantileTDigest(number) FROM numbers(10)
Result:
┌─quantileTDigest(number)─┐ │ 4.5 │ └─────────────────────────┘
quantileTDigestWeighted¶
Computes an approximate quantile of a numeric data sequence using the t-digest algorithm. The function takes into account the weight of each sequence member. The maximum error is 1%. Memory consumption is log(n)
, where n
is a number of values.
The performance of the function is lower than performance of quantile
or quantileTiming
. In terms of the ratio of State size to precision, this function is much better than quantile
.
The result depends on the order of running the query, and is nondeterministic.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
Syntax
quantileTDigestWeighted(level)(expr, weight)
Alias: medianTDigestWeighted
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over the column values resulting in numeric data types, Date or DateTime.weight
: Column with weights of sequence elements. Weight is a number of value occurrences.
Returned value
- Approximate quantile of the specified level.
Type:
- Float64 for numeric data type input.
- Date if input values have the Date type.
- DateTime if input values have the DateTime type.
Example
Query:
SELECT quantileTDigestWeighted(number, 1) FROM numbers(10)
Result:
┌─quantileTDigestWeighted(number, 1)─┐ │ 4.5 │ └────────────────────────────────────┘
quantileTiming¶
With the determined precision computes the quantile of a numeric data sequence.
The result is deterministic (it does not depend on the query processing order). The function is optimized for working with sequences which describe distributions like loading web pages times or backend response times.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
Syntax
quantileTiming(level)(expr)
Alias: medianTiming
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over a column values returning a Float* type number.- If negative values are passed to the function, the behavior is undefined.
- If the value is greater than 30,000 (a page loading time of more than 30 seconds), it is assumed to be 30,000.
Accuracy
The calculation is accurate if:
- Total number of values does not exceed 5670.
- Total number of values exceeds 5670, but the page loading time is less than 1024ms.
Otherwise, the result of the calculation is rounded to the nearest multiple of 16 ms.
For calculating page loading time quantiles, this function is more effective and accurate than quantile
.
Returned value
- Quantile of the specified level.
Type: Float32
.
If no values are passed to the function (when using quantileTimingIf
), NaN is returned. The purpose of this is to differentiate these cases from cases that result in zero.
Example
Input table:
┌─response_time─┐ │ 72 │ │ 112 │ │ 126 │ │ 145 │ │ 104 │ │ 242 │ │ 313 │ │ 168 │ │ 108 │ └───────────────┘
Query:
SELECT quantileTiming(response_time) FROM t
Result:
┌─quantileTiming(response_time)─┐ │ 126 │ └───────────────────────────────┘
quantileTimingWeighted¶
With the determined precision computes the quantile of a numeric data sequence according to the weight of each sequence member.
The result is deterministic (it does not depend on the query processing order). The function is optimized for working with sequences which describe distributions like loading web pages times or backend response times.
When using multiple quantile*
functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles
function.
Syntax
quantileTimingWeighted(level)(expr, weight)
Alias: medianTimingWeighted
.
Arguments
level
: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using alevel
value in the range of[0.01, 0.99]
. Default value: 0.5. Atlevel=0.5
the function calculates median.expr
: Expression over a column values returning a Float* type number.- If negative values are passed to the function, the behavior is undefined. - If the value is greater than 30,000 (a page loading time of more than 30 seconds), it is assumed to be 30,000.
weight
: Column with weights of sequence elements. Weight is a number of value occurrences.
Accuracy
The calculation is accurate if:
- Total number of values does not exceed 5670.
- Total number of values exceeds 5670, but the page loading time is less than 1024ms.
Otherwise, the result of the calculation is rounded to the nearest multiple of 16 ms.
For calculating page loading time quantiles, this function is more effective and accurate than quantile
.
Returned value
- Quantile of the specified level.
Type: Float32
.
If no values are passed to the function (when using quantileTimingIf
), NaN is returned. The purpose of this is to differentiate these cases from cases that result in zero.
Example
Input table:
┌─response_time─┬─weight─┐ │ 68 │ 1 │ │ 104 │ 2 │ │ 112 │ 3 │ │ 126 │ 2 │ │ 138 │ 1 │ │ 162 │ 1 │ └───────────────┴────────┘
Query:
SELECT quantileTimingWeighted(response_time, weight) FROM t
Result:
┌─quantileTimingWeighted(response_time, weight)─┐ │ 112 │ └───────────────────────────────────────────────┘
quantilesTimingWeighted¶
Same as quantileTimingWeighted
, but accept multiple parameters with quantile levels and return an Array filled with many values of that quantiles.
Example
Input table:
┌─response_time─┬─weight─┐ │ 68 │ 1 │ │ 104 │ 2 │ │ 112 │ 3 │ │ 126 │ 2 │ │ 138 │ 1 │ │ 162 │ 1 │ └───────────────┴────────┘
Query:
SELECT quantilesTimingWeighted(0,5, 0.99)(response_time, weight) FROM t
Result:
┌─quantilesTimingWeighted(0.5, 0.99)(response_time, weight)─┐ │ [112,162] │ └───────────────────────────────────────────────────────────┘
rankCorr¶
Computes a rank correlation coefficient.
Syntax
rankCorr(x, y)
Arguments
x
: Arbitrary value. Float32 or Float64.y
: Arbitrary value. Float32 or Float64.
Returned value(s)
- Returns a rank correlation coefficient of the ranks of x and y. The value of the correlation coefficient ranges from -1 to +1. If less than two arguments are passed, the function will return an exception. The value close to +1 denotes a high linear relationship, and with an increase of one random variable, the second random variable also increases. The value close to -1 denotes a high linear relationship, and with an increase of one random variable, the second random variable decreases. The value close or equal to 0 denotes no relationship between the two random variables.
Type: Float64.
Example
Query:
SELECT rankCorr(number, number) FROM numbers(100)
Result:
┌─rankCorr(number, number)─┐ │ 1 │ └──────────────────────────┘
Query:
SELECT roundBankers(rankCorr(exp(number), sin(number)), 3) FROM numbers(100)
Result:
┌─roundBankers(rankCorr(exp(number), sin(number)), 3)─┐ │ -0.037 │ └─────────────────────────────────────────────────────┘
simpleLinearRegression¶
Performs simple (unidimensional) linear regression.
simpleLinearRegression(x, y)
Parameters:
x
: Column with explanatory variable values.y
: Column with dependent variable values.
Returned values:
Constants (k, b)
of the resulting line y = k*x + b
.
Examples
SELECT arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [0, 1, 2, 3])
┌─arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [0, 1, 2, 3])─┐ │ (1,0) │ └───────────────────────────────────────────────────────────────────┘
SELECT arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [3, 4, 5, 6])
┌─arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [3, 4, 5, 6])─┐ │ (1,3) │ └───────────────────────────────────────────────────────────────────┘
singleValueOrNull¶
The aggregate function singleValueOrNull
is used to implement subquery operators, such as x = ALL (SELECT ...)
. It checks if there is only one unique non-NULL value in the data. If there is only one unique value, it returns it. If there are zero or at least two distinct values, it returns NULL.
Syntax
singleValueOrNull(x)
Parameters
x
: Column of any data type (except Map, Array or Tuple which cannot be of type Nullable).
Returned values
- The unique value, if there is only one unique non-NULL value in
x
. NULL
, if there are zero or at least two distinct values.
skewPop¶
Computes the skewness of a sequence.
skewPop(expr)
Arguments
expr
: Expression returning a number.
Returned value
The skewness of the given distribution. Type: Float64.
Example
SELECT skewPop(value) FROM series_with_value_column
skewSamp¶
Computes the sample skewness of a sequence.
It represents an unbiased estimate of the skewness of a random variable if passed values form its sample.
skewSamp(expr)
Arguments
expr
: Expression returning a number.
Returned value
The skewness of the given distribution. Type: Float64. If n <= 1
(n
is the size of the sample), then the function returns nan
.
Example
SELECT skewSamp(value) FROM series_with_value_column
sparkbar¶
The function plots a frequency histogram for values x
and the repetition rate y
of these values over the interval [min_x, max_x]
. Repetitions for all x
falling into the same bucket are averaged, so data should be pre-aggregated. Negative repetitions are ignored.
If no interval is specified, then the minimum x
is used as the interval start, and the maximum x
: as the interval end. Otherwise, values outside the interval are ignored.
Syntax
sparkbar(buckets[, min_x, max_x])(x, y)
Parameters
buckets
: The number of segments. Type: Integer.min_x
: The interval start. Optional parameter.max_x
: The interval end. Optional parameter.
Arguments
x
: The field with values.y
: The field with the frequency of values.
Returned value
- The frequency histogram.
stddevPop¶
The result is equal to the square root of varPop
.
Aliases: STD
, STDDEV_POP
.
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the stddevPopStable
function. It works slower but provides a lower computational error.
Syntax
stddevPop(x)
Parameters
x
: Population of values to find the standard deviation of. (U)Int*, Float*, Decimal*.
Returned value
- Square root of standard deviation of
x
. Float64.
stddevPopStable¶
The result is equal to the square root of varPop
. Unlike stddevPop
, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.
Syntax
stddevPopStable(x)
Parameters
x
: Population of values to find the standard deviation of. (U)Int*, Float*, Decimal*.
Returned value
Square root of standard deviation of x
. Float64.
stddevSamp¶
The result is equal to the square root of varSamp
.
Alias: STDDEV_SAMP
.
This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the stddevSampStable
function. It works slower but provides a lower computational error.
Syntax
stddevSamp(x)
Parameters
x
: Values for which to find the square root of sample variance. (U)Int*, Float*, Decimal*.
Returned value
Square root of sample variance of x
. Float64.
stddevSampStable¶
The result is equal to the square root of varSamp
. Unlike stddevSamp
, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.
Syntax
stddevSampStable(x)
Parameters
x
: Values for which to find the square root of sample variance. (U)Int*, Float*, Decimal*.
Returned value
Square root of sample variance of x
. Float64.
stochasticLinearRegression {#agg_functions_stochasticlinearregression_parameters}¶
This function implements stochastic linear regression. It supports custom parameters for learning rate, L2 regularization coefficient, mini-batch size, and has a few methods for updating weight: Adam (used by default), simple SGD, Momentum, and Nesterov.
Parameters
There are 4 customizable parameters. They are passed to the function sequentially, but there is no need to pass all four - default values will be used, however good model required some parameter tuning.
stochasticLinearRegression(0.00001, 0.1, 15, 'Adam')
learning rate
is the coefficient on step length, when the gradient descent step is performed. A learning rate that is too big may cause infinite weights of the model. Default is0.00001
.l2 regularization coefficient
which may help to prevent overfitting. Default is0.1
.mini-batch size
sets the number of elements, which gradients will be computed and summed to perform one step of gradient descent. Pure stochastic descent uses one element, however, having small batches (about 10 elements) makes gradient steps more stable. Default is15
.method for updating weights
, they are:Adam
(by default),SGD
,Momentum
, andNesterov
.Momentum
andNesterov
require a little bit more computations and memory, however, they happen to be useful in terms of speed of convergence and stability of stochastic gradient methods.
stochasticLogisticRegression¶
This function implements stochastic logistic regression. It can be used for binary classification problem, supports the same custom parameters as stochasticLinearRegression and works the same way.
Parameters
Parameters are exactly the same as in stochasticLinearRegression: learning rate
, l2 regularization coefficient
, mini-batch size
, method for updating weights
. For more information see parameters.
stochasticLogisticRegression(1.0, 1.0, 10, 'SGD')
studentTTest¶
Applies Student's t-test to samples from two populations.
Syntax
studentTTest([confidence_level])(sample_data, sample_index)
Values of both samples are in the sample_data
column. If sample_index
equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that means of populations are equal. Normal distribution with equal variances is assumed.
Arguments
sample_data
: Sample data. Integer, Float or Decimal.sample_index
: Sample index. Integer.
Parameters
confidence_level
: Confidence level in order to calculate confidence intervals. Float.
Returned values
Tuple with two or four elements (if the optional confidence_level
is specified):
- calculated t-statistic. Float64.
- calculated p-value. Float64.
- calculated confidence-interval-low. Float64.
- calculated confidence-interval-high. Float64.
Example
Input table:
┌─sample_data─┬─sample_index─┐ │ 20.3 │ 0 │ │ 21.1 │ 0 │ │ 21.9 │ 1 │ │ 21.7 │ 0 │ │ 19.9 │ 1 │ │ 21.8 │ 1 │ └─────────────┴──────────────┘
Query:
SELECT studentTTest(sample_data, sample_index) FROM student_ttest
Result:
┌─studentTTest(sample_data, sample_index)───┐ │ (-0.21739130434783777,0.8385421208415731) │ └───────────────────────────────────────────┘
sum¶
Calculates the sum. Only works for numbers.
Syntax
sum(num)
Parameters
num
: Column of numeric values. (U)Int*, Float*, Decimal*.
Returned value
- The sum of the values. (U)Int*, Float*, Decimal*.
sumCount¶
Calculates the sum of the numbers and counts the number of rows at the same time.
Syntax
sumCount(x)
Arguments
x
: Input value, must be Integer, Float, or Decimal.
Returned value
- Tuple
(sum, count)
, wheresum
is the sum of numbers andcount
is the number of rows with not-NULL values.
Type: Tuple.
sumKahan¶
Calculates the sum of the numbers with Kahan compensated summation algorithm Slower than sum
function. The compensation works only for Float types.
Syntax
sumKahan(x)
Arguments
x
: Input value, must be Integer, Float, or Decimal.
Returned value
- the sum of numbers, with type Integer, Float, or Decimal depends on type of input arguments
Example
Query:
SELECT sum(0.1), sumKahan(0.1) FROM numbers(10)
Result:
┌───────────sum(0.1)─┬─sumKahan(0.1)─┐ │ 0.9999999999999999 │ 1 │ └────────────────────┴───────────────┘
sumMap¶
Totals a value
array according to the keys specified in the key
array. Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys without overflow.
Syntax
sumMap(key <Array>, value <Array>)
Array type.sumMap(Tuple(key <Array>, value <Array>))
Tuple type.
Alias: sumMappedArrays
.
Arguments
key
: Array of keys.value
: Array of values.
Passing a tuple of key and value arrays is a synonym to passing separately an array of keys and an array of values.
The number of elements in key
and value
must be the same for each row that is totaled.
Returned Value
- Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys.
sumMapWithOverflow¶
Totals a value
array according to the keys specified in the key
array. Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys. It differs from the sumMap
function in that it does summation with overflow - i.e. returns the same data type for the summation as the argument data type.
Syntax
sumMapWithOverflow(key <Array>, value <Array>)
Array type.sumMapWithOverflow(Tuple(key <Array>, value <Array>))
Tuple type.
Arguments
key
: Array of keys.value
: Array of values.
Passing a tuple of key and value arrays is a synonym to passing separately an array of keys and an array of values.
The number of elements in key
and value
must be the same for each row that is totaled.
Returned Value
- Returns a tuple of two arrays: keys in sorted order, and values summed for the corresponding keys.
sumWithOverflow¶
Computes the sum of the numbers, using the same data type for the result as for the input parameters. If the sum exceeds the maximum value for this data type, it is calculated with overflow.
Only works for numbers.
Syntax
sumWithOverflow(num)
Parameters
num
: Column of numeric values. (U)Int*, Float*, Decimal*.
Returned value
- The sum of the values. (U)Int*, Float*, Decimal*.
theilsU¶
The theilsU
function calculates the Theil's U uncertainty coefficient, a value that measures the association between two columns in a table. Its values range from −1.0 (100% negative association, or perfect inversion) to +1.0 (100% positive association, or perfect agreement). A value of 0.0 indicates the absence of association.
Syntax
theilsU(column1, column2)
Arguments
column1
andcolumn2
are the columns to be compared
Returned value
- a value between -1 and 1
Return type is always Float64.
Example
The following two columns being compared below have a small association with each other, so the value of theilsU
is negative:
SELECT theilsU(a ,b) FROM ( SELECT number % 10 AS a, number % 4 AS b FROM numbers(150) )
Result:
┌────────theilsU(a, b)─┐ │ -0.30195720557678846 │ └──────────────────────┘
topK¶
Returns an array of the approximately most frequent values in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves).
Implements the Filtered Space-Saving algorithm for analyzing TopK, based on the reduce-and-combine algorithm from Parallel Space Saving.
topK(N)(column) topK(N, load_factor)(column) topK(N, load_factor, 'counts')(column)
This function does not provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren’t the most frequent values.
We recommend using the N < 10
value; performance is reduced with large N
values. Maximum value of N = 65536
.
Parameters
N
: The number of elements to return. Optional. Default value: 10.load_factor
: Defines, how many cells reserved for values. If uniq(column) > N * load_factor, result of topK function will be approximate. Optional. Default value: 3.counts
: Defines, should result contain approximate count and error value.
Arguments
column
: The value to calculate frequency.
Example
Take the OnTime data set and select the three most frequently occurring values in the AirlineID
column.
SELECT topK(3)(AirlineID) AS res FROM ontime
┌─res─────────────────┐ │ [19393,19790,19805] │ └─────────────────────┘
topKWeighted¶
Returns an array of the approximately most frequent values in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves). Additionally, the weight of the value is taken into account.
Syntax
topKWeighted(N)(column, weight) topKWeighted(N, load_factor)(column, weight) topKWeighted(N, load_factor, 'counts')(column, weight)
Parameters
N
: The number of elements to return. Optional. Default value: 10.load_factor
: Defines, how many cells reserved for values. If uniq(column) > N * load_factor, result of topK function will be approximate. Optional. Default value: 3.counts
: Defines, should result contain approximate count and error value.
Arguments
column
: The value.weight
: The weight. Every value is accountedweight
times for frequency calculation. UInt64.
Returned value
Returns an array of the values with maximum approximate sum of weights.
Example
Query:
SELECT topKWeighted(2)(k, w) FROM VALUES('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))
Result:
┌─topKWeighted(2)(k, w)──┐ │ ['z','x'] │ └────────────────────────┘
Query:
SELECT topKWeighted(2, 10, 'counts')(k, w) FROM VALUES('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))
Result:
┌─topKWeighted(2, 10, 'counts')(k, w)─┐ │ [('z',10,0),('x',5,0)] │ └─────────────────────────────────────┘
uniq¶
Calculates the approximate number of different values of the argument.
uniq(x[, ...])
Arguments
The function takes a variable number of parameters. Parameters can be Tuple
, Array
, Date
, DateTime
, String
, or numeric types.
Returned value
- A UInt64-type number.
Implementation details
Function:
Calculates a hash for all parameters in the aggregate, then uses it in calculations.
Uses an adaptive sampling algorithm. For the calculation state, the function uses a sample of element hash values up to 65536. This algorithm is very accurate and very efficient on the CPU. When the query contains several of these functions, using
uniq
is almost as fast as using other aggregate functions.Provides the result deterministically (it does not depend on the query processing order).
We recommend using this function in almost all scenarios.
uniqCombined¶
Calculates the approximate number of different argument values.
uniqCombined(HLL_precision)(x[, ...])
The uniqCombined
function is a good choice for calculating the number of different values.
Arguments
HLL_precision
: The base-2 logarithm of the number of cells in HyperLogLog. Optional, you can use the function asuniqCombined(x[, ...])
. The default value forHLL_precision
is 17, which is effectively 96 KiB of space (2^17 cells, 6 bits each).X
: A variable number of parameters. Parameters can beTuple
,Array
,Date
,DateTime
,String
, or numeric types.
Returned value
- A number UInt64-type number.
Implementation details
The uniqCombined
function:
- Calculates a hash (64-bit hash for
String
and 32-bit otherwise) for all parameters in the aggregate, then uses it in calculations. - Uses a combination of three algorithms: array, hash table, and HyperLogLog with an error correction table.
- For a small number of distinct elements, an array is used.
- When the set size is larger, a hash table is used.
- For a larger number of elements, HyperLogLog is used, which will occupy a fixed amount of memory.
- Provides the result deterministically (it does not depend on the query processing order).
Since it uses a 32-bit hash for non-String
types, the result will have very high error for cardinalities significantly larger than UINT_MAX
(error will raise quickly after a few tens of billions of distinct values), hence in this case you should use uniqCombined64.
Compared to the uniq function, the uniqCombined function:
- Consumes several times less memory.
- Calculates with several times higher accuracy.
- Usually has slightly lower performance. In some scenarios, uniqCombined can perform better than uniq, for example, with distributed queries that transmit a large number of aggregation states over the network.
Example
Query:
SELECT uniqCombined(number) FROM numbers(1e6)
Result:
┌─uniqCombined(number)─┐ │ 1001148 │ -- 1.00 million └──────────────────────┘
See the example section of uniqCombined64 for an example of the difference between uniqCombined and uniqCombined64 for much larger inputs.
uniqCombined64¶
Calculates the approximate number of different argument values. It is the same as uniqCombined, but uses a 64-bit hash for all data types rather than just for the String data type.
uniqCombined64(HLL_precision)(x[, ...])
Parameters
HLL_precision
: The base-2 logarithm of the number of cells in HyperLogLog. Optionally, you can use the function asuniqCombined64(x[, ...])
. The default value forHLL_precision
is 17, which is effectively 96 KiB of space (2^17 cells, 6 bits each).X
: A variable number of parameters. Parameters can beTuple
,Array
,Date
,DateTime
,String
, or numeric types.
Returned value
- A number UInt64-type number.
Implementation details
The uniqCombined64
function:
- Calculates a hash (64-bit hash for all data types) for all parameters in the aggregate, then uses it in calculations.
- Uses a combination of three algorithms: array, hash table, and HyperLogLog with an error correction table.
- For a small number of distinct elements, an array is used.
- When the set size is larger, a hash table is used.
- For a larger number of elements, HyperLogLog is used, which will occupy a fixed amount of memory.
- Provides the result deterministically (it does not depend on the query processing order).
Since it uses 64-bit hash for all types, the result does not suffer from very high error for cardinalities significantly larger than UINT_MAX
like uniqCombined does, which uses a 32-bit hash for non-String
types.
Compared to the uniq function, the uniqCombined64 function:
- Consumes several times less memory.
- Calculates with several times higher accuracy.
Example
In the example below uniqCombined64
is run on 1e10
different numbers returning a very close approximation of the number of different argument values.
Query:
SELECT uniqCombined64(number) FROM numbers(1e10)
Result:
┌─uniqCombined64(number)─┐ │ 9998568925 │ -- 10.00 billion └────────────────────────┘
By comparison the uniqCombined
function returns a rather poor approximation for an input this size.
Query:
SELECT uniqCombined(number) FROM numbers(1e10)
Result:
┌─uniqCombined(number)─┐ │ 5545308725 │ -- 5.55 billion └──────────────────────┘
uniqExact¶
Calculates the exact number of different argument values.
uniqExact(x[, ...])
Use the uniqExact
function if you absolutely need an exact result. Otherwise use the uniq function.
The uniqExact
function uses more memory than uniq
, because the size of the state has unbounded growth as the number of different values increases.
Arguments
The function takes a variable number of parameters. Parameters can be Tuple
, Array
, Date
, DateTime
, String
, or numeric types.
uniqHLL12¶
Calculates the approximate number of different argument values, using the HyperLogLog algorithm.
uniqHLL12(x[, ...])
Arguments
The function takes a variable number of parameters. Parameters can be Tuple
, Array
, Date
, DateTime
, String
, or numeric types.
Returned value
- A UInt64-type number.
Implementation details
Function:
Calculates a hash for all parameters in the aggregate, then uses it in calculations.
Uses the HyperLogLog algorithm to approximate the number of different argument values.
2^12 5-bit cells are used. The size of the state is slightly more than 2.5 KB. The result is not very accurate (up to ~10% error) for small data sets (<10K elements). However, the result is fairly accurate for high-cardinality data sets (10K-100M), with a maximum error of ~1.6%. Starting from 100M, the estimation error increases, and the function will return very inaccurate results for data sets with extremely high cardinality (1B+ elements).
Provides the determinate result (it does not depend on the query processing order).
We do not recommend using this function. In most cases, use the uniq or uniqCombined function.
uniqTheta¶
Calculates the approximate number of different argument values, using the Theta Sketch Framework.
uniqTheta(x[, ...])
Arguments
The function takes a variable number of parameters. Parameters can be Tuple
, Array
, Date
, DateTime
, String
, or numeric types.
Returned value
- A UInt64-type number.
Implementation details
Function:
Calculates a hash for all parameters in the aggregate, then uses it in calculations.
Uses the KMV algorithm to approximate the number of different argument values.
4096(2^12) 64-bit sketch are used. The size of the state is about 41 KB.
The relative error is 3.125% (95% confidence), see the relative error table for detail.
varPop¶
Calculates the population variance.
Syntax
varPop(x)
Alias: VAR_POP
.
Parameters
x
: Population of values to find the population variance of. (U)Int*, Float*, Decimal*.
Returned value
- Returns the population variance of
x
. Float64.
varPopStable¶
Returns the population variance. Unlike varPop, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.
Syntax
varPopStable(x)
Alias: VAR_POP_STABLE
.
Parameters
x
: Population of values to find the population variance of. (U)Int*, Float*, Decimal*.
Returned value
- Returns the population variance of
x
. Float64.
varSamp¶
Calculate the sample variance of a data set.
Syntax
varSamp(x)
Alias: VAR_SAMP
.
Parameters
x
: The population for which you want to calculate the sample variance. (U)Int*, Float*, Decimal*.
Returned value
- Returns the sample variance of the input data set
x
. Float64.
Implementation details
The varSamp
function calculates the sample variance using the following formula:
$$ \sum\frac{(x - \text{mean}(x))^2}{(n - 1)} $$
Where:
x
is each individual data point in the data set.mean(x)
is the arithmetic mean of the data set.n
is the number of data points in the data set.
The function assumes that the input data set represents a sample from a larger population. If you want to calculate the variance of the entire population (when you have the complete data set), you should use varPop
instead.
varSampStable¶
Calculate the sample variance of a data set. Unlike varSamp, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.
Syntax
varSampStable(x)
Alias: VAR_SAMP_STABLE
Parameters
x
: The population for which you want to calculate the sample variance. (U)Int*, Float*, Decimal*.
Returned value
- Returns the sample variance of the input data set. Float64.
Implementation details
The varSampStable
function calculates the sample variance using the same formula as the varSamp
:
$$ \sum\frac{(x - \text{mean}(x))^2}{(n - 1)} $$
Where:
x
is each individual data point in the data set.mean(x)
is the arithmetic mean of the data set.n
is the number of data points in the data set.
welchTTest¶
Applies Welch's t-test to samples from two populations.
Syntax
welchTTest([confidence_level])(sample_data, sample_index)
Values of both samples are in the sample_data
column. If sample_index
equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that means of populations are equal. Normal distribution is assumed. Populations may have unequal variance.
Arguments
sample_data
: Sample data. Integer, Float or Decimal.sample_index
: Sample index. Integer.
Parameters
confidence_level
: Confidence level in order to calculate confidence intervals. Float.
Returned values
Tuple with two or four elements (if the optional confidence_level
is specified)
- calculated t-statistic. Float64.
- calculated p-value. Float64.
- calculated confidence-interval-low. Float64.
- calculated confidence-interval-high. Float64.
Example
Input table:
┌─sample_data─┬─sample_index─┐ │ 20.3 │ 0 │ │ 22.1 │ 0 │ │ 21.9 │ 0 │ │ 18.9 │ 1 │ │ 20.3 │ 1 │ │ 19 │ 1 │ └─────────────┴──────────────┘
Query:
SELECT welchTTest(sample_data, sample_index) FROM welch_ttest
Result:
┌─welchTTest(sample_data, sample_index)─────┐ │ (2.7988719532211235,0.051807360348581945) │ └───────────────────────────────────────────┘