Aggregate functions

Aggregate functions perform a calculation on a set of values and return a single value. They are commonly used in data analysis and reporting to summarize large datasets. These functions can be used to compute sums, averages, counts, and other statistical measures.

Aggregate Function Combinators

The name of an aggregate function can have a suffix appended to it. This changes the way the aggregate function works.

-If

The suffix -If can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition (Uint8 type). The aggregate function processes only the rows that trigger the condition. If the condition was not triggered even once, it returns a default value (usually zeros or empty strings).

Examples: sumIf(column, cond), countIf(cond), avgIf(x, cond), quantilesTimingIf(level1, level2)(x, cond), argMinIf(arg, val, cond) and so on.

With conditional aggregate functions, you can calculate aggregates for several conditions at once, without using subqueries and JOINs. For example, conditional aggregate functions can be used to implement the segment comparison functionality.

-Array

The -Array suffix can be appended to any aggregate function. In this case, the aggregate function takes arguments of the 'Array(T)' type (arrays) instead of 'T' type arguments. If the aggregate function accepts multiple arguments, this must be arrays of equal lengths. When processing arrays, the aggregate function works like the original aggregate function across all array elements.

Example 1: sumArray(arr) - Totals all the elements of all 'arr' arrays. In this example, it could have been written more simply: sum(arraySum(arr)).

Example 2: uniqArray(arr) – Counts the number of unique elements in all 'arr' arrays. This could be done an easier way: uniq(arrayJoin(arr)), but it's not always possible to add 'arrayJoin' to a query.

-If and -Array can be combined. However, 'Array' must come first, then 'If'. Examples: uniqArrayIf(arr, cond), quantilesTimingArrayIf(level1, level2)(arr, cond). Due to this order, the 'cond' argument won't be an array.

-Map

The -Map suffix can be appended to any aggregate function. This will create an aggregate function which gets Map type as an argument, and aggregates values of each key of the map separately using the specified aggregate function. The result is also of a Map type.

Example

WITH map_map AS (
  SELECT c1::Date AS date, c2::DateTime AS timeslot, c3::Map(String, UInt16) AS status FROM values (
    ('2000-01-01', '2000-01-01 00:00:00', (['a', 'b', 'c'], [10, 10, 10])),
    ('2000-01-01', '2000-01-01 00:00:00', (['c', 'd', 'e'], [10, 10, 10])),
    ('2000-01-01', '2000-01-01 00:01:00', (['d', 'e', 'f'], [10, 10, 10])),
    ('2000-01-01', '2000-01-01 00:01:00', (['f', 'g', 'g'], [10, 10, 10]))
))
SELECT
    timeslot,
    sumMap(status),
    avgMap(status),
    minMap(status)
FROM map_map
GROUP BY timeslot

┌────────────timeslot─┬─sumMap(status)───────────────────────┬─avgMap(status)───────────────────────┬─minMap(status)───────────────────────┐
│ 2000-01-01 00:00:00 │ {'a':10,'b':10,'c':20,'d':10,'e':10} │ {'a':10,'b':10,'c':10,'d':10,'e':10} │ {'a':10,'b':10,'c':10,'d':10,'e':10} │
│ 2000-01-01 00:01:00 │ {'d':10,'e':10,'f':20,'g':20}        │ {'d':10,'e':10,'f':10,'g':10}        │ {'d':10,'e':10,'f':10,'g':10}        │
└─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┘

-SimpleState

If you apply this combinator, the aggregate function returns the same value but with a different type. This is a SimpleAggregateFunction(...) that can be stored in a table to work with AggregatingMergeTree tables.

Syntax

<aggFunction>SimpleState(x)

Arguments

  • x: Aggregate function parameters.

Returned values

The value of an aggregate function with the SimpleAggregateFunction(...) type.

Example

Query:

WITH anySimpleState(number) AS c SELECT toTypeName(c), c FROM numbers(1)

Result:

┌─toTypeName(c)────────────────────────┬─c─┐
│ SimpleAggregateFunction(any, UInt64) │ 0 │
└──────────────────────────────────────┴───┘

-State

If you apply this combinator, the aggregate function does not return the resulting value (such as the number of unique values for the uniq function), but an intermediate state of the aggregation (for uniq, this is the hash table for calculating the number of unique values). This is an AggregateFunction(...) that can be used for further processing or stored in a table to finish aggregating later.

Please notice, that -MapState is not an invariant for the same data due to the fact that order of data in intermediate state can change, though it doesn't impact ingestion of this data.

To work with these states, use:

  • AggregatingMergeTree table engine.
  • finalizeAggregation function.
  • runningAccumulate function.
  • -Merge combinator.
  • -MergeState combinator.

-Merge

If you apply this combinator, the aggregate function takes the intermediate aggregation state as an argument, combines the states to finish aggregation, and returns the resulting value.

-MergeState

Merges the intermediate aggregation states in the same way as the -Merge combinator. However, it does not return the resulting value, but an intermediate aggregation state, similar to the -State combinator.

-ForEach

Converts an aggregate function for tables into an aggregate function for arrays that aggregates the corresponding array items and returns an array of results. For example, sumForEach for the arrays [1, 2], [3, 4, 5]and[6, 7]returns the result [10, 13, 5] after adding together the corresponding array items.

-Distinct

Every unique combination of arguments will be aggregated only once. Repeating values are ignored. Examples: sum(DISTINCT x) (or sumDistinct(x)), groupArray(DISTINCT x) (or groupArrayDistinct(x)), corrStable(DISTINCT x, y) (or corrStableDistinct(x, y)) and so on.

-OrDefault

Changes behavior of an aggregate function.

If an aggregate function does not have input values, with this combinator it returns the default value for its return data type. Applies to the aggregate functions that can take empty input data.

-OrDefault can be used with other combinators.

Syntax

<aggFunction>OrDefault(x)

Arguments

  • x: Aggregate function parameters.

Returned values

Returns the default value of an aggregate function's return type if there is nothing to aggregate.

Type depends on the aggregate function used.

Example

Query:

SELECT avg(number), avgOrDefault(number) FROM numbers(0)

Result:

┌─avg(number)─��─avgOrDefault(number)─┐
│         nan │                    0 │
└─────────────┴──────────────────────┘

Also -OrDefault can be used with another combinators. It is useful when the aggregate function does not accept the empty input.

Query:

SELECT avgOrDefaultIf(x, x > 10)
FROM
(
    SELECT toDecimal32(1.23, 2) AS x
)

Result:

┌─avgOrDefaultIf(x, greater(x, 10))─┐
│                              0.00 │
└───────────────────────────────────┘

-OrNull

Changes behavior of an aggregate function.

This combinator converts a result of an aggregate function to the Nullable data type. If the aggregate function does not have values to calculate it returns NULL.

-OrNull can be used with other combinators.

Syntax

<aggFunction>OrNull(x)

Arguments

  • x: Aggregate function parameters.

Returned values

  • The result of the aggregate function, converted to the Nullable data type.
  • NULL, if there is nothing to aggregate.

Type: Nullable(aggregate function return type).

Example

Add -orNull to the end of aggregate function.

Query:

SELECT sumOrNull(number), toTypeName(sumOrNull(number)) FROM numbers(10) WHERE number > 10

Result:

┌─sumOrNull(number)─┬─toTypeName(sumOrNull(number))─┐
│              ᴺᵁᴸᴸ │ Nullable(UInt64)              │
└───────────────────┴───────────────────────────────┘

Also -OrNull can be used with another combinators. It is useful when the aggregate function does not accept the empty input.

Query:

SELECT avgOrNullIf(x, x > 10)
FROM
(
    SELECT toDecimal32(1.23, 2) AS x
)

Result:

┌─avgOrNullIf(x, greater(x, 10))─┐
│                           ᴺᵁᴸᴸ │
└────────────────────────────────┘

-Resample

Lets you divide data into groups, and then separately aggregates the data in those groups. Groups are created by splitting the values from one column into intervals.

<aggFunction>Resample(start, end, step)(<aggFunction_params>, resampling_key)

Arguments

  • start: Starting value of the whole required interval for resampling_key values.
  • stop: Ending value of the whole required interval for resampling_key values. The whole interval does not include the stop value [start, stop).
  • step: Step for separating the whole interval into subintervals. The aggFunction is executed over each of those subintervals independently.
  • resampling_key: Column whose values are used for separating data into intervals.
  • aggFunction_params: aggFunction parameters.

Returned values

  • Array of aggFunction results for each subinterval.

Example

Consider the people table with the following data:

┌─name───┬─age─┬─wage─┐
│ John   │  16 │   10 │
│ Alice  │  30 │   15 │
│ Mary   │  35 │    8 │
│ Evelyn │  48 │ 11.5 │
│ David  │  62 │  9.9 │
│ Brian  │  60 │   16 │
└────────┴─────┴──────┘

Let's get the names of the people whose age lies in the intervals of [30,60) and [60,75). Since we use integer representation for age, we get ages in the [30, 59] and [60,74] intervals.

To aggregate names in an array, we use the groupArray aggregate function. It takes one argument. In our case, it's the name column. The groupArrayResample function should use the age column to aggregate names by age. To define the required intervals, we pass the 30, 75, 30 arguments into the groupArrayResample function.

SELECT groupArrayResample(30, 75, 30)(name, age) FROM people
┌─groupArrayResample(30, 75, 30)(name, age)─────┐
│ [['Alice','Mary','Evelyn'],['David','Brian']] │
└───────────────────────────────────────────────┘

Consider the results.

John is out of the sample because he's too young. Other people are distributed according to the specified age intervals.

Now let's count the total number of people and their average wage in the specified age intervals.

SELECT
    countResample(30, 75, 30)(name, age) AS amount,
    avgResample(30, 75, 30)(wage, age) AS avg_wage
FROM people
┌─amount─┬─avg_wage──────────────────┐
│ [3,2]  │ [11.5,12.949999809265137] │
└────────┴───────────────────────────┘

-ArgMin

The suffix -ArgMin can be appended to the name of any aggregate function. In this case, the aggregate function accepts an additional argument, which should be any comparable expression. The aggregate function processes only the rows that have the minimum value for the specified extra expression.

Examples: sumArgMin(column, expr), countArgMin(expr), avgArgMin(x, expr) and so on.

-ArgMax

Similar to suffix -ArgMin but processes only the rows that have the maximum value for the specified extra expression.

aggThrow

This function can be used for the purpose of testing exception safety. It will throw an exception on creation with the specified probability.

Syntax

aggThrow(throw_prob)

Arguments

  • throw_prob: Probability to throw on creation. Float64.

Returned value

  • An exception: Code: 503. DB::Exception: Aggregate function aggThrow has thrown exception successfully.

Example

Query:

SELECT number % 2 AS even, aggThrow(number) FROM numbers(10) GROUP BY even

Result:

Received exception:
Code: 503. DB::Exception: Aggregate function aggThrow has thrown exception successfully: While executing AggregatingTransform. (AGGREGATE_FUNCTION_THROW)

analysisOfVariance

Provides a statistical test for one-way analysis of variance (ANOVA test). It is a test over several groups of normally distributed observations to find out whether all groups have the same mean or not.

Syntax

analysisOfVariance(val, group_no)

Aliases: anova

Parameters

  • val: value.
  • group_no : group number that val belongs to.

Groups are enumerated starting from 0 and there should be at least two groups to perform a test. There should be at least one group with the number of observations greater than one.

Returned value

  • (f_statistic, p_value). Tuple(Float64, Float64).

Example

Query:

SELECT analysisOfVariance(number, number % 2) FROM numbers(1048575)

Result:

┌─analysisOfVariance(number, modulo(number, 2))─┐
│ (0,1)                                         │
└───────────────────────────────────────────────┘

any

Selects the first encountered value of a column.

As a query can be executed in arbitrary order, the result of this function is non-deterministic. If you need an arbitrary but deterministic result, use functions min or max.

By default, the function never returns NULL, i.e. ignores NULL values in the input column. However, if the function is used with the RESPECT NULLS modifier, it returns the first value reads no matter if NULL or not.

Syntax

any(column) [RESPECT NULLS]

Aliases any(column) (without RESPECT NULLS)

  • any_value, first_value.

Alias for any(column) RESPECT NULLS

  • any_respect_nulls, first_value_respect_nulls,any_value_respect_nulls

Parameters

  • column: The column name.

Returned value

The first value encountered.

The return type of the function is the same as the input, except for LowCardinality which is discarded. This means that given no rows as input it will return the default value of that type (0 for integers, or Null for a Nullable() column). You might use the -OrNull combinator ) to modify this behaviour.

Example

Query:

WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city)
SELECT any(city), any_respect_nulls(city) FROM cte
┌─any(city)─┬─any_respect_nulls(city)─┐
│ Amsterdam │ ᴺᵁᴸᴸ                    │
└───────────┴─────────────────────────┘

anyHeavy

Selects a frequently occurring value using the heavy hitters algorithm. If there is a value that occurs more than in half the cases in each of the query's execution threads, this value is returned. Normally, the result is nondeterministic.

anyHeavy(column)

Arguments

  • column – The column name.

Example

Take the OnTime data set and select any frequently occurring value in the AirlineID column.

WITH cte AS (SELECT arrayJoin([2,1,1,1,3,1,1,2,2]) as n)
SELECT any(n), anyHeavy(n) FROM cte
┌─any(n)─┬─anyHeavy(n)─┐
│     2 │            1 │
└────────┴─────────────┘

anyLast

Selects the last encountered value of a column.

As a query can be executed in arbitrary order, the result of this function is non-deterministic. If you need an arbitrary but deterministic result, use functions min or max.

By default, the function never returns NULL, i.e. ignores NULL values in the input column. However, if the function is used with the RESPECT NULLS modifier, it returns the first value reads no matter if NULL or not.

Syntax

anyLast(column) [RESPECT NULLS]

Alias anyLast(column) (without RESPECT NULLS)

  • last_value.

Aliases for anyLast(column) RESPECT NULLS

  • anyLast_respect_nulls, last_value_respect_nulls

Parameters

  • column: The column name.

Returned value

  • The last value encountered.

Example

Query:

WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city)
SELECT anyLast(city), anyLast_respect_nulls(city) FROM cte
┌─anyLast(city)─┬─anyLast_respect_nulls(city)─┐
│ Valencia      │ ᴺᵁᴸᴸ                        │
└───────────────┴─────────────────────────────┘

approx_top_k

Returns an array of the approximately most frequent values and their counts in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves).

approx_top_k(N)(column)
approx_top_k(N, reserved)(column)

This function does not provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren't the most frequent values.

We recommend using the N < 10 value; performance is reduced with large N values. Maximum value of N = 65536.

Parameters

  • N: The number of elements to return. Optional. Default value: 10.
  • reserved: Defines, how many cells reserved for values. If uniq(column) > reserved, result of topK function will be approximate. Optional. Default value: N * 3.

Arguments

  • column: The value to calculate frequency.

Example

Query:

SELECT approx_top_k(2)(k)
FROM values('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))

Result:

┌─approx_top_k(2)(k)────┐
│ [('y',3,0),('x',1,0)] │
└───────────────────────┘

approx_top_count

Is an alias to approx_top_k function

approx_top_sum

Returns an array of the approximately most frequent values and their counts in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves). Additionally, the weight of the value is taken into account.

approx_top_sum(N)(column, weight)
approx_top_sum(N, reserved)(column, weight)

This function does not provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren't the most frequent values.

We recommend using the N < 10 value; performance is reduced with large N values. Maximum value of N = 65536.

Parameters

  • N: The number of elements to return. Optional. Default value: 10.
  • reserved: Defines, how many cells reserved for values. If uniq(column) > reserved, result of topK function will be approximate. Optional. Default value: N * 3.

Arguments

  • column: The value to calculate frequency.
  • weight: The weight. Every value is accounted weight times for frequency calculation. UInt64.

Example

Query:

SELECT approx_top_sum(2)(k, w)
FROM values('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))

Result:

┌─approx_top_sum(2)(k, w)─┐
│ [('z',10,0),('x',5,0)]  │
└─────────────────────────┘

argMax

Calculates the arg value for a maximum val value. If there are multiple rows with equal val being the maximum, which of the associated arg is returned is not deterministic. Both parts the arg and the max behave as aggregate functions, they both skip Null during processing and return not Null values if not Null values are available.

Syntax

argMax(arg, val)

Arguments

  • arg: Argument.
  • val: Value.

Returned value

  • arg value that corresponds to maximum val value.

Type: matches arg type.

Example

Input table:

┌─user─────┬─salary─┐
│ director │   5000 │
│ manager  │   3000 │
│ worker   │   1000 │
└──────────┴────────┘

Query:

SELECT argMax(user, salary) FROM salary

Result:

┌─argMax(user, salary)─┐
│ director             │
└──────────────────────┘

argMin

Calculates the arg value for a minimum val value. If there are multiple rows with equal val being the maximum, which of the associated arg is returned is not deterministic. Both parts the arg and the min behave as aggregate functions, they both skip Null during processing and return not Null values if not Null values are available.

Syntax

argMin(arg, val)

Arguments

  • arg: Argument.
  • val: Value.

Returned value

  • arg value that corresponds to minimum val value.

Type: matches arg type.

Example

Input table:

┌─user─────┬─salary─┐
│ director │   5000 │
│ manager  │   3000 │
│ worker   │   1000 │
└──────────┴────────┘

Query:

SELECT argMin(user, salary) FROM salary

Result:

┌─argMin(user, salary)─┐
│ worker               │
└──────────────────────┘

array_concat_agg

  • Alias of groupArrayArray. The function is case insensitive.

Example

SELECT *
FROM t

┌─a───────┐
│ [1,2,3] │
│ [4,5]   │
│ [6]     │
└─────────┘

Query:

SELECT array_concat_agg(a) AS a
FROM t

┌─a─────────────┐
│ [1,2,3,4,5,6] │
└───────────────┘

avg

Calculates the arithmetic mean.

Syntax

avg(x)

Arguments

  • x: input values, must be Integer, Float, or Decimal.

Returned value

  • The arithmetic mean, always as Float64.
  • NaN if the input parameter x is empty.

Example

Query:

SELECT avg(x) FROM values('x Int8', 0, 1, 2, 3, 4, 5)

Result:

┌─avg(x)─┐
│    2.5 │
└────────┘

avgWeighted

Calculates the weighted arithmetic mean.

Syntax

avgWeighted(x, weight)

Arguments

  • x: Values.
  • weight: Weights of the values.

x and weight must both be Integer or floating-point, but may have different types.

Returned value

  • NaN if all the weights are equal to 0 or the supplied weights parameter is empty.
  • Weighted mean otherwise.

Return type is always Float64.

Example

Query:

SELECT avgWeighted(x, w)
FROM values('x Int8, w Int8', (4, 1), (1, 0), (10, 2))

Result:

┌─avgWeighted(x, weight)─┐
│                      8 │
└────────────────────────┘

Example

Query:

SELECT avgWeighted(x, w)
FROM values('x Int8, w Float64', (4, 1), (1, 0), (10, 2))

Result:

┌─avgWeighted(x, weight)─┐
│                      8 │
└────────────────────────┘

Example

Query:

SELECT avgWeighted(x, w)
FROM values('x Int8, w Int8', (0, 0), (1, 0), (10, 0))

Result:

┌─avgWeighted(x, weight)─┐
│                    nan │
└────────────────────────┘

boundingRatio

Aggregate function that calculates the slope between the leftmost and rightmost points across a group of values.

Example

Sample data:

SELECT
    number,
    number * 1.5
FROM numbers(10)
┌─number─┬─multiply(number, 1.5)─┐
│      0 │                     0 │
│      1 │                   1.5 │
│      2 │                     3 │
│      3 │                   4.5 │
│      4 │                     6 │
│      5 │                   7.5 │
│      6 │                     9 │
│      7 │                  10.5 │
│      8 │                    12 │
│      9 │                  13.5 │
└────────┴───────────────────────┘

The boundingRatio() function returns the slope of the line between the leftmost and rightmost points, in the above data these points are (0,0) and (9,13.5).

SELECT boundingRatio(number, number * 1.5)
FROM numbers(10)
┌─boundingRatio(number, multiply(number, 1.5))─┐
│                                          1.5 │
└──────────────────────────────────────────────┘

Calculates the value of (P(tag = 1) - P(tag = 0))(log(P(tag = 1)) - log(P(tag = 0))) for each category.

categoricalInformationValue(category1, category2, ..., tag)

The result indicates how a discrete (categorical) feature [category1, category2, ...] contribute to a learning model which predicting the value of tag.

contingency

The contingency function calculates the contingency coefficient, a value that measures the association between two columns in a table. The computation is similar to the cramersV function but with a different denominator in the square root.

Syntax

contingency(column1, column2)

Arguments

  • column1 and column2 are the columns to be compared

Returned value

  • a value between 0 and 1. The larger the result, the closer the association of the two columns.

Return type is always Float64.

Example

The two columns being compared below have a small association with each other. We have included the result of cramersV also (as a comparison):

SELECT
    cramersV(a, b),
    contingency(a ,b)
FROM
    (
        SELECT
            number % 10 AS a,
            number % 4 AS b
        FROM
            numbers(150)
    )

Result:

┌──────cramersV(a, b)─┬───contingency(a, b)─┐
│ 0.41171788506213564 │ 0.05812725261759165 │
└─────────────────────┴─────────────────────┘

corr

Calculates the Pearson correlation coefficient: Σ((x - x̄)(y - ȳ)) / √(Σ(x - x̄)²*Σ(y - ȳ)²)

Σ(xxˉ)(yyˉ)Σ(xxˉ)2Σ(yyˉ)2\frac{\Sigma{(x - \bar{x})(y - \bar{y})}}{\sqrt{\Sigma{(x - \bar{x})^2} * \Sigma{(y - \bar{y})^2}}}

This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the corrStable function. It is slower but provides a more accurate result.

Syntax

corr(x, y)

Arguments

  • x: first variable. (U)Int*, Float*, Decimal.
  • y: second variable. (U)Int*, Float*, Decimal.

Returned value

  • The Pearson correlation coefficient. Float64.

corrMatrix

Computes the correlation matrix over N variables.

Syntax

corrMatrix(x[, ...])

Arguments

  • x: a variable number of parameters. (U)Int*, Float*, Decimal.

Returned value

  • Correlation matrix. Array(Array(Float64)).

corrStable

Calculates the Pearson correlation coefficient: Σ((x - x̄)(y - ȳ)) / √(Σ(x - x̄)²*Σ(y - ȳ)²)

Σ(xxˉ)(yyˉ)Σ(xxˉ)2Σ(yyˉ)2\frac{\Sigma{(x - \bar{x})(y - \bar{y})}}{\sqrt{\Sigma{(x - \bar{x})^2} * \Sigma{(y - \bar{y})^2}}}

Similar to the corr function, but uses a numerically stable algorithm. As a result, corrStable is slower than corr but produces a more accurate result.

Syntax

corrStable(x, y)

Arguments

  • x: first variable. (U)Int*, Float*, Decimal.
  • y: second variable. (U)Int*, Float*, Decimal.

Returned value

  • The Pearson correlation coefficient. Float64.

count

Counts the number of rows or not-NULL values.

Use the following syntaxes for count:

  • count(expr) or COUNT(DISTINCT expr).
  • count() or COUNT(*).

Arguments

The function can take:

  • Zero parameters.
  • One expression.

Returned value

  • If the function is called without parameters it counts the number of rows.
  • If the expression is passed, then the function counts how many times this expression returned not null. If the expression returns a Nullable-type value, then the result of count stays not Nullable. The function returns 0 if the expression returned NULL for all the rows.

In both cases the type of the returned value is UInt64.

covarPop

Calculates the population covariance: Σ(x - x̄)(y - ȳ) / n

Σ(xxˉ)(yyˉ)n\frac{\Sigma{(x - \bar{x})(y - \bar{y})}}{n}

This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the covarPopStable function. It works slower but provides a lower computational error.

Syntax

covarPop(x, y)

Arguments

  • x: first variable. (U)Int*, Float*, Decimal.
  • y: second variable. (U)Int*, Float*, Decimal.

Returned value

  • The population covariance between x and y. Float64.

covarPopMatrix

Returns the population covariance matrix over N variables.

Syntax

covarPopMatrix(x[, ...])

Arguments

  • x: a variable number of parameters. (U)Int*, Float*, Decimal.

Returned value

  • Population covariance matrix. Array(Array(Float64)).

covarPopStable

Calculates the value of the population covariance:Σ(x - x̄)(y - ȳ) / n

Σ(xxˉ)(yyˉ)n\frac{\Sigma{(x - \bar{x})(y - \bar{y})}}{n}

It is similar to the covarPop function, but uses a numerically stable algorithm. As a result, covarPopStable is slower than covarPop but produces a more accurate result.

Syntax

covarPop(x, y)

Arguments

  • x: first variable. (U)Int*, Float*, Decimal.
  • y: second variable. (U)Int*, Float*, Decimal.

Returned value

  • The population covariance between x and y. Float64.

covarSamp

Calculates the value of Σ((x - x̅)(y - y̅)) / (n - 1).

This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the covarSampStable function. It works slower but provides a lower computational error.

Syntax

covarSamp(x, y)

Arguments

  • x: first variable. (U)Int*, Float*, Decimal.
  • y: second variable. (U)Int*, Float*, Decimal.

Returned value

  • The sample covariance between x and y. For n <= 1, nan is returned. Float64.

covarSampMatrix

Returns the sample covariance matrix over N variables.

Syntax

covarSampMatrix(x[, ...])

Arguments

  • x: a variable number of parameters. (U)Int*, Float*, Decimal.

Returned value

  • Sample covariance matrix. Array(Array(Float64)).

covarSampStable

Calculates the value of Σ((x - x̅)(y - y̅)) / (n - 1). Similar to covarSamp but works slower while providing a lower computational error.

Syntax

covarSampStable(x, y)

Arguments

  • x: first variable. (U)Int*, Float*, Decimal.
  • y: second variable. (U)Int*, Float*, Decimal.

Returned value

  • The sample covariance between x and y. For n <= 1, inf is returned. Float64.

cramersV

Cramer's V (sometimes referred to as Cramer's phi) is a measure of association between two columns in a table. The result of the cramersV function ranges from 0 (corresponding to no association between the variables) to 1 and can reach 1 only when each value is completely determined by the other. It may be viewed as the association between two variables as a percentage of their maximum possible variation.

For a bias corrected version of Cramer's V see: cramersVBiasCorrected

Syntax

cramersV(column1, column2)

Parameters

  • column1: first column to be compared.
  • column2: second column to be compared.

Returned value

  • a value between 0 (corresponding to no association between the columns' values) to 1 (complete association).

Type: always Float64.

cramersVBiasCorrected

Cramer's V is a measure of association between two columns in a table. The result of the cramersV function ranges from 0 (corresponding to no association between the variables) to 1 and can reach 1 only when each value is completely determined by the other. The function can be heavily biased, so this version of Cramer's V uses the bias correction.

Syntax

cramersVBiasCorrected(column1, column2)

Parameters

  • column1: first column to be compared.
  • column2: second column to be compared.

Returned value

  • a value between 0 (corresponding to no association between the columns' values) to 1 (complete association).

Type: always Float64.

deltaSum

Sums the arithmetic difference between consecutive rows. If the difference is negative, it is ignored.

The underlying data must be sorted for this function to work properly. If you would like to use this function in a materialized view, you most likely want to use the deltaSumTimestamp method instead.

Syntax

deltaSum(value)

Arguments

  • value: Input values, must be Integer or Float type.

Returned value

  • A gained arithmetic difference of the Integer or Float type.

deltaSumTimestamp

Adds the difference between consecutive rows. If the difference is negative, it is ignored.

This function is primarily for materialized views that store data ordered by some time bucket-aligned timestamp, for example, a toStartOfMinute bucket. Because the rows in such a materialized view will all have the same timestamp, it is impossible for them to be merged in the correct order, without storing the original, unrounded timestamp value. The deltaSumTimestamp function keeps track of the original timestamp of the values it's seen, so the values (states) of the function are correctly computed during merging of parts.

To calculate the delta sum across an ordered collection you can simply use the deltaSum function.

Syntax

deltaSumTimestamp(value, timestamp)

Arguments

  • value: Input values, must be some Integer type or Float type or a Date or DateTime.
  • timestamp: The parameter for order values, must be some Integer type or Float type or a Date or DateTime.

Returned value

  • Accumulated differences between consecutive values, ordered by the timestamp parameter.

Type: Integer or Float or Date or DateTime.

distinctDynamicTypes

Calculates the list of distinct data types stored in Dynamic column.

Syntax

distinctDynamicTypes(dynamic)

Arguments

  • dynamic: Dynamic column.

Returned value

  • The sorted list of data type names Array(String).

distinctJSONPaths

Calculates the list of distinct paths stored in JSON column.

Syntax

distinctJSONPaths(json)

Arguments

  • json: JSON column.

Returned value

  • The sorted list of paths Array(String).

distinctJSONPathsAndTypes

Calculates the list of distinct paths and their types stored in JSON column.

Syntax

distinctJSONPathsAndTypes(json)

Arguments

  • json: JSON column.

Returned value

  • The sorted map of paths and types Map(String, Array(String)).

entropy

Calculates Shannon entropy) of a column of values.

Syntax

entropy(val)

Arguments

  • val: Column of values of any type.

Returned value

  • Shannon entropy.

Type: Float64.

exponentialMovingAverage

Calculates the exponential moving average of values for the determined time.

Syntax

exponentialMovingAverage(x)(value, timeunit)

Each value corresponds to the determinate timeunit. The half-life x is the time lag at which the exponential weights decay by one-half. The function returns a weighted average: the older the time point, the less weight the corresponding value is considered to be.

Arguments

  • value: Value. Integer, Float or Decimal.
  • timeunit: Timeunit. Integer, Float or Decimal. Timeunit is not timestamp (seconds), it's -- an index of the time interval. Can be calculated using intDiv.

Parameters

  • x: Half-life period. Integer, Float or Decimal.

Returned values

  • Returns an exponentially smoothed moving average of the values for the past x time at the latest point of time.

Type: Float64.

Examples

Input table:

┌──temperature─┬─timestamp──┐
│          95  │         1  │
│          95  │         2  │
│          95  │         3  │
│          96  │         4  │
│          96  │         5  │
│          96  │         6  │
│          96  │         7  │
│          97  │         8  │
│          97  │         9  │
│          97  │        10  │
│          97  │        11  │
│          98  │        12  │
│          98  │        13  │
│          98  │        14  │
│          98  │        15  │
│          99  │        16  │
│          99  │        17  │
│          99  │        18  │
│         100  │        19  │
│         100  │        20  │
└──────────────┴────────────┘

Query:

SELECT exponentialMovingAverage(5)(temperature, timestamp)

Result:

┌──exponentialMovingAverage(5)(temperature, timestamp)──┐
│                                    92.25779635374204  │
└───────────────────────────────────────────────────────┘

Query:

SELECT
    value,
    time,
    round(exp_smooth, 3),
    bar(exp_smooth, 0, 1, 50) AS bar
FROM
(
    SELECT
        (number = 0) OR (number >= 25) AS value,
        number AS time,
        exponentialMovingAverage(10)(value, time) OVER (Rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(50)
)

Result:

┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────────────────────────────────────┐
│     1 │    0 │                0.067 │ ███▎                                       │
│     0 │    1 │                0.062 │ ███                                        │
│     0 │    2 │                0.058 │ ██▊                                        │
│     0 │    3 │                0.054 │ ██▋                                        │
│     0 │    4 │                0.051 │ ██▌                                        │
│     0 │    5 │                0.047 │ ██▎                                        │
│     0 │    6 │                0.044 │ ██▏                                        │
│     0 │    7 │                0.041 │ ██                                         │
│     0 │    8 │                0.038 │ █▊                                         │
│     0 │    9 │                0.036 │ █▋                                         │
│     0 │   10 │                0.033 │ █▋                                         │
│     0 │   11 │                0.031 │ █▌                                         │
│     0 │   12 │                0.029 │ █▍                                         │
│     0 │   13 │                0.027 │ █▎                                         │
│     0 │   14 │                0.025 │ █▎                                         │
│     0 │   15 │                0.024 │ █▏                                         │
│     0 │   16 │                0.022 │ █                                          │
│     0 │   17 │                0.021 │ █                                          │
│     0 │   18 │                0.019 │ ▊                                          │
│     0 │   19 │                0.018 │ ▊                                          │
│     0 │   20 │                0.017 │ ▋                                          │
│     0 │   21 │                0.016 │ ▋                                          │
│     0 │   22 │                0.015 │ ▋                                          │
│     0 │   23 │                0.014 │ ▋                                          │
│     0 │   24 │                0.013 │ ▋                                          │
│     1 │   25 │                0.079 │ ███▊                                       │
│     1 │   26 │                 0.14 │ ███████                                    │
│     1 │   27 │                0.198 │ █████████▊                                 │
│     1 │   28 │                0.252 │ ████████████▌                              │
│     1 │   29 │                0.302 │ ███████████████                            │
│     1 │   30 │                0.349 │ █████████████████▍                         │
│     1 │   31 │                0.392 │ ███████████████████▌                       │
│     1 │   32 │                0.433 │ █████████████████████▋                     │
│     1 │   33 │                0.471 │ ███████████████████████▌                   │
│     1 │   34 │                0.506 │ █████████████████████████▎                 │
│     1 │   35 │                0.539 │ ██████████████████████████▊                │
│     1 │   36 │                 0.57 │ ████████████████████████████▌              │
│     1 │   37 │                0.599 │ █████████████████████████████▊             │
│     1 │   38 │                0.626 │ ███████████████████████████████▎           │
│     1 │   39 │                0.651 │ ████████████████████████████████▌          │
│     1 │   40 │                0.674 │ █████████████████████████████████▋         │
│     1 │   41 │                0.696 │ ██████████████████████████████████▋        │
│     1 │   42 │                0.716 │ ███████████████████████████████████▋       │
│     1 │   43 │                0.735 │ ████████████████████████████████████▋      │
│     1 │   44 │                0.753 │ █████████████████████████████████████▋     │
│     1 │   45 │                 0.77 │ ██████████████████████████████████████▍    │
│     1 │   46 │                0.785 │ ███████████████████████████████████████▎   │
│     1 │   47 │                  0.8 │ ███████████████████████████████████████▊   │  
│     1 │   48 │                0.813 │ ████████████████████████████████████████▋  │
│     1 │   49 │                0.825 │ █████████████████████████████████████████▎ │
└───────┴──────┴──────────────────────┴────────────────────────────────────────────┘

exponentialTimeDecayedAvg

Returns the exponentially smoothed weighted moving average of values of a time series at point t in time.

Syntax

exponentialTimeDecayedAvg(x)(v, t)

Arguments

  • v: Value. Integer, Float or Decimal.
  • t: Time. Integer, Float or Decimal, DateTime, DateTime64.

Parameters

  • x: Half-life period. Integer, Float or Decimal.

Returned values

  • Returns an exponentially smoothed weighted moving average at index t in time. Float64.

Examples

Query:

SELECT
    value,
    time,
    round(exp_smooth, 3),
    bar(exp_smooth, 0, 5, 50) AS bar
FROM
    (
    SELECT
    (number = 0) OR (number >= 25) AS value,
    number AS time,
    exponentialTimeDecayedAvg(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(50)
    )

Response:

   ┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────┐
1. │     1 │    0 │                    1 │ ██████████ │
2. │     0 │    1 │                0.475 │ ████▊      │
3. │     0 │    2 │                0.301 │ ███        │
4. │     0 │    3 │                0.214 │ ██▏        │
5. │     0 │    4 │                0.162 │ █▌         │
6. │     0 │    5 │                0.128 │ █▎         │
7. │     0 │    6 │                0.104 │ █          │
8. │     0 │    7 │                0.086 │ ▊          │
9. │     0 │    8 │                0.072 │ ▋          │
0. │     0 │    9 │                0.061 │ ▌          │
1. │     0 │   10 │                0.052 │ ▌          │
2. │     0 │   11 │                0.045 │ ▍          │
3. │     0 │   12 │                0.039 │ ▍          │
4. │     0 │   13 │                0.034 │ ▎          │
5. │     0 │   14 │                 0.03 │ ▎          │
6. │     0 │   15 │                0.027 │ ▎          │
7. │     0 │   16 │                0.024 │ ▏          │
8. │     0 │   17 │                0.021 │ ▏          │
9. │     0 │   18 │                0.018 │ ▏          │
0. │     0 │   19 │                0.016 │ ▏          │
1. │     0 │   20 │                0.015 │ ▏          │
2. │     0 │   21 │                0.013 │ ▏          │
3. │     0 │   22 │                0.012 │            │
4. │     0 │   23 │                 0.01 │            │
5. │     0 │   24 │                0.009 │            │
6. │     1 │   25 │                0.111 │ █          │
7. │     1 │   26 │                0.202 │ ██         │
8. │     1 │   27 │                0.283 │ ██▊        │
9. │     1 │   28 │                0.355 │ ███▌       │
0. │     1 │   29 │                 0.42 │ ████▏      │
1. │     1 │   30 │                0.477 │ ████▊      │
2. │     1 │   31 │                0.529 │ █████▎     │
3. │     1 │   32 │                0.576 │ █████▊     │
4. │     1 │   33 │                0.618 │ ██████▏    │
5. │     1 │   34 │                0.655 │ ██████▌    │
6. │     1 │   35 │                0.689 │ ██████▉    │
7. │     1 │   36 │                0.719 │ ███████▏   │
8. │     1 │   37 │                0.747 │ ███████▍   │
9. │     1 │   38 │                0.771 │ ███████▋   │
0. │     1 │   39 │                0.793 │ ███████▉   │
1. │     1 │   40 │                0.813 │ ████████▏  │
2. │     1 │   41 │                0.831 │ ████████▎  │
3. │     1 │   42 │                0.848 │ ████████▍  │
4. │     1 │   43 │                0.862 │ ████████▌  │
5. │     1 │   44 │                0.876 │ ████████▊  │
6. │     1 │   45 │                0.888 │ ████████▉  │
7. │     1 │   46 │                0.898 │ ████████▉  │
8. │     1 │   47 │                0.908 │ █████████  │
9. │     1 │   48 │                0.917 │ █████████▏ │
0. │     1 │   49 │                0.925 │ █████████▏ │
   └───────┴──────┴──────────────────────┴────────────┘

exponentialTimeDecayedCount

Returns the cumulative exponential decay over a time series at the index t in time.

Syntax

exponentialTimeDecayedCount(x)(t)

Arguments

  • t: Time. Integer, Float or Decimal, DateTime, DateTime64.

Parameters

  • x: Half-life period. Integer, Float or Decimal.

Returned values

  • Returns the cumulative exponential decay at the given point in time. Float64.

Example

Query:

SELECT
    value,
    time,
    round(exp_smooth, 3),
    bar(exp_smooth, 0, 20, 50) AS bar
FROM
(
    SELECT
        (number % 5) = 0 AS value,
        number AS time,
        exponentialTimeDecayedCount(10)(time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(50)
)

Result:

    ┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────────────────────┐
 1. │     1 │    0 │                    1 │ ██▌                        │
 2. │     0 │    1 │                1.905 │ ████▊                      │
 3. │     0 │    2 │                2.724 │ ██████▊                    │
 4. │     0 │    3 │                3.464 │ ████████▋                  │
 5. │     0 │    4 │                4.135 │ ██████████▎                │
 6. │     1 │    5 │                4.741 │ ███████████▊               │
 7. │     0 │    6 │                 5.29 │ █████████████▏             │
 8. │     0 │    7 │                5.787 │ ██████████████▍            │
 9. │     0 │    8 │                6.236 │ ███████████████▌           │
10. │     0 │    9 │                6.643 │ ████████████████▌          │
11. │     1 │   10 │                 7.01 │ █████████████████▌         │
12. │     0 │   11 │                7.343 │ ██████████████████▎        │
13. │     0 │   12 │                7.644 │ ███████████████████        │
14. │     0 │   13 │                7.917 │ ███████████████████▊       │
15. │     0 │   14 │                8.164 │ ████████████████████▍      │
16. │     1 │   15 │                8.387 │ ████████████████████▉      │
17. │     0 │   16 │                8.589 │ █████████████████████▍     │
18. │     0 │   17 │                8.771 │ █████████████████████▉     │
19. │     0 │   18 │                8.937 │ ██████████████████████▎    │
20. │     0 │   19 │                9.086 │ ██████████████████████▋    │
21. │     1 │   20 │                9.222 │ ███████████████████████    │
22. │     0 │   21 │                9.344 │ ███████████████████████▎   │
23. │     0 │   22 │                9.455 │ ███████████████████████▋   │
24. │     0 │   23 │                9.555 │ ███████████████████████▉   │
25. │     0 │   24 │                9.646 │ ████████████████████████   │
26. │     1 │   25 │                9.728 │ ████████████████████████▎  │
27. │     0 │   26 │                9.802 │ ████████████████████████▌  │
28. │     0 │   27 │                9.869 │ ████████████████████████▋  │
29. │     0 │   28 │                 9.93 │ ████████████████████████▊  │
30. │     0 │   29 │                9.985 │ ████████████████████████▉  │
31. │     1 │   30 │               10.035 │ █████████████████████████  │
32. │     0 │   31 │                10.08 │ █████████████████████████▏ │
33. │     0 │   32 │               10.121 │ █████████████████████████▎ │
34. │     0 │   33 │               10.158 │ █████████████████████████▍ │
35. │     0 │   34 │               10.191 │ █████████████████████████▍ │
36. │     1 │   35 │               10.221 │ █████████████████████████▌ │
37. │     0 │   36 │               10.249 │ █████████████████████████▌ │
38. │     0 │   37 │               10.273 │ █████████████████████████▋ │
39. │     0 │   38 │               10.296 │ █████████████████████████▋ │
40. │     0 │   39 │               10.316 │ █████████████████████████▊ │
41. │     1 │   40 │               10.334 │ █████████████████████████▊ │
42. │     0 │   41 │               10.351 │ █████████████████████████▉ │
43. │     0 │   42 │               10.366 │ █████████████████████████▉ │
44. │     0 │   43 │               10.379 │ █████████████████████████▉ │
45. │     0 │   44 │               10.392 │ █████████████████████████▉ │
46. │     1 │   45 │               10.403 │ ██████████████████████████ │
47. │     0 │   46 │               10.413 │ ██████████████████████████ │
48. │     0 │   47 │               10.422 │ ██████████████████████████ │
49. │     0 │   48 │                10.43 │ ██████████████████████████ │
50. │     0 │   49 │               10.438 │ ██████████████████████████ │
    └───────┴──────┴──────────────────────┴────────────────────────────┘

exponentialTimeDecayedMax

Returns the maximum of the computed exponentially smoothed moving average at index t in time with that at t-1.

Syntax

exponentialTimeDecayedMax(x)(value, timeunit)

Arguments

  • value: Value. Integer, Float or Decimal.
  • timeunit: Timeunit. Integer, Float or Decimal, DateTime, DateTime64.

Parameters

  • x: Half-life period. Integer, Float or Decimal.

Returned values

  • Returns the maximum of the exponentially smoothed weighted moving average at t and t-1. Float64.

Example

Query:

SELECT
    value,
    time,
    round(exp_smooth, 3),
    bar(exp_smooth, 0, 5, 50) AS bar
FROM
    (
    SELECT
    (number = 0) OR (number >= 25) AS value,
    number AS time,
    exponentialTimeDecayedMax(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(50)
    )

Result:

    ┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar────────┐
 1. │     1 │    0 │                    1 │ ██████████ │
 2. │     0 │    1 │                0.905 │ █████████  │
 3. │     0 │    2 │                0.819 │ ████████▏  │
 4. │     0 │    3 │                0.741 │ ███████▍   │
 5. │     0 │    4 │                 0.67 │ ██████▋    │
 6. │     0 │    5 │                0.607 │ ██████     │
 7. │     0 │    6 │                0.549 │ █████▍     │
 8. │     0 │    7 │                0.497 │ ████▉      │
 9. │     0 │    8 │                0.449 │ ████▍      │
10. │     0 │    9 │                0.407 │ ████       │
11. │     0 │   10 │                0.368 │ ███▋       │
12. │     0 │   11 │                0.333 │ ███▎       │
13. │     0 │   12 │                0.301 │ ███        │
14. │     0 │   13 │                0.273 │ ██▋        │
15. │     0 │   14 │                0.247 │ ██▍        │
16. │     0 │   15 │                0.223 │ ██▏        │
17. │     0 │   16 │                0.202 │ ██         │
18. │     0 │   17 │                0.183 │ █▊         │
19. │     0 │   18 │                0.165 │ █▋         │
20. │     0 │   19 │                 0.15 │ █▍         │
21. │     0 │   20 │                0.135 │ █▎         │
22. │     0 │   21 │                0.122 │ █▏         │
23. │     0 │   22 │                0.111 │ █          │
24. │     0 │   23 │                  0.1 │ █          │
25. │     0 │   24 │                0.091 │ ▉          │
26. │     1 │   25 │                    1 │ ██████████ │
27. │     1 │   26 │                    1 │ ██████████ │
28. │     1 │   27 │                    1 │ ██████████ │
29. │     1 │   28 │                    1 │ ██████████ │
30. │     1 │   29 │                    1 │ ██████████ │
31. │     1 │   30 │                    1 │ ██████████ │
32. │     1 │   31 │                    1 │ ██████████ │
33. │     1 │   32 │                    1 │ ██████████ │
34. │     1 │   33 │                    1 │ ██████████ │
35. │     1 │   34 │                    1 │ ██████████ │
36. │     1 │   35 │                    1 │ ██████████ │
37. │     1 │   36 │                    1 │ ██████████ │
38. │     1 │   37 │                    1 │ ██████████ │
39. │     1 │   38 │                    1 │ ██████████ │
40. │     1 │   39 │                    1 │ ██████████ │
41. │     1 │   40 │                    1 │ ██████████ │
42. │     1 │   41 │                    1 │ ██████████ │
43. │     1 │   42 │                    1 │ ██████████ │
44. │     1 │   43 │                    1 │ ██████████ │
45. │     1 │   44 │                    1 │ ██████████ │
46. │     1 │   45 │                    1 │ ██████████ │
47. │     1 │   46 │                    1 │ ██████████ │
48. │     1 │   47 │                    1 │ ██████████ │
49. │     1 │   48 │                    1 │ ██████████ │
50. │     1 │   49 │                    1 │ ██████████ │
    └───────┴──────┴──────────────────────┴────────────┘

exponentialTimeDecayedSum

Returns the sum of exponentially smoothed moving average values of a time series at the index t in time.

Syntax

exponentialTimeDecayedSum(x)(v, t)

Arguments

  • v: Value. Integer, Float or Decimal.
  • t: Time. Integer, Float or Decimal, DateTime, DateTime64.

Parameters

  • x: Half-life period. Integer, Float or Decimal.

Returned values

  • Returns the sum of exponentially smoothed moving average values at the given point in time. Float64.

Example

Query:

SELECT
    value,
    time,
    round(exp_smooth, 3),
    bar(exp_smooth, 0, 10, 50) AS bar
FROM
    (
    SELECT
    (number = 0) OR (number >= 25) AS value,
    number AS time,
    exponentialTimeDecayedSum(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(50)
    )

Result:

    ┌─value─┬─time─┬─round(exp_smooth, 3)─┬─bar───────────────────────────────────────────────┐
 1. │     1 │    0 │                    1 │ █████                                             │
 2. │     0 │    1 │                0.905 │ ████▌                                             │
 3. │     0 │    2 │                0.819 │ ████                                              │
 4. │     0 │    3 │                0.741 │ ███▋                                              │
 5. │     0 │    4 │                 0.67 │ ███▎                                              │
 6. │     0 │    5 │                0.607 │ ███                                               │
 7. │     0 │    6 │                0.549 │ ██▋                                               │
 8. │     0 │    7 │                0.497 │ ██▍                                               │
 9. │     0 │    8 │                0.449 │ ██▏                                               │
10. │     0 │    9 │                0.407 │ ██                                                │
11. │     0 │   10 │                0.368 │ █▊                                                │
12. │     0 │   11 │                0.333 │ █▋                                                │
13. │     0 │   12 │                0.301 │ █▌                                                │
14. │     0 │   13 │                0.273 │ █▎                                                │
15. │     0 │   14 │                0.247 │ █▏                                                │
16. │     0 │   15 │                0.223 │ █                                                 │
17. │     0 │   16 │                0.202 │ █                                                 │
18. │     0 │   17 │                0.183 │ ▉                                                 │
19. │     0 │   18 │                0.165 │ ▊                                                 │
20. │     0 │   19 │                 0.15 │ ▋                                                 │
21. │     0 │   20 │                0.135 │ ▋                                                 │
22. │     0 │   21 │                0.122 │ ▌                                                 │
23. │     0 │   22 │                0.111 │ ▌                                                 │
24. │     0 │   23 │                  0.1 │ ▌                                                 │
25. │     0 │   24 │                0.091 │ ▍                                                 │
26. │     1 │   25 │                1.082 │ █████▍                                            │
27. │     1 │   26 │                1.979 │ █████████▉                                        │
28. │     1 │   27 │                2.791 │ █████████████▉                                    │
29. │     1 │   28 │                3.525 │ █████████████████▋                                │
30. │     1 │   29 │                 4.19 │ ████████████████████▉                             │
31. │     1 │   30 │                4.791 │ ███████████████████████▉                          │
32. │     1 │   31 │                5.335 │ ██████████████████████████▋                       │
33. │     1 │   32 │                5.827 │ █████████████████████████████▏                    │
34. │     1 │   33 │                6.273 │ ███████████████████████████████▎                  │
35. │     1 │   34 │                6.676 │ █████████████████████████████████▍                │
36. │     1 │   35 │                7.041 │ ███████████████████████████████████▏              │
37. │     1 │   36 │                7.371 │ ████████████████████████████████████▊             │
38. │     1 │   37 │                7.669 │ ██████████████████████████████████████▎           │
39. │     1 │   38 │                7.939 │ ███████████████████████████████████████▋          │
40. │     1 │   39 │                8.184 │ ████████████████████████████████████████▉         │
41. │     1 │   40 │                8.405 │ ██████████████████████████████████████████        │
42. │     1 │   41 │                8.605 │ ███████████████████████████████████████████       │
43. │     1 │   42 │                8.786 │ ███████████████████████████████████████████▉      │
44. │     1 │   43 │                 8.95 │ ████████████████████████████████████████████▊     │
45. │     1 │   44 │                9.098 │ █████████████████████████████████████████████▍    │
46. │     1 │   45 │                9.233 │ ██████████████████████████████████████████████▏   │
47. │     1 │   46 │                9.354 │ ██████████████████████████████████████████████▊   │
48. │     1 │   47 │                9.464 │ ███████████████████████████████████████████████▎  │
49. │     1 │   48 │                9.563 │ ███████████████████████████████████████████████▊  │
50. │     1 │   49 │                9.653 │ ████████████████████████████████████████████████▎ │
    └───────┴──────┴──────────────────────┴───────────────────────────────────────────────────┘

first_value

It is an alias for any but it was introduced for compatibility with Window Functions, where sometimes it's necessary to process NULL values.

It supports declaring a modifier to respect nulls (RESPECT NULLS), both under Window Functions and in normal aggregations.

As with any, without Window Functions the result will be random if the source stream is not ordered and the return type matches the input type (Null is only returned if the input is Nullable or -OrNull combinator is added).

flameGraph

Aggregate function which builds a flamegraph using the list of stacktraces. Outputs an array of strings which can be used by flamegraph.pl utility to render an SVG of the flamegraph.

Syntax

flameGraph(traces, [size], [ptr])

Parameters

  • traces: a stacktrace. Array(UInt64).
  • size: an allocation size for memory profiling. (optional - default 1). UInt64.
  • ptr: an allocation address. (optional - default 0). UInt64.

In the case where ptr != 0, a flameGraph will map allocations (size > 0) and deallocations (size < 0) with the same size and ptr. Only allocations which were not freed are shown. Non mapped deallocations are ignored.

Returned value

groupArray

Syntax: groupArray(x) or groupArray(max_size)(x)

Creates an array of argument values. Values can be added to the array in any (indeterminate) order.

The second version (with the max_size parameter) limits the size of the resulting array to max_size elements. For example, groupArray(1)(x) is equivalent to [any (x)].

In some cases, you can still rely on the order of execution. This applies to cases when SELECT comes from a subquery that uses ORDER BY if the subquery result is small enough.

Example

SELECT * FROM default.ck
┌─id─┬─name─────┐
│  1 │ zhangsan │
│  1 │ ᴺᵁᴸᴸ     │
│  1 │ lisi     │
│  2 │ wangwu   │
└────┴──────────┘

Query:

select id, groupArray(10)(name) from default.ck group by id

Result:

┌─id─┬─groupArray(10)(name)─┐
│  1 │ ['zhangsan','lisi']  │
│  2 │ ['wangwu']           │
└────┴──────────────────────┘

The groupArray function will remove ᴺᵁᴸᴸ value based on the above results.

  • Alias: array_agg.

groupArrayInsertAt

Inserts a value into the array at the specified position.

Syntax

groupArrayInsertAt(default_x, size)(x, pos)

If in one query several values are inserted into the same position, the function behaves in the following ways:

  • If a query is executed in a single thread, the first one of the inserted values is used.
  • If a query is executed in multiple threads, the resulting value is an undetermined one of the inserted values.

Arguments

  • x: Value to be inserted. Expression resulting in one of the supported data types.
  • pos: Position at which the specified element x is to be inserted. Index numbering in the array starts from zero. UInt32.
  • default_x: Default value for substituting in empty positions. Optional parameter. Expression resulting in the data type configured for the x parameter. If default_x is not defined, the [default values are used.
  • size: Length of the resulting array. Optional parameter. When using this parameter, the default value default_x must be specified. UInt32.

Returned value

  • Array with inserted values.

Type: Array.

Example

Query:

SELECT groupArrayInsertAt(toString(number), number * 2) FROM numbers(5)

Result:

┌─groupArrayInsertAt(toString(number), multiply(number, 2))─┐
│ ['0','','1','','2','','3','','4']                         │
└───────────────────────────────────────────────────────────┘

Query:

SELECT groupArrayInsertAt('-')(toString(number), number * 2) FROM numbers(5)

Result:

┌─groupArrayInsertAt('-')(toString(number), multiply(number, 2))─┐
│ ['0','-','1','-','2','-','3','-','4']                          │
└────────────────────────────────────────────────────────────────┘

Query:

SELECT groupArrayInsertAt('-', 5)(toString(number), number * 2) FROM numbers(5)

Result:

┌─groupArrayInsertAt('-', 5)(toString(number), multiply(number, 2))─┐
│ ['0','-','1','-','2']                                             │
└───────────────────────────────────────────────────────────────────┘

Multi-threaded insertion of elements into one position.

Query:

SELECT groupArrayInsertAt(number, 0) FROM numbers_mt(10) SETTINGS max_block_size = 1```

As a result of this query you get random integer in the `[0,9]` range. For example:

``` text
┌─groupArrayInsertAt(number, 0)─┐
│ [7]                           │
└───────────────────────────────┘

groupArrayIntersect

Return an intersection of given arrays (Return all items of arrays, that are in all given arrays).

Syntax

groupArrayIntersect(x)

Arguments

  • x: Argument (column name or expression).

Returned values

  • Array that contains elements that are in all arrays.

Type: Array.

Examples

Consider table numbers:

┌─a──────────────┐
│ [1,2,4]        │
│ [1,5,2,8,-1,0] │
│ [1,5,7,5,8,2]  │
└────────────────┘

Query with column name as argument:

SELECT groupArrayIntersect(a) as intersection FROM numbers

Result:

┌─intersection──────┐
│ [1, 2]            │
└───────────────────┘

groupArrayLast

Syntax: groupArrayLast(max_size)(x)

Creates an array of last argument values. For example, groupArrayLast(1)(x) is equivalent to [anyLast (x)].

In some cases, you can still rely on the order of execution. This applies to cases when SELECT comes from a subquery that uses ORDER BY if the subquery result is small enough.

Example

Query:

select groupArrayLast(2)(number+1) numbers from numbers(10)

Result:

┌─numbers─┐
│ [9,10]  │
└─────────┘

In compare to groupArray:

select groupArray(2)(number+1) numbers from numbers(10)
┌─numbers─┐
│ [1,2]   │
└─────────┘

groupArrayMovingAvg

Calculates the moving average of input values.

groupArrayMovingAvg(numbers_for_summing)
groupArrayMovingAvg(window_size)(numbers_for_summing)

The function can take the window size as a parameter. If left unspecified, the function takes the window size equal to the number of rows in the column.

Arguments

  • numbers_for_summing: Expression resulting in a numeric data type value.
  • window_size: Size of the calculation window.

Returned values

  • Array of the same size and type as the input data.

The function uses rounding towards zero. It truncates the decimal places insignificant for the resulting data type.

groupArrayMovingSum

Calculates the moving sum of input values.

groupArrayMovingSum(numbers_for_summing)
groupArrayMovingSum(window_size)(numbers_for_summing)

The function can take the window size as a parameter. If left unspecified, the function takes the window size equal to the number of rows in the column.

Arguments

  • numbers_for_summing: Expression resulting in a numeric data type value.
  • window_size: Size of the calculation window.

Returned values

  • Array of the same size and type as the input data.

groupArraySample

Creates an array of sample argument values. The size of the resulting array is limited to max_size elements. Argument values are selected and added to the array randomly.

Syntax

groupArraySample(max_size[, seed])(x)

Arguments

  • max_size: Maximum size of the resulting array. UInt64.
  • seed: Seed for the random number generator. Optional. UInt64. Default value: 123456.
  • x: Argument (column name or expression).

Returned values

  • Array of randomly selected x arguments.

Type: Array.

Examples

Consider table colors:

┌─id─┬─color──┐
│  1 │ red    │
│  2 │ blue   │
│  3 │ green  │
│  4 │ white  │
│  5 │ orange │
└────┴────────┘

Query with column name as argument:

SELECT groupArraySample(3)(color) as newcolors FROM colors

Result:

┌─newcolors──────────────────┐
│ ['white','blue','green']   │
└────────────────────────────┘

Query with column name and different seed:

SELECT groupArraySample(3, 987654321)(color) as newcolors FROM colors

Result:

┌─newcolors──────────────────┐
│ ['red','orange','green']   │
└────────────────────────────┘

Query with expression as argument:

SELECT groupArraySample(3)(concat('light-', color)) as newcolors FROM colors

Result:

┌─newcolors───────────────────────────────────┐
│ ['light-blue','light-orange','light-green'] │
└─────────────────────────────────────────────┘

groupArraySorted {#groupArraySorted}

Returns an array with the first N items in ascending order.

groupArraySorted(N)(column)

Arguments

  • N – The number of elements to return.

  • column – The value (Integer, String, Float and other Generic types).

Example

Gets the first 10 numbers:

SELECT groupArraySorted(10)(number) FROM numbers(100)
┌─groupArraySorted(10)(number)─┐
│ [0,1,2,3,4,5,6,7,8,9]        │
└──────────────────────────────┘

Gets all the String implementations of all numbers in column:

SELECT groupArraySorted(5)(str) FROM (SELECT toString(number) as str FROM numbers(5))
┌─groupArraySorted(5)(str)─┐
│ ['0','1','2','3','4']    │
└──────────────────────────┘

groupBitAnd

Applies bit-wise AND for series of numbers.

groupBitAnd(expr)

Arguments

expr – An expression that results in UInt* or Int* type.

Return value

Value of the UInt* or Int* type.

Example

Test data:

binary     decimal
00101100 = 44
00011100 = 28
00001101 = 13
01010101 = 85

Query:

SELECT groupBitAnd(num) FROM t

Where num is the column with the test data.

Result:

binary     decimal
00000100 = 4

groupBitmap

Bitmap or Aggregate calculations from a unsigned integer column, return cardinality of type UInt64, if add suffix -State, then return bitmap object.

groupBitmap(expr)

Arguments

expr – An expression that results in UInt* type.

Return value

Value of the UInt64 type.

Example

Test data:

UserID
1
1
2
3

Query:

SELECT groupBitmap(UserID) as num FROM t

Result:

num
3

groupBitmapAnd

Calculations the AND of a bitmap column, return cardinality of type UInt64, if add suffix -State, then return bitmap object.

groupBitmapAnd(expr)

Arguments

expr – An expression that results in AggregateFunction(groupBitmap, UInt*) type.

Return value

Value of the UInt64 type.

groupBitmapOr

Calculations the OR of a bitmap column, return cardinality of type UInt64, if add suffix -State, then return bitmap object. This is equivalent to groupBitmapMerge.

groupBitmapOr(expr)

Arguments

expr – An expression that results in AggregateFunction(groupBitmap, UInt*) type.

Returned value

Value of the UInt64 type.

groupBitmapXor

Calculations the XOR of a bitmap column, return cardinality of type UInt64, if add suffix -State, then return bitmap object.

groupBitmapOr(expr)

Arguments

expr – An expression that results in AggregateFunction(groupBitmap, UInt*) type.

Returned value

Value of the UInt64 type.

groupBitOr

Applies bit-wise OR for series of numbers.

groupBitOr(expr)

Arguments

expr – An expression that results in UInt* or Int* type.

Returned value

Value of the UInt* or Int* type.

Example

Test data:

binary     decimal
00101100 = 44
00011100 = 28
00001101 = 13
01010101 = 85

Query:

SELECT groupBitOr(num) FROM t

Where num is the column with the test data.

Result:

binary     decimal
01111101 = 125

groupBitXor

Applies bit-wise XOR for series of numbers.

groupBitXor(expr)

Arguments

expr – An expression that results in UInt* or Int* type.

Return value

Value of the UInt* or Int* type.

Example

Test data:

binary     decimal
00101100 = 44
00011100 = 28
00001101 = 13
01010101 = 85

Query:

SELECT groupBitXor(num) FROM t

Where num is the column with the test data.

Result:

binary     decimal
01101000 = 104

groupConcat

Calculates a concatenated string from a group of strings, optionally separated by a delimiter, and optionally limited by a maximum number of elements.

Syntax

groupConcat[(delimiter [, limit])](expression)

Arguments

  • expression: The expression or column name that outputs strings to be concatenated..
  • delimiter: A string that will be used to separate concatenated values. This parameter is optional and defaults to an empty string if not specified.
  • limit: A positive integer specifying the maximum number of elements to concatenate. If more elements are present, excess elements are ignored. This parameter is optional.

If delimiter is specified without limit, it must be the first parameter. If both delimiter and limit are specified, delimiter must precede limit.

Returned value

  • Returns a string consisting of the concatenated values of the column or expression. If the group has no elements or only null elements, and the function does not specify a handling for only null values, the result is a nullable string with a null value.

Examples

Input table:

┌─id─┬─name─┐
│ 1  │  John│
│ 2  │  Jane│
│ 3  │   Bob│
└────┴──────┘
  1. Basic usage without a delimiter:

Query:

SELECT groupConcat(Name) FROM Employees

Result:

JohnJaneBob

This concatenates all names into one continuous string without any separator.

  1. Using comma as a delimiter:

Query:

SELECT groupConcat(', ')(Name)  FROM Employees

Result:

John, Jane, Bob

This output shows the names separated by a comma followed by a space.

  1. Limiting the number of concatenated elements

Query:

SELECT groupConcat(', ', 2)(Name) FROM Employees

Result:

John, Jane

This query limits the output to the first two names, even though there are more names in the table.

groupUniqArray

Syntax: groupUniqArray(x) or groupUniqArray(max_size)(x)

Creates an array from different argument values. Memory consumption is the same as for the uniqExact function.

The second version (with the max_size parameter) limits the size of the resulting array to max_size elements. For example, groupUniqArray(1)(x) is equivalent to [any(x)].

intervalLengthSum

Calculates the total length of union of all ranges (segments on numeric axis).

Syntax

intervalLengthSum(start, end)

Arguments

  • start: The starting value of the interval. Int32, Int64, UInt32, UInt64, Float32, Float64, DateTime or Date.
  • end: The ending value of the interval. Int32, Int64, UInt32, UInt64, Float32, Float64, DateTime or Date.

Arguments must be of the same data type. Otherwise, an exception will be thrown.

Returned value

  • Total length of union of all ranges (segments on numeric axis). Depending on the type of the argument, the return value may be UInt64 or Float64 type.

Examples

  1. Input table:
┌─id─┬─start─┬─end─┐
│ a  │   1.1 │ 2.9 │
│ a  │   2.5 │ 3.2 │
│ a  │     4 │   5 │
└────┴───────┴─────┘

In this example, the arguments of the Float32 type are used. The function returns a value of the Float64 type.

Result is the sum of lengths of intervals [1.1, 3.2] (union of [1.1, 2.9] and [2.5, 3.2]) and [4, 5]

Query:

SELECT id, intervalLengthSum(start, end), toTypeName(intervalLengthSum(start, end)) FROM fl_interval GROUP BY id ORDER BY id

Result:

┌─id─┬─intervalLengthSum(start, end)─┬─toTypeName(intervalLengthSum(start, end))─┐
│ a  │                           3.1 │ Float64                                   │
└────┴───────────────────────────────┴───────────────────────────────────────────┘
  1. Input table:
┌─id─┬───────────────start─┬─────────────────end─┐
│ a  │ 2020-01-01 01:12:30 │ 2020-01-01 02:10:10 │
│ a  │ 2020-01-01 02:05:30 │ 2020-01-01 02:50:31 │
│ a  │ 2020-01-01 03:11:22 │ 2020-01-01 03:23:31 │
└────┴─────────────────────┴─────────────────────┘

In this example, the arguments of the DateTime type are used. The function returns a value in seconds.

Query:

SELECT id, intervalLengthSum(start, end), toTypeName(intervalLengthSum(start, end)) FROM dt_interval GROUP BY id ORDER BY id

Result:

┌─id─┬─intervalLengthSum(start, end)─┬─toTypeName(intervalLengthSum(start, end))─┐
│ a  │                          6610 │ UInt64                                    │
└────┴───────────────────────────────┴───────────────────────────────────────────┘
  1. Input table:
┌─id─┬──────start─┬────────end─┐
│ a  │ 2020-01-01 │ 2020-01-04 │
│ a  │ 2020-01-12 │ 2020-01-18 │
└────┴────────────┴────────────┘

In this example, the arguments of the Date type are used. The function returns a value in days.

Query:

SELECT id, intervalLengthSum(start, end), toTypeName(intervalLengthSum(start, end)) FROM date_interval GROUP BY id ORDER BY id

Result:

┌─id─┬─intervalLengthSum(start, end)─┬─toTypeName(intervalLengthSum(start, end))─┐
│ a  │                             9 │ UInt64                                    │
└────┴───────────────────────────────┴───────────────────────────────────────────┘

kolmogorovSmirnovTest

Applies Kolmogorov-Smirnov's test to samples from two populations.

Syntax

kolmogorovSmirnovTest([alternative, computation_method])(sample_data, sample_index)

Values of both samples are in the sample_data column. If sample_index equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. Samples must belong to continuous, one-dimensional probability distributions.

Arguments

  • sample_data: Sample data. Integer, Float or Decimal.
  • sample_index: Sample index. Integer.

Parameters

  • alternative: alternative hypothesis. (Optional, default: 'two-sided'.) String. Let F(x) and G(x) be the CDFs of the first and second distributions respectively.
    • 'two-sided' The null hypothesis is that samples come from the same distribution, e.g. F(x) = G(x) for all x. And the alternative is that the distributions are not identical.
    • 'greater' The null hypothesis is that values in the first sample are stochastically smaller than those in the second one, e.g. the CDF of first distribution lies above and hence to the left of that for the second one. Which in fact means that F(x) >= G(x) for all x. And the alternative in this case is that F(x) < G(x) for at least one x.
    • 'less'. The null hypothesis is that values in the first sample are stochastically greater than those in the second one, e.g. the CDF of first distribution lies below and hence to the right of that for the second one. Which in fact means that F(x) <= G(x) for all x. And the alternative in this case is that F(x) > G(x) for at least one x.
  • computation_method: the method used to compute p-value. (Optional, default: 'auto'.) String.
    • 'exact' - calculation is performed using precise probability distribution of the test statistics. Compute intensive and wasteful except for small samples.
    • 'asymp' ('asymptotic') - calculation is performed using an approximation. For large sample sizes, the exact and asymptotic p-values are very similar.
    • 'auto' - the 'exact' method is used when a maximum number of samples is less than 10'000.

Returned values

Tuple with two elements:

  • calculated statistic. Float64.
  • calculated p-value. Float64.

Example

Query:

SELECT kolmogorovSmirnovTest('less', 'exact')(value, num)
FROM
(
    SELECT
        randNormal(0, 10) AS value,
        0 AS num
    FROM numbers(10000)
    UNION ALL
    SELECT
        randNormal(0, 10) AS value,
        1 AS num
    FROM numbers(10000)
)

Result:

┌─kolmogorovSmirnovTest('less', 'exact')(value, num)─┐
│ (0.009899999999999996,0.37528595205132287)         │
└────────────────────────────────────────────────────┘

Note: P-value is bigger than 0.05 (for confidence level of 95%), so null hypothesis is not rejected.

Query:

SELECT kolmogorovSmirnovTest('two-sided', 'exact')(value, num)
FROM
(
    SELECT
        randStudentT(10) AS value,
        0 AS num
    FROM numbers(100)
    UNION ALL
    SELECT
        randNormal(0, 10) AS value,
        1 AS num
    FROM numbers(100)
)

Result:

┌─kolmogorovSmirnovTest('two-sided', 'exact')(value, num)─┐
│ (0.4100000000000002,6.61735760482795e-8)                │
└─────────────────────────────────────────────────────────┘

Note: P-value is less than 0.05 (for confidence level of 95%), so null hypothesis is rejected.

kurtPop

Computes the kurtosis of a sequence.

kurtPop(expr)

Arguments

expr: Expression returning a number.

Returned value

The kurtosis of the given distribution. Type: Float64.

Example

SELECT kurtPop(value) FROM series_with_value_column

kurtSamp

Computes the sample kurtosis of a sequence.

It represents an unbiased estimate of the kurtosis of a random variable if passed values form its sample.

kurtSamp(expr)

Arguments

expr: Expression returning a number.

Returned value

The kurtosis of the given distribution. Type: Float64. If n <= 1 (n is a size of the sample), then the function returns nan.

Example

SELECT kurtSamp(value) FROM series_with_value_column

largestTriangleThreeBuckets

Applies the Largest-Triangle-Three-Buckets algorithm to the input data. The algorithm is used for downsampling time series data for visualization. It is designed to operate on series sorted by x coordinate. It works by dividing the sorted series into buckets and then finding the largest triangle in each bucket. The number of buckets is equal to the number of points in the resulting series. the function will sort data by x and then apply the downsampling algorithm to the sorted data.

Syntax

largestTriangleThreeBuckets(n)(x, y)

Alias: lttb.

Arguments

  • x: x coordinate. Integer, Float, Decimal, Date, Date32, DateTime, DateTime64.
  • y: y coordinate. Integer, Float, Decimal, Date, Date32, DateTime, DateTime64.

NaNs are ignored in the provided series, meaning that any NaN values will be excluded from the analysis. This ensures that the function operates only on valid numerical data.

Parameters

  • n: number of points in the resulting series. UInt64.

Returned values

Array of Tuple with two elements:

Example

Input table:

┌─────x───────┬───────y──────┐
│ 1.000000000 │ 10.000000000 │
│ 2.000000000 │ 20.000000000 │
│ 3.000000000 │ 15.000000000 │
│ 8.000000000 │ 60.000000000 │
│ 9.000000000 │ 55.000000000 │
│ 10.00000000 │ 70.000000000 │
│ 4.000000000 │ 30.000000000 │
│ 5.000000000 │ 40.000000000 │
│ 6.000000000 │ 35.000000000 │
│ 7.000000000 │ 50.000000000 │
└─────────────┴──────────────┘

Query:

SELECT largestTriangleThreeBuckets(4)(x, y) FROM largestTriangleThreeBuckets_test

Result:

┌────────largestTriangleThreeBuckets(4)(x, y)───────────┐
│           [(1,10),(3,15),(9,55),(10,70)]              │
└───────────────────────────────────────────────────────┘

last_value

Selects the last encountered value, similar to anyLast, but could accept NULL. Mostly it should be used with Window Functions. Without Window Functions the result will be random if the source stream is not ordered.

mannWhitneyUTest

Applies the Mann-Whitney rank test to samples from two populations.

Syntax

mannWhitneyUTest[(alternative[, continuity_correction])](sample_data, sample_index)

Values of both samples are in the sample_data column. If sample_index equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that two populations are stochastically equal. Also one-sided hypothesises can be tested. This test does not assume that data have normal distribution.

Arguments

  • sample_data: sample data. Integer, Float or Decimal.
  • sample_index: sample index. Integer.

Parameters

  • alternative: alternative hypothesis. (Optional, default: 'two-sided'.) String.
    • 'two-sided' - 'greater' - 'less'.
  • continuity_correction: if not 0 then continuity correction in the normal approximation for the p-value is applied. (Optional, default: 1.) UInt64.

Returned values

Tuple with two elements:

  • calculated U-statistic. Float64.
  • calculated p-value. Float64.

Example

Input table:

┌─sample_data─┬─sample_index─┐
│          10 │            0 │
│          11 │            0 │
│          12 │            0 │
│           1 │            1 │
│           2 │            1 │
│           3 │            1 │
└─────────────┴──────────────┘

Query:

SELECT mannWhitneyUTest('greater')(sample_data, sample_index) FROM mww_ttest

Result:

┌─mannWhitneyUTest('greater')(sample_data, sample_index)─┐
│ (9,0.04042779918503192)                                │
└────────────────────────────────────────────────────────┘

max

Aggregate function that calculates the maximum across a group of values.

Syntax:

SELECT max(salary) FROM employees
SELECT department, max(salary) FROM employees GROUP BY department

If you need non-aggregate function to choose a maximum of two values, see greatest:

SELECT greatest(a, b) FROM table

maxIntersections

Aggregate function that calculates the maximum number of times that a group of intervals intersects each other (if all the intervals intersect at least once).

The syntax is:

maxIntersections(start_column, end_column)

Arguments

  • start_column – the numeric column that represents the start of each interval. If start_column is NULL or 0 then the interval will be skipped.

  • end_column - the numeric column that represents the end of each interval. If end_column is NULL or 0 then the interval will be skipped.

Returned value

Returns the maximum number of intersected intervals.

maxIntersectionsPosition

Aggregate function that calculates the positions of the occurrences of the maxIntersections function.

The syntax is:

maxIntersectionsPosition(start_column, end_column)

Arguments

  • start_column – the numeric column that represents the start of each interval. If start_column is NULL or 0 then the interval will be skipped.

  • end_column - the numeric column that represents the end of each interval. If end_column is NULL or 0 then the interval will be skipped.

Returned value

Returns the start positions of the maximum number of intersected intervals.

maxMap

Calculates the maximum from value array according to the keys specified in the key array.

Syntax

maxMap(key, value)

or

maxMap(Tuple(key, value))

Alias: maxMappedArrays

  • Passing a tuple of keys and value arrays is identical to passing two arrays of keys and values.
  • The number of elements in key and value must be the same for each row that is totaled.

Parameters

  • key: Array of keys. Array.
  • value: Array of values. Array.

Returned value

  • Returns a tuple of two arrays: keys in sorted order, and values calculated for the corresponding keys. Tuple(Array, Array).

Example

Query:

SELECT maxMap(a, b)
FROM values('a Array(Char), b Array(Int64)', (['x', 'y'], [2, 2]), (['y', 'z'], [3, 1]))

Result:

┌─maxMap(a, b)───────────┐
│ [['x','y','z'],[2,3,1]]│
└────────────────────────┘

meanZTest

Applies mean z-test to samples from two populations.

Syntax

meanZTest(population_variance_x, population_variance_y, confidence_level)(sample_data, sample_index)

Values of both samples are in the sample_data column. If sample_index equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that means of populations are equal. Normal distribution is assumed. Populations may have unequal variance and the variances are known.

Arguments

  • sample_data: Sample data. Integer, Float or Decimal.
  • sample_index: Sample index. Integer.

Parameters

  • population_variance_x: Variance for population x. Float.
  • population_variance_y: Variance for population y. Float.
  • confidence_level: Confidence level in order to calculate confidence intervals. Float.

Returned values

Tuple with four elements:

  • calculated t-statistic. Float64.
  • calculated p-value. Float64.
  • calculated confidence-interval-low. Float64.
  • calculated confidence-interval-high. Float64.

Example

Input table:

┌─sample_data─┬─sample_index─┐
│        20.3 │            0 │
│        21.9 │            0 │
│        22.1 │            0 │
│        18.9 │            1 │
│          19 │            1 │
│        20.3 │            1 │
└─────────────┴──────────────┘

Query:

SELECT meanZTest(0.7, 0.45, 0.95)(sample_data, sample_index) FROM mean_ztest

Result:

┌─meanZTest(0.7, 0.45, 0.95)(sample_data, sample_index)────────────────────────────┐
│ (3.2841296025548123,0.0010229786769086013,0.8198428246768334,3.2468238419898365) │
└──────────────────────────────────────────────────────────────────────────────────┘

median

The median* functions are the aliases for the corresponding quantile* functions. They calculate median of a numeric data sample.

Functions:

  • median: Alias for quantile.
  • medianDeterministic: Alias for quantileDeterministic.
  • medianExact: Alias for quantileExact.
  • medianExactWeighted: Alias for quantileExactWeighted.
  • medianTiming: Alias for quantileTiming.
  • medianTimingWeighted: Alias for quantileTimingWeighted.
  • medianTDigest: Alias for quantileTDigest.
  • medianTDigestWeighted: Alias for quantileTDigestWeighted.
  • medianBFloat16: Alias for quantileBFloat16.
  • medianDD: Alias for quantileDD.

Example

Input table:

┌─val─┐
│   1 │
│   1 │
│   2 │
│   3 │
└─────┘

Query:

SELECT medianDeterministic(val, 1) FROM t

Result:

┌─medianDeterministic(val, 1)─┐
│                         1.5 │
└─────────────────────────────┘

min

Aggregate function that calculates the minimum across a group of values.

Syntax:

SELECT min(salary) FROM employees
SELECT department, min(salary) FROM employees GROUP BY department

If you need non-aggregate function to choose a minimum of two values, see least:

SELECT least(a, b) FROM table

minMap

Calculates the minimum from value array according to the keys specified in the key array.

Syntax

`minMap(key, value)`

or

minMap(Tuple(key, value))

Alias: minMappedArrays

  • Passing a tuple of keys and value arrays is identical to passing an array of keys and an array of values.
  • The number of elements in key and value must be the same for each row that is totaled.

Parameters

  • key: Array of keys. Array.
  • value: Array of values. Array.

Returned value

  • Returns a tuple of two arrays: keys in sorted order, and values calculated for the corresponding keys. Tuple(Array, Array).

Example

Query:

SELECT minMap(a, b)
FROM values('a Array(Int32), b Array(Int64)', ([1, 2], [2, 2]), ([2, 3], [1, 1]))

Result:

┌─minMap(a, b)──────┐
│ ([1,2,3],[2,1,1]) │
└───────────────────┘

quantile

Computes an approximate quantile of a numeric data sequence.

This function applies reservoir sampling with a reservoir size up to 8192 and a random number generator for sampling. The result is non-deterministic. To get an exact quantile, use the quantileExact function.

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Note that for an empty numeric sequence, quantile will return NaN, but its quantile* variants will return either NaN or a default value for the sequence type, depending on the variant.

Syntax

quantile(level)(expr)

Alias: median.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value

  • Approximate quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

Input table:

┌─val─┐
│   1 │
│   1 │
│   2 │
│   3 │
└─────┘

Query:

SELECT quantile(val) FROM t

Result:

┌─quantile(val)─┐
│           1.5 │
└───────────────┘

quantileBFloat16

Computes an approximate quantile of a sample consisting of bfloat16 numbers. bfloat16 is a floating-point data type with 1 sign bit, 8 exponent bits and 7 fraction bits. The function converts input values to 32-bit floats and takes the most significant 16 bits. Then it calculates bfloat16 quantile value and converts the result to a 64-bit float by appending zero bits. The function is a fast quantile estimator with a relative error no more than 0.390625%.

Syntax

quantileBFloat16[(level)](expr)

Alias: medianBFloat16

Arguments

  • expr: Column with numeric data. Integer, Float.

Parameters

  • level: Level of quantile. Optional. Possible values are in the range from 0 to 1. Default value: 0.5. Float.

Returned value

  • Approximate quantile of the specified level.

Type: Float64.

Example

Input table has an integer and a float columns:

┌─a─┬─────b─┐
│ 1 │ 1.001 │
│ 2 │ 1.002 │
│ 3 │ 1.003 │
│ 4 │ 1.004 │
└───┴───────┘

Query to calculate 0.75-quantile (third quartile):

SELECT quantileBFloat16(0.75)(a), quantileBFloat16(0.75)(b) FROM example_table

Result:

┌─quantileBFloat16(0.75)(a)─┬─quantileBFloat16(0.75)(b)─┐
│                         3 │                         1 │
└───────────────────────────┴───────────────────────────┘

Note that all floating point values in the example are truncated to 1.0 when converting to bfloat16.

quantileBFloat16Weighted

Like quantileBFloat16 but takes into account the weight of each sequence member.

quantileDD

Computes an approximate quantile of a sample with relative-error guarantees. It works by building a DD.

Syntax

quantileDD(relative_accuracy, [level])(expr)

Arguments

  • expr: Column with numeric data. Integer, Float.

Parameters

  • relative_accuracy: Relative accuracy of the quantile. Possible values are in the range from 0 to 1. Float. The size of the sketch depends on the range of the data and the relative accuracy. The larger the range and the smaller the relative accuracy, the larger the sketch. The rough memory size of the of the sketch is log(max_value/min_value)/relative_accuracy. The recommended value is 0.001 or higher.

  • level: Level of quantile. Optional. Possible values are in the range from 0 to 1. Default value: 0.5. Float.

Returned value

  • Approximate quantile of the specified level.

Type: Float64.

Example

Input table has an integer and a float columns:

┌─a─┬─────b─┐
│ 1 │ 1.001 │
│ 2 │ 1.002 │
│ 3 │ 1.003 │
│ 4 │ 1.004 │
└───┴───────┘

Query to calculate 0.75-quantile (third quartile):

SELECT quantileDD(0.01, 0.75)(a), quantileDD(0.01, 0.75)(b) FROM example_table

Result:

┌─quantileDD(0.01, 0.75)(a)─┬─quantileDD(0.01, 0.75)(b)─┐
│               2.974233423476717 │                            1.01 │
└─────────────────────────────────┴─────────────────────────────────┘

quantileDeterministic

Computes an approximate quantile of a numeric data sequence.

This function applies reservoir sampling with a reservoir size up to 8192 and deterministic algorithm of sampling. The result is deterministic. To get an exact quantile, use the quantileExact function.

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Syntax

quantileDeterministic(level)(expr, determinator)

Alias: medianDeterministic.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.
  • determinator: Number whose hash is used instead of a random number generator in the reservoir sampling algorithm to make the result of sampling deterministic. As a determinator you can use any deterministic positive number, for example, a user id or an event id. If the same determinator value occurs too often, the function works incorrectly.

Returned value

  • Approximate quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

Input table:

┌─val─┐
│   1 │
│   1 │
│   2 │
│   3 │
└─────┘

Query:

SELECT quantileDeterministic(val, 1) FROM t

Result:

┌─quantileDeterministic(val, 1)─┐
│                           1.5 │
└───────────────────────────────┘

quantileExact

Exactly computes the quantile of a numeric data sequence.

To get exact value, all the passed values ​​are combined into an array, which is then partially sorted. Therefore, the function consumes O(n) memory, where n is a number of values that were passed. However, for a small number of values, the function is very effective.

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Syntax

quantileExact(level)(expr)

Alias: medianExact.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value

  • Quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

Query:

SELECT quantileExact(number) FROM numbers(10)

Result:

┌─quantileExact(number)─┐
│                     5 │
└───────────────────────┘

quantileExactLow

Similar to quantileExact, this computes the exact quantile of a numeric data sequence.

To get the exact value, all the passed values are combined into an array, which is then fully sorted. The sorting algorithm's complexity is O(N·log(N)), where N = std::distance(first, last) comparisons.

The return value depends on the quantile level and the number of elements in the selection, i.e. if the level is 0.5, then the function returns the lower median value for an even number of elements and the middle median value for an odd number of elements. Median is calculated similarly to the median_low implementation which is used in python.

For all other levels, the element at the index corresponding to the value of level * size_of_array is returned. For example:

SELECT quantileExactLow(0.1)(number) FROM numbers(10)

┌─quantileExactLow(0.1)(number)─┐
│                             1 │
└───────────────────────────────┘

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Syntax

quantileExactLow(level)(expr)

Alias: medianExactLow.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value

  • Quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

Query:

SELECT quantileExactLow(number) FROM numbers(10)

Result:

┌─quantileExactLow(number)─┐
│                        4 │
└──────────────────────────┘

quantileExactHigh

Similar to quantileExact, this computes the exact quantile of a numeric data sequence.

All the passed values are combined into an array, which is then fully sorted, to get the exact value. The sorting algorithm's complexity is O(N·log(N)), where N = std::distance(first, last) comparisons.

The return value depends on the quantile level and the number of elements in the selection, i.e. if the level is 0.5, then the function returns the higher median value for an even number of elements and the middle median value for an odd number of elements. Median is calculated similarly to the median_high implementation which is used in python. For all other levels, the element at the index corresponding to the value of level * size_of_array is returned.

This implementation behaves exactly similar to the current quantileExact implementation.

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Syntax

quantileExactHigh(level)(expr)

Alias: medianExactHigh.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value

  • Quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

Query:

SELECT quantileExactHigh(number) FROM numbers(10)

Result:

┌─quantileExactHigh(number)─┐
│                         5 │
└───────────────────────────┘

quantileExactExclusive

Exactly computes the quantile of a numeric data sequence.

To get exact value, all the passed values ​​are combined into an array, which is then partially sorted. Therefore, the function consumes O(n) memory, where n is a number of values that were passed. However, for a small number of values, the function is very effective.

This function is equivalent to PERCENTILE.EXC Excel function, (type R6).

When using multiple quantileExactExclusive functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantilesExactExclusive function.

Syntax

quantileExactExclusive(level)(expr)

Arguments

  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Parameters

  • level: Level of quantile. Optional. Possible values: (0, 1): bounds not included. Default value: 0.5. At level=0.5 the function calculates median. Float.

Returned value

  • Quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

quantileExactInclusive

Exactly computes the quantile of a numeric data sequence.

To get exact value, all the passed values ​​are combined into an array, which is then partially sorted. Therefore, the function consumes O(n) memory, where n is a number of values that were passed. However, for a small number of values, the function is very effective.

This function is equivalent to PERCENTILE.INC Excel function, (type R7).

When using multiple quantileExactInclusive functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantilesExactInclusive function.

Syntax

quantileExactInclusive(level)(expr)

Arguments

  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Parameters

  • level: Level of quantile. Optional. Possible values: [0, 1]: bounds included. Default value: 0.5. At level=0.5 the function calculates median. Float.

Returned value

  • Quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

quantileExactWeighted

Exactly computes the quantile of a numeric data sequence, taking into account the weight of each element.

To get exact value, all the passed values ​​are combined into an array, which is then partially sorted. Each value is counted with its weight, as if it is present weight times. A hash table is used in the algorithm. Because of this, if the passed values ​​are frequently repeated, the function consumes less RAM than quantileExact. You can use this function instead of quantileExact and specify the weight 1.

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Syntax

quantileExactWeighted(level)(expr, weight)

Alias: medianExactWeighted.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.
  • weight: Column with weights of sequence members. Weight is a number of value occurrences with Unsigned integer types.

Returned value

  • Quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

Input table:

┌─n─┬─val─┐
│ 0 │   3 │
│ 1 │   2 │
│ 2 │   1 │
│ 5 │   4 │
└───┴─────┘

Query:

SELECT quantileExactWeighted(n, val) FROM t

Result:

┌─quantileExactWeighted(n, val)─┐
│                             1 │
└───────────────────────────────┘

quantileExactWeightedInterpolated

Computes quantile of a numeric data sequence using linear interpolation, taking into account the weight of each element.

To get the interpolated value, all the passed values are combined into an array, which are then sorted by their corresponding weights. Quantile interpolation is then performed using the weighted percentile method by building a cumulative distribution based on weights and then a linear interpolation is performed using the weights and the values to compute the quantiles.

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

We strongly recommend using quantileExactWeightedInterpolated instead of quantileInterpolatedWeighted because quantileExactWeightedInterpolated is more accurate than quantileInterpolatedWeighted. Here is an example:

SELECT
    quantileExactWeightedInterpolated(0.99)(number, 1),
    quantile(0.99)(number),
    quantileInterpolatedWeighted(0.99)(number, 1)
FROM numbers(9)


┌─quantileExactWeightedInterpolated(0.99)(number, 1)─┬─quantile(0.99)(number)─┬─quantileInterpolatedWeighted(0.99)(number, 1)─┐
│                                               7.92 │                   7.92 │                                             8 │
└────────────────────────────────────────────────────┴────────────────────────┴───────────────────────────────────────────────┘

Syntax

quantileExactWeightedInterpolated(level)(expr, weight)

Alias: medianExactWeightedInterpolated.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.
  • weight: Column with weights of sequence members. Weight is a number of value occurrences with Unsigned integer types.

Returned value

  • Quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

Input table:

┌─n─┬─val─┐
│ 0 │   3 │
│ 1 │   2 │
│ 2 │   1 │
│ 5 │   4 │
└───┴─────┘

Result:

┌─quantileExactWeightedInterpolated(n, val)─┐
│                                       1.5 │
└───────────────────────────────────────────┘

quantileGK

Computes the quantile of a numeric data sequence using the Greenwald-Khanna algorithm. The Greenwald-Khanna algorithm is an algorithm used to compute quantiles on a stream of data in a highly efficient manner. It was introduced by Michael Greenwald and Sanjeev Khanna in 2001. It is widely used in databases and big data systems where computing accurate quantiles on a large stream of data in real-time is necessary. The algorithm is highly efficient, taking only O(log n) space and O(log log n) time per item (where n is the size of the input). It is also highly accurate, providing an approximate quantile value with high probability.

quantileGK is different from other quantile functions, because it enables user to control the accuracy of the approximate quantile result.

Syntax

quantileGK(accuracy, level)(expr)

Alias: medianGK.

Arguments

  • accuracy: Accuracy of quantile. Constant positive integer. Larger accuracy value means less error. For example, if the accuracy argument is set to 100, the computed quantile will have an error no greater than 1% with high probability. There is a trade-off between the accuracy of the computed quantiles and the computational complexity of the algorithm. A larger accuracy requires more memory and computational resources to compute the quantile accurately, while a smaller accuracy argument allows for a faster and more memory-efficient computation but with a slightly lower accuracy.

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. Default value: 0.5. At level=0.5 the function calculates median.

  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value

  • Quantile of the specified level and accuracy.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

SELECT quantileGK(1, 0.25)(number + 1)
FROM numbers(1000)

┌─quantileGK(1, 0.25)(plus(number, 1))─┐
│                                    1 │
└──────────────────────────────────────┘

SELECT quantileGK(10, 0.25)(number + 1)
FROM numbers(1000)

┌─quantileGK(10, 0.25)(plus(number, 1))─┐
│                                   156 │
└───────────────────────────────────────┘

SELECT quantileGK(100, 0.25)(number + 1)
FROM numbers(1000)

┌─quantileGK(100, 0.25)(plus(number, 1))─┐
│                                    251 │
└────────────────────────────────────────┘

SELECT quantileGK(1000, 0.25)(number + 1)
FROM numbers(1000)

┌─quantileGK(1000, 0.25)(plus(number, 1))─┐
│                                     249 │
└─────────────────────────────────────────┘

quantileInterpolatedWeighted

Computes quantile of a numeric data sequence using linear interpolation, taking into account the weight of each element.

To get the interpolated value, all the passed values are combined into an array, which are then sorted by their corresponding weights. Quantile interpolation is then performed using the weighted percentile method by building a cumulative distribution based on weights and then a linear interpolation is performed using the weights and the values to compute the quantiles.

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Syntax

quantileInterpolatedWeighted(level)(expr, weight)

Alias: medianInterpolatedWeighted.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.
  • weight: Column with weights of sequence members. Weight is a number of value occurrences.

Returned value

  • Quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

Input table:

┌─n─┬─val─┐
│ 0 │   3 │
│ 1 │   2 │
│ 2 │   1 │
│ 5 │   4 │
└───┴─────┘

Query:

SELECT quantileInterpolatedWeighted(n, val) FROM t

Result:

┌─quantileInterpolatedWeighted(n, val)─┐
│                                    1 │
└──────────────────────────────────────┘

quantiles

Syntax: quantiles(level1, level2, ...)(x)

All the quantile functions also have corresponding quantiles functions: quantiles, quantilesDeterministic, quantilesTiming, quantilesTimingWeighted, quantilesExact, quantilesExactWeighted, quantileExactWeightedInterpolated, quantileInterpolatedWeighted, quantilesTDigest, quantilesBFloat16, quantilesDD. These functions calculate all the quantiles of the listed levels in one pass, and return an array of the resulting values.

quantilesExactExclusive

Exactly computes the quantiles of a numeric data sequence.

To get exact value, all the passed values ​​are combined into an array, which is then partially sorted. Therefore, the function consumes O(n) memory, where n is a number of values that were passed. However, for a small number of values, the function is very effective.

This function is equivalent to PERCENTILE.EXC Excel function, (type R6).

Works more efficiently with sets of levels than quantileExactExclusive.

Syntax

quantilesExactExclusive(level1, level2, ...)(expr)

Arguments

  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Parameters

  • level: Levels of quantiles. Possible values: (0, 1): bounds not included. Float.

Returned value

  • Array of quantiles of the specified levels.

Type of array values:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

quantilesExactInclusive

Exactly computes the quantiles of a numeric data sequence.

To get exact value, all the passed values ​​are combined into an array, which is then partially sorted. Therefore, the function consumes O(n) memory, where n is a number of values that were passed. However, for a small number of values, the function is very effective.

This function is equivalent to PERCENTILE.INC Excel function, (type R7).

Works more efficiently with sets of levels than quantileExactInclusive.

Syntax

quantilesExactInclusive(level1, level2, ...)(expr)

Arguments

  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Parameters

  • level: Levels of quantiles. Possible values: 0, 1: bounds included. Float.

Returned value

  • Array of quantiles of the specified levels.

Type of array values:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

quantilesGK

quantilesGK works similarly with quantileGK but allows us to calculate quantities at different levels simultaneously and returns an array.

Syntax

quantilesGK(accuracy, level1, level2, ...)(expr)

Returned value

  • Array of quantiles of the specified levels.

Type of array values:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

Query:

SELECT quantilesGK(1, 0.25, 0.5, 0.75)(number + 1)
FROM numbers(1000)

┌─quantilesGK(1, 0.25, 0.5, 0.75)(plus(number, 1))─┐
│ [1,1,1]                                          │
└──────────────────────────────────────────────────┘

SELECT quantilesGK(10, 0.25, 0.5, 0.75)(number + 1)
FROM numbers(1000)

┌─quantilesGK(10, 0.25, 0.5, 0.75)(plus(number, 1))─┐
│ [156,413,659]                                     │
└───────────────────────────────────────────────────┘


SELECT quantilesGK(100, 0.25, 0.5, 0.75)(number + 1)
FROM numbers(1000)

┌─quantilesGK(100, 0.25, 0.5, 0.75)(plus(number, 1))─┐
│ [251,498,741]                                      │
└────────────────────────────────────────────────────┘

SELECT quantilesGK(1000, 0.25, 0.5, 0.75)(number + 1)
FROM numbers(1000)

┌─quantilesGK(1000, 0.25, 0.5, 0.75)(plus(number, 1))─┐
│ [249,499,749]                                       │
└─────────────────────────────────────────────────────┘

quantileTDigest

Computes an approximate quantile of a numeric data sequence using the t-digest algorithm.

Memory consumption is log(n), where n is a number of values. The result depends on the order of running the query, and is nondeterministic.

The performance of the function is lower than performance of quantile or quantileTiming. In terms of the ratio of State size to precision, this function is much better than quantile.

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Syntax

quantileTDigest(level)(expr)

Alias: medianTDigest.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.

Returned value

  • Approximate quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

Query:

SELECT quantileTDigest(number) FROM numbers(10)

Result:

┌─quantileTDigest(number)─┐
│                     4.5 │
└─────────────────────────┘

quantileTDigestWeighted

Computes an approximate quantile of a numeric data sequence using the t-digest algorithm. The function takes into account the weight of each sequence member. The maximum error is 1%. Memory consumption is log(n), where n is a number of values.

The performance of the function is lower than performance of quantile or quantileTiming. In terms of the ratio of State size to precision, this function is much better than quantile.

The result depends on the order of running the query, and is nondeterministic.

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Syntax

quantileTDigestWeighted(level)(expr, weight)

Alias: medianTDigestWeighted.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.
  • expr: Expression over the column values resulting in numeric data types, Date or DateTime.
  • weight: Column with weights of sequence elements. Weight is a number of value occurrences.

Returned value

  • Approximate quantile of the specified level.

Type:

  • Float64 for numeric data type input.
  • Date if input values have the Date type.
  • DateTime if input values have the DateTime type.

Example

Query:

SELECT quantileTDigestWeighted(number, 1) FROM numbers(10)

Result:

┌─quantileTDigestWeighted(number, 1)─┐
│                                4.5 │
└────────────────────────────────────┘

quantileTiming

With the determined precision computes the quantile of a numeric data sequence.

The result is deterministic (it does not depend on the query processing order). The function is optimized for working with sequences which describe distributions like loading web pages times or backend response times.

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Syntax

quantileTiming(level)(expr)

Alias: medianTiming.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.

  • expr: Expression over a column values returning a Float* type number.

    • If negative values are passed to the function, the behavior is undefined.
    • If the value is greater than 30,000 (a page loading time of more than 30 seconds), it is assumed to be 30,000.

Accuracy

The calculation is accurate if:

  • Total number of values does not exceed 5670.
  • Total number of values exceeds 5670, but the page loading time is less than 1024ms.

Otherwise, the result of the calculation is rounded to the nearest multiple of 16 ms.

For calculating page loading time quantiles, this function is more effective and accurate than quantile.

Returned value

  • Quantile of the specified level.

Type: Float32.

If no values are passed to the function (when using quantileTimingIf), NaN is returned. The purpose of this is to differentiate these cases from cases that result in zero.

Example

Input table:

┌─response_time─┐
│            72 │
│           112 │
│           126 │
│           145 │
│           104 │
│           242 │
│           313 │
│           168 │
│           108 │
└───────────────┘

Query:

SELECT quantileTiming(response_time) FROM t

Result:

┌─quantileTiming(response_time)─┐
│                           126 │
└───────────────────────────────┘

quantileTimingWeighted

With the determined precision computes the quantile of a numeric data sequence according to the weight of each sequence member.

The result is deterministic (it does not depend on the query processing order). The function is optimized for working with sequences which describe distributions like loading web pages times or backend response times.

When using multiple quantile* functions with different levels in a query, the internal states are not combined (that is, the query works less efficiently than it could). In this case, use the quantiles function.

Syntax

quantileTimingWeighted(level)(expr, weight)

Alias: medianTimingWeighted.

Arguments

  • level: Level of quantile. Optional parameter. Constant floating-point number from 0 to 1. We recommend using a level value in the range of [0.01, 0.99]. Default value: 0.5. At level=0.5 the function calculates median.

  • expr: Expression over a column values returning a Float* type number.

    - If negative values are passed to the function, the behavior is undefined. - If the value is greater than 30,000 (a page loading time of more than 30 seconds), it is assumed to be 30,000.

  • weight: Column with weights of sequence elements. Weight is a number of value occurrences.

Accuracy

The calculation is accurate if:

  • Total number of values does not exceed 5670.
  • Total number of values exceeds 5670, but the page loading time is less than 1024ms.

Otherwise, the result of the calculation is rounded to the nearest multiple of 16 ms.

For calculating page loading time quantiles, this function is more effective and accurate than quantile.

Returned value

  • Quantile of the specified level.

Type: Float32.

If no values are passed to the function (when using quantileTimingIf), NaN is returned. The purpose of this is to differentiate these cases from cases that result in zero.

Example

Input table:

┌─response_time─┬─weight─┐
│            68 │      1 │
│           104 │      2 │
│           112 │      3 │
│           126 │      2 │
│           138 │      1 │
│           162 │      1 │
└───────────────┴────────┘

Query:

SELECT quantileTimingWeighted(response_time, weight) FROM t

Result:

┌─quantileTimingWeighted(response_time, weight)─┐
│                                           112 │
└───────────────────────────────────────────────┘

quantilesTimingWeighted

Same as quantileTimingWeighted, but accept multiple parameters with quantile levels and return an Array filled with many values of that quantiles.

Example

Input table:

┌─response_time─┬─weight─┐
│            68 │      1 │
│           104 │      2 │
│           112 │      3 │
│           126 │      2 │
│           138 │      1 │
│           162 │      1 │
└───────────────┴────────┘

Query:

SELECT quantilesTimingWeighted(0,5, 0.99)(response_time, weight) FROM t

Result:

┌─quantilesTimingWeighted(0.5, 0.99)(response_time, weight)─┐
│ [112,162]                                                 │
└───────────────────────────────────────────────────────────┘

rankCorr

Computes a rank correlation coefficient.

Syntax

rankCorr(x, y)

Arguments

  • x: Arbitrary value. Float32 or Float64.
  • y: Arbitrary value. Float32 or Float64.

Returned value(s)

  • Returns a rank correlation coefficient of the ranks of x and y. The value of the correlation coefficient ranges from -1 to +1. If less than two arguments are passed, the function will return an exception. The value close to +1 denotes a high linear relationship, and with an increase of one random variable, the second random variable also increases. The value close to -1 denotes a high linear relationship, and with an increase of one random variable, the second random variable decreases. The value close or equal to 0 denotes no relationship between the two random variables.

Type: Float64.

Example

Query:

SELECT rankCorr(number, number) FROM numbers(100)

Result:

┌─rankCorr(number, number)─┐
│                        1 │
└──────────────────────────┘

Query:

SELECT roundBankers(rankCorr(exp(number), sin(number)), 3) FROM numbers(100)

Result:

┌─roundBankers(rankCorr(exp(number), sin(number)), 3)─┐
│                                              -0.037 │
└─────────────────────────────────────────────────────┘

simpleLinearRegression

Performs simple (unidimensional) linear regression.

simpleLinearRegression(x, y)

Parameters:

  • x: Column with explanatory variable values.
  • y: Column with dependent variable values.

Returned values:

Constants (k, b) of the resulting line y = k*x + b.

Examples

SELECT arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [0, 1, 2, 3])
┌─arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [0, 1, 2, 3])─┐
│ (1,0)                                                             │
└───────────────────────────────────────────────────────────────────┘
SELECT arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [3, 4, 5, 6])
┌─arrayReduce('simpleLinearRegression', [0, 1, 2, 3], [3, 4, 5, 6])─┐
│ (1,3)                                                             │
└───────────────────────────────────────────────────────────────────┘

singleValueOrNull

The aggregate function singleValueOrNull is used to implement subquery operators, such as x = ALL (SELECT ...). It checks if there is only one unique non-NULL value in the data. If there is only one unique value, it returns it. If there are zero or at least two distinct values, it returns NULL.

Syntax

singleValueOrNull(x)

Parameters

  • x: Column of any data type (except Map, Array or Tuple which cannot be of type Nullable).

Returned values

  • The unique value, if there is only one unique non-NULL value in x.
  • NULL, if there are zero or at least two distinct values.

skewPop

Computes the skewness of a sequence.

skewPop(expr)

Arguments

expr: Expression returning a number.

Returned value

The skewness of the given distribution. Type: Float64.

Example

SELECT skewPop(value) FROM series_with_value_column

skewSamp

Computes the sample skewness of a sequence.

It represents an unbiased estimate of the skewness of a random variable if passed values form its sample.

skewSamp(expr)

Arguments

  • expr: Expression returning a number.

Returned value

The skewness of the given distribution. Type: Float64. If n <= 1 (n is the size of the sample), then the function returns nan.

Example

SELECT skewSamp(value) FROM series_with_value_column

sparkbar

The function plots a frequency histogram for values x and the repetition rate y of these values over the interval [min_x, max_x]. Repetitions for all x falling into the same bucket are averaged, so data should be pre-aggregated. Negative repetitions are ignored.

If no interval is specified, then the minimum x is used as the interval start, and the maximum x: as the interval end. Otherwise, values outside the interval are ignored.

Syntax

sparkbar(buckets[, min_x, max_x])(x, y)

Parameters

  • buckets: The number of segments. Type: Integer.
  • min_x: The interval start. Optional parameter.
  • max_x: The interval end. Optional parameter.

Arguments

  • x: The field with values.
  • y: The field with the frequency of values.

Returned value

  • The frequency histogram.

stddevPop

The result is equal to the square root of varPop.

Aliases: STD, STDDEV_POP.

This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the stddevPopStable function. It works slower but provides a lower computational error.

Syntax

stddevPop(x)

Parameters

  • x: Population of values to find the standard deviation of. (U)Int*, Float*, Decimal*.

Returned value

  • Square root of standard deviation of x. Float64.

stddevPopStable

The result is equal to the square root of varPop. Unlike stddevPop, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.

Syntax

stddevPopStable(x)

Parameters

  • x: Population of values to find the standard deviation of. (U)Int*, Float*, Decimal*.

Returned value

Square root of standard deviation of x. Float64.

stddevSamp

The result is equal to the square root of varSamp.

Alias: STDDEV_SAMP.

This function uses a numerically unstable algorithm. If you need numerical stability in calculations, use the stddevSampStable function. It works slower but provides a lower computational error.

Syntax

stddevSamp(x)

Parameters

  • x: Values for which to find the square root of sample variance. (U)Int*, Float*, Decimal*.

Returned value

Square root of sample variance of x. Float64.

stddevSampStable

The result is equal to the square root of varSamp. Unlike stddevSamp, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.

Syntax

stddevSampStable(x)

Parameters

  • x: Values for which to find the square root of sample variance. (U)Int*, Float*, Decimal*.

Returned value

Square root of sample variance of x. Float64.

stochasticLinearRegression {#agg_functions_stochasticlinearregression_parameters}

This function implements stochastic linear regression. It supports custom parameters for learning rate, L2 regularization coefficient, mini-batch size, and has a few methods for updating weight: Adam (used by default), simple SGD, Momentum, and Nesterov.

Parameters

There are 4 customizable parameters. They are passed to the function sequentially, but there is no need to pass all four - default values will be used, however good model required some parameter tuning.

stochasticLinearRegression(0.00001, 0.1, 15, 'Adam')
  1. learning rate is the coefficient on step length, when the gradient descent step is performed. A learning rate that is too big may cause infinite weights of the model. Default is 0.00001.
  2. l2 regularization coefficient which may help to prevent overfitting. Default is 0.1.
  3. mini-batch size sets the number of elements, which gradients will be computed and summed to perform one step of gradient descent. Pure stochastic descent uses one element, however, having small batches (about 10 elements) makes gradient steps more stable. Default is 15.
  4. method for updating weights, they are: Adam (by default), SGD, Momentum, and Nesterov. Momentum and Nesterov require a little bit more computations and memory, however, they happen to be useful in terms of speed of convergence and stability of stochastic gradient methods.

stochasticLogisticRegression

This function implements stochastic logistic regression. It can be used for binary classification problem, supports the same custom parameters as stochasticLinearRegression and works the same way.

Parameters

Parameters are exactly the same as in stochasticLinearRegression: learning rate, l2 regularization coefficient, mini-batch size, method for updating weights. For more information see parameters.

stochasticLogisticRegression(1.0, 1.0, 10, 'SGD')

studentTTest

Applies Student's t-test to samples from two populations.

Syntax

studentTTest([confidence_level])(sample_data, sample_index)

Values of both samples are in the sample_data column. If sample_index equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that means of populations are equal. Normal distribution with equal variances is assumed.

Arguments

  • sample_data: Sample data. Integer, Float or Decimal.
  • sample_index: Sample index. Integer.

Parameters

  • confidence_level: Confidence level in order to calculate confidence intervals. Float.

Returned values

Tuple with two or four elements (if the optional confidence_level is specified):

  • calculated t-statistic. Float64.
  • calculated p-value. Float64.
  • calculated confidence-interval-low. Float64.
  • calculated confidence-interval-high. Float64.

Example

Input table:

┌─sample_data─┬─sample_index─┐
│        20.3 │            0 │
│        21.1 │            0 │
│        21.9 │            1 │
│        21.7 │            0 │
│        19.9 │            1 │
│        21.8 │            1 │
└─────────────┴──────────────┘

Query:

SELECT studentTTest(sample_data, sample_index) FROM student_ttest

Result:

┌─studentTTest(sample_data, sample_index)───┐
│ (-0.21739130434783777,0.8385421208415731) │
└───────────────────────────────────────────┘

sum

Calculates the sum. Only works for numbers.

Syntax

sum(num)

Parameters

  • num: Column of numeric values. (U)Int*, Float*, Decimal*.

Returned value

  • The sum of the values. (U)Int*, Float*, Decimal*.

sumCount

Calculates the sum of the numbers and counts the number of rows at the same time.

Syntax

sumCount(x)

Arguments

  • x: Input value, must be Integer, Float, or Decimal.

Returned value

  • Tuple (sum, count), where sum is the sum of numbers and count is the number of rows with not-NULL values.

Type: Tuple.

sumKahan

Calculates the sum of the numbers with Kahan compensated summation algorithm Slower than sum function. The compensation works only for Float types.

Syntax

sumKahan(x)

Arguments

  • x: Input value, must be Integer, Float, or Decimal.

Returned value

  • the sum of numbers, with type Integer, Float, or Decimal depends on type of input arguments

Example

Query:

SELECT sum(0.1), sumKahan(0.1) FROM numbers(10)

Result:

┌───────────sum(0.1)─┬─sumKahan(0.1)─┐
│ 0.9999999999999999 │             1 │
└────────────────────┴───────────────┘

sumMap

Totals a value array according to the keys specified in the key array. Returns a tuple of two arrays: keys in sorted order, and values ​​summed for the corresponding keys without overflow.

Syntax

  • sumMap(key <Array>, value <Array>) Array type.
  • sumMap(Tuple(key <Array>, value <Array>)) Tuple type.

Alias: sumMappedArrays.

Arguments

  • key: Array of keys.
  • value: Array of values.

Passing a tuple of key and value arrays is a synonym to passing separately an array of keys and an array of values.

The number of elements in key and value must be the same for each row that is totaled.

Returned Value

  • Returns a tuple of two arrays: keys in sorted order, and values ​​summed for the corresponding keys.

sumMapWithOverflow

Totals a value array according to the keys specified in the key array. Returns a tuple of two arrays: keys in sorted order, and values ​​summed for the corresponding keys. It differs from the sumMap function in that it does summation with overflow - i.e. returns the same data type for the summation as the argument data type.

Syntax

  • sumMapWithOverflow(key <Array>, value <Array>) Array type.
  • sumMapWithOverflow(Tuple(key <Array>, value <Array>)) Tuple type.

Arguments

  • key: Array of keys.
  • value: Array of values.

Passing a tuple of key and value arrays is a synonym to passing separately an array of keys and an array of values.

The number of elements in key and value must be the same for each row that is totaled.

Returned Value

  • Returns a tuple of two arrays: keys in sorted order, and values ​​summed for the corresponding keys.

sumWithOverflow

Computes the sum of the numbers, using the same data type for the result as for the input parameters. If the sum exceeds the maximum value for this data type, it is calculated with overflow.

Only works for numbers.

Syntax

sumWithOverflow(num)

Parameters

  • num: Column of numeric values. (U)Int*, Float*, Decimal*.

Returned value

  • The sum of the values. (U)Int*, Float*, Decimal*.

theilsU

The theilsU function calculates the Theil's U uncertainty coefficient, a value that measures the association between two columns in a table. Its values range from −1.0 (100% negative association, or perfect inversion) to +1.0 (100% positive association, or perfect agreement). A value of 0.0 indicates the absence of association.

Syntax

theilsU(column1, column2)

Arguments

  • column1 and column2 are the columns to be compared

Returned value

  • a value between -1 and 1

Return type is always Float64.

Example

The following two columns being compared below have a small association with each other, so the value of theilsU is negative:

SELECT
    theilsU(a ,b)
FROM
    (
        SELECT
            number % 10 AS a,
            number % 4 AS b
        FROM
            numbers(150)
    )

Result:

┌────────theilsU(a, b)─┐
│ -0.30195720557678846 │
└──────────────────────┘

topK

Returns an array of the approximately most frequent values in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves).

Implements the Filtered Space-Saving algorithm for analyzing TopK, based on the reduce-and-combine algorithm from Parallel Space Saving.

topK(N)(column)
topK(N, load_factor)(column)
topK(N, load_factor, 'counts')(column)

This function does not provide a guaranteed result. In certain situations, errors might occur and it might return frequent values that aren’t the most frequent values.

We recommend using the N < 10 value; performance is reduced with large N values. Maximum value of N = 65536.

Parameters

  • N: The number of elements to return. Optional. Default value: 10.
  • load_factor: Defines, how many cells reserved for values. If uniq(column) > N * load_factor, result of topK function will be approximate. Optional. Default value: 3.
  • counts: Defines, should result contain approximate count and error value.

Arguments

  • column: The value to calculate frequency.

Example

Take the OnTime data set and select the three most frequently occurring values in the AirlineID column.

SELECT topK(3)(AirlineID) AS res
FROM ontime
┌─res─────────────────┐
│ [19393,19790,19805] │
└─────────────────────┘

topKWeighted

Returns an array of the approximately most frequent values in the specified column. The resulting array is sorted in descending order of approximate frequency of values (not by the values themselves). Additionally, the weight of the value is taken into account.

Syntax

topKWeighted(N)(column, weight)
topKWeighted(N, load_factor)(column, weight)
topKWeighted(N, load_factor, 'counts')(column, weight)

Parameters

  • N: The number of elements to return. Optional. Default value: 10.
  • load_factor: Defines, how many cells reserved for values. If uniq(column) > N * load_factor, result of topK function will be approximate. Optional. Default value: 3.
  • counts: Defines, should result contain approximate count and error value.

Arguments

  • column: The value.
  • weight: The weight. Every value is accounted weight times for frequency calculation. UInt64.

Returned value

Returns an array of the values with maximum approximate sum of weights.

Example

Query:

SELECT topKWeighted(2)(k, w) FROM
VALUES('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))

Result:

┌─topKWeighted(2)(k, w)──┐
│ ['z','x']              │
└────────────────────────┘

Query:

SELECT topKWeighted(2, 10, 'counts')(k, w)
FROM VALUES('k Char, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))

Result:

┌─topKWeighted(2, 10, 'counts')(k, w)─┐
│ [('z',10,0),('x',5,0)]              │
└─────────────────────────────────────┘

uniq

Calculates the approximate number of different values of the argument.

uniq(x[, ...])

Arguments

The function takes a variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

Returned value

  • A UInt64-type number.

Implementation details

Function:

  • Calculates a hash for all parameters in the aggregate, then uses it in calculations.

  • Uses an adaptive sampling algorithm. For the calculation state, the function uses a sample of element hash values up to 65536. This algorithm is very accurate and very efficient on the CPU. When the query contains several of these functions, using uniq is almost as fast as using other aggregate functions.

  • Provides the result deterministically (it does not depend on the query processing order).

We recommend using this function in almost all scenarios.

uniqCombined

Calculates the approximate number of different argument values.

uniqCombined(HLL_precision)(x[, ...])

The uniqCombined function is a good choice for calculating the number of different values.

Arguments

  • HLL_precision: The base-2 logarithm of the number of cells in HyperLogLog. Optional, you can use the function as uniqCombined(x[, ...]). The default value for HLL_precision is 17, which is effectively 96 KiB of space (2^17 cells, 6 bits each).
  • X: A variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

Returned value

  • A number UInt64-type number.

Implementation details

The uniqCombined function:

  • Calculates a hash (64-bit hash for String and 32-bit otherwise) for all parameters in the aggregate, then uses it in calculations.
  • Uses a combination of three algorithms: array, hash table, and HyperLogLog with an error correction table.
    • For a small number of distinct elements, an array is used.
    • When the set size is larger, a hash table is used.
    • For a larger number of elements, HyperLogLog is used, which will occupy a fixed amount of memory.
  • Provides the result deterministically (it does not depend on the query processing order).

Since it uses a 32-bit hash for non-String types, the result will have very high error for cardinalities significantly larger than UINT_MAX (error will raise quickly after a few tens of billions of distinct values), hence in this case you should use uniqCombined64.

Compared to the uniq function, the uniqCombined function:

  • Consumes several times less memory.
  • Calculates with several times higher accuracy.
  • Usually has slightly lower performance. In some scenarios, uniqCombined can perform better than uniq, for example, with distributed queries that transmit a large number of aggregation states over the network.

Example

Query:

SELECT uniqCombined(number) FROM numbers(1e6)

Result:

┌─uniqCombined(number)─┐
│              1001148 │ -- 1.00 million
└──────────────────────┘

See the example section of uniqCombined64 for an example of the difference between uniqCombined and uniqCombined64 for much larger inputs.

uniqCombined64

Calculates the approximate number of different argument values. It is the same as uniqCombined, but uses a 64-bit hash for all data types rather than just for the String data type.

uniqCombined64(HLL_precision)(x[, ...])

Parameters

  • HLL_precision: The base-2 logarithm of the number of cells in HyperLogLog. Optionally, you can use the function as uniqCombined64(x[, ...]). The default value for HLL_precision is 17, which is effectively 96 KiB of space (2^17 cells, 6 bits each).
  • X: A variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

Returned value

  • A number UInt64-type number.

Implementation details

The uniqCombined64 function:

  • Calculates a hash (64-bit hash for all data types) for all parameters in the aggregate, then uses it in calculations.
  • Uses a combination of three algorithms: array, hash table, and HyperLogLog with an error correction table.
    • For a small number of distinct elements, an array is used.
    • When the set size is larger, a hash table is used.
    • For a larger number of elements, HyperLogLog is used, which will occupy a fixed amount of memory.
  • Provides the result deterministically (it does not depend on the query processing order).

Since it uses 64-bit hash for all types, the result does not suffer from very high error for cardinalities significantly larger than UINT_MAX like uniqCombined does, which uses a 32-bit hash for non-String types.

Compared to the uniq function, the uniqCombined64 function:

  • Consumes several times less memory.
  • Calculates with several times higher accuracy.

Example

In the example below uniqCombined64 is run on 1e10 different numbers returning a very close approximation of the number of different argument values.

Query:

SELECT uniqCombined64(number) FROM numbers(1e10)

Result:

┌─uniqCombined64(number)─┐
│             9998568925 │ -- 10.00 billion
└────────────────────────┘

By comparison the uniqCombined function returns a rather poor approximation for an input this size.

Query:

SELECT uniqCombined(number) FROM numbers(1e10)

Result:

┌─uniqCombined(number)─┐
│           5545308725 │ -- 5.55 billion
└──────────────────────┘

uniqExact

Calculates the exact number of different argument values.

uniqExact(x[, ...])

Use the uniqExact function if you absolutely need an exact result. Otherwise use the uniq function.

The uniqExact function uses more memory than uniq, because the size of the state has unbounded growth as the number of different values increases.

Arguments

The function takes a variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

uniqHLL12

Calculates the approximate number of different argument values, using the HyperLogLog algorithm.

uniqHLL12(x[, ...])

Arguments

The function takes a variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

Returned value

  • A UInt64-type number.

Implementation details

Function:

  • Calculates a hash for all parameters in the aggregate, then uses it in calculations.

  • Uses the HyperLogLog algorithm to approximate the number of different argument values.

    2^12 5-bit cells are used. The size of the state is slightly more than 2.5 KB. The result is not very accurate (up to ~10% error) for small data sets (<10K elements). However, the result is fairly accurate for high-cardinality data sets (10K-100M), with a maximum error of ~1.6%. Starting from 100M, the estimation error increases, and the function will return very inaccurate results for data sets with extremely high cardinality (1B+ elements).

  • Provides the determinate result (it does not depend on the query processing order).

We do not recommend using this function. In most cases, use the uniq or uniqCombined function.

uniqTheta

Calculates the approximate number of different argument values, using the Theta Sketch Framework.

uniqTheta(x[, ...])

Arguments

The function takes a variable number of parameters. Parameters can be Tuple, Array, Date, DateTime, String, or numeric types.

Returned value

  • A UInt64-type number.

Implementation details

Function:

  • Calculates a hash for all parameters in the aggregate, then uses it in calculations.

  • Uses the KMV algorithm to approximate the number of different argument values.

    4096(2^12) 64-bit sketch are used. The size of the state is about 41 KB.

  • The relative error is 3.125% (95% confidence), see the relative error table for detail.

varPop

Calculates the population variance.

Syntax

varPop(x)

Alias: VAR_POP.

Parameters

  • x: Population of values to find the population variance of. (U)Int*, Float*, Decimal*.

Returned value

  • Returns the population variance of x. Float64.

varPopStable

Returns the population variance. Unlike varPop, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.

Syntax

varPopStable(x)

Alias: VAR_POP_STABLE.

Parameters

  • x: Population of values to find the population variance of. (U)Int*, Float*, Decimal*.

Returned value

  • Returns the population variance of x. Float64.

varSamp

Calculate the sample variance of a data set.

Syntax

varSamp(x)

Alias: VAR_SAMP.

Parameters

  • x: The population for which you want to calculate the sample variance. (U)Int*, Float*, Decimal*.

Returned value

  • Returns the sample variance of the input data set x. Float64.

Implementation details

The varSamp function calculates the sample variance using the following formula:

$$ \sum\frac{(x - \text{mean}(x))^2}{(n - 1)} $$

Where:

  • x is each individual data point in the data set.
  • mean(x) is the arithmetic mean of the data set.
  • n is the number of data points in the data set.

The function assumes that the input data set represents a sample from a larger population. If you want to calculate the variance of the entire population (when you have the complete data set), you should use varPop instead.

varSampStable

Calculate the sample variance of a data set. Unlike varSamp, this function uses a numerically stable algorithm. It works slower but provides a lower computational error.

Syntax

varSampStable(x)

Alias: VAR_SAMP_STABLE

Parameters

  • x: The population for which you want to calculate the sample variance. (U)Int*, Float*, Decimal*.

Returned value

  • Returns the sample variance of the input data set. Float64.

Implementation details

The varSampStable function calculates the sample variance using the same formula as the varSamp:

$$ \sum\frac{(x - \text{mean}(x))^2}{(n - 1)} $$

Where:

  • x is each individual data point in the data set.
  • mean(x) is the arithmetic mean of the data set.
  • n is the number of data points in the data set.

welchTTest

Applies Welch's t-test to samples from two populations.

Syntax

welchTTest([confidence_level])(sample_data, sample_index)

Values of both samples are in the sample_data column. If sample_index equals to 0 then the value in that row belongs to the sample from the first population. Otherwise it belongs to the sample from the second population. The null hypothesis is that means of populations are equal. Normal distribution is assumed. Populations may have unequal variance.

Arguments

  • sample_data: Sample data. Integer, Float or Decimal.
  • sample_index: Sample index. Integer.

Parameters

  • confidence_level: Confidence level in order to calculate confidence intervals. Float.

Returned values

Tuple with two or four elements (if the optional confidence_level is specified)

  • calculated t-statistic. Float64.
  • calculated p-value. Float64.
  • calculated confidence-interval-low. Float64.
  • calculated confidence-interval-high. Float64.

Example

Input table:

┌─sample_data─┬─sample_index─┐
│        20.3 │            0 │
│        22.1 │            0 │
│        21.9 │            0 │
│        18.9 │            1 │
│        20.3 │            1 │
│          19 │            1 │
└─────────────┴──────────────┘

Query:

SELECT welchTTest(sample_data, sample_index) FROM welch_ttest

Result:

┌─welchTTest(sample_data, sample_index)─────┐
│ (2.7988719532211235,0.051807360348581945) │
└───────────────────────────────────────────┘
Was this page helpful?
Updated