Hash functions¶
Hash functions can be used for the deterministic pseudo-random shuffling of elements.
Simhash is a hash function, which returns close hash values for close (similar) arguments.
halfMD5¶
Interprets all the input parameters as strings and calculates the MD5 hash value for each of them. Then combines hashes, takes the first 8 bytes of the hash of the resulting string, and interprets them as UInt64
in big-endian byte order.
halfMD5(par1, ...)
The function is relatively slow (5 million short strings per second per processor core). Consider using the sipHash64 function instead.
Arguments
The function takes a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed Tuple
with the same data, Map
and the corresponding Array(Tuple(key, value))
type with the same data).
Returned value
A UInt64 data type hash value.
Example
SELECT halfMD5(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS halfMD5hash, toTypeName(halfMD5hash) AS type
┌────────halfMD5hash─┬─type───┐ │ 186182704141653334 │ UInt64 │ └────────────────────┴────────┘
MD4¶
Calculates the MD4 from a string and returns the resulting set of bytes as FixedString(16).
MD5¶
Calculates the MD5 from a string and returns the resulting set of bytes as FixedString(16). If you don't need MD5 in particular, but you need a decent cryptographic 128-bit hash, use the ‘sipHash128’ function instead. If you want to get the same result as output by the md5sum utility, use lower(hex(MD5(s))).
RIPEMD160¶
Produces RIPEMD-160 hash value.
Syntax
RIPEMD160(input)
Parameters
input
: Input string. String
Returned value
- A 160-bit
RIPEMD-160
hash value of type FixedString(20).
Example
Use the hex function to represent the result as a hex-encoded string.
Query:
SELECT HEX(RIPEMD160('The quick brown fox jumps over the lazy dog'))
┌─HEX(RIPEMD160('The quick brown fox jumps over the lazy dog'))─┐ │ 37F332F68DB77BD9D7EDD4969571AD671CF9DD3B │ └───────────────────────────────────────────────────────────────┘
sipHash64¶
Produces a 64-bit SipHash hash value.
sipHash64(par1,...)
This is a cryptographic hash function. It works at least three times faster than the MD5 hash function.
The function interprets all the input parameters as strings and calculates the hash value for each of them. It then combines the hashes by the following algorithm:
- The first and the second hash value are concatenated to an array which is hashed.
- The previously calculated hash value and the hash of the third input parameter are hashed in a similar way.
- This calculation is repeated for all remaining hash values of the original input.
Arguments
The function takes a variable number of input parameters of any of the supported data types.
Returned value
A UInt64 data type hash value.
Note that the calculated hash values may be equal for the same input values of different argument types. This affects for example integer types of different size, named and unnamed Tuple
with the same data, Map
and the corresponding Array(Tuple(key, value))
type with the same data.
Example
SELECT sipHash64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS SipHash, toTypeName(SipHash) AS type
┌──────────────SipHash─┬─type───┐ │ 11400366955626497465 │ UInt64 │ └──────────────────────┴────────┘
sipHash64Keyed¶
Same as sipHash64 but additionally takes an explicit key argument instead of using a fixed key.
Syntax
sipHash64Keyed((k0, k1), par1,...)
Arguments
Same as sipHash64, but the first argument is a tuple of two UInt64 values representing the key.
Returned value
A UInt64 data type hash value.
Example
Query:
SELECT sipHash64Keyed((506097522914230528, 1084818905618843912), array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS SipHash, toTypeName(SipHash) AS type
┌─────────────SipHash─┬─type───┐ │ 8017656310194184311 │ UInt64 │ └─────────────────────┴────────┘
sipHash128¶
Like sipHash64 but produces a 128-bit hash value, i.e. the final xor-folding state is done up to 128 bits.
This 128-bit variant differs from the reference implementation and it's weaker. This version exists because, when it was written, there was no official 128-bit extension for SipHash. New projects should probably use sipHash128Reference.
Syntax
sipHash128(par1,...)
Arguments
Same as for sipHash64.
Returned value
A 128-bit SipHash
hash value of type FixedString(16).
Example
Query:
SELECT hex(sipHash128('foo', '\x01', 3))
Result:
┌─hex(sipHash128('foo', '', 3))────┐ │ 9DE516A64A414D4B1B609415E4523F24 │ └──────────────────────────────────┘
sipHash128Keyed¶
Same as sipHash128 but additionally takes an explicit key argument instead of using a fixed key.
This 128-bit variant differs from the reference implementation and it's weaker. This version exists because, when it was written, there was no official 128-bit extension for SipHash. New projects should probably use sipHash128ReferenceKeyed.
Syntax
sipHash128Keyed((k0, k1), par1,...)
Arguments
Same as sipHash128, but the first argument is a tuple of two UInt64 values representing the key.
Returned value
A 128-bit SipHash
hash value of type FixedString(16).
Example
Query:
SELECT hex(sipHash128Keyed((506097522914230528, 1084818905618843912),'foo', '\x01', 3))
Result:
┌─hex(sipHash128Keyed((506097522914230528, 1084818905618843912), 'foo', '', 3))─┐ │ B8467F65C8B4CFD9A5F8BD733917D9BF │ └───────────────────────────────────────────────────────────────────────────────┘
sipHash128Reference¶
Like sipHash128 but implements the 128-bit algorithm from the original authors of SipHash.
Syntax
sipHash128Reference(par1,...)
Arguments
Same as for sipHash128.
Returned value
A 128-bit SipHash
hash value of type FixedString(16).
Example
Query:
SELECT hex(sipHash128Reference('foo', '\x01', 3))
Result:
┌─hex(sipHash128Reference('foo', '', 3))─┐ │ 4D1BE1A22D7F5933C0873E1698426260 │ └────────────────────────────────────────┘
sipHash128ReferenceKeyed¶
Same as sipHash128Reference but additionally takes an explicit key argument instead of using a fixed key.
Syntax
sipHash128ReferenceKeyed((k0, k1), par1,...)
Arguments
Same as sipHash128Reference, but the first argument is a tuple of two UInt64 values representing the key.
Returned value
A 128-bit SipHash
hash value of type FixedString(16).
Example
Query:
SELECT hex(sipHash128ReferenceKeyed((506097522914230528, 1084818905618843912),'foo', '\x01', 3))
Result:
┌─hex(sipHash128ReferenceKeyed((506097522914230528, 1084818905618843912), 'foo', '', 3))─┐ │ 630133C9722DC08646156B8130C4CDC8 │ └────────────────────────────────────────────────────────────────────────────────────────┘
cityHash64¶
Produces a 64-bit CityHash hash value.
cityHash64(par1,...)
This is a fast non-cryptographic hash function. It uses the CityHash algorithm for string parameters and implementation-specific fast non-cryptographic hash function for parameters with other data types. The function uses the CityHash combinator to get the final results.
Arguments
The function takes a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed Tuple
with the same data, Map
and the corresponding Array(Tuple(key, value))
type with the same data).
Returned value
A UInt64 data type hash value.
Examples
Call example:
SELECT cityHash64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS CityHash, toTypeName(CityHash) AS type
┌─────────────CityHash─┬─type───┐ │ 12072650598913549138 │ UInt64 │ └──────────────────────┴────────┘
The following example shows how to compute the checksum of the entire table with accuracy up to the row order:
SELECT groupBitXor(cityHash64(*)) FROM table
intHash32¶
Calculates a 32-bit hash code from any type of integer. This is a relatively fast non-cryptographic hash function of average quality for numbers.
Syntax
intHash32(int)
Arguments
int
: Integer to hash. (U)Int*.
Returned value
- 32-bit hash code. UInt32.
Example
Query:
SELECT intHash32(42)
Result:
┌─intHash32(42)─┐ │ 1228623923 │ └───────────────┘
intHash64¶
Calculates a 64-bit hash code from any type of integer. This is a relatively fast non-cryptographic hash function of average quality for numbers. It works faster than intHash32.
Syntax
intHash64(int)
Arguments
int
: Integer to hash. (U)Int*.
Returned value
- 64-bit hash code. UInt64.
Example
Query:
SELECT intHash64(42)
Result:
┌────────intHash64(42)─┐ │ 11490350930367293593 │ └──────────────────────┘
SHA1, SHA224, SHA256, SHA512, SHA512_256¶
Calculates SHA-1, SHA-224, SHA-256, SHA-512, SHA-512-256 hash from a string and returns the resulting set of bytes as FixedString.
Syntax
SHA1('s') ... SHA512('s')
Arguments
s
: Input string for SHA hash calculation. String.
Returned value
- SHA hash as a hex-unencoded FixedString. SHA-1 returns as FixedString(20), SHA-224 as FixedString(28), SHA-256: FixedString(32), SHA-512: FixedString(64). FixedString.
Example
Use the hex function to represent the result as a hex-encoded string.
Query:
SELECT hex(SHA1('abc'))
Result:
┌─hex(SHA1('abc'))─────────────────────────┐ │ A9993E364706816ABA3E25717850C26C9CD0D89D │ └──────────────────────────────────────────┘
BLAKE3¶
Calculates BLAKE3 hash string and returns the resulting set of bytes as FixedString.
Syntax
BLAKE3('s')
This cryptographic hash-function is integrated with the BLAKE3 Rust library. The function is rather fast and shows approximately two times faster performance compared to SHA-2, while generating hashes of the same length as SHA-256.
Arguments
- s - input string for BLAKE3 hash calculation. String.
Return value
- BLAKE3 hash as a byte array with type FixedString(32). FixedString.
Example
Use function hex to represent the result as a hex-encoded string.
Query:
SELECT hex(BLAKE3('ABC'))
Result:
┌─hex(BLAKE3('ABC'))───────────────────────────────────────────────┐ │ D1717274597CF0289694F75D96D444B992A096F1AFD8E7BBFA6EBB1D360FEDFC │ └──────────────────────────────────────────────────────────────────┘
URLHash(url[, N])¶
A fast, decent-quality non-cryptographic hash function for a string obtained from a URL using some type of normalization. URLHash(s)
– Calculates a hash from a string without one of the trailing symbols /
,?
or #
at the end, if present. URLHash(s, N)
– Calculates a hash from a string up to the N level in the URL hierarchy, without one of the trailing symbols /
,?
or #
at the end, if present. Levels are the same as in URLHierarchy.
farmFingerprint64¶
farmHash64¶
Produces a 64-bit FarmHash or Fingerprint value. farmFingerprint64
is preferred for a stable and portable value.
farmFingerprint64(par1, ...) farmHash64(par1, ...)
These functions use the Fingerprint64
and Hash64
methods respectively from all available methods.
Arguments
The function takes a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed Tuple
with the same data, Map
and the corresponding Array(Tuple(key, value))
type with the same data).
Returned value
A UInt64 data type hash value.
Example
SELECT farmHash64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS FarmHash, toTypeName(FarmHash) AS type
┌─────────────FarmHash─┬─type───┐ │ 17790458267262532859 │ UInt64 │ └──────────────────────┴────────┘
javaHash¶
Calculates JavaHash from a string, Byte, Short, Integer, Long. This hash function is neither fast nor having a good quality. The only reason to use it's when this algorithm is already used in another system and you have to calculate exactly the same result.
Syntax
SELECT javaHash('')
Returned value
A Int32
data type hash value.
Example
Query:
SELECT javaHash(toInt32(123))
Result:
┌─javaHash(toInt32(123))─┐ │ 123 │ └────────────────────────┘
Query:
SELECT javaHash('Hello, world!')
Result:
┌─javaHash('Hello, world!')─┐ │ -1880044555 │ └───────────────────────────┘
javaHashUTF16LE¶
Calculates JavaHash from a string, assuming it contains bytes representing a string in UTF-16LE encoding.
Syntax
javaHashUTF16LE(stringUtf16le)
Arguments
stringUtf16le
: a string in UTF-16LE encoding.
Returned value
A Int32
data type hash value.
Example
Correct query with UTF-16LE encoded string.
Query:
SELECT javaHashUTF16LE(convertCharset('test', 'utf-8', 'utf-16le'))
Result:
┌─javaHashUTF16LE(convertCharset('test', 'utf-8', 'utf-16le'))─┐ │ 3556498 │ └──────────────────────────────────────────────────────────────┘
hiveHash¶
Calculates HiveHash
from a string.
SELECT hiveHash('')
This is just JavaHash with zeroed out sign bit. This function is used in Apache Hive for versions before 3.0. This hash function is neither fast nor having a good quality. The only reason to use it's when this algorithm is already used in another system and you have to calculate exactly the same result.
Returned value
hiveHash
hash value. Int32.
Example
Query:
SELECT hiveHash('Hello, world!')
Result:
┌─hiveHash('Hello, world!')─┐ │ 267439093 │ └───────────────────────────┘
metroHash64¶
Produces a 64-bit MetroHash hash value.
metroHash64(par1, ...)
Arguments
The function takes a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed Tuple
with the same data, Map
and the corresponding Array(Tuple(key, value))
type with the same data).
Returned value
A UInt64 data type hash value.
Example
SELECT metroHash64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS MetroHash, toTypeName(MetroHash) AS type
┌────────────MetroHash─┬─type───┐ │ 14235658766382344533 │ UInt64 │ └──────────────────────┴────────┘
jumpConsistentHash¶
Calculates JumpConsistentHash form a UInt64. Accepts two arguments: a UInt64-type key and the number of buckets. Returns Int32. For more information, see the link: JumpConsistentHash
kostikConsistentHash¶
An O(1) time and space consistent hash algorithm by Konstantin 'kostik' Oblakov. Previously yandexConsistentHash
.
Syntax
kostikConsistentHash(input, n)
Alias: yandexConsistentHash
(left for backwards compatibility sake).
Parameters
input
: A UInt64-type key UInt64.n
: Number of buckets. UInt16.
Returned value
- A UInt16 data type hash value.
Implementation details
It is efficient only if n <= 32768.
Example
Query:
SELECT kostikConsistentHash(16045690984833335023, 2)
┌─kostikConsistentHash(16045690984833335023, 2)─┐ │ 1 │ └───────────────────────────────────────────────┘
murmurHash2_32, murmurHash2_64¶
Produces a MurmurHash2 hash value.
murmurHash2_32(par1, ...) murmurHash2_64(par1, ...)
Arguments
Both functions take a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed Tuple
with the same data, Map
and the corresponding Array(Tuple(key, value))
type with the same data).
Returned value
- The
murmurHash2_32
function returns hash value having the UInt32 data type. - The
murmurHash2_64
function returns hash value having the UInt64 data type.
Example
SELECT murmurHash2_64(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS MurmurHash2, toTypeName(MurmurHash2) AS type
┌──────────MurmurHash2─┬─type───┐ │ 11832096901709403633 │ UInt64 │ └──────────────────────┴────────┘
gccMurmurHash¶
Calculates a 64-bit MurmurHash2 hash value using the same hash seed as gcc. It is portable between Clang and GCC builds.
Syntax
gccMurmurHash(par1, ...)
Arguments
par1, ...
: A variable number of parameters that can be any of the supported data types.
Returned value
- Calculated hash value. UInt64.
Example
Query:
SELECT gccMurmurHash(1, 2, 3) AS res1, gccMurmurHash(('a', [1, 2, 3], 4, (4, ['foo', 'bar'], 1, (1, 2)))) AS res2
Result:
┌─────────────────res1─┬────────────────res2─┐ │ 12384823029245979431 │ 1188926775431157506 │ └──────────────────────┴─────────────────────┘
kafkaMurmurHash¶
Calculates a 32-bit MurmurHash2 hash value using the same hash seed as Kafka and without the highest bit to be compatible with Default Partitioner.
Syntax
MurmurHash(par1, ...)
Arguments
par1, ...
: A variable number of parameters that can be any of the supported data types.
Returned value
- Calculated hash value. UInt32.
Example
Query:
SELECT kafkaMurmurHash('foobar') AS res1, kafkaMurmurHash(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS res2
Result:
┌───────res1─┬─────res2─┐ │ 1357151166 │ 85479775 │ └────────────┴──────────┘
murmurHash3_32, murmurHash3_64¶
Produces a MurmurHash3 hash value.
murmurHash3_32(par1, ...) murmurHash3_64(par1, ...)
Arguments
Both functions take a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed Tuple
with the same data, Map
and the corresponding Array(Tuple(key, value))
type with the same data).
Returned value
- The
murmurHash3_32
function returns a UInt32 data type hash value. - The
murmurHash3_64
function returns a UInt64 data type hash value.
Example
SELECT murmurHash3_32(array('e','x','a'), 'mple', 10, toDateTime('2019-06-15 23:00:00')) AS MurmurHash3, toTypeName(MurmurHash3) AS type
┌─MurmurHash3─┬─type───┐ │ 2152717 │ UInt32 │ └─────────────┴────────┘
murmurHash3_128¶
Produces a 128-bit MurmurHash3 hash value.
Syntax
murmurHash3_128(expr)
Arguments
expr
: A list of expressions. String.
Returned value
A 128-bit MurmurHash3
hash value. FixedString(16).
Example
Query:
SELECT hex(murmurHash3_128('foo', 'foo', 'foo'))
Result:
┌─hex(murmurHash3_128('foo', 'foo', 'foo'))─┐ │ F8F7AD9B6CD4CF117A71E277E2EC2931 │ └───────────────────────────────────────────┘
xxh3¶
Produces a 64-bit xxh3 hash value.
Syntax
xxh3(expr)
Arguments
expr
: A list of expressions of any data type.
Returned value
A 64-bit xxh3
hash value. UInt64.
Example
Query:
SELECT xxh3('Hello', 'world')
Result:
┌─xxh3('Hello', 'world')─┐ │ 5607458076371731292 │ └────────────────────────┘
xxHash32, xxHash64¶
Calculates xxHash
from a string. It is proposed in two flavors, 32 and 64 bits.
SELECT xxHash32('') OR SELECT xxHash64('')
Returned value
- Hash value. UInt32/64.
The return type will be UInt32
for xxHash32
and UInt64
for xxHash64
.
Example
Query:
SELECT xxHash32('Hello, world!')
Result:
┌─xxHash32('Hello, world!')─┐ │ 834093149 │ └───────────────────────────┘
ngramSimHash¶
Splits a ASCII string into n-grams of ngramsize
symbols and returns the n-gram simhash
. Is case sensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance. The smaller is the Hamming Distance of the calculated simhashes
of two strings, the more likely these strings are the same.
Syntax
ngramSimHash(string[, ngramsize])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.
Returned value
- Hash value. UInt64.
Example
Query:
SELECT ngramSimHash('Bazinga') AS Hash
Result:
┌───────Hash─┐ │ 1627567969 │ └────────────┘
ngramSimHashCaseInsensitive¶
Splits a ASCII string into n-grams of ngramsize
symbols and returns the n-gram simhash
. Is case insensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance. The smaller is the Hamming Distance of the calculated simhashes
of two strings, the more likely these strings are the same.
Syntax
ngramSimHashCaseInsensitive(string[, ngramsize])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.
Returned value
- Hash value. UInt64.
Example
Query:
SELECT ngramSimHashCaseInsensitive('Bazinga') AS Hash
Result:
┌──────Hash─┐ │ 562180645 │ └───────────┘
ngramSimHashUTF8¶
Splits a UTF-8 string into n-grams of ngramsize
symbols and returns the n-gram simhash
. Is case sensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance. The smaller is the Hamming Distance of the calculated simhashes
of two strings, the more likely these strings are the same.
Syntax
ngramSimHashUTF8(string[, ngramsize])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.
Returned value
- Hash value. UInt64.
Example
Query:
SELECT ngramSimHashUTF8('Bazinga') AS Hash
Result:
┌───────Hash─┐ │ 1628157797 │ └────────────┘
ngramSimHashCaseInsensitiveUTF8¶
Splits a UTF-8 string into n-grams of ngramsize
symbols and returns the n-gram simhash
. Is case insensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance. The smaller is the Hamming Distance of the calculated simhashes
of two strings, the more likely these strings are the same.
Syntax
ngramSimHashCaseInsensitiveUTF8(string[, ngramsize])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.
Returned value
- Hash value. UInt64.
Example
Query:
SELECT ngramSimHashCaseInsensitiveUTF8('Bazinga') AS Hash
Result:
┌───────Hash─┐ │ 1636742693 │ └────────────┘
wordShingleSimHash¶
Splits a ASCII string into parts (shingles) of shinglesize
words and returns the word shingle simhash
. Is case sensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance. The smaller is the Hamming Distance of the calculated simhashes
of two strings, the more likely these strings are the same.
Syntax
wordShingleSimHash(string[, shinglesize])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.
Returned value
- Hash value. UInt64.
Example
Query:
SELECT wordShingleSimHash('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Hash
Result:
┌───────Hash─┐ │ 2328277067 │ └────────────┘
wordShingleSimHashCaseInsensitive¶
Splits a ASCII string into parts (shingles) of shinglesize
words and returns the word shingle simhash
. Is case insensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance. The smaller is the Hamming Distance of the calculated simhashes
of two strings, the more likely these strings are the same.
Syntax
wordShingleSimHashCaseInsensitive(string[, shinglesize])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.
Returned value
- Hash value. UInt64.
Example
Query:
SELECT wordShingleSimHashCaseInsensitive('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Hash
Result:
┌───────Hash─┐ │ 2194812424 │ └────────────┘
wordShingleSimHashUTF8¶
Splits a UTF-8 string into parts (shingles) of shinglesize
words and returns the word shingle simhash
. Is case sensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance. The smaller is the Hamming Distance of the calculated simhashes
of two strings, the more likely these strings are the same.
Syntax
wordShingleSimHashUTF8(string[, shinglesize])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.
Returned value
- Hash value. UInt64.
Example
Query:
SELECT wordShingleSimHashUTF8('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Hash
Result:
┌───────Hash─┐ │ 2328277067 │ └────────────┘
wordShingleSimHashCaseInsensitiveUTF8¶
Splits a UTF-8 string into parts (shingles) of shinglesize
words and returns the word shingle simhash
. Is case insensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance. The smaller is the Hamming Distance of the calculated simhashes
of two strings, the more likely these strings are the same.
Syntax
wordShingleSimHashCaseInsensitiveUTF8(string[, shinglesize])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.
Returned value
- Hash value. UInt64.
Example
Query:
SELECT wordShingleSimHashCaseInsensitiveUTF8('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Hash
Result:
┌───────Hash─┐ │ 2194812424 │ └────────────┘
wyHash64¶
Produces a 64-bit wyHash64 hash value.
Syntax
wyHash64(string)
Arguments
string
: String. String.
Returned value
- Hash value. UInt64.
Example
Query:
SELECT wyHash64('Bazinga') AS Hash
Result:
┌─────────────────Hash─┐ │ 12336419557878201794 │ └──────────────────────┘
ngramMinHash¶
Splits a ASCII string into n-grams of ngramsize
symbols and calculates hash values for each n-gram. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case sensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance. For two strings: if one of the returned hashes is the same for both strings, Tinybird thinks that those strings are the same.
Syntax
ngramMinHash(string[, ngramsize, hashnum])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two hashes: the minimum and the maximum. Tuple(UInt64, UInt64).
Example
Query:
SELECT ngramMinHash('Bazinga') AS Tuple
Result:
┌─Tuple──────────────────────────────────────┐ │ (18333312859352735453,9054248444481805918) │ └────────────────────────────────────────────┘
ngramMinHashCaseInsensitive¶
Splits a ASCII string into n-grams of ngramsize
symbols and calculates hash values for each n-gram. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance. For two strings: if one of the returned hashes is the same for both strings, Tinybird thinks that those strings are the same.
Syntax
ngramMinHashCaseInsensitive(string[, ngramsize, hashnum])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two hashes: the minimum and the maximum. Tuple(UInt64, UInt64).
Example
Query:
SELECT ngramMinHashCaseInsensitive('Bazinga') AS Tuple
Result:
┌─Tuple──────────────────────────────────────┐ │ (2106263556442004574,13203602793651726206) │ └────────────────────────────────────────────┘
ngramMinHashUTF8¶
Splits a UTF-8 string into n-grams of ngramsize
symbols and calculates hash values for each n-gram. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case sensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance. For two strings: if one of the returned hashes is the same for both strings, Tinybird thinks that those strings are the same.
Syntax
ngramMinHashUTF8(string[, ngramsize, hashnum])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two hashes: the minimum and the maximum. Tuple(UInt64, UInt64).
Example
Query:
SELECT ngramMinHashUTF8('Bazinga') AS Tuple
Result:
┌─Tuple──────────────────────────────────────┐ │ (18333312859352735453,6742163577938632877) │ └────────────────────────────────────────────┘
ngramMinHashCaseInsensitiveUTF8¶
Splits a UTF-8 string into n-grams of ngramsize
symbols and calculates hash values for each n-gram. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance. For two strings: if one of the returned hashes is the same for both strings, Tinybird thinks that those strings are the same.
Syntax
ngramMinHashCaseInsensitiveUTF8(string [, ngramsize, hashnum])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two hashes: the minimum and the maximum. Tuple(UInt64, UInt64).
Example
Query:
SELECT ngramMinHashCaseInsensitiveUTF8('Bazinga') AS Tuple
Result:
┌─Tuple───────────────────────────────────────┐ │ (12493625717655877135,13203602793651726206) │ └─────────────────────────────────────────────┘
ngramMinHashArg¶
Splits a ASCII string into n-grams of ngramsize
symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHash function with the same input. Is case sensitive.
Syntax
ngramMinHashArg(string[, ngramsize, hashnum])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two tuples with
hashnum
n-grams each. Tuple(Tuple(String), Tuple(String)).
Example
Query:
SELECT ngramMinHashArg('Bazinga') AS Tuple
Result:
┌─Tuple─────────────────────────────────────────────────────────────────────────┐ │ (('ous','ick','lic','Hou','kHo','use'),('Hou','lic','ick','ous','ckH','Cli')) │ └───────────────────────────────────────────────────────────────────────────────┘
ngramMinHashArgCaseInsensitive¶
Splits a ASCII string into n-grams of ngramsize
symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashCaseInsensitive function with the same input. Is case insensitive.
Syntax
ngramMinHashArgCaseInsensitive(string[, ngramsize, hashnum])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two tuples with
hashnum
n-grams each. Tuple(Tuple(String), Tuple(String)).
Example
Query:
SELECT ngramMinHashArgCaseInsensitive('Bazinga') AS Tuple
Result:
┌─Tuple─────────────────────────────────────────────────────────────────────────┐ │ (('ous','ick','lic','kHo','use','Cli'),('kHo','lic','ick','ous','ckH','Hou')) │ └───────────────────────────────────────────────────────────────────────────────┘
ngramMinHashArgUTF8¶
Splits a UTF-8 string into n-grams of ngramsize
symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashUTF8 function with the same input. Is case sensitive.
Syntax
ngramMinHashArgUTF8(string[, ngramsize, hashnum])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two tuples with
hashnum
n-grams each. Tuple(Tuple(String), Tuple(String)).
Example
Query:
SELECT ngramMinHashArgUTF8('Bazinga') AS Tuple
Result:
┌─Tuple─────────────────────────────────────────────────────────────────────────┐ │ (('ous','ick','lic','Hou','kHo','use'),('kHo','Hou','lic','ick','ous','ckH')) │ └───────────────────────────────────────────────────────────────────────────────┘
ngramMinHashArgCaseInsensitiveUTF8¶
Splits a UTF-8 string into n-grams of ngramsize
symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashCaseInsensitiveUTF8 function with the same input. Is case insensitive.
Syntax
ngramMinHashArgCaseInsensitiveUTF8(string[, ngramsize, hashnum])
Arguments
string
: String. String.ngramsize
: The size of an n-gram. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two tuples with
hashnum
n-grams each. Tuple(Tuple(String), Tuple(String)).
Example
Query:
SELECT ngramMinHashArgCaseInsensitiveUTF8('Bazinga') AS Tuple
Result:
┌─Tuple─────────────────────────────────────────────────────────────────────────┐ │ (('ckH','ous','ick','lic','kHo','use'),('kHo','lic','ick','ous','ckH','Hou')) │ └───────────────────────────────────────────────────────────────────────────────┘
wordShingleMinHash¶
Splits a ASCII string into parts (shingles) of shinglesize
words and calculates hash values for each word shingle. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case sensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance. For two strings: if one of the returned hashes is the same for both strings, Tinybird thinks that those strings are the same.
Syntax
wordShingleMinHash(string[, shinglesize, hashnum])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two hashes: the minimum and the maximum. Tuple(UInt64, UInt64).
Example
Query:
SELECT wordShingleMinHash('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Tuple
Result:
┌─Tuple──────────────────────────────────────┐ │ (16452112859864147620,5844417301642981317) │ └────────────────────────────────────────────┘
wordShingleMinHashCaseInsensitive¶
Splits a ASCII string into parts (shingles) of shinglesize
words and calculates hash values for each word shingle. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance. For two strings: if one of the returned hashes is the same for both strings, Tinybird thinks that those strings are the same.
Syntax
wordShingleMinHashCaseInsensitive(string[, shinglesize, hashnum])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two hashes: the minimum and the maximum. Tuple(UInt64, UInt64).
Example
Query:
SELECT wordShingleMinHashCaseInsensitive('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Tuple
Result:
┌─Tuple─────────────────────────────────────┐ │ (3065874883688416519,1634050779997673240) │ └───────────────────────────────────────────┘
wordShingleMinHashUTF8¶
Splits a UTF-8 string into parts (shingles) of shinglesize
words and calculates hash values for each word shingle. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case sensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance. For two strings: if one of the returned hashes is the same for both strings, Tinybird thinks that those strings are the same.
Syntax
wordShingleMinHashUTF8(string[, shinglesize, hashnum])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two hashes: the minimum and the maximum. Tuple(UInt64, UInt64).
Example
Query:
SELECT wordShingleMinHashUTF8('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Tuple
Result:
┌─Tuple──────────────────────────────────────┐ │ (16452112859864147620,5844417301642981317) │ └────────────────────────────────────────────┘
wordShingleMinHashCaseInsensitiveUTF8¶
Splits a UTF-8 string into parts (shingles) of shinglesize
words and calculates hash values for each word shingle. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance. For two strings: if one of the returned hashes is the same for both strings, Tinybird thinks that those strings are the same.
Syntax
wordShingleMinHashCaseInsensitiveUTF8(string[, shinglesize, hashnum])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two hashes: the minimum and the maximum. Tuple(UInt64, UInt64).
Example
Query:
SELECT wordShingleMinHashCaseInsensitiveUTF8('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).') AS Tuple
Result:
┌─Tuple─────────────────────────────────────┐ │ (3065874883688416519,1634050779997673240) │ └───────────────────────────────────────────┘
wordShingleMinHashArg¶
Splits a ASCII string into parts (shingles) of shinglesize
words each and returns the shingles with minimum and maximum word hashes, calculated by the wordshingleMinHash function with the same input. Is case sensitive.
Syntax
wordShingleMinHashArg(string[, shinglesize, hashnum])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two tuples with
hashnum
word shingles each. Tuple(Tuple(String), Tuple(String)).
Example
Query:
SELECT wordShingleMinHashArg('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).', 1, 3) AS Tuple
Result:
┌─Tuple─────────────────────────────────────────────────────────────────┐ │ (('OLAP','database','analytical'),('online','oriented','processing')) │ └───────────────────────────────────────────────────────────────────────┘
wordShingleMinHashArgCaseInsensitive¶
Splits a ASCII string into parts (shingles) of shinglesize
words each and returns the shingles with minimum and maximum word hashes, calculated by the wordShingleMinHashCaseInsensitive function with the same input. Is case insensitive.
Syntax
wordShingleMinHashArgCaseInsensitive(string[, shinglesize, hashnum])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two tuples with
hashnum
word shingles each. Tuple(Tuple(String), Tuple(String)).
Example
Query:
SELECT wordShingleMinHashArgCaseInsensitive('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).', 1, 3) AS Tuple
Result:
┌─Tuple──────────────────────────────────────────────────────────────────┐ │ (('queries','database','analytical'),('oriented','processing','DBMS')) │ └────────────────────────────────────────────────────────────────────────┘
wordShingleMinHashArgUTF8¶
Splits a UTF-8 string into parts (shingles) of shinglesize
words each and returns the shingles with minimum and maximum word hashes, calculated by the wordShingleMinHashUTF8 function with the same input. Is case sensitive.
Syntax
wordShingleMinHashArgUTF8(string[, shinglesize, hashnum])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two tuples with
hashnum
word shingles each. Tuple(Tuple(String), Tuple(String)).
Example
Query:
SELECT wordShingleMinHashArgUTF8('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).', 1, 3) AS Tuple
Result:
┌─Tuple─────────────────────────────────────────────────────────────────┐ │ (('OLAP','database','analytical'),('online','oriented','processing')) │ └───────────────────────────────────────────────────────────────────────┘
wordShingleMinHashArgCaseInsensitiveUTF8¶
Splits a UTF-8 string into parts (shingles) of shinglesize
words each and returns the shingles with minimum and maximum word hashes, calculated by the wordShingleMinHashCaseInsensitiveUTF8 function with the same input. Is case insensitive.
Syntax
wordShingleMinHashArgCaseInsensitiveUTF8(string[, shinglesize, hashnum])
Arguments
string
: String. String.shinglesize
: The size of a word shingle. Optional. Possible values: any number from1
to25
. Default value:3
. UInt8.hashnum
: The number of minimum and maximum hashes used to calculate the result. Optional. Possible values: any number from1
to25
. Default value:6
. UInt8.
Returned value
- Tuple with two tuples with
hashnum
word shingles each. Tuple(Tuple(String), Tuple(String)).
Example
Query:
SELECT wordShingleMinHashArgCaseInsensitiveUTF8('Bazinga® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP).', 1, 3) AS Tuple
Result:
┌─Tuple──────────────────────────────────────────────────────────────────┐ │ (('queries','database','analytical'),('oriented','processing','DBMS')) │ └────────────────────────────────────────────────────────────────────────┘
sqidEncode¶
Encodes numbers as a Sqid which is a YouTube-like ID string. The output alphabet is abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
. Do not use this function for hashing - the generated IDs can be decoded back into the original numbers.
Syntax
sqidEncode(number1, ...)
Alias: sqid
Arguments
- A variable number of UInt8, UInt16, UInt32 or UInt64 numbers.
Returned value
A sqid String.
Example
SELECT sqidEncode(1, 2, 3, 4, 5)
┌─sqidEncode(1, 2, 3, 4, 5)─┐ │ gXHfJ1C6dN │ └───────────────────────────┘
sqidDecode¶
Decodes a Sqid back into its original numbers. Returns an empty array in case the input string isn't a valid sqid.
Syntax
sqidDecode(sqid)
Arguments
- A sqid - String
Returned value
The sqid transformed to numbers Array(UInt64).
Example
SELECT sqidDecode('gXHfJ1C6dN')
┌─sqidDecode('gXHfJ1C6dN')─┐ │ [1,2,3,4,5] │ └──────────────────────────┘