Functions for Splitting Strings¶
splitByChar¶
Splits a string into substrings separated by a specified character. Uses a constant string separator
which consists of exactly one character. Returns an array of selected substrings. Empty substrings may be selected if the separator occurs at the beginning or end of the string, or if there are multiple consecutive separators.
Syntax
splitByChar(separator, s[, max_substrings]))
Arguments
separator
: The separator which should contain exactly one character. String.s
: The string to split. String.max_substrings
: An optionalInt64
defaulting to 0. Ifmax_substrings
> 0, the returned array will contain at mostmax_substrings
substrings, otherwise the function will return as many substrings as possible.
Returned values
- An array of selected substrings. Array(String).
Empty substrings may be selected when:
- A separator occurs at the beginning or end of the string
- There are multiple consecutive separators
- The original string
s
is empty.
Example
SELECT splitByChar(',', '1,2,3,abcde')
Result:
┌─splitByChar(',', '1,2,3,abcde')─┐ │ ['1','2','3','abcde'] │ └─────────────────────────────────┘
splitByString¶
Splits a string into substrings separated by a string. It uses a constant string separator
of multiple characters as the separator. If the string separator
is empty, it will split the string s
into an array of single characters.
Syntax
splitByString(separator, s[, max_substrings]))
Arguments
separator
: The separator. String.s
: The string to split. String.max_substrings
: An optionalInt64
defaulting to 0. Whenmax_substrings
> 0, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible.
Returned values
- An array of selected substrings. Array(String).
Empty substrings may be selected when:
- A non-empty separator occurs at the beginning or end of the string
- There are multiple consecutive non-empty separators
- The original string
s
is empty while the separator is not empty.
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings
> 0.
Example
SELECT splitByString(', ', '1, 2 3, 4,5, abcde')
Result:
┌─splitByString(', ', '1, 2 3, 4,5, abcde')─┐ │ ['1','2 3','4,5','abcde'] │ └───────────────────────────────────────────┘
SELECT splitByString('', 'abcde')
Result:
┌─splitByString('', 'abcde')─┐ │ ['a','b','c','d','e'] │ └────────────────────────────┘
splitByRegexp¶
Splits a string into substrings separated by a regular expression. It uses a regular expression string regexp
as the separator. If the regexp
is empty, it will split the string s
into an array of single characters. If no match is found for this regular expression, the string s
won't be split.
Syntax
splitByRegexp(regexp, s[, max_substrings]))
Arguments
regexp
: Regular expression. Constant. String or FixedString.s
: The string to split. String.max_substrings
: An optionalInt64
defaulting to 0. Whenmax_substrings
> 0, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible.
Returned values
- An array of selected substrings. Array(String).
Empty substrings may be selected when:
- A non-empty regular expression match occurs at the beginning or end of the string
- There are multiple consecutive non-empty regular expression matches
- The original string
s
is empty while the regular expression is not empty.
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings
> 0.
Example
SELECT splitByRegexp('\\d+', 'a12bc23de345f')
Result:
┌─splitByRegexp('\\d+', 'a12bc23de345f')─┐ │ ['a','bc','de','f'] │ └────────────────────────────────────────┘
SELECT splitByRegexp('', 'abcde')
Result:
┌─splitByRegexp('', 'abcde')─┐ │ ['a','b','c','d','e'] │ └────────────────────────────┘
splitByWhitespace¶
Splits a string into substrings separated by whitespace characters. Returns an array of selected substrings.
Syntax
splitByWhitespace(s[, max_substrings]))
Arguments
s
: The string to split. String.max_substrings
: An optionalInt64
defaulting to 0. Whenmax_substrings
> 0, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible.
Returned values
- An array of selected substrings. Array(String).
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings
> 0.
Example
SELECT splitByWhitespace(' 1! a, b. ')
Result:
┌─splitByWhitespace(' 1! a, b. ')─┐ │ ['1!','a,','b.'] │ └─────────────────────────────────────┘
splitByNonAlpha¶
Splits a string into substrings separated by whitespace and punctuation characters. Returns an array of selected substrings.
Syntax
splitByNonAlpha(s[, max_substrings]))
Arguments
s
: The string to split. String.max_substrings
: An optionalInt64
defaulting to 0. Whenmax_substrings
> 0, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible.
Returned values
- An array of selected substrings. Array(String).
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings
> 0.
Example
SELECT splitByNonAlpha(' 1! a, b. ')
┌─splitByNonAlpha(' 1! a, b. ')─┐ │ ['1','a','b'] │ └───────────────────────────────────┘
arrayStringConcat¶
Concatenates string representations of values listed in the array with the separator. separator
is an optional parameter: a constant string, set to an empty string by default. Returns the string.
Syntax
arrayStringConcat(arr\[, separator\])
Example
SELECT arrayStringConcat(['12/05/2021', '12:50:00'], ' ') AS DateString
Result:
┌─DateString──────────┐ │ 12/05/2021 12:50:00 │ └─────────────────────┘
alphaTokens¶
Selects substrings of consecutive bytes from the ranges a-z and A-Z.Returns an array of substrings.
Syntax
alphaTokens(s[, max_substrings]))
Alias: splitByAlpha
Arguments
s
: The string to split. String.max_substrings
: An optionalInt64
defaulting to 0. Whenmax_substrings
> 0, the returned substrings will be no more thanmax_substrings
, otherwise the function will return as many substrings as possible.
Returned values
- An array of selected substrings. Array(String).
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings
> 0.
Example
SELECT alphaTokens('abca1abc')
┌─alphaTokens('abca1abc')─┐ │ ['abca','abc'] │ └─────────────────────────┘
extractAllGroups¶
Extracts all groups from non-overlapping substrings matched by a regular expression.
Syntax
extractAllGroups(text, regexp)
Arguments
text
: String or FixedString.regexp
: Regular expression. Constant. String or FixedString.
Returned values
- If the function finds at least one matching group, it returns
Array(Array(String))
column, clustered by group_id (1 to N, where N is number of capturing groups inregexp
). If there is no matching group, it returns an empty array. Array.
Example
SELECT extractAllGroups('abc=123, 8="hkl"', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')
Result:
┌─extractAllGroups('abc=123, 8="hkl"', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐ │ [['abc','123'],['8','"hkl"']] │ └───────────────────────────────────────────────────────────────────────┘
ngrams¶
Splits a UTF-8 string into n-grams of ngramsize
symbols.
**Syntax **
ngrams(string, ngramsize)
Arguments
string
: String. String or FixedString.ngramsize
: The size of an n-gram. UInt.
Returned values
- Array with n-grams. Array(String).
Example
SELECT ngrams('Tinybird', 3)
Result:
┌─ngrams('Tinybird', 3)─────────────────┐ │ ['Tin','iny','nyb','ybi','bir','ird'] │ └───────────────────────────────────────┘
tokens¶
Splits a string into tokens using non-alphanumeric ASCII characters as separators.
Arguments
input_string
: Any set of bytes represented as the String data type object.
Returned value
- The resulting array of tokens from input string. Array.
Example
SELECT tokens('test1,;\\ test2,;\\ test3,;\\ test4') AS tokens
Result:
┌─tokens────────────────────────────┐ │ ['test1','test2','test3','test4'] │ └───────────────────────────────────┘