JSON data type
BETA
¶Stores JavaScript Object Notation (JSON) documents in a single column.
To declare a column of JSON
type, use the following syntax:
<column_name> JSON([max_dynamic_paths=N, max_dynamic_types=M, some.path TypeName, SKIP path.to.skip, SKIP REGEXP 'paths_regexp']) [jsonPath]
Where:
some.path TypeName
is an optional type hint for particular path in the JSON. Such paths will be always stored as subcolumns with specified type.SKIP path.to.skip
is an optional hint for particular path that should be skipped during JSON parsing. Such paths will never be stored in the JSON column. If specified path is a nested JSON object, the whole nested object will be skipped.SKIP REGEXP 'path_regexp'
is an optional hint with a regular expression that is used to skip paths during JSON parsing. All paths that match this regular expression will never be stored in the JSON column.max_dynamic_paths
is an optional parameter indicating how many paths can be stored separately as subcolumns across single block of data that is stored separately (for example across single data part for MergeTree table). If this limit is exceeded, all other paths will be stored together in a single structure. Default value ofmax_dynamic_paths
is1024
.max_dynamic_types
is an optional parameter between1
and255
indicating how many different data types can be stored inside a single path column with typeDynamic
across single block of data that is stored separately (for example across single data part for MergeTree table). If this limit is exceeded, all new types will be converted to typeString
. Default value ofmax_dynamic_types
is32
.
Creating a Data Source with JSON columns¶
You can create a data source with JSON columns by using the JSON
data type.
test.datasource
SCHEMA > `json` JSON `json:$` ENGINE "MergeTree" ENGINE_SORTING_KEY "tuple()"
tb push datasources/test.datasource echo '{"a" : {"b" : 42}, "c" : [1, 2, 3]}\n{"f" : "Hello, World!"}\n{"a" : {"b" : 43, "e" : 10}, "c" : [4, 5, 6]}' > test.ndjson tb datasource append test test.ndjson tb sql 'select * from test' --------------------------------------------- | json | --------------------------------------------- | {'a': {'b': 42}, 'c': [1, 2, 3]} | | {'f': 'Hello, World!'} | | {'a': {'b': 43, 'e': 10}, 'c': [4, 5, 6]} | ---------------------------------------------
Specifying the paths will force the data source to have them:
test.datasource
SCHEMA > `json` JSON(a.b UInt32, SKIP a.e) `json:$` ENGINE "MergeTree" ENGINE_SORTING_KEY "tuple()"
tb push datasources/test.datasource echo '{"a" : {"b" : 42}, "c" : [1, 2, 3]}\n{"f" : "Hello, World!"}\n{"a" : {"b" : 43, "e" : 10}, "c" : [4, 5, 6]}' > test.ndjson tb datasource append test test.ndjson tb sql 'select * from test' ----------------------------------------- | json | ----------------------------------------- | {'a': {'b': 42}, 'c': [1, 2, 3]} | | {'a': {'b': 0}, 'f': 'Hello, World!'} | | {'a': {'b': 43}, 'c': [4, 5, 6]} | -----------------------------------------
For more details, check Ingesting NDJSON data.
Creating JSON in a Query¶
Using CAST from String
:
SELECT '{"a" : {"b" : 42},"c" : [1, 2, 3], "d" : "Hello, World!"}'::JSON AS json
┌─json───────────────────────────────────────────┐ │ {"a":{"b":42},"c":[1,2,3],"d":"Hello, World!"} │ └────────────────────────────────────────────────┘
To CAST from Tuple
or Map
to JSON
is you need to CAST the column into String
column containing JSON objects and deserializing it back to JSON
type column.
Reading JSON paths as subcolumns¶
JSON type supports reading every path as a separate subcolumn. If type of the requested path was not specified in the JSON type declaration, the subcolumn of the path will always have type Dynamic.
test.datasource
SCHEMA > `json` JSON(a.b UInt32, SKIP a.e) `json:$` ENGINE "MergeTree" ENGINE_SORTING_KEY "tuple()"
tb push datasources/test.datasource echo '{"a" : {"b" : 42, "g" : 42.42}, "c" : [1, 2, 3], "d" : "2020-01-01"}\n{"f" : "Hello, World!", "d" : "2020-01-02"}\n{"a" : {"b" : 43, "e" : 10, "g" : 43.43}, "c" : [4, 5, 6]}' > test.ndjson tb datasource append test test.ndjson
For example:
SELECT json.a.b, json.a.g, json.c, json.d FROM test
┌─json.a.b─┬─json.a.g─┬─json.c──┬─json.d─────┐ │ 42 │ 42.42 │ [1,2,3] │ 2020-01-01 │ │ 0 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 2020-01-02 │ │ 43 │ 43.43 │ [4,5,6] │ ᴺᵁᴸᴸ │ └──────────┴──────────┴─────────┴────────────┘
If you ever find any issue with dot notation, try getSubcloumn(json,'path')
syntax:
See getSubcloumn() example:
SELECT getSubcolumn(json, 'a.b'), getSubcolumn(json, 'a.g'), getSubcolumn(json, 'c'), getSubcolumn(json, 'd') FROM test
┌─getSubcolumn(json, 'a.b')─┬─getSubcolumn(json, 'a.g')─┬─getSubcolumn(json, 'c')─┬─getSubcolumn(json, 'd')─────┐ │ 42 │ 42.42 │ [1,2,3] │ 2020-01-01 │ │ 0 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 2020-01-02 │ │ 43 │ 43.43 │ [4,5,6] │ ᴺᵁᴸᴸ │ └───────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────────┘
If the requested path wasn't found in the data, it's filled with NULL
values:
SELECT json.non.existing.path FROM test
┌─json.non.existing.path─┐ │ ᴺᵁᴸᴸ │ │ ᴺᵁᴸᴸ │ │ ᴺᵁᴸᴸ │ └────────────────────────┘
You can check the data types of returned subcolumns:
SELECT toTypeName(json.a.b), toTypeName(json.a.g), toTypeName(json.c), toTypeName(json.d) FROM test
┌─toTypeName(json.a.b)─┬─toTypeName(json.a.g)─┬─toTypeName(json.c)─┬─toTypeName(json.d)─┐ │ UInt32 │ Dynamic │ Dynamic │ Dynamic │ │ UInt32 │ Dynamic │ Dynamic │ Dynamic │ │ UInt32 │ Dynamic │ Dynamic │ Dynamic │ └──────────────────────┴──────────────────────┴────────────────────┴────────────────────┘
For a.b
the type is UInt32
as we specified in the JSON type declaration, and for all other subcolumns the type is Dynamic
.
You can also read subcolumns of a Dynamic
type using special syntax json.some.path.:TypeName
:
select json.a.g.:Float64, dynamicType(json.a.g), json.d.:Date, dynamicType(json.d) FROM test
┌─json.a.g.:`Float64`─┬─dynamicType(json.a.g)─┬─json.d.:`Date`─┬─dynamicType(json.d)─┐ │ 42.42 │ Float64 │ 2020-01-01 │ Date │ │ ᴺᵁᴸᴸ │ None │ 2020-01-02 │ Date │ │ 43.43 │ Float64 │ ᴺᵁᴸᴸ │ None │ └─────────────────────┴───────────────────────┴────────────────┴─────────────────────┘
Dynamic
subcolumns can be cast to any data type. In this case the exception will be thrown if internal type inside Dynamic
cannot be cast to the requested type:
select json.a.g::UInt64 as uint FROM test
┌─uint─┐ │ 42 │ │ 0 │ │ 43 │ └──────┘
select json.a.g::UUID as float FROM test
[Error] Conversion between numeric types and UUID is not supported. Probably the passed UUID is unquoted: while executing 'FUNCTION CAST(json.a.g :: 0, 'UUID' :: 1) -> CAST(json.a.g, 'UUID') UUID : 2'. (NOT_IMPLEMENTED)
Reading JSON sub-objects as subcolumns¶
JSON type supports reading nested objects as subcolumns with type JSON
using special syntax json.^some.path
:
test.datasource
SCHEMA > `json` JSON `json:$` ENGINE "MergeTree" ENGINE_SORTING_KEY "tuple()"
tb push datasources/test.datasource echo '{"a" : {"b" : {"c" : 42, "g" : 42.42}}, "c" : [1, 2, 3], "d" : {"e" : {"f" : {"g" : "Hello, World", "h" : [1, 2, 3]}}}}\n{"f" : "Hello, World!", "d" : {"e" : {"f" : {"h" : [4, 5, 6]}}}}\n{"a" : {"b" : {"c" : 43, "e" : 10, "g" : 43.43}}, "c" : [4, 5, 6]}' > test.ndjson tb datasource append test test.ndjson
For example:
SELECT json.^a.b, json.^d.e.f FROM test
┌─json.^`a`.b───────────────┬─json.^`d`.e.f────────────────────┐ │ {"c":42,"g":42.42} │ {"g":"Hello, World","h":[1,2,3]} │ │ {} │ {"h":[4,5,6]} │ │ {"c":43,"e":10,"g":43.43} │ {} │ └───────────────────────────┴──────────────────────────────────┘
Reading sub-objects as subcolumns may be inefficient, as this may require almost full scan of the JSON data.
Handling arrays of JSON objects¶
JSON paths that contains an array of objects are parsed as type Array(JSON)
and inserted into Dynamic
column for this path. To read an array of objects you can extract it from Dynamic
column as a subcolumn:
test.datasource
SCHEMA > `json` JSON `json:$` ENGINE "MergeTree" ENGINE_SORTING_KEY "tuple()"
tb push datasources/test.datasource echo '{"a" : {"b" : [{"c" : 42, "d" : "Hello", "f" : [[{"g" : 42.42}]], "k" : {"j" : 1000}}, {"c" : 43}, {"e" : [1, 2, 3], "d" : "My", "f" : [[{"g" : 43.43, "h" : "2020-01-01"}]], "k" : {"j" : 2000}}]}}\n{"a" : {"b" : [1, 2, 3]}}\n{"a" : {"b" : [{"c" : 44, "f" : [[{"h" : "2020-01-02"}]]}, {"e" : [4, 5, 6], "d" : "World", "f" : [[{"g" : 44.44}]], "k" : {"j" : 3000}}]}}' > test.ndjson tb datasource append test test.ndjson
For example:
SELECT json.a.b, dynamicType(json.a.b) FROM test
┌─json.a.b──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─dynamicType(json.a.b)────────────────────────────────────┐ │ ['{"c":"42","d":"Hello","f":[[{"g":42.42}]],"k":{"j":"1000"}}','{"c":"43"}','{"d":"My","e":["1","2","3"],"f":[[{"g":43.43,"h":"2020-01-01"}]],"k":{"j":"2000"}}'] │ Array(JSON(max_dynamic_types=16, max_dynamic_paths=256)) │ │ [1,2,3] │ Array(Nullable(Int64)) │ │ ['{"c":"44","f":[[{"h":"2020-01-02"}]]}','{"d":"World","e":["4","5","6"],"f":[[{"g":44.44}]],"k":{"j":"3000"}}'] │ Array(JSON(max_dynamic_types=16, max_dynamic_paths=256)) │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────┘
The max_dynamic_types/max_dynamic_paths
parameters of the nested JSON
type were reduced compared to the default values. It's needed to avoid number of subcolumns to grow uncontrolled on nested arrays of JSON objects.
When reading subcolumns from a nested JSON
column:
SELECT json.a.b.:`Array(JSON)`.c, json.a.b.:`Array(JSON)`.f, json.a.b.:`Array(JSON)`.d FROM test
┌─json.a.b.:`Array(JSON)`.c─┬─json.a.b.:`Array(JSON)`.f───────────────────────────────────┬─json.a.b.:`Array(JSON)`.d─┐ │ [42,43,NULL] │ [[['{"g":42.42}']],NULL,[['{"g":43.43,"h":"2020-01-01"}']]] │ ['Hello',NULL,'My'] │ │ [] │ [] │ [] │ │ [44,NULL] │ [[['{"h":"2020-01-02"}']],[['{"g":44.44}']]] │ [NULL,'World'] │ └───────────────────────────┴─────────────────────────────────────────────────────────────┴───────────────────────────┘
You can avoid writing the Array(JSON)
subcolumn name using the following syntax:
SELECT json.a.b[].c, json.a.b[].f, json.a.b[].d FROM test
┌─json.a.b.:`Array(JSON)`.c─┬─json.a.b.:`Array(JSON)`.f───────────────────────────────────┬─json.a.b.:`Array(JSON)`.d─┐ │ [42,43,NULL] │ [[['{"g":42.42}']],NULL,[['{"g":43.43,"h":"2020-01-01"}']]] │ ['Hello',NULL,'My'] │ │ [] │ [] │ [] │ │ [44,NULL] │ [[['{"h":"2020-01-02"}']],[['{"g":44.44}']]] │ [NULL,'World'] │ └───────────────────────────┴─────────────────────────────────────────────────────────────┴───────────────────────────┘
The number of []
after path indicates the array level. json.path[][]
is transformed to json.path.:Array(Array(JSON))
To read subcolumns from Array(JSON)
column:
SELECT json.a.b[].c.:Int64, json.a.b[].f[][].g.:Float64, json.a.b[].f[][].h.:Date FROM test
┌─json.a.b.:`Array(JSON)`.c.:`Int64`─┬─json.a.b.:`Array(JSON)`.f.:`Array(Array(JSON))`.g.:`Float64`─┬─json.a.b.:`Array(JSON)`.f.:`Array(Array(JSON))`.h.:`Date`─┐ │ [42,43,NULL] │ [[[42.42]],[],[[43.43]]] │ [[[NULL]],[],[['2020-01-01']]] │ │ [] │ [] │ [] │ │ [44,NULL] │ [[[NULL]],[[44.44]]] │ [[['2020-01-02']],[[NULL]]] │ └────────────────────────────────────┴──────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
You can also read sub-object subcolumns from nested JSON
column:
SELECT json.a.b[].^k FROM test
┌─json.a.b.:`Array(JSON)`.^`k`─────────┐ │ ['{"j":"1000"}','{}','{"j":"2000"}'] │ │ [] │ │ ['{}','{"j":"3000"}'] │ └──────────────────────────────────────┘