JSON data type

BETA

Stores JavaScript Object Notation (JSON) documents in a single column.

To declare a column of JSON type, use the following syntax:

<column_name> JSON([max_dynamic_paths=N, max_dynamic_types=M, some.path TypeName, SKIP path.to.skip, SKIP REGEXP 'paths_regexp']) [jsonPath]

Where:

  • some.path TypeName is an optional type hint for particular path in the JSON. Such paths will be always stored as subcolumns with specified type.
  • SKIP path.to.skip is an optional hint for particular path that should be skipped during JSON parsing. Such paths will never be stored in the JSON column. If specified path is a nested JSON object, the whole nested object will be skipped.
  • SKIP REGEXP 'path_regexp' is an optional hint with a regular expression that is used to skip paths during JSON parsing. All paths that match this regular expression will never be stored in the JSON column.
  • max_dynamic_paths is an optional parameter indicating how many paths can be stored separately as subcolumns across single block of data that is stored separately (for example across single data part for MergeTree table). If this limit is exceeded, all other paths will be stored together in a single structure. Default value of max_dynamic_paths is 1024.
  • max_dynamic_types is an optional parameter between 1 and 255 indicating how many different data types can be stored inside a single path column with type Dynamic across single block of data that is stored separately (for example across single data part for MergeTree table). If this limit is exceeded, all new types will be converted to type String. Default value of max_dynamic_types is 32.

Creating a Data Source with JSON columns

You can create a data source with JSON columns by using the JSON data type.

test.datasource
SCHEMA >
    `json` JSON `json:$`

ENGINE "MergeTree"
ENGINE_SORTING_KEY "tuple()"
tb push datasources/test.datasource
echo '{"a" : {"b" : 42}, "c" : [1, 2, 3]}\n{"f" : "Hello, World!"}\n{"a" : {"b" : 43, "e" : 10}, "c" : [4, 5, 6]}' > test.ndjson
tb datasource append test test.ndjson
tb sql 'select * from test'
---------------------------------------------
| json                                      |
---------------------------------------------
| {'a': {'b': 42}, 'c': [1, 2, 3]}          |
| {'f': 'Hello, World!'}                    |
| {'a': {'b': 43, 'e': 10}, 'c': [4, 5, 6]} |
---------------------------------------------

Specifying the paths will force the data source to have them:

test.datasource
SCHEMA >
    `json` JSON(a.b UInt32, SKIP a.e) `json:$`

ENGINE "MergeTree"
ENGINE_SORTING_KEY "tuple()"
tb push datasources/test.datasource
echo '{"a" : {"b" : 42}, "c" : [1, 2, 3]}\n{"f" : "Hello, World!"}\n{"a" : {"b" : 43, "e" : 10}, "c" : [4, 5, 6]}' > test.ndjson
tb datasource append test test.ndjson
tb sql 'select * from test'
-----------------------------------------
| json                                  |
-----------------------------------------
| {'a': {'b': 42}, 'c': [1, 2, 3]}      |
| {'a': {'b': 0}, 'f': 'Hello, World!'} |
| {'a': {'b': 43}, 'c': [4, 5, 6]}      |
-----------------------------------------

For more details, check Ingesting NDJSON data.

Creating JSON in a Query

Using CAST from String:

SELECT '{"a" : {"b" : 42},"c" : [1, 2, 3], "d" : "Hello, World!"}'::JSON AS json
┌─json───────────────────────────────────────────┐
│ {"a":{"b":42},"c":[1,2,3],"d":"Hello, World!"} │
└────────────────────────────────────────────────┘

To CAST from Tuple or Map to JSON is you need to CAST the column into String column containing JSON objects and deserializing it back to JSON type column.

Reading JSON paths as subcolumns

JSON type supports reading every path as a separate subcolumn. If type of the requested path was not specified in the JSON type declaration, the subcolumn of the path will always have type Dynamic.

test.datasource
SCHEMA >
    `json` JSON(a.b UInt32, SKIP a.e) `json:$`

ENGINE "MergeTree"
ENGINE_SORTING_KEY "tuple()"
tb push datasources/test.datasource
echo '{"a" : {"b" : 42, "g" : 42.42}, "c" : [1, 2, 3], "d" : "2020-01-01"}\n{"f" : "Hello, World!", "d" : "2020-01-02"}\n{"a" : {"b" : 43, "e" : 10, "g" : 43.43}, "c" : [4, 5, 6]}' > test.ndjson
tb datasource append test test.ndjson

For example:

SELECT json.a.b, json.a.g, json.c, json.d FROM test
┌─json.a.b─┬─json.a.g─┬─json.c──┬─json.d─────┐
│       42 │ 42.42    │ [1,2,3] │ 2020-01-01 │
│        0 │ ᴺᵁᴸᴸ     │ ᴺᵁᴸᴸ    │ 2020-01-02 │
│       43 │ 43.43    │ [4,5,6] │ ᴺᵁᴸᴸ       │
└──────────┴──────────┴─────────┴────────────┘

If you ever find any issue with dot notation, try getSubcloumn(json,'path') syntax:

See getSubcloumn() example:

SELECT getSubcolumn(json, 'a.b'), getSubcolumn(json, 'a.g'), getSubcolumn(json, 'c'), getSubcolumn(json, 'd') FROM test
┌─getSubcolumn(json, 'a.b')─┬─getSubcolumn(json, 'a.g')─┬─getSubcolumn(json, 'c')─┬─getSubcolumn(json, 'd')─────┐
│                        42 │                     42.42 │                 [1,2,3] │                  2020-01-01 │
│                         0 │                  ᴺᵁᴸᴸ     │                 ᴺᵁᴸᴸ    │                  2020-01-02 │
│                        43 │                  43.43    │                 [4,5,6] │                  ᴺᵁᴸᴸ       │
└───────────────────────────┴───────────────────────────┴─────────────────────────┴─────────────────────────────┘

If the requested path wasn't found in the data, it's filled with NULL values:

SELECT json.non.existing.path FROM test
┌─json.non.existing.path─┐
│ ᴺᵁᴸᴸ                   │
│ ᴺᵁᴸᴸ                   │
│ ᴺᵁᴸᴸ                   │
└────────────────────────┘

You can check the data types of returned subcolumns:

SELECT toTypeName(json.a.b), toTypeName(json.a.g), toTypeName(json.c), toTypeName(json.d) FROM test
┌─toTypeName(json.a.b)─┬─toTypeName(json.a.g)─┬─toTypeName(json.c)─┬─toTypeName(json.d)─┐
│ UInt32               │ Dynamic              │ Dynamic            │ Dynamic            │
│ UInt32               │ Dynamic              │ Dynamic            │ Dynamic            │
│ UInt32               │ Dynamic              │ Dynamic            │ Dynamic            │
└──────────────────────┴──────────────────────┴────────────────────┴────────────────────┘

For a.b the type is UInt32 as we specified in the JSON type declaration, and for all other subcolumns the type is Dynamic.

You can also read subcolumns of a Dynamic type using special syntax json.some.path.:TypeName:

select json.a.g.:Float64, dynamicType(json.a.g), json.d.:Date, dynamicType(json.d) FROM test
┌─json.a.g.:`Float64`─┬─dynamicType(json.a.g)─┬─json.d.:`Date`─┬─dynamicType(json.d)─┐
│               42.42 │ Float64               │     2020-01-01 │ Date                │
│                ᴺᵁᴸᴸ │ None                  │     2020-01-02 │ Date                │
│               43.43 │ Float64               │           ᴺᵁᴸᴸ │ None                │
└─────────────────────┴───────────────────────┴────────────────┴─────────────────────┘

Dynamic subcolumns can be cast to any data type. In this case the exception will be thrown if internal type inside Dynamic cannot be cast to the requested type:

select json.a.g::UInt64 as uint FROM test
┌─uint─┐
│   42 │
│    0 │
│   43 │
└──────┘
select json.a.g::UUID as float FROM test
[Error] Conversion between numeric types and UUID is not supported. Probably the passed UUID is unquoted: while executing 'FUNCTION CAST(json.a.g :: 0, 'UUID' :: 1) -> CAST(json.a.g, 'UUID') UUID : 2'. (NOT_IMPLEMENTED)

Reading JSON sub-objects as subcolumns

JSON type supports reading nested objects as subcolumns with type JSON using special syntax json.^some.path:

test.datasource
SCHEMA >
    `json` JSON `json:$`

ENGINE "MergeTree"
ENGINE_SORTING_KEY "tuple()"
tb push datasources/test.datasource
echo '{"a" : {"b" : {"c" : 42, "g" : 42.42}}, "c" : [1, 2, 3], "d" : {"e" : {"f" : {"g" : "Hello, World", "h" : [1, 2, 3]}}}}\n{"f" : "Hello, World!", "d" : {"e" : {"f" : {"h" : [4, 5, 6]}}}}\n{"a" : {"b" : {"c" : 43, "e" : 10, "g" : 43.43}}, "c" : [4, 5, 6]}' > test.ndjson
tb datasource append test test.ndjson

For example:

SELECT json.^a.b, json.^d.e.f FROM test
┌─json.^`a`.b───────────────┬─json.^`d`.e.f────────────────────┐
│ {"c":42,"g":42.42}        │ {"g":"Hello, World","h":[1,2,3]} │
│ {}                        │ {"h":[4,5,6]}                    │
│ {"c":43,"e":10,"g":43.43} │ {}                               │
└───────────────────────────┴──────────────────────────────────┘

Reading sub-objects as subcolumns may be inefficient, as this may require almost full scan of the JSON data.

Handling arrays of JSON objects

JSON paths that contains an array of objects are parsed as type Array(JSON) and inserted into Dynamic column for this path. To read an array of objects you can extract it from Dynamic column as a subcolumn:

test.datasource
SCHEMA >
    `json` JSON `json:$`

ENGINE "MergeTree"
ENGINE_SORTING_KEY "tuple()"
tb push datasources/test.datasource
echo '{"a" : {"b" : [{"c" : 42, "d" : "Hello", "f" : [[{"g" : 42.42}]], "k" : {"j" : 1000}}, {"c" : 43}, {"e" : [1, 2, 3], "d" : "My", "f" : [[{"g" : 43.43, "h" : "2020-01-01"}]],  "k" : {"j" : 2000}}]}}\n{"a" : {"b" : [1, 2, 3]}}\n{"a" : {"b" : [{"c" : 44, "f" : [[{"h" : "2020-01-02"}]]}, {"e" : [4, 5, 6], "d" : "World", "f" : [[{"g" : 44.44}]],  "k" : {"j" : 3000}}]}}' > test.ndjson
tb datasource append test test.ndjson

For example:

SELECT json.a.b, dynamicType(json.a.b) FROM test
┌─json.a.b──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─dynamicType(json.a.b)────────────────────────────────────┐
│ ['{"c":"42","d":"Hello","f":[[{"g":42.42}]],"k":{"j":"1000"}}','{"c":"43"}','{"d":"My","e":["1","2","3"],"f":[[{"g":43.43,"h":"2020-01-01"}]],"k":{"j":"2000"}}'] │ Array(JSON(max_dynamic_types=16, max_dynamic_paths=256)) │
│ [1,2,3]                                                                                                                                                           │ Array(Nullable(Int64))                                   │
│ ['{"c":"44","f":[[{"h":"2020-01-02"}]]}','{"d":"World","e":["4","5","6"],"f":[[{"g":44.44}]],"k":{"j":"3000"}}']                                                  │ Array(JSON(max_dynamic_types=16, max_dynamic_paths=256)) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────┘

The max_dynamic_types/max_dynamic_paths parameters of the nested JSON type were reduced compared to the default values. It's needed to avoid number of subcolumns to grow uncontrolled on nested arrays of JSON objects.

When reading subcolumns from a nested JSON column:

SELECT json.a.b.:`Array(JSON)`.c, json.a.b.:`Array(JSON)`.f, json.a.b.:`Array(JSON)`.d FROM test
┌─json.a.b.:`Array(JSON)`.c─┬─json.a.b.:`Array(JSON)`.f───────────────────────────────────┬─json.a.b.:`Array(JSON)`.d─┐
│ [42,43,NULL]              │ [[['{"g":42.42}']],NULL,[['{"g":43.43,"h":"2020-01-01"}']]] │ ['Hello',NULL,'My']       │
│ []                        │ []                                                          │ []                        │
│ [44,NULL]                 │ [[['{"h":"2020-01-02"}']],[['{"g":44.44}']]]                │ [NULL,'World']            │
└───────────────────────────┴─────────────────────────────────────────────────────────────┴───────────────────────────┘

You can avoid writing the Array(JSON) subcolumn name using the following syntax:

SELECT json.a.b[].c, json.a.b[].f, json.a.b[].d FROM test
┌─json.a.b.:`Array(JSON)`.c─┬─json.a.b.:`Array(JSON)`.f───────────────────────────────────┬─json.a.b.:`Array(JSON)`.d─┐
│ [42,43,NULL]              │ [[['{"g":42.42}']],NULL,[['{"g":43.43,"h":"2020-01-01"}']]] │ ['Hello',NULL,'My']       │
│ []                        │ []                                                          │ []                        │
│ [44,NULL]                 │ [[['{"h":"2020-01-02"}']],[['{"g":44.44}']]]                │ [NULL,'World']            │
└───────────────────────────┴─────────────────────────────────────────────────────────────┴───────────────────────────┘

The number of [] after path indicates the array level. json.path[][] is transformed to json.path.:Array(Array(JSON))

To read subcolumns from Array(JSON) column:

SELECT json.a.b[].c.:Int64, json.a.b[].f[][].g.:Float64, json.a.b[].f[][].h.:Date FROM test
┌─json.a.b.:`Array(JSON)`.c.:`Int64`─┬─json.a.b.:`Array(JSON)`.f.:`Array(Array(JSON))`.g.:`Float64`─┬─json.a.b.:`Array(JSON)`.f.:`Array(Array(JSON))`.h.:`Date`─┐
│ [42,43,NULL]                       │ [[[42.42]],[],[[43.43]]]                                     │ [[[NULL]],[],[['2020-01-01']]]                            │
│ []                                 │ []                                                           │ []                                                        │
│ [44,NULL]                          │ [[[NULL]],[[44.44]]]                                         │ [[['2020-01-02']],[[NULL]]]                               │
└────────────────────────────────────┴──────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

You can also read sub-object subcolumns from nested JSON column:

SELECT json.a.b[].^k FROM test
┌─json.a.b.:`Array(JSON)`.^`k`─────────┐
│ ['{"j":"1000"}','{}','{"j":"2000"}'] │
│ []                                   │
│ ['{}','{"j":"3000"}']                │
└──────────────────────────────────────┘
Was this page helpful?
Updated