Apache Parquet JSON integration
This project is a spin-off of the parquet-mr project.
We propose to implement a converter to write JsonNode objects to parquet directly without
intermediately format. To do so, this project implements the WriteSupport
interface for Jackson
JsonNode
objects, and relies on a OpenAPI based schema definition.
This project is mostly based on the ProtocolBuffer and Avro converters implementations.
OpenAPI Type | OpenAPI format | Parquet | Comment |
---|---|---|---|
integer | int16 | int16 | not a valid OAPI type |
integer | int32 | int32 | |
integer | int64 | int64 | |
integer | - | int32 | default format int32 |
number | float | float | |
number | double | double | |
number | - | float | default format float |
string | - | String | logical type |
string | password | String | logical type |
string | String | logical type | |
string | UUID | String | to be improved |
string | byte | String | base64 encoded bytes string |
string | binary | binary | not supported |
string | date | date | logical type |
string | date-time | timestamp | MILLIS precision |
boolean | - | boolean | |
arrays | - | list | logical type, array of maps not implemented |
object | - | GroupType | |
oneOf | - | Union | not implemented |
allOf | - | not supported | |
map | - | map | keys as string only, "free form" objects and "Fixed Keys" not supported |
enum | - | enum | only string type supported |
Given for example a schema definition in a file openapi.yaml
as:
openapi: 3.0.1
info:
title: Some schemas
description: Some schemas for parquet-json usage example
version: 1.0.0
servers:
- url: 'https://getyourguide.com'
paths: {}
components:
schemas:
MyObject:
title: MyObject
type: object
properties:
key_string:
type: string
nullable: false
default: 'a string'
key_int32:
type: integer
format: int32
nullable: true
default: 1
is_true:
type: boolean
nullable: true
default: true
the converter can be used to write a parquet file on the local FS with:
Configuration conf = new Configuration();
conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
OpenAPI openAPI = new OpenAPIV3Parser().read("openapi.yaml");
ObjectSchema schema = (ObjectSchema) openAPI.getComponents().getSchemas().get("MyObject");
ObjectMapper mapper = new ObjectMapper();
String output = "./example.parquet";
Path path = new Path(output);
ParquetWriter<JsonNode> writer =
JsonParquetWriter.Builder(path)
.withSchema(schema)
.withConf(conf)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withDictionaryEncoding(true)
.withPageSize(1024 * 1024)
.build();
String json =
"{\"key_string\":\"hello\",\"key_int32\":32,\"is_true\":true}";
JsonNode payload = mapper.readTree(json);
writer.write(payload);
writer.close();
- Currently works only with schemas of type
OpenAPI
(https://github.com/swagger-api/swagger-parser/) and data payload of typeJsonNode
(Jackson library). - The schema must be fully resolved (no internal or external
ref
) - Union types (
oneOf
) not implemented yet - Readers (from Parquet to JsonNode/OpenAPI) are not implemented (we don't need this part here at GetYourGuide)
We welcome pull requests; if you are planning to perform bigger changes then it makes sense to file an issue first.
For sensitive security matters please contact [email protected].
Copyright 2020 GetYourGuide GmbH.
parquet-json
is licensed under the Apache License, Version 2.0. See LICENSE for the full text.