Skip to content
Valentin Kuznetsov edited this page Mar 23, 2017 · 1 revision

Test doc against WMArchive schema

Sometimes it is necessary to test existing JSON document against WMArchive schema. To do so we need to perform the following steps:

  • Obtain a document in JSON format, e.g. file.json
  • Obtain WMArchive schema file, it can be fetched from HDFS
hadoop fs -get /cms/wmarchive/avro/fwjr_prod.avsc ./schame.avsc

or generated directly from FWJRProduction.py file

# setup WMArchive environment and run the following commands
# both fwjrschema and json2avsc are part of WMArchvie distribution
fwjrschema --fout=schema.json
json2avsc --fin=schema.json --fout=schema.avsc

The schema.avsc is an AVRO schema for WMArchive. It has the following format:

{
    "namespace": "ns12",
    "type": "record",
    "name": "name12",
    "fields": [
        {
            "type": {
                "items": [
                    "string",
                    "null"
                ],
                "type": "array"
            },
            "name": "PFNArrayRef"
        },
...
}

which describes valid keys and associated value data-types.

  • Generate AVRO file from existing JSON and WMArchive schema files
json2avro --fin=file.json --schema=schema.avsc --fout=file.avro
  • Read back avro file using Java avro library
# you may look-up avro-tools-1.7.7.jar in your local Java installation area or on an internet
java -jar avro-tools-1.7.7.jar tojson file.avro > avro2file.json