Skip to content

Records constrains

Valentin Kuznetsov edited this page Dec 21, 2015 · 1 revision

Record constrains

WMArchive service is quite flexible to storing your favorite JSON records, but they should follow up a few constrains which we impose due to limitations (and complexity) of underlying avro binary format used by HDFS. Here they are:

  • It is allowed to store nested records, e.g. dict of dicts, or dict with list of dicts, but the complex data types should have identical structure for all its elements. For instance, if your record contains a list, all elements of the list should be the same type, e.g. integers, floats or dicts. If list contains another complex data structure, e.g. dictionaries, all dictionary should have identical structure.

  • It is allowed to expand the schema, but it is prohibit to shrink it. For example, if original document had the structure of {"foo":"some_string"} you can expand it to {"foo":"some_string", "bla":int}, i.e. it is allowed to add attributes to existing schema. But you can do reverse operation.

  • There are few limitations on choice of keyword attribute names for your JSON record, they should not start with underscores, e.g. _id, instead use python way to name your attributes, e.g. uid or unique_id.

  • The timestamp is reserved word for HDFS, therefore try to avoid it in your JSON record

  • If you plan to search your values which contain URI's, e.g. root://path/file.root, you need to replace them with a linked value, e.g. root://path/file.root<root://path/file.root>. The tag brackets will provide necessary link for such values.