Skip to content

Avro data format benchmark

Valentin Kuznetsov edited this page Mar 15, 2016 · 4 revisions

Avro data-format benchmark

Within WMArchive project we performed benchmark of Avro data-format. The WMArchive code contains an Avro storage layer and here we present benchmarks with avro data-format.

Data access on a single node

We used json2avsc and json2avro tools to translate CMS FrameWork JobReport (FWJR) JSON documents into Avro format. We generated avro file with 50k records FWJR documents in it. The total file size in Avro format was about 190MB. The bzip'ed Avro file shrink to 26MB. We use myspark and myspark.py scripts to run a simple map-reduce job over the set of avro files. In CMS we expect to have about 200K records a day, so we put 4 avro files into a single directory. Then we cloned that directory to accumulate 2 months of CMS data. In total we collected 12M records. We run myspark job on vocms013 and measured time we spent to find individual records based on provided LFN pattern. The first spark job run in local mode over 1 day of data and finished in about a minute. Then we processed 12M records and total time of the job was about 1 hour.

Reference benchmark results (Yarn mode)

In order to understand which are the best reachable performances we need to compare our custom solution with the standard methods. In particular we need to understand if there is a consistent overhead during the Java/Scala/Python processes communication (due to Pyspark) and to compare our record parsing library with equivalent libraries available for Spark. This can be done launching native Spark jobs directly inside the cluster (pure yarn-cluster* mode) using spark-avro to read our files.

(*) yarn-cluster: all Spark execution actors are included inside the cluster (driver and executors). This is the mode used to run these first tests. yarn-client: the Spark driver is spawned in the local node meanwhile executors are inside the cluster. This is the preferred modality in which we will run Spark jobs from vocms013.

Test description
  • FWJR: /cms/wmarchive/test/avro/
  • Spark parameters: 4 executors, 4G memory and 4 cores each
  • Test query: ^/AbcCde_Task_Data_test_99[0-9]+/RECO$
  • Output: {'nrecords': x, 'result': [...]}, where result is an array of matching task names (x in total)
lmeniche@analytix$ hadoop fs -du -s -h /cms/wmarchive/test/avro/2016/*
22.2 G  66.6 G  /cms/wmarchive/test/avro/2016/01
21.5 G  64.4 G  /cms/wmarchive/test/avro/2016/02
  1. Since data are partitioned horizontally (year/month/day) we can run a test to extract matching task names for the month on January using: /cms/wmarchive/test/avro/2016/01/*. The result extracts 65880 matching task names out of 6M total entries. Time needed for counting and extraction 85 seconds. The query with only one day takes 12 seconds, it scales!

  2. Considering two months of data (January and February), extracted records are 131760 out of 12M in total. Time: 150 seconds.

These numbers can change depending on many factors (data locality principle cannot always be respected for instance), therefore consider that each execution sometime can take 10 seconds less or 10 second more. Moreover those times are not including the time spent by the Spark application to be accepted by Yarn (depends on how busy is the cluster) and the initialization time (how many executors we are asking).