Skip to content

WMArchive architecture

Valentin Kuznetsov edited this page Mar 8, 2017 · 8 revisions

WMArchive architecture

The WMArchive architecture is shown in figure below:

WMArchive architecture

It consist of core framework, REST server and series of cronjob scripts. The data are pushed to WMArchive via REST APIs. These data are routed to internal Short-Term Storage (STS), current MongoDB. The separate daemon reads data from STS, merge records together, convert JSON to Avro and writes Avro files to local file system. Then we use a migration script (cronjob) to push data into Long-Term Storage (LTS), currently HDFS. As data written to LTS we use yet another script (cronjob) to aggregate data for hourly/daily jobs and push aggregated data both to STS and CERN MONIT systems.

Components

WMArchive is based on WMCore REST framework and consists of the following components:

WMArchive architecture

Currently, we choose MongoDB as short-term storage technology (others are available as well, e.g. FileIO, AvroIO or can be easily added). It will serve the purpose of the buffer for incoming requests. The data are stored via MongoIO storage layer. Every document carries wmaid (unique WMArchive id) and stype (storage type) attributes. The later is used as status attribute to indicate where data is located, e.g. in mongo or hdfs.

The choice of short-term storage engine is defined in WMArchive configuration. Each storage is defined via its URI:

  • fileio for FileIO, e.g. fileio:/path/storage
  • avroio for AvroIO, e.g. avroio:/path/storage/schema.avcs
  • mongodb for MongoIO, e.g. mongodb://localhost:27017
  • hdfsio for HdfsIO, e.g. hdfsio:/test/data.avsc The avroio and hdfsio must specify schema file in their path. The schema file is used to convert FWRJ JSON documents into Avro format used by particular storage.

We complement WMArchive service by two cronjobs, one to continuously move data from STS to LTS and another to clean-up short-term storage. When former is succeed it changes the stype attribute of the document from one storage value to another, e.g. mongodb->hdfsio. The later is used to regulate STS capacity. For example the clan-up job will remove only migrated docs which will be expired after 3 months.

We used StompAMQ WMCore module to feed aggregated data from WMArchive to CERN MONIT system. This is done via stand-along cronjob which invokes spark job over given file on HDFS, collects stats for a single day and push data into STS and CERN MONIT systems (the later is done via AMQ request).

WMArchive supports the following HTTP requests:

  • GET request to fetch a given document from the archive, e.g. /wmarchive/data/UID, where UID is unique id of the document. This method is used when a user place a query which is routed to HDFS/Spark platform and executed on user behave on analytics cluster. The results of the job are stored into Short Term Storage and user is provided with UID of the job. Using such UID user can fetch results later via GET request.
  • POST requests
    • post the data to the service, e.g. {"data":[list_of_docs]}
    • fetch certain fields from documents for given spec conditions, e.g. {"spec":{condition_dict}, "fields":[list_of_attributes_to_retrieve]}
    • post job results, e.g. {"job": {"results":{results_of_job}, "wmaid":wma_unique_id}}
  • PUT request to update given document (to be implemented)
  • DELETE request to delete given document (to be implemented)

Query execution

User communicate with WMArchive service via REST interface. As such user may place a query request providing spec/fields attributes in a POST request. Each request will contain a timerange attribute which specify a boundary of the query. If both bounds of timerange below certain threshold (to be determine by STS capacity), then query will be executed in real time on STS, otherwise a job submission will be done to LTS. The LTS job, e.g. Spark job on HDFS, will be routed to CERN analytics cluster and user will be given an UID of the job for later look-up. The job execution will store results of the job back into STS (in MongoDB it would be a separate collection, we also discuss to store results either on HDFS or HBase). The user later cat fetch results from STS via GET /wmarchive/data/UID request. The web interface will be provided to look-up status of running jobs.