Skip to content

Faunus and Hadoop

okram edited this page Aug 20, 2012 · 29 revisions

Faunus has two particular use cases: graph derivations and simple graph statistics. Faunus is not a general-purpose graph processing/algorithms package for Hadoop. Instead, due to the constraints imposed by the Hadoop computing model (MapReduce), Faunus is leveraged for what Hadoop is best at: parallel processing a massive number of atomic records. With Faunus, these atomic records are the vertices of a graph and the processing is global graph derivations and/or global graph statistics.

  • Graph derivation: Given an input graph, derive a new graph based upon the input graph’s structure and semantics. Other terms include graph rewriting and graph transformations.
    • deriving a cousin-graph from father and brother edges.
    • deriving a youth friend-graph by removing all vertices with age greater than 30 and keeping only friend edges.
  • Graph statistic: A graph is a complex structure that can best be understood when the nature of its structure is distilled down to a manageable set of numbers. An analogous term is graph/network analysis.
    • counting the number of vertices and edges in the graph.
    • determining the degree distribution of the graph.
    • determining the distribution of edges labels in the graph.
    • determine the assortative mixing in the graph.

Faunus’ statistics library is for simple global graph computations. Hadoop is not optimized for iterative algorithms such as the popular centrality algorithms and/or community detection algorithms.

Introduction to Hadoop

Hadoop is a distributed storage and processing system that greatly simplifies the creation of distributed computing jobs. The default storage layer for Hadoop is HDFS. HDFS is similar to any other file system in that it can be used to store arbitrarily formatted files (e.g. binary, text, etc.). However, HDFS allows for the storage of files so large that they can not be represented on a single machine. As such, HDFS is a distributed file system. Given files distributed over HDFS, it is possible to process these files using Hadoop’s distributed processing framework, MapReduce. MapReduce represents a computation as a series of parallel/atomic key/value pair computations. There are two steps to MapReduce:

  • Map: For every <key1,value1> input, yield <key2,value2> outputs.
  • Reduce: For every <key2,list<value2>> input, yield <key3, value3> outputs.

In this way, the map-step can be seen as a parallel analysis of all the <key1,value1> pairs represented in the source file/location. The reduce-step aggregates all the values (value2) emitted by the previous map-step that have the same key (key2). Some algorithm is evaluated over those values (value2) to ultimately write <key3,value3> data to some sink file/location.

Introduction to Graphs

A graph is a data structure composed of vertices (dots,things) and edges (lines,links). A vertex has a set of incoming edges and a set of outgoing edges. There are numerous types of graphs and the one used by Faunus is the property graph model of the Blueprints API. A property graph is multi-relational in that edges are labeled to denote different types of relationships between vertices. Moreover, every vertex/edge can have an arbitrary number of key/value pairs associated with it (note that these key/value pairs should not be confused with the key/value pairs of Hadoop). Graphs are typically processed using traversals. A traversal is a algorithmic walk over the graph in order to arrive at a particular destination (e.g. search or derivation) or to yield some side-effect in the process (e.g. a ranking or recommendation or statistic).

Faunus and Hadoop

A graph can be represented by an adjacency list. Each “row” represents a vertex along with its incident edges. For property graphs, the vertex and edge properties are stored in the row as well. Faunus interprets a graph from this perspective. Every key/value in Hadoop is a single vertex along with its properties and its incoming and outgoing edges. As such, a Faunus MapReduce job operates on each vertex in parallel.

There are two types of derivation operations in Faunus.

  • Map-Only: Filters, Mutations
    • Map: For every <null,vertex> input, yield a <null,vertex> output.
  • MapReduce: Traversals, Vertex Filters
    • Map: For every <null,vertex> input, yield a <id,tagged_element> output.
    • Reduce: For every <id,list<tagged_element>> input, yield a <null,vertex> output.

For every vertex in a map-only step, the vertex is either allowed or filtered from the next processing step or the vertex is mutated in some way (e.g. properties removed/added, edges transposed, etc.). For every vertex in a map-reduce step, the vertex’s edges are analyzed and messages are sent to the adjacent vertices. These messages are then aggregated at the reduce step and the message receiving vertex is updated in some way.

Given that the input and output of both map-only and map-reduce steps is <null,vertex>, it is possible to chain these steps together to yield a more complex derivation.

Ultimately, the original graph or the derivation can then be subjected to some graph statistic.

References

A theory of multi-relational graph derivations for the purpose of applying single-relational graph algorithms is described in the article below and forms the primary inspiration for Faunus.

The following articles provided numerous insights into various theoretical and technical issues encountered during Faunus’ design and development.

Clone this wiki locally