Skip to content

Performance Tuning

okram edited this page Sep 22, 2012 · 29 revisions


— Courtesy of Ricky Ho’s Pragmatic Programmer blog.

Hadoop is a complex piece of software with a variegation of components including a distributed file system, a distributed computing framework with job trackers, data nodes, and numerous simultaneously running JVM instances. With any complex software environment, there are tunings that can be employed to ensure both efficient use of space (network bandwidth, hard drive, memory, etc.) and time (object creation, combiners, in-memory combiners, etc.). This section presents various tricks to Hadoop/Faunus that can be used to tune a Faunus job sequences and Faunus MapReduce extensions.

Faunus Specific Tunings

  • Use sequence files for repeated analyses: The Hadoop sequence file is the most optimal file format for Faunus. If repeated analysis is going to be done on a graph, then it is beneficial to first generate a sequence file representation of that graph in HDFS. This file can then serve as the input for repeated analyses. Generating a sequence file is as simple as running the identity step g.V() with the following faunus.properties.
faunus.graph.output.format=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
  • Avoid text-based representations of graphs: The GraphSON representation of a graph is easy to read/write, but very inefficient. DBpedia as a GraphSON file is 23gigs and 11gigs as a sequence file. If possible, avoid using verbose text-based formats.
  • Reduce the size of the graph early in a job chain: A Faunus graph is typically multi-relational in that there are numerous types of edges in the graph. In many situations, all that information is not necessary for the graph derivation or statistic. As such, use filtering steps early on in the expression to reduce the graph down to the requisite information needed for the computation. Below, because only battled and father edges are used for the traversal, all other edges are filtered out prior to doing the traversal.
g.E.has('label','battled','father').keep.V.as('x').in('battled').in('father').linkOut('x','enemy-father')
  • Make use of repeated map steps: When a chain of map-steps are issued, Faunus will process that chain in-memory and thus, there is no I/O within the chain. As such, it is best to organize the traversal (if possible) with all the maps front-loading the traversal/query.
  • A vertex must be storable in main memory: The atomic unit of Faunus is a vertex. A FaunusVertex is composed of the vertex’s properties, its incoming edges, its outgoing edges, the properties of those edges, and the path information of the traversal up to the elements. A vertex (and any working data structure) must be able to fit in main memory. As such, this is the limit of what can be processed by Faunus. Be sure to make use of mapred.child.java.opts if more memory is required for the largest vertex being processed (e.g. -Xmx512m).
  • Avoid long, non-filtering path-enabled traversals: When the history of a traversal is not needed, then the only information propagated between elements is the number of counter at the current element (as a long). When history is required, then an array of an array of element ids is propagated. With complex graph structures and with the breadth-first nature of Gremlin/Faunus, this can lead to a combinatorial explosion.
  • If no graph output needed, use NoOpOutputFormat: If a graph derivation is not required, but instead some sideeffect computation, then set the final output format of the graph to be NoOpOutputFormat. If no output is needed at all, used Hadoop’s native NullOutputFormat.
faunus.graph.output.format=com.thinkaurelius.faunus.formats.noop.NoOpOutputFormat

Hadoop Specific Tunings

There are numerous Hadoop specific tunings. Below is a collection of blog posts that discuss tips and tricks for Hadoop.

Clone this wiki locally