Skip to content

Performance Tuning

okram edited this page Aug 3, 2012 · 29 revisions

Hadoop is a complex piece of software with a variegation of components including a distributed file system, a distributed computing framework with job trackers, data nodes, and numerous simultaneously running JVM instances. With any complex software environment, there are tunings that can be employed to ensure both efficient use of space (network bandwidth, hard drive, memory, etc.) and time (object creation, combiners, in-memory combiners, etc.). This section presents various tricks to Hadoop/Faunus that can be used to tune a Faunus jobs and Faunus MapReduce extensions.

Faunus Specific Tunings

  • Use sequence files for repeated analyses: The Hadoop sequence file is the most optimal file format for Faunus. If repeated analysis is going to be done on a graph, then it is beneficial to generate a sequence file representation of that graph in HDFS. This file can then be the input for repeated analyses. This is as simple as running the Faunus script g.V._() with the following faunus.properties.
faunus.graph.output.format.class=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
faunus.data.output.location=graph.dat
  • Avoid text-based representations of graphs: The GraphSON representation of a graph is easy to read/write, but very inefficient. DBpedia as a GraphSON file is 23gigs and XXgigs as a sequence file. If possible, avoid using verbose text-based formats.
  • Reduce the size of the graph early in an job sequence: A Faunus graph is typically multi-relational in that there are numerous types of edges in the graph. In many situations, all that information is not necessary for the graph derivation or statistic. As such, use filtering steps early on in the expression to reduce the graph down to the requisite information needed for the computation.

Useful Blog Posts

Below is a collection of blog posts that discuss tips and tricks for Hadoop.

Clone this wiki locally