Skip to content

Performance Tuning

okram edited this page Aug 3, 2012 · 29 revisions

Hadoop is a complex piece of software with a variegation of components including a distributed file system, a distributed computing framework with job trackers, data nodes, and numerous simultaneously running JVM instances. With any complex software environment, there are tunings that can be employed to ensure both efficient use of space (network bandwidth, hard drive, memory, etc.) and time (object creation, combiners, in-memory combiners, etc.). This section presents various tricks to Hadoop/Faunus that can be used to tune a Faunus jobs and Faunus MapReduce extensions.

Faunus Specific Tunings

  • Use sequence files for repeated analyses: The Hadoop sequence file is the most optimal file format for Faunus. If repeated analysis is going to be done on a graph, then it is beneficial to generate a sequence file representation of that graph in HDFS. This file can then be the input for repeated analyses. This is as simple as running the Faunus script g.V._() with the following faunus.properties.
faunus.graph.output.format.class=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
faunus.data.output.location=graph.dat

Useful Blog Posts

Below is a collection of blog posts that discuss tips and tricks for Hadoop.

Clone this wiki locally