-
Notifications
You must be signed in to change notification settings - Fork 58
Performance Tuning
okram edited this page Aug 3, 2012
·
29 revisions
Hadoop is a complex piece of software with a variegation of components including a distributed file system, a distributed computing framework with job trackers, data nodes, and numerous simultaneously running JVM instances. With any complex software environment, there are tunings that can be employed to ensure both efficient use of space (network bandwidth, hard drive, memory, etc.) and time (object creation, combiners, in-memory combiners, etc.). This section presents various tricks to Hadoop/Faunus that can be used to tune a Faunus job sequences and Faunus MapReduce extensions.
-
Use sequence files for repeated analyses: The Hadoop sequence file is the most optimal file format for Faunus. If repeated analysis is going to be done on a graph, then it is beneficial to first generate a sequence file representation of that graph in HDFS. This file can then serve as the input for repeated analyses. Generating a sequence file is as simple as running the identity step
g.V._()
with the followingfaunus.properties
.
faunus.graph.output.format.class=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
faunus.data.output.location=graph.dat
- Avoid text-based representations of graphs: The GraphSON representation of a graph is easy to read/write, but very inefficient. DBpedia as a GraphSON file is 23gigs and XXgigs as a sequence file. If possible, avoid using verbose text-based formats.
- Reduce the size of the graph early in a job sequence: A Faunus graph is typically multi-relational in that there are numerous types of edges in the graph. In many situations, all that information is not necessary for the graph derivation or statistic. As such, use filtering steps early on in the expression to reduce the graph down to the requisite information needed for the computation. Below, because only battled and father edges are used for the traversal, all other edges are filtered out prior to doing the traversal. Finally, the once the enemy-father edge has been generated, the battled and father edges are dropped.
g.V.edgeLabelFilter(KEEP,"battled","father").traverse(IN,"battled",IN,"father","enemy-father",DROP)
Below is a collection of blog posts that discuss tips and tricks for Hadoop.