Skip to content

Faunus Graph

Dan LaRocque edited this page Sep 5, 2014 · 13 revisions
This is the documentation for Faunus 0.4.
Faunus was merged into Titan and renamed Titan-Hadoop in version 0.5.
Documentation for the latest Titan version is available at http://s3.thinkaurelius.com/docs/titan/current.

The source of any Faunus job is a FaunusGraph. FaunusGraph is simply a wrapper to a collection of Hadoop- and Faunus-specific configurations. Most importantly, it captures the location and type of the input graph and output graph. A FaunusGraph is typically created using one of the FaunusFactory.open() methods.



FaunusGraph Construction

A Faunus configuration file is used to construct a FaunusGraph. Assume a file named bin/faunus.properties as represented below.

# input graph parameters
faunus.graph.input.format=com.thinkaurelius.faunus.formats.graphson.GraphSONInputFormat
faunus.input.location=graph-of-the-gods.json
# output data parameters
faunus.graph.output.format=com.thinkaurelius.faunus.formats.graphson.GraphSONOutputFormat
faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
faunus.output.location=output
faunus.output.location.overwrite=true

With FaunusFactory, a configuration file is turned in a FaunusGraph. The toString() of the FaunusGraph denotes the input and output format of the graph. For instance, as seen below, a graph of type GraphSON is the input and a graph of type GraphSON is the output.

gremlin> g = FaunusFactory.open('bin/faunus.properties')
==>faunusgraph[graphsoninputformat->graphsonoutputformat]

Hadoop-Specific Configurations

A FaunusGraph is loaded with Hadoop specific configuration information that is percolated from the master cluster configuration (e.g. set up during cluster construction) to various job level configurations.

gremlin> g.getConf()    
==>keep.failed.task.files=false
==>io.seqfile.compress.blocksize=1000000
==>dfs.df.interval=60000
==>dfs.datanode.failed.volumes.tolerated=0
==>mapreduce.reduce.input.limit=-1
==>mapred.task.tracker.http.address=0.0.0.0:50060
==>mapred.userlog.retain.hours=24
==>dfs.max.objects=0
==>dfs.https.client.keystore.resource=ssl-client.xml
==>mapred.local.dir.minspacestart=0
...

Note, it is possible to provide a prefix to look for in FaunusGraph.getConf(String prefix).

gremlin> g.getConf('mapred')
==>mapred.disk.healthChecker.interval=60000
==>mapred.task.tracker.http.address=0.0.0.0:50060
==>mapred.userlog.retain.hours=24
==>mapred.local.dir.minspacestart=0
==>mapred.cluster.reduce.memory.mb=-1
==>mapred.reduce.parallel.copies=5
...

Faunus Properties

Within the global configuration, there are Faunus-specific configurations. These properties can be isolated with FaunusGraph.getConf('faunus'). In general, any prefix string can be provided (e.g. mapred or mapred.map).

gremlin> g.getConf('faunus')        
==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.graphson.GraphSONInputFormat
==>faunus.input.location=graph-of-the-gods.json
==>faunus.graph.output.format=com.thinkaurelius.faunus.formats.graphson.GraphSONOutputFormat
==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
==>faunus.output.location=output
==>faunus.output.location.overwrite=true

Moreover, FaunusGraph provides getters/setters for mutating the most commonly used properties.

gremlin> g.setGraphOutputFormat(NoOpOutputFormat.class)
==>null
gremlin> g
==>faunusgraph[graphsoninputformat->noopoutputformat]
gremlin> g.getGraphOutputFormat()
==>class com.thinkaurelius.faunus.formats.noop.NoOpOutputFormat
gremlin> g.getProperties()       
==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.graphson.GraphSONInputFormat
==>faunus.input.location=graph-of-the-gods.json
==>faunus.graph.output.format=com.thinkaurelius.faunus.formats.noop.NoOpOutputFormat
==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
==>faunus.output.location=output
==>faunus.output.location.overwrite=true

Chaining Graphs

To conclude, a useful FaunusGraph method is getNextGraph(). This method generates a new FaunusGraph that is the “inverse” of the current with the input formats and output locations reconfigured to support easy graph chaining.

gremlin> g = FaunusFactory.open('bin/faunus.properties')
==>faunusgraph[graphsoninputformat->graphsonoutputformat]
gremlin> h = g.getNextGraph()
==>faunusgraph[graphsoninputformat->graphsonoutputformat]
gremlin> h.getConf('faunus')
==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.graphson.GraphSONInputFormat
==>faunus.input.location=output/job-1
==>faunus.graph.output.format=com.thinkaurelius.faunus.formats.graphson.GraphSONOutputFormat
==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
==>faunus.output.location=output_
==>faunus.output.location.overwrite=true