-
Notifications
You must be signed in to change notification settings - Fork 58
Script Format
This is the documentation for Faunus 0.4.
Faunus was merged into Titan and renamed Titan-Hadoop in version 0.5.
Documentation for the latest Titan version is available at http://s3.thinkaurelius.com/docs/titan/current.
-
InputFormat:
com.thinkaurelius.faunus.formats.script.ScriptInputFormat
-
OutputFormat:
com.thinkaurelius.faunus.formats.script.ScriptOutputFormat
ScriptInputFormat
and ScriptOutputFormat
take an arbitrary Gremlin script and use that script to either read or write FaunusVertex
objects, respectively. This can be considered the most general InputFormat
/OutputFormat
possible in that Faunus uses the user provided script for all reading/writing.
The data below represents an adjacency list representation of an untyped, directed graph. First line reads, “vertex 0 has no outgoing edges.” The second line reads, “vertex 1 has an outgoing edge to vertices 4, 3, 2, and 0.”
0:
1:4,3,2,0
2:5,3,1
3:11,6,1,2
4:
5:
6:
7:8,9,10,11,1
8:
9:
10:
11:6
There is no corresponding InputFormat
that can parse this particular file (or some adjacency list variant of it). As such, ScriptInputFormat
can be used. With ScriptInputFormat
a Gremlin-Groovy script is stored in HDFS and leveraged by each mapper in the Faunus job. The Gremlin-Groovy script must have the following method defined:
def boolean read(FaunusVertex vertex, String line) { ... }
An appropriate read()
for the above adjacency list file is:
def boolean read(FaunusVertex vertex, String line) {
parts = line.split(':');
vertex.reuse(Long.valueOf(parts[0]))
if (parts.length == 2) {
parts[1].split(',').each {
vertex.addEdge(Direction.OUT, 'linkedTo', Long.valueOf(it));
}
}
return true;
}
Note that to avoid object creation overhead, the previous vertex is provided to the next parse. The vertex can be “reused” with the FaunusVertex.reuse(long id)
method which wipes all previous data. The resultant boolean
denotes whether the line parsed yielded a valid FaunusVertex
. As such, if the line is not valid (e.g. a comment line, a skip line, etc.), then simply return false
.
The above files are provided with the Faunus distribution and can be used from the Gremlin REPL.
gremlin> hdfs.copyFromLocal('data/ScriptInput.groovy','ScriptInput.groovy')
==>null
gremlin> hdfs.copyFromLocal('data/graph-of-the-gods.id','graph-of-the-gods.id')
==>null
gremlin> hdfs.ls()
==>rw-r--r-- marko supergroup 868 ScriptInput.groovy
==>rw-r--r-- marko supergroup 69 graph-of-the-gods.id
gremlin> g = FaunusFactory.open('bin/script-input.properties')
==>faunusgraph[scriptinputformat->graphsonoutputformat]
gremlin> g.getConf('faunus')
==>faunus.output.location.overwrite=true
==>faunus.graph.output.format=com.thinkaurelius.faunus.formats.graphson.GraphSONOutputFormat
==>faunus.input.location=graph-of-the-gods.id
==>faunus.graph.input.script.file=ScriptInput.groovy
==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.script.ScriptInputFormat
==>faunus.output.location=output
==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
==>faunus.graph.input.edge-copy.direction=OUT
gremlin> g._
13/04/16 12:35:06 INFO mapreduce.FaunusCompiler: Compiled to 2 MapReduce job(s)
13/04/16 12:35:06 INFO mapreduce.FaunusCompiler: Executing job 1 out of 2: MapSequence[com.thinkaurelius.faunus.formats.EdgeCopyMapReduce.Map, com.thinkaurelius.faunus.formats.EdgeCopyMapReduce.Reduce]
...
gremlin> hdfs.head('output')
==>{"_id":0,"_inE":[{"_label":"linkedTo","_id":-1,"_outV":1}]}
==>{"_id":1,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":4},{"_label":"linkedTo","_id":-1,"_inV":3},{"_label":"linkedTo","_id":-1,"_inV":2},{"_label":"linkedTo","_id":-1,"_inV":0}],"_inE":[{"_label":"linkedTo","_id":-1,"_outV":3},{"_label":"linkedTo","_id":-1,"_outV":7},{"_label":"linkedTo","_id":-1,"_outV":2}]}
==>{"_id":2,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":5},{"_label":"linkedTo","_id":-1,"_inV":3},{"_label":"linkedTo","_id":-1,"_inV":1}],"_inE":[{"_label":"linkedTo","_id":-1,"_outV":3},{"_label":"linkedTo","_id":-1,"_outV":1}]}
==>{"_id":3,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":11},{"_label":"linkedTo","_id":-1,"_inV":6},{"_label":"linkedTo","_id":-1,"_inV":1},{"_label":"linkedTo","_id":-1,"_inV":2}],"_inE":[{"_label":"linkedTo","_id":-1,"_outV":1},{"_label":"linkedTo","_id":-1,"_outV":2}]}
==>{"_id":4,"_inE":[{"_label":"linkedTo","_id":-1,"_outV":1}]}
==>{"_id":5,"_inE":[{"_label":"linkedTo","_id":-1,"_outV":2}]}
==>{"_id":6,"_inE":[{"_label":"linkedTo","_id":-1,"_outV":11},{"_label":"linkedTo","_id":-1,"_outV":3}]}
==>{"_id":7,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":8},{"_label":"linkedTo","_id":-1,"_inV":9},{"_label":"linkedTo","_id":-1,"_inV":10},{"_label":"linkedTo","_id":-1,"_inV":11},{"_label":"linkedTo","_id":-1,"_inV":1}]}
==>{"_id":8,"_inE":[{"_label":"linkedTo","_id":-1,"_outV":7}]}
==>{"_id":9,"_inE":[{"_label":"linkedTo","_id":-1,"_outV":7}]}
==>{"_id":10,"_inE":[{"_label":"linkedTo","_id":-1,"_outV":7}]}
==>{"_id":11,"_outE":[{"_label":"linkedTo","_id":-1,"_inV":6}],"_inE":[{"_label":"linkedTo","_id":-1,"_outV":3},{"_label":"linkedTo","_id":-1,"_outV":7}]}
Note that the resultant graph has both incoming and outgoing edges even though the input graph dataset was an adjacency list which only denotes outgoing edges. The EdgeCopyMapReduce
step was activated to overlay the graph transpose appropriately (see faunus.graph.input.edge-copy.direction
property).
The principle above can also be used for writing a <NullWritable,FaunusVertex>
stream back to a file in HDFS. This is the role of ScriptOutputFormat
. ScriptOutputFormat
requires that the provided script maintains a method with the following signature:
def void write(FaunusVertex vertex, DataOutput output) { ... }
Note the following ScriptOutput.groovy
file distributed with Faunus.
def void write(FaunusVertex vertex, DataOutput output) {
output.writeUTF(vertex.getId().toString() + ':');
Iterator<Edge> itty = vertex.getEdges(OUT).iterator()
while (itty.hasNext()) {
output.writeUTF(itty.next().getVertex(IN).getId().toString());
if (itty.hasNext())
output.writeUTF(',');
}
output.writeUTF('\n');
}
gremlin> hdfs.copyFromLocal('data/graph-of-the-gods.json','graph-of-the-gods.json')
==>null
gremlin> hdfs.copyFromLocal('data/ScriptOutput.groovy','ScriptOutput.groovy')
==>null
gremlin> g = FaunusFactory.open('bin/script-output.properties')
==>faunusgraph[graphsoninputformat->scriptoutputformat]
gremlin> g.getConf('faunus')
==>faunus.output.location.overwrite=true
==>faunus.graph.output.format=com.thinkaurelius.faunus.formats.script.ScriptOutputFormat
==>faunus.graph.output.script.file=ScriptOutput.groovy
==>faunus.input.location=graph-of-the-gods.json
==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.graphson.GraphSONInputFormat
==>faunus.output.location=output
==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
gremlin> g._
13/04/16 14:40:06 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
13/04/16 14:40:06 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.IdentityMap.Map]
...
gremlin> hdfs.head('output')
==>0:
==>1:4,3,2,0
==>2:5,3,1
==>3:11,6,1,2
==>4:
==>5:
==>6:
==>7:8,9,10,11,1
==>8:
==>9:
==>10:
==>11:6