Skip to content

Gotchas and Limitations

okram edited this page Dec 16, 2012 · 21 revisions

This section presents a list of outstanding issues and likely problems that users should be aware of. A design limitation denotes a limitation that is inherent to the foundation of Faunus and as such, is not something that will not be rectified in future release. On the contrary, temporary limitations are incurred by the current implementation and future versions should provide solutions.

Design Limitations

Real-time graph processing

Faunus is built atop Hadoop. Hadoop is not a real-time processing framework. All Hadoop jobs require a costly setup (even for small input data) that takes around 15 seconds to initiate. For real-time graph processing, use Titan.

Only primitive element property values supported

The Blueprints API states that an element (vertex/edge) can have any Object as a property value (e.g. Vertex.setProperty(String,Object)). Faunus only supports integer, float, double, long, string, and boolean property values.

Titan/Cassandra and the Thrift frame size exceeded exception

When using Titan/Cassandra as a data source, and if there are vertices with a large number of edges (i.e. a very wide row in Cassandra) and inoculous exception may occur warning that the thrift frame size has been exceeded. While the cassandra.yaml can be updated, typically, the easiest way to solve this is to add the following property to the FaunusGraph being worked with: cassandra.input.split.size=512 (see bin/titan-cassandra.properties). The value 512 is how many kilobytes to make the input split size and this value can be adjusted higher or lower to ensure performant, no-excepting behavior.

Temporary Limitations

Gremlin closures must be strings

There is no easy way to serialize a Groovy closure and thus, propagate to the Hadoop jobs running on different machines. As such, until a solution is found, a closure must be provided as a String. For example: filter('{it.degree > 10}') instead of filter{it.degree > 10}.

A vertex and its incident edges must fit in memory

A single vertex must be able to fit within the -Xmx upper bound of memory. As such, this means that a graph with a vertex with 10s of millions of edges might not fit within a reasonable machine. In the future, “vertex splitting” or streaming is a potential solution to this problem.

GraphSON file format is overly expensive

The current implementation of the GraphSON InputFormat is excessively inefficient. As it stands the full String representation of vertex is held in memory, then its JSON Map representation, and finally its FaunusVertex representation. This can be fixed with a smarter, streaming parser in the future.

Titan and Rexster do not have OutputFormats (sinks)

Titan and Rexster can only be the source of graph data, not the sink of graph data. A near future release will provide OutputFormats to support writing to Titan and Rexster.

Not a 1-to-1 mapping with Gremlin/Pipes

The Gremlin implementation that is currently distributed with Faunus is not identical to Gremlin/Pipes. Besides not all steps being implemented, the general rule is that once “the graph is left” (e.g. traverse to data that is not vertices or edges), then the traversal ends. This ending is represented as a pipeline lock in the Gremlin/Faunus compiler.

SequenceFileOutputFormat and SequenceFileInputFormat contain other metadata

The binary sequence files supported by Hadoop are the primary means by which graph and traversal data is moved between MapReduce jobs in a Faunus chain. If a sequence file is saved to disk, be conscious of the traversal metadata it contains (e.g. path calculations).

Rexster Operation

Faunus will work with Rexster 2.1.0, but has some limitations. Due to some default and non-configurable settings in Rexster 2.1.0, it will only allow Faunus a maximum of four map tasks to connect to it. If more are configured, then Faunus will throw a SocketException and fail. To use more than four map tasks, consider building Rexster from source and utilize the 2.2.0-SNAPSHOT. Several upgrades and optimizations have been made to Rexster in light of Faunus and Rexster interoperability, thus rendering the 2.2.0-SNAPSHOT much more efficient.

The most important of these improvements, in relation to Faunus, is the ability to configure the thread pool. The Faunus Kibble streams data over REST back to Faunus. This streaming puts Rexster in a situation where it will be dealing with multiple, long-run HTTP requests. The thread pool should be configured such that there are enough threads to service each of the expected long-run requests. Without appropriate configuration, one would expect SocketException errors like the one described above with version 2.1.0.

Being able to strike a careful balance among the number of map tasks, Rexster memory availability, and Rexster thread pool size will greatly determine the speed at which Faunus will operate. It will likely take some experimentation to achieve the most efficient configuration.

Clone this wiki locally