Skip to content

Cluster Metadata

Doug Rohrer edited this page Sep 20, 2016 · 3 revisions

Cluster Metadata Usage

The following is intended for those wishing to use the Cluster Metadata subsystem inside of Riak or another riak_core application. It is not intended for those wishing to support, debug or contribute to the subsystem, although it may be helpful. For those wishing to develop a deeper understanding of how Cluster Metadata works, please see Cluster Metadata Internals.

1. Overview

Cluster Metadata is intended to be used by riak_core applications wishing to work with information stored cluster-wide. It is useful for storing application metadata or any information that needs to be read without blocking on communication over the network.

1.1 Data Model

Cluster Metadata is a key-value store. It treats values as opaque Erlang terms that are fully addressed by their "Full Prefix" and "Key". A Full Prefix is a {atom() | binary(), atom() | binary()}, while a Key is any Erlang term. The first element of the Full Prefix is referred to as the "Prefix" and the second as the "Sub-Prefix".

1.2 Storage

Values are stored on-disk and a full copy is also maintained in-memory. This allows reads to be performed only from memory, while writes are affected in both mediums.

1.3 Consistency

Updates in Cluster Metadata are eventually consistent. Writing a value only requires acknowledgment from a single node and as previously mentioned, reads return values from the local node, only.

1.4 Replication

Updates are replicated to every node in the cluster, including nodes that join the cluster after the update has already reached all nodes in the previous set of members.

2. API

The interface to Cluster Metadata is provided by the riak_core_metadata module. The module's documentation is the official source for information about the API, but some details are re-iterated here.

2.1 Reading and Writing

Reading the local value for a key can be done with the get/2,3 functions. Like most riak_core_metadata functions, the higher arity version takes a set of possible options, while the lower arity function uses the defaults.

Updating a key is done using put/3.4. Performing a put only blocks until the write is affected locally. The broadcast of the update is done asynchronously.

2.1.1 Deleting Keys

Deletion of keys is logical and tombstones are not reaped. delete/2,3 act the same as put/3,4 with respect to blocking and broadcast.

2.2 Iterators

Along with reading individual keys, the API also allows Full Prefixes to be iterated over. Iterators can visit both keys and values. They are not ordered, nor are they read-isolated. However, they do guarantee that each key is seen at most once for the lifetime of an iterator.

See iterator/2 and the itr_* functions.

2.3 Conflict Resolution

Conflict resolution can be done on read or write.

On read, if the conflict is resolved, an option, allow_put, passed to get/3 or iterator/2 controls whether or not the resolved value will be written back to local storage and broadcast asynchronously.

On write, conflicts are resolved by passing a function instead of a new value to put/3,4. The function is passed the list of existing values and can use this and values captured within the closure to produce a new value to store.

2.2 Detecting Changes in Groups of Keys

The prefix_hash/1 function can be polled to determined when groups of keys, by Prefix or Full Prefix, have changed.

3. Common Pitfalls & Other Notes

The following is by no means a complete list of things to keep in mind when developing on top of Cluster Metadata.

3.1 Storage Limitations

Cluster Metadata use dets for on-disk storage. There is a dets table per Full Prefix, which limits the amount of data stored under each Full Prefix to 2GB. This size includes the overhead of information stored alongside values, such as the logical clock and key.

Since a full-copy of the data is kept in-memory, its usage must also be considered.

3.2 Replication Limitations

Cluster Metadata uses disterl for message delivery, like most Erlang applications. Standard caveats and issues with large and/or too frequent messages still apply.

3.3 Last-Write Wins

The default conflict resolution strategy on read is last-write-wins. The usual caveats about the dangers of this method apply.

3.4 "Pathological Eventual Consistency"

The extremely frequent writing back of resolved values after read in an eventually consistent store where acknowledgment is only required from one node for both types of operations can lead to an interesting pathological case where siblings continue to be produce (although the set does not grow unbounded). A very rough exploration of this can be found here.

If a riak_core application is likely to have concurrent writes and wishes to read extremely frequently, e.g. in the Riak request path, it may be advisable to use {allow_put, false} with get/3.

4. Configuration Details

Needs Content

4.1 Data Directories

Needs Content

See riak_Core_metadata_manager:start_link/1.

4.2 Broadcast Timers

Needs Content