-
Notifications
You must be signed in to change notification settings - Fork 107
History
It all started out with the original components from the graphite project. We used a couple of machines to store metrics (carbon-cache.py) a bunch of our central machines for relaying the graphite metrics (carbon-relay.py) to the stores and Diamond on all of our machines to generate minutely stats.
It became clear quickly that the infrastructure had some problems: missing values, and cpu cores being pegged, while plenty of free cpu cores were available. Hence we figured we needed to use more cpu to allow for more metrics to be processed.
A working solution put in place was haproxy on each of the relays and stores to point to a number of backends, closely matching the number of cores available (relay) or disk speed (cache) to the machines. This gave us a massive improvement in terms of cpu usage, as well as more complete metrics, however, as we continued to grow the relays started to become a (cpu) bottleneck. In addition, we were also interested in having a redundant copy of the metric data in another location, and for that we had to chain relays: one to forward to two others, which in turn did the work of sending the metrics to the clusters in different locations. Obviously, this all became very complex configuration wise, not that flexible, and burning away cpu cores. On top of that, we started to see problems with submitted metrics. As developers discovered how easily they could add their own metrics to graphite, new code emerged to do so. But as we tried to balance the data using the consistent-hash strategy of carbon-relay.py, all these new metrics made the disks run low on available space and hence we had to invest in tools to shuffle the metrics to their new destinations. But we noticed that metrics like a..b.c
were on disk called a.b.c
, however the consistant-hash algorithm in carbon-relay.py had considered the original, double-dot metric. This gave us consistent disagreement between where the relay would put new points for a metric, and where our scripts thought the metric should belong.
So, there we were, looking at a bunch of machines all burning away their cpus. And I couldn't help thinking it: "how hard can it be?" Famous last words. End result after a hackaton, a very first proof of concept, lazily called the C-version of carbon-relay, not long after that coined carbon-c-relay. This product didn't do much more than relaying incoming metrics to another destination, but while doing so, cleansing the metric, basically stripping the double-dots that were haunting us. Since this sort of worked, but was close to useless, quickly the foundations for what carbon-c-relay is today were built: some config file with a definition for clusters and what to send to those clusters via matches. This because I wanted to solve the problem of multiple cluster locations in one go. At that time I also wanted to prepare for using multiple conceptual clusters, since the immense growth we were facing begged for (risk) isolation.
Over time, carbon-c-relay grew features and functionality. After it could successfully send incoming metrics to multiple clusters using the consistent-hash algorithm as used by carbon-relay.py, it got two more hashing strategies for better load distribution, but also fail-over alike targets, suitable for larger environments where unavailability of machines shouldn't hurt the global flow.
Next to these aggregations made their introduction into the product. Initially only static aggregation rules were supported, with an optimiser to handle the tens of thousands of aggregation rules we produced (via a puppet template) efficiently. Later, dynamic rules with back-matches were implemented making the aggregator a true replacement for carbon-aggregator.py.
Also rewrites were introduced, to perform modifications to the input metrics, including case changes.