Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the couchbase adapter gone? #1

Open
0xgeert opened this issue Sep 30, 2015 · 12 comments
Open

Is the couchbase adapter gone? #1

0xgeert opened this issue Sep 30, 2015 · 12 comments
Labels

Comments

@0xgeert
Copy link

0xgeert commented Sep 30, 2015

If I remember correctly Telepat used to have adapters for both Couchbase and Elasticsearch. Where can I find the Couchbase adapter?

@Mayhem93
Copy link
Member

Telepat started with Couchbase (and Redis came in after, but Redis is irreplaceable at the moment and it's only used for storing more volatile things). When I started working on the database adapters we picked ElasticSearch as it was more suited for our needs (more specifically, complex queries). We used to run ES and CB at the same time (with XDCR), ES for more complex subscription filters but the XDCR plugin from ES wasn't doing so great and we ultimately decided to stick with ES for a while replacing CB.

At the moment we only have developed the ElasticSearch adapter. Couchbase has the "query" part we need, but at the time when we started Telepat it was still in Developer Preview (or RC, I don't remember). We will eventually take a look at it later.

@andreimarinescu
Copy link
Member

Hi @gebrits,

Just a little more info on this. Starting with 0.2.4, we've implemented the adapter model, where Elasticsearch can be swapped out for any other DB, depending on needs on workload type. Previously, both Couchbase and Elasticsearch were used with overlapping functionality. This was both tough to manage and confusing.

We chosen to create the Elasticsearch adapter first, as we felt it was the best multipurpose system to support. In time, we plan to add Couchbase, MongoDB and other adapters. Any thoughts / contributions on this are really appreciated!

@0xgeert
Copy link
Author

0xgeert commented Oct 1, 2015

@andreimarinescu If I understand correctly, the idea is to have 1 adapter or the other? Could I persuade you guys to consider having multiple adapters at the same time?

IMO one of the big positives of a CQRS setup that Telepat uses, involving Kafka as an event broker, etc. is the ability to have multiple Kafka consumers (i.e.: the stores/adapters) be updated from a single stream of updates served by Kafka. Essentially, a polyglot architecture, where you as a user could choose to go with, say:

  • Couchbase (get/multiget/range)
  • elasticsearch (more involved queries, facets, etc. but only for tenants paying the Premium package)
  • Druid.io (aggregates, stats)
  • Redis (counters)

Sure, with the above config there's overlap of the type of queries that can be served, but all have a different cost/performance tradeoff. I.e.: multiGet on elasticsearch is an 1 or 2 orders of magnitude slower than couchbase, etc.

Then on the query side, you would need to have some configurable reverse proxy / level of indirection to declare that all (multi)gets go to Couchbase, while other queries go to Elasticsearch, etc. This doesn't seem too difficult though.

Please tell me I'm not chasing a pipedream here :)

P.s.: As I understand it Redis in the current architecture isn't an adapter in the above sense but is treated as special in some way. Care to elaborate on the why?

@Mayhem93
Copy link
Member

Mayhem93 commented Oct 1, 2015

@gebrits I'll answer to some of your questions/suggestions, the rest will be covered by @andreimarinescu.

Redis in the current architecture is used for "a part" of our database, mainly the database that's the most volatile: user subscriptions, user devices and operation deltas (objects that represent changes in the persistent database, written by aggregation workers, read/removed by write workers). We need this component to be as fast as possible and we thought keeping it on Redis would suffice, instead of allowing too much liberty in choosing another NoSQL which could potentially degrade performance since this component is very sensitive to changes.

Another reason why we chose Redis is because it has data structures which are extremely useful for us and very suited for our data model.

There are 3 types of databases used in Telepat:

  • Main Database (free to choose/implement your own) where application objects are stored and queried,
  • State (or Volatile) Database where the state of user subscriptions, devices and deltas are stored (locked to Redis)
  • Caching database where application objects will be cached: this is not implemented yet nor discussed in more detail (free to choose/implement your own, but this is still a matter of discussion)

@0xgeert
Copy link
Author

0xgeert commented Oct 1, 2015

@Mayhem93 Thanks for clearing up Redis' role. Makes sense.

@0xgeert
Copy link
Author

0xgeert commented Oct 2, 2015

@andreimarinescu : friendly ping

@andreimarinescu
Copy link
Member

Actually, this kind of setup is exactly what we wanted to do in the first place, @gebrits, it's really nice to see that we're not alone in looking to build such an architecture. One of the main issues that we've encountered in this approach is keeping everything in sync. The thing I liked about Couchbase was its XDCR capabilities, that should have, in theory, supported things like replicating from CB to Elasticsearch. In practice, we've found the link to be very buggy and unstable and this was the driver that brought us to this single-database adapter approach (while keeping the possibility of switching out an adapter for another).

The only way I can think of right now of keeping all persistence engines in sync, irrespective of the actual underlying system (elastic, mongo, couchbase or any other db) was doing writes on all adapters and doing reads selectively depending on purpose.

What are your thoughts on this?

@0xgeert
Copy link
Author

0xgeert commented Oct 2, 2015

@andreimarinescu: great to know this is still on the radar!

I don't believe syncing the various databases (ES, Couchbase, etc.) is a big problem. After all, in a distributed ES or CB setup results are only eventually consistent anyhow (both choose A and P over C in CAP). Therefore, being 'in-sync' can never beat that upper threshold. In other words, the total system as a whole will be eventually consistent, with a 'consistency lag' that's equal to the ability of the slowest adapter (being either ES or CB) to get in a consistent state.

For sake of clarify let me try to describe how I see the different roles of the system:

First a diagram by Martin Kleppmann (as part of reference 6, explained more later) that says it all:

logs-35

  1. A KV-store with MVCC capability (or at least optimistic versioning) is used as the primary datastore. CB comes to mind, which supports optimistic versioning natively.
    • Clients should be kept in sync against this primary database
    • this includes (native) clientside databases to support offline clients, resyncing when they come back online. CB is wonderful here, since it supports syncing clientside databases through either Couchbase Mobile (1) or even PouchDB (2) which gets you even more clientside db support for free.
  2. Changes to the primary datastore, once committed and agreed (given optimistic versioning), are send as events to Kafka. Cb -> Kafka connector has recently become part of CB core. 7)
  3. Multiple secondary datastores connect as consumers to Kafka and update their indexes accordingly. Examples of secondary indexes: ES, Druid, Redis (for some really volatile stats/counters perhaps), or as described in the diagram: HDFS, monitoring, Complex Event Processing systems (Storm, Spark streaming, Samza), etc. In short go wild..
    • Beauty here is that these secondary datastores in a way act only as advanced caches to the primary datastore. After all, all the secondary datastores can be regenerated from scratch based on the primary datastore when needed.
    • What's more, secondary datastores can be added after the fact. I.e.: You can add ES later and have the data in the primary datastore be replayed through Kafka to the new consumer. 3)
    • BUT, these are all caches with special abilities. ES allows for crazy complex queries, while druid allows for really high performant aggregates queries (aggregates are eagerly computed)
    • What's paramount here is that the secondary datastores (caches) are used for reads only
    • Writes to the secondary datastores only ever go through Kafka.
    • This guarantees that all Secondary Datastores are eventually consistent with the primary datastore, and thus the system as a whole is eventually consistent as well
    • Another way to look at the Secondary datastores is to treat them as separate views of the same data, all with different querying capabilities and performance characteristics.
    • Since these Secondary datastores are used for reads only, and update themselves eagerly with new updates (as opposed to ordinary caches which are lazy most of the time) these are sometimes called (in this setup) Eager Read Derivations. 4)
  4. A clientfacing API for reads would distribute queries based on type / characteristics to any of the Secondary datastores (and possibly the primary datastore, not really sure if there's a drawback to that)
  5. A clientfacing API for writes would go the primary datastore.

I hope this post didn't come of as too pedantic. I'm just pretty excited with the possibilities. I really believe this is a great architecture. I've just never came around to building it myself. I'd be over the moon if you guys consider picking this up.

More than happy to discuss more.

  1. http://www.couchbase.com/nosql-databases/couchbase-mobile
  2. http://blog.couchbase.com/first-steps-with-pouchdb--sync-gateway-todomvc-todolite
  3. This is probably for Telepat 2.0 :) but if you want to go fullblown eventsource, replaying the state of the primary datasource is not enough. Instead you want to keep all events/mutations that have ever happened to the system. Cool thing here is that Kafka actually does keep all events ever recorded. In the default state Kafka however will compact events after a certain date. This can be changed though. I asked a question once if Kafka could be used as an eventsource and if there are any caveats. To which founder (Jay Kreps) positively replied. See 5) In fact, the architecture I describe is also excellently explained by Martin Kleppmann 6), who works at Confluent which Kreps and others started to offer commercialized versions of Kafka. I have the feeling they want to start marketing Kafka as an eventsource as well...
  4. http://martinfowler.com/bliki/EagerReadDerivation.html
  5. http://stackoverflow.com/questions/17708489/using-kafka-as-a-cqrs-eventstore-good-idea
  6. https://martin.kleppmann.com/2015/05/27/logs-for-data-infrastructure.html
  7. http://blog.couchbase.com/introducing-the-couchbase-kafka-connector

@0xgeert
Copy link
Author

0xgeert commented Oct 6, 2015

hi @andreimarinescu. Just checking in if you've seen above. Thanks

@andreimarinescu
Copy link
Member

Hi, Geert,

Apologies for the delayed responses. The reason for this is that we've been
running a series of benchmarks these days, in order to asses the general
viability of the project so far (it's already looking interesting). We'll
have a blog post on this as soon as we're done.

I've been delaying this discussion until then, as the results of our work
so far will give us an idea of what and how to prioritize next.

First of all, a big thank you for your detailed post. These kinds of
discussions make me feel so happy for working on an open-source project.
What I can tell you so far is that we're generally on the same page on how
things should look like, it's a question of putting it all together. I'll
post a reply on this with my ideas going further as soon as we're done with
benchmarks and bugfixes. It's important for us to also start rolling out
well tested and solid releases even at these relatively early stages as we
seem to have attracted some attention with this.

Thanks again!
On Oct 6, 2015 9:26 PM, "Geert-Jan Brits" [email protected] wrote:

hi @andreimarinescu https://github.com/andreimarinescu. Just checking
in if you've seen above. Thanks


Reply to this email directly or view it on GitHub
#1 (comment)
.

@0xgeert
Copy link
Author

0xgeert commented Oct 6, 2015

Great to hear @andreimarinescu. Looking forward to said blogpost (and later reply to this issue.).

This could become awesome!

@0xgeert
Copy link
Author

0xgeert commented Dec 15, 2015

I'll post a reply on this with my ideas going further as soon as we're done with
benchmarks and bugfixes.

hi @andreimarinescu: any updates already?

Mayhem93 added a commit that referenced this issue Dec 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants