Replies: 3 comments 4 replies
-
I don't see from this design what happens when storage goes down at all - e. g. when net.box connection cannot be established at all. will router consider such storage as 'disabled'? |
Beta Was this translation helpful? Give feedback.
-
During a rolling update, if one of the functions reports that it is not ready (for example expected version is lower), this does not mean that functions from other modules are not ready either. In this regard, I believe that you need to divide the whole storage connection backoff from individual (per connection) functions backoff. |
Beta Was this translation helpful? Give feedback.
-
if all storages are in a backoff state, then any request must be rejected immediately, it is pointless to spend the resources of the router while waiting for readiness, or at least you need to limit the maximum number of requests that can wait. in addition, in this case, you should use the circuit breaker pattern: closed -> halfopen -> open or closed. Upd. Or the router should wait "ready state" through the internal api, in this case the halfopen state is not required |
Beta Was this translation helpful? Give feedback.
-
The related issue is #298. The discussion starts with a description of how the task looks in my understanding. Then I provide my vision of API and behaviour, some insights at internals, frequent questions, alternatives.
Problems with how it works now
Storage boot and recovery can be long. While they are in progress, the storage can already have its listening port active and accept clients.
The clients are usually routers. They try to call
vshard.storage.*
functions. But it is not safe to do until the recovery/boot ends.In case of recovery the functions can access partially recovered data. If WAL is long, it also means old data.
Some of
vshard.storage.*
functions might be not recovered yet. Routers will get failures likebox.error.NO_SUCH_FUNCTION
. What is worse, they will continue sending requests to this not yet ready storage.How it should work
VShard storage should be able to tell the router to backoff when the storage is not ready yet. The router then needs to stop sending new requests to this storage for a while and try again later.
Storage readiness is a complicated property, consists of multiple factors including application-specific ones. These are considered application-agnostic:
vshard.storage.cfg()
didn't end yet;box.info.status == 'loading'
- recovery/boot is not done yet;After
vshard.storage.cfg()
and recovery/boot are done, user's application might want to make more changes and only then allow to accept requests. This means the storage needs to be able to enable/disable requests.From the router's point of view the matter is a little bit more complicated. Because it can't call
box.info
on a storage if the storage does not yet have any vshard functions. Or the functions might be here, but there might be no access rights to them recovered yet.To workaround that the router will need to track certain error codes for
vshard.storage.*
functions and how they are raised. For instance, if a call tovshard.storage.call(user_func)
throws an exception with codebox.error.NO_SUCH_FUNCTION
, it meansvshard.storage.call
itself does not exist. But if the same error is returned gracefully, it meansvshard.storage.call
worked fine, butuser_func
or something inside of it did not.API and behaviour
Storage
Disable makes the storage stop accepting new vshard API requests. In case the storage functions are already available, they will return a special error code -
vshard.error.code.STORAGE_DISABLED
.By default the storage is enabled, for backward compatibility. To leverage the feature a user would need to do this:
Later
vshard.storage.disable()
can be called any time again. For instance, if the storage entered a broken state such as a very outdated orphan replica.Neither
enable
nordisable
do any waiting or even yielding. They will simply switch some flags/meta-tables inside of the storage immediately.Router
Router sends requests as usual and checks errors for special codes. Only when it talks to
vshard.storage.*
API. It means, thatreplicaset:call()
functions won't support the disable, because they call user's functions directly. Butvshard.router.call()
will support it. Because it calls user functions viavshard.storage.call()
.If a call to a
vshard.storage.*
function raises an exception with codesbox.error.NO_SUCH_FUNCTION
orbox.error.ACCESS_DENIED
orvshard.error.code.STORAGE_DISABLED
, then the faulty replica is put into a backoff queue for some time (order of seconds). The router won't send new requests to it until the timeout expires.The failed request is retried transparently via another instance if there is a suitable one (not always the case - if an RW request fails, then it fails for good because there is just one master).
If there is no a replica to retry the request on and the request's timeout didn't expire yet, then the router will wait for backoff timeout and try this request on the same replica again. It is repeated until request's timeout expires, or it succeeds, or gets a critical error (network, luajit, etc).
It is safe to assume that if
vshard.storage.call(user_func)
raised an exception with the codes above, then theuser_func
wasn't invoked. Becauseuser_func
is called usingpcall()
so its error is always converted intonil, err
format.That in turn allows to retry these raised exceptions safely and transparently on the router's side.
FAQ
Is it possible to make the storage disable/enable self fully automatically?
No. Because user's application needs to create its own functions to call them from the router. For example, a function
customer_lookup(id)
must be created in_func
in order to be able to dovshard.router.call('customer_lookup')
. Or otherwise could grant access to all functions to the needed user, but it is also a schema change which is likely to be done aftervshard.storage.cfg
.That means, even if
vshard.storage.cfg
is done, it does not mean the storage is ready. And there is no way vshard could see automatically when upper-level code is ready.Alternatives
Callback on each storage request to allow it or not
It was possible to allow to specify a Lua callback which would be called on each
vshard.storage
request and it would returntrue/false
whether the request is allowed for execution.The main problems of this approach are:
Need to implement this function instead of calling existing API;
Unclear what to do for internal requests such as
vshard.storage._call()
, automatic bucket discovery done by routers viavshard.storage.buckets_discovery()
and more. Should the user's callback also be called for these? The user might have been surprised to see incoming requests even though he does not make any himself.Performance would be lower. The
enable/disable
approach can be done practically for free.vshard.storage
functions are substituted by pointer. Disabled versions simply raise an error immediately, and enabled versions work exactly like now - no additional calls, noif
s.It was decided to go with
enable/disable()
API.Make box.info.status "orphan" = disabled
It is not safe. Because an orphan replica still could be up to date and fully consistent. Just not have enough connections to other replicas. Then it still can handle RO requests just fine and it would be a mistake to disable it..
With
enable/disable()
as API though the users can disable orphans manually if they want.Beta Was this translation helpful? Give feedback.
All reactions