Router's fast-proxy calls #317

Gerold103 · 2022-01-24T23:56:21Z

Gerold103
Jan 24, 2022
Maintainer

The related issue is #312. The discussion starts with a description of how the task looks in my understanding. Then I provide my vision of API and behaviour, some insights at internals, frequent questions, alternatives.

Problems with how it works now

Router sometimes is used as a proxy, not as a client itself. In that case it does a lot of unnecessary MessagePack decoding in Lua. For instance, when vshard.router.call() is called by a remote client, firstly the router will decode arguments of vshard.router.call() itself. Secondly, when it is executed on the storage, it will decode storage's response.

The decoding itself is cheap, but the results are pushed to Lua and that is very expensive. Puts a lot of load on Lua GC.

But most of that decoding-work is not needed. Indeed, the signature of vshard.router.call() is bucket_id, opts, func, args, netbox_opts. The heaviest part is args which are the storage's function arguments and the router doesn't need it.

When func's result is returned from the storage, the router doesn't need it either. It is forwarded back to the client as is.

How it should work

When used as a proxy, the router must not decode unnecessary data. It should only decode a few light arguments of vshard.router.call() and leave the rest untouched as a binary buffer.

Since 2.10.0-beta2 Tarantool supports 2 APIs which make the idea possible:

A function can be registered in box.schema.func with takes_raw_args = true. Then it will take a single argument in Lua - an msgpack object which internally stores the array of arguments in a plain MessagePack buffer, received from the outside as is.
Netbox can call any function with return_raw = true option. Then the result is an msgpack object. Regardless of what was the remotely called function.

It is proposed to utilise these new features as follows.

API and behaviour

Storage

Nothing changes. Storage needs to decode the user's function arguments anyway.

Router

Part 1

Firstly, the router will need to support return_raw option in all vshard.router.call* functions. It is going to be needed regardless of what happens next.

When has that option, it will forward it to call conn:call('vshard.storage.call', ...) invocations inside and will decode only first small part of the result to find if it is an error, wither it needs a retry, etc.

Part 2

The support of return_raw option is not enough. The arguments of vshard.router.call() itself still are decoded when it is called by a remote instance via iproto.

It is proposed to add a new function:

vshard.router.callraw(raw_args)

The raw_args is expected to be a Lua object of type msgpack. It should be a MessagePack array with the same arguments as vshard.router.call().

The function should be registered in box.schema.func by the user if he wants to utilise it. The router can't do it itself because it shouldn't depend on the schema anyhow.

Internally that new function will perform something like this:

local it = args:iterator()
assert(it:decode_array_header() == 5)
local bucket_id = it:decode()
local router_opts = it:decode()
local func = it:decode()
-- take() ~= decode()
local storage_args = it:take()
local netbox_opts = it:decode()
netbox_opts.return_raw = true
return vshard.router.call(bucket_id, router_opts, func, storage_args, netbox_opts)

This is how the usage would look on the client:

--
-- Client
--
local c = netbox.connect(router_addr)
local bucket_id = 100
local router_opts = {mode = 'read'}
local func = 'do_echo'
local args = {123}
local netbox_opts = {timeout = 5}
local res = c:call('vshard.router.callraw', {bucket_id, router_opts, func, args, netbox_opts)
assert(res == 123)

The value of args is decoded only once on the target storage and nowhere else.

FAQ

Wouldn't it be faster to put the storage args in the end of vshard.router.callraw() signature?

This could be asked by someone who thinks that making local storage_args = iterator:take() won't be needed if storage_args is the last argument of vshard.router.callraw().

It won't help, iterator:take() is inevitable. Because iterator:decode() does not change the original msgpack object. The latter still contains the full array of all callraw arguments. However, it is not a big deal since iterator:take() won't push anything to Lua. It will only call mp_next() inside and create a new msgpack object from it. The data won't be even copied. It will reference the original buffer.

How much faster is it to use vshard.router.callraw instead of vshard.router.call from a client?

I have no idea. It isn't implemented yet. But old benchmarks show that a lot of time was spent in pushing data to Lua stack on the router and it got worse when storage_args was getting bigger.

Once I have this feature implemented in any form, I will make a benchmark and update this RFC.

Alternatives

Make vshard.router.call() be able to accept msgpack args

The idea is to allow to call vshard.router.call() in 2 ways at the same time:

New way: vshard.router.call(msgpack_args)
Old way: vshard.router.call(bucket_id, router_opts, func, args, netbox_opts).

It could make the API more compact, but on the other hand would complicate the most used function of the router and slow it down a bit as it will need to branch depending on its arguments type. Hence it was decided to go with a new router function.

Call new function vshard.router.callproxy instead of vshard.router.callraw

It could look more explicit on the client. But on the other hand it can happen that the arguments are very light. So their decoding is actually faster than carrying them in a Lua msgpack object. Then calling vshard.router.call would be just fine. The name callproxy() then would raise the question why shouldn't it be always used by clients.

Thus it was decided to use callraw.

Maybe drop callraw at all if it just calls a few :decode() and one :take()?

Its implementation is provided above and it is indeed quite simple. I am actually thinking about not introducing it at all. So as to keep the router's API simpler. User's will need to make box.schema.func.create() anyway. In the same place they could implement it on their own.

no1seman · 2022-01-25T08:13:25Z

no1seman
Jan 25, 2022

Looks great, but seems that bucket_id in request on client is a big problem. Why not to use shrading_keys array which will make unnecessary to calculate the bucket_id on the client? Acoording to Roadmap ddl and crud will support custom sharding functions from the box in the nearest future.

8 replies

Gerold103 Jan 26, 2022
Maintainer Author

In the future it will be possible to create a lib that can be used to extract sharding keys from raw request like key def extracted data from tuple.

Hm, I don't think I understand. How is it related here?

Router[UNPACK_MSGPACK->CALCULATE_BUCKET_ID->MAKE_VSHARD_CALL->PACK_ARGUMENTS_TO_MSGPACK(!)] It's like that?

Nope. VShard never calculates bucket_id itself, neither on router nor on the storage. The step CALCULATE_BUCKET_ID is always done by user. You can do that on clients ("clients" in your schema, the leftmost participant), can do on routers using your own function or built-in vshard.router.bucket_id_mpcrc32().

I want to emphasize - router never does it itself inside of vshard code. Even the function bucket_id_mpcrc32 is just a helper for you, router never calls it anywhere.

Moreover, you can even drop vshard router and implement your own one where you would calculate bucket_id any way you want as well.

Thus if you do CALCULATE_BUCKET_ID on clients, you will be able to call vshard.router.callraw() right from the client. The schema will look like this:

-> Client[(JSON or any client structure) -> CALCULATE_BUCKET_ID  -> PACK_TO_MSGPACK -> MAKE_IPROTO_CALL] ->

-> Router[MAKE_VSHARD_CALL] ->

-> Storage[UNPACK_MSGPACK -> EXECUTE_STORED_FUNCTION]

As you can see, router now really just routes and doesn't do anything else.

So, as I understand callraw* solves only one PACK-UNPACK (marked with "!").

No, sorry. It solves ROUTER UNPACK_MSGPACK and ROUTER PACK_ARGUMENTS_TO_MSGPACK. And then the same when response is delivered back. So we drop 4 pack/unpack operations.

You still need to unpack the arguments on the storage. Otherwise how will you call the actual user's function without the arguments being properly decoded?

Lets look again. This is how the old schema looked:

-> Client[CALCULATE_BUCKET_ID  -> PACK_REQUEST -> IPROTO_CALL] ->

-> Router[UNPACK_REQUEST(!) -> FIND_REPLICA -> PACK_REQUEST(!) -> IPROTO_CALL] ->

-> Storage[UNPACK_REQUEST -> EXECUTE_FUNCTION -> PACK_RESPONSE] ->

-> Router[UNPACK_RESPONSE(!) -> PACK_RESPONSE(!) (yes, useless unpack+pack)] ->

-> Client[UNPACK_RESPONSE]

The places marked with (!) are useless. This is how the new schema will look:

-> Client[CALCULATE_BUCKET_ID  -> PACK_REQUEST -> IPROTO_CALL] ->

-> Router[FIND_REPLICA -> IPROTO_CALL] ->

-> Storage[UNPACK_REQUEST -> EXECUTE_FUNCTION -> PACK_RESPONSE] ->

-> Router[] ->

-> Client[UNPACK_RESPONSE]

See? Request is packed and unpacked 1 time. The same with response. The smallest possible number of packs/unpacks. It will work if you can calculate bucket_id on the client. You usually can, it is just a function, can be done in any language on any participant of that chain.

Or if you can implement your own version of my_router_callraw() which will decode only needed info from your data to calculate bucket_id and then call vshard.router.callraw() locally. But that looks a bit like an overkill for an average case.

no1seman Jan 27, 2022

Ok, can't see any differences in understanding except this one:

Client[CALCULATE_BUCKET_ID -> PACK_REQUEST -> IPROTO_CALL] ->

In real Cartridge apps we never calculate bucket_id on Client so as R-omk suggested - in addition to this RFC, outside of vshard need to implement fast extractor of sharding keys to make it possible to extract sharding keys on router to calculate bucket_id without unpacking all msgpack.

R-omk Jan 27, 2022

In real Cartridge apps we never calculate bucket_id on Client

I agree with it. For the case where the client is code that runs in the same lua runtime as the vsahrd router, a slightly different approach is required.
We need the same basic function as we have now, but which can accept arguments of remote executed function in the raw type format and an additional option indicating that we want to return the data in raw format.

that's what we have now

res, err = vshard.router.call(bucket_id, mode, function_name, {argument_list}, {options})

here is what is needed for the scenario that I described

raw_resp (when 'return_raw' is true ) = vshard.router.call(bucket_id, mode, function_name, raw_argument_list, { options as is + 'return_raw' } )

(perhaps as a return value, we need to consider different variants, here we need to think more. [raw_data_with_err], [raw_data, lua_err], [lua_data, lua_err] )

Hm, I don't think I understand. How is it related here?

This part does not apply to the vshard. This is exactly what it need to use in a framework like 'crud' for calculate and/or extract arguments for vshard.router.call function

As for the cases when the remote client calculates the bucket itself, this is also the case, so the originally proposed solution with callraw also is suitable.

Gerold103 Jan 27, 2022
Maintainer Author

We need the same basic function as we have now, but which can accept arguments of remote executed function in the raw type format and an additional option indicating that we want to return the data in raw format.

That is described in the first part of the RFC above.

But otherwise it looks to me like the second part, about vshard.router.callraw, is simply not needed. You don't need it, Cartridge doesn't as well. I can't see anybody voting for it, while the first part is obviously needed (about router.call() with return_raw).

With that said I will probably drop Part 2 unless somebody will come and say he needs it.

R-omk Jan 28, 2022

With that said I will probably drop Part 2 unless somebody will come and say he needs it.

I think it's best to write a simple working example in the documentation. Thus, such an example can be used by both those who need to create their own wrappers on lua side and those who calculate the bucket number on a remote client.

Gerold103 · 2022-02-17T00:33:42Z

Gerold103
Feb 17, 2022
Maintainer Author

I did some benchmarks. Locally on Mac and on a Linux server, results are almost the same (comparatively, not in absolute timing). Overall it doesn't look like a killer feature TBH. Here is the source code: https://github.com/tarantool/vshard/tree/gerold103/benchmark/bench. I run it like this:

make start
tarantool client.lua

Then observe results. Time is measured for getting the responses. Their sending is not counted. To try various parameters change client.lua file - they are in the beginning. Also can find some in node_template.lua.

The bench tried to compare return_raw vs normal encode/decode on the router. When return_raw is used, client calls special version of router's function which takes arguments as msgpack.object. So router never encodes/decodes args nor results from the storage. With various number of arguments, their sizes, and when request is empty but result is not empty (like select would work). The bench starts multiple clients, each client sends a lot of requests in parallel (using is_async) and then collects responses. Router's main thread was in 100% CPU in all tests. IProto thread was mostly not doing anything. Storages were always at <50%.

Test 1 - a single number 1 is sent and received

arg_count = 1
arg_depth = 1
value = 1
clients = 100
requests per client = 10000

Normal:     6.26 sec
ReturnRaw: 12.04 sec (-x1.92)

Quite bad result. Here return_raw made only worse. Router instead of doing decode/encode of a single MessagePack byte containing the number 1 was creating a lot of msgpack.object objects with the number encoded into a userdata buffer and creating pressure on Lua GC.

Test 2 - several big numbers are sent and received

arg_count = 10
arg_depth = 5
value = big number
clients = 100
requests per client = 10000

Normal:    25.75 sec
ReturnRaw: 10.81 sec (+x2.38)

Well, more than twice faster now! The arguments are an array of 10 items. Each is {{{{{11352534231}}}}}. This is notably heavier than a single number.

Test 3 - big complex multi-level objects, JSON, sent and received

arg_count = 10
arg_depth = 5
arg_value = big json
clients = 100
requests per client = 1200

Normal:    16.60 sec
ReturnRaw:  4.39 sec (+x3.78)

The big JSON object is a big Lua table with many fields. See the source code of the bench. Even bigger difference. Obviously the more decode/encode does normal call, the faster return_raw looks.

Test 4 - big JSON, not many of them per call, but more calls

arg_count = 4
arg_depth = 1
arg_value = big json
clients = 100
requests per client = 4000

Normal:    20.90 sec
ReturnRaw:  7.28 sec (+x2.87)

Expected result. Less arguments = less difference. But still good improvement.

Test 5 - empty arguments, many numbers in result

func = select
tuple count = 10
tuple size = 5
field value = big number
field depth = 1
clients = 100
requests per client = 10000

Normal:    13.51 sec
ReturnRaw: 11.00 sec (+x1.22)

Select does not take any arguments but returns an array of tuples. Each has 5 numbers, each number wrapped into 5 arrays. The difference is much less notable now. Only one part of the data stream competes for tx time on the router.

Test 6 - empty arguments, big JSONs in result

func = select
tuple count = 10
tuple size = 1
field value = big json
field depth = 1
clients = 100
requests per client = 10000

Normal:    64.26 sec
ReturnRaw: 17.31 sec (+x3.71)

When select results get bigger, the difference grows again in favor of return_raw.

Test 7 - a lot of flat integers in result

arg_count = 20
arg_depth = 1
value = big number
clients = 100
requests per client = 10000

Normal:    9.95 sec
ReturnRaw: 11.92 sec (-x1.19)

Router uses :take_array() method of msgpack object iterator. The case shows what happens when returned array has a big tail.

I've also collected perf record traces for normal and return_raw cases. The summary from the results above and from the traces is that msgpack.object works only when arguments and/or result are big enough. Otherwise its usage complicates the code and for small values is even slower.

The reason is that msgpack.object still is a Lua GC object. It creates pressure on GC. Moreover, it is a Lua C object. AFAIK Jit doesn't really like to jump between Lua and Lua C contexts. That makes it sometimes worse than certain amount of plain Lua objects.

In return_raw trace can see that a lot of time is spent in GC (first lines of the trace). But quite small time in MessagePack decoding (luamp_iterator_take).

By the way, are the userdata methods stored globally for all objects of the type at once, or does each call of luamp_iterator_take pushes this function as a GC object before invocation?

In plain usage can see that CPU is all over Lua, not just in GC - string allocation (lj_str_new), encode/decode (luamp_encode_r and luamp_decode), and other Lua-related calls.

In both traces can see that some time is spent in syscalls. This is because netbox sockets live in TX thread.

Summary

The outcome of this IMO is that router can't be truly optimized while it is written in Lua. And maybe even while it works in Tarantool runtime with just one thread. Even if you scale iproto threads, still netbox connections to the storages are in TX. Moreover, iproto threads suffer from connections being pinned to threads. So that would make router's load on its threads imbalanced.

The return_raw solution will work only for those who have quite big arguments and/or results from storages.

1 reply

igormunkin Feb 18, 2022
Collaborator

@Gerold103, I've glanced your reply and my wild guess is that this might be related to tarantool/tarantool#5201. I believe it's worth revising Lua usage on the path you're benchmarking (within a reasonable scope, of course).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Router's fast-proxy calls #317

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Router's fast-proxy calls #317

Gerold103 Jan 24, 2022 Maintainer

Problems with how it works now

How it should work

API and behaviour

Storage

Router

Part 1

Part 2

FAQ

Alternatives

Replies: 2 comments · 9 replies

no1seman Jan 25, 2022

Gerold103 Jan 26, 2022 Maintainer Author

no1seman Jan 27, 2022

R-omk Jan 27, 2022

Gerold103 Jan 27, 2022 Maintainer Author

R-omk Jan 28, 2022

Gerold103 Feb 17, 2022 Maintainer Author

Summary

igormunkin Feb 18, 2022 Collaborator

Gerold103
Jan 24, 2022
Maintainer

Replies: 2 comments 9 replies

no1seman
Jan 25, 2022

Gerold103 Jan 26, 2022
Maintainer Author

Gerold103 Jan 27, 2022
Maintainer Author

Gerold103
Feb 17, 2022
Maintainer Author

igormunkin Feb 18, 2022
Collaborator