-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Criticism of the benchmarkings #44
Comments
I've done further testing in response to your assertions. First, I added a binary mode to the websocketpp, uwebsocket, and Go (golang.org/x/net/websocket) servers and updated the benchmark client to support it. Performance of all servers increased roughly proportionally to the message size reduction (the same message encoded in binary is about 75% the size of the message encoded in binary) but not more. As you mention, a single broadcast from the server will cause 1000's of JSON decodes in the client, but this test indicates that due to using multiple client machines in parallel that was not a limiting factor. Next, I tested using C++ instead of Go for the benchmark client. In addition, the C++ benchmark does not use a a full websocket implementation; it works directly with libuv. I used your throughput benchmark as the starting point and updated it to run the same test as the Go tool's binary broadcast test. It was indeed able to get higher results than a single Go client, but running the Go tool on multiple machines in parallel produces higher results. None of the above tests found the dramatic differences your assertions would lead me to expect. However, I think I found something that explains the substantially different results we observe. I noticed the benchmarks for uwebsockets are hard-coded to connect to 127.0.0.1. This could confound the results in two ways. First, the client and server are running on the same machine. So any resources taken by the benchmark client have a direct negative effect on the server. This explains getting a substantially different result from a very low overhead C++ client and a more heavyweight Go client. Second, by using a loopback interface instead of an actual network there is far less overhead. This allows seeing much higher numbers than is possible when actually on a real network.
I do not see the fact that most implementations are within 50% of each other as flaw, I see it as a valid data point that for this particular workload the choice of language and library should probably not be decided just based on throughput. For other workloads, results may be substantially different. The raw results are here: https://github.com/hashrocket/websocket-shootout/blob/master/results/round-02-binary.md. The C++ benchmark is here: https://github.com/hashrocket/websocket-shootout/tree/master/cpp/bench. |
To validate Chapter 6 of my first post, and to really show you how flawed your "benchmark of websocket libraries" is, I made my own server with uWS and it performs multiple hundreds of percentages better than the one you wrote (using the very same uWS):
Just like Chapter 6 states, a broadcast is ultimately going to end up being a loop of syscalls (which is a constant workload for all servers). That's why it is important to know what you are doing when implementing things like pub/sub and similar things (like this very benchmark of yours). You cannot use your grandmother as a test subject when testing how fast a sports car is and then conclude, based on the fact that your grandmother didn't go any faster, that "all cars are the same speed". What you benchmark in that case is your grandmother, not the car. By implementing a very simple server based on my own recommendations from this repo: https://github.com/alexhultman/High-performance-pub-sub I was able to give you results of your own benchmark, close to 5x different than those you came up with. You need to stop tainting the bechmark with your own shortcomings. You cannot conclude that uWS is "about the same" as other low-perf implementations, when the issue is what you put ontop of the library. A server will not just magically be fast just because you swapped to uWS - it requires that you know how to use it and surrounding low-level matters. Stick with the echo tests, they are standard in this industry: they benchmark receiving performance (parsing + memory management) as well as sending performance (framing and memory management). Everything else is up to the user, it's not part of the websocket library. Node.js, Apache, h2o, NGINX and all those HTTP server measure performance in requests per second aka echo, simply becuse that is the only way to show (without tainting the server with user code) the performance of the server and only the server. For reference, this is the result I get with the server you wrote in uWS:
As you can see, the difference is major. Yet the very same websocket library has been utilized. I hope this will get you to realize how flawed this benchmark is. This is yet again validating my very first post "Chapter 6". |
Can you share the code for this? |
Yes I can post it, but it would be very unfair if you used it since the other servers would be using a different broadcasting algorithm. This is what I have currently, it depends on a new function which is not fully decided on yet, but should land some time soon (I have discussed this function for a while with other people doing pub/sub): #include <uWS/uWS.h>
#include <iostream>
#include <string>
using namespace std;
struct Sender {
std::string data;
uWS::WebSocket<uWS::SERVER> ws;
};
std::vector<Sender> senders;
uWS::Hub hub;
bool newThisIteration, inBatch;
int main(int argc, char *argv[]) {
uv_timer_t timer;
uv_timer_init(hub.getLoop(), &timer);
uv_prepare_t prepare;
prepare.data = &timer;
uv_prepare_init(hub.getLoop(), &prepare);
uv_prepare_start(&prepare, [](uv_prepare_t *prepare) {
if (inBatch) {
uv_timer_start((uv_timer_t *) prepare->data, [](uv_timer_t *t) {}, 1, 0);
newThisIteration = false;
}
});
uv_check_t checker;
uv_check_init(hub.getLoop(), &checker);
uv_check_start(&checker, [](uv_check_t *checker) {
if (inBatch && !newThisIteration) {
std::vector<std::string> messages;
std::vector<int> excludes;
for (Sender s : senders) {
messages.push_back(s.data);
}
if (messages.size()) {
uWS::WebSocket<uWS::SERVER>::PreparedMessage *prepared = uWS::WebSocket<uWS::SERVER>::prepareMessageBatch(messages, excludes, uWS::OpCode::BINARY, false, nullptr);
hub.getDefaultGroup<uWS::SERVER>().forEach([&prepared](uWS::WebSocket<uWS::SERVER> ws) {
ws.sendPrepared(prepared, nullptr);
});
uWS::WebSocket<uWS::SERVER>::finalizeMessage(prepared);
}
for (Sender s : senders) {
s.data[0] = 'r';
s.ws.send(s.data.data(), s.data.length(), uWS::OpCode::BINARY);
}
senders.clear();
inBatch = false;
}
});
hub.onMessage([](uWS::WebSocket<uWS::SERVER> ws, char *message, size_t length, uWS::OpCode opCode) {
switch (message[0]) {
case 'b':
senders.push_back({std::string(message, length), ws});
newThisIteration = true;
inBatch = true;
break;
case 'e':
ws.send(message, length, opCode);
}
});
hub.listen(3000);
hub.run();
} I landed the initial commit here: uNetworking/uWebSockets@e4b7584 |
I love the fact that you've put together a nice set of socket implementations in various languages (especially Elixir!). I would very much like to see a more optimized version of the Node implementation, though. If it took advantage of inline caching and V8 CrankShaft's optimizer it could be doing dramatically better I think. Most.js does an amazing job at that: https://github.com/cujojs/most/tree/master/test/perf |
Good write up. I also wonder why the |
Chapter 1: You are benchmarking the client, not the server
Let's look at the client you are using to "benchmark" these servers:
Golden rule of benchmarking: benchmark the server, NOT the client. You are benchmarking a high performance C++ server with a low performance golang client. Every JSON (this whole JSON-tainting story is a chapter of its own) receive server side will result in many receives client side. In fact, if you look at µWS as an example, the only thing happening user-side of the server is:
So what are you benchmarking server side here? Well, you are benchmarking the receive of one WebSocket frame followed by one JSON parse and one WebSocket frame formatting -> the rest is 100% the operating system (aka, there is no theoretical way to make it any more efficient)
Now, lets look at the client side: since you are using a low performance golang client with a full WebSocket implementation, every broadcast will result in thousands of WebSocket frame parsings client side. Are you starting to get what I'm pointing at now? You are benchmarking 1 WebSocket frame parsing + 1 JSON parse server side followed by thousands of WebSocket frame parsings client side and you are parsing these in golang!.
I immediately saw a HUGE tainting factor client-side when I started benchmarking WebSocket servers. So what did I do about it? I wrote the client in low-level TCP in C++ and made sure the server was stressed 100% all the time. This dramatically increased the gap between the slow WebSocket servers and the fast ones (as you can see in my benchmark, WebSocket++ is many tens of x:es faster than ws).
If you are going to act like you are benchmarking a high performance server you better write a client that is capable of outperforming the server, otherwise you are not benchmarking anything other than the client. No matter how many client instances you have, it still makes a massive different between having many slow clients in a cluster, or one ultra fast. You are completely tainting any kind of result by having this client.
Chapter 2: the broadcasting benchmark in general
You told me that you was not able to see any difference when doing an echo test, so instead you made this broadcast test. That statement alone solidifies my criticism: your client is so slow that it doesn't make any difference if you have a fast server or a slow one, while in my tests I can see dramatic differences in server performance, even when only dealing with 1 single echo message! I can see a 6x difference in performance between ws and µWS with 1 single echo message, and up to 150x when doing thousands of echoes per TCP chunk. But my point is not the 150x, my point is that it is absolutely possible to showcase a massive difference in performance when doing simple echo benchmarks!. But like I said: it requires that your client is able to stress the server and that means you cannot possibly write it in golang with the standard golang bullshit WebSocket client implementation.
Chapter 3: the JSON tainting
Like you have already heard, the fact that you benchmark 1 WebSocket frame parsing together with 1 JSON parsing, where the JSON parsing is majorly dominant is simply unacceptable. And you pass this off as a WebSocket benchmark! Parsing JSON is extremely slow compared to parsing a WebSocket frame: every single byte of the JSON data will have to be scanned for matchin end-token (if you are inside of a string, it will have to check EVERY BYTE for the end token). Compare this to the WebSocket format where the length of the whole message is given in the header, which makes the parsing O(1) while the JSON parsing is AT LEAST O(n).
Chapter 4: the threading and other random variance
Some servers are threaded, some are not. Some servers are implemented with hash tables, some are not. Some servers have RapidJSON, some have other JSON implementations. You simply have WAY too many variables going all random to give any kind of stable result. Comparing a server utilizing 8 CPU cores with a server restricted to 1 is just mindblowingly invalid. It's not just a bunch of "threads" you can toss in and have it speed-up you also need to take into account the efficient and the inefficient ways of using threading. That varies with implementation.
Chapter 5: gold comments
Chapter 6: low-level primitives vs. high level algorithms
A WebSocket library exposes some very fundamental and low-level functionalities that you as an app developer can use to construct more complex algorithms, like for instance, efficient broadcasting. What this benchmark is trying to simulate is very close to a pub/sub server: you get an event and you push this to all the connected sockets.
Now, as you might know, broadcasting can be implemented with a simple for-loop and a call to the WebSocket send function. This is what you are doing in this "benchmark". Problem with this is, that kind of algorithm for distributing 100 events to X connections is very far from something efficient and does not reflect the underlying low-level library as much as it reflects your own abstract interpretation of "pub/sub".
As an example, I work for a company where pub/sub is part of the problem to optimize. This pub/sub was implemented with a for-loop and calls to send for each socket. I changed this into a far more efficient algorithm that merges the broadcasts and prepares the WebSocket frames for these in an efficient way. This resulted in a 270x speed-up and far outperforms the most common pub/sub implementations out there. Had I used a slow server as the low-level implementation, then this speed-up would not be even remotely close to possible. Yet, it still required me to design the algorithm efficiently.
My point is, you cannot benchmark the low-level fundamentals of a library by benchmarking your own inefficient for-loop that pretty much just calls into the kernel and leaves no room for the user-space server to shine.
End notes
This benchmark is completely flawed and does not in any way show the real personalities of the underlying WebSocket servers. I know for a fact that WebSocket++ far outperforms most other servers and that needs to be properly displayed here. The point of a good benchmark is to maximize the result difference between the test subjects. You want to show difference in terms of X not in terms of minor percentages.
The text was updated successfully, but these errors were encountered: