Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnection from testit client can crash server #3

Open
matthewcurry opened this issue Feb 4, 2016 · 4 comments
Open

Disconnection from testit client can crash server #3

matthewcurry opened this issue Feb 4, 2016 · 4 comments
Assignees

Comments

@matthewcurry
Copy link
Collaborator

After disconnect, server displays this message, then segfaults:

[Wrn|src/conn/ssmptcp_conn_read.c: 13] couldn't read
[ 7736 ]  error   @server/asg_exec.ae:780
unrecognized op: 2013266084
Segmentation fault

Expected output varies, but one example of reasonable output is:

[ 22692 ] error @server/asg_exec:476 error: Link has been severed
[ 22692 ] error @server/asg_exec:320 error: Link has been severed
[ 22692 ] error @server/asg_exec:345 error: Link has been severed

This is extremely intermittent.

@matthewcurry
Copy link
Collaborator Author

This behavior is due to improper error handling in the server-side implementation of the ASG API. Fixing the superficial bug, which causes the error code from the failed xm_get to be picked up, the cleanup code causes further issues. Output below.

[ 5235 ]  error   @server/asg_exec.ae:323       error: Link has been severed
  [Wrn|src/conn/ssmptcp_conn_init.c: 23] can't connect to addr addr->host="127.0.0.1", addr->port=4391, (*__errno_location ())=111, strerror((*__errno_location
 ()))="Connection refused"
[ 5235 ]  error   @server/asg_exec.ae:348       error: Link has been severed
  [Wrn|src/conn/ssmptcp_conn_init.c: 23] can't connect to addr addr->host="127.0.0.1", addr->port=4391, (*__errno_location ())=111, strerror((*__errno_location
 ()))="Connection refused"
  [Wrn|src/conn/ssmptcp_conn_read.c: 13] couldn't read
[ 5235 ]  error   @server/asg_exec.ae:72        (By the way, this was a bad get, and would have been missed.)
[ 5235 ]  error   @server/asg_exec.ae:75        Get of buffer failed: Link has been severed
*** Error in `server/server': double free or corruption (fasttop): 0x00007f70d8003700 ***

@GeoffDanielson
Copy link
Contributor

How did you fix the superficial bug?

@matthewcurry
Copy link
Collaborator Author

It's not committed (because of the "btw" comment), but this is what I did first.

diff --git a/server/asg_exec.ae b/server/asg_exec.ae
index dab006f..e2ce49a 100644
--- a/server/asg_exec.ae
+++ b/server/asg_exec.ae
@@ -67,7 +67,10 @@ get_buf(struct iovec *iov, match1 bits, struct peer_record *p) {
                 m.iovcnt = 1;

                 rc = sync_xm_op(xm_get, p->xmx, &m);
-                if (rc != 0) {
+                if (rc != 0 || (rc = -m.error)) {
+                       if (rc == -m.error) {
+                               ERR("(By the way, this was a bad get, and would have been missed.)\n");
+                       }
                         ERR("Get of buffer failed: %s\n",
                             strerror(-rc));
                         break;

@GeoffDanielson
Copy link
Contributor

It appears that removing aesop threadpools (that were not useful in any event) from batches has fixed the issue. Unrecognized ops will now abort the batches they're included in without segfaulting the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants