Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow client to reconnect if other side closes connection #478

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

cybernetlab
Copy link

It would be nice to allow framed client to reconnect if other side closed connection because of client inactivity for example. In my personal case thrift library used as a client for hbase and this server closes inactive connections in about 10 minutes. Also I'm using poolboy to pool connections and without reconnection ability I'll need to do some unnecessary logic to monitor status of tcp connections and push out clients with closed connection from the pool.

@cybernetlab
Copy link
Author

cybernetlab commented Sep 16, 2019

Ah, sorry - I've forgotten to resend data after reconnection - some additional fixes needed

@cybernetlab
Copy link
Author

cybernetlab commented Sep 16, 2019

Now all works fine in my environment. What should I do with test coverage? ) Where are no any tests related to Elixir.Connection behaviour in client_test.exs.. Should I implement some additional test cases?

Copy link
Member

@fishcakez fishcakez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now all works fine in my environment. What should I do with test coverage? ) Where are no any tests related to Elixir.Connection behaviour in client_test.exs.. Should I implement some additional test cases?

We have some more integration style tests in server_test.exs. It would be good to get at test for new behavior.

However would using :inet.setopts(sock, active: :once) in the client process allow us to handle idle disconnect earlier and not have to retry a request? We should usually get a close notification message, and can reconnect proactively? Then instead of :gen_tcp.recv we could do receive do {:tcp, ^socket, data} -> etc.

Also its not always the case that all connection closes can be retried, right now this could double the request rate to a badly behaving service :(.

lib/thrift/binary/framed/client.ex Outdated Show resolved Hide resolved
lib/thrift/binary/framed/client.ex Outdated Show resolved Hide resolved
lib/thrift/binary/framed/client.ex Outdated Show resolved Hide resolved
@cybernetlab
Copy link
Author

However would using :inet.setopts(sock, active: :once) in the client process allow us to handle idle disconnect earlier and not have to retry a request? We should usually get a close notification message, and can reconnect proactively? Then instead of :gen_tcp.recv we could do receive do {:tcp, ^socket, data} -> etc.

Agree. The better solution is to handle disconnects at the moment of disconnect. I think that we can change active option to false right before :gen_tcp.send and switch it back to :once after :gen_tcp.recv. Is it reasonable approach?

Also its not always the case that all connection closes can be retried, right now this could double the request rate to a badly behaving service :(.

Why it doubles request rate? If connection is closed at the moment of sending then where are no data sent at all and it will sent once after reconnect

@cybernetlab
Copy link
Author

cybernetlab commented Sep 17, 2019

@fishcakez I'm ready for your re-review )

lib/thrift/binary/framed/client.ex Outdated Show resolved Hide resolved
lib/thrift/binary/framed/client.ex Outdated Show resolved Hide resolved
lib/thrift/binary/framed/client.ex Show resolved Hide resolved
@fishcakez
Copy link
Member

Agree. The better solution is to handle disconnects at the moment of disconnect. I think that we can change active option to false right before :gen_tcp.send and switch it back to :once after :gen_tcp.recv. Is it reasonable approach?

I like what you did with active: true, left comment about always handling all socket messages. Outside a request any message is an issue and we could close.

Why it doubles request rate? If connection is closed at the moment of sending then where are no data sent at all and it will sent once after reconnect.

Unfortunately with TCP we can not tell why the socket closed. There are three main scenarios:

  1. The socket closed before we sent the request, and we'd like to retry as we never go to send the request.

  2. The socket closed at some point after we sent the request, and the server never received the full request so didn't try to handle it.

  3. The socket closed after the server received the request. This is the case where we could increase the load on the server because the server could receive the request multiple times but the socket is closed before it can send a response. Most likely there is a bug in the server that causes the socket to get closed. A problematic example would be with the thrift server in this repo, and using Task.async inside the handler function. If the task processes crashes then the server processes will get an exit signal and exit - leading to the socket being closed without sending a response. As the client does not get a response it would resend the request, doubling the qps. This double qps occurence is quite concern to me.

I wonder if handling the disconnect immediately (active: true with message handling) is enough for the issue you're seeing as we would catch scenario 1) most of the time but never retry on 2) or 3). Perhaps we can split this PR in two, and follow up on reconnect retry in a second one if its still a problem.

@cybernetlab
Copy link
Author

I wonder if handling the disconnect immediately (active: true with message handling) is enough for the issue you're seeing as we would catch scenario 1) most of the time but never retry on 2) or 3). Perhaps we can split this PR in two, and follow up on reconnect retry in a second one if its still a problem.

I'm writing this PR not only for my case but for any possible similar cases. But I agree that in case of 2) and 3) we should not resend data

@cybernetlab
Copy link
Author

@fishcakez I've kept active: true and remove resending logic. Ready to review

@cybernetlab
Copy link
Author

cybernetlab commented Oct 15, 2019

Hi, guys! I need a SASL authentication for hbase. I can prepare another PR with auth backend support, but I need this PR to be included in my code too (In my mix.exs I've github dependency that points to my repo with this PR). How can I achive this? Add my changes in this PR? As I understand, the better would be to create a new PR after merging this one... Thanks!

Copy link
Member

@fishcakez fishcakez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @cybernetlab, I messed up my github notifications temporarily so missed the your last change. One comment to handle and then we can merge this change.

{:disconnect, :reconnect, s}
end

def handle_info(_, s) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should handle the {:tcp_error, sock, _}, {:ssl_error, sock, _}, {:tcp, sock, _}, {;ssl, sock, _} here but ignore messages from old sockets (or clean them up in disconnect/2).

@cybernetlab
Copy link
Author

@fishcakez ready to review )

Copy link
Collaborator

@jparise jparise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! My only request is that we document the new :reconnect behavior somewhere (in a @moduledoc and/or in the README.md).

@cybernetlab
Copy link
Author

cybernetlab commented Oct 15, 2019

@jparise is it enough to add:

The :reconnect option if setted to true forces client to reopen tcp connection whenever it closed.

as a last paragraph to the end of @doc for start_link/3?

@jparise
Copy link
Collaborator

jparise commented Oct 15, 2019

@jparise is it enough to add:

The :reconnect option if setted to true forces client to reopen tcp connection whenever it closed.

s/setted/set

as a last paragraph to the end of @doc for start_link/3 ?

Yes, that would be fine.

end

def handle_info({:ssl_error, sock, _error}, %{reconnect: true, sock: {:ssl, sock}} = s) do
{:disconnect, :reconnect, s}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always disconnect and make the decision on reconnect there. If reconnect: false we want to disconnect too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case many tests fails. If I disconnect on any TCP or SSL message with reconnect: false following output produced by mix test:

19:52 $ mix test
Compiling 1 file (.ex)
Excluding tags: [pending: true]

............................................................................................................................................

  1) test clients exit if they try to use a closed client (Thrift.Generator.ServiceTest)
     test/thrift/generator/service_test.exs:387
     ** (exit) exited in: :gen_server.call(#PID<0.1755.0>, {:call, "friend_ids_of", [<<10, 0, 1, 0, 0, 0, 0, 0, 0, 4, 210>> | <<0>>], []}, 5000)
         ** (EXIT) {:error, :closed}
     code: assert Client.friend_ids_of(client, 1234) == {:error, :closed}
     stacktrace:
       (stdlib) gen_server.erl:223: :gen_server.call/3
       (thrift) lib/thrift/binary/framed/client.ex:221: Thrift.Binary.Framed.Client.call/5
       test/thrift/generator/service_test.exs:392: (test)

     The following output was logged:
     
     19:52:36.114 [error] Connection closed
     
     19:52:36.115 [error] GenServer #PID<0.1755.0> terminating
     ** (stop) {:error, :closed}
     Last message: {:tcp_closed, #Port<0.19>}
     State: nil
     
..............

  2) test clients exit on connection timeout (Thrift.Generator.ServiceTest)
     test/thrift/generator/service_test.exs:373
     ** (exit) exited in: :gen_server.call(#PID<0.1958.0>, {:call, "friend_ids_of", [<<10, 0, 1, 0, 0, 0, 0, 0, 0, 4, 210>> | <<0>>], []}, 5000)
         ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
     code: assert {:error, closed} = Client.friend_ids_of(client, 1234)
     stacktrace:
       (stdlib) gen_server.erl:223: :gen_server.call/3
       (thrift) lib/thrift/binary/framed/client.ex:221: Thrift.Binary.Framed.Client.call/5
       test/thrift/generator/service_test.exs:382: (test)

     The following output was logged:
     
     19:52:36.155 [error] Connection closed
     
     19:52:36.155 [error] GenServer #PID<0.1958.0> terminating
     ** (stop) {:error, :closed}
     Last message: {:tcp_closed, #Port<0.59>}
     State: nil
     
..

  3) test it handles void oneway functions (Thrift.Generator.ServiceTest)
     test/thrift/generator/service_test.exs:309
     ** (EXIT from #PID<0.1990.0>) {:error, :closed}

     The following output was logged:
     
     19:52:36.188 [error] Connection closed
     
     19:52:36.188 [error] GenServer #PID<0.1996.0> terminating
     ** (stop) {:error, :closed}
     Last message: {:tcp, #Port<0.68>, <<0>>}
     State: nil
     
     19:52:36.189 [error] GenServer #PID<0.1991.0> terminating
     ** (stop) {:error, :closed}
     Last message: {:EXIT, #PID<0.1990.0>, {:error, :closed}}
     State: {:state, {#PID<0.1991.0>, Supervisor.Default}, :one_for_one, {[ranch_listener_sup: Thrift.Generator.ServiceTest.ServerSpy], %{{:ranch_listener_sup, Thrift.Generator.ServiceTest.ServerSpy} => {:child, #PID<0.1992.0>, {:ranch_listener_sup, Thrift.Generator.ServiceTest.ServerSpy}, {:ranch_listener_sup, :start_link, [Thrift.Generator.ServiceTest.ServerSpy, :ranch_tcp, %{num_acceptors: 1, socket_opts: [port: 0]}, Thrift.Binary.Framed.ProtocolHandler, {Thrift.Generator.ServiceTest.SimpleService.Binary.Framed.Server, Thrift.Generator.ServiceTest.ServerSpy, [], []}]}, :permanent, :infinity, :supervisor, [:ranch_listener_sup]}}}, :undefined, 0, 5, [], 0, Supervisor.Default, {:ok, {%{intensity: 0, period: 5, strategy: :one_for_one}, [{{:ranch_listener_sup, Thrift.Generator.ServiceTest.ServerSpy}, {:ranch_listener_sup, :start_link, [Thrift.Generator.ServiceTest.ServerSpy, :ranch_tcp, %{num_acceptors: 1, socket_opts: [port: 0]}, Thrift.Binary.Framed.ProtocolHandler, {Thrift.Generator.ServiceTest.SimpleService.Binary.Framed.Server, Thrift.Generator.ServiceTest.ServerSpy, [], []}]}, :permanent, :infinity, :supervisor, [:ranch_listener_sup]}]}}}
     
.................

  4) test it can handle oneway messages (Servers.Binary.Framed.IntegrationTest)
     test/thrift/binary/framed/server_test.exs:158
     ** (EXIT from #PID<0.2192.0>) {:error, :closed}

     The following output was logged:
     
     19:52:36.324 [error] Connection closed
     
     19:52:36.324 [error] GenServer :"test it can handle oneway messages" terminating
     ** (stop) {:error, :closed}
     Last message: {:tcp, #Port<0.114>, <<0>>}
     State: nil
     
..................................................

Finished in 6.3 seconds
227 tests, 4 failures

Randomized with seed 350387

Copy link
Author

@cybernetlab cybernetlab Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure - should I correct these test cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, any comments about my last question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think about we handle these cases, and handle them.

For example we could do

try do
  GenServer.call(...)
catch
  :exit, {{:error, _} = error, _} ->
    error
  :exit, {:noproc, _} ->
    {:error, :closed}
end

Copy link
Author

@cybernetlab cybernetlab Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is not to handle these cases and let users of library to handle it. But in this case some tests should be rewritten

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants