Guidance on implementing async timeout #64

tboyko · 2018-04-02T03:55:35Z

I would like to include handling for timeout situations when using .push_async and would appreciate some guidance on how best to accomplish this.

To provide some context, we first discovered timeout-related issues in our own project upon using Apnotic's connection.join. The join was never returning and there were also notifications that had been queued via .push_async that had not run their on(:response) yet. Our knee jerk solution is to wrap .join in a timeout block and run .close shortly after, whether a timeout occurs or not.

This got me to thinking... if two notifications are sent with the async method, are they always delivered in sequence? If there is some sort of delay with sending the first one, does the second one stay stuck? My guess is that it depends upon how the http2 streams multiplex the notifications. Is this accurate?

Ultimately, I'm hoping to find a robust way to time out these problems and retry but would greatly benefit from some insights on the best way to approach it with this library.

The text was updated successfully, but these errors were encountered:

maciejkowalski · 2018-07-17T00:27:56Z

I am also experiencing hangouts inside Sidekiq worker.

I was able to get stacktrace for current threads via

kill -TTIN [worker_pid]

src: https://github.com/mperham/sidekiq/wiki/Signals

Stacktrace:

2018-07-16T23:55:05.127Z 7571 TID-bubnn WARN: Thread TID-wd31v processor
2018-07-16T23:55:05.127Z 7571 TID-bubnn WARN: /home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:83:in `sleep'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:83:in `delayed_push_async'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:49:in `push_async'
/home/ubuntu/abokifx_com/app/workers/abstract_send_mobile_notification_worker.rb:46:in `block in send_push_iphone'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:65:in `block (2 levels) in with'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `handle_interrupt'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `block in with'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `handle_interrupt'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `with'
/home/ubuntu/abokifx_com/app/workers/abstract_send_mobile_notification_worker.rb:23:in `send_push_iphone'

It points to:
https://github.com/ostinelli/apnotic/blob/1.3.1/lib/apnotic/connection.rb#L83

ostinelli · 2018-07-17T09:13:39Z

Can you please check if this clarifies it for you?
#65

maciejkowalski · 2018-07-18T01:00:48Z

Hey @ostinelli

TL;DR;
Unfortunately not.

LONG STORY:
We are sending a lot of notifications (~30000 iphone users * 10-30 notifications daily).

We have 1 worker for 1 "batch notification".

Code looks like:

# connection pool - declared as CONSTANT inside class (not method) in AbstractWorker
  APNOTIC_POOL = ConnectionPool.new(size: 5) do
    connection = Apnotic::Connection.new(
      cert_path: Rails.root.join(ENV.fetch('IOS_PUSH_APN_PEM_CERT'))
    )
    connection.on(:error) do |exception|
      Sidekiq.logger.error "Exception has been raised on APNS socket: #{exception.inspect}"
    end
    connection
  end

# taken from SendMarketRateNotificationWorker#perform
      ActiveRecord::Base.uncached do
        user_scope(type, rate.currency_name).find_in_batches.with_index do |group, index|
          device_tokens = group.pluck(:device_token)

          device_tokens.each do |device_token|
            send_push_iphone(device_token, notification_message, type)
          end
        end
      end
# [...]

  def send_push_iphone(device_token, notification_message, type)
    APNOTIC_POOL.with do |connection|
      notification = Apnotic::Notification.new(device_token)

      notification.alert = notification_message
      notification.sound = 'default'
      notification.topic = # ...
      notification.custom_payload = {
        others: {
          device_token: device_token,
          push_type: type
        }
      }

      push = connection.prepare_push(notification)
      connection.push_async(push)
    end

{ We don't aim at 100% accuracy, I am happy if 99.9% notifications are sent, so no retry for rare failed notifications }

In my experience, if connection is broken e.g.

Socket was remotely closed
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:130:in `callback_or_raise'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:115:in `rescue in block (2 levels) in ensure_open'/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:109:in `block (2 levels) in ensure_open'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/xml_mini.rb:51: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/xml_mini.rb:52: warning: constant ::Bignum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/core_ext/numeric/conversions.rb:138: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activejob-5.0.1/lib/active_job/arguments.rb:38: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activejob-5.0.1/lib/active_job/arguments.rb:38: warning: constant ::Bignum is deprecated

This connection is still inside ConnectionPool, but it is corrupted and if worker pickes-up this connection, then it hangs-out

{ 6 godzin temu means == 6 hours ago -> this job is stuck, iphone worker takes 2-5mins }

This result in worker waiting forever (sleep)
{ I posted stacktrace before }

2018-07-16T23:55:05.127Z 7571 TID-bubnn WARN: Thread TID-wd31v processor
2018-07-16T23:55:05.127Z 7571 TID-bubnn WARN: /home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:83:in `sleep'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:83:in `delayed_push_async'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/apnotic-1.3.1/lib/apnotic/connection.rb:49:in `push_async'
/home/ubuntu/abokifx_com/app/workers/abstract_send_mobile_notification_worker.rb:46:in `block in send_push_iphone'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:65:in `block (2 levels) in with'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `handle_interrupt'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `block in with'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `handle_interrupt'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `with'
/home/ubuntu/abokifx_com/app/workers/abstract_send_mobile_notification_worker.rb:23:in `send_push_iphone'

ostinelli · 2018-07-18T13:51:21Z

I don't understand how you can get this error being raised if you define a connection callback:

Socket was remotely closed
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:130:in `callback_or_raise'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:115:in `rescue in block (2 levels) in ensure_open'/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:109:in `block (2 levels) in ensure_open'

This should take care of not raising an error. Any ideas why you are seeing this?

On a side note, instead of your current custom way of declaring a connection you can now use:

APNOTIC_POOL = Apnotic::ConnectionPool.new({
  cert_path: Rails.root.join("config", "certs", "apns_certificate.pem"),
  cert_pass: "mypass"
}, size: 5) do |connection|
  connection.on(:error) { |exception| puts "Exception has been raised: #{exception}" }
end

maciejkowalski · 2018-07-18T19:49:47Z

Hey @ostinelli
I pasted old stacktrace, I already have connection callback before:

  # https://github.com/ostinelli/apnotic/issues/48#issuecomment-399776823
  APNOTIC_POOL = ConnectionPool.new(size: 5) do
    connection = Apnotic::Connection.new(
      cert_path: Rails.root.join(ENV.fetch('IOS_PUSH_APN_PEM_CERT'))
    )
    connection.on(:error) do |exception|
      Sidekiq.logger.error "Exception has been raised on APNS socket: #{exception.inspect}"
    end
    connection
  end

Right now I switched from ConnectionPool to 1 connection per worker and it works.

I switched from ConnectionPool to 1 connection per worker (create new connection, use it to send notifiactions, close it).

It seems to me like after connection is broken
see #73
then it doesn't reinitialize/repair/heal the connection inside pool and when worker tries to use it, it hangs forever.

ostinelli · 2018-07-18T20:24:50Z

#73 has nothing to do with this. What do you mean "I pasted old stacktrace"?

maciejkowalski · 2018-07-19T00:18:48Z

What do you mean "I pasted old stacktrace"?

That means this stacktrace:

Socket was remotely closed
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:130:in `callback_or_raise'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:115:in `rescue in block (2 levels) in ensure_open'/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/net-http2-0.18.0/lib/net-http2/client.rb:109:in `block (2 levels) in ensure_open'
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/xml_mini.rb:51: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/xml_mini.rb:52: warning: constant ::Bignum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activesupport-5.0.1/lib/active_support/core_ext/numeric/conversions.rb:138: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activejob-5.0.1/lib/active_job/arguments.rb:38: warning: constant ::Fixnum is deprecated
/home/ubuntu/.rvm/gems/ruby-2.4.0/gems/activejob-5.0.1/lib/active_job/arguments.rb:38: warning: constant ::Bignum is deprecated

was when, I had ConnectionPool without callback (and Sidekiq worker was crashing)

class AbstractSendMobileNotificationWorker
  include Sidekiq::Worker
  sidekiq_options retry: 0

  APNOTIC_POOL = ConnectionPool.new(size: 5) do
    connection = Apnotic::Connection.new(
      cert_path: Rails.root.join(ENV.fetch('IOS_PUSH_APN_PEM_CERT'))
    )
    connection
  end
# [...]

After that (found info here) I added callback:

class AbstractSendMobileNotificationWorker
  include Sidekiq::Worker
  sidekiq_options retry: 0

  # https://github.com/ostinelli/apnotic/issues/48#issuecomment-399776823
  APNOTIC_POOL = ConnectionPool.new(size: 5) do
    connection = Apnotic::Connection.new(
      cert_path: Rails.root.join(ENV.fetch('IOS_PUSH_APN_PEM_CERT'))
    )
    connection.on(:error) do |exception|
      Sidekiq.logger.error "Exception has been raised on APNS socket: #{exception.inspect}"
    end
    connection
  end
# [...]

Sidekiq stopped crashing when connection was closed, but I experienced issues with hanging sidekiq jobs.

#73 has nothing to do with this.

In my experience (YMMV) if you got 2 SocketErrors, 2 sidekiq workers were executing forever.
So I suspect it has to do with corrupted connection inside ConnectionPool

ostinelli · 2019-08-23T11:23:18Z

Please see if latest 1.6.0 solves your issue.

wogg · 2019-10-14T21:41:00Z

I'm posting this here, because it seems the same as the original poster's issue, however I am in running using a single connection in a rake task. It happens inconsistently. This is the stack that is dumped, when I interrupt a stuck process using sig TERM:

versions: (ruby 2.3.8)

   apnotic (1.6.0)
      connection_pool (~> 2)
      net-http2 (>= 0.18, < 2)

trace reported when interrupted:

.../vendor/ruby/2.3.0/gems/net-http2-0.18.2/lib/net-http2/client.rb:59:in `sleep'
.../vendor/ruby/2.3.0/gems/net-http2-0.18.2/lib/net-http2/client.rb:59:in `join'
.../vendor/ruby/2.3.0/gems/apnotic-1.6.0/lib/apnotic/connection.rb:76:in `join'
.../lib/tasks/apple_notifications.rake:73:in `block (3 levels) in <top (required)>'
.../lib/tasks/apple_notifications.rake:89:in `block in device_selector'
.../vendor/ruby/2.3.0/gems/activerecord-4.2.5.1/lib/active_record/relation/batches.rb:124:in `find_in_batches'
.../lib/tasks/apple_notifications.rake:89:in `device_selector'
.../lib/tasks/apple_notifications.rake:43:in `block (2 levels) in <top (required)>'
.../vendor/ruby/2.3.0/gems/rake-12.3.1/exe/rake:27:in `<top (required)>'
.../vendor/ruby/2.3.0/bin/ruby_executable_hooks:24:in `eval'
.../vendor/ruby/2.3.0/bin/ruby_executable_hooks:24:in `<main>'

it's a basic async_push:

      pushme.on(:response) do |result|
          begin 
            handle_result_for_device(device.id, result, true)
          rescue Exception => e
            puts("unknown exception %s" % e)
            puts(e.backtrace)
          end
      end

# (...) and here is the async push ("selected_pusher" is an apnotic connection)

        begin
          selected_pusher.push_async(pushme)
        rescue Errno::ECONNRESET, HTTP2::Error, Errno::EPIPE => e
          puts("connection error device %s skipping and trying next: %s" % [
            device.id, e.inspect])
          next
        rescue Exception => e
          puts("unknown exception %s" % e)
          puts(e.backtrace)
        end

In the attempt to troubleshoot, I have reduced a batch of around 30k pushes on a connection, to doing batches of 200, and re-building the connection for each batch. However, I still run into this "sleep forever" behavior on .join in about 1/4 of the runs of the job.

pbhogan · 2019-10-29T16:48:11Z

I'm having a similar hang on sends (not async, and using individual connections, not connection pool, although the problem happens with the pool also). It just seems to get stuck on the send forever.

So far, adding an explicit timeout to the send call seems to be working, but it remains to be seen if this is a long term fix.

notification = Apnotic::Notification.new token
# ...
response = connection.push notification, timeout: 5
connection.close

Does the internal HTTP2 call without a specified timeout just wait indefinitely and that's what's causing these hangs?

apnotic/lib/apnotic/connection.rb

Lines 45 to 49 in b71af58

    
           response = @client.call(:post, request.path, 
        
             body:    request.body, 
        
             headers: request.headers, 
        
             timeout: options[:timeout] 
        
           )

moofkit mentioned this issue Apr 23, 2018

update docs example for async case #65

Closed

brlo mentioned this issue Aug 24, 2018

push_async executing forever #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance on implementing async timeout #64

Guidance on implementing async timeout #64

tboyko commented Apr 2, 2018

maciejkowalski commented Jul 17, 2018 •

edited

Loading

ostinelli commented Jul 17, 2018

maciejkowalski commented Jul 18, 2018 •

edited

Loading

ostinelli commented Jul 18, 2018 •

edited

Loading

maciejkowalski commented Jul 18, 2018 •

edited

Loading

ostinelli commented Jul 18, 2018

maciejkowalski commented Jul 19, 2018

ostinelli commented Aug 23, 2019

wogg commented Oct 14, 2019 •

edited

Loading

pbhogan commented Oct 29, 2019

Guidance on implementing async timeout #64

Guidance on implementing async timeout #64

Comments

tboyko commented Apr 2, 2018

maciejkowalski commented Jul 17, 2018 • edited Loading

ostinelli commented Jul 17, 2018

maciejkowalski commented Jul 18, 2018 • edited Loading

ostinelli commented Jul 18, 2018 • edited Loading

maciejkowalski commented Jul 18, 2018 • edited Loading

ostinelli commented Jul 18, 2018

maciejkowalski commented Jul 19, 2018

ostinelli commented Aug 23, 2019

wogg commented Oct 14, 2019 • edited Loading

pbhogan commented Oct 29, 2019

maciejkowalski commented Jul 17, 2018 •

edited

Loading

maciejkowalski commented Jul 18, 2018 •

edited

Loading

ostinelli commented Jul 18, 2018 •

edited

Loading

maciejkowalski commented Jul 18, 2018 •

edited

Loading

wogg commented Oct 14, 2019 •

edited

Loading