-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outgoing_webhooks: automatically disable an endpoint after multiple failures #24
base: main
Are you sure you want to change the base?
outgoing_webhooks: automatically disable an endpoint after multiple failures #24
Conversation
@bhumi1102 Can we make this so it doesn't disable the webhook after just a single webhook failing, but after maybe a day of webhooks not being successfully delivered? We could track it at the endpoint level, and if it's been failing for more than 24 hours, then disable the endpoint. |
Sure, will take a look at this next! |
Hi @andrewculver - to understand the 24 hours thing, you're imagining adding another attempt after 24 hours to double check that it still fails before disabling? OR adding a background job or something that checks periodically for up to 24 hours (more involved). Also to clarify, with this change we are not disabling after a single webhook failing but after the sequence of 5 attempts, the longest one being 1 hours.
So the options I see are 1. add a 6th attempt 24 hours later or 2. add some background job to run in 24 hours or something more involved or 3. leave it as is if we're okay with concluding that if the endpoint is still not responding after 5 attempts separated by an hours it's safe to disable. |
Update: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me. @andrewculver do you have any other concerns about this before we merge it?
last_successful_delivery = endpoint.deliveries.where.not(delivered_at: nil).maximum(:delivered_at) | ||
if last_successful_delivery.blank? || last_successful_delivery < 24.hours.ago | ||
endpoint.disabled_from_failure = true | ||
endpoint.save | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is maximum
a potentially costly query? Will the database have to do a full table scan or do we have an index on the column?
I wonder if it would be clearer too if we rephrased the logic here a bit? Date math is always something I'm terrible with, so this logic might not be exactly right, it would need to be verified.
last_successful_delivery = endpoint.deliveries.where.not(delivered_at: nil).maximum(:delivered_at) | |
if last_successful_delivery.blank? || last_successful_delivery < 24.hours.ago | |
endpoint.disabled_from_failure = true | |
endpoint.save | |
end | |
no_successful_delivery_within_past_24_hours = endpoint.deliveries.where(delivered_at: 24.hours.ago..).none? | |
if no_successful_delivery_within_past_24_hours | |
endpoint.disabled_from_failure = true | |
endpoint.save | |
end |
last_successful_delivery.before? 24.hours.ago
or 24.hours.ago.after? last_successful_delivery
may be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This. I had quite a few instances in the recent past where sorting in the Ruby domain was faster than against an unoptimized table.
But: wouldn’t this then be cleaner by adding an index?
also, currently evades me but wasn’t there a shorthand for .where.not(…: nil)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding an index seems reasonable to me since we definitely care about being able to find recent records quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, currently evades me but wasn’t there a shorthand for .where.not(…: nil)?
@julianrubisch maybe you're thinking of where.missing
or where.associated
but that's for associations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where.not(attribute_name: nil)
is it.
For instance:
User.where.not(created_at: nil).count
D, [2023-08-03T11:18:09.830819 #22549] DEBUG -- : User Count (0.7ms) SELECT COUNT(*) FROM "users" WHERE "users"."created_at" IS NOT NULL
=> 22
Though I think I'm missing how that would be applicable to this situation.
@andrewculver asked me not to merge anything related to outgoing webhooks, so I'm going to leave this for him to review and merge. Going to convert it to a draft PR for now. |
The goal is to automatically mark an endpoint as disabled if a success response is not received after multiple attempts (currently 5 attempts at increasing re-attempts delays)
Paired with bullet-train-co/bullet_train#565
Testing Done:
Add a new endpoint, set the url to fail with something like webhook.site
Create the resource to trigger the outgoing webhook
Watch the logs and the DB tables
outgoing_webhooks_*
for the 5 delivery attempts failuresConfirm that the endpoint's
disabled_from_failure
field is set totrue
Create a resource that would trigger the endpoint. This time while disabled.
Confirm from the logs that no outgoing webhooks are queued for delivery
Note: user can add a new endpoint with the desired config to re-enable
closes bullet-train-co/bullet_train-outgoing_webhooks#11