-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(healthcheck): replace checker callback timer with ngx.thread to lower timer usage #155
Conversation
iirc there is a limit of 100 threads per request-context. Do we risk hitting that in large installations? |
@Tieske Hmm I haven't heard of such a limitation, timers are also relying on light threads(coroutines) so they should somehow be similar. Which limitation are you referring to? Maybe this https://github.com/openresty/lua-nginx-module?tab=readme-ov-file#lua_worker_thread_vm_pool_size? |
not sure it's that one, since it refers to VMs. Anyway I cannot reproduce it, so seems the limitation is no longer present. |
}) | ||
if timer == nil then | ||
self:log(ERR, "failed to create timer to check ", health_mode) | ||
local thread = ngx.thread.spawn(function(mode, list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not agree with this change
checker_callback()
was called by active_check_timer
timer function, that function for loop will iterate all active health check instances, and then submit an active health checker task by timer, that's our existing implementation. But After this change, the health checker task from timer
to ngx.thread
, which means from async thread to sync thread. In async mode, we submit the health checker task and then do not wait for the task return status, but in sync thread, we need to wait task return thread, So every active healthcheck instance for the loop needs to wait for all health check tasks done, That's not accepted for us when we have many upstream node need to active healthchecker eg: if some upstream nodes can not connect, it will trigger timeout logic, the whole for loop execution time was sum by all node connection timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible solution to this problem is that when we create an active health check task, we check the completion status of the previous task on this instance. If it is still in progress, we cancel the submission of the current task, but this may be a break change in bad case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oowl I created an alternative solution for this: #156
To keep the behaviour as much similar as before, I checked the running status
of previous task on this instance, instead of the completion status
. The behavior of checking the completion status
will be just the same as the current PR which waits for the coroutine to finish.
Close in favor of #156 |
This PR modifies the way that
checker_callback
spawns the background checker for targets.Previously the checker spawns timers with 0 delay. When the user has defined a huge amount of upstream targets with small check intervals, this may result in running timers hitting the limit in OpenResty. FTI-5847 provides a similar case that due to the nature of timer-ng, this results in a memory "leaking" due to an unlimited number of timers being spawned every interval and the timer job consuming speed cannot catch up with timer creating speed.
The PR tries to modify the checker to spawn light thread directly by using
ngx.thread.spawn
since there is no need for spawning timers(which are also light threads by nature), so that we can keep behaviour the same as before as much as possible.Alleviates FTI-5847