Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(healthcheck): replace checker callback timer with ngx.thread to lower timer usage #155

Closed
wants to merge 3 commits into from

Conversation

windmgc
Copy link
Member

@windmgc windmgc commented May 6, 2024

This PR modifies the way that checker_callback spawns the background checker for targets.

Previously the checker spawns timers with 0 delay. When the user has defined a huge amount of upstream targets with small check intervals, this may result in running timers hitting the limit in OpenResty. FTI-5847 provides a similar case that due to the nature of timer-ng, this results in a memory "leaking" due to an unlimited number of timers being spawned every interval and the timer job consuming speed cannot catch up with timer creating speed.

The PR tries to modify the checker to spawn light thread directly by using ngx.thread.spawn since there is no need for spawning timers(which are also light threads by nature), so that we can keep behaviour the same as before as much as possible.

Alleviates FTI-5847

@windmgc windmgc changed the base branch from master to release/1.6.x May 6, 2024 09:33
@Kong Kong deleted a comment from CLAassistant May 6, 2024
@Kong Kong deleted a comment from CLAassistant May 6, 2024
@CLAassistant
Copy link

CLAassistant commented May 6, 2024

CLA assistant check
All committers have signed the CLA.

@windmgc windmgc changed the title Fix replace hc timer with thread fix(healthcheck): replace checker callback timer with ngx.thread to lower timer usage May 6, 2024
@windmgc windmgc marked this pull request as ready for review May 8, 2024 08:03
@windmgc
Copy link
Member Author

windmgc commented May 8, 2024

@oowl @locao - Could you please help check if this is a meaningful change to you? will make a change against master if it does

(Although I found it seems meaningless to write test for it)

@windmgc windmgc requested review from oowl and locao May 8, 2024 08:04
@Tieske
Copy link
Member

Tieske commented May 8, 2024

iirc there is a limit of 100 threads per request-context. Do we risk hitting that in large installations?

@windmgc
Copy link
Member Author

windmgc commented May 8, 2024

@Tieske Hmm I haven't heard of such a limitation, timers are also relying on light threads(coroutines) so they should somehow be similar.

Which limitation are you referring to? Maybe this https://github.com/openresty/lua-nginx-module?tab=readme-ov-file#lua_worker_thread_vm_pool_size?

@Tieske
Copy link
Member

Tieske commented May 8, 2024

not sure it's that one, since it refers to VMs. Anyway I cannot reproduce it, so seems the limitation is no longer present.

})
if timer == nil then
self:log(ERR, "failed to create timer to check ", health_mode)
local thread = ngx.thread.spawn(function(mode, list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not agree with this change
checker_callback() was called by active_check_timer timer function, that function for loop will iterate all active health check instances, and then submit an active health checker task by timer, that's our existing implementation. But After this change, the health checker task from timer to ngx.thread , which means from async thread to sync thread. In async mode, we submit the health checker task and then do not wait for the task return status, but in sync thread, we need to wait task return thread, So every active healthcheck instance for the loop needs to wait for all health check tasks done, That's not accepted for us when we have many upstream node need to active healthchecker eg: if some upstream nodes can not connect, it will trigger timeout logic, the whole for loop execution time was sum by all node connection timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible solution to this problem is that when we create an active health check task, we check the completion status of the previous task on this instance. If it is still in progress, we cancel the submission of the current task, but this may be a break change in bad case.

Copy link
Member Author

@windmgc windmgc May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oowl I created an alternative solution for this: #156

To keep the behaviour as much similar as before, I checked the running status of previous task on this instance, instead of the completion status. The behavior of checking the completion status will be just the same as the current PR which waits for the coroutine to finish.

@windmgc
Copy link
Member Author

windmgc commented May 10, 2024

Close in favor of #156

@windmgc windmgc closed this May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants