未知问题有导致某些节点target实际为不存在的节点 #13146

zhou-cl · 2024-06-03T09:59:55Z

zhou-cl
Jun 3, 2024

背景：
1、kong的版本为1.5.1
2、给upstream配置的healthcheck为
{
"created_at": 1607580903,
"hash_on": "none",
"id": "02b1062d-448f-4037-9698-da83e7d790a1",
"algorithm": "round-robin",
"name": "client-shopping-cart",
"tags": [
"k8s-1701624154669388227"
],
"hash_fallback_header": null,
"hash_fallback": "none",
"hash_on_cookie": null,
"host_header": null,
"hash_on_cookie_path": "/",
"healthchecks": {
"active": {
"unhealthy": {
"http_statuses": [
429,
404,
500,
501,
502,
503,
504,
505
],
"tcp_failures": 0,
"timeouts": 0,
"http_failures": 0,
"interval": 2
},
"type": "http",
"http_path": "/health",
"timeout": 1,
"healthy": {
"successes": 1,
"interval": 0,
"http_statuses": [
200,
302
]
},
"https_sni": null,
"https_verify_certificate": true,
"concurrency": 10
},
"passive": {
"unhealthy": {
"http_failures": 0,
"http_statuses": [
429,
500,
503
],
"tcp_failures": 1,
"timeouts": 5
},
"healthy": {
"http_statuses": [
200,
201,
202,
203,
204,
205,
206,
207,
208,
226,
300,
301,
302,
303,
304,
305,
306,
307,
308
],
"successes": 0
},
"type": "tcp"
}
},
"hash_on_header": null,
"slots": 10000
}
3、kong error日志有
[lua] events.lua:273: post(): worker-events: failed posting event "healthy" by "lua-resty-healthcheck [client-shopping-cart]"; no memory, context: ngx.timer
[lua] events.lua:273: post(): worker-events: failed posting event "healthy" by "lua-resty-healthcheck [client-shopping-cart]"; no memory, context: ngx.timer
[lua] healthcheck.lua:1068: log(): [healthcheck] (client-shopping-cart) event: trying to remove an unknown target '...(...:7083)', context: ngx.timer
[lua] targets.lua:65: clean_history(): [Target DAO] Starting cleanup of target table for upstream -448f-4037-9698-da83e7d790a1, client: 127.0.0.1, server: kong_admin, request: "DELETE /upstreams/client-shopping-cart/targets/-1f04-424f-8c91-b5afb800297e HTTP/1.1", host: "localhost:8001"
[lua] events.lua:194: do_handlerlist(): worker-events: event callback failed; source=lua-resty-healthcheck [client-shopping-cart], event=healthy, pid=38 error='/usr/local/share/lua/5.1/resty/healthcheck.lua:247: attempt to index field 'targets' (a nil value)
[lua] balancer.lua:810: do_upstream_event(): failed recreating balancer for client-shopping-cart: timeout waiting for balancer for ***-448f-4037-9698-da83e7d790a1, context: ngx.timer
[lua] events.lua:155: post_event(): worker-events: could not write to shm after 6 tries (no memory), it is either fragmented or cannot allocate more memory, consider increasing 'opts.shm_retries' or increasing the shm size, context: ngx.timer
[lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10639087, context: ngx.timer
[lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10200660, context: ngx.timer
4、kong的配置

lua_shared_dict kong                5m;
lua_shared_dict kong_db_cache       ${{MEM_CACHE_SIZE}};
> if database == "off" then
lua_shared_dict kong_db_cache_2     ${{MEM_CACHE_SIZE}};
> end
lua_shared_dict kong_db_cache_miss   12m;
> if database == "off" then
lua_shared_dict kong_db_cache_miss_2 12m;
> end
lua_shared_dict kong_locks          ${{MEM_CACHE_SIZE}};
lua_shared_dict kong_process_events ${{KONG_PROCESS_EVENTS}};
lua_shared_dict kong_cluster_events ${{MEM_CACHE_SIZE}};
lua_shared_dict kong_healthchecks   ${{MEM_CACHE_SIZE}};

nginx_user = nobody nobody
nginx_worker_processes = auto
nginx_optimizations = on
nginx_daemon = on
mem_cache_size = 128m
kong_process_events = 128m

5、现象为k8s的pod重启漂移ip变更、短时间内几十个节点更新、kong对应的upstream开启了健康监测、kong就出现了某几个节点出现了服务pod为不存在的节点
6、问题
我想咨询下是什么触发的此问题？有什么解决办法？shm的大小是否会自动回收？这个shm的监控需要添加吗怎么添加呢？这个健康监测是否配置错误导致的？重试次数opts.shm_retries能否调整？是哪个shm导致的问题呢？

chronolaw · 2024-06-06T08:24:56Z

chronolaw
Jun 6, 2024
Collaborator

Kong 1.5.1 is EOL ( end of life), would you mind to upgrade to latest version (3.7) and try again?

2 replies

zhou-cl Jun 6, 2024
Author

Thank you for your response. There is a plan to upgrade to version 3.7, but version 1.5.1 still needs to be in production for a long time. Currently, production has been affected by this issue multiple times, but it has not been resolved. Is there any way to temporarily solve it or prevent it from happening again

chronolaw Jun 6, 2024
Collaborator

@bungle, do you have some idea about this issue? I am not familiar with kong 1.5 and can not answer it.

chronolaw · 2024-06-06T08:35:27Z

chronolaw
Jun 6, 2024
Collaborator

Could you change your questions to English? then we can help you easily.

1 reply

zhou-cl Jun 6, 2024
Author

1、The version of Kong is 1.5.1
2、The healthcheck configured for Upstream is
{
"created_at": 1607580903,
"hash_on": "none",
"id": "02b1062d-448f-4037-9698-da83e7d790a1",
"algorithm": "round-robin",
"name": "client-shopping-cart",
"tags": [
"k8s-1701624154669388227"
],
"hash_fallback_header": null,
"hash_fallback": "none",
"hash_on_cookie": null,
"host_header": null,
"hash_on_cookie_path": "/",
"healthchecks": {
"active": {
"unhealthy": {
"http_statuses": [
429,
404,
500,
501,
502,
503,
504,
505
],
"tcp_failures": 0,
"timeouts": 0,
"http_failures": 0,
"interval": 2
},
"type": "http",
"http_path": "/health",
"timeout": 1,
"healthy": {
"successes": 1,
"interval": 0,
"http_statuses": [
200,
302
]
},
"https_sni": null,
"https_verify_certificate": true,
"concurrency": 10
},
"passive": {
"unhealthy": {
"http_failures": 0,
"http_statuses": [
429,
500,
503
],
"tcp_failures": 1,
"timeouts": 5
},
"healthy": {
"http_statuses": [
200,
201,
202,
203,
204,
205,
206,
207,
208,
226,
300,
301,
302,
303,
304,
305,
306,
307,
308
],
"successes": 0
},
"type": "tcp"
}
},
"hash_on_header": null,
"slots": 10000
}
3、kong error log （Partial logs）
[lua] events.lua:273: post(): worker-events: failed posting event "healthy" by "lua-resty-healthcheck [client-shopping-cart]"; no memory, context: ngx.timer
[lua] events.lua:273: post(): worker-events: failed posting event "healthy" by "lua-resty-healthcheck [client-shopping-cart]"; no memory, context: ngx.timer
[lua] healthcheck.lua:1068: log(): [healthcheck] (client-shopping-cart) event: trying to remove an unknown target '...(...:7083)', context: ngx.timer
[lua] targets.lua:65: clean_history(): [Target DAO] Starting cleanup of target table for upstream -448f-4037-9698-da83e7d790a1, client: 127.0.0.1, server: kong_admin, request: "DELETE /upstreams/client-shopping-cart/targets/-1f04-424f-8c91-b5afb800297e HTTP/1.1", host: "localhost:8001"
[lua] events.lua:194: do_handlerlist(): worker-events: event callback failed; source=lua-resty-healthcheck [client-shopping-cart], event=healthy, pid=38 error='/usr/local/share/lua/5.1/resty/healthcheck.lua:247: attempt to index field 'targets' (a nil value)
[lua] balancer.lua:810: do_upstream_event(): failed recreating balancer for client-shopping-cart: timeout waiting for balancer for ***-448f-4037-9698-da83e7d790a1, context: ngx.timer
[lua] events.lua:155: post_event(): worker-events: could not write to shm after 6 tries (no memory), it is either fragmented or cannot allocate more memory, consider increasing 'opts.shm_retries' or increasing the shm size, context: ngx.timer
[lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10639087, context: ngx.timer
[lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10200660, context: ngx.timer
4、nginx_kong.lua kong.conf

lua_shared_dict kong 5m;
lua_shared_dict kong_db_cache ${{MEM_CACHE_SIZE}};

if database == "off" then
lua_shared_dict kong_db_cache_2 ${{MEM_CACHE_SIZE}};
end
lua_shared_dict kong_db_cache_miss 12m;
if database == "off" then
lua_shared_dict kong_db_cache_miss_2 12m;
end
lua_shared_dict kong_locks ${{MEM_CACHE_SIZE}};
lua_shared_dict kong_process_events ${{KONG_PROCESS_EVENTS}};
lua_shared_dict kong_cluster_events ${{MEM_CACHE_SIZE}};
lua_shared_dict kong_healthchecks ${{MEM_CACHE_SIZE}};
nginx_user = nobody nobody
nginx_worker_processes = auto
nginx_optimizations = on
nginx_daemon = on
mem_cache_size = 128m
kong_process_events = 128m

4、The phenomenon is that in Kubernetes, pod restarts lead to IP changes, and within a short period of time, there are updates to dozens of nodes. Kong's corresponding upstream has enabled health checks, and Kong encounters the issue where some nodes point to non-existent pod services.
5、Questions:
I would like to inquire what triggered this issue? What are the solutions? Does the shm size get automatically recycled? Should monitoring of shm be added? How do I add it? Is this health check caused by a misconfiguration? Can the retry count opts.shm_retries be adjusted? Which shm is causing the problem? kong_healthchecks？
The fundamental problem is that some nodes whose target is actually non-existent are constantly being attempted to be pulled up and accessed, and the IP address cannot be seen in the target list but is still being accessed

Maybe I didn't describe clearly. Could you please let me know if there are any unclear areas? Thank you

zhou-cl · 2024-06-06T15:18:07Z

zhou-cl
Jun 6, 2024
Author

When the IP is scaled down at 7 o'clock, and a new pod IP is generated by the node deployment service at 9 o'clock, Kong may redirect the traffic to the IP that was scaled down at 7 o'clock. Sometimes it will recover within 10 seconds, and sometimes it will loop endlessly, requiring a node restart

0 replies

zhou-cl · 2024-06-07T08:36:30Z

zhou-cl
Jun 7, 2024
Author

Hi All

We face an issue where post new deployments (Basically when new pods come in) Kong continues to try to send the requests to old upstream IPs.

On analysing further we also see the target table in postgres is updated with new IPs and does not contain the stale IPs. We are guessing the cache in Kong is not refreshed and hence the issue.

As a temp solution we have restarted the kong pods.

What could be a permanent fix?
We are thinking of incresing the postgres max connections config. Is there anything we could do from kong side?Will increasing the maximum database connections be effective?

2 replies

chronolaw Jun 7, 2024
Collaborator

Does this issue still belong to Kong 1.5?
Since Kong has go forward a lot after 1.5, we suggest you take a test of Kong 3.x (3.4 LTS or latest 3.7), I think that it will solve much of problems.

zhou-cl Jun 7, 2024
Author

I have been having this issue on version 1.5.1, as most of the production environment clusters are using this version and cannot be upgraded immediately in a short period of time. The upgrade plan might be arranged within a year, but we would like to address this issue first. Currently, we are restarting nodes frequently, which greatly affects the stability of production.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

未知问题有导致某些节点target实际为不存在的节点 #13146

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

未知问题有导致某些节点target实际为不存在的节点 #13146

zhou-cl Jun 3, 2024

Replies: 4 comments · 5 replies

chronolaw Jun 6, 2024 Collaborator

zhou-cl Jun 6, 2024 Author

chronolaw Jun 6, 2024 Collaborator

chronolaw Jun 6, 2024 Collaborator

zhou-cl Jun 6, 2024 Author

zhou-cl Jun 6, 2024 Author

zhou-cl Jun 7, 2024 Author

chronolaw Jun 7, 2024 Collaborator

zhou-cl Jun 7, 2024 Author

zhou-cl
Jun 3, 2024

Replies: 4 comments 5 replies

chronolaw
Jun 6, 2024
Collaborator

zhou-cl Jun 6, 2024
Author

chronolaw Jun 6, 2024
Collaborator

chronolaw
Jun 6, 2024
Collaborator

zhou-cl Jun 6, 2024
Author

zhou-cl
Jun 6, 2024
Author

zhou-cl
Jun 7, 2024
Author

chronolaw Jun 7, 2024
Collaborator

zhou-cl Jun 7, 2024
Author