kilo pods high cpu usage #304

faraonc · 2022-04-22T04:04:21Z

We started seeing kilo pods having high CPU usage. This is degrading the network and causing our apps to have timeouts.

kube-system                   kilo-rhpwz                                          4m           17Mi            
kube-system                   kilo-s9hpm                                          981m         41Mi            
kube-system                   kilo-sf9nb                                          8m           17Mi            
kube-system                   kilo-sr4gd                                          12m          17Mi            
kube-system                   kilo-vftzs                                          8m           17Mi            
kube-system                   kilo-vs9cs                                          11m          17Mi            
kube-system                   kilo-vssrf                                          9m           17Mi            
kube-system                   kilo-w9tnh                                          905m         28Mi

To workaround this issue, we would kill the pod whenever it happens. We tried upgrading the daemonset to use image 0.4.1, but it did not resolve the issue. Is there already a built-in way to profile the CPU? Does anybody have a process to investigate the root cause?

The text was updated successfully, but these errors were encountered:

squat · 2022-04-22T06:31:14Z

Hi @faraonc that is very high indeed and not normal for Kilo!
~~The Kilo pods expose profile endpoints at~~
I'm surprised to learn that I didn't yet add profiling endpoints to Kilo 🤔
It would be straightforward (and very helpful) to do so to add them to this httpmux: https://github.com/squat/kilo/blob/main/cmd/kg/main.go#L255

If you'd like to give it a shot, please open a PR :) Otherwise I'll do it this afternoon.

In the meantime, could you share logs from one of the high CPU-usage Kilo pods?

squat · 2022-04-22T06:51:51Z

When we've seen high CPU issues in the past, it's tended to be either due to constant reconciliation because some configuration doesn't match or Kilo and another process are fighting OR because the iptables controller is going crazy and spawning lots of iptables processes because some rules don't match what Kilo expects. We haven't seen the iptables issue in a year, so hopefully it's not that.

faraonc · 2022-04-22T16:31:53Z

Hi @faraonc that is very high indeed and not normal for Kilo! ~~The Kilo pods expose profile endpoints at~~ I'm surprised to learn that I didn't yet add profiling endpoints to Kilo thinking It would be straightforward (and very helpful) to do so to add them to this httpmux: https://github.com/squat/kilo/blob/main/cmd/kg/main.go#L255

If you'd like to give it a shot, please open a PR :) Otherwise I'll do it this afternoon.

In the meantime, could you share logs from one of the high CPU-usage Kilo pods?

{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[242,125,221,115,208,5,153,219,45,253,36,36,199,189,255,138,155,194,185,60,129,38,244,62,81,186,221,84,40,172,22,0],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644808,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-hetzner-eu-4a720b41ca08","PersistentKeepalive":0,"Subnet":{"IP":"100.64.29.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.16","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:45.786502247Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=12, new=13","level":"info","msg":"WireGuard configurations are different","ts":"2022-04-22T16:27:46.461982026Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[114,119,82,34,224,221,69,40,94,4,0,222,216,239,86,4,119,55,11,178,153,80,198,84,37,150,139,20,32,111,202,100],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644808,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-scylla-50533b7565ab4976","PersistentKeepalive":0,"Subnet":{"IP":"100.64.37.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.10","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:46.463541907Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=13, new=6","level":"info","msg":"WireGuard configurations are different","ts":"2022-04-22T16:27:47.21942035Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[78,60,137,77,169,59,82,34,222,34,184,2,161,139,219,98,228,91,146,171,59,224,0,142,110,247,177,213,188,16,53,49],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644809,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-flink-dedicated-fa50447","PersistentKeepalive":0,"Subnet":{"IP":"100.64.14.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.12","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:47.220543733Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=6, new=5","level":"info","msg":"WireGuard configurations are different","ts":"2022-04-22T16:27:47.74765066Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[139,191,31,255,105,117,250,6,28,186,42,66,240,152,10,66,226,112,218,196,43,52,188,237,131,206,112,59,152,15,218,82],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644809,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-trino-dedicated-521aa90","PersistentKeepalive":0,"Subnet":{"IP":"100.64.28.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.41","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:47.748560905Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=5, new=6","level":"info","msg":"WireGuard configurations are different","ts":"2022-04-22T16:27:47.868961075Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[12,36,220,47,219,196,113,45,228,161,157,137,162,43,107,237,11,98,80,88,19,231,149,49,75,0,226,241,40,138,42,34],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644809,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-scylla-2ee7ca8f48c64612","PersistentKeepalive":0,"Subnet":{"IP":"100.64.73.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.5","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:47.869752907Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=6, new=7","level":"info","msg":"WireGuard configurations are different","ts":"2022-04-22T16:27:48.217186219Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[114,119,82,34,224,221,69,40,94,4,0,222,216,239,86,4,119,55,11,178,153,80,198,84,37,150,139,20,32,111,202,100],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644809,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-scylla-50533b7565ab4976","PersistentKeepalive":0,"Subnet":{"IP":"100.64.37.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.10","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:48.218748474Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=7, new=4","level":"info","msg":"WireGuard configurations are different","ts":"2022-04-22T16:27:48.634100148Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[114,119,82,34,224,221,69,40,94,4,0,222,216,239,86,4,119,55,11,178,153,80,198,84,37,150,139,20,32,111,202,100],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644809,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-scylla-50533b7565ab4976","PersistentKeepalive":0,"Subnet":{"IP":"100.64.37.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.6","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:48.634982851Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[20,102,34,232,210,159,85,154,57,156,160,241,77,44,254,1,46,100,171,222,59,33,153,177,149,166,16,2,70,199,107,10],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644809,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-trino-dedicated-250ce17","PersistentKeepalive":0,"Subnet":{"IP":"100.64.31.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.40","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:48.635300726Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=4, new=5","level":"info","msg":"WireGuard configurations are different","ts":"2022-04-22T16:27:48.707896089Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[172,195,114,75,107,20,63,254,131,155,99,94,183,91,227,155,47,215,173,159,144,244,202,94,16,84,191,238,160,124,110,100],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644810,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-hetzner-eu-adde729b3f83","PersistentKeepalive":0,"Subnet":{"IP":"100.64.32.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.22","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:48.739294936Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=5, new=6","level":"info","msg":"WireGuard configurations are different","ts":"2022-04-22T16:27:49.112783677Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[114,119,82,34,224,221,69,40,94,4,0,222,216,239,86,4,119,55,11,178,153,80,198,84,37,150,139,20,32,111,202,100],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644810,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-scylla-50533b7565ab4976","PersistentKeepalive":0,"Subnet":{"IP":"100.64.37.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.7","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:49.11410384Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=6, new=2","level":"info","msg":"WireGuard configurations are different","ts":"2022-04-22T16:27:49.44559662Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[12,36,220,47,219,196,113,45,228,161,157,137,162,43,107,237,11,98,80,88,19,231,149,49,75,0,226,241,40,138,42,34],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644810,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-scylla-2ee7ca8f48c64612","PersistentKeepalive":0,"Subnet":{"IP":"100.64.73.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.5","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:49.446298605Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=2, new=3","level":"info","msg":"WireGuard configurations are different","ts":"2022-04-22T16:27:49.671687206Z"}
{"caller":"mesh.go:309","component":"kilo","event":"update","level":"info","node":{"Endpoint":{},"Key":[58,219,243,71,85,3,78,184,120,219,142,252,207,14,138,56,46,24,242,138,175,128,54,118,104,203,0,47,166,97,158,39],"NoInternalIP":true,"InternalIP":null,"LastSeen":1650644810,"Leader":false,"Location":"germany","Name":"scw-k8s-minerva-modern-scylla-1f0654f251f94c2b","PersistentKeepalive":0,"Subnet":{"IP":"100.64.83.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.4","Mask":"//8AAA=="},"DiscoveredEndpoints":{},"AllowedLocationIPs":null,"Granularity":"location"},"ts":"2022-04-22T16:27:49.672127801Z"}
{"caller":"mesh.go:544","component":"kilo","diff":"number of peers: old=3, new=4","level":"info","msg":"WireGuard configurations are different","ts":"2022-04-22T16:27:49.968036077Z"}

Thanks for pointing out where the change is needed. I might just build and test it on my deployments. I might do a PR next week if you have not done so.

squat · 2022-04-22T16:39:17Z

Hi @faraonc i made the PR earlier today: #305
Hope to merge today and you can test out. Otherwise You can built it yourself and test :)

Something funny is going in in that cluster for sure. The WireGuard configuration keeps changing really really fast. There are constantly new, unexpected changed to the number of peers. Are you connecting WireGuard peers to the cluster besides nodes?

squat · 2022-04-22T17:01:51Z

@faraonc the PR just merged, so you should be able to use the squat/kilo:0ddeea3d7840acc7d6c8f021310e9b39e7dad6f5 tag in a few minutes to test out the pprof endpoint. In any case it seems clear to me that the problem is that Kilo is constantly reconciling perceived changes to the WireGuard config. Something else seems to be touching it.

faraonc · 2022-04-22T17:19:51Z

Thanks a lot. I will be deploying this on Monday, and check what is going on. I will share the outcome.

squat · 2022-04-22T17:22:05Z

As i mentioned earlier, i strongly suspect that Kilo is getting confused with the WireGuard configuration. Any info you could share about the peers in the cluster / what other processes might be interacting with the cluster would be very helpful.

squat · 2022-04-27T04:25:37Z

Hi @faraonc, any update?

faraonc · 2022-04-27T04:34:01Z

A little tight this week on the workload, but it is in our queue.

…

Sent from my iPhone On Apr 26, 2022, at 21:25, Lucas Servén Marín ***@***.***> wrote: Hi @faraonc<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffaraonc&data=05%7C01%7C%7Ca817f01fcd1a4a02beec08da280601b4%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637866303509662056%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=G4CbrFIBk91F4Bzen2KEL60bApe5%2FoTGzvTvyTB1j4M%3D&reserved=0>, any update? — Reply to this email directly, view it on GitHub<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsquat%2Fkilo%2Fissues%2F304%23issuecomment-1110524999&data=05%7C01%7C%7Ca817f01fcd1a4a02beec08da280601b4%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637866303509662056%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WMprOAyPBB5AkYtR0xJXDEQpl5%2FV8mw3JfMRTsTEow4%3D&reserved=0>, or unsubscribe<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAD5GO2Y77XVLGQINE77SFQTVHC6UZANCNFSM5UBCJZNA&data=05%7C01%7C%7Ca817f01fcd1a4a02beec08da280601b4%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637866303509662056%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4Xv%2FYSZjZ9oVWYM5Y9JaCxUpHJDJXm6aIeAQhRVJQlM%3D&reserved=0>. You are receiving this because you were mentioned.Message ID: ***@***.***>

squat · 2022-04-27T05:15:59Z

Totally understand 👍
Let me know if you need help debugging. You can also find me in the #kilo channel in the k8s slack.

squat mentioned this issue Apr 22, 2022

Pprof #305

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kilo pods high cpu usage #304

kilo pods high cpu usage #304

faraonc commented Apr 22, 2022

squat commented Apr 22, 2022

squat commented Apr 22, 2022

faraonc commented Apr 22, 2022

squat commented Apr 22, 2022

squat commented Apr 22, 2022

faraonc commented Apr 22, 2022

squat commented Apr 22, 2022 •

edited

Loading

squat commented Apr 27, 2022

faraonc commented Apr 27, 2022 via email

squat commented Apr 27, 2022

kilo pods high cpu usage #304

kilo pods high cpu usage #304

Comments

faraonc commented Apr 22, 2022

squat commented Apr 22, 2022

squat commented Apr 22, 2022

faraonc commented Apr 22, 2022

squat commented Apr 22, 2022

squat commented Apr 22, 2022

faraonc commented Apr 22, 2022

squat commented Apr 22, 2022 • edited Loading

squat commented Apr 27, 2022

faraonc commented Apr 27, 2022 via email

squat commented Apr 27, 2022

squat commented Apr 22, 2022 •

edited

Loading