Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s-dqlite spiking cpu core to 100% #3227

Open
WilliamG-LORA opened this issue Jun 8, 2022 · 193 comments
Open

k8s-dqlite spiking cpu core to 100% #3227

WilliamG-LORA opened this issue Jun 8, 2022 · 193 comments

Comments

@WilliamG-LORA
Copy link

WilliamG-LORA commented Jun 8, 2022

Summary

I've setup a 4 node microk8s cluster on bare metal machines. Every now and then the /snap/microk8s/3204/bin/k8s-dqlite process will spike one of the cores on one of my nodes to 100% usage, sending my fans into overdrive.

I can see the ram usage is low and all the other cores are running at <6% usage, and RAM is hardly used:
htop

The specs of the machines are as follows:

  • Node 1:
    • CPU: AMD Threadripper 1950X
    • RAM: 64GB
  • Node 2:
    • CPU: i7-7820X
    • RAM: 64
  • Node 3:
    • CPU: i7-9700
    • RAM: 32
  • Node 4:
    • CPU: i7-9700K
    • RAM: 64

The cluster has the metallb, dns, rbac, and storage enabled.
I've also deployed Rook-Ceph on the cluster.

What Should Happen Instead?

It shouldn't be using over 100% of a core.

Reproduction Steps

  1. Create a microk8s cluster
  2. Deploy Rook-Ceph
  3. Wait a bit.
  4. I'm not sure how to properly reproduce this issue...

Introspection Report

inspection-report-20220608_143601.tar.gz

@maximemoreillon
Copy link

maximemoreillon commented Jun 14, 2022

I am facing the same issue on multiple AWS EC2s, each running single node Microk8s instances.

Microk8s version: 1.23 classic

Enabled addons:

  • DNS
  • Storage
  • RBAC

For example, here is a screenshot of htop on an EC2.XLarge (16Gb memory):

microk8s_cpu

Microk8s was running smoothly until this week.

On the other hand, instances running microk8s version 1.18 were not affected.

@bc185174
Copy link

We found similar results where the dqlite service on the leader node was hitting 100% usage. It is mentioned in a few other issues but dqlite is sensitive to slow disk performance. In other scenarios such as a node drain, it took a while to write to the database. You should see in the logs, something similar to microk8s.daemon-kubelite[3802920]: Trace[736557743]: ---"Object stored in database" 7755ms. We found on our cluster it took over ~18000ms to write to the datastore and dqlite could not cope with it. As a result, it led to leader election failures and the kubelite service panics.

We monitored the CPU and RAM utilisation for dqlite and compared it to etcd under the same workload and conditions.

Dqlite Idle

image

Etcd Idle

image

@benben
Copy link

benben commented Dec 1, 2022

Can confirm this on Proxmox virtualized VMs. There is a constant high load on a 3 node cluster

image

@djjudas21
Copy link

I'm seeing something similar. I was running a 4-node HA cluster but it failed (see #3735) so I removed 2 nodes to disable HA mode and hopefully restore quorum , now running 2 nodes, 1 is master. The master has a dqlite process rammed at 100% CPU. Running iotop shows aggregate disk transfer of only a hundred KB/s. The dqlite log shows various transaction logs and mostly they complete in 500-700ms, but occasionally I get a much slower one.

Feb 08 09:08:51 kube05 microk8s.daemon-kubelite[1456693]: Trace[1976283013]: ["GuaranteedUpdate etcd3" audit-id:70abc60c-f365-4744-96fc-a404d34de11b,key:/leases/kube-system/kube-apiserver-b3nhikmrwakntwutkwiesxox4e,type:*coordination.Lease,resource:leases.coordination.k8s.io 7005ms (09:08:44.693)
Feb 08 09:08:51 kube05 microk8s.daemon-kubelite[1456693]: Trace[1976283013]:  ---"Txn call completed" 7005ms (09:08:51.699)]
Feb 08 09:08:51 kube05 microk8s.daemon-kubelite[1456693]: Trace[1976283013]: [7.005829845s] [7.005829845s] END

Hardware isn't exactly a rocketship but all my nodes are i5-6500T with 4 cores, 16GB memory, 256GB SSD and that should be adequate. Most of my workloads are not running at the moment, either.

@cole-miller
Copy link
Contributor

cole-miller commented Feb 9, 2023

Hi all. I work on dqlite, and I'm going to try to figure out what's causing these CPU usage spikes. If you're experiencing this issue on a continuing basis and are in a position to collect some diagnostic info, I could use your help! Short of a reproducer that I can run myself (which I realize is difficult for this kind of complex system), the data I'd find most helpful would be a sampling profiler report showing where the k8s-dqlite process is spending its time during one of these spikes. A separate report for the same process and workload during a period of nominal CPU usage would also be great, so I can compare the two and see if anything stands out. You can gather this information as follows:

  1. Install the perf command-line tool on all machines in your cluster. On Ubuntu this is part of the linux-tools package (you'll have to pick a "flavor" like linux-tools-generic).
  2. Collect a profile by ssh-ing into the affected node and running perf record -F 99 --call-graph dwarf -p <pid>, where <pid> is the PID of the k8s-dqlite process. That command will keep running and collecting samples until you kill it with Ctrl-C.
  3. Upload the generated perf.data file to somewhere I can access. (It doesn't contain a core dump or anything else that might be sensitive, just backtraces.) Please also share the version of microk8s that you're using.

If the spikes last long enough that you can notice one happening and have time to ssh in and gather data before it ends, do that. Otherwise, since it's probably not feasible to just leave perf running (perf.data would get too huge), you could have a script like

i=0
while true; do
        rm -f perf.data.$i
        i=$(( ( i + 1 ) % 2 ))
        perf record -F 99 --call-graph dwarf -o perf.data.$i -p <pid> sleep 60
done

Or with a longer timeout, etc. Then if you notice after the fact that a spike has occurred, you hopefully still have the perf.data file for that time period around.

Thanks for your reports -- let's get this issue fixed!

@djjudas21
Copy link

Hey @cole-miller, thanks for looking at this.

I've done a perf capture for you, but it is worth noting a couple of things:

  1. Mine aren't transient CPU spikes, dqlite just hammers the CPU at 100% from the moment it is started
  2. In a separate issue I'm investigating with @ktsakalozos whether this high CPU is caused by corruption of the dqlite database

With those caveats, here's my attached perf.data which I ran for about a minute on MicroK8s v1.26.1 (rev 4595). Hope it is useful to you.

perf.tar.gz

@doctorpangloss
Copy link

@cole-miller likewise, k8s-dqlite doesn't spike, it's just at 100% all the time.

perf.data.zip

@doctorpangloss
Copy link

It seems like a lot of people are affected by this issue.

@djjudas21
Copy link

As well as my prod cluster being affected by this, last week I quickly threw together a 3-node MicroK8s cluster on v1.26 in VirtualBox to test something. No workloads. Initially it worked normally, but then I shut down all 3 VMs. When I booted them up again later, I had the dqlite 100% CPU problem. I didn't have time to look into it as I was working on something else, but it does show that it can happen on a new cluster that hasn't been "messed with".

@djjudas21
Copy link

djjudas21 commented Mar 2, 2023

I understand that MicroK8s is free software, no guarantees, etc, but it is run by a prominent company like Canonical so it is surprising/disappointing that there are quite few serious, long-standing issues, affecting multiple users, that don't appear to be getting much attention from maintainers (for example this one, #3735 and #3204)

I have been running MicroK8s since v1.17 I think and it has generally been rock solid and only broken when I've fiddled with it. Since v1.25/v1.26 there seem to be chronic issues affecting the stability. I have personally lost data due to this dqlite CPU problem (long story short, I lost dqlite quorum which broke the kube api-server, but I was using OpenEBS/cStor clustered storage, which depends on kube quorum for its own quorum. When it lost quorum and the kube api-server become silently read-only, the storage controller got itself into a bad state and volumes could not be mounted).

A lot of people are talking about switching to k3s, and I don't want to be that guy who rants about switching, but it is something I will consider doing at next refresh. I note that k3s ditched dqlite in favour of etcd in v1.19. I don't know what their reasons were, but it was probably a good move.

@AlexsJones
Copy link
Contributor

I understand that MicroK8s is free software, no guarantees, etc, but it is run by a prominent company like Canonical so it is surprising/disappointing that there are quite few serious, long-standing issues, affecting multiple users, that don't appear to be getting much attention from maintainers (for example this one, #3735 and #3204)

I have been running MicroK8s since v1.17 I think and it has generally been rock solid and only broken when I've fiddled with it. Since v1.25/v1.26 there seem to be chronic issues affecting the stability. I have personally lost data due to this dqlite CPU problem (long story short, I lost dqlite quorum which broke the kube api-server, but I was using OpenEBS/cStor clustered storage, which depends on kube quorum for its own quorum. When it lost quorum and the kube api-server become silently read-only, the storage controller got itself into a bad state and volumes could not be mounted).

A lot of people are talking about switching to k3s, and I don't want to be that guy who rants about switching, but it is something I will consider doing at next refresh. I note that k3s ditched dqlite in favour of etcd in v1.19. I don't know what their reasons were, but it was probably a good move.

Hi Jonathan,

I lead Kubernetes at Canonical and I wanted to firstly offer both an apology and a thank you for your help thus far. We know that there are situations where people like yourself are using MicroK8s and suddenly unbeknownst to you something goes wrong - not being able to easily solve that only compounds the problem.

Our ambition for MicroK8s is to keep it as simple as possible, both in day-to-day operation but also upgrading. I wanted to thank you again for taking the time to engage with us and help us try and improve our projects. Projects plural, as DQLite is also something we build and maintain, hence our investment in building it for a low-ops K8s database backend. ( That said, there is the option to run ETCD with MicroK8s should you desire too ).

This resilience issue is being taken extremely seriously and we are configuring machines to try and reproduce your environment as we are aware of it to the best of our abilities and to work with our DQLite team counterparts to resolve any performance issues. ( Please do let us know what your storage configuration is, localpath etc ).

I believe one thing that really sets MicroK8s apart from alternatives is that we have no secret agenda here.
We are here to serve and help our community grow and in that is a promise of working together to make sure our end-users and community members are assisted as much as humanly possible. We will not rest until any potential issues have been exhaustively analysed; independently of whether this is a quirk of setup or environment.

All that said, I will do the following immediately:

  • Ensure our team has picked this up with the DQLite team to analyze the performance results
  • Run a benchmark and sanity test against our own lab equipment
  • Test scenarios like mentioned ( e.g node drain with a workload like ceph on dqlite )
  • Keep you abreast of updates in this thread.

@djjudas21
Copy link

Thanks @AlexsJones, I really appreciate the detailed response. It's good to hear that there is more going on behind the scenes than I was aware of. I'll be happy to help with testing and providing inspections reports, etc. I've generally had really good experiences interacting with the Canonical team on previous issues (@ktsakalozos in particular has been really helpful).

My specific environment is 4 identical hardware nodes, each with a SATA SSD which has the OS (Ubuntu LTS 22.04) and also the Snap stuff, so that will include the dqlite database. Each node also has an M.2 NVMe which is claimed by OpenEBS/cStor for use as clustered storage.

I don't use any localpath storage. I do also have an off-cluster TrueNAS server which provides NFS volumes via a RWX storage class.

I'm actually part-way through a series of blog posts and the first part covers my architecture. The third part was going to be about OpenEBS/cStor but then it all went wrong, so I'm holding off on writing that!

@AlexsJones
Copy link
Contributor

Thanks for the additional detail, I've setup a bare metal cluster ( albeit lower scale than yours ) and will look to install EBS/cstor with rook-ceph. We will then conduct a soak test and a variety of interrupts to generate data

@djjudas21
Copy link

Thanks. I installed my cStor from Helm directly, rather than using Rook. I've just made a gist with my values file so you can create a similar environment if you need to, although I the root cause of my problem was with dqlite rather than cStor.

@AlexsJones
Copy link
Contributor

Thanks. I installed my cStor from Helm directly, rather than using Rook. I've just made a gist with my values file so you can create a similar environment if you need to, although I the root cause of my problem was with dqlite rather than cStor.

It's still worth investigating as there might be a disk activity correlation - Will set this up now

@AlexsJones
Copy link
Contributor

Thanks @AlexsJones, I really appreciate the detailed response. It's good to hear that there is more going on behind the scenes than I was aware of. I'll be happy to help with testing and providing inspections reports, etc. I've generally had really good experiences interacting with the Canonical team on previous issues (@ktsakalozos in particular has been really helpful).

My specific environment is 4 identical hardware nodes, each with a SATA SSD which has the OS (Ubuntu LTS 22.04) and also the Snap stuff, so that will include the dqlite database. Each node also has an M.2 NVMe which is claimed by OpenEBS/cStor for use as clustered storage.

I don't use any localpath storage. I do also have an off-cluster TrueNAS server which provides NFS volumes via a RWX storage class.

I'm actually part-way through a series of blog posts and the first part covers my architecture. The third part was going to be about OpenEBS/cStor but then it all went wrong, so I'm holding off on writing that!

Are you using CephFS or RBD? If so how's that interacting with the cstor SC?

@djjudas21
Copy link

I'm not using CephFS or RBD, only OpenEBS & cStor

@AlexsJones
Copy link
Contributor

  • Create a microk8s cluster
  • Deploy Rook-Ceph

Okay, I saw it was mentioned at the top of the thread, thanks!

@djjudas21
Copy link

Okay, I saw it was mentioned at the top of the thread, thanks!

No worries, this isn't my thread originally, but I am affected by the same dqlite CPU spike

@cole-miller
Copy link
Contributor

cole-miller commented Mar 2, 2023

Hi @djjudas21, @doctorpangloss -- thanks very much for uploading the perf files, they're quite useful for narrowing down the root of the problem. Here are the resulting flamegraphs:

  1. For @djjudas21's data: djjudas21.svg
  2. For @doctorpangloss's data:
    doctorpangloss.svg

As you can see, CPU usage is dominated by calls to sqlite3_step. Looking at just the children of that function, the big contributions are from SQLite code, with some contributions also from dqlite's custom VFS (about 14% of the grand total), much of which boils down to calls to memcpy (9%). So my preliminary conclusion is that most of the CPU cycles are spent in SQLite, in which case it's likely the route to fixing this problem lies in optimizing the requests send by microk8s to dqlite (via the kine bridge). But I'll continue to investigate whether any parts of dqlite stand out as causing excessive CPU usage.

@cole-miller
Copy link
Contributor

cole-miller commented Mar 2, 2023

One possible issue: dqlite runs sqlite3_step in the libuv main thread, so if calls to sqlite3_step are taking quite a long time then we're effectively blocking the event loop -- which could have bad downstream consequences like Raft requests timing out and causing leadership churn. @djjudas21, @doctorpangloss, and anyone else who's experiencing this issue, it'd be very helpful if you could follow these steps to generate some data about the distribution of time spent in sqlite3_step:

  1. On an affected node, install bpftrace: sudo apt install bpftrace
  2. Find the PID of k8s-dqlite, then run
    $ sudo bpftrace -p $k8s_dqlite_pid -e 'uprobe:libsqlite3:sqlite3_step { @start[tid] = nsecs; } uretprobe:libsqlite3:sqlite3_step { @times = hist(nsecs - @start[tid]); delete(@start[tid]); }'
    
    That will keep running and gathering data until you kill it with Ctrl-C, and print an ASCII art histogram when it exits, which you can post in this issue thread.

Thanks again for your willingness to help debug this issue!

@sbidoul
Copy link

sbidoul commented Mar 2, 2023

@cole-miller I could capture the requested histogram during about 3 minutes of such an event:

[512, 1K)          11070 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)            3758 |@@@@@@@@@@@@@@@@@                                   |
[2K, 4K)            4186 |@@@@@@@@@@@@@@@@@@@                                 |
[4K, 8K)            3882 |@@@@@@@@@@@@@@@@@@                                  |
[8K, 16K)            882 |@@@@                                                |
[16K, 32K)           966 |@@@@                                                |
[32K, 64K)          1048 |@@@@                                                |
[64K, 128K)          494 |@@                                                  |
[128K, 256K)         428 |@@                                                  |
[256K, 512K)          81 |                                                    |
[512K, 1M)            18 |                                                    |
[1M, 2M)               8 |                                                    |
[2M, 4M)            2208 |@@@@@@@@@@                                          |
[4M, 8M)            1271 |@@@@@                                               |
[8M, 16M)            267 |@                                                   |
[16M, 32M)            50 |                                                    |
[32M, 64M)            18 |                                                    |
[64M, 128M)            1 |                                                    |
[128M, 256M)           0 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               0 |                                                    |
[2G, 4G)              10 |                                                    |

This is a 3 nodes cluster with 2 12-CPU nodes and 1 2-CPU node. The particularity (?) is that there is a relatively high (~10ms) latency between nodes 1,2 and node 4. Below the CPU load measurement during the event. This event was on the little node, but it can equally happen on the big nodes, sending CPU load to 300% (apparently due to iowait, not sure).

image

Here is a perf capture (which is probably not good quality due to missing /proc/kallsyms?).

@sbidoul
Copy link

sbidoul commented Mar 2, 2023

Here is another view of the same 15 minutes event, obtained with python psutil.Process(pid).cpu_time().

image

@sbidoul
Copy link

sbidoul commented Mar 2, 2023

As a side note, I have always been wondering what dqlite is doing to consume 0.2 CPU when the cluster is otherwise idle. Although I don't want to divert this thread if this is unrelated.

@doctorpangloss
Copy link

sudo bpftrace -p $(pidof k8s-dqlite) -e 'uprobe:libsqlite3:sqlite3_step { @start[tid] = nsecs; } uretprobe:libsqlite3:sqlite3_step { @times = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
^C

@start[13719]: 503234295412429
@times: 
[1K, 2K)            6297 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)            5871 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[4K, 8K)            1597 |@@@@@@@@@@@@@                                       |
[8K, 16K)           4113 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
[16K, 32K)           816 |@@@@@@                                              |
[32K, 64K)           542 |@@@@                                                |
[64K, 128K)          397 |@@@                                                 |
[128K, 256K)         500 |@@@@                                                |
[256K, 512K)          59 |                                                    |
[512K, 1M)            17 |                                                    |
[1M, 2M)              13 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)          3078 |@@@@@@@@@@@@@@@@@@@@@@@@@                           |
[64M, 128M)           71 |                                                    |
[128M, 256M)           0 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               0 |                                                    |
[2G, 4G)              16 |                                                    |

@cole-miller
Copy link
Contributor

cole-miller commented Mar 2, 2023

@sbidoul, @doctorpangloss -- thanks! It does look like a substantial number of our calls to sqlite3_step are taking milliseconds to tens of milliseconds, which is not outrageous but is long enough that we should probably try to move those calls out of the main thread. (We will want to do this in any case for the experimental disk mode.) I will try to work up a patch that does that and we'll make it available for you to try out when it's ready.

(I don't know what to think about the small number of calls that took entire seconds to complete.)

@doctorpangloss
Copy link

would it be helpful to have a reproducible environment? i can give you access to a hardware cluster reproduction of what is executing here.

@cole-miller
Copy link
Contributor

@doctorpangloss Yes, that would be very helpful! My SSH keys are listed at https://github.com/cole-miller.keys, and you can contact me at [email protected] to share any private details.

@cole-miller
Copy link
Contributor

cole-miller commented Mar 3, 2023

@sbidoul I had some trouble generating a flamegraph or perf report from your uploaded data -- if you get the chance, could you try following these steps (first source block, with your own perf.data file) on the machine in question and uploading the SVG? It seems like some debug symbols may be missing from my repro environment, or perhaps we have different builds of SQLite.

Re: your second graph, I'm a little confused because it looks like CPU usage for dqlite goes down during the spike event, and it's the kubelite process that's responsible for the spike. Am I misinterpreting?

As a side note, I have always been wondering what dqlite is doing to consume 0.2 CPU when the cluster is otherwise idle. Although I don't want to divert this thread if this is unrelated.

The main thing that dqlite has to do even in the steady state where it's not receiving any client requests is to exchange Raft "heartbeat" messages with other nodes, so that they don't think it has crashed. If you can gather perf data for one of those idle periods I'd be happy to try to interpret the results (it would be educational for me too, and we might uncover something unexpected).

@r2DoesInc
Copy link

Has anyone tried the v1.30 of m8s to see if the mentioned improvement of dqlite has made a difference?

I am using 1.30/stable currently and saw no difference

@maximemoreillon
Copy link

maximemoreillon commented Jun 2, 2024

Same problem here, dqlite is using massive amounts of CPU on a 3 node Microk8s v1.30 cluster.
Nodes are AWS EC2 running Ubuntu server 20.04 LTS.
Running sudo systemctl restart snap.microk8s.daemon-k8s-dqlite.service helps mitigate the problem

@r2DoesInc
Copy link

r2DoesInc commented Jun 2, 2024 via email

@EmilCarpenter
Copy link

EmilCarpenter commented Jun 18, 2024

microk8s-k8s-dqlite-debug-github.com-issue-3227-A.txt
microk8s-k8s-dqlite-debug-github.com-issue-3227-B.txt

@ktsakalozos log files with --debug according to your instructions from a comment above.

No pods created over the standard ones.

microceph is installed, too.
microceph has a bug that fills the disk with logs at a rate of about 1G/h.
Even if log level is set to 'warning', it resets to default 'debug' after reboot.

The high CPU usage by k8s-dqlite, kubelite and rsyslog started on most nodes when disks got full of logs.

Tha attached log files are from worker1 with more than 1G disk available, which after maybe 5-10 minutes after reboot goes to 100% CPU usage.

There are lot's of entries with the IP 10.0.2.15 which maybe shouldn't be there at all. That is the VirtualBox NAT IP. There is another Host-only network which was used to add worker nodes, 192.168.60.0/24. I should probably configure the microk8s cluster for the 192 network at install time, will look into that.


Edit 1:
Without this issue, total CPU usage is 2% -15%.

Edit 2:
The 'workers' are just named so, they were joined without '--worker' flag.

Edit 3: Possible reason

Time sync was not implemented/working (due to me not installing Guest Additions on each VirtualBox VM).
Observations showed time being off by several days between the servers/VM's.
Not tested yet with time being in sync.


4 x Ubuntu 24.04 server VirtualBox'es in Ubuntu desktop host.
Master: 2 CPUs, 2G RAM
Workers x 3: 1 CPU, 1G RAM
microk8s version
MicroK8s v1.29.4 revision 6809

microceph --version
ceph-version: 18.2.0-0ubuntu3~cloud0; microceph-git: cba31e8c75

# Edit:
#  /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml
- ID: 3297041220608546238
  Address: 192.168.60.11:19001
  Role: 0
- ID: 1760248264364118143
  Address: 192.168.60.21:19001
  Role: 1
- ID: 12809918262664895569
  Address: 192.168.60.22:19001
  Role: 0
- ID: 6417727055371815105
  Address: 192.168.60.23:19001
  Role: 0
kubemaster@kmaster:~$ sudo microk8s status
microk8s is running
high-availability: yes
  datastore master nodes: 192.168.60.11:19001 192.168.60.22:19001 192.168.60.23:19001
  datastore standby nodes: 192.168.60.21:19001
addons:
  enabled:
    dashboard            # (core) The Kubernetes dashboard
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
  disabled:
    ...

Screenshot from 2024-06-19 00-50-19

Hope this helps in troubleshooting.

@bananu7
Copy link

bananu7 commented Jun 27, 2024

I've just upgraded my cluster from 1.27 to 1.30. I run a small homelab with 3 nodes, 8,8 and 4G and 2c each. After the upgrade i noticed lack of stability. The 4G node got hit especially badly. Even though i drained the nodes during the update, the added load caused them to misbehave. Right now my the 4G node is cordoned, and sitting on 90% CPU. I tried disabling and enabling the control plane, and it didn't help. That's when i found this thread.

I'm perplexed as to why this issue can be open so long without a mitigation. I understand that m8s is an open source project, but if it's meant to be production ready, i should at least have a way of fixing my cluster in a hotfix way. I've read through all 100+ comments and it seems the best way is to back up my Longhorn, pack everything in boxes and move to a fresh k3s install.

@r2DoesInc
Copy link

r2DoesInc commented Jun 27, 2024 via email

@bananu7
Copy link

bananu7 commented Jun 27, 2024

It seems that a downgrade of the offending node to 1.29.4 did help a bit (silver is 4GB):

image

I'll downgrade the other ones and see what happens. I'd really like to avoid going for 1 control plane node if I can.

@doctorpangloss
Copy link

I'm perplexed as to why this issue can be open so long without a mitigation.

Beats me. I would have started the migration to etcd away from dqlite 2 years ago. k3s used to have the same problems and stopped using dqlite in 2022.

@Dunge
Copy link
Contributor

Dunge commented Jun 27, 2024

I don't believe versions changes has much impact on triggering the issue or not. Nearly 2 years ago I was on 1.26 and got it sporadically, sometimes rebooting one node would fix it. I tried to upgrade to 1.27 and never managed to get it working, it was always stuck in this mode. I downgraded back to 1.26 and it's been running flawlessly ever since (nearly two years). I do believe it just require one heavy change of node structure to trigger it, and it's the same issue that linger since years. So no @bananu7 switching from 1.30 to 1.29 fixing it was probably dumb luck.

I somehow also believe it's related to disk usage. I have issues related to Longhorn, other comments here mentioned ceph and microceph struggling, and low disk space remaining.

Suggesting disabling HA is also completely ridiculous, HA is the only reason why we use kubernetes in the first place.

@fmiqbal
Copy link

fmiqbal commented Jun 28, 2024

Got this problem on my 3 master, 8c 16G (my mistake that I didnt isolate master node with worker node on my 500 pod cluster, but nonetheless),

One of the 3 node become unresponsive and hogging 100% cpu, 100% memory, and because swap is disabled as per recommendation, it have nowhere to go, and I tried to force reboot it. But then the other node now become like that, so i ended up doing recovery and downgrade it into single cluster, and need to do leave-rejoin the worker node, which also broke my cluster longhorn state and 40 volume faulted (also my mistake).

On the migration to standalone master node (that will solely host microk8s, without any other k8s workload, with 3 node 4c/8g because I think it will be enough), I found some findings,

  • that a single node master state can become flapping, the kubelet become timed out because dqlite is becoming slower, showing on "Context Deadline Exceeded" with Txn call completed in 10000ms
  • that on the third node joining to become HA-cluster, after the dqlite receiving and inflating, it then grow in memory, draining all free memory, make it swapping, and cpu maxed out, this process only happen on the third node, the "solution" is just to wait it out, and also adding swap is helping a bit, but generally you want higher memory (my 8g can't cope)
  • after I wait it out, it then lower the memory and cpu usage, and then it kinda stable, but then it become flapping again because of slow database call, and sometihng is killed, now the weird things is when I do kubectl describe node, it says "kubelet : OOM, victim dqlite" or something like that, can't quiet get the log because i need it to work asap.
  • the main problem for the slow database call I think is the gaps, because I have 60k total record on kine, and 40k of them is gaps, now how it happen to have gaps in the first place I don't how, after i delete the gaps, it become more stable

What does help me debugging is enabling dqlite debug with flag in /var/snap/microk8s/current/args/k8s-dqlite-env, checking who is the leader node with

sudo -E /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"

accessing the dqlite database with

/snap/microk8s/current/bin/dqlite -s 10.10.8.62:19001 -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key k8s

checking the kubelite Txn duration

sudo journalctl -f | grep Txn

The questions that remain is,

  • why the HA-cluster creation on third node so resource heavy on dqlite ?
  • and if its because clutter of gaps, how can it be mitigated ?

What I currently use is throw_more_ram™, although I think we need a benchmark on how many k8s resource can dqlite realistically handle with how big the storage size /var/snap/microk8s/current/var/kubernetes/backend/ is used on how big the memory should be, Currently I have 20K kine entry, with 700MB Snapshots size, with RAM usage sitting on 11 GB (dqlite and kubelet)

On the other hand, or rather the main thread here is, there are still occasionally constant 100% dqlite usage, but with my 8 core, I didnt worry too much

@lucian1094
Copy link

I'm also facing with this issue. I want to install a external etcd cluster and to restore data from dqlite backup. Has anyone tried this? Can it be done?

@doctorpangloss
Copy link

  • why the HA-cluster creation on third node so resource heavy on dqlite ?

because dqlite is buggy

and if its because clutter of gaps, how can it be mitigated ?

By adopting etcd

I'm also facing with this issue. I want to install a external etcd cluster and to restore data from dqlite backup. Has anyone tried this? Can it be done?

For your sake, I'd hope so.

@benben
Copy link

benben commented Sep 13, 2024

After a bit more than a year, my microk8s now finally gave up. I ran it in single node mode because of this bug but now it ended up CPU freezing my whole server and eating all CPU. Restarting helped but I never got it back to a consistent state regarding running pods etc.

I finally gave up, uninstalled microk8s and snap and moved to k3s. It is sad to see that dqlite is fundamentally broken and Canonical has no interest in fixing it. I can just recommend everyone moving to k3s. It was extremely straight forward to setup. Actually its much more flexible when installing since you are not bound to snap.

Good luck everyone here! ❤️

@claudiubelu
Copy link
Contributor

claudiubelu commented Oct 15, 2024

Hello there!

I see this has been quite the thread, and it took a while to catch up on everything that's been happening here. I am bit late here, but I'm going to touch on a few of the things which have been mentioned here, and share some of my thoughts, ramblings, knowledge, experiences, and findings with you as well. Will try to keep the following as impartial as possible. Some are general tips that may or may not apply to you, even if you use microk8s or not.

I see that quite a few people are trying to use Rook-Ceph and / or kubevirt. Actually, I see a lot of people here have mentioned they're also using something storage-related, like Longhorn, minio, etc., which is interesting. In fact, it seems that they're quite storage intensive; I wonder if it all leads to some sort of storage-related resource starvation, which also impacts k8s-dqlite. I wonder if anyone tried separating the Kubernetes control plane (+ k8s-dqlite) from the storage plane and have them on separate nodes? It is common practice to separate the control plane from the rest of the workloads, and it might also be useful in this scenario. I understand that it's still microk8s after all, and there's probably not a lot of nodes available, but at least a partial separation may still be useful (e.g.: have at least one node fully dedicated for the k8s controlplan + k8s-dqlite master node, etcd, or whatever other db you're using).

Anyways, in regards to Rook-Ceph and / or kubevirt (may also be relevant for RedPanda, not sure, or any other applications / operators that have to watch a lot of files), this may be an important tip for you, but I don't think it will solve the issue that has been observed in this thread with k8s-dqlite. Please check the following fields:

cat /proc/sys/fs/inotify/max_user_instances
cat /proc/sys/fs/inotify/max_user_watches

By default in Ubuntu (I've verified this in 22.04 and 24.04), they're set to 128 and 45,674 respectively. proc.sys.fs.inotify.max_user_instances being set to 128 is especially low, considering the fact that a freshly installed Kubernetes node may easily have somewhere between 35 and 100 instances already in use, and that basically every Pod consumes 2 other instances, as you can see below (Credit for inotify-info executable and details goes to: https://stackoverflow.com/questions/13758877/how-do-i-find-out-what-inotify-watches-have-been-registered):

ubuntu@microk8s-03:~/inotify-info$ sudo ./_release/inotify-info
------------------------------------------------------------------------------
INotify Limits:
  max_queued_events    16,384
  max_user_instances   128
  max_user_watches     45,674
------------------------------------------------------------------------------
       Pid Uid        App                      Watches  Instances
    180833 0          kubelite                     287         12
         1 0          systemd                      113          5
       782 0          polkitd                       18          2
    193917 0          udevadm                       14          1
       448 0          udevadm                       14          1
    174351 1000       systemd                        7          3
    188787 0          containerd-shim-runc-v2        4          2
    187202 0          containerd-shim-runc-v2        4          2
    188712 0          containerd-shim-runc-v2        4          2
    188519 0          containerd-shim-runc-v2        4          2
    188369 0          containerd-shim-runc-v2        4          2
    188257 0          containerd-shim-runc-v2        4          2
    188128 0          containerd-shim-runc-v2        4          2
    187930 0          containerd-shim-runc-v2        4          2
...
    190950 0          containerd-shim-runc-v2        4          2
    186828 0          containerd-shim-runc-v2        4          2
       773 103        dbus-daemon                    4          1
    178685 1000       dbus-daemon                    4          1
    181207 0          containerd-shim-runc-v2        4          2
    182200 0          containerd-shim-runc-v2        4          2
...
    185321 0          containerd-shim-runc-v2        4          2
    185042 0          containerd-shim-runc-v2        4          2
       801 0          udisksd                        3          2
    191028 0          containerd-shim-runc-v2        2          1
    180711 0          containerd                     1          1
       799 0          systemd-logind                 1          1
       684 102        systemd-resolved               1          1
       651 104        systemd-timesyncd              1          1
       445 0          multipathd                     1          1
------------------------------------------------------------------------------
Total inotify Watches:   675
Total inotify Instances: 136
------------------------------------------------------------------------------

We have had some issues with these settings in some baremetal deployments that also used Rook and Kubevirt (note that it may be possible for some Kubernetes operators to overwrite these values as well), and we had to update these values in order for them to function properly. This has also been noted in a few places as well:

The inotify values have been updated in microk8s as well since v1.28: #4094, and microk8s inspect should also raise some Warnings regarding this: #4136. To set these settings, you can run:

echo fs.inotify.max_user_instances=1024 | sudo tee -a /etc/sysctl.conf
echo fs.inotify.max_user_watches=1048576 | sudo tee -a /etc/sysctl.conf

As a personal anecdote, when I was trying to replicate the scenarios listed here and tried to deploy Rook-Ceph on microk8s + deploying a Ceph cluster on top of it in microk8s, it got stuck after spawning the first ceph-mon for quite a while, and immediately got unstuck and spawned the other 2 ceph-mons after increasing these. It might help with the other Rook-Ceph-related issues I've seen here.

After writing this, I see that @djjudas21 also mentioned these settings. Oh well.

I've talked about these settings since I've assumed that it may have had something to do with this issue, but I haven't observed an impact, or I haven't done sufficient testing on it to notice.

A biiiit unrelated to the k8s-dqlite issue, but still kinda' related (will mention a bit later how):

I have been running MicroK8s since v1.17 I think and it has generally been rock solid and only broken when I've fiddled with it. Since v1.25/v1.26 there seem to be chronic issues affecting the stability. I have personally lost data due to this dqlite CPU problem (long story short, I lost dqlite quorum which broke the kube api-server, but I was using OpenEBS/cStor clustered storage, which depends on kube quorum for its own quorum. When it lost quorum and the kube api-server become silently read-only, the storage controller got itself into a bad state and volumes could not be mounted).

A colleague pointed me at this post as this is very similar behavior to what I have been experiencing and have been able to reliably replicate.

I run v1.26 3 nodes HA with an argo workflows instance.

I have a workflow that scales out ~2000 jobs to download json data from a http endpoint, then dumps it into a mongodb instance. Each job should take no more than a couple seconds to complete.

When the workflow's parallel setting is set to run 100 simultaneous download jobs and left to run for a bit, it nukes the cluster like you describe and requires a rebuild. Dialing back to 50 or less jobs in parallel does not cause the issue.

and:

Hello, All my clues, without proof.

All this month, onely one of our two microk8s clusters was falling every month around the 4/5th, since april. Always the same. The workload are the same. All of them are mono-node.

We often found the cluster failed on a monday morning. After 2 days with no deployment and no trafic (the two cluster are Test Environment). Often the onset of outage begins with multiple pending Pods. Some having tried to redeploy and blow up the number of Pods. (as some try to reprint and fill the printing pool). More than 110 pods, with pending and failed ones. But after cleaning, the state does not return.

The other being on a channel no longer evolving (1.25) I suspected the auto-updates of the snaps. I recreated another cluster and tried to abuse it: overload, unexpected restart, snap update... of course it did'nt failed.

So, I recreated it, with blocked update on microk8s, and put in "internal production", it's still working. (hey happy 1th month !)

I noticed recently that the powerdedge which hosts the cluster has very slow disk access (~30MB/S). Because of a Raid1 on a PERC 350, which has no cache (it's not really a PERC). This sometimes causes significant IO delays.

In my opinion, if Maybe the lack of power causes the failure. But then something keeps the charge on the base looping. And create a broken state.

I understand that until you reproduce the problem...

So. You might have another issue here, when it comes to those numbers. By default, Kubernetes supports only 110 Pods per node, as you can see here: https://kubernetes.io/docs/setup/best-practices/cluster-large/. So, if you indeed want to scale to such numbers, you should probably have a few more nodes in order to handle that kind of influx of Jobs / Pods, or adjust your workflow / pipeline to throttle that number a bit. If you do have bulky nodes that could potentially handle that kind of workloads, then you should also adjust that number. I'm saying this because someone had 110+ Pods on a single node, in which case, any additional Pod would end up stuck in a Pending state or even in an Error state. Of course, I think it's ok to go a biiit over the limit for short-lived Pods (Jobs), but not for regular, long-lived Pods.

Now, I said that it is slightly related. Regarding how Kubernetes works, it really does store a lot of data, or more precisely, it can cause data updates very frequently. Every Kubernetes Object has a resourceVersion, which is being used to track / watch changes of that said object (https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes). For example:

kubectl get -o yaml pod/test-745ebc7294
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 3019ade39b75d6791372973fda626886f41c6b0ebb75bb0ef51c351b4565d4b8
    cni.projectcalico.org/podIP: ""
    cni.projectcalico.org/podIPs: ""
  creationTimestamp: "2024-10-14T16:41:45Z"
  labels:
    run: test-745ebc7294
  name: test-745ebc7294
  namespace: default
  resourceVersion: "377085"
  uid: 912abcf4-a8f5-4de6-b3bd-222e9a58d18b
spec:
  containers:
  - args:
    - echo
    - hello there
    image: busybox
...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-10-14T16:41:50Z"
    status: "False"
    type: PodReadyToStartContainers
...
  podIPs:
  - ip: 10.1.63.138
  qosClass: BestEffort
  startTime: "2024-10-14T16:41:45Z"

As you can see, the resourceVersion is 377085. And that's been on a 5-day-old 3-node cluster that basically did nothing the entire time but only exist, and which probably doesn't even have 50% uptime, since it's been on my laptop. From what we've observed, that's a global counter for the entire cluster, it doesn't mean that particular resource was updated that many times. This basically translates to the fact that there's been 377085 DB updates, which have also been sync'ed across 3 replicas. There are some things which constantly cause that resourceVersion to increase, like leases (which always have to be refreshed, and apparently, having lots of leases expiring could lead to instability: etcd-io/etcd#9360), serviceaccount tokens (each Pod has one, and it's automatically renewed by Kubernetes), and events (for example, when spawning a Pod, at least 5 events are being generated, depending on the number of container images, containers, probes it has; other resources may also have their own events, like PVCs, other CustomResources)(, and maybe tokens are update as well?). In fact, the number of events may be large enough that apparently it's recommended as a best practice for large clusters to store Event objects in a separate dedicated etcd instance: https://kubernetes.io/docs/setup/best-practices/cluster-large/#etcd-storage. This issue becomes worse and worse, based on the number of nodes, operators, users, workloads, resources present in the cluster. It has also been known to be wrapping around, interestingly: kubernetes-client/javascript#516

Anyways, this basically means that Kubernetes does need a database that's able to keep up with this (a bit absurd, IMO) number of updates, in addition to it being fault tolerant. That is a bit difficult, IMO, and I'm not exactly sure how etcd or k8s-dqlite handles this sort of thing. Though, if I may throw in an unwarranted opinion, there are a few things which could probably help a lot on "scaling" these database updates:

  • I'm not sure if these are written in the database, but I'm thinking about Kubernetes Events: there are a lot of them, and, according to the Kubernetes documentation: Event consumers should not rely on the timing of an event with a given Reason reflecting a consistent underlying trigger, or the continued existence of events with that Reason. Events should be treated as informative, best-effort, supplemental data.. So, being best-effort, I wouldn't even store them in the database, I'd only store them (and sync across nodes) in memory, and they're meant to expire in about 5-6 minutes anyways. I'm not so worried about losing them anyways in a HA scenario; if all 3 instances are down, you have bigger issues anyways. This also means that k8s-dqlite would have less work to do when it has to optimize / clean the database. resourceVersion should still be incremented and saved though, I think. This should also speed up Kubernetes event queries, as they're in memory (plus, Kubernetes events looks a bit like a time series, not sure how good is sqlite for querying such data). It may not be everyone's cup of tea, and it may not fit everyone's wishes, so it could be an optional flag for k8s-dqlite? That's assuming it can be done in the first place.
  • When a Kubernetes object is created, it typically changes its state a couple of times in a short window of time. Perhaps we could group those events together (in, let's say, a timespan of 1 second?) and persistent them in the database together for the same object? I thought about only doing best-effort for persisting the Kubernetes Resource's status, but no, you need that data to be in line with reality.

Other than that, as a general note, I would take a look at what exactly is generating a lot of noise / churn in the Kubernetes cluster itself. I'd take a look at the Kubernetes Events. For example, flip-floping resources can cause quite a few events and thus resource status updates, which translates in several database updates which would have to be sync'ed across multiple nodes:

kubectl get events -A
NAMESPACE   LAST SEEN   TYPE      REASON      OBJECT              MESSAGE
default     2m16s       Warning   BackOff     pod/liveness-exec   Back-off restarting failed container liveness in pod liveness-exec_default(01f7a73e-e694-4834-b415-8c2dd9b25e3c)
default     88s         Normal    Scheduled   pod/liveness-exec   Successfully assigned default/liveness-exec to microk8s-02
default     13s         Normal    Pulling     pod/liveness-exec   Pulling image "registry.k8s.io/busybox"
default     85s         Normal    Pulled      pod/liveness-exec   Successfully pulled image "registry.k8s.io/busybox" in 2.28s (2.28s including waiting). Image size: 1144547 bytes.
default     12s         Normal    Created     pod/liveness-exec   Created container liveness
default     11s         Normal    Started     pod/liveness-exec   Started container liveness
default     43s         Warning   Unhealthy   pod/liveness-exec   Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory
default     43s         Normal    Killing     pod/liveness-exec   Container liveness failed liveness probe, will be restarted
default     12s         Normal    Pulled      pod/liveness-exec   Successfully pulled image "registry.k8s.io/busybox" in 1.218s (1.218s including waiting). Image size: 1144547 bytes.
...

As a note, I think restarting a node would generate a lot of events / status updates for all the resources currently present on that node (haven't checked). But, there could also be other cluster-related health issues or warnings which could generate a lot of events / status updates, including misconfigurations, node health warnings (disk, memory, PID pressure), etc. For example, I've tried one v1.26 which apparently many people had issues with, and I've seen the following events:

1s          Normal    Scheduled                 pod/test-e31e464e95   Successfully assigned default/test-e31e464e95 to microk8s-01
1s          Warning   MissingClusterDNS         pod/test-e31e464e95   pod: "test-e31e464e95_default(a47a23d1-ba71-4c3e-9cd9-b687ac064b44)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
1s          Normal    Pulling                   pod/test-e31e464e95   Pulling image "busybox"
1s          Warning   MissingClusterDNS         pod/test-9a865292ac   pod: "test-9a865292ac_default(bee826e9-dca9-4067-ae48-3ecbd6d8852e)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
1s          Warning   MissingClusterDNS         pod/test-cc0993a671   pod: "test-cc0993a671_default(15327d39-de6a-48c2-8685-0454faf246ed)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.

So, a lot of events have been generated simply because kubelet was not configured with a ClusterDNS IP. Great.

As a side note, I have always been wondering what dqlite is doing to consume 0.2 CPU when the cluster is otherwise idle. Although I don't want to divert this thread if this is unrelated.

An idle cluster still has a lot of updates to do, apparently, from what we've seen above. It's gotten better since then though I've seen it getting better though as you'll see below. But tl;dr, I've tried in a v1.31.1 microk8s 3-node cluster which was busy spawning 10K pods, and the master node was orbiting around 10-20% CPU usage in a VM.

It looks like i finally solved this problem. First I stopped kubelite, snap stop microk8s.daemon-kubelite . (The workloads run just fine, even better if kubelite is shuting them down). Then set debug=true for k8s-dqlite and restart it using snap restart microk8s.daemon-k8s-dqlite, Then in journalctl -f -u snap.microk8s.daemon-k8s-dqlite is saw it is deleting rows. So i assumed it was doing compaction. In my version kine is most likely 0.4.1 which does row by row delete. This version is used by all versions newer k8s-dqlite except master, which includes kine in tree. But it was stuck deleting the ones which there are cca 30000 of then namely /registry/masterleases/..., it is only hypothesis that having a lot of operations with same key is slower or that sql used in compaction is non-optimal in that case, i do not have data or test to prove so, but i imagine it can play a role. But I deleted those recordsm manualy using dqlite client connecting to database. Then after few restarts and minutes, CPU is around 50% and log contains also different messages, not just QUERY,DELETE with '/registry/masterleases/...'. Sorry for not graphig the data, but after that also compaction progress was faster.

This is the summary of most common rows

* /registry/apiregistration.k8s.io/apiservices/v1beta1.metrics.k8s.io|5068

* /registry/endpointslices/default/kubernetes|5624

* /registry/services/endpoints/default/kubernetes|6130

* /registry/leases/kube-node-lease/|10418

* /registry/leases/kube-node-lease/|12247

* /registry/leases/kube-node-lease/|15378

* /registry/leases/kube-system/kube-controller-manager|23421

* /registry/masterleases/|28249

* /registry/masterleases/|30344

* /registry/masterleases/|35107

* /registry/leases/kube-system/kube-scheduler|35253

This was probably related to flapping that occured when there was some infrastructure error.

I tried microk8s v1.26 as well, and bombarded it with loads of resources, and k8s-dqlite did get to 100% CPU usage, and it was spending a lot of time compacting, but it wasn't unresponsive though. After deleting the trash resources created for the stress test, k8s-dqlite also finished its compacting, and resumed regular functionality though.

... but those numbers of masterleases updates... are so high. It makes out to be the majority of those updates. And it's only 3 of them. And I assume that's also after some compacting. What if there are more masterleases? From what I can tell, they're from / related to EndpointSlices (which became stable / GA in v1.21, it would align somewhat with people's upgrade stories), as there's a concept of management / ownership there, and thus, leasing and updating the lease, and I assume that leasing would be per IP in the EndpointSlice (I came to this conclusion thanks to @ole1986's post). The masterlease is not an exposed Kubernetes API resource, so I can't really see how often it has to be updated. It may be a bit short. For example, for regular leases, you'd see something like this:

kubectl get -o yaml -n kube-node-lease lease microk8s-01 
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2024-10-14T19:18:23Z"
  name: microk8s-01
  namespace: kube-node-lease
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: microk8s-01
    uid: ed59f34b-7b31-44b2-8fe1-121288d35338
  resourceVersion: "29778"
  uid: 55e17baf-4f8d-4bcf-853d-e428925fdc49
spec:
  holderIdentity: microk8s-01
  leaseDurationSeconds: 40
  renewTime: "2024-10-14T23:38:31.074558Z"

I wonder if anyone else has a higher number of Endpoints / EndpointSlices, and could share the number of masterleases entries, and how stable is their deployment. I wonder if we can adjust the lease duration, so it wouldn't be so noisy.

-- SQL Statement
SELECT name, created, deleted, lease, value from kine WHERE name LIKE '/registry/masterleases/%';
-- RESULT
name	                                COUNT(*)
/registry/masterleases/10.251.6.106	3417
/registry/masterleases/10.251.6.91	1055
/registry/masterleases/10.251.6.92	172
/registry/masterleases/10.251.6.93	922
/registry/masterleases/10.251.6.94	4015
/registry/masterleases/10.251.6.95	1425

It is interesting how the number of entries varies so much. I wonder if it's due to compacting, or some sort of lease duration.

-- SQL Statement
SELECT name,COUNT(*)from kine WHERE name LIKE '/registry/leases/%' GROUP BY name;
- RESULT
name	                                                                COUNT(*)
/registry/leases/default/external-attacher-leader-rbd-csi-ceph-com	    842
/registry/leases/default/external-resizer-rbd-csi-ceph-com	            1046
/registry/leases/default/external-snapshotter-leader-rbd-csi-ceph-com	783
/registry/leases/default/rbd-csi-ceph-com	                            977
/registry/leases/default/rbd.csi.ceph.com-default	                    820
/registry/leases/kube-node-lease/xxxxx01.xxxxx.local	                830
/registry/leases/kube-node-lease/xxxxx02.xxxx.local	                    118
/registry/leases/kube-node-lease/xxxxx03.xxxx.local	                    442
/registry/leases/kube-node-lease/xxxxx04.xxxx.local	                    2955
/registry/leases/kube-node-lease/xxxxx05.xxxx.local	                    954
/registry/leases/kube-node-lease/xxxxx06.xxxx.local	                    2364
/registry/leases/kube-system/apiserver-4hmltwx2ygxotcrsm42ilrqwje	    926
/registry/leases/kube-system/apiserver-auwbsyd25mw2gwj6qfwwlx7gja	    1401
/registry/leases/kube-system/apiserver-cyvthvhws355b6yxwgf5iwn3dm	    2810
/registry/leases/kube-system/apiserver-gxnzhfuf6xkdgdo54typdcalna	    1030
/registry/leases/kube-system/apiserver-nmgvfj7tz3ojxk35fkkajlj7ga	    176
/registry/leases/kube-system/apiserver-v43ksimizpq43irz6v7ilm5niq	    4825
/registry/leases/kube-system/kube-controller-manager	                6171
/registry/leases/kube-system/kube-scheduler	                            6929
/registry/leases/kube-system/nfs-csi-k8s-io	                            500

Hmm, I see you have 6 API servers running, based on the number of apiserver leases. I assume you joined all of them as "master" nodes. That also probably means that all the nodes have k8s-dqlite, and 6 of them is too many for such a cluster. 3 should be enough. Having more just creates more noise, and makes it harder to sync data across all nodes. Even for etcd, The recommended etcd cluster size is 3, 5 or 7, which is decided by the fault tolerance requirement. A 7-member cluster can provide enough fault tolerance in most cases. While larger cluster provides better fault tolerance the write performance reduces since data needs to be replicated to more machines. (https://etcd.io/docs/v2.3/admin_guide/#optimal-cluster-size).

**PS: You can add nodes as a worker to a microk8s cluster if you add the --worker flag. This piece of information is also mentioned when you run microk8s add-node. For example: **

# Use the '--worker' flag to join a node as a worker not running the control plane, eg:
microk8s join ip:port/token/token2 --worker

I have tried to replicate some of the scenarios as well. I might have to do some additional digging, in light of some of the discoveries above, but will have to do it later. I will detail my setup for comparison. I've ran these experiments in 3 Hyper-V Generation 2 VMs (on a Windows 11 host), created on an SSD, each VM with a 100GB dynamically expanding differential disks, each VM having 2 vCPUs, and each VM having 5GB memory (one of them has 6GB though). This setup is a bit weaker when compared to some of the specs that have been shared here. (a small FYI here, a Kubernetes node should have at least 2 (v)CPUs, otherwise it'll probably freeze up, learned that the hard way).

As for the VMs, they're all identical:

cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

I've tried out microk8s v1.31.1, and they have been clustered up:

ubuntu@microk8s-01:~$ kubectl get nodes
NAME          STATUS   ROLES    AGE     VERSION
microk8s-01   Ready    <none>   11m     v1.31.1
microk8s-02   Ready    <none>   10m     v1.31.1
microk8s-03   Ready    <none>   9m46s   v1.31.1

I have generated quite a bit of load on the cluster generating Pods like this:

microk8s kubectl run --restart Never --image busybox test-$(cat /dev/urandom | tr -cd "a-f0-9" | head -c 10) -- sh -c "echo hello there && sleep 1200"

There are quite a few of them:

ubuntu@microk8s-01:~$ kubectl get pods | wc -l
10831

Of course, not everything can be running, based on the Pod / node limit. But it should generate a bit of churn, also considering how many events, service account tokens, other leases, etc. I've followed the CPU consumption of k8s-dqlite on the master node. The CPU usage has been between 10-20% most of the time during this small experiment (and it's been lower on the other nodes):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 179940 root      20   0 1916800 614240  10080 R  14.1  10.1  42:03.07 k8s-dqlite
 179940 root      20   0 1917240 614628  10080 S  14.5  10.1  42:07.19 k8s-dqlite
 179940 root      20   0 1917144 614532  10080 S  17.5  10.1  42:09.24 k8s-dqlite
 179940 root      20   0 1928332 622612  10076 S  23.4  10.3  42:10.73 k8s-dqlite
 179940 root      20   0 1902256 600616  10080 S   8.2   9.9  42:13.79 k8s-dqlite

Will have to check later the logs as well, to see if k8s-dqlite has started compacting. Will have to run some more experiments.

Best regards,

Claudiu

@claudiubelu
Copy link
Contributor

claudiubelu commented Oct 15, 2024

PS: A few other quick updates.

... but those numbers of masterleases updates... are so high. It makes out to be the majority of those updates. And it's only 3 of them. And I assume that's also after some compacting. What if there are more masterleases? From what I can tell, they're from / related to EndpointSlices (which became stable / GA in v1.21, it would align somewhat with people's upgrade stories), as there's a concept of management / ownership there, and thus, leasing and updating the lease, and I assume that leasing would be per IP in the EndpointSlice (I came to this conclusion thanks to @ole1986's post). The masterlease is not an exposed Kubernetes API resource, so I can't really see how often it has to be updated. It may be a bit short. For example, for regular leases, you'd see something like this:

The masterleases can be found here: https://github.com/kubernetes/kubernetes/blob/55b83c92b3b69cd53d5bf22b8ccff859a005241a/pkg/controlplane/instance.go#L223 and it is used by an EndpointReconciler. I didn't find any other instance in which this was being used (though I did find other types of leases as well as config options, including endpointsleases, configmapsleases). The default TTL for such a lease is 15 seconds (https://github.com/kubernetes/kubernetes/blob/55b83c92b3b69cd53d5bf22b8ccff859a005241a/pkg/controlplane/instance.go#L163), and the default renew time is 2/3 of that, so every 10 seconds. This looks like a configurable option, though I didn't find anything relating to this at a quick glance in the Kubernetes documentation.

PS: But I did find that there are multiple endpoint reconciler types: https://github.com/kubernetes/kubernetes/blob/55b83c92b3b69cd53d5bf22b8ccff859a005241a/pkg/controlplane/reconcilers/reconcilers.go#L53-L58 . This can be set through the kube-apiserver flag --endpoint-reconciler-type string Default: "lease". The current default value, lease, basically means that the lease will get stored / updated in the database's storage. The old one, master-count (it used to be the default, I think), was deprecated in v1.24 (https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.24.md#deprecation-3). But being a kube-apiserver flag also got me thinking for a bit: does that mean that ALL kube-apiserver instances will update ALL these masterleases independently of one another? Hmm...

@miro-balaz thank you. We already gave up our cluster. But I could get into the database through the uncompressed snapshot file and also noticed redundant registry entries while often only a single byte in value/old_value column is being changed for the same name

Indeed, I found this part as well. It's the endpoint's masterlease Generation count incrementing: https://github.com/kubernetes/kubernetes/blob/55b83c92b3b69cd53d5bf22b8ccff859a005241a/pkg/controlplane/reconcilers/lease.go#L115 . This apparently is being done to ensure the lease's TTL gets refreshed, keeping in mind that if nothing else actually changes, the database record is not actually saved (and the resourceVersion doesn't increment either). Tbh, not exactly sure how this helps, I would have expected renewTime with a new value to be updated instead, not Generation. This looks like the renewTime remains the same? And if renewTime doesn't get updated, how else does it know it expired in the first place?

Speaking of guaranteeing database updates, as seen in a comment in the link above, I wonder if there's ever a case in which k8s-dqlite would have to "update" a Kubernetes object that had literally 0 changes (excluding the resourceVersion field), if it makes this sort of check that something did or did not change. If there are such cases, k8s-dqlite should probably make this kind of checks and skip such noop database operations. Note that you can't do this sort of thing as an user through the Kubernetes API by patching resources with 0 changes, it doesn't allow you. But there are still some internal Kubernetes objects which are being updated (including masterleases), and I wonder if there any other internal objects that effectively do noop database updates.

Another quick thing I found to be interesting. In the kube-apiserver commandline arguments (https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/) there seems to be an option, --lease-reuse-duration-seconds int Default: 60, which has the following explanation: The time in seconds that each lease is reused. A lower value could avoid large number of objects reusing the same lease. Notice that a too small value may cause performance problems at storage layer.. So, I think we can acknowledge that short leases could generate a lot of churn for the K8s database? PS: It's not related to the masterleases lease duration though.

It is interesting how the number of entries varies so much. I wonder if it's due to compacting, or some sort of lease duration.

It can also be because they haven't been around for as long as the others. Didn't think about that at the time. 3AM may not be a good moment to think about things.

PS: It seems to me that there are a lot operations related to leases. I'm not sure on-disk database is a good choice for leases, keeping in mind that they have to be constantly updated, otherwise they expire. In fact, it is a quite known as a mechanism to store distributed locks / leases on in-memory databases like Redis or etcd (in fact, they've even come up with their own name for it, "Redis Lock"). This could be the reason people haven't seen these issues with etcd-backed K8s nodes. In this sense, if I may amend the list of k8s-dqlite suggestions: also treat leases like Kubernetes events I've mentioned above, keep them only in-memory, don't store them in the sqllite database (maybe only the first instance? idk.). If the k8s-dqlite instance goes down, that's find, you still have 1-2 other instances. If all of them go down, it's fine, new leases will be created / updated anyways by who needs them.

@bananu7
Copy link

bananu7 commented Oct 15, 2024

Since I participated here - my cluster worked for a while with mismatched versions, but finally became completely unusable. The 4GB node stopped working altogether (even SSH didn't work), so I removed everything and finally moved to K3s.

On exactly the same hardware, I now see 0.6% cpu use instead of 60% with an empty cluster.

@Azbesciak
Copy link

Any progress on that? Every 2h I have not reasonable spikes. I already once needed to wipe everything, and it came back. And increases every day.

@ktsakalozos
Copy link
Member

Hi @Azbesciak could you share a microk8s inspect tarball produced right after one such spike appears? Also, are you aware of any process in your system that runs every two hours?

@Azbesciak
Copy link

Hi @ktsakalozos - No, I am not aware of any process like this, when I got alert I quickly checked top, and top positions were for kubelite, k8s-dqlite and prometheus, exchanging eachother.

I will provide when I am able to catch the spike, thank you for the answer

@Azbesciak
Copy link

Azbesciak commented Nov 9, 2024

@ktsakalozos
I was a bit late to the party, because I thought it happens only on one node, what turned out to be not thuth. It is on both (I have only 2 nodes), but there is 10 min shift between them in that load spike
image
image
I have tarbal 10 min after that spike (attaching 2, because I was a bit late also 2h ago), not sure if that points anything - at least I did not notice
inspection-report-20241109_100506.tar.gz
inspection-report-20241109_082602.tar.gz

Also load for the last 24h
image

@Azbesciak
Copy link

I was able to catch the spike, not on the top but a bit after, still some load remained though.
image

inspection-report-20241109_151312.tar.gz

@louiseschmidtgen
Copy link
Contributor

Hello @Azbesciak,

would you be willing to upgrade your microk8s snap to v1.31? This release has the most recent performance improvements for k8s-dqlite.

To ensure that no other processes may be causing the spike I would recommend looking at the metrics using pidstat on the k8s-dqlite process.

Thanks for sharing your metrics with us!

@Azbesciak
Copy link

Will do, thank you

@hulucc
Copy link

hulucc commented Jan 16, 2025

v1.31 is not helping, master still get 100% cpu, 100% memory and massive of disk io.

Image

@r2DoesInc
Copy link

r2DoesInc commented Jan 16, 2025 via email

@realG
Copy link

realG commented Jan 16, 2025

Not sure if this type of comment has any value for this community/Canonical devs so I'll keep it short. I'm also moving on, just completed migration of all our deployments to k3s.

The sole reason for us abandoning microk8s is dqlite, specifically this cpu issue, and the unreasonably large amount of writes killing our flash storage on nodes (#3064). Our use case is single node relatively low-power edge compute, and microk8s just isn't stable enough for anything resembling production use in our experience.

Good luck to everyone sticking with microk8s, hope these issues do get fixed eventually!

@miro-balaz
Copy link
Contributor

miro-balaz commented Jan 19, 2025 via email

@djjudas21
Copy link

Well, maybe you havent noticed that you can migrate microk8s to use etcd.

Can you expand on this, @miro-balaz? I was only able to find documentation for running etcd with external etcd, but not internal.

@r2DoesInc
Copy link

Well, maybe you havent noticed that you can migrate microk8s to use etcd.

št 16. 1. 2025 o 23:49 realG @.***> napísal(a):

We shouldnt have to go tweaking the base install to make it viable. At this point they know dqlite is a failure, and them not moving to an external - or internal - etcd implementation is enough for me to realize they dont really care about this software any longer.

@doctorpangloss
Copy link

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests