Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snap auto-refresh breaks cluster #1022

Open
eug48 opened this issue Mar 10, 2020 · 100 comments
Open

snap auto-refresh breaks cluster #1022

eug48 opened this issue Mar 10, 2020 · 100 comments

Comments

@eug48
Copy link

eug48 commented Mar 10, 2020

This morning a close-to-production cluster fell over after snap's auto-refresh "feature" failed on 3 of 4 worker nodes - looks like it hanged at the Copy snap "microk8s" data step. microk8s could be restarted after aborting the auto-refresh, but this only worked after manually killing snapd.. For a production-ready Kubernetes distribution I really think this is a far from acceptable default.. Perhaps until snapd allows disabling auto-refreshes microk8s scripts could recommend running sudo snap set system refresh.hold=2050-01-01T15:04:05Z or similar. Also a kubernetes-native integration with snapd refreshes could be considered (e.g. a prometheus/grafana dashboard/alert) to prompt manual updates - presumably one node at a time to begin with.

Otherwise microk8s is working rather well so thank you very much.

More details about the outage:

kubectl get nodes
NAME           STATUS     ROLES    AGE   VERSION
10.aa.aa.aaa   Ready      <none>   38d   v1.17.3
10.aa.aa.aaa   NotReady   <none>   18d   v1.17.2
10.aa.aa.aaa   NotReady   <none>   38d   v1.17.2
10.aa.aa.aaa   NotReady   <none>   18d   v1.17.2
aaa-master     Ready      <none>   59d   v1.17.3

microk8s is disabled..

root@wk3:/home# snap list
Name      Version    Rev   Tracking  Publisher   Notes
core      16-2.43.3  8689  stable    canonical✓  core
kubectl   1.17.3     1424  1.17      canonical✓  classic
microk8s  v1.17.2    1176  1.17      canonical✓  disabled,classic
root@wk3:/home# snap changes microk8s
ID   Status  Spawn                Ready  Summary
20   Doing   today at 09:56 AEDT  -      Auto-refresh snap "microk8s"

Data copy appears hanged

root@wk3:/home# snap tasks --last=auto-refresh
Status  Spawn                Ready                Summary
Done    today at 09:56 AEDT  today at 09:56 AEDT  Ensure prerequisites for "microk8s" are available
Done    today at 09:56 AEDT  today at 09:56 AEDT  Download snap "microk8s" (1254) from channel "1.17/stable"
Done    today at 09:56 AEDT  today at 09:56 AEDT  Fetch and check assertions for snap "microk8s" (1254)
Done    today at 09:56 AEDT  today at 09:56 AEDT  Mount snap "microk8s" (1254)
Done    today at 09:56 AEDT  today at 09:56 AEDT  Run pre-refresh hook of "microk8s" snap if present
Done    today at 09:56 AEDT  today at 09:57 AEDT  Stop snap "microk8s" services
Done    today at 09:56 AEDT  today at 09:57 AEDT  Remove aliases for snap "microk8s"
Done    today at 09:56 AEDT  today at 09:57 AEDT  Make current revision for snap "microk8s" unavailable
Doing   today at 09:56 AEDT  -                    Copy snap "microk8s" data
Do      today at 09:56 AEDT  -                    Setup snap "microk8s" (1254) security profiles
Do      today at 09:56 AEDT  -                    Make snap "microk8s" (1254) available to the system
Do      today at 09:56 AEDT  -                    Automatically connect eligible plugs and slots of snap "microk8s"
Do      today at 09:56 AEDT  -                    Set automatic aliases for snap "microk8s"
Do      today at 09:56 AEDT  -                    Setup snap "microk8s" aliases
Do      today at 09:56 AEDT  -                    Run post-refresh hook of "microk8s" snap if present
Do      today at 09:56 AEDT  -                    Start snap "microk8s" (1254) services
Do      today at 09:56 AEDT  -                    Clean up "microk8s" (1254) install
Do      today at 09:56 AEDT  -                    Run configure hook of "microk8s" snap if present
Do      today at 09:56 AEDT  -                    Run health check of "microk8s" snap
Doing   today at 09:56 AEDT  -                    Consider re-refresh of "microk8s"

There doesn't seem to be much to copy anyway:

root@wk3 /v/l/snapd# du -sh /var/lib/snapd/ /var/snap/ /snap
527M	/var/lib/snapd/
74G	/var/snap/
2.0G	/snap

root@wk3 /s/microk8s# du -sh /snap/microk8s/*
737M	/snap/microk8s/1176
737M	/snap/microk8s/1254

root@wk3 /s/microk8s# du -sh /var/snap/microk8s/*
232K	/var/snap/microk8s/1176
74G	/var/snap/microk8s/common

Starting microk8s fails

user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress

root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress

Fails to abort..

root@wk3:/home# snap abort 20
root@wk3:/home# snap changes
ID   Status  Spawn                Ready  Summary
20   Abort   today at 09:56 AEDT  -      Auto-refresh snap "microk8s"

user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress

root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress

snapd service hangs when trying to stop it...

root@wk2 ~# systemctl stop snapd.service
(hangs)

have to resort to manually stopping the process

killall snapd

finally change is undone..

root@wk3:/home# snap changes
ID   Status  Spawn                Ready                Summary
20   Undone  today at 09:56 AEDT  today at 10:41 AEDT  Auto-refresh snap "microk8s"

root@wk3:/home# snap tasks --last=auto-refresh
Status  Spawn                Ready                Summary
Done    today at 09:56 AEDT  today at 10:41 AEDT  Ensure prerequisites for "microk8s" are available
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Download snap "microk8s" (1254) from channel "1.17/stable"
Done    today at 09:56 AEDT  today at 10:41 AEDT  Fetch and check assertions for snap "microk8s" (1254)
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Mount snap "microk8s" (1254)
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Run pre-refresh hook of "microk8s" snap if present
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Stop snap "microk8s" services
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Remove aliases for snap "microk8s"
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Make current revision for snap "microk8s" unavailable
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Copy snap "microk8s" data
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Setup snap "microk8s" (1254) security profiles
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Make snap "microk8s" (1254) available to the system
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Automatically connect eligible plugs and slots of snap "microk8s"
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Set automatic aliases for snap "microk8s"
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Setup snap "microk8s" aliases
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run post-refresh hook of "microk8s" snap if present
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Start snap "microk8s" (1254) services
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Clean up "microk8s" (1254) install
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run configure hook of "microk8s" snap if present
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run health check of "microk8s" snap
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Consider re-refresh of "microk8s

root@wk3:/home# snap list
Name      Version    Rev   Tracking  Publisher   Notes
core      16-2.43.3  8689  stable    canonical✓  core
kubectl   1.17.3     1424  1.17      canonical✓  classic
microk8s  v1.17.2    1176  1.17      canonical✓  classic

Nothing much in snapd logs except for a polkit error - unsure if related:

root@wk3:/home# journalctl -b -u snapd.service

...
Mar 09 06:11:34 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 16:11:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 16:11:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 19:06:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 19:06:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 02:51:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 10 02:51:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 09:56:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl"
Mar 10 10:12:18 wk3 snapd[15182]: daemon.go:208: polkit error: Authorization requires interaction
Mar 10 10:39:24 wk3 systemd[1]: Stopping Snappy daemon...
Mar 10 10:39:24 wk3 snapd[15182]: main.go:155: Exiting on terminated signal.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Killing process 15182 (snapd) with signal SIGKILL.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Failed with result 'timeout'.
Mar 10 10:40:54 wk3 systemd[1]: Stopped Snappy daemon.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Triggering OnFailure= dependencies.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Found left-over process 16729 (sync) in control group while starting unit. Ignoring.
Mar 10 10:40:54 wk3 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 10 10:40:54 wk3 systemd[1]: Starting Snappy daemon...
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:346: started snapd/2.43.3 (series 16; classic) ubuntu/18.04 (amd64) linux/4.15.0-88-generic.
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:439: adjusting startup timeout by 45s (pessimistic estimate of 30s plus 5s per snap)
Mar 10 10:40:54 wk3 systemd[1]: Started Snappy daemon.
@eug48
Copy link
Author

eug48 commented Mar 10, 2020

I've left one worker node in the stuck state in case that's useful for troubleshooting, and have now come across a well-known issue in that pods running on that NotReady node are stuck as Terminating. With prometheus this is a problem because the StatefulSet will not start another instance until that pod is manually force-deleted. Just mentioning here as this is another reason why I think microk8s is not ready for auto-refreshes.

@ktsakalozos
Copy link
Member

Thank you for reporting this @eug48. I opened an issue/topic with the snap team at [1].

One note here is that you cannot hold snap refreshes forever (sudo snap set system refresh.hold=2050-01-01T15:04:05Z does not work). You can defer refreshes for up to 90 days, I think. If you want to block refreshes you need to setup a snap store proxy [2]

[1] https://forum.snapcraft.io/t/snap-refresh-breaks-microk8s-cluster/15906
[2] https://docs.ubuntu.com/snap-store-proxy/en/

@eug48
Copy link
Author

eug48 commented Mar 11, 2020

Thanks very much for raising that @ktsakalozos and correcting my incorrect assumption that refresh.hold would work long-term. I was misled by the lead sentence "Use refresh.hold to delay snap refreshes until a defined time and date." in snap docs (1) and by the command running without an error/warning.. Another lesson to not skim documentation..

@mvo5
Copy link

mvo5 commented Mar 11, 2020

@eug48 sorry for the trouble and thanks for the report. The fact that it hangs during the copy-data phase is curious. I think you mentioned you have one node in the bad state?

It looks like the data in /var/snap/microk8s/ did not even got started to get copied, i.e. the new snaps data dir did not even get created, is this correct?

@eug48
Copy link
Author

eug48 commented Mar 12, 2020

@mvo5 yes, /snap/microk8s/1254 got created but in /var/snap/microk8s/ there is only 1176.

Upon further investigation I've probably found the cause. I've been trying out rook-ceph and there is still a volume mounted with it:

/dev/rbd0 on /var/snap/microk8s/common/var/lib/kubelet/pods/4dbf852e-f740-4a9f-b72d-de1b50120983/volumes/kubernetes.io~csi/pvc-149ca422-8e37-48f4-b087-98cd31d06c43/mount type ext4 (rw,relatime,stripe=1024,data=ordered)

However trying to ls some sub-directories within it hangs forever and dmesg is full of errors like libceph: mon0 10.152.183.195:6789 socket error on read. So ceph is failing to connect to its service, which is running inside the cluster, but flanneld has been stopped.

sync also hangs forever and sure enough it looks like snapd has launched a sync on which it is presumably waiting.

So this is already a complex and therefore rather brittle set-up, and I think having snap auto-refreshes added to the mix makes failure much more likely. Having an option for it to be turned off permanently so that users can upgrade manually and fix these kinds of problems would be great for production use.

@ShadowJonathan
Copy link

ShadowJonathan commented May 20, 2020

For anyone reading this with the same issue, snapd currently doesnt allow any indefinite auto-update disabling, other than this suggestion through a forum thread with a bigger umbrella issue discussion about this; https://forum.snapcraft.io/t/disabling-automatic-refresh-for-snap-from-store/707/268

TL;DR: snap download foo ; snap install foo.snap --dangerous, replace foo with application in question.

Personally, the fact that snapd doesn't allow more comprehensive and cooperative options like (web)hooks on new snap version, and have an external system handle manual refreshes one-by-one (for example by draining a node first before refreshing it, running some canary checks on it, then doing the same one-by-one for the rest, reverting entirely on first error (+ autoemail describing failed upgrade)), would be of great help and would even be still within the ethos of keeping snaps up-to-date, but the system to trust sysadmins to make it so needs to be there.

@skobow
Copy link

skobow commented Oct 14, 2020

I also have problems with my microk8s cluster that my be related to this issue. I am experiencing regular service failure that start almost exactly at 2am in a not yet discovered interval (some days). The exact service is a VerneMQ MQTT server that is using a MariaDB for authentication. The result is that authentication does not work after that (unknown) event happened. This event could relate to snap activities as I discovered some micro8ks restarting by snap around that time.
I also had this behavior with other services. My assumption is that the failure may be related to persistent network connections that fail on k8s level and the applications do not notice the failure. After rescheduling the corresponding pods everything works fine again.

I also would like to disable auto refresh for microk8s to further investigate the problem and to proof my assumption.
Does anyone have any other ideas?

@ktsakalozos
Copy link
Member

Hi @skobow, is it possible you were following the latest/edge channel? What do you get from snap list | grep microk8s?

@skobow
Copy link

skobow commented Oct 14, 2020

Hi @ktsakalozos, I am using 1.19/stable channel which currently installs v1.19.2

@ktsakalozos
Copy link
Member

Nothing got released on 1.19/stable. You could attach the microk8s.inspect tarball so we can take a look.

@skobow
Copy link

skobow commented Oct 14, 2020

Find the tarball attached.
The reason for my assumption is the output of snap changes that shows:

ID Status Spawn Ready Summary
166 Done today at 02:51 CEST today at 02:51 CEST Running service command for snap "microk8s"
167 Done today at 02:51 CEST today at 02:52 CEST Running service command for snap "microk8s"
168 Done today at 02:52 CEST today at 02:52 CEST Running service command for snap "microk8s"
169 Done today at 02:52 CEST today at 02:52 CEST Running service command for snap "microk8s"
170 Done today at 02:52 CEST today at 02:52 CEST Running service command for snap "microk8s"
171 Done today at 04:35 CEST today at 04:35 CEST Auto-refresh snaps "core", "snapd"
172 Done today at 04:51 CEST today at 04:51 CEST Running service command for snap "microk8s"
173 Done today at 04:51 CEST today at 04:52 CEST Running service command for snap "microk8s"
174 Done today at 04:52 CEST today at 04:52 CEST Running service command for snap "microk8s"
175 Done today at 04:52 CEST today at 04:52 CEST Running service command for snap "microk8s"
176 Done today at 04:52 CEST today at 04:52 CEST Running service command for snap "microk8s"
177 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s"
178 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s"
179 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s"
180 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s"
181 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s"
182 Error today at 10:27 CEST today at 10:27 CEST Change configuration of "core" snap
183 Done today at 10:29 CEST today at 10:29 CEST Change configuration of "core" snap

snap change 166then shows:

Status Spawn Ready Summary
Done today at 02:51 CEST today at 02:51 CEST restart of [microk8s.daemon-etcd]

The time stamps fit for the service stop working. Even though there might not be any updates something happens anyway. Could that be related?
inspection-report-20201014_103630.tar.gz

@skobow
Copy link

skobow commented Oct 15, 2020

Hi! Fyi: exactly the same happened tonight at the same time. @ktsakalozos what are these service commands and why are they run?

@ktsakalozos
Copy link
Member

I am not sure why snapd decides to restart MicroK8s. Could you attach the snapd log journalctl -u snapd -n 3000. If we do not see anything there we may need to ask over at https://forum.snapcraft.io/

@skobow
Copy link

skobow commented Oct 16, 2020

@ktsakalozos find the log attached!

snapd.log

@skobow
Copy link

skobow commented Oct 22, 2020

@ktsakalozos Any news on this topic?

@ktsakalozos
Copy link
Member

@skobow in the snapd.log I see these failures:

Oct 16 13:04:32 k8s-master snapd[803]: storehelpers.go:551: cannot refresh: snap has no updates available: "core", "core18", "lxd", "microk8s", "snapd"
Oct 16 13:04:32 k8s-master snapd[803]: stateengine.go:150: state ensure error: cannot sections: got unexpected HTTP status code 403 via GET to "https://api.snapcraft.io/api/v1/snaps/sections"

If you do not know what might be causing this we will go to https://forum.snapcraft.io/ and ask there.

@pw10n
Copy link

pw10n commented Mar 17, 2021

hello. I believe I'm running in to the same issue here as well.

$ snap changes
ID   Status  Spawn               Ready  Summary
198  Doing   today at 16:59 UTC  -      Auto-refresh snap "microk8s"
$ snap tasks 198
Status  Spawn               Ready               Summary
Done    today at 16:59 UTC  today at 16:59 UTC  Ensure prerequisites for "microk8s" are available
Done    today at 16:59 UTC  today at 17:04 UTC  Download snap "microk8s" (2074) from channel "1.20/stable"
Done    today at 16:59 UTC  today at 17:04 UTC  Fetch and check assertions for snap "microk8s" (2074)
Done    today at 16:59 UTC  today at 17:04 UTC  Mount snap "microk8s" (2074)
Done    today at 16:59 UTC  today at 17:04 UTC  Run pre-refresh hook of "microk8s" snap if present
Done    today at 16:59 UTC  today at 17:06 UTC  Stop snap "microk8s" services
Done    today at 16:59 UTC  today at 17:06 UTC  Remove aliases for snap "microk8s"
Done    today at 16:59 UTC  today at 17:07 UTC  Make current revision for snap "microk8s" unavailable
Doing   today at 16:59 UTC  -                   Copy snap "microk8s" data
Do      today at 16:59 UTC  -                   Setup snap "microk8s" (2074) security profiles
Do      today at 16:59 UTC  -                   Make snap "microk8s" (2074) available to the system
Do      today at 16:59 UTC  -                   Automatically connect eligible plugs and slots of snap "microk8s"
Do      today at 16:59 UTC  -                   Set automatic aliases for snap "microk8s"
Do      today at 16:59 UTC  -                   Setup snap "microk8s" aliases
Do      today at 16:59 UTC  -                   Run post-refresh hook of "microk8s" snap if present
Do      today at 16:59 UTC  -                   Start snap "microk8s" (2074) services
Do      today at 16:59 UTC  -                   Remove data for snap "microk8s" (1910)
Do      today at 16:59 UTC  -                   Remove snap "microk8s" (1910) from the system
Do      today at 16:59 UTC  -                   Clean up "microk8s" (2074) install
Do      today at 16:59 UTC  -                   Run configure hook of "microk8s" snap if present
Do      today at 16:59 UTC  -                   Run health check of "microk8s" snap
Doing   today at 16:59 UTC  -                   Consider re-refresh of "microk8s"

It appears whenever snap decides to auto-refresh, microk8s hangs on the copy step and never completes (taking the cluster down).

The only things that seem to be effective were either rebooting the machine or recently discovered that I could abort the auto-refresh:

$ sudo snap abort 198

... wait ...

$ sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress

$ snap changes
ID   Status  Spawn               Ready  Summary
198  Abort   today at 16:59 UTC  -      Auto-refresh snap "microk8s"

$ snap list
Name      Version   Rev    Tracking         Publisher       Notes
core      16-2.49   10859  latest/stable    canonical✓      core
core18    20210128  1988   latest/stable    canonical✓      base
docker    19.03.13  796    latest/stable    canonical✓      -
helm3     3.1.2     5      latest/stable    terraform-snap  -
lxd       4.12      19766  latest/stable/…  canonical✓      -
microk8s  v1.20.2   2035   1.20/stable      canonical✓      disabled,classic
snapd     2.49      11107  latest/stable    canonical✓      snapd

$ sudo killall snapd

However, eventually the auto-refresh happens again...

~$ snap changes
ID   Status  Spawn               Ready               Summary
198  Undone  today at 16:59 UTC  today at 19:09 UTC  Auto-refresh snap "microk8s"
199  Doing   today at 19:14 UTC  -                   Auto-refresh snap "microk8s"

After reading this thread, I gave me the idea to look for unusual mounts that were lingering... and while i wasn't able to find references to libceph I did see that the nfs mounts from the nfs provisioner running in my cluster were erroring out.

dmesg
[2465660.413978] nfs: server 10.152.183.19 not responding, timed out
[2465667.326132] nfs: server 10.152.183.19 not responding, timed out
[2465677.686357] nfs: server 10.152.183.19 not responding, timed out

mounts (partial)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/data on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/0 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/log on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/1 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/cert on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/2 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/init.d on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/3 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19

Not sure if this is the actual cause but thought I'd share in case it was helpful to anyone. Did anyone find a resolution to this problem?

@lpellegr
Copy link

lpellegr commented Mar 17, 2021

I experienced the same issue. Pods remain in terminating states. New ones are created but fail to run due to a connectivity issue that I can't reproduce outside of the pod. Removing deployments and services before recreating them does not help. I had to reinstall the whole cluster after resetting all nodes.

Name      Version   Rev    Tracking       Publisher   Notes
core      16-2.49   10859  latest/stable  canonical✓  core
core18    20210128  1988   latest/stable  canonical✓  base
lxd       4.0.5     19188  4.0/stable/…   canonical✓  -
microk8s  v1.20.4   2074   latest/stable  canonical✓  classic
snapd     2.49      11107  latest/stable  canonical✓  snapd
Mar 17 11:30:12 api-bhs snapd[47810]: stateengine.go:150: state ensure error: cannot sections: got unexpected HTTP status code 403 via GET to "https://api.sna>
Mar 17 11:30:24 api-bhs snapd[47810]: main.go:155: Exiting on terminated signal.

@t-o-o-m
Copy link

t-o-o-m commented Apr 17, 2021

Came across a probably related problem - snap refreshed microk8s and took the cluster down - all pods are in "sandbox changed" state then. Same goes for node reboots, btw.
Microk8s restart usually helps, but sometimes I have to start from scratch to get it to work again.

  1. one can't decide when updates should be performed (I know about the possibility to postpone or periodic scheduling, but I'd rather want to set a single point in time when to have potential downtimes). Might be possible with --devmode, but then I'd have to switch to a dev/edge channel
  2. a simple update (and reboot in ny case) takes the cluster down, with manual work needed to get it up and running again

It would be great to tackle one of those two. Happy to provide any log, as I can easily reproduce the sandbox-issue.

I'd be careful with "reliable production-ready Kubernetes distribution" (from https://ubuntu.com/blog/introduction-to-microk8s-part-1-2) until then :)

@andrew-landsverk-win
Copy link

I also came across this issue today, also running rook-ceph in a 3 node cluster. The Rook/Ceph cluster works perfectly fine otherwise.

@lfdominguez
Copy link

Great, thanks snap autorefresh, you have crashed my entire cluster with this:
/snap/microk8s/2338/bin/dqlite: symbol lookup error: /snap/microk8s/2338/bin/dqlite: undefined symbol: sqlite3_system_errno

@pw10n
Copy link

pw10n commented Jul 29, 2021

Incase it's helpful to anyone here, I was able to permanently disable the auto-refresh by disabling the snapd service.

sudo systemctl stop snapd.service then sudo systemctl mask snapd.service to disable
sudo systemctl unmask snapd.service and sudo systemctl start snapd.service to reenable

Since doing this, I haven't had any stability issues with my cluster at all. This is my temporary fix until I have time to migrate my cluster to k3s or something that actually works.

@ShadowJonathan
Copy link

These are all no permanent fixes, snap will never implement disabling auto updates, and this will always become a problem.

I just suggest not touching microk8s at all, only use it for development purposes, and ban it for all production purposes.

@ktsakalozos
Copy link
Member

The Kubernetes project ships a few releases every month [1]. These releases include security, bug and regression fixes. Every production grade Kubernetes distribution should have a mechanism to release such fixes even before they are released from upstream. For MicroK8s this mechanism is the snaps. Snaps allow us to keep your Kubernetes infrastructure up to date not only with fresh Kubernetes binaries but also update/fix integrations with underlying system and the Kubernetes ecosystem.

If you do not want to take the risk of automated refreshes you have at least two options:

  • use a snapstore proxy [3] to test revisions as they come and block those you do not want,
  • set the snap refresh window in a convenient time for you [2]. This will not completely stop refreshes but will allow you to postpone them.

[1] https://github.com/kubernetes/kubernetes/releases
[2] https://docs.ubuntu.com/snap-store-proxy/en/
[3] https://snapcraft.io/docs/keeping-snaps-up-to-date

@ShadowJonathan
Copy link

@ktsakalozos the point of "security" is pretty moot if it breaks everything while updating it, it's defeating its own purpose.

@ktsakalozos
Copy link
Member

@ShadowJonathan, I am not sure why you mention only security and in quotes. Any update that breaks the cluster is defeating its own purpose.

For anyone that wants to contribute back to this project we would be grateful if you could run non-production clusters with the candidate channels of the track you follow, for example 1.20/candidate. Normally, the stable channel gets updated with what is on candidate after about a week. Having candidate releases well tested in a large variety of setup would be great.

@ShadowJonathan
Copy link

I am not sure why you mention only security and in quotes.

Your point was that security is paramount and absolute, that it should be the excuse that makes this problem okay, it's not, it's an excuse that only exasperates this problem and the whole of snap for servers in general.

Snaps are fine for user apps, those can deal with being restarted, crashing, shutting down, again and again. Server apps need more delicacy, planning, and oversight. Any admin/operator would not want the developer control over when, how, and why something will update, they want complete control over their systems, and the snaps auto-updating feature is a complete insult to that.

Any update that breaks the cluster is defeating its own purpose.

I'm glad you agree, then? I'd rather have a cluster which is outdated and vulnerable, and possibly get hacked, if it's about my own oversight and my own fault (at least then i can tune it to my own schedule and my own system). With auto-update, and even the update window, that control is taken away from me, as now i have to scramble to make sure the eventual update will not fuck with my system, and then to do it manually, safe, and controlled to make sure it does not fuck over the data. (which it did for me, 1.2TB of scraping data, all corrupted because docker didnt want to close within 30 seconds, after which it got SIGKILLd)

As a sysadmin, I control a developer's software, when, where, and how. The developer doesn't control my system, unless I tell it to. And even then, only on my own conditions.

Snaps violated this principle, and that's why I'm incredibly displeased with them.

@lfdominguez
Copy link

@ktsakalozos the point of "security" is pretty moot if it breaks everything while updating it, it's defeating its own purpose.

i thinking same.... but they are telling that is production ready... really???

@ShadowJonathan
Copy link

@lfdominguez branding

@lfdominguez
Copy link

but if microk8s get out from snap.... or use another method, like executable self-contained (like k3s o k0s) i think that is better, you get out of the insane snap auto-refresh....

@evilnick
Copy link
Contributor

@a-hahn Hi. In respect of the documentation, no matter how many warnings or comments are added to things, sadly many people don't read them. There is no hidden agenda, this is simply a matter of responsibility. If you buy a car and want to disconnect all the warning lights, that is also up to you, but you wouldn't expect to find the instructions to do so printed in the owners manual. Please add your method as a comment on discourse if you like.

@ShadowJonathan
Copy link

Please add your method as a comment on discourse if you like.

That'll be nothing more than black-holing the concern.

@evilnick
Copy link
Contributor

Please add your method as a comment on discourse if you like.

That'll be nothing more than black-holing the concern.

What is the actual concern?

@a-hahn
Copy link

a-hahn commented Feb 6, 2022

@ktsakalozos So much time spent on arguing. This simple one-liner snap download microk8s && snap install microk8s --dangerous is a well-known alternative install for the interns at canonical as this blog post proves. It would have saved lots of hours of troubleshooting for the people looking for help on this issue including @jobh @ShadowJonathan @Dart-Alex @joes @skobow @eug48 Thx again @ktsakalozos @evilnick for your 'helpful' educational support on this topic.

@evildmp
Copy link

evildmp commented Feb 6, 2022

@a-hahn Hello Andreas, thank you for your engagement on this, and for making a contribution to the documentation for this project.

I'm a Director of Engineering at Canonical, and my responsibility is documentation - everything to do with documentation and the way we produce it, for all Canonical products and projects (if you'd like to learn more about what my work is, I invite you to read The future of documentation at Canonical).

Needless to say, I think that documentation is extremely important. I think that's generally true, but particularly so for an open-source software organisation like Canonical, and for open-source software projects themselves. That's because documentation is part of the contract with an open-source community. Documentation is one of the most important ways of sealing the relationship between a product and its community.

Community members understandably feel strongly about documentation. You put time and effort into making an improvement to the MicroK8s documentation, and it was declined - for reasons that you don't agree with - by one of the maintainers. I can see that you are angry and frustrated that the contribution you made was reversed, and your reasoning about it also not accepted.

Can I ask, are you upset because you think @evilnick made a wrong technical decision about what should or should not be documented? Or would you say you feel more upset because you need to feel a different relationship to exist between you (as a community member), and the project, and Canonical?

Unfortunately I am not in a position to comment on the technical aspect of this. I am not an expert in MicroK8s security. However, one of the things I would like to achieve in my work is improved community engagement through documentation and improved experiences for documentation contributors, so I'd be very happy if you would like to talk more about that.

Either way, one thing I would like to say in the meantime is that I do know that Nick is a very community-minded person. Before making this decision, we discussed it together (nor was I the only person it was raised with), because it's a hard thing to do to turn down someone's contribution. It was not done lightly. So it's also a hard thing to be in that position, and to receive angry criticism for it, or to be accused of not respecting the code of conduct. Personally, I would be upset by that myself.

@ShadowJonathan
Copy link

ShadowJonathan commented Feb 6, 2022

@evildmp if I were to guess, a large share of the frustration is not personal, but rather aimed at snap's packaging in-of-itself, which is then a main root to cause this problem.

The solution offered is a hack, an explicit circumvention of the problem, which does not do much to offer a satisfying resolution, nor does it help lighten the burden that the problem caused, it only cripples the effectiveness of the platform, while a better solution is available from snap's side, while they do not wish to give developers those tools, out of political and ideological reasons, that explicitly tear away control from users in a patriarchal fashion, in the sense that the developers would like to think they know their users' systems better than the users would. (Which, imo, that is maybe true for normal application users, this becomes far less true for developers, and very much not true for system administrators, for which snaps all have the same attitude)

I don't want to perpetuate the cycle here, at the very least know this; it wasn't personal, the frustration is high, and this issue is just one part of the knot where the pressure became too high.

@vazir
Copy link

vazir commented Feb 7, 2022

The approach of the mikrok8s team is not professional. The sole purpose of kubernetes is to provide a platform for running fault tolerant services, and it is completely ruined by packaging it into totally unsuitable, and DNA broken "snap" tool. It's hard to say more. I personaly got rid of all microk8s on my servers and migrated to k3s. Next is to replace back ubuntu by plain Debian.

@evildmp
Copy link

evildmp commented Feb 7, 2022

@vazir

The approach of the mikrok8s team is not professional.

This is not an issue, or criticism of code, or even criticism of someone's behaviour. It is an abusive remark. There is nothing constructive that anyone can do in response to this comment.

Sometimes people get angry about open-source software projects they participating, which is OK. It's also OK to express anger sometimes.

It is not OK, and it is explicitly against the Ubuntu code of conduct adopted by the MicroK8s project, to make abusive comments.

I politely request that you delete that part of your comment. Thank you.

@a-hahn
Copy link

a-hahn commented Feb 8, 2022

@evildmp Do you really think you gave a satisfying professional response to my statement ? You have missed a chance to clarify:

'As a director of engineering at canonical I'd like to assure you that you can expect a professional and complete documentation for our products. And this also includes controversial fixes or alternative or internal usage instructions for our products from our staff or our users even if we don't recommend those to our customers. We encourage and enforce transparency. We are committed to leaving the choice to our users and customers to use and deploy our products in a way that best suits their needs even if we don't agree with it or consider it harmful. Of course we will flag those with a big fat warning label'.

As long as you don't say that the best answer to this issue still comes from your former employee [see: Bypassing store refresh]. As long as interns don't have the courage or the companies allowance to give some really helpful background information I'm afraid its not cynical to say then hopefully more people are leaving the company to make up their minds finally. That'd be a really sad conclusion for the friends of Canonical, Ubuntu and Microk8s.

@lfdominguez
Copy link

I think that some last comments is off-topic related to title of the issue..... We have a problem... the autorefresh system of snap breaks microk8s, we need an option to disable autorefresh... only that...

@ShadowJonathan
Copy link

@lfdominguez you cannot disable autorefresh. You can download and install microk8s manually, but then updating it is manual as well.

@lfdominguez
Copy link

lfdominguez commented Feb 11, 2022

Yes I underestand that... so if is a political of canonical & snap dont change that (really i think that is a wrong idea dont let the user disable that)... why then waste the time in this issue??? snap team is not listen to users, because without use a workaround like discuss in this issue... or better, go to the microk8s kubernetes dristribution official doc and install it on a production cluster, you will mess all when autorefresh change something.... that's from my sysadmin point of view is faaaaaaaaaar from a production grade system.

@vazir
Copy link

vazir commented Feb 12, 2022

@evildmp ,
It has nothing to do with opensource and etc. I do not criticize the code, but the wrong packaging way itself, which breaking WHOLE idea of having microk8s, and i'm using straight words for that. Once again I will try to explain what is wrong, it's of course useless, but anyways. Looking at the meaning of the messages of other user of this thread: everyone tells the same, words different. Everyone says "It's broken, it's unacceptable, change it, do something". If there is no other way to multiplatform distiribution of microk8s, just state in the docs, in front, very visible and understandable "DO NOT USE IT FOR PRODUCTION WHEN INSTALLED VIA SNAP BECAUSE YOUR SERVICE MAY BREAK RANDOMLY, UNEXPECTEDLY". And this way, no one will have any issues. We will just look for other k8s from the start, and use microk8s (probably, not really) for dev only, and we will not experience the problems randomly and unexpectedly in production, because, according to the front page it's rock solid reliable. Stating in the docs somewhere deep deep "install manually if you do not like snap", changes nothing, just making users even more disappointed.

@vazir
...

@lpellegr
Copy link

Said differently, why is there so much reluctance to improve user experience with a feature so many people requests? Is there something to do with data collected from auto updates?

@ShadowJonathan
Copy link

@vazir id also like to note that the current Ubuntu server installer, when entering the "additional software" screen, installs these through snap.

So, if you've installed docker on that screen, it'll go ahead and install that in a snap container, including any other software, such as microk8s.

@vazir
Copy link

vazir commented Feb 13, 2022

@ShadowJonathan - this mindless practice effectively and rapidly moves Ubuntu as distribution out of servers. I do not believe, they do not understand it. So, there is the only conclusion - someone slowly killing Ubuntu from inside. Nokia way

@stale
Copy link

stale bot commented Jan 9, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the inactive label Jan 9, 2023
@neoaggelos
Copy link
Contributor

A note that is probably useful for people that have commented in the issue:

Starting with snapd 2.58, it is possible to indefinitely hold MicroK8s (and any other installed snap packages) from updating with the following command:

sudo snap refresh --hold microk8s

See also the "Control updates" section in the snapd documentation

Not stale

@vazir
Copy link

vazir commented Jan 9, 2023

I can only say "finally"... But I dropped microk8s, switched to k3s, also no UBUNTU any more anywhere, servers and desktops. Back to Debian. It is hard to express how ANNOYING to get those "Pending update of 'BLA BLA SNAP' close the app to avoid disruption". Ubuntu is trying to mimic worst parts of the damned windows...

@stale stale bot removed the inactive label Jan 9, 2023
@yuryskaletskiy
Copy link

Had switched to Rancher/k3s already. No further bothering on unexpected auto updates.

@lfdominguez
Copy link

Changed to rancher rke2 (very stable), no snap anymore i hope

Copy link

stale bot commented Dec 5, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the inactive label Dec 5, 2023
@ShadowJonathan
Copy link

This isn't stale, this is still a problem

@stale stale bot removed the inactive label Dec 6, 2023
@JWilson45
Copy link

this just happened to me, failed on 2/3 nodes

@fmiqbal
Copy link

fmiqbal commented Dec 20, 2024

gosh darn it, i am now really suspecting snap auto refresh is why my cluster is somewhat unstable in specific interval, today snap auto refresh to 1.29.11, 3 of 9 worker nodes didnt restart properly,
image

the fix is easy, stop and start microk8s, why !?

Moreover, because we use longhorn, when nodes become "NotReady" as stated by API Server, the volume will be rebalanced automatically, which make a lot of request to API Server and make it hard to query (i hate you dqlite), for some time we use 16 GB x 3 for master node, and when this thing happens (I suppose the update), the cluster can become totally down, nothing is running, dqlite is choked, the kine count (sudo -E /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s "select count(*) from kine) is growing fast, with negative feedback loop to everything in the cluster and in matter of hours it can bring whole cluster down, need to restore from "cold boot"

this time around we suspect 16GB is not enough and increase it to 32GB, at least the kine count can be manageable (albeit still getting bigger over time, but a lot longer than before), and we restore (literally just restart the nodes) in time before everything goes to sh*t for the nth time over 6 month period

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests