You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These days, our Docker server OVH1 is seeing frequent periods of 1-2 hours of general unresponsiveness (all containers unusable, proxy not working, no SSH access to troubleshoot, no metrics reported to datadog).
We suspect that it's because of very high RAM usage, causing the system to swap a lot (extremely slow on a rotational disk).
The mitigations we've used before are:
wait for the server to finish its RAM-heavy workload (e.g. chrome build) and for responsiveness to come back (can take a few hours)
just reboot the server
Looking for better solutions to this problem, I found https://superuser.com/a/1142197 which seems to suggest disabling swap (or having a smaller swap to trigger OOM-killer faster) and/or preventing some critical processes from swapping.
Maybe we could try this:
disable swap (since we have a rotational disk, swap is too slow for our needs)
prevent node, docker and ssh from swapping, to guarantee their continued responsiveness (this only works well if they don't use much memory to begin with)
@ishitatsuyuki@beaufortfrancois@etiennewan do you agree with my diagnostic? Would trying the ideas above really help us? Do you have other ideas to guarantee consistent responsiveness of our service?
The text was updated successfully, but these errors were encountered:
15:42:25 ishitatsuyuki> janx: disabling swap is a good way, although SSD doesn't help much
15:42:57 ishitatsuyuki> In our case, we probably can just remove that trivial 1GB and disable overcommit
These days, our Docker server OVH1 is seeing frequent periods of 1-2 hours of general unresponsiveness (all containers unusable, proxy not working, no SSH access to troubleshoot, no metrics reported to datadog).
We suspect that it's because of very high RAM usage, causing the system to swap a lot (extremely slow on a rotational disk).
The mitigations we've used before are:
Looking for better solutions to this problem, I found https://superuser.com/a/1142197 which seems to suggest disabling swap (or having a smaller swap to trigger OOM-killer faster) and/or preventing some critical processes from swapping.
Maybe we could try this:
node
,docker
andssh
from swapping, to guarantee their continued responsiveness (this only works well if they don't use much memory to begin with)@ishitatsuyuki @beaufortfrancois @etiennewan do you agree with my diagnostic? Would trying the ideas above really help us? Do you have other ideas to guarantee consistent responsiveness of our service?
The text was updated successfully, but these errors were encountered: