You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have two clusters managed by the same operator running on kubernetes. Daily backup is set up for them, which works for one cluster, but fails for the other. All the settings are the same for backups, both uses the same s3 bucket.
More about the problem
Error message on the CRD: some of pbm-agents were lost during the backup
State of the backup is error
Checking the logs of the backup-agent container in one of the pods I see, that it's writing the collections, and then it stopps with the following error message:
2024-04-10T11:47:19.097+0000 Mux close namespace XXXXX
2024-04-10T11:47:19.097+0000 done dumping XXXX (0 documents)
2024-04-10T11:47:19.098+0000 writing XXXXX to archive on stdout
2024/04/10 11:47:21 [entrypoint] `pbm-agent` exited with code -1
2024/04/10 11:47:21 [entrypoint] restart in 5 sec │
2024/04/10 11:47:26 [entrypoint] starting `pbm-agent`
We had a change on this cluster, when it stopped working, but it was just to increase the resources from c5a.large to c5a.4xlarge. First I thought that maybe the backup agent gets OOMKilled, as it now sees, that there are plenty more resources available, so I decreased the resources (as we don't need increased anymore) to c5a.xlarge, but the issue is still the same.
I was not able to enable debug loggin on the backup-agent, maybe it's not even possible. How could I get more details on the error?
Report
We have two clusters managed by the same operator running on kubernetes. Daily backup is set up for them, which works for one cluster, but fails for the other. All the settings are the same for backups, both uses the same s3 bucket.
More about the problem
Error message on the CRD:
some of pbm-agents were lost during the backup
State of the backup is
error
Checking the logs of the backup-agent container in one of the pods I see, that it's writing the collections, and then it stopps with the following error message:
We had a change on this cluster, when it stopped working, but it was just to increase the resources from c5a.large to c5a.4xlarge. First I thought that maybe the backup agent gets OOMKilled, as it now sees, that there are plenty more resources available, so I decreased the resources (as we don't need increased anymore) to c5a.xlarge, but the issue is still the same.
I was not able to enable debug loggin on the backup-agent, maybe it's not even possible. How could I get more details on the error?
Steps to reproduce
Versions
Anything else?
I also tried to restart the whole cluster, but still the same.
We haven't changed the resources of the other cluster and the backups are working fine there.
The text was updated successfully, but these errors were encountered: