Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Episode idea: Handling Disasters #114

Open
ckdarby opened this issue Aug 20, 2020 · 0 comments
Open

Episode idea: Handling Disasters #114

ckdarby opened this issue Aug 20, 2020 · 0 comments

Comments

@ckdarby
Copy link

ckdarby commented Aug 20, 2020

Describe a topic you want to learn (required)

Properly handling disasters outside of just simple Kubernetes restarted the pod and everything worked

Handling Pulsar disk fill
Even though Pulsar has a disk quota check we've had situations where our ingestion rate was so quick we filled the disk before the check happened to protect Pulsar. The cluster itself became unresponsive and we had to destroy it.

Handling removing a bookie gracefully or by force from the cluster
We had a situation trying to move Pulsar from one AWS AZ to another AWS AZ and having issues with EBS volumes. We tried to delete the PV/PVC claim from kubernetes thinking when we spun up the pod again it'd just issue a new pv/pvc and the autorecovery process would take over.

We were greeted with an error message about it not being a new bookie and the folder shouldn't be empty. I think we were supposed to decommission the bookie cleanly but we didn't know of this.

After we deleted the volume, I jumped onto another bookie and tried to use the ./bookkeeper shell to list out the bookie id of the deleted bookie and got a null pointer. We had to nuke the cluster.

I still to this date don't know what is the approach of removing a bookie and what to do if a bookie has to be removed not cleanly.

Checking on Auto Recovery status
Aside from checking the logs in the autorecovery pod I don't know how to properly see when replication is under the correct amount, what the process is, seeing which partitions are affected.

Why do you want to learn this topic? (required)

It is important understanding these when having a cluster to have faith in ever using it again. When bad things happen you need to be able to understand & know how to gracefully get yourself out of the situation.

Reference (optional)

https://pulsar.apache.org/docs/en/administration-zk-bk/#decommissioning-bookies-cleanly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant