Episode idea: Handling Disasters #114

ckdarby · 2020-08-20T18:41:55Z

Describe a topic you want to learn (required)

Properly handling disasters outside of just simple Kubernetes restarted the pod and everything worked

Handling Pulsar disk fill
Even though Pulsar has a disk quota check we've had situations where our ingestion rate was so quick we filled the disk before the check happened to protect Pulsar. The cluster itself became unresponsive and we had to destroy it.

Handling removing a bookie gracefully or by force from the cluster
We had a situation trying to move Pulsar from one AWS AZ to another AWS AZ and having issues with EBS volumes. We tried to delete the PV/PVC claim from kubernetes thinking when we spun up the pod again it'd just issue a new pv/pvc and the autorecovery process would take over.

We were greeted with an error message about it not being a new bookie and the folder shouldn't be empty. I think we were supposed to decommission the bookie cleanly but we didn't know of this.

After we deleted the volume, I jumped onto another bookie and tried to use the ./bookkeeper shell to list out the bookie id of the deleted bookie and got a null pointer. We had to nuke the cluster.

I still to this date don't know what is the approach of removing a bookie and what to do if a bookie has to be removed not cleanly.

Checking on Auto Recovery status
Aside from checking the logs in the autorecovery pod I don't know how to properly see when replication is under the correct amount, what the process is, seeing which partitions are affected.

Why do you want to learn this topic? (required)

It is important understanding these when having a cluster to have faith in ever using it again. When bad things happen you need to be able to understand & know how to gracefully get yourself out of the situation.

Reference (optional)

https://pulsar.apache.org/docs/en/administration-zk-bk/#decommissioning-bookies-cleanly

ckdarby added the episode-idea label Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Episode idea: Handling Disasters #114

Episode idea: Handling Disasters #114

ckdarby commented Aug 20, 2020 •

edited

Loading

Episode idea: Handling Disasters #114

Episode idea: Handling Disasters #114

Comments

ckdarby commented Aug 20, 2020 • edited Loading

Describe a topic you want to learn (required)

Why do you want to learn this topic? (required)

Reference (optional)

ckdarby commented Aug 20, 2020 •

edited

Loading