Skip to content
This repository has been archived by the owner on Jan 19, 2024. It is now read-only.

Commit

Permalink
Merge pull request #33 from neo4j-contrib/rolling
Browse files Browse the repository at this point in the history
Guidance on Performing Rolling Updates
  • Loading branch information
moxious authored Jun 29, 2020
2 parents 3068225 + 3bc8ff8 commit 215b5f5
Show file tree
Hide file tree
Showing 4 changed files with 195 additions and 0 deletions.
167 changes: 167 additions & 0 deletions tools/rolling-upgrade/README-ROLLING-UPGRADE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Rolling Upgrades: Neo4j in Kubernetes

This document expands on the Neo4j Operations Manual entry
[Upgrade a Causal Cluster](https://neo4j.com/docs/operations-manual/current/upgrade/causal-cluster/) with information about approaches on rolling upgrades in Kubernetes.

**It is strongly recommended that you read all of that documentation before using this method**.

Not all relevant concepts will be described in this document. Familiarity with the page above will be assumed.

**It is recommended to perform a test upgrade on a production-like environment to get information on the duration of the downtime, if any, that may be necessary.**

## When is this needed?

* When you have a Neo4j Causal Cluster (standalone does not apply)
* When you need to upgrade to a new minor or patch version of Neo4j
* When you must maintain the cluster online with both read and write capabilities during the course of the upgrade process.

## What This Approach Doesn't Cover

Moving between major versions of Neo4j (for example, 3.5 and 4.0).
This requires more planning, due to these factors:

* Substantial configuration changes needed on the pods between 3.5 and 4.0
* Major changes in product features which impact what clients can rely upon.
* Changes in the helm charts used, and their structure; if you're using an old
3.5 helm chart, the differences between what you're using and this repo may be substantial.
* The need for a store upgrade operation to change the format of Neo4j data on disk

Neo4j 4.0 is not backwards compatible with 3.5. Additional planning and offline upgrade is recommended.

If you are in the situation of migrating from Neo4j 3.5 -> 4.0, please consult
[the Neo4j Migration Guide](https://neo4j.com/docs/migration-guide/current/).

If this won't work for your constraints, please check "Alternatives to Rolling Upgrades"
at the very bottom of this document.

## High-Level Approach

1. Take a backup
2. Scale the core statefulset up, to maintain high availability.
3. Choose and apply your UpdateStrategy.
4. Patch the statefulset to apply the new Neo4j version
5. Monitor the process
6. (Optional/If Applicable) Apply the above process to the read replica StatefulSet as well.
7. Scale back down on success to the original size.

We will now describe each step, how to do it, and why.

## Take a Backup

Before doing any major system maintenance operation, it's crucial to have an up-to-date backup, ensuring that if anything goes wrong, there is a point in time to return to for the database's state.

In addition, all operations should be tested on a staging or a production-like environment as a "dry run" before attempting this on application-critical systems. Performing backups is covered in the user guide in this repository.

## Scale the Core Statefulset Up

If you'd normally have 3 core members in your statefulset, they are providing a valuable
[high availability purpose, and a quorum](https://neo4j.com/docs/operations-manual/current/clustering/introduction/#causal-clustering-introduction-operational).

In a rolling upgrade operation, we are going to take each server *down* in its turn. And while one is stopping/restarting, we're (temporarily) damaging the HA characteristics of the
cluster, reducing it's ability to serve queries. To mitigate this, before doing
a rolling upgrade we scale the cluster *up*, from say 3 cores to 5. We will then roll
changes through - we will at any given moment have 4 of 5 cores available.

Given a cluster deployment named "mygraph", you can scale it to 5 cores like so:

```
kubectl scale statefulsets mygraph-neo4j-core --replicas=5
```

This should immediately schedule 2 new pods with the same configuration (and the *old* version of Neo4j) to start up and join the cluster. It is recommended that you scale by 2, not by 1,
to ensure that the number of cores in the cluster is always odd.

> **REMEMBER** when new members join the cluster, before the cluster is stable, they
> need to pull current transactional state. Having members restore from a recent
> backup first is strongly recommended, to minimize the load of the
> [catch-up process](https://neo4j.com/docs/operations-manual/current/clustering-advanced/lifecycle/#causal-clustering-catchup-protocol).
If you do not restore from backups on each pod, it will still work, but the cluster may take
substantial time for the new members to catch up, particularly if you have a lot of data or
transactional history.

Consult [scaling statefulsets](https://kubernetes.io/docs/tasks/run-application/scale-stateful-set/#scaling-statefulsets) in the kubernetes documentation for more information.

## Choose and Apply your Update Strategy

For more details, consult the Kubernetes [Update strategies for statefulsets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies).

In the next steps, we're going to tell the statefulset to change versions of Neo4j that it is
running. Before we do, we need to give it a strategy of how the rolling upgrade should proceed.
You basically have 2 key options, `RollingUpdate` and `OnDelete`. This document assumes the
`RollingUpdate` strategy. In this situation, Kubernetes does the following:

* Starting from the end (the highest core number) it shuts a pod down, and restarts it. Because
we're going to patch the version, when that pod restarts, it will come up with the new version.
* Kubernetes waits for the pod to get to the Ready state.
* The rolling update proceeds with the next lowest index.

### Criticality of Readiness / Liveness Checks

Kubernetes doesn't know much about Neo4j. It needs to be provided a definiton of what it
means for Neo4j to be "ready". If the new pod isn't ready before the next one rolls, you
could get into a situation where the cluster doesn't have time to stabilize before losing
more members.

The readiness checks described in the user manual cover this purpose. Review them for
adequacy and appropriateness for your deployment, and revise if necessary *before* performing
a rolling upgrade.

## Patch the StatefulSet

Up until now, everything we're doing was just preparation, but this step actually affects the
change. If you run this, then subject to the `UpdateStrategy` you specified above, Kubernetes
will start killing & restarting pods.

```
NEW_VERSION=neo4j:4.1.0-enterprise
kubectl patch statefulset mygraph-neo4j-core --type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"'$NEW_VERSION'"}]'
```

## Monitor the Process

What we're looking for:
* Only one pod at a time stopping/restarting
* Consistent ability to maintain a connection to the cluster, with both read & write ability throughout.
* Monitor the `debug.log` file and container output to make sure the process is proceeding correctly

## (Optional) Apply the Process to the Read Replicas

Read replicas are handled in a second StatefulSet. The process of applying an update
strategy, patching the statefulset, and monitoring is the same though.

**Recommended:** make sure you have a fully migrated & happy core cluster member set
before working with read replicas. Do not roll both sets at once.

## Scale Back Down

If everything was completed correctly, you should end up with a 5 member cluster with
the new version, 100% online. After success, the extra members are no longer needed,
and you may scale back down to the original size of the cluster.

```
kubectl scale statefulsets mygraph-neo4j-core --replicas=3
```

# Alternatives to Rolling Upgrades

If you can tolerate a period of write-unavailabilty while maintaining full read-availability,
Kubernetes provides a secondary option to a rolling upgrade. This document will focus on
how to do rolling upgrades, but as a sketch of the main alternative:

1. Configure your current cluster to be backed by a single DNS record (mycluster.company.com)
2. Take a backup of your current cluster.
3. Launch a second cluster running the new version of Neo4j (mycluster-updated). Restore
this cluster from the last backup of `mycluster`.
4. Hot swap the DNS record to point to the second cluster (mycluster-updated)
5. Shut down the original cluster (mycluster)

**Important** if you adopt this approach, you will need a maintenance window where you are
not accepting writes, as of the point of backup. The period of write unavailability should
be between steps 2 and the readiness of DNS in step 4. If writes come in during this time
period to the original cluster, they will be missing form the updated cluster.

This approach should maintain read availability throughout, and it is reasonably safe; i.e.
if the new cluster fails to migrate properly or there is a data issue, this does not compromise
the availability of the running production system.
8 changes: 8 additions & 0 deletions tools/rolling-upgrade/exfil-logs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/bin/bash

DEPLOY=r2


for idx in 0 1 2 ; do
kubectl exec -it $DEPLOY-neo4j-core-$idx /bin/cat /var/lib/neo4j/logs/debug.log > $DEPLOY-$idx.log
done
16 changes: 16 additions & 0 deletions tools/rolling-upgrade/rolling-upgrade.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash

export DEPLOY=r2
export NEW_VERSION=neo4j:4.1.0-enterprise

# Update Strategies: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies
STRATEGY=RollingUpdate
# STRATEGY=OnDelete

for set in core replica ; do
kubectl patch statefulset $DEPLOY-neo4j-$set -p '{"spec":{"updateStrategy":{"type":"'$STRATEGY'"}}}'
done

kubectl patch statefulset $DEPLOY-neo4j-core --type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"'$NEW_VERSION'"}]'
[]
4 changes: 4 additions & 0 deletions user-guide/USER-GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,10 @@ table above because they are documented in the operational manual, but here are
* To adjust CSV metrics (enabled by default) use `metrics.csv.enabled` and `metrics.csv.interval`.
* To disable JMX metrics (enabled by default) use `metrics.jmx.enabled=false`.

## Rolling Upgrades

See [the documentation on the rolling upgrade process](../tools/rolling-upgrade/README-ROLLING-UPGRADE.md).

## Data Persistence

The most important data is kept in the `/data` volume attached to each of the core cluster members. These in turn
Expand Down

0 comments on commit 215b5f5

Please sign in to comment.