Skip to content

Commit

Permalink
Merge pull request #35 from timebertt/docs
Browse files Browse the repository at this point in the history
Update, rework, and extend docs
  • Loading branch information
timebertt authored Nov 27, 2023
2 parents 7ef6eec + 9814e03 commit b00eb19
Show file tree
Hide file tree
Showing 32 changed files with 1,738 additions and 436 deletions.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
docs/assets/*.jpg filter=lfs diff=lfs merge=lfs -text
2 changes: 1 addition & 1 deletion .github/workflows/controller-sharding.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ jobs:
# prepare .ko.yaml to inject build settings into all images
entrypoints=(
./cmd/sharder
./hack/cmd/shard
./cmd/shard
./hack/cmd/janitor
)
Expand Down
4 changes: 2 additions & 2 deletions .run/shard (kind).run.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
<configuration default="false" name="shard (kind)" type="GoApplicationRunConfiguration" factoryName="Go Application">
<module name="kubernetes-controller-sharding" />
<working_directory value="$PROJECT_DIR$" />
<parameters value="--zap-log-level=debug --shard=shard-host" />
<parameters value="--zap-log-level=debug --shard=shard-host --lease-namespace=default" />
<envs>
<env name="KUBECONFIG" value="$PROJECT_DIR$/hack/kind_kubeconfig.yaml" />
</envs>
<kind value="PACKAGE" />
<package value="github.com/timebertt/kubernetes-controller-sharding/hack/cmd/shard" />
<package value="github.com/timebertt/kubernetes-controller-sharding/cmd/shard" />
<directory value="$PROJECT_DIR$" />
<filePath value="$PROJECT_DIR$/webhosting-operator/cmd/experiment/main.go" />
<method v="2" />
Expand Down
12 changes: 8 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ PROJECT_DIR := $(shell dirname $(abspath $(lastword $(MAKEFILE_LIST))))
TAG ?= latest
GHCR_REPO ?= ghcr.io/timebertt/kubernetes-controller-sharding
SHARDER_IMG ?= $(GHCR_REPO)/sharder:$(TAG)
SHARD_IMG ?= $(GHCR_REPO)/shard:$(TAG)
JANITOR_IMG ?= $(GHCR_REPO)/janitor:$(TAG)

# ENVTEST_K8S_VERSION refers to the version of kubebuilder assets to be downloaded by envtest binary.
ENVTEST_K8S_VERSION = 1.27
Expand Down Expand Up @@ -111,10 +113,12 @@ run: $(KUBECTL) generate-fast ## Run the sharder from your host and deploy prere
$(KUBECTL) apply --server-side --force-conflicts -k hack/config/certificates/host
go run ./cmd/sharder --config=hack/config/sharder/host/config.yaml --zap-log-level=debug

SHARD_NAME ?= shard-$(shell tr -dc bcdfghjklmnpqrstvwxz2456789 </dev/urandom | head -c 8)

.PHONY: run-shard
run-shard: $(KUBECTL) ## Run a shard from your host and deploy prerequisites.
$(KUBECTL) apply --server-side --force-conflicts -k hack/config/shard/clusterring
go run ./hack/cmd/shard --zap-log-level=debug
go run ./cmd/shard --shard=$(SHARD_NAME) --lease-namespace=default --zap-log-level=debug

PUSH ?= false
images: export KO_DOCKER_REPO = $(GHCR_REPO)
Expand All @@ -139,15 +143,15 @@ kind-up: $(KIND) $(KUBECTL) ## Launch a kind cluster for local development and t
kind-down: $(KIND) ## Tear down the kind testing cluster.
$(KIND) delete cluster --name sharding

# use dedicated ghcr repo for dev images to prevent spamming the "production" image repo
export SKAFFOLD_DEFAULT_REPO ?= ghcr.io/timebertt/dev-images
export SKAFFOLD_FILENAME = hack/config/skaffold.yaml
# use static label for skaffold to prevent rolling all components on every skaffold invocation
deploy up dev down: export SKAFFOLD_LABEL = skaffold.dev/run-id=sharding
# use dedicated ghcr repo for dev images to prevent spamming the "production" image repo
up dev: export SKAFFOLD_DEFAULT_REPO ?= ghcr.io/timebertt/dev-images

.PHONY: deploy
deploy: $(SKAFFOLD) $(KUBECTL) $(YQ) ## Build all images and deploy everything to K8s cluster specified in $KUBECONFIG.
$(SKAFFOLD) deploy -i $(SHARDER_IMG)
$(SKAFFOLD) deploy -i $(SHARDER_IMG) -i $(SHARD_IMG) -i $(JANITOR_IMG)

.PHONY: up
up: $(SKAFFOLD) $(KUBECTL) $(YQ) ## Build all images, deploy everything to K8s cluster specified in $KUBECONFIG, start port-forward and tail logs.
Expand Down
114 changes: 59 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,79 @@
# Kubernetes Controller Sharding

_Towards Horizontally Scalable Kubernetes Controllers_
_Horizontally Scalable Kubernetes Controllers_ 🚀

## About
## TL;DR 📖

This study project is part of my master's studies in Computer Science at the [DHBW Center for Advanced Studies](https://www.cas.dhbw.de/) (CAS).
Make Kubernetes controllers horizontally scalable by distributing reconciliation of API objects across multiple controller instances.
Remove the limitation to have only a single active replica (leader) per controller.

You can download and read the full thesis belonging to this implementation here: [thesis-controller-sharding](https://github.com/timebertt/thesis-controller-sharding).
This repository contains the practical part of the thesis: a sample operator using the sharding implementation, a full monitoring setup, and some tools for demonstration and evaluation purposes.
See [Getting Started With Controller Sharding](docs/getting-started.md) for a quick start with this project.

The controller sharding implementation itself is done generically in [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime).
It is currently located in the `sharding` branch of my fork: https://github.com/timebertt/controller-runtime/tree/sharding.
## About ℹ️

## TL;DR
I started this project as part of my Master's studies in Computer Science at the [DHBW Center for Advanced Studies](https://www.cas.dhbw.de/) (CAS).
I completed a study project ("half-time thesis") about this topic. I'm currently working on my Master's thesis to evolve the project based on the first iteration.

Distribute reconciliation of Kubernetes objects across multiple controller instances.
Remove the limitation to have only one active replica (leader) per controller.
- Download and read the study project (first paper) here: [thesis-controller-sharding](https://github.com/timebertt/thesis-controller-sharding)
- Download and read the Master's thesis (second paper) here (WIP): [masters-thesis-controller-sharding](https://github.com/timebertt/masters-thesis-controller-sharding)

## Motivation
This repository contains the implementation belonging to the scientific work: the actual sharding implementation, a sample operator using controller sharding, a monitoring and continuous profiling setup, and some tools for development and evaluation purposes.

## Motivation 💡

Typically, [Kubernetes controllers](https://kubernetes.io/docs/concepts/architecture/controller/) use a leader election mechanism to determine a *single* active controller instance (leader).
When deploying multiple instances of the same controller, there will only be one active instance at any given time, other instances will be on standby.
This is done to prevent controllers from performing uncoordinated and conflicting actions (reconciliations).
This is done to prevent multiple controller instances from performing uncoordinated and conflicting actions (reconciliations) on a single object concurrently.

If the current leader goes down and loses leadership (e.g. network failure, rolling update) another instance takes over leadership and becomes the active instance.
Such a setup can be described as an "active-passive HA setup". It minimizes "controller downtime" and facilitates fast failovers.
Such a setup can be described as an "active-passive HA setup". It minimizes "controller downtime" and facilitates fast fail-overs.
However, it cannot be considered as "horizontal scaling" as work is not distributed among multiple instances.

This restriction imposes scalability limitations for Kubernetes controllers.
I.e., the rate of reconciliations, amount of objects, etc. is limited by the machine size that the active controller runs on and the network bandwidth it can use.
In contrast to usual stateless applications, one cannot increase the throughput of the system by adding more instances (scaling horizontally) but only by using bigger instances (scaling vertically).

This study project presents a design that allows distributing reconciliation of Kubernetes objects across multiple controller instances.
It applies proven sharding mechanisms used in distributed databases to Kubernetes controllers to overcome the restriction of having only one active replica per controller.
The sharding design is implemented in a generic way in my [fork](https://github.com/timebertt/controller-runtime/tree/sharding) of the Kubernetes [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) project.
The [webhosting-operator](#webhosting-operator) is implemented as an example operator that uses the sharding implementation for demonstration and evaluation.
These are the first steps toward horizontally scalable Kubernetes controllers.

## Sharding Design

![Sharding Architecture](assets/architecture.svg)

High-level summary of the sharding design:

- multiple controller instances are deployed
- one controller instance is elected to be the sharder via the usual leader election
- all instances maintain individual shard leases for announcing themselves to the sharder (membership and failure detection)
- the sharder watches all objects (metadata-only) and the shard leases
- the sharder assigns individual objects to shards (using consistent hashing) by labeling them with the `shard` label
- the shards use a label selector to restrict the cache and controller to the set of objects assigned to them
- for moving objects (e.g. during rebalancing on scale-out), the sharder drains the object from the old shard by adding the `drain` label; after the shard has acknowledged the drain operation by removing both labels, the sharder assigns the object to the new shard
- when a shard releases its shard lease (voluntary disruption) the sharder assigns the objects to another active instance
- when a shard loses its shard lease the sharder acquires the shard lease (for ensuring the API server's reachability/functionality) and forcefully reassigns the objects


Read chapter 4 of the full [thesis](https://github.com/timebertt/thesis-controller-sharding) for a detailed explanation of the sharding design.

## Contents of This Repository

- [docs](docs):
- [getting started with controller sharding](docs/getting-started.md)
- [sharder](cmd/sharder): the main sharding component orchestrating sharding
- [shard](hack/cmd/shard): a simple dummy controller that implements the requirements for sharded controllers (used for development and testing purposes)
- [webhosting-operator](webhosting-operator): a sample operator for demonstrating and evaluating the implemented sharding design for Kubernetes controllers
- [samples-generator](webhosting-operator/cmd/samples-generator): a tool for generating a given number of random `Website` objects
- [monitoring setup](hack/config/monitoring): a setup for monitoring and measuring load test experiments for the sample operator
- includes [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus)
- [sharding-exporter](config/monitoring/sharding-exporter): (based on the [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) [custom resource metrics feature](https://github.com/kubernetes/kube-state-metrics/blob/main/docs/customresourcestate-metrics.md)) for metrics on the state of shards
- [webhosting-exporter](webhosting-operator/config/monitoring/webhosting-exporter) for metrics on the state of the webhosting-operator's API objects (similar to above)
- [grafana](https://github.com/grafana/grafana) along with some dashboards for [controller-runtime](hack/config/monitoring/default/dashboards) and [webhosting-operator and sharding](webhosting-operator/config/monitoring/default/dashboards)
- [experiment](webhosting-operator/cmd/experiment): a tool (based on controller-runtime) for executing load test scenarios for the webhosting-operator
- [measure](webhosting-operator/cmd/measure): a tool for retrieving configurable measurements from prometheus and storing them in csv-formatted files for further analysis (with `numpy`) and visualization (with `matplotlib`)
- a few [kyverno](https://github.com/kyverno/kyverno) policies for [scheduling](webhosting-operator/config/policy) and the [control plane](hack/config/policy) for more stable load test results
- a simple [parca](https://github.com/parca-dev/parca) setup for [profiling](hack/config/policy) the sharding components and webhosting-operator during load tests
## Introduction 🚀

This project allows scaling Kubernetes controllers horizontally by removing the restriction of having only one active replica per controller (allows active-active setups).
It distributes reconciliation of Kubernetes objects across multiple controller instances, while still ensuring that only a single controller instance acts on a single object at any given time.
For this, the project applies proven sharding mechanisms used in distributed databases to Kubernetes controllers.

The project introduces a `sharder` component that implements sharding in a generic way and can be applied to any Kubernetes controller (independent of the used programming language and controller framework).
The `sharder` component is installed into the cluster along with a `ClusterRing` custom resource.
A `ClusterRing` declares a virtual ring of sharded controller instances and specifies API resources that should be distributed across shards in the ring.
It configures sharding on the cluster-scope level (i.e., objects in all namespaces), hence the `ClusterRing` name.

The watch cache is an expensive part of a controller regarding network transfer, CPU (decoding), and memory (local copy of all objects).
When running multiple instances of a controller, the individual instances must thus only watch the subset of objects they are responsible for.
Otherwise, the setup would only multiply the resource consumption.
The sharder assigns objects to instances via the shard label.
Each shard then uses a label selector with its own instance name to watch only the objects that are assigned to it.

Alongside the actual sharding implementation, this project contains a setup for simple [development, testing](docs/development.md), and [evaluation](docs/evaluation.md) of the sharding mechanism.
This includes an example operator that uses controller sharding ([webhosting-operator](webhosting-operator)).
See [Getting Started With Controller Sharding](docs/getting-started.md) for more details.

To support sharding in your Kubernetes controller, only three aspects need to be implemented:

- announce ring membership and shard health: maintain individual shard `Leases` instead of performing leader election on a single `Lease`
- only watch, cache, and reconcile objects assigned to the respective shard: add a shard-specific label selector to watches
- acknowledge object movements during rebalancing: remove the drain and shard label when the drain label is set and stop reconciling the object

See [Implement Sharding in Your Controller](docs/implement-sharding.md) for more information and examples.

## Design 📐

![Sharding Architecture](docs/assets/architecture.svg)

See [Design](docs/design.md) for more details on the sharding architecture and design decisions.

## Discussion 💬

Feel free to contact me on the [Kubernetes Slack](https://kubernetes.slack.com/) ([get an invitation](https://slack.k8s.io/)): [@timebertt](https://kubernetes.slack.com/team/UF8C35Z0D).

## TODO 🧑‍💻

- [ ] implement more tests: unit, integration, e2e tests
- [ ] add `ClusterRing` API validation
- [ ] implement a custom generic sharding-exporter
4 changes: 0 additions & 4 deletions assets/architecture.svg

This file was deleted.

Loading

0 comments on commit b00eb19

Please sign in to comment.