- Why would you need this monkey?
- 3 Modes
- Pod Targeting
- Randomness
- Running/Testing on local machine
- Installation
- Debugging
- CLI Options
TLDR: khaos-monkey is a simple chaos monkey built for Kubernetes. It terminates grouped pods at random based on shared rules/modes for all workload in selected namespaces. The project focuses on simplicity, being lightweight, and streamlining chaos applied to all workload.
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. In 2011, Netflix created the project called, Chaos Monkey, which kickstarted the Chaos Engineering discipline.
khaos-monkey is a simple chaos monkey built for Kubernetes. All it does is randomly terminating pods following specific rules. The project focuses on simplicity, being lightweight, and streamlining chaos applied to the workload. khaos-monkey is built in rust with kube-rs.
I created this tool because the tools I found didn't fit my use case very well.
In my experience, if you are orchestrating a huge Kubernetes cluster and installs a chaos monkey and let it up to the developers to remember to add specific labels to their workload, then they will forget about it. Close to no one will remember to opt-in and therefore you can't have confidence that the services can tolerate random failure/crashes. Targeting whole namespaces kind of "forces" the developer to make an active decision... Choose to opt-out or make sure that my system/service can survive occasional crashes. If getting targeted by chaos is not default then no one will remember to opt-in and no resilience will be ensured.
This is kind of how they did it at Netflix. Not forcing their "engineers to architect their code in any specific way"link, but instead, have a chaos monkey that indirectly forces their engineers to build their system resilient enough to survive incidents.
kube-monkey Tools like kube-monkey compel you to add labels to every single resource you want the chaos monkey to target. khaos-monkey takes another approach and focuses on having equal rules for all pods in a given namespace. The main use case of khaos-monkey' is to target whole namespaces. khaos-monkey is built on the philosophy that all systems/services should be resilient enough that a few "crashes" do not result in downtime.
chaoskube chaoskube is a great tool! chaoskube is probably the most similar tool to khaos-monkey. chaoskube has more ways of selecting pods, which is nice. This repo differentiates itself from this tool by having multiple modes. chaoskube is fixed in the mode I call "fixed 1". chaoskube sees all pods as equal - it doesn't matter if there are 100 replicas of a given deployment and only 1 of another... the likely hood of hitting each pod is the same. khaos-monkey tries to group pods in deployments, so the number of deleted pods depends on the replica number of each deployment.
litmus Another great tool is litmus (which I am a huge fan of). Litmus is much more advanced and better suited for big mature infrastructure - but it can be a bit cumbersome to install and may be overkill for smaller experimental clusters. This monkey is simple to install and is very lightweight. Running litmus on your local kind or minikube cluster can be a bit overkill and resource-intensive.
The monkey will kill a given percentage of targeted pods. The number is rounded down.
Example: if you run the monkey with
./khoas-monkey percentage 55
and yourReplicaSet
has 4 pods the monkey will kill 2 random pods on every attack.
If set to fixed
they will kill a fixed number of pods in a ReplicaSet.
Example: if you run the monkey with
./khoas-monkey fixed 3
and yourReplicaSet
has 5 pods the monkey will kill 3 random pods on every attack.
If set to fixed_left
they will kill all pod types until there is a fixed number of pods left.
Example: if you run the monkey with
./khoas-monkey fixed-left 3
and yourReplicaSet
has 5 pods the monkey will kill pods until there are 3 left. In this case, it would kill 2 pods.
The monkey can either target individual pods or whole namespaces. If the option --target-namespaces
is set to "namespaceA, namespaceB"
the monkey will target all pods in those two namespaces. This means that the monkey may kill any pod (unless they opt-out) in those namespaces.
Example: If you run the monkey with
--target-namespaces="namespaceA,namespaceB"
it will target all pods innamespaceA
andnamespaceB
.
This feature means you can make the monkey target individual pod in any namespace by adding the label khaos-enabled: "true"
to to the pod. If this label exists on a pod it doesn't matter if it is inside in the namespaces specified by --target-namespaces
or not.
Opt-in deployment example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: whatever-deployment
spec:
template:
spec:
metadata:
labels:
khaos-enabled: "true"
containers:
- name: whatever-container
image: whatever-image
This feature means you can make the monkey target on all pods in specific namespaces and choose which pods you want to be excluded in those namespaces. All pods with the label khaos-enabled: false
will opt-out and will be excluded in the pod targeting by the monkey.
Example: The monkey is targeting
namespaceA
.podA
inside namespacenamespaceA
. If the pod has the labelkhaos-enabled: false
it will be ignored by the monkey and not killed - if it does not have that label it will be targeted by the monkey (and eventually be killed).
Opt-out Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: whatever-deployment
spec:
template:
spec:
metadata:
labels:
khaos-enabled: "false"
containers:
- name: whatever-container
image: whatever-image
By default, all pods in a ReplicaSet
are grouped. So if a Deployment
has 4
replicas the monkey may kill x pods of those replicas.
You can make a custom group by adding a label to the pods - e.g. khaos-group=my-group
. This would make the monkey treat your custom group the same way it treats a ReplicaSet
.
Example: Let's say that deployment
depA
have 2 pods/replicas and deploymentdepB
has 1 pod/replicas and all 3 pods/replicas has the labelkhaos-group=my-group
. The monkey is set to./khaos-monkey fixed 2
. In this case, the monkey will kill either 2 pods ofdepA
' pods or 1 pod from each deployment since they are treated as being in the same group.
deployment example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: whatever-deployment
spec:
replicas: 4
template:
spec:
metadata:
labels:
khaos-enabled: "true"
khaos-group: my-group
containers:
- name: whatever-container
image: whatever-image
You can specify namespaces where it is not possible to opt-in.
Example: If you want the monkey with
--blacklisted-namespace="whatever"
it is not possible for pods to opt-in in namespace"whatever"
.
You can randomize how often the attack happens and how many pods are killed each attack.
You can set random-kill-count
to true
if you want the monkey to kill a random amount of pods between 0 and the specified value for that mode.
Example: If the monkey runs with
./khaos-monkey --random-kill-count=true percentage 50
then the monkey will kill between0
and50
percent of the pods in eachReplicaSet
.
You can set random-extra-time-between-chaos
to 5m
if you want to add additional random time between each attack.
Example: If the monkey runs with
--min-time-between-chaos=1m --random-extra-time-between-chaos=1m
the attacks will happen with a random time interval between 1 and 2 minutes.
You can test the monkey on your local machine before putting it on Kubernetes. If you have your kube-config installed in ~/.kube/config
and have cargo
installed then you can just pull the repo and run
$ cargo run -- --target-namespaces="my-namespace" fixed 1
in the repo root. If your config and permissions are correct the monkey will start killing pods in namespace, "my-namespace", on the current kubectl
context.
$ kubectl create namespace khaos-monkey
Copy-paste and run this in your terminal:
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: khaos-monkey-cluster-role
rules:
- apiGroups: ["*"]
resources: ["pods", "namespaces"]
verbs: ["list", "delete"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: khaos-monkey-cluster-role-binding
subjects:
- kind: ServiceAccount
name: default
namespace: khaos-monkey
roleRef:
kind: ClusterRole
name: khaos-monkey-cluster-role
apiGroup: ""
EOF
Copy-paste and run this in your terminal:
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: khaos-monkey
name: khaos-monkey
spec:
replicas: 1
selector:
matchLabels:
app: khaos-monkey
strategy:
type: Recreate
template:
metadata:
labels:
khaos-enabled: "false"
app: khaos-monkey
spec:
containers:
- name: khaos-monkey
image: dagandersen/khaos-monkey:latest
args: ["fixed", "1" ]
env:
- name: TARGET_NAMESPACES
value: "default"
EOF
Now the monkey will start killing 1
pod of a random ReplicaSet (or custom chaos-group
) in the default
namespace every 1-2 minutes.
Feel free to tune the numbers yourself. Remember that the monkey may kill itself if it exists inside a targeted namespace and does not opt-out. It is possible to run multiple instances of the monkey with different settings.
Run the following command to verify that the monkey works as expected.
$ kubectl wait -A --for=condition=ready pod -l "app=khaos-monkey" && kubectl logs -l app=khaos-monkey -n khaos-monkey --follow=true --tail=100
The command will print something like this
target_namespaces from args/env: {"default"}
blacklisted_namespaces from args/env: {"kube-system", "kube-public", "kube-node-lease"}
Namespaces found in cluster: {"kube-public", "kube-node-lease", "default", "kube-system", "local-path-storage"}
Monkey will target namespace: {"default"}
###################
### Chaos Beginning
# Deleting: 1/4 running pods in Khaos Group: pod-template-hash=5b8c759b68
Deleting Pod: "my-magic-pod-5b8c759b68-2pzxh"
### Chaos over
### Time until next Chaos: 1m 32s
###################
...
If you are having trouble figuring out what pods are grouped or why the monkey is not targeting certain pods, then you can run the monkey with the env RUST_LOG=info
...
containers:
- name: khaos-monkey
image: dagandersen/khaos-monkey:latest
args: ["fixed", "1" ]
env:
- name: TARGET_NAMESPACES
value: "default"
- name: RUST_LOG
value: "info"
It will print something like
[2021-08-23T17:12:46Z INFO khaos_monkey] ## All pods found:
[2021-08-23T17:12:47Z INFO khaos_monkey] - echo-client-5b8c759b68-5kk6q
[2021-08-23T17:12:47Z INFO khaos_monkey] - echo-server-5bfd4768db-6tdjc
[2021-08-23T17:12:47Z INFO khaos_monkey] - echo-server-5bfd4768db-g7vwv
[2021-08-23T17:12:47Z INFO khaos_monkey] - echo-server-5bfd4768db-khcz5
[2021-08-23T17:12:47Z INFO khaos_monkey] - echo-server-5bfd4768db-qkdzx
...
[2021-08-23T17:12:47Z INFO khaos_monkey] ## All targeted groups:
[2021-08-23T17:12:47Z INFO khaos_monkey] - pod-template-hash=5b8c759b68 with 1 pods:
[2021-08-23T17:12:47Z INFO khaos_monkey] - echo-client-5b8c759b68-5kk6q
[2021-08-23T17:12:47Z INFO khaos_monkey] - pod-template-hash=5bfd4768db with 4 pods:
[2021-08-23T17:12:47Z INFO khaos_monkey] - echo-server-5bfd4768db-qkdzx
[2021-08-23T17:12:47Z INFO khaos_monkey] - echo-server-5bfd4768db-khcz5
[2021-08-23T17:12:47Z INFO khaos_monkey] - echo-server-5bfd4768db-g7vwv
[2021-08-23T17:12:47Z INFO khaos_monkey] - echo-server-5bfd4768db-6tdjc
...
khaos-monkey 0.1.0
USAGE:
khaos-monkey [OPTIONS] <SUBCOMMAND>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
--attacks-per-interval <attacks-per-interval>
Number of pod-types that can be deleted at a time. No limit if value is -1. Example: if set to "2" it may
attack two ReplicaSets
[env: ATTACKS_PER_INTERVAL=] [default: 1]
--blacklisted-namespaces <blacklisted-namespaces>
namespaces you want the monkey to ignore. Pods running in these namespaces can't be target
[env: BLACKLISTED_NAMESPACES=] [default: kube-system, kube-public, kube-node-lease]
--min-time-between-chaos <min-time-between-chaos>
Minimum time between chaos attacks
[env: MIN_TIME_BETWEEN_CHAOS=] [default: 1m]
--random-extra-time-between-chaos <random-extra-time-between-chaos>
This specifies a random time interval that will be added to `min-time-between-chaos` each attack. Example:
If both options are sat to `1m` the attacks will happen with a random time interval between 1 and 2 minutes
[env: RANDOM_EXTRA_TIME_BETWEEN_CHAOS=] [default: 1m]
--random-kill-count <random-kill-count>
If "true" a number between 0 and 1 is multiplied with number of pods to kill
[env: RANDOM_KILL_COUNT=] [default: false]
--target-namespaces <target-namespaces>
namespaces you want the monkey to target. Example: "namespace1, namespace2". The monkey will target all pods
in these namespace unless they opt-out
[env: TARGET_NAMESPACES=] [default: ""]
SUBCOMMANDS:
fixed Kill a fixed number of each pod group
fixed-left Kill pods until a fixed number of each pod group is alive
percentage Kill a percentage of each pod group