The master provides an api to orchestrate fault injections to the chaos bots. Using the chaos API you have access to a number of possible fault injection and to an automatic failure recovery mechanism
- Start by defining a ‘steady state’.
- Hypothesize that this steady state will continue in both the control group and the experimental group.
- Inject failures that reflect real world events.
- Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
At this point we should add one more stage
- Recover fast to the ‘steady state’.
The master and bots focus on two of the stages of chaos, Injection of failures and recovery to a steady state
tar -xzf chaos-master-0.0.2.linux-amd64.tar.gz
./chaos-master --config.file=path/to/config.yml
See examples of the file in the config/example
# Contain the configuration for the port and scheme of the api.
# The deafault values are port: 8080 and scheme: http
port: 8090
scheme: http
# Contain the definition of all enabled failures.
# Each failure injection needs to be defined in a job together with the targets that are in scope
# The unique name of the job. The character ',' is not allowed
- job_name: "docker failure injection"
# The type of the failure. Can be [Docker, Service, CPU, Server, Network]
type: "Docker"
# The name of the target component. Only applicable to Docker and Service failure types
component_name: "nginx"
# The list of targets for which is this failure can be applied
targets: ['host1:8081', 'host2:8081']
- job_name: "network injection"
type: "Network"
targets: ['host1:8081', 'host3:8081']
# Contains the tls configuration for the communication with the bots.
# If not specified will default to http
# If specified the traffic to the bots will be https
# You can only provide a peer token if the traffic is https
# CA certificate
ca_cert: "config/test/certs/ca-cert.pem"
# The pub cert for the connection with the bot
public_cert: "config/test/certs/server-cert.pem"
# peer token for authorization with the bot. A public cert needs to also be provided
peer_token: 30028dd6-a641-4ac3-91d8-1e214ac5e6f6
# Contains the configuration for the healthcheck towards the bots
# If set to active the master with send a healthcheck request to the bots every 1 minute
active: false
# If set to active the status of the healthcheck will be reported in application log (stderr)
report: false
See the api specification after starting the master at <host>/chaos/api/v1/swagger/index.html
- Define the scope of your experiments. Failure types are scoped to specific targets and components.
- For the example config above
failure is scoped to thenginx
containers in the targets'host1:8081', 'host2:8081'
failure is scoped to targets'host1:8081', 'host3:8081'
- For the example config above
- Start a chaos bots in each target specified in your jobs
- For the example config above
we would have to start 3 bots. one onhost1
, one onhost2
and one onhost3
, all on port8081
- For the example config above
- [Optional] Ensure that you have monitoring and alerting in place. Add the recover endpoint as a webhook in case of an alert, to quickly revert all running failures
- Make the first API call to inject a failure
- For the example config above
curl -ss -X POST "" \ -H "Content-Type: application/json" \ -d '{"job": "docker failure injection", "containerName": "nginx", "target": "host1:8081"}'
- For the example config above
Chaos master | Chaos mesh | Chaos toolkit | Gremlin | |
Run experiments as API calls | x | x | ||
Run experiments as json | x | x | ||
Automatic recovery | x | x | ||
Steady state definition | x | |||
Status checks | x | x | ||
Plugable failures | x | |||
Kubernetes failures | x | x | ||
Container failures | x | x | x | |
Service failures | x | |||
Server failures | x | x | ||
netem chaos | x | x | x | |
CPU burn | x | x | x | |
IO chaos | x | x | ||
Memory burn | x | x | ||
Kernel chaos | x | |||
dns chaos | x | x | ||
Experiment results | x | |||
Open source | x | x | x | |
Free | x | x | x |