Chaos Monkey provides a convenient way to disrupt CDAP and hadoop services on a cluster.
Disruptions can be scheduled, randomized, or issued on command.
To start Chaos Monkey daemon and HTTP server, set configurations in chaos-monkey-site.xml and run ChaosMonkeyMain
Disruptions setup
By default, the following disruptions will be available to each service:
- start
- restart
- stop
- terminate
- kill
- rolling-restart
Custom disruptions can be added by extending the Disruption class and then associating them with a service. A custom disruption is started by calling ClusterDisruptor.disrupt(serviceName, disruptionName, actionArguments), where disruptionName is set by the Disruption.getName() method. Disruptions receive a collection of RemoteProcess based on the actionArguments, and can be used to execute commands via ssh. To add a custom disruption to a service:
- {service}.disruptions - Class paths of custom disruptions, separated by commas
Initialize a service for Chaos Monkey
Any configured service can be interacted with through ClusterDisruptor or REST endpoints. To configure a service for chaos Monkey, either provide custom disruptions or a pid file for the default disruptions:
- {service}.pidFile - Path to the .pid file of the service
Configurations for scheduled disruptions
These additional properties can be set for a certain service to start a scheduled disruption:
- {service}.interval - Number of seconds between each disruption
- {service}.killProbability - Number between 0 to 1 representing chance of kill occurring each iteration.
- {service}.stopProbability - Number between 0 to 1 representing chance of stop occurring each iteration.
- {service}.restartProbability - Number between 0 to 1 representing chance of restart occurring each iteration.
- {service}.minNodesPerIteration - Minimum number of nodes affected each iteration.
- {service}.maxNodesPerIteration - Maximum number of nodes affected each iteration.
Cluster information collector
By default, Chaos Monkey will retrieve cluster information from Coopr
To get cluster information from Coopr, the following configurations need to be set:
- cluster.info.collector.coopr.clusterId
- cluster.info.collector.coopr.tenantId
- cluster.info.collector.coopr.server.uri
To get cluster information from other sources, include a plugin to implement ClusterInfoCollector and set the following configs:
- cluster.info.collector.class - classpath of the implementation of ClusterInfoCollector
Additional properties can be passed in to the ClusterInfoCollector implementation. Setting the property cluster.info.collector.{propertyName} in configurations will make {propertyName} available in the properties map, passed in via the initialize method
SSH configurations
username - username of SSH profile (if different from system user)
keyPassphrase - passphrase for private key, if applicable
privateKey - path to private key (will check default locations unless specified)
HTTP server is hosted on port 11020, with the following endpoints:
POST /v1/services/{service}/{action}
{action} includes stop, kill, terminate, start, restart, and rolling-restart
The action, by default, will be performed on all nodes configured with the service. To specify affected nodes, include ne of the following request bodies:{ nodes:[<nodeAddress1>,<nodeAddress2>...] }
{ percentage:<numberFrom0To100> }
{ count:<numberOfNodes> }
In addition to the above request bodies, rolling restart can be also configured with:
{ restartTime:<restartTimeSeconds> delay:<delaySeconds> }
GET /v1/nodes/{ip}/status
Get the status of all configured service on a given address
GET /v1/status
Get the status of all configured service on every node of the cluster