AWS System Manager integration brings Maintenance Window support and In-instance Event notifications.
Important: By default, SSM support is disabled and must be explicitly enabled with
ssm.enable
.
Even activated globally with ssm.enable
, all features of the CloneSquad SSM module are disabled by default. One need to activate each feature explicitly with appropriate feature toggle.
Feature toggle: ssm.feature.maintenance_window
AWS SSM allows definition of up to 50 Maintenance Windows (MW) per account and region. Theses MWs are scheduled periods of time dedicated to perform maintenance actions (Patch management, Backup, etc...) on fleets of EC2 instances.
CloneSquad extends native SSM Maintenance Window capabilities by looking at them as a source of scaling decisions and fleet behavior triggers.
During a Maintenance Window period, the following statements are true:
- No instance can be put in
draining
state or even shutdown by CloneSquad,- As consequence, any instance started during the Maintenance Window period (manually through console or by the auto-scaler) remains up until end of it (more specifically, Spot instance interruption is the only reason for a shutdown during a MW period but it will be replaced by starting new instance as needed).
- Unhealthy instances entering a MW will be drained after it,
- Drained instances entering a MW will be stopped only after it.
- By default, all managed instances (Main fleet instances -including LightHouse ones- or subfleet instances) are started. In this default temporary configuration (i.e. with
ec2.schedule.min_instance_count
andec2.schedule.desired_instance_count
set to100%
), no unavailable instance replacement is performed leading to full fleet stability.- If user overrides the active Maintenance window settings (see below) with an
ec2.schedule.min_instance_count
different than100%
,ec2.desired_instance_count
is left to user setting value meaning that (auto)scalers continue to behave as expected but only upscale (i.e. can only to start instances) during a MW.
- If user overrides the active Maintenance window settings (see below) with an
By default, CloneSquad expects to follow directions derived from SSM Maintenance Windows (MW) object named by convention.
Default SSM Maintenance Window naming convention:
CS-GlobalDefaultMaintenanceWindow
: The MW that will influence all CloneSquad deployments in an account/region,CS-{GroupName}
: A MW affecting all instances managed by the CS deployment withclonesquad:group-name
=={GroupName}
,CS-{GroupName}-Mainfleet
: A MW affecting only Main fleet instances,
CS-{GroupName}-Subfleet.__all__
: A MW afffecting all subfleet instances,CS-{GroupName}-Subfleet.{SubfleetName}
: A MW affecting a specific instance fleet.
IMPORTANT: SSM Maintenance Window objects MUST be tagged with
clonesquad:group-name
:{GroupName}
to be useable by CloneSquad. This constraint does not apply to the Maintenance Window namedCS-GlobalDefaultMaintenanceWindow
.
If multiple MW matches, they are cumulative (meaning effective maintenance window periods will be the union of all matching MWs).
By default, CloneSquad starts instances 15 minutes (see ssm.feature.maintenance_window.start_ahead
) before the next MW period to ensure that instances are ready and stable when the SSM MW period effectively begins. The CloneSquad MW decisions are technically implemented by generating a temporary set of overriding settings (that can be seen by the user through the API GW). At end of a MW period, these temporary scaling settings are removed and all user settings defined in CloneSquad configuration takes fully effect again.
Note: To avoid a burst of instance starts, a jitter is applied to the value defined in
ssm.feature.maintenance_window.start_ahead
. This jitter is deterministic and based on a hash function ofclonesquad:group-name
tag value and subfleet name.
One can change the default behaviors implied by a Maintenance Window period.
Temporary MW settings can be modified through tags on the MW objects: All tags starting with the string clonesquad:config:
will be considered as overriding directives.
By default, entering a MW period means that ec2.schedule.min_instance_count
and ec2.schedule.desired_instance_count
configuration settings are both temporary overriden with the string value 100%
: This makes all instances start (including LightHouse ones). This default behavior can be disabled by setting the tag clonesquad:config:disable-default-config
to True
on SSM Maintenance Window objects.
Tagging the MW object may be used to change these default settings (but also any other valid settings as-well).
Examples of tags and values to put on a MW object:
clonesquad:config:ec2.min_instance_count: 20
clonesquad:config:ec2.desired_instance_count: -1
clonesquad:config:subfleet.__all__.state: running
clonesquad:config:subfleet.ASubfleetName.ec2.schedule.min_instance_count: 66p
IMPORTANT: Due to tag value constraint, you can not use the
%
character to express a pourcentage. Please use the letterp
as replacement (Ex:100p
means100%
).
CloneSquad is able to launch Event scripts hosted in managed instances running a SSM agent that sucessfully registered to AWS SSM.
CloneSquad uses the AWS SSM RunCommand feature to upload in memory the Linux helper script and launch scripts with expected names and location in the instance filesystem.
Note: Sending events to Windows instances is currently not supported.
These event scripts allow user to react to some critical CloneSquad events to make operations smooth and reliable. These scripts are not meant to perform long running tasks but to inform and probe about an event and associated return status if required. As a general rule of thumb, if a user script returns a zero-code, the event is assumed successfully taken into account by the instance. If the user scripts returns a non-zero code, the event will be repeated until event specific timeout or zero status code received.
IMPORTANT: All launched user scripts MUST execute in less than 30 seconds or will be forcibly terminated otherwise by the AWS SSM agent running in the EC2 instance.
Feature toggle: ssm.feature.events.ec2.maintenance_window_period
This event notifies an instance that it is entering/exiting a Maintenance Window period.
Scripts called depending on the event type:
/etc/cs-ssm/enter-maintenance-window-period
: Called when an instance enters a maintenance window period./etc/cs-ssm/exit-maintenance-window-period
: Called when an instance exits a maintenance window period.
Note: A just started instance always receives ASAP this event to inform it what is the period type (i.e. this event is not only sent at the very moment of entering or exiting the maintenance window period).
Feature toggle: ssm.feature.events.ec2.scaling_state_changes
This event is sent when an instance enters the 'draining' state: The event warns that the instance is selected to be shutdown soon.
If it exists on the instance, the scripts /etc/cs-ssm/instance-scaling-state-change-draining
is executed with first argument as the previous state. If this scripts returns a non-zero code, it will be repeated.
The TCP port blocked list subfeature uses the SSM RunCommand to install an in-memory IPtables on an instance entering the draining
state. This IPTables will generate 'Connection refused' messages to new TCP connection attempts targeting ports defined in ssm.feature.events.ec2.scaling_state_changes.draining.connection_refused_tcp_ports
.
This subfeature is especially useful to fail healthchecks of external balancers while allowing currently active connections to finish. This feature is not useful if managed instances are served by ELB(s) under CloneSquad management (i.e. CloneSquad unregisters automatically drained instances for ELB(s)).
If the file /etc/cs-ssm/blocked-connections/extra-iptables-parameters.txt
exists on the drained instance, it is read and content will be added to the iptables command line. Especially, it can be used to restrict the rule to some instance network interfaces (ex: -i eth0
). See Linux helper script for implementation details.
See also
ssm.feature.events.ec2.instance_ready_for_shutdown
to control the amount of time spend in thedraining
time.
Feature toggle: ssm.feature.events.ec2.instance_ready_for_shutdown
This event is sent as soon as an instance enter the 'draining' state. A zero return code is expected from the user script /etc/cs-ssm/instance-ready-for-shutdown
as prerequisite to shutdown the instance. CloneSquad will wait for up to one hour (see ssm.feature.events.ec2.instance_ready_for_shutdown.max_shutdown_delay
. After this delay, the instance is forcibly shutdowned.
A typical use-case for this event is to perform house keeping tasks and allow to shutdown instance gracefully. Examples of tasks can range from breaking the lifeline of loadbalancer healthchecks, wait for all active connections to terminate or backup the machine...
Note: 15 mins before the end of the wait time (so, after 45 mins with default settings), the instance is placed in the 'unuseable' state. When an instance enters the 'unuseable' state, a User Notification of type
new_instances_marked_as_unuseable
is sent: User may log and react to this event to detect instances needing servicing.
Feature toggle: ssm.feature.events.ec2.instance_ready_for_operation
This event is sent to probe if a just started instance is ready and can exit the 'initializing' state. If the user script /etc/cs-ssm/instance-ready-for-operation
returns a zero code, CloneSquad assumes readiness and the instance is placed in 'running' state.
When in 'initializing' state, an instance will never be stopped by CloneSquad. As a typical use-case, this event can by leveraged to ensure that an instance is assumed 'ready' only if it has completed its boot sequence. By using this event, you can avoid CloneSquad shutdowning down prematurely an instance with a very long boot time.
By default, CloneSquad waits up to one hour (see ssm.feature.events.ec2.instance_ready_for_operation.max_initializing_time
) to receive a zero return code. After this delay, the instance is set to 'unuseable' state and will be forcibly shutdown after 15 mins.
Note: When an instance enters the 'unuseable' state, a User Notification of type
new_instances_marked_as_unuseable
is sent: User may log and react to this event to detect instances needing servicing.