-
Notifications
You must be signed in to change notification settings - Fork 33
How to Resolve "EC2 Failing Instances" Alerts
Brandon Cruz edited this page Oct 14, 2023
·
11 revisions
Term | Definition |
---|---|
EC2 Status Checks | See AWS documentation |
Established Environment | Any of prod , prod-sbx , or test
|
- Access to the CMS VPN
- Access to AWS
- An installation of the AWS CLI that is configured properly for access to the BFD/CMS AWS account
- An installation of
jq
-
Run the following command in
bash
orzsh
:failed_instances="$(aws ec2 describe-instance-status --query 'InstanceStatuses[?SystemStatus.Status==`impaired` || InstanceStatus.Status==`impaired`]')"; [[ -n "${failed_instances#\[\]}" ]] && echo $failed_instances && aws ec2 describe-instances --no-cli-pager --query 'Reservations[].Instances[].[InstanceId,Tags[?Key==`Name`]| [0].Value]' --output table --instance-ids $(jq -n -r '$in.[].InstanceId' --argjson in "$failed_instances")
This command will output a list, in JSON, of all of the EC2 instances that have failed their EC2 Status Checks. Additionally, a table of Instance IDs to friendly names will be printed at the end.
In all likelihood, the failing instances reported by the CLI command above will be some variant of
pipeline
.
One-or-more of the following cases will apply to the failing instances, so resolving this Alarm may require following multiple steps:
- If any of the failing instances are a variant of
pipeline
,server
, and/ormigrator
, and those instances exist in our established environments:- The latest release must be re-deployed using the BFD Deployment Pipeline specifying AMIs to be built again
-
OR, you must:
- Re-build AMIs using the
bfd-build-apps
Jenkins Pipeline onmaster
- For each failing instance type, re-
apply
its corresponding Terraservice in all established environments- For example, if the one of failing instances was a variant of
pipeline
, thepipeline
Terraservice must be re-apply
'd manually byterraform apply
ing it in each established environment's corresponding Terraform workspace
- For example, if the one of failing instances was a variant of
- Re-build AMIs using the
-
server-load
instances:- The corresponding
bfd-run-server-load
Jenkins Pipeline run must be aborted via the "Abort or Proceed" choice in Jenkins- This will resolve the error in the short-term, as the offending instance will be destroyed. However, the instance AMI likely will still need to be re-built
- The
server-load
AMI re-built using thebfd-build-apps
Jenkins Pipeline - A new run of the
bfd-run-server-load
Jenkins Pipeline should then be started
- The corresponding
-
server
instances that have been detached from their ASG in an established environment:- Should be destroyed, and a new instance detached
- Instances of any variety deployed within an ephemeral environment:
- The corresponding AMIs must be re-built and the ephemeral environment re-deployed
- Other instances that do not use our typical AMIs or use the Platinum AMI:
- Should be first restarted to see if that resolves it
- Or, the corresponding AMI re-built and the instance re-deployed
- Home
- For BFD Users
- Making Requests to BFD
- API Changelog
- Migrating to V2 FAQ
- Synthetic and Synthea Data
- BFD SAMHSA Filtering