How to Resolve "EC2 Failing Instances" Alerts

How to Resolve "EC2 Failing Instances" Alerts

Glossary

Term	Definition
EC2 Status Checks	See AWS documentation
Established Environment	Any of `prod`, `prod-sbx`, or `test`

Prerequisites

Access to the CMS VPN
Access to AWS
An installation of the AWS CLI that is configured properly for access to the BFD/CMS AWS account
An installation of jq

Instructions

Determine which Instances are Failing

Run the following command in bash or zsh:

 failed_instances="$(aws ec2 describe-instance-status --query 'InstanceStatuses[?SystemStatus.Status==`impaired` || InstanceStatus.Status==`impaired`]')"; [[ -n "${failed_instances#\[\]}" ]] && echo $failed_instances && aws ec2 describe-instances --no-cli-pager --query 'Reservations[].Instances[].[InstanceId,Tags[?Key==`Name`]| [0].Value]' --output table --instance-ids $(jq -n -r '$in.[].InstanceId' --argjson in "$failed_instances")

This command will output a list, in JSON, of all of the EC2 instances that have failed their EC2 Status Checks. Additionally, a table of Instance IDs to friendly names will be printed at the end.

In all likelihood, the failing instances reported by the CLI command above will be some variant of pipeline.

Resolve the Failing Instances

One-or-more of the following cases will apply to the failing instances, so resolving this Alarm may require following multiple steps:

If any of the failing instances are a variant of pipeline, server, and/or migrator, and those instances exist in our established environments:
- The latest release must be re-deployed using the BFD Deployment Pipeline specifying AMIs to be built again
- OR, you must:
  1. Re-build AMIs using the bfd-build-apps Jenkins Pipeline on master
  2. For each failing instance type, re-apply its corresponding Terraservice in all established environments
    - For example, if the one of failing instances was a variant of pipeline, the pipeline Terraservice must be re-apply'd manually by terraform applying it in each established environment's corresponding Terraform workspace
server-load instances:
1. The corresponding bfd-run-server-load Jenkins Pipeline run must be aborted via the "Abort or Proceed" choice in Jenkins
  - This will resolve the error in the short-term, as the offending instance will be destroyed. However, the instance AMI likely will still need to be re-built
2. The server-load AMI re-built using the bfd-build-apps Jenkins Pipeline
3. A new run of the bfd-run-server-load Jenkins Pipeline should then be started
server instances that have been detached from their ASG in an established environment:
- Should be destroyed, and a new instance detached
Instances of any variety deployed within an ephemeral environment:
- The corresponding AMIs must be re-built and the ephemeral environment re-deployed
Other instances that do not use our typical AMIs or use the Platinum AMI:
- Should be first restarted to see if that resolves it
- Or, the corresponding AMI re-built and the instance re-deployed

Home
For BFD Users
- Making Requests to BFD
- API Changelog
- Migrating to V2 FAQ
- Synthetic and Synthea Data
  - Synthetic Data Guide
  - Synthetic Data FAQ
- BFD SAMHSA Filtering

For BFD Contributors and Maintainers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Resolve "EC2 Failing Instances" Alerts

How to Resolve "EC2 Failing Instances" Alerts

Glossary

Prerequisites

Instructions

Determine which Instances are Failing

Resolve the Failing Instances

Clone this wiki locally