Skip to content

How to Resolve "EC2 Failing Instances" Alerts

Brandon Cruz edited this page Oct 14, 2023 · 11 revisions

How to Resolve "EC2 Failing Instances" Alerts

Glossary

Term Definition
EC2 Status Checks See AWS documentation
Established Environment Any of prod, prod-sbx, or test

Prerequisites

  • Access to the CMS VPN
  • Access to AWS
  • An installation of the AWS CLI that is configured properly for access to the BFD/CMS AWS account
  • An installation of jq

Instructions

Determine which Instances are Failing

  1. Run the following command in bash or zsh:

     failed_instances="$(aws ec2 describe-instance-status --query 'InstanceStatuses[?SystemStatus.Status==`impaired` || InstanceStatus.Status==`impaired`]')"; [[ -n "${failed_instances#\[\]}" ]] && echo $failed_instances && aws ec2 describe-instances --no-cli-pager --query 'Reservations[].Instances[].[InstanceId,Tags[?Key==`Name`]| [0].Value]' --output table --instance-ids $(jq -n -r '$in.[].InstanceId' --argjson in "$failed_instances")

    This command will output a list, in JSON, of all of the EC2 instances that have failed their EC2 Status Checks. Additionally, a table of Instance IDs to friendly names will be printed at the end.

    In all likelihood, the failing instances reported by the CLI command above will be some variant of pipeline.

Resolve the Failing Instances

One-or-more of the following cases will apply to the failing instances, so resolving this Alarm may require following multiple steps:

  • If any of the failing instances are a variant of pipeline, server, and/or migrator, and those instances exist in our established environments:
    • The latest release must be re-deployed using the BFD Deployment Pipeline specifying AMIs to be built again
    • OR, you must:
      1. Re-build AMIs using the bfd-build-apps Jenkins Pipeline on master
      2. For each failing instance type, re-apply its corresponding Terraservice in all established environments
        • For example, if the one of failing instances was a variant of pipeline, the pipeline Terraservice must be re-apply'd manually by terraform applying it in each established environment's corresponding Terraform workspace
  • server-load instances:
    1. The corresponding bfd-run-server-load Jenkins Pipeline run must be aborted via the "Abort or Proceed" choice in Jenkins
      • This will resolve the error in the short-term, as the offending instance will be destroyed. However, the instance AMI likely will still need to be re-built
    2. The server-load AMI re-built using the bfd-build-apps Jenkins Pipeline
    3. A new run of the bfd-run-server-load Jenkins Pipeline should then be started
  • server instances that have been detached from their ASG in an established environment:
    • Should be destroyed, and a new instance detached
  • Instances of any variety deployed within an ephemeral environment:
    • The corresponding AMIs must be re-built and the ephemeral environment re-deployed
  • Other instances that do not use our typical AMIs or use the Platinum AMI:
    • Should be first restarted to see if that resolves it
    • Or, the corresponding AMI re-built and the instance re-deployed
Clone this wiki locally