Skip to content

Commit

Permalink
Added quick version checks to risky cluster-* and config-* cmds (#160)
Browse files Browse the repository at this point in the history
* Added quick version checks to risky cluster-* and config-* cmds

Signed-off-by: Chris Helma <[email protected]>

* README Update; fix for cluster-destroy

Signed-off-by: Chris Helma <[email protected]>

---------

Signed-off-by: Chris Helma <[email protected]>
  • Loading branch information
chelma authored Jan 24, 2024
1 parent 6c70ed4 commit 136f70a
Show file tree
Hide file tree
Showing 14 changed files with 401 additions and 73 deletions.
31 changes: 31 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ The CDK is used to perform infrastructure specification, setup, management, and
- [How to shell into the ECS containers](#how-to-shell-into-the-ecs-containers)
- [Setting Up Demo Traffic Generation](#setting-up-demo-traffic-generation)
- [Account Limits, Scaling, and Other Concerns](#account-limits-scaling-and-other-concerns)
- [Troubleshooting](#troubleshooting)
- [Generally useful NPM/CDK commands](#generally-useful-npmcdk-commands)
- [Contribute](#contribute)
- [Maintainers](#maintainers)
Expand Down Expand Up @@ -376,6 +377,36 @@ Here are some account limits you'll want to watch out for:

## Troubleshooting

### AWS AIO version mismatch

The AWS AIO project contains many components that must operate together, and these components have embedded assumptions of how the other components will behave. We use a concept called the "AWS AIO Version" to determine whether the various components of the solution should be able to operate together successfully.

Most importantly, the version of the CLI currently installed must be compatible with the version of the Arkime Cluster it is operating against. If the CLI and Arkime Cluster are both on the same AWS AIO major version (e.g. v7.x.x), then they should be interoperable. If they are not on the same major version, then it is possible (or even likely) that performing CLI operations against the Arkime Cluster is unsafe and should be avoided. To help protect deployed Arkime Clusters, the CLI compares the AWS AIO version of the CLI and the Arkime Cluster before sensitive operations and aborts if there is a detected mismatch, or it can't figure out if there is one (which itself is likely a sign of a mismatch).

In the event you discover your installed CLI is not compatible with your Arkime Cluster, you should check out the latest version of the CLI whose major version matches the AWS AIO version of your Arkime Cluster. You can find the version of your installed CLI using git tags like so:

```
git describe --tags
```

You can retrieve a listing of CLI versions using git tags as well:

```
git ls-remote --tags [email protected]:arkime/aws-aio.git
```

If the CLI detects a version mismatch, it should inform you of the AWS AIO version of the Arkime Cluster you tried to operate against. However, you can also find the AWS AIO version of deployed Arkime Clusters in your account/region using the `clusters-list` command:

```
./manage_arkime.py clusters-list
```

Once you determine the correct major version to you with your Arkime Cluster, you can then check out the latest minor/patch version using git and operate against your Arkime Cluster as planned:

```
git checkout v2.2.0
```

### "This CDK CLI is not compatible with the CDK library"
This error is caused by having a mismatch between the Node packages `aws-cdk` (the CLI) and `aws-cdk-lib` (the CDK library), which can occaisionally result if one package is upgraded without the other package package. You'll see an error message like the following in the manage_arkime.log file:

Expand Down
25 changes: 8 additions & 17 deletions manage_arkime/cdk_interactions/cdk_context.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,7 @@
from typing import Dict, List

import core.constants as constants
from core.capacity_planning import (CaptureNodesPlan, ViewerNodesPlan, VpcPlan, ClusterPlan, DataNodesPlan, EcsSysResourcePlan,
MasterNodesPlan, OSDomainPlan, S3Plan, DEFAULT_S3_STORAGE_CLASS, DEFAULT_VPC_CIDR,
DEFAULT_CAPTURE_PUBLIC_MASK)
from core.capacity_planning import ClusterPlan
from core.user_config import UserConfig

@dataclass
Expand Down Expand Up @@ -52,27 +50,20 @@ def generate_cluster_create_context(name: str, viewer_cert_arn: str, cluster_pla
create_context[constants.CDK_CONTEXT_CMD_VAR] = constants.CMD_cluster_create
return create_context

def generate_cluster_destroy_context(name: str, stack_names: ClusterStackNames, has_viewer_vpc: bool) -> Dict[str, str]:
# Hardcode these value because it saves us some implementation headaches and it doesn't matter what it is. Since
# we're tearing down the Cfn stack in which it would be used, the operation either succeeds they are irrelevant
# or it fails/rolls back they are irrelevant.
def generate_cluster_destroy_context(name: str, stack_names: ClusterStackNames, cluster_plan: ClusterPlan) -> Dict[str, str]:
# Hardcode most of these value because it saves us some implementation headaches and it doesn't matter what it is.
# Since we're tearing down the Cfn stack in which it would be used, the operation either succeeds they are
# irrelevant or it fails/rolls back they are irrelevant.
#
# We have to pass the Cluster Plan or else the CDK will fail to start up properly
fake_arn = "N/A"
fake_cluster_plan = ClusterPlan(
CaptureNodesPlan("m5.xlarge", 1, 2, 1),
VpcPlan(DEFAULT_VPC_CIDR, ["us-fake-1"], DEFAULT_CAPTURE_PUBLIC_MASK),
EcsSysResourcePlan(1, 1),
OSDomainPlan(DataNodesPlan(2, "t3.small.search", 100), MasterNodesPlan(3, "m6g.large.search")),
S3Plan(DEFAULT_S3_STORAGE_CLASS, 1),
ViewerNodesPlan(4, 2),
VpcPlan(DEFAULT_VPC_CIDR, ["us-fake-1"], DEFAULT_CAPTURE_PUBLIC_MASK) if has_viewer_vpc else None,
)
fake_user_config = UserConfig(1, 1, 1, 1, 1)
fake_bucket_name = ""

destroy_context = _generate_cluster_context(
name,
fake_arn,
fake_cluster_plan,
cluster_plan,
fake_user_config,
fake_bucket_name,
stack_names
Expand Down
16 changes: 12 additions & 4 deletions manage_arkime/commands/cluster_create.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
S3Plan, DEFAULT_S3_STORAGE_CLASS, DEFAULT_S3_STORAGE_DAYS, DEFAULT_HISTORY_DAYS,
CaptureNodesPlan, ViewerNodesPlan, DataNodesPlan, EcsSysResourcePlan, MasterNodesPlan, OSDomainPlan,
get_viewer_vpc_plan)
from core.versioning import get_version_info
import core.versioning as ver
from core.user_config import UserConfig

logger = logging.getLogger(__name__)
Expand All @@ -38,14 +38,22 @@ def cmd_cluster_create(profile: str, region: str, name: str, expected_traffic: f
aws_env = aws_provider.get_aws_env()
cdk_client = CdkClient(aws_env)

# Confirm the CLI and Cluster versions are compatible
is_initial_invocation = _is_initial_invocation(name, aws_provider)
if not is_initial_invocation:
try:
ver.confirm_aws_aio_version_compatibility(name, aws_provider)
except (ver.CliClusterVersionMismatch, ver.CaptureViewerVersionMismatch, ver.UnableToRetrieveClusterVersion) as e:
logger.error(e)
logger.warning("Aborting...")
return

# Generate our capacity plan, then confirm it's what the user expected and it's safe to proceed with the operation
previous_user_config = _get_previous_user_config(name, aws_provider)
next_user_config = _get_next_user_config(name, expected_traffic, spi_days, history_days, replicas, pcap_days, aws_provider)
previous_capacity_plan = _get_previous_capacity_plan(name, aws_provider)
next_capacity_plan = _get_next_capacity_plan(next_user_config, previous_capacity_plan, capture_cidr, viewer_cidr, aws_provider)

is_initial_invocation = _is_initial_invocation(name, aws_provider)

if not _should_proceed_with_operation(is_initial_invocation, previous_capacity_plan, next_capacity_plan, previous_user_config,
next_user_config, preconfirm_usage, capture_cidr, viewer_cidr):
return
Expand Down Expand Up @@ -250,7 +258,7 @@ def _upload_arkime_config_if_necessary(cluster_name: str, bucket_name: str, s3_k
# Generate its metadata
next_metadata = config_wrangling.ConfigDetails(
s3=config_wrangling.S3Details(bucket_name, s3_key),
version=get_version_info(archive)
version=ver.get_version_info(archive)
)

# Upload the archive to S3
Expand Down
8 changes: 8 additions & 0 deletions manage_arkime/commands/cluster_deregister_vpc.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import aws_interactions.ssm_operations as ssm_ops
import core.constants as constants
from core.cross_account_wrangling import CrossAccountAssociation, remove_vpce_permissions
import core.versioning as ver

logger = logging.getLogger(__name__)

Expand All @@ -15,6 +16,13 @@ def cmd_cluster_deregister_vpc(profile: str, region: str, cluster_name: str, vpc
logger.info("Deregistering the VPC with the Cluster...")
aws_provider = AwsClientProvider(aws_profile=profile, aws_region=region)

try:
ver.confirm_aws_aio_version_compatibility(cluster_name, aws_provider)
except (ver.CliClusterVersionMismatch, ver.CaptureViewerVersionMismatch, ver.UnableToRetrieveClusterVersion) as e:
logger.error(e)
logger.warning("Aborting...")
return

# Confirm the cross-account link exists
try:
ssm_param_name = constants.get_cluster_vpc_cross_account_ssm_param_name(cluster_name, vpc_id)
Expand Down
18 changes: 11 additions & 7 deletions manage_arkime/commands/cluster_destroy.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from cdk_interactions.cdk_client import CdkClient
from core.capacity_planning import ClusterPlan
import core.constants as constants
import core.versioning as ver
import cdk_interactions.cdk_context as context

logger = logging.getLogger(__name__)
Expand All @@ -25,12 +26,15 @@ def cmd_cluster_destroy(profile: str, region: str, name: str, destroy_everything
cdk_client = CdkClient(aws_provider.get_aws_env())

try:
cluster_plan_str = get_ssm_param_json_value(constants.get_cluster_ssm_param_name(name), "capacityPlan", aws_provider)
cluster_plan = ClusterPlan.from_dict(cluster_plan_str)
except ParamDoesNotExist:
logger.warning(f"The Cluster {name} does not appear to exist; aborting...")
ver.confirm_aws_aio_version_compatibility(name, aws_provider)
except (ver.CliClusterVersionMismatch, ver.CaptureViewerVersionMismatch, ver.UnableToRetrieveClusterVersion) as e:
logger.error(e)
logger.warning("Aborting...")
return

cluster_plan_str = get_ssm_param_json_value(constants.get_cluster_ssm_param_name(name), "capacityPlan", aws_provider)
cluster_plan = ClusterPlan.from_dict(cluster_plan_str)

vpcs_search_path = f"{constants.get_cluster_ssm_param_name(name)}/vpcs"
monitored_vpcs = get_ssm_names_by_path(vpcs_search_path, aws_provider)
if monitored_vpcs:
Expand All @@ -53,7 +57,7 @@ def cmd_cluster_destroy(profile: str, region: str, name: str, destroy_everything

has_viewer_vpc = cluster_plan.viewerVpc is not None
stacks_to_destroy = _get_stacks_to_destroy(name, destroy_everything, has_viewer_vpc)
destroy_context = _get_cdk_context(name, has_viewer_vpc)
destroy_context = _get_cdk_context(name, cluster_plan)

cdk_client.destroy(stacks_to_destroy, context=destroy_context)

Expand Down Expand Up @@ -131,7 +135,7 @@ def _get_stacks_to_destroy(cluster_name: str, destroy_everything: bool, has_view

return stacks

def _get_cdk_context(cluster_name: str, has_viewer_vpc: bool) -> Dict[str, any]:
def _get_cdk_context(cluster_name: str, cluster_plan: ClusterPlan) -> Dict[str, any]:
stack_names = context.ClusterStackNames(
captureBucket=constants.get_capture_bucket_stack_name(cluster_name),
captureNodes=constants.get_capture_nodes_stack_name(cluster_name),
Expand All @@ -141,4 +145,4 @@ def _get_cdk_context(cluster_name: str, has_viewer_vpc: bool) -> Dict[str, any]:
viewerNodes=constants.get_viewer_nodes_stack_name(cluster_name),
viewerVpc=constants.get_viewer_vpc_stack_name(cluster_name),
)
return context.generate_cluster_destroy_context(cluster_name, stack_names, has_viewer_vpc)
return context.generate_cluster_destroy_context(cluster_name, stack_names, cluster_plan)
18 changes: 10 additions & 8 deletions manage_arkime/commands/cluster_register_vpc.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import aws_interactions.ssm_operations as ssm_ops
import core.constants as constants
from core.cross_account_wrangling import CrossAccountAssociation, ensure_cross_account_role_exists, add_vpce_permissions
import core.versioning as ver

logger = logging.getLogger(__name__)

Expand All @@ -15,18 +16,19 @@ def cmd_cluster_register_vpc(profile: str, region: str, cluster_name: str, vpc_a
aws_provider = AwsClientProvider(aws_profile=profile, aws_region=region)
aws_env = aws_provider.get_aws_env()

# Confirm the cluster exists
try:
vpce_service_id = ssm_ops.get_ssm_param_json_value(
constants.get_cluster_ssm_param_name(cluster_name),
"vpceServiceId",
aws_provider
)
except ssm_ops.ParamDoesNotExist:
logger.error(f"The cluster {cluster_name} does not exist; try using the clusters-list command to see the clusters you have created.")
ver.confirm_aws_aio_version_compatibility(cluster_name, aws_provider)
except (ver.CliClusterVersionMismatch, ver.CaptureViewerVersionMismatch, ver.UnableToRetrieveClusterVersion) as e:
logger.error(e)
logger.warning("Aborting...")
return

vpce_service_id = ssm_ops.get_ssm_param_json_value(
constants.get_cluster_ssm_param_name(cluster_name),
"vpceServiceId",
aws_provider
)

# Create the cross account IAM role for the VPC account to access the Cluster account
role_name = ensure_cross_account_role_exists(cluster_name, vpc_account_id, vpc_id, aws_provider, aws_env)

Expand Down
21 changes: 15 additions & 6 deletions manage_arkime/commands/config_update.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
import aws_interactions.ssm_operations as ssm_ops
import core.constants as constants
from core.local_file import LocalFile, S3File
from core.versioning import get_version_info
import core.versioning as ver

logger = logging.getLogger(__name__)

Expand All @@ -23,13 +23,22 @@ def cmd_config_update(profile: str, region: str, cluster_name: str, capture: boo
no_component_specified = not (capture or viewer)
if config_version and (not one_component_specified):
logger.error("If you specify a specific config version to deploy, you must indicate whether to deploy it to"
+ " either the Capture or Viewer nodes. Aborting...")
+ " either the Capture or Viewer nodes.")
logger.warning("Aborting...")
exit(1)

# Update Capture/Viewer config in the cloud, if there's a new version locally. Bounce the associated ECS Tasks
# if we updated the configuration so that they pick it up.
aws_provider = AwsClientProvider(aws_profile=profile, aws_region=region)
aws_env = aws_provider.get_aws_env()

try:
ver.confirm_aws_aio_version_compatibility(cluster_name, aws_provider)
except (ver.CliClusterVersionMismatch, ver.CaptureViewerVersionMismatch, ver.UnableToRetrieveClusterVersion) as e:
logger.error(e)
logger.warning("Aborting...")
return

# Update Capture/Viewer config in the cloud, if there's a new version locally. Bounce the associated ECS Tasks
# if we updated the configuration so that they pick it up.
bucket_name = constants.get_config_bucket_name(aws_env.aws_account, aws_env.aws_region, cluster_name)

logger.info("Updating Arkime config for Capture Nodes, if necessary...")
Expand Down Expand Up @@ -92,7 +101,7 @@ def _update_config_if_necessary(cluster_name: str, bucket_name: str, s3_key_prov
# Create the local config archive and its metadata
aws_env = aws_provider.get_aws_env()
archive = archive_provider(cluster_name, aws_env)
archive_md5 = get_version_info(archive).md5_version
archive_md5 = ver.get_version_info(archive).md5_version

# Confirm the requested version exists, if specified
if switch_to_version:
Expand Down Expand Up @@ -149,7 +158,7 @@ def _update_config_if_necessary(cluster_name: str, bucket_name: str, s3_key_prov
if switch_to_version
else config_wrangling.ConfigDetails(
s3=config_wrangling.S3Details(bucket_name, s3_key_provider(next_config_version)),
version=get_version_info(archive, config_version=next_config_version),
version=ver.get_version_info(archive, config_version=next_config_version),
previous=cloud_config_details
)
)
Expand Down
59 changes: 59 additions & 0 deletions manage_arkime/core/versioning.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,19 @@
from dataclasses import dataclass
from datetime import datetime, timezone
import hashlib
import json
import logging
from typing import Dict

import arkime_interactions.config_wrangling as config_wrangling
from aws_interactions.aws_client_provider import AwsClientProvider
import aws_interactions.ssm_operations as ssm_ops
import core.constants as constants
from core.local_file import LocalFile
from core.shell_interactions import call_shell_command

logger = logging.getLogger(__name__)

"""
Manually updated/managed version number. Increment if/when a backwards incompatible change is made.
"""
Expand Down Expand Up @@ -74,4 +82,55 @@ def get_version_info(config_file: LocalFile, config_version: str = None) -> Vers
datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S')
)

class UnableToRetrieveClusterVersion(Exception):
def __init__(self, cluster_name: str, cli_version: int):
super().__init__(f"It appears the cluster {cluster_name} does not exist. There's also a chance the AWS AIO version"
+ f" of the CLI ({cli_version}) is incompatible with your Cluster. If you're confident the Cluster"
+ " exists, you can try checking the AWS AIO version of your cluster using the clusters-list"
+ " command. The CLI and Cluster versions must match.")

class CaptureViewerVersionMismatch(Exception):
def __init__(self, capture_version: int, viewer_version: int):
super().__init__(f"The AWS AIO versions of your Capture ({capture_version}) and Viewer ({viewer_version})"
+ " components do not match. This is unexpected and should not happen. Please cut us a"
+ " ticket at: https://github.com/arkime/aws-aio/issues/new")

class CliClusterVersionMismatch(Exception):
def __init__(self, cli_version: int, cluster_version: int):
super().__init__(f"The AWS AIO versions of your CLI ({cli_version}) and Cluster ({cluster_version}) do not"
+ " match. This is likely to result in unexpected behavior. Please change your CLI to the"
+ f" latest minor version under the major version ({cluster_version}). Check out the"
+ " following README section for more details:"
+ " https://github.com/arkime/aws-aio#aws-aio-version-mismatch")

def confirm_aws_aio_version_compatibility(cluster_name: str, aws_provider: AwsClientProvider,
cli_version: int = AWS_AIO_VERSION):
# Unfortunately, it currently appears impossible to distinguish between the scenarios where the cluster doesn't
# exist and the cluster exists but is a different version. In either case, we could get the ParamDoesNotExist
# exception.
try:
raw_capture_details_val = ssm_ops.get_ssm_param_value(
constants.get_capture_config_details_ssm_param_name(cluster_name),
aws_provider
)
capture_config_details = config_wrangling.ConfigDetails.from_dict(json.loads(raw_capture_details_val))

raw_viewer_details_val = ssm_ops.get_ssm_param_value(
constants.get_viewer_config_details_ssm_param_name(cluster_name),
aws_provider
)
viewer_config_details = config_wrangling.ConfigDetails.from_dict(json.loads(raw_viewer_details_val))
except ssm_ops.ParamDoesNotExist:
raise UnableToRetrieveClusterVersion(cluster_name, cli_version)

capture_version = int(capture_config_details.version.aws_aio_version)
viewer_version = int(viewer_config_details.version.aws_aio_version)

if capture_version != viewer_version:
raise CaptureViewerVersionMismatch(capture_version, viewer_version)

if capture_version != cli_version:
raise CliClusterVersionMismatch(cli_version, capture_version)

# Everything matches, we're good to go
return
Loading

0 comments on commit 136f70a

Please sign in to comment.