Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenSearch Domain capacity now configurable #58

Merged
merged 2 commits into from
Jun 2, 2023
Merged

OpenSearch Domain capacity now configurable #58

merged 2 commits into from
Jun 2, 2023

Conversation

chelma
Copy link
Collaborator

@chelma chelma commented Jun 1, 2023

Description

  • Made the OpenSearch Domain used for SPI data configurable via the command-line.
  • By default, the user gets a minimal Capture Node ASG and OpenSearch Domain. If they specify non-default values, they get a cluster that should be capable of servicing their indicated traffic load. If they re-deploy after setting non-default values, the previously configured values are re-used unless they provide new values.
  • The user can change the individual capacity options and the code will gracefully pull the other, unspecified values if they exist.

Issues

Testing

  • Added Unit Tests
  • Tested the behavior manually using the CLI. Confirmed it correctly followed the constraints we imposed with our capacity plan for a few scenarios as a supplement to the extensive unit testing. Pasted below is the CLI output and cluster capacity configuration after deployment.
(.venv) chelma@3c22fba4e266 aws-aio % ./manage_arkime.py create-cluster --name MyCluster3 --expected-traffic 0.01 --spi-days 30 --replicas 1
2023-06-01 16:55:46 - Debug-level logs save to file: /Users/chelma/workspace/Arkime/aws-aio/manage_arkime.log
2023-06-01 16:55:46 - Using AWS Credential Profile: default
2023-06-01 16:55:46 - Using AWS Region: default from AWS Config settings
2023-06-01 16:55:47 - Executing command: deploy MyCluster3-CaptureBucket MyCluster3-CaptureNodes MyCluster3-CaptureVPC MyCluster3-OSDomain MyCluster3-ViewerNodes
2023-06-01 16:55:47 - NOTE: This operation can take a while.  You can 'tail -f' the logfile to track the status.
2023-06-01 17:15:28 - Deployment succeeded

-----------------------------------------------------------

{"busArn":"arn:aws:events:us-east-2:968674222892:event-bus/MyCluster3CaptureNodesClusterBus224EB36F","busName":"MyCluster3CaptureNodesClusterBus224EB36F","clusterName":"MyCluster3","vpceServiceId":"vpce-svc-0d6b05027bae05312","capacityPlan":{"captureNodes":{"instanceType":"m5.xlarge","desiredCount":1,"maxCount":2,"minCount":1},"captureVpc":{"numAzs":2},"ecsResources":{"cpu":3584,"memory":15360},"osDomain":{"dataNodes":{"count":2,"instanceType":"t3.small.search","volumeSize":100},"masterNodes":{"count":3,"instanceType":"m5.large.search"}}},"userConfig":{"expectedTraffic":0.01,"spiDays":30,"replicas":1}}
(.venv) chelma@3c22fba4e266 aws-aio % ./manage_arkime.py create-cluster --name MyCluster3 --expected-traffic 0.1
2023-06-01 17:16:35 - Debug-level logs save to file: /Users/chelma/workspace/Arkime/aws-aio/manage_arkime.log
2023-06-01 17:16:35 - Using AWS Credential Profile: default
2023-06-01 17:16:35 - Using AWS Region: default from AWS Config settings
2023-06-01 17:16:36 - Executing command: deploy MyCluster3-CaptureBucket MyCluster3-CaptureNodes MyCluster3-CaptureVPC MyCluster3-OSDomain MyCluster3-ViewerNodes
2023-06-01 17:16:36 - NOTE: This operation can take a while.  You can 'tail -f' the logfile to track the status.
2023-06-01 17:45:43 - Deployment succeeded
-----------------------------------------------------------

{"busArn":"arn:aws:events:us-east-2:968674222892:event-bus/MyCluster3CaptureNodesClusterBus224EB36F","busName":"MyCluster3CaptureNodesClusterBus224EB36F","clusterName":"MyCluster3","vpceServiceId":"vpce-svc-0d6b05027bae05312","capacityPlan":{"captureNodes":{"instanceType":"m5.xlarge","desiredCount":1,"maxCount":2,"minCount":1},"captureVpc":{"numAzs":2},"ecsResources":{"cpu":3584,"memory":15360},"osDomain":{"dataNodes":{"count":2,"instanceType":"r6g.large.search","volumeSize":1024},"masterNodes":{"count":3,"instanceType":"m6g.large.search"}}},"userConfig":{"expectedTraffic":0.1,"spiDays":30,"replicas":1}}
(.venv) chelma@3c22fba4e266 aws-aio % ./manage_arkime.py create-cluster --name MyCluster3
2023-06-01 17:46:55 - Debug-level logs save to file: /Users/chelma/workspace/Arkime/aws-aio/manage_arkime.log
2023-06-01 17:46:55 - Using AWS Credential Profile: default
2023-06-01 17:46:55 - Using AWS Region: default from AWS Config settings
2023-06-01 17:46:56 - Executing command: deploy MyCluster3-CaptureBucket MyCluster3-CaptureNodes MyCluster3-CaptureVPC MyCluster3-OSDomain MyCluster3-ViewerNodes
2023-06-01 17:46:56 - NOTE: This operation can take a while.  You can 'tail -f' the logfile to track the status.
2023-06-01 17:47:24 - Deployment succeeded

-----------------------------------------------------------

{"busArn":"arn:aws:events:us-east-2:968674222892:event-bus/MyCluster3CaptureNodesClusterBus224EB36F","busName":"MyCluster3CaptureNodesClusterBus224EB36F","clusterName":"MyCluster3","vpceServiceId":"vpce-svc-0d6b05027bae05312","capacityPlan":{"captureNodes":{"instanceType":"m5.xlarge","desiredCount":1,"maxCount":2,"minCount":1},"captureVpc":{"numAzs":2},"ecsResources":{"cpu":3584,"memory":15360},"osDomain":{"dataNodes":{"count":2,"instanceType":"r6g.large.search","volumeSize":1024},"masterNodes":{"count":3,"instanceType":"m6g.large.search"}}},"userConfig":{"expectedTraffic":0.1,"spiDays":30,"replicas":1}}

License

I confirm that this contribution is made under an Apache 2.0 license and that I have the authority necessary to make this contribution on behalf of its copyright owner.

@chelma chelma added the Capture Configurability Work to make capture configurable and handle changes in configuration label Jun 1, 2023
@chelma chelma requested review from awick and 31453 June 1, 2023 16:57
README.md Outdated

```
./manage_arkime.py create-cluster --name MyCluster --expected-traffic 10
./manage_arkime.py create-cluster --name MyCluster --expected-traffic 1 --spi-days 30 --replicas 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid of copy paste, lets make this replicas 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

manage_arkime.py Outdated
@@ -58,15 +58,27 @@ def destroy_demo_traffic(ctx):
@click.option(
"--expected-traffic",
help=("The amount of traffic, in gigabits-per-second, you expect your Arkime Cluster to receive."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be more correct should be something like "The average amount of traffic, ..."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change

def cmd_create_cluster(profile: str, region: str, name: str, expected_traffic: float):
class MustProvideAllParams(Exception):
def __init__(self):
super().__init__("If you specify one of the optional capacity parameters, you must specify all of them.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? is this because create vs update which we will have later?

If keeping this, 3 changes

  1. remove the stack trace, we shouldn't show stack traces when its a command line arg issue, it confuses people
  2. the error message should include ALL the ones required OR the ones missing. (Guessing ALL is easier)
  3. the "default" print out of help is confusing, since there aren't defaults i guess, or only defaults when none are set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to do with how we're storing the application state in SSM for the cluster and its capacity. Definitely possible for this constraint to be loosened but it's just more code that needs to be written.

The short version is that I want our capacity plan to be stable between deployments even if the underlying code for the capacity plan changes until the user specifically indicates they want to update the capacity (users should not be "surprised" by an infrastructure change). This means we need to store the calculated capacity plan rather than just the inputs to the capacity plan (expected traffic, SPI days, replicas). In this scenario, we don't actually need to store those inputs as long as the user provide all three of them when they want a change. This is the approach I took for the first pass.

If we store the inputs in SSM as well, it will enable us to accept a single one of the parameters and generate a capacity plan using the existing values of the other ones, but it makes the code more complex.

Let's talk it through.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awick and I talked it through and agreed the path forward is to store the inputs as well and accept the additional complexity. Will update.

Copy link
Contributor

@awick awick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Still deleting and redeploying everything.

At some point we should put delete protection on OS and S3

@chelma chelma merged commit 77a8e5d into main Jun 2, 2023
@chelma chelma deleted the capture-cap-3 branch June 2, 2023 13:37
@chelma
Copy link
Collaborator Author

chelma commented Jun 2, 2023

At some point we should put delete protection on OS and S3

Good idea; made a new issue for that: #59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Capture Configurability Work to make capture configurable and handle changes in configuration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants