Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Revamp of How Container Configuration Is Updated #81

Closed
chelma opened this issue Jul 6, 2023 · 8 comments
Closed

[RFC] Revamp of How Container Configuration Is Updated #81

chelma opened this issue Jul 6, 2023 · 8 comments
Assignees

Comments

@chelma
Copy link
Collaborator

chelma commented Jul 6, 2023

Proposal

We will provide a no-code mechanism for light/medium modification of the Capture/Viewer container contents and behavior. We will try not to block users that need heavy modification of the containers.

update-config

We will create a new CLI command (name TBD) that updates the running container configuration without requiring a rebuild of the underlying Docker images or a CloudFormation Update. The command will take user-specified configuration and deploy it to the Capture/Viewer containers in a gradual or otherwise "safe" manner. We will provide a way to roll back bad configuration, though this may be manual initially.

create-cluster

We will update the create-cluster CLI command to use the same distribution method as the new CLI command to set up initial container configuration. By default, if there is an existing configuration, re-running create-cluster will not update it. The user can specify a flag to overwrite pre-existing configuration; we have a special flag for this because this creates situations where it may be impossible to successfully roll-back automatically.

Arkime config.ini

We will expose the capture and viewer .INI files to the user in plain-text. The user is free to modify them as desired (including changing Auth and other settings). We will deliver them without additional modification to the containers as part of create-cluster and update-config. We will perform client-side vetting to ensure the files meet the Arkime spec and alert the user pre-deployment if the files are obviously malformed.

Startup Behavior

We will update the startup behavior of the containers to be as generic as possible. We will move configuration-specific behavior to a separate script which will be delivered as part of container startup rather than baked into the images.

Startup Customization

We will expose one bash script, each, for the capture and viewer containers which contains any special startup commands required to configure the container at launch beyond startup of the Arkime processes and other generic actions. These scripts will contain most of the behavior currently in the run_capture_node.sh and run_viewer_node.sh scripts, such as the sed commands to update entries in the Arkime config.ini files. The user will be free to modify the behavior in these scripts. We will deliver the scripts to the container and execute them during startup.

Additional Files

We will expose one manifest file, each, for the capture and viewer containers which contains mappings between the absolute/relative path on-disk to a client-side file and an absolute path of where it should live in-container. We will package the files specified in the manifest and deliver them to the containers as a part of create-cluster and update-config. An example of a file to be delivered in this way is the default.rules file we're currently embedding in the Capture container's image. The user is free to add any additional files they want delivered to the Capture and Viewer containers to their respective manifests.

Suggested Implementation

Location of configuration files in-Repo

. # repo root
./config/capture/manifest.json     # The Capture Nodes' mapping file
./config/capture/startup.sh     # The Capture Nodes' customizable startup script
./config/capture/files/config.ini     # The Capture Nodes' INI file
./config/capture/files/default.rules     # The default rules file, currently embedding in-image
./config/capture/files/*     # Convenient place to stick files to be placed in-Container
./config/viewer/manifest.json     # The Viewer Nodes' mapping file
./config/viewer/startup.sh     # The Viewer Nodes' customizable startup script
./config/viewer/files/config.ini     # The Viewer Nodes' INI file
./config/viewer/files/*     # Convenient, optional place to stick files to be placed in-Container

Updates to create-cluster

  • Create an S3 bucket to hold the configuration for the Capture and Viewer nodes if it doesn't exist. Do this in-Python because we need our configuration in place before our CDK-based CloudFormation deployment is triggered.
  • Check Parameter Store to see if there's a currently-deployed configuration. If not, upload the files in the Capture and Viewer config directory trees on-disk to a versioned location in S3. For both the Capture and Viewer config directory trees on-disk, get a hash of the directory tree. Update Parameter Store to have the currently-deployed version, hash, and timestamp. Alternatively, accept a command-line flag to behave similar to update-config for this step.
  • Update Capture/Viewer Containers startup behavior to check the "current" config version via Parameter Store and copy the files from S3 to the container. After all files are copied locally, source the startup.sh.

Updates to list-clusters

  • Display the current Capture and Viewer config version and creation timestamp

Behavior of update-config

  • Get a hashes of the Capture and Viewer files. Compare it to the current hash tracked in Parameter Store. If it's different, increment the config version number, upload the configuration to a new versioned path in S3, and update the current config version/hash in Parameter Store along with a timestamp. Store the old version number/timestamp/hash in the same parameter as the "previous" (will be used for rollback).
  • Use the ECS UpdateService API's force-deploy option to bounce the containers in the Capture and/or Viewer fleets according to a deployment plan.

How Config Rollbacks Will Work

Assumptions:

  • We will assume there are no out-of-band changes to the configuration in S3 or the Parameter Store

Approach:

  • We will keep all versions of the container configuration in separate versioned paths in S3. This will allow us to easily move between versions.
  • We will keep track of the most recent and previous successfully deployed configuration version, its identifying hash, and deployment timestamp in Parameter Store.
  • If, during an invocation of create-cluster, the containers fail to stabilize, then this will eventually result in the CloudFormation failure and rollback. This rollback is guaranteed to succeed because (be default) create-cluster only uploads the latest config if no configuration currently exists. This means that we know the previous config version worked with the previous CloudFormation template. If there is no previous config, it's the first deployment of the stack(s) and a deployment failure/rollback doesn't matter because the cluster is not in use.
  • If, during an invocation of update-config, the containers fail to stabilize, we will notice by monitoring the deployment for failedTasks using the deployment ID and the DescribeServices API. If we see multiple failedTasks, we will update the Parameter Store to use the "previous" config version. This should result in the next attempt to start up the containers to succeed, as we know it succeeded previously (otherwise it wouldn't be recorded in Parameter Store).

FAQs

Why not use AWS CodePipeline to manage deployments?

CodePipeline appears be a viable alternative for managing the update of both the underlying Docker Image and the deployed files/configuration. The advantage of CodePipeline is that it provides pre-made mechanisms for deployment safety (especially automated rollbacks). The disadvantage of CodePipeline is that it would require a substantial rework of the the existing repo in order to leverage properly and it would add quite a few additional components to our solution. If we run into problems with the suggested approach above, we should consider this a plausible backup plan.

Why not use AWS AppConfig?

AWS AppConfig isn't designed to perform this type of configuration. Its sweet spot is to provide a way to distribute things like feature flags, allow lists, permission, etc that the on-host processes can then pull to modify their behavior. It is not designed to distribute files and perform on-disk actions like running scripts.

Why not use a pre-built configuration management tool (SaltStack, Ansible, Chef, etc)?

For the project's current scope it requires a substantial upfront investment in infrastructure setup and a steep learning curve for users that seems disproportionate to the relative additional benefit.

How can I invoke additional executables during startup?

You can either stick the commands you want to execute in the startup.sh script we execute on container start or stick additional executables in the files transfered to the container and invoke them from the startup.sh.

Related Tasks

@awick
Copy link
Contributor

awick commented Jul 6, 2023

Great write up

Arkime config.ini:
We will deliver them without additional modification to the containers as part of create-cluster and update-config.

I assume we will still do var subs so this should be word smithed a little. For example I wouldn't want the OS password sitting in plain text on my control computer. A P2 would be if I want to stick my OIDC password in AWS Service X it would be nice to do some kind of var sub on it. Example: client_secret={CLIENT_SECRET_VAR} and should be an easy way to get that.

Startup Customization section

Would like to discuss this, too much to type, not sure totally on board yet.

Additional files

Not totally sure the purpose of the manifest.json if you just copy everything in files directory.

Other:
Something I always struggle with is where do the provided versions of all these files live. So for example, when I first get this project where is the sample config.ini for capture and then how does it get to ./config/capture/files/config.ini? If thats where it lives in git/rpm, I will forever have merge issues if the user and the developer make changes to the same file and the user update. One solution I see for this is a new command to copy the shipped versions over to the staged area, but there are other solutions too.

@chelma
Copy link
Collaborator Author

chelma commented Jul 6, 2023

Thanks for the comments!

I assume we will still do var subs so this should be word smithed a little.

Yeah, I see your point. What I mean is that we will deliver the config.ini to the container exactly as they have it locally (no special magic client-side). However, we also run the startup.sh after it's in container, where things like seds can happen (and will happen). As an example, we'll still pull our OpenSearch Domain password from Secret Manager and stick it in the config.ini; that logic will be in startup.sh, which is invoked during container startup.

Would like to discuss this, too much to type, not sure totally on board yet.

Let's discuss; happy to change course if there's a better option

Not totally sure the purpose of the manifest.json

We're not making assumptions that the files the user wants in the container are located in any specific location client side. Additionally, we want the user to specify any location for the file to be delivered in-container. However, the bigger issue is: how do you keep track of which file, where, on the client-side ends up where on the container-side? This was my simplified solution.

where do the provided versions of all these files live

I was thinking the repo should be runnable "as-is", meaning the configuration files for the default behavior are already in the "correct" location to just "go". I haven't spent too many cycles on how folks version control their changes, but yeah - probably something to think through and open to your thoughts on the topic. I think the manifest.json may help us here since the files don't technically even need to be in this repo as long as they're available client-side.

@chelma
Copy link
Collaborator Author

chelma commented Jul 7, 2023

I think the sequence of operations we perform on container start could be filled in/clarified a bit more. Here's what I'm thinking:

  1. ECS kicks off the container's entry-point command (the run_capture_node.sh/run_viewer_node.sh). The process should be the same for both viewer and capture nodes so we'll just refer to one of the two scripts.
  2. run_capture_node.sh prints the env variables, as it currently does, and maybe other "config agnostic" activities
  3. run_capture_node.sh invokes another script embedded in the container image (bootstrap_config.sh, or something like that).
  4. bootstrap_config.sh pulls the config and other files the user originally specified in their manifest.json from s3/wherever, moves them to the correct location on disk in the container, and then invokes the configure.sh script that the user defined client side (originally called startup.sh above).
    • We make bootstrap_config.sh a separate script so that later we could do things like use AWS System Manager RunCommand to pull updated config outside of the container startup process.
    • This needs to be embedded in the container image because we need to something within the image to pull our files from S3 or wherever initially.
  5. configure.sh executes all user-specified actions. By default, this would be things like pulling the OpenSearch Domain admin password from AWS SecretsManager and using sed to stick it in the config.ini file. The user is free to add whatever logic they want to it, including invoking other executables pulled from S3 to disk in the previous step.
  6. run_capture_node.sh starts the Arkime process

@chelma
Copy link
Collaborator Author

chelma commented Jul 7, 2023

Regarding where user will stick their own configuration, I think the manifest.json will help a bunch here. First, we should probably talk about what will be inside this file. Maybe a format like:

{
    "configure_script_path": "./local/relative/path/configure.sh",
    "files_map": [
        {
            "local_path": "./relative/path/config.ini",
            "container_path": "/opt/arkime/etc/config.ini"
        },
        {
            "local_path": "./relative/path/default.rules",
            "container_path": "/opt/arkime/etc/default.rules"
        },
        {
            "local_path": "/absolute/path/my_script.sh",
            "container_path": "/usr/bin/my_script.sh"
        },
    ]
}

We keep track of the configure.sh script separately because it's really the only thing the bootstrap_config.sh needs to know the specific location of during the config installation process so it can kick it off. When we create our tarball locally we can stick it in a standard location so we'll know where it ends up in-container when we unpack the tarball.

Anyways, this enables us to do something like as described before and stick all our "default" files in the repo so it will just "go" when the user runs create-cluster:

. # repo root
./default_config/capture/manifest.json     # The Capture Nodes' mapping file
./default_config/capture/configure.sh     # The Capture Nodes' customizable configuration script
./default_config/capture/files/config.ini     # The Capture Nodes' INI file
./default_config/capture/files/default.rules     # The default rules file, currently embedding in-image
./default_config/viewer/manifest.json     # The Viewer Nodes' mapping file
./default_config/viewer/configure.sh     # The Viewer Nodes' customizable configuration script
./default_config/viewer/files/config.ini     # The Viewer Nodes' INI file

By default, create-cluster and update-config can point to the ./default_config/capture/manifest.json and ./default_config/viewer/manifest.json manifest files, but we can add command-line args to let the user specific paths to different manifest file locations. This should enable them to stick both the manifests and any/all files they want to end up on the containers wherever they want.

@chelma
Copy link
Collaborator Author

chelma commented Jul 7, 2023

Had a great convo w/ @awick about the proposal. Some takeaways:

  • We can generate config directory with default values for config.ini, configure.sh, etc the first time the user invokes create-cluster. We'll want to make these directories tied to the specific cluster though so the user can manage multiple clusters simultaneously. That way, we're guaranteed that even if those files are checked into a user's fork, they will never present a merge conflict when pulling from the arkime/aws-aio main repo.
  • We can probably ditch the manifest file if we use convention/relative paths to track the location of the configure.sh file and rely on mkdir/mv commands in the configure.sh to reposition stuff from the unpacked tarball in-container to the correct location on-disk.
  • There's a need for some additional commands for config management
    • config-list: get a listing of config bundles for a given cluster (e.g. what's in S3?)
    • config-pull: pull a specified config bundle from S3 to the client
    • config-update --version: a way to deploy a specific, historical version of the config to the cluster
  • We should consider moving to "Reverse Polish" naming (create-cluster -> cluster-create, list-clusters -> cluster-list) now that we're accumulating more commands
  • We should embed the AWS AIO version into the s3 objects; should be easy to do with S3 tags

@chelma
Copy link
Collaborator Author

chelma commented Jul 7, 2023

Thought about things a bit more and think it makes sense to ditch the manifest.json for now. We'll have updated logic as follows.

  • We can stick our "default" configuration in our source directory tree to get them out of the way and not confuse users. Since it's Python that will will be managing the lifecycle of these files, this seems pretty reasonable:
. # repo root
./manage_arkime/arkime_config/capture/configure.sh
./manage_arkime/arkime_config/capture/config.ini
./manage_arkime/arkime_config/capture/default.rules
./manage_arkime/arkime_config/viewer/configure.sh
./manage_arkime/arkime_config/viewer/config.ini
  • When the user invokes cluster-create, we copy them into a cluster-specific directory if that directory doesn't exist already:
. # repo root
./config_<CLUSTER NAME>/capture/configure.sh
./config_<CLUSTER NAME>/capture/config.ini
./config_<CLUSTER NAME>/capture/default.rules
./config_<CLUSTER NAME>/viewer/configure.sh
./config_<CLUSTER NAME>/viewer/config.ini
  • We tar the ./config_<CLUSTER NAME>/capture/ and ./config_<CLUSTER NAME>/viewer/ and stick them in S3. We confirm configure.sh exists in both the Capture and Viewer trees, since that's the one touchpoint we're directly referencing in the code embedded in the container image (via bootstrap_config.sh).
  • If files need to be moved to specific locations in the container, we'll do that in configure.sh when it's invoked in-container.
  • If the user has new files they need in the container, they can stick them in those directories and relocate, as needed, with commands in the configure.sh files.

@awick
Copy link
Contributor

awick commented Jul 7, 2023

LGTM

@chelma
Copy link
Collaborator Author

chelma commented Jul 10, 2023

Plan looks something like this:

* Create S3 bucket on `create-cluster`; destroy S3 bucket on `destroy-cluster`
* Create user config directory from master on `create-cluster`
* Code to tarball config/upload to s3/update param store; tag S3 and Param Store with `aws-aio` version(s)
* Update `create-cluster` to perform tarball upload if no existing config
* Update container config/logic to use tarball
* README update
* Create `config-update` command to update the existing config with new changes
* Implement/test rollback behavior
* README update
* Change command names to reverse-polish notation
* README update
* Add new config-related commands and options:
    * `config-list`: get a listing of config bundles for a given cluster (e.g. what's in S3?)
    * `config-pull`: pull a specified config bundle from S3 to the client
    * `config-update --version`: a way to deploy a specific, historical version of the config to the cluster
* README update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants