[RFC] Revamp of How Container Configuration Is Updated #81

chelma · 2023-07-06T19:22:58Z

Proposal

We will provide a no-code mechanism for light/medium modification of the Capture/Viewer container contents and behavior. We will try not to block users that need heavy modification of the containers.

`update-config`

We will create a new CLI command (name TBD) that updates the running container configuration without requiring a rebuild of the underlying Docker images or a CloudFormation Update. The command will take user-specified configuration and deploy it to the Capture/Viewer containers in a gradual or otherwise "safe" manner. We will provide a way to roll back bad configuration, though this may be manual initially.

`create-cluster`

We will update the create-cluster CLI command to use the same distribution method as the new CLI command to set up initial container configuration. By default, if there is an existing configuration, re-running create-cluster will not update it. The user can specify a flag to overwrite pre-existing configuration; we have a special flag for this because this creates situations where it may be impossible to successfully roll-back automatically.

Arkime config.ini

We will expose the capture and viewer .INI files to the user in plain-text. The user is free to modify them as desired (including changing Auth and other settings). We will deliver them without additional modification to the containers as part of create-cluster and update-config. We will perform client-side vetting to ensure the files meet the Arkime spec and alert the user pre-deployment if the files are obviously malformed.

Startup Behavior

We will update the startup behavior of the containers to be as generic as possible. We will move configuration-specific behavior to a separate script which will be delivered as part of container startup rather than baked into the images.

Startup Customization

We will expose one bash script, each, for the capture and viewer containers which contains any special startup commands required to configure the container at launch beyond startup of the Arkime processes and other generic actions. These scripts will contain most of the behavior currently in the run_capture_node.sh and run_viewer_node.sh scripts, such as the sed commands to update entries in the Arkime config.ini files. The user will be free to modify the behavior in these scripts. We will deliver the scripts to the container and execute them during startup.

Additional Files

We will expose one manifest file, each, for the capture and viewer containers which contains mappings between the absolute/relative path on-disk to a client-side file and an absolute path of where it should live in-container. We will package the files specified in the manifest and deliver them to the containers as a part of create-cluster and update-config. An example of a file to be delivered in this way is the default.rules file we're currently embedding in the Capture container's image. The user is free to add any additional files they want delivered to the Capture and Viewer containers to their respective manifests.

Suggested Implementation

Location of configuration files in-Repo

. # repo root
./config/capture/manifest.json     # The Capture Nodes' mapping file
./config/capture/startup.sh     # The Capture Nodes' customizable startup script
./config/capture/files/config.ini     # The Capture Nodes' INI file
./config/capture/files/default.rules     # The default rules file, currently embedding in-image
./config/capture/files/*     # Convenient place to stick files to be placed in-Container
./config/viewer/manifest.json     # The Viewer Nodes' mapping file
./config/viewer/startup.sh     # The Viewer Nodes' customizable startup script
./config/viewer/files/config.ini     # The Viewer Nodes' INI file
./config/viewer/files/*     # Convenient, optional place to stick files to be placed in-Container

Updates to `create-cluster`

Create an S3 bucket to hold the configuration for the Capture and Viewer nodes if it doesn't exist. Do this in-Python because we need our configuration in place before our CDK-based CloudFormation deployment is triggered.
Check Parameter Store to see if there's a currently-deployed configuration. If not, upload the files in the Capture and Viewer config directory trees on-disk to a versioned location in S3. For both the Capture and Viewer config directory trees on-disk, get a hash of the directory tree. Update Parameter Store to have the currently-deployed version, hash, and timestamp. Alternatively, accept a command-line flag to behave similar to update-config for this step.
Update Capture/Viewer Containers startup behavior to check the "current" config version via Parameter Store and copy the files from S3 to the container. After all files are copied locally, source the startup.sh.

Updates to `list-clusters`

Display the current Capture and Viewer config version and creation timestamp

Behavior of `update-config`

Get a hashes of the Capture and Viewer files. Compare it to the current hash tracked in Parameter Store. If it's different, increment the config version number, upload the configuration to a new versioned path in S3, and update the current config version/hash in Parameter Store along with a timestamp. Store the old version number/timestamp/hash in the same parameter as the "previous" (will be used for rollback).
Use the ECS UpdateService API's force-deploy option to bounce the containers in the Capture and/or Viewer fleets according to a deployment plan.

How Config Rollbacks Will Work

Assumptions:

We will assume there are no out-of-band changes to the configuration in S3 or the Parameter Store

Approach:

We will keep all versions of the container configuration in separate versioned paths in S3. This will allow us to easily move between versions.
We will keep track of the most recent and previous successfully deployed configuration version, its identifying hash, and deployment timestamp in Parameter Store.
If, during an invocation of create-cluster, the containers fail to stabilize, then this will eventually result in the CloudFormation failure and rollback. This rollback is guaranteed to succeed because (be default) create-cluster only uploads the latest config if no configuration currently exists. This means that we know the previous config version worked with the previous CloudFormation template. If there is no previous config, it's the first deployment of the stack(s) and a deployment failure/rollback doesn't matter because the cluster is not in use.
If, during an invocation of update-config, the containers fail to stabilize, we will notice by monitoring the deployment for failedTasks using the deployment ID and the DescribeServices API. If we see multiple failedTasks, we will update the Parameter Store to use the "previous" config version. This should result in the next attempt to start up the containers to succeed, as we know it succeeded previously (otherwise it wouldn't be recorded in Parameter Store).

FAQs

Why not use AWS CodePipeline to manage deployments?

CodePipeline appears be a viable alternative for managing the update of both the underlying Docker Image and the deployed files/configuration. The advantage of CodePipeline is that it provides pre-made mechanisms for deployment safety (especially automated rollbacks). The disadvantage of CodePipeline is that it would require a substantial rework of the the existing repo in order to leverage properly and it would add quite a few additional components to our solution. If we run into problems with the suggested approach above, we should consider this a plausible backup plan.

Why not use AWS AppConfig?

AWS AppConfig isn't designed to perform this type of configuration. Its sweet spot is to provide a way to distribute things like feature flags, allow lists, permission, etc that the on-host processes can then pull to modify their behavior. It is not designed to distribute files and perform on-disk actions like running scripts.

Why not use a pre-built configuration management tool (SaltStack, Ansible, Chef, etc)?

For the project's current scope it requires a substantial upfront investment in infrastructure setup and a steep learning curve for users that seems disproportionate to the relative additional benefit.

How can I invoke additional executables during startup?

You can either stick the commands you want to execute in the startup.sh script we execute on container start or stick additional executables in the files transfered to the container and invoke them from the startup.sh.

Related Tasks

Generic File Handling for Arkime Containers #80

The text was updated successfully, but these errors were encountered:

awick · 2023-07-06T19:54:27Z

Great write up

Arkime config.ini:
We will deliver them without additional modification to the containers as part of create-cluster and update-config.

I assume we will still do var subs so this should be word smithed a little. For example I wouldn't want the OS password sitting in plain text on my control computer. A P2 would be if I want to stick my OIDC password in AWS Service X it would be nice to do some kind of var sub on it. Example: client_secret={CLIENT_SECRET_VAR} and should be an easy way to get that.

Startup Customization section

Would like to discuss this, too much to type, not sure totally on board yet.

Additional files

Not totally sure the purpose of the manifest.json if you just copy everything in files directory.

Other:
Something I always struggle with is where do the provided versions of all these files live. So for example, when I first get this project where is the sample config.ini for capture and then how does it get to ./config/capture/files/config.ini? If thats where it lives in git/rpm, I will forever have merge issues if the user and the developer make changes to the same file and the user update. One solution I see for this is a new command to copy the shipped versions over to the staged area, but there are other solutions too.

chelma · 2023-07-06T20:09:04Z

Thanks for the comments!

I assume we will still do var subs so this should be word smithed a little.

Yeah, I see your point. What I mean is that we will deliver the config.ini to the container exactly as they have it locally (no special magic client-side). However, we also run the startup.sh after it's in container, where things like seds can happen (and will happen). As an example, we'll still pull our OpenSearch Domain password from Secret Manager and stick it in the config.ini; that logic will be in startup.sh, which is invoked during container startup.

Would like to discuss this, too much to type, not sure totally on board yet.

Let's discuss; happy to change course if there's a better option

Not totally sure the purpose of the manifest.json

We're not making assumptions that the files the user wants in the container are located in any specific location client side. Additionally, we want the user to specify any location for the file to be delivered in-container. However, the bigger issue is: how do you keep track of which file, where, on the client-side ends up where on the container-side? This was my simplified solution.

where do the provided versions of all these files live

I was thinking the repo should be runnable "as-is", meaning the configuration files for the default behavior are already in the "correct" location to just "go". I haven't spent too many cycles on how folks version control their changes, but yeah - probably something to think through and open to your thoughts on the topic. I think the manifest.json may help us here since the files don't technically even need to be in this repo as long as they're available client-side.

chelma · 2023-07-07T14:19:19Z

I think the sequence of operations we perform on container start could be filled in/clarified a bit more. Here's what I'm thinking:

ECS kicks off the container's entry-point command (the run_capture_node.sh/run_viewer_node.sh). The process should be the same for both viewer and capture nodes so we'll just refer to one of the two scripts.
run_capture_node.sh prints the env variables, as it currently does, and maybe other "config agnostic" activities
run_capture_node.sh invokes another script embedded in the container image (bootstrap_config.sh, or something like that).
bootstrap_config.sh pulls the config and other files the user originally specified in their manifest.json from s3/wherever, moves them to the correct location on disk in the container, and then invokes the configure.sh script that the user defined client side (originally called startup.sh above).
- We make bootstrap_config.sh a separate script so that later we could do things like use AWS System Manager RunCommand to pull updated config outside of the container startup process.
- This needs to be embedded in the container image because we need to something within the image to pull our files from S3 or wherever initially.
configure.sh executes all user-specified actions. By default, this would be things like pulling the OpenSearch Domain admin password from AWS SecretsManager and using sed to stick it in the config.ini file. The user is free to add whatever logic they want to it, including invoking other executables pulled from S3 to disk in the previous step.
run_capture_node.sh starts the Arkime process

chelma · 2023-07-07T14:43:06Z

Regarding where user will stick their own configuration, I think the manifest.json will help a bunch here. First, we should probably talk about what will be inside this file. Maybe a format like:

{
    "configure_script_path": "./local/relative/path/configure.sh",
    "files_map": [
        {
            "local_path": "./relative/path/config.ini",
            "container_path": "/opt/arkime/etc/config.ini"
        },
        {
            "local_path": "./relative/path/default.rules",
            "container_path": "/opt/arkime/etc/default.rules"
        },
        {
            "local_path": "/absolute/path/my_script.sh",
            "container_path": "/usr/bin/my_script.sh"
        },
    ]
}

We keep track of the configure.sh script separately because it's really the only thing the bootstrap_config.sh needs to know the specific location of during the config installation process so it can kick it off. When we create our tarball locally we can stick it in a standard location so we'll know where it ends up in-container when we unpack the tarball.

Anyways, this enables us to do something like as described before and stick all our "default" files in the repo so it will just "go" when the user runs create-cluster:

. # repo root
./default_config/capture/manifest.json     # The Capture Nodes' mapping file
./default_config/capture/configure.sh     # The Capture Nodes' customizable configuration script
./default_config/capture/files/config.ini     # The Capture Nodes' INI file
./default_config/capture/files/default.rules     # The default rules file, currently embedding in-image
./default_config/viewer/manifest.json     # The Viewer Nodes' mapping file
./default_config/viewer/configure.sh     # The Viewer Nodes' customizable configuration script
./default_config/viewer/files/config.ini     # The Viewer Nodes' INI file

By default, create-cluster and update-config can point to the ./default_config/capture/manifest.json and ./default_config/viewer/manifest.json manifest files, but we can add command-line args to let the user specific paths to different manifest file locations. This should enable them to stick both the manifests and any/all files they want to end up on the containers wherever they want.

chelma · 2023-07-07T17:03:31Z

Had a great convo w/ @awick about the proposal. Some takeaways:

We can generate config directory with default values for config.ini, configure.sh, etc the first time the user invokes create-cluster. We'll want to make these directories tied to the specific cluster though so the user can manage multiple clusters simultaneously. That way, we're guaranteed that even if those files are checked into a user's fork, they will never present a merge conflict when pulling from the arkime/aws-aio main repo.
We can probably ditch the manifest file if we use convention/relative paths to track the location of the configure.sh file and rely on mkdir/mv commands in the configure.sh to reposition stuff from the unpacked tarball in-container to the correct location on-disk.
There's a need for some additional commands for config management
- config-list: get a listing of config bundles for a given cluster (e.g. what's in S3?)
- config-pull: pull a specified config bundle from S3 to the client
- config-update --version: a way to deploy a specific, historical version of the config to the cluster
We should consider moving to "Reverse Polish" naming (create-cluster -> cluster-create, list-clusters -> cluster-list) now that we're accumulating more commands
We should embed the AWS AIO version into the s3 objects; should be easy to do with S3 tags

chelma · 2023-07-07T18:58:25Z

Thought about things a bit more and think it makes sense to ditch the manifest.json for now. We'll have updated logic as follows.

We can stick our "default" configuration in our source directory tree to get them out of the way and not confuse users. Since it's Python that will will be managing the lifecycle of these files, this seems pretty reasonable:

. # repo root
./manage_arkime/arkime_config/capture/configure.sh
./manage_arkime/arkime_config/capture/config.ini
./manage_arkime/arkime_config/capture/default.rules
./manage_arkime/arkime_config/viewer/configure.sh
./manage_arkime/arkime_config/viewer/config.ini

When the user invokes cluster-create, we copy them into a cluster-specific directory if that directory doesn't exist already:

. # repo root
./config_<CLUSTER NAME>/capture/configure.sh
./config_<CLUSTER NAME>/capture/config.ini
./config_<CLUSTER NAME>/capture/default.rules
./config_<CLUSTER NAME>/viewer/configure.sh
./config_<CLUSTER NAME>/viewer/config.ini

We tar the ./config_<CLUSTER NAME>/capture/ and ./config_<CLUSTER NAME>/viewer/ and stick them in S3. We confirm configure.sh exists in both the Capture and Viewer trees, since that's the one touchpoint we're directly referencing in the code embedded in the container image (via bootstrap_config.sh).
If files need to be moved to specific locations in the container, we'll do that in configure.sh when it's invoked in-container.
If the user has new files they need in the container, they can stick them in those directories and relocate, as needed, with commands in the configure.sh files.

awick · 2023-07-07T19:04:59Z

LGTM

chelma · 2023-07-10T14:57:52Z

Plan looks something like this:

* Create S3 bucket on `create-cluster`; destroy S3 bucket on `destroy-cluster`
* Create user config directory from master on `create-cluster`
* Code to tarball config/upload to s3/update param store; tag S3 and Param Store with `aws-aio` version(s)
* Update `create-cluster` to perform tarball upload if no existing config
* Update container config/logic to use tarball
* README update

* Create `config-update` command to update the existing config with new changes
* Implement/test rollback behavior
* README update

* Change command names to reverse-polish notation
* README update

* Add new config-related commands and options:
    * `config-list`: get a listing of config bundles for a given cluster (e.g. what's in S3?)
    * `config-pull`: pull a specified config bundle from S3 to the client
    * `config-update --version`: a way to deploy a specific, historical version of the config to the cluster
* README update

chelma added the Design Proposal label Jul 6, 2023

chelma mentioned this issue Jul 6, 2023

Generic File Handling for Arkime Containers #80

Closed

chelma self-assigned this Jul 7, 2023

This was referenced Jul 7, 2023

Enable Setup/Update of OIDC Auth for Arkime Portal #77

Closed

Enable OIDC Auth on the Arkime Portal #75

Closed

This was referenced Jul 10, 2023

create-cluster/destroy-cluster implement Config RFC #83

Closed

list-clusters should display Config version #84

Closed

Implement a config-update CLI command #85

Closed

CLI Improvements for Configuration #87

Closed

chelma mentioned this issue Jul 21, 2023

create-cluster/destroy-cluster use new configuration system #93

Merged

chelma closed this as completed Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Revamp of How Container Configuration Is Updated #81

[RFC] Revamp of How Container Configuration Is Updated #81

chelma commented Jul 6, 2023 •

edited

Loading

awick commented Jul 6, 2023 •

edited

Loading

chelma commented Jul 6, 2023 •

edited

Loading

chelma commented Jul 7, 2023

chelma commented Jul 7, 2023 •

edited

Loading

chelma commented Jul 7, 2023

chelma commented Jul 7, 2023 •

edited

Loading

awick commented Jul 7, 2023

chelma commented Jul 10, 2023 •

edited

Loading

[RFC] Revamp of How Container Configuration Is Updated #81

[RFC] Revamp of How Container Configuration Is Updated #81

Comments

chelma commented Jul 6, 2023 • edited Loading

Proposal

update-config

create-cluster

Arkime config.ini

Startup Behavior

Startup Customization

Additional Files

Suggested Implementation

Location of configuration files in-Repo

Updates to create-cluster

Updates to list-clusters

Behavior of update-config

How Config Rollbacks Will Work

FAQs

Why not use AWS CodePipeline to manage deployments?

Why not use AWS AppConfig?

Why not use a pre-built configuration management tool (SaltStack, Ansible, Chef, etc)?

How can I invoke additional executables during startup?

Related Tasks

awick commented Jul 6, 2023 • edited Loading

chelma commented Jul 6, 2023 • edited Loading

chelma commented Jul 7, 2023

chelma commented Jul 7, 2023 • edited Loading

chelma commented Jul 7, 2023

chelma commented Jul 7, 2023 • edited Loading

awick commented Jul 7, 2023

chelma commented Jul 10, 2023 • edited Loading

chelma commented Jul 6, 2023 •

edited

Loading

`update-config`

`create-cluster`

Updates to `create-cluster`

Updates to `list-clusters`

Behavior of `update-config`

awick commented Jul 6, 2023 •

edited

Loading

chelma commented Jul 6, 2023 •

edited

Loading

chelma commented Jul 7, 2023 •

edited

Loading

chelma commented Jul 7, 2023 •

edited

Loading

chelma commented Jul 10, 2023 •

edited

Loading