Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undocumented retries and inability to remove that functionality #15

Open
mschultz-aofl opened this issue Jul 21, 2022 · 5 comments
Open

Comments

@mschultz-aofl
Copy link

By default, the ec2-macos-init application will retry execution of failed sections. This doesn't appear documented anywhere, and can cause unexpected side effects. As an example, I have a script that configures MacOS certificates and registers the instance as a runner. As the line of my shell script failed came after registration, the machine registered itself with gitlab 101 times before finally stopping. This behavior needs to be modifiable and well documented, as it can cause serious side effects if not known.

Example TOML added:

  Name = "Execute_Startup"
  PriorityGroup = 3 # Fourth group
  RunOnce = true # Run once, ever
  FatalOnError = false # Stop running Init if there is an error 
  [Module.Command]
    Cmd = ["/usr/local/aws/ec2-macos-init/startup.sh"] # A simple command
    RunAsUser = "ec2-user" # Run as ec2-user

And the shell script:

#!/bin/bash

curl -L https://www.apple.com/certificateauthority/AppleWWDRCAG3.cer > g3.cer

sudo security authorizationdb write com.apple.trust-settings.admin allow
sudo security add-trusted-cert -d -r trustAsRoot -k /Library/Keychains/System.keychain g3.cer

security set-key-partition-list -S apple-tool:,apple: -s -k "${password}" login.keychain-db

/usr/local/opt/gitlab-runner/bin/gitlab-runner register --non-interactive --url 'REDACTED'--executor 'shell' --run-untagged='false' --tag-list 'mac' --registration-token REDACTED

git lfs install

Note that git lfs install failed due to a missing $HOME env var, causing it to import the certificate many times and register itself with gitlab 100 times.

My recommended fix would be to accept a 'RetryCount' option in the TOML to make this configurable, and explicitly set the default within the TOML, removing the const variable.

@mieliespoor
Copy link

mieliespoor commented Jul 25, 2022

@mschultz-aofl a question unrelated to the issue here, but related to your script. The ${password} you use there for the login keychain, would this be the password set by macos-init or where do you get this from? Asking as I have some credentials and certificates that need to be added to the login keychain, but need the password for this to work, so am a bit stuck on that point.

@mschultz-aofl
Copy link
Author

I use packer to generate the shell script and toss it in that path, with the password in there. It's a restricted path, and the password is set per machine so it's not really sensitive. Not that the Packer I use resets the password to a known password - it doesn't use the reset script that's a part of this tool.

So the setup is, within packer:

  1. Set password
  2. Reboot
  3. Create /usr/local/aws/ec2-macos-init/startup.sh with the password hard-coded on it
  4. Modify /usr/local/aws/ec2-macos-init/init.toml to remove password reset and enable the script above
  5. Create AMI
  6. Launch ami via Terraform

@mattcataws
Copy link
Contributor

mattcataws commented Aug 3, 2022

Hey @mschultz-aofl, thanks for opening this issue to track this. My first guess at what's happening is the logic behind ec2-macos-init's RetryOnce option. When this is present in a module, ec2-macos-init will keep attempting to run the module (once per run) until it marks a successful run in the history. See this block in Module.ShouldRun():

if m.RunOnce {
for _, instance := range history {
// Check every module history for that instance
for _, moduleHistory := range instance.ModuleHistories {
if key == moduleHistory.Key && moduleHistory.Success {
// If there is a matching key and it completed successfully, it doesn't need to be run
return false
}
}
}
// If no instances match the instance history, run the module
return true
}

I like your suggestion of adding a new config value to set the maximum number of attempts a module is run unsuccessfully. We'll need to do a bit more investigation into that addition as well as this issue as a whole.

@mschultz-aofl
Copy link
Author

So I don't think it's there. If you look here:

func (c *InitConfig) RetriesExceeded() (exceeded bool, err error) {
// Check for the existence of the temporary file and get the current fatal count
err = c.FatalCounts.readFatalCount()
if err != nil {
return false, fmt.Errorf("ec2macosinit: unable to read fatal counts: %s", err)
}
// If there have been more than the limit of fatal exits, return true
if c.FatalCounts.Count > PerBootFatalLimit {
return true, nil
}
// Otherwise, continue
return false, nil
}

You can see it retries 100 times. I think just modifying this would be the easiest thing to do. However, I'm not too familiar with Go - but as to why it does this, I think either (a) history isn't re-read/modified in memory, so the running process doesn't pick up the failure, or (b), the 100 retry logic bypasses ShouldRun.

@mattcataws
Copy link
Contributor

I don't that's the case for the module that you've provided as it sets FatalOnError = false. When the module fails, ec2-maocs-init will continue to run and that InitConfig.FatalCount type will not be incremented.

You can double check this by seeing what's written in the log file /var/log/amazon/ec2/ec2-macos-init.log. If there's a fatal call due to the Execute_Startup module you have defined, there should be a log line like the following:

Exiting after <runtime> due to failure in module [Execute_Startup] with FatalOnError set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants