Skip to content
This repository has been archived by the owner on Jul 25, 2023. It is now read-only.

Aegis Safe v0.18.0 appears to be broken #476

Closed
v0lkan opened this issue Jul 9, 2023 · 7 comments
Closed

Aegis Safe v0.18.0 appears to be broken #476

v0lkan opened this issue Jul 9, 2023 · 7 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@v0lkan
Copy link
Contributor

v0lkan commented Jul 9, 2023

Readiness probe failing; could be related to an environment configuration.

                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  34s               default-scheduler  Successfully assigned aegis-system/aegis-safe-6b4bc89c78-jdbjs to minikube
  Normal   Pulled     3s (x2 over 33s)  kubelet            Container image "localhost:5000/aegis-ist-safe:0.18.0" already present on machine
  Normal   Created    3s (x2 over 33s)  kubelet            Created container main
  Normal   Started    3s (x2 over 33s)  kubelet            Started container main
  Warning  Unhealthy  1s (x7 over 32s)  kubelet            Readiness probe failed: Get "http://10.244.0.9:8082/": dial tcp 10.244.0.9:8082: connect: connection refused

@v0lkan v0lkan added this to Aegis Jul 9, 2023
@v0lkan v0lkan converted this from a draft issue Jul 9, 2023
@v0lkan v0lkan added the bug Something isn't working label Jul 9, 2023
@v0lkan v0lkan moved this from 🏗 In progress to 📋 Backlog in Aegis Jul 9, 2023
@v0lkan
Copy link
Contributor Author

v0lkan commented Jul 9, 2023

Bug confirmed.

There seeems to be something with the bootstrapping sequence:

~/WORKSPACE/aegis (main) 🐢⚡️ k logs aegis-safe-6b4bc89c78-7gpl5 -n aegis-system
2023/07/09 20:15:12 [INFO ] AEGISAFE Acquiring identity…
2023/07/09 20:15:12 [INFO ] AEGISAFE Secret has been set in the cluster, will reuse it
2023/07/09 20:15:12 [INFO ] AEGISAFE Updated age key.
2023/07/09 20:15:12 [INFO ] AEGISAFE remaining: 2

@v0lkan
Copy link
Contributor Author

v0lkan commented Jul 9, 2023

more; spire server acting funky:

~/WORKSPACE/aegis (main) 🐢⚡️ k logs  spire-server-6fb4f57c8-6dcpc -n spire-system
Defaulted container "spire-server" out of: spire-server, spire-controller-manager
time="2023-07-09T20:14:32Z" level=warning msg="Current umask 0022 is too permissive; setting umask 0027"
time="2023-07-09T20:14:32Z" level=info msg=Configured admin_ids="[]" data_dir=/run/spire/server/data
time="2023-07-09T20:14:32Z" level=info msg="Opening SQL database" db_type=sqlite3 subsystem_name=sql
time="2023-07-09T20:14:32Z" level=info msg="Initializing new database" subsystem_name=sql
time="2023-07-09T20:14:32Z" level=info msg="Connected to SQL database" read_only=false subsystem_name=sql type=sqlite3 version=3.39.4
time="2023-07-09T20:14:32Z" level=info msg="Plugin loaded" external=false plugin_name=k8s_psat plugin_type=NodeAttestor subsystem_name=catalog
time="2023-07-09T20:14:32Z" level=info msg="Plugin loaded" external=false plugin_name=disk plugin_type=KeyManager subsystem_name=catalog
time="2023-07-09T20:14:32Z" level=info msg="Plugin loaded" external=false plugin_name=k8sbundle plugin_type=Notifier subsystem_name=catalog
time="2023-07-09T20:14:32Z" level=debug msg="Loading journal" path=/run/spire/server/data/journal.pem subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=info msg="Journal loaded" jwt_keys=0 subsystem_name=ca_manager x509_cas=0
time="2023-07-09T20:14:32Z" level=debug msg="Preparing X509 CA" slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=info msg="X509 CA prepared" expiration="2023-07-10 20:14:32 +0000 UTC" issued_at="2023-07-09 20:14:32.460960018 +0000 UTC" self_signed=true slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=info msg="X509 CA activated" expiration="2023-07-10 20:14:32 +0000 UTC" issued_at="2023-07-09 20:14:32.460960018 +0000 UTC" slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=debug msg="Successfully rotated X.509 CA" subsystem_name=ca_manager trust_domain_id="spiffe://aegis.ist" ttl=86399.531431212
time="2023-07-09T20:14:32Z" level=debug msg="Preparing JWT key" slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=info msg="JWT key prepared" expiration="2023-07-10 20:14:32.468688879 +0000 UTC" issued_at="2023-07-09 20:14:32.468688879 +0000 UTC" slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=info msg="JWT key activated" expiration="2023-07-10 20:14:32.468688879 +0000 UTC" issued_at="2023-07-09 20:14:32.468688879 +0000 UTC" slot=A subsystem_name=ca_manager
time="2023-07-09T20:14:32Z" level=debug msg="Rotating server SVID" subsystem_name=svid_rotator
time="2023-07-09T20:14:32Z" level=debug msg="Signed X509 SVID" expiration="2023-07-09T21:14:32Z" spiffe_id="spiffe://aegis.ist/spire/server" subsystem_name=svid_rotator
time="2023-07-09T20:14:32Z" level=info msg="Building in-memory entry cache" subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=info msg="Completed building in-memory entry cache" subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=info msg="Serving bundle endpoint" addr="0.0.0.0:8443" subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=debug msg="Initializing API endpoints" subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=info msg="Starting Server APIs" address="[::]:8081" network=tcp subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=info msg="Starting Server APIs" address=/tmp/spire-server/private/api.sock network=unix subsystem_name=endpoints
time="2023-07-09T20:14:32Z" level=debug msg="Initializing health checkers" subsystem_name=health
time="2023-07-09T20:14:32Z" level=info msg="Serving health checks" address="0.0.0.0:8080" subsystem_name=health
time="2023-07-09T20:14:32Z" level=debug msg="Notifier handled event" event="bundle loaded" notifier=k8sbundle subsystem_name=ca_manager
time="2023-07-09T20:14:35Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=cc9eede1-70de-4a89-9ce4-45fd31fe762e serial_num=220647600972895169213478345007152105879 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:14:38Z" level=info msg="Agent attestation request completed" address="10.244.0.1:24714" agent_id="spiffe://aegis.ist/spire/agent/k8s_psat/aegis-cluster/58ec9779-b69b-467f-8daa-6d425168e242" authorized_as=nobody authorized_via= caller_addr="10.244.0.1:24714" method=AttestAgent node_attestor_type=k8s_psat request_id=2213b68f-889d-42f5-b93e-47d7f71f60e6 service=agent.v1.Agent subsystem_name=api
time="2023-07-09T20:16:36Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=5ae39f53-be75-4331-b0a9-c6424eff2f67 serial_num=134583869554082741372914276644777735286 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:19:09Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=f6b507ca-e407-41b1-a50c-72d03c0e5a86 serial_num=274273112340796017457525047861804262516 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:21:51Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=006bbfc9-29c3-40a4-86cf-0f9b619a062f serial_num=161114409812572787252179367528088440062 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:25:00Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=6b1e06f4-bc86-4cca-8a8b-e26813420e50 serial_num=308959287486341623984535810580801604004 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:28:45Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=b9f57687-f86f-4187-b0dd-5ade78ad4718 serial_num=209113354354863514649845506184842233746 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:33:43Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=7585fd84-888b-4ea0-8009-66d554bf682b serial_num=73387018949878136766962404638821297169 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api
time="2023-07-09T20:41:08Z" level=debug msg="Signed X509 SVID" authorized_as=local authorized_via=transport dns_name=spire-controller-manager-webhook-service.spire-system.svc expiration="2023-07-10T20:14:32Z" method=MintX509SVID request_id=76b317a5-9da4-401d-9b0f-dbd1893dfc5d serial_num=226485357373726696769043468056246421737 service=svid.v1.SVID spiffe_id="spiffe://aegis.ist/spire-controller-manager-webhook" subject= subsystem_name=api

That might also be the reason for safe bootstrap failing.

more:

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  36m                 default-scheduler  Successfully assigned spire-system/spire-server-6fb4f57c8-6dcpc to minikube
  Normal   Pulling    36m                 kubelet            Pulling image "ghcr.io/spiffe/spire-server:1.6.3"
  Normal   Pulled     36m                 kubelet            Successfully pulled image "ghcr.io/spiffe/spire-server:1.6.3" in 2.192539709s (3.846152012s including waiting)
  Normal   Created    36m                 kubelet            Created container spire-server
  Normal   Started    36m                 kubelet            Started container spire-server
  Normal   Pulling    36m                 kubelet            Pulling image "ghcr.io/spiffe/spire-controller-manager:nightly"
  Normal   Pulled     36m                 kubelet            Successfully pulled image "ghcr.io/spiffe/spire-controller-manager:nightly" in 2.192111448s (2.963491192s including waiting)
  Normal   Created    26m (x5 over 36m)   kubelet            Created container spire-controller-manager
  Normal   Started    26m (x5 over 36m)   kubelet            Started container spire-controller-manager
  Normal   Pulled     26m (x4 over 34m)   kubelet            Container image "ghcr.io/spiffe/spire-controller-manager:nightly" already present on machine
  Warning  BackOff    23s (x75 over 32m)  kubelet            Back-off restarting failed container spire-controller-manager in pod spire-server-6fb4f57c8-6dcpc_spire-system(ed1688e0-1e49-4beb-9585-dbcedebd4af3)

can be related to spire-controller-manager too.

@v0lkan
Copy link
Contributor Author

v0lkan commented Jul 9, 2023

I’ll check if there is an update version for SPIRE server.

@v0lkan
Copy link
Contributor Author

v0lkan commented Jul 9, 2023

FYI @abhishek44sharma this is related to the issue that you are having:

spiffe/spire-controller-manager#177

To fix this,

  1. we need to add the CRD to the yaml files
  2. and we need to fix the version of the controller manager to something other than nightly so that similar breaking changes do not impact Aegis Safe.

@v0lkan v0lkan added the good first issue Good for newcomers label Jul 9, 2023
@abhishek44sharma
Copy link
Contributor

/assign

@abhishek44sharma
Copy link
Contributor

This issue is fixed.
For aegis-safe readiness probe failure event logged, since initially it takes time to acquire svid and update secret. Caused delay in readiness probe creation. Eventually readiness probe is coming up fine now.
To avoid readiness probe failure event getting logged we can increase initialDelaySeconds time in aegis-safe deployment to avoid this event.

@github-project-automation github-project-automation bot moved this from 📋 Backlog to ✅ Done in Aegis Jul 22, 2023
@abhishek44sharma
Copy link
Contributor

abhishek44sharma commented Jul 22, 2023

This issue is fixed.
For aegis-safe readiness probe failure event is logged, since initially it takes time to acquire svid and update secret. Causes delay in readiness probe creation. Eventually readiness probe is coming up fine now.
To avoid readiness probe failure event getting logged, we can increase initialDelaySeconds time for readiness probe in aegis-safe deployment.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working good first issue Good for newcomers
Projects
Status: Done
Development

No branches or pull requests

2 participants