Agents do not survive Jenkins restart #691

613andred · 2021-12-20T19:53:46Z

Describe the bug

Agents are currently unable to re-connect to Jenkins if it is restarted, this is problematic for any long-running pipeline.

This functionality is critical to our use of Jenkins.

This issue has previously been raised in #542.

There are several problems causing this issue:

The backup restore happens too late in the Jenkins lifecycle
Missing Jenkins state for agent re-connection

To Reproduce

Create a long running pipeline, Ex. create a pipeline with a step sh 'sleep 300'.
Restart Jenkins by deleting the pod.

Additional information

Agent logs:

Dec 03, 2021 3:43:49 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFO: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused (Connection refused)
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Performing onReconnect operation.
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: onReconnect operation failed.
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://jenkins-operator-http-jenkins.jenkins-operator-dev.svc.cluster.local:8080/]
Dec 03, 2021 3:44:00 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
Dec 03, 2021 3:44:00 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting TCP connection tunneling is enabled. Skipping the TCP Agent Listener Port availability check
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
  Agent address: jenkins-operator-slave-jenkins.jenkins-operator-dev.svc.cluster.local
  Agent port:    50000
  Identity:      53:e7:06:b5:b0:1b:2d:13:e9:22:f6:5a:79:4f:ef:69
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins-operator-slave-jenkins.jenkins-operator-dev.svc.cluster.local:50000
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect
Dec 03, 2021 3:44:00 PM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run
INFO: Waiting for ProtocolStack to start.
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Remote identity confirmed: 53:e7:06:b5:b0:1b:2d:13:e9:22:f6:5a:79:4f:ef:69
Dec 03, 2021 3:44:00 PM org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer onRecv
INFO: [JNLP4-connect connection to jenkins-operator-slave-jenkins.jenkins-operator-dev.svc.cluster.local/10.0.56.118:50000] Local headers refused by remote: Unknown client name: img-agent-java-11-fdqph
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Protocol JNLP4-connect encountered an unexpected exception
java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: img-agent-java-11-fdqph
        at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223)
        at hudson.remoting.Engine.innerRun(Engine.java:778)
        at hudson.remoting.Engine.run(Engine.java:540)
Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: img-agent-java-11-fdqph
        at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:378)
        at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:433)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:825)
        at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:288)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:170)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:825)
        at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:49)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:255)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122)
        at java.base/java.lang.Thread.run(Thread.java:829)
        Suppressed: java.nio.channels.ClosedChannelException
                ... 7 more

Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: None of the protocols were accepted
java.lang.Exception: The server rejected the connection: None of the protocols were accepted
        at hudson.remoting.Engine.onConnectionRejected(Engine.java:864)
        at hudson.remoting.Engine.innerRun(Engine.java:804)
        at hudson.remoting.Engine.run(Engine.java:540)

Jenkins logs:

2021-12-03 15:44:00.039+0000 [id=87]    INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #1 from /10.163.64.241:53150
2021-12-03 15:44:00.122+0000 [id=89]    INFO    o.j.r.p.i.ConnectionHeadersFilterLayer#onRecv: [JNLP4-connect connection from 10.163.64.241/10.163.64.241:53150] Refusing headers from remote: Unknown client name: img-agent-java-11-fdqph

Workaround

This workaround is not great as it breaks the principals of the Jenkins operator.

Disable backup
Create PVC for
- /var/lib/jenkins/nodes
- /var/lib/jenkins/secrets
- /var/lib/jenkins/jobs

Proposed fix

Make the restore step happen earlier in the Jenkins lifecycle
- Use init container (anything after the Jenkins startup is too late as there would be a small window for an agent to reconnect before state is restored)
- Remove from the operator reconcile loop
Add nodes and secrets state to backup

If you accept the proposal I am able to provide a PR to resolve this issue.

I believe this solution would also address #607 and #679

The text was updated successfully, but these errors were encountered:

thecooldrop · 2021-12-20T22:46:55Z

Thank for including this. We have some longer-running pipelines ourselves, and we are migrating the Jenkins into our cluster. I was wondering about this just today.

mortenbirkelund · 2022-02-09T19:27:34Z

@613andred Sounds like a great proposal. Anything to do, in order to help out?

stale · 2022-04-16T12:31:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this issue is still affecting you, just comment with any updates and we'll keep it open. Thank you for your contributions.

613andred · 2022-04-20T12:51:42Z

Unfortunately I have been very busy as of late so I have not been able to put together PR.

Ideally I would like to get input from the maintainers before I go ahead.

rosscdh · 2022-06-24T06:42:12Z

@613andred this would be a great addition; we suffer this issue as well and getting remote agents to reconnect with new tokens is less than a great strategy.

Any input from the maintainers would go a long way here

brokenpip3 · 2023-03-07T22:34:30Z

Make the restore step happen earlier in the Jenkins lifecycle
Use init container (anything after the Jenkins startup is too late as there would be a small window for an agent to reconnect before state is restored)
Remove from the operator reconcile loop

I like this proposal here, in particular the concept of performing the restore before the jenkins start. If you want to try to make a PR to address this will be great, atm the priority is to fix the current code and release a new version.

Add nodes and secrets state to backup

I do not understand this part, here we are talking about some kubernetes jobs or vm/ec2/static jenkins nodes?

brokenpip3 · 2023-04-29T16:45:57Z

Coming back to this issue, I believe we should do something about this since apart of kubernetes or ssh agents is not possible to create any permanent inbound agent due to after a restart the inbound agent secrets will be re-generated (this is also causing the kubernetes agent issue of not able to reconnect)

After investigating for what I understood when you create a jenkins node the node secrets is stored in the DefaultConfidentialStore that is not transparent to the end users and store the secrets -as file- under $JENKINS_HOME/secrets (correct me if I'm wrong). So after a jenkins restart that secret will be lost forever. I believe there is no way to specify the node secret using a "user" credentials (if we exclude the ssh agents like the ec2 instance from ec2 plugins that will read an ssh keys or u/p from the credentials store) or tell jenkins to avoid storing the node secret in the DefaultConfidentialStore.

I tried a couple of things, the first one was to add a persistent layer mounted in /var/lib/jenkins/secrets and the secret of the node survived after a reboot/recreate of the jenkins master, my test inbound agent was able to connect correctly (via websocket).

Now these are the options that comes to my mind to address this issue (in my personal preferred order, other opinions and options will be appreciated):

1- Find a way (groovy/api script) to write in the DefaultConfidentialStore. This method have a pro that is following the spirit of this project (no persistent layer at all, everything as code) but the cons of not rescuing the kubernetes job agents but only the permanent not-ssh nodes (which is ok to me, the k8s job should born and die as long the master is up imho)

2- This issue's proposal: add an init container to restore the backup before the jenkins java process start and add to the backup script the secrets directory. The cons of this is that each time you want a proper restore you need to restart the jenkins master

3-Add a persistent volume (optional) for $JENKINS_HOME/secrets, this will be the most easy to implement but also the one that is far away from the spirit of this project.

Any opinion, thought, criticism?

Thanks

brokenpip3 · 2023-05-03T07:32:28Z

@brokenpip3 Thanks for the response. Just a quick question, I tried mounting a PV at /var/lib/jenkins/secrets to preserve the secrets to allow agent reconnection after master comes back up once its down. But still the agent wasn't able to connect to the master. This is in relation to #691 . Any idea how to tackle this?

what exactly you tried? when you are referring to agents, which kind of agents we are talking about?

dashashutosh24 · 2023-05-03T07:36:39Z

@brokenpip3 I am referring to the k8s pods that spin up dynamically when a job is run. I added a sleep command in a job, ran it which brought up a k8s pod as slave agent, then killed the master pod. Once the master came back up, the jenkins slave pod that was spun by the job went to NotReady state indefinitely trying to connect to the master. I had to manually delete the pod. I tried to mount a PV at /var/lib/jenkins/secrets path via a PVC, but the slave pod was still unable to connect back to the master once master came back up.

brokenpip3 · 2023-05-03T07:49:40Z

Ok but you did this by scaling down the operator, save the jenkins pod as code, modify it and delete/apply again?
About the k8s agent: I never tried with that because like I said I believe that k8s jobs should be tied to the master lifecycle, anyway I can try to understand if we can support
If you want to make another test you can do exactly what you did by mounting the pvc for the secret directory, run a job with a big sleep, check the agent secret from the console like this:

jenkins.model.Jenkins.getInstance().getComputer("<agentname>").getJnlpMac()

restart jenkins
and check the secret again

dashashutosh24 · 2023-05-03T08:29:13Z

@brokenpip3 Yes, I modified the jenkins master and recreated it. I tested it the way you mentioned. It seems the value returned by jenkins.model.Jenkins.getInstance().getComputer("").getJnlpMac() is removed after jenkins master restart. That is why the slave pod is unable to communicate with the master.

The slave pod logs:

brokenpip3 · 2023-05-03T21:32:54Z

So the issue it's different in your tests: as you can see the second groovy query failed because the node does not exist in the jenkins configuration not because the persistent layer does not have the right secret.
I tried instead with permanent inbound jenkins agent with the proper jenkins casc that will automatically re-create the node each time and the secret (that is a file) will persist across pod restart. Obv in case of kubernetes on-demand agents will be hard to write down the jenkins casc each time to the configmap because like I said imho the kubernetes jenkins agents should be treated as short living agents that will last until the master will be up.
If you want a permanent kubernetes agent you can achieve that with an inbound agent as deployment, and in that case the casc + /secrets persistent layer will do the work.
Sadly the secrets pvc is not yet implemented in the code, like I said some comments ago I would prefer others solutions, let's see. :)

dashashutosh24 · 2023-05-04T01:17:45Z

@brokenpip3 Thanks for looking into this! I understand your point about the lifecycle of slave pod being tied to the master. That made sense. But the issue is that once master comes back up, the existing slave pods go into NotReady state indefinitely. It would be great if the operator would check for orphan pods and terminates them once master comes back up. Otherwise it becomes a manual step.

brokenpip3 · 2023-05-04T08:30:18Z

Yep that make sense 100%.
We can do something around that but not in 0.8, after the first 0.9 maybe

613andred added the bug Something isn't working label Dec 20, 2021

613andred changed the title ~~Agents do not survive restart of the Jenkins~~ Agents do not survive Jenkins restart Dec 20, 2021

stale bot added the stale label Apr 16, 2022

stale bot removed the stale label Apr 20, 2022

brokenpip3 mentioned this issue May 2, 2023

Keeping the Jenkins agent secret stable jenkinsci/configuration-as-code-plugin#2250

Closed

dashashutosh24 mentioned this issue May 3, 2023

Use ephemeral PVC for JENKINS_HOME #769

Closed

github-actions bot added the stale label Jul 3, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 14, 2023

brokenpip3 mentioned this issue May 8, 2024

How to persist Jenkins API token? #454

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agents do not survive Jenkins restart #691

Agents do not survive Jenkins restart #691

613andred commented Dec 20, 2021

thecooldrop commented Dec 20, 2021

mortenbirkelund commented Feb 9, 2022

stale bot commented Apr 16, 2022

613andred commented Apr 20, 2022

rosscdh commented Jun 24, 2022

brokenpip3 commented Mar 7, 2023 •

edited

Loading

brokenpip3 commented Apr 29, 2023

brokenpip3 commented May 3, 2023

dashashutosh24 commented May 3, 2023 •

edited

Loading

brokenpip3 commented May 3, 2023

dashashutosh24 commented May 3, 2023 •

edited

Loading

brokenpip3 commented May 3, 2023

dashashutosh24 commented May 4, 2023

brokenpip3 commented May 4, 2023

Agents do not survive Jenkins restart #691

Agents do not survive Jenkins restart #691

Comments

613andred commented Dec 20, 2021

Describe the bug

To Reproduce

Additional information

Workaround

Proposed fix

thecooldrop commented Dec 20, 2021

mortenbirkelund commented Feb 9, 2022

stale bot commented Apr 16, 2022

613andred commented Apr 20, 2022

rosscdh commented Jun 24, 2022

brokenpip3 commented Mar 7, 2023 • edited Loading

brokenpip3 commented Apr 29, 2023

brokenpip3 commented May 3, 2023

dashashutosh24 commented May 3, 2023 • edited Loading

brokenpip3 commented May 3, 2023

dashashutosh24 commented May 3, 2023 • edited Loading

brokenpip3 commented May 3, 2023

dashashutosh24 commented May 4, 2023

brokenpip3 commented May 4, 2023

brokenpip3 commented Mar 7, 2023 •

edited

Loading

dashashutosh24 commented May 3, 2023 •

edited

Loading

dashashutosh24 commented May 3, 2023 •

edited

Loading