Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents do not survive Jenkins restart #691

Closed
613andred opened this issue Dec 20, 2021 · 14 comments
Closed

Agents do not survive Jenkins restart #691

613andred opened this issue Dec 20, 2021 · 14 comments
Labels
bug Something isn't working stale

Comments

@613andred
Copy link

Describe the bug

Agents are currently unable to re-connect to Jenkins if it is restarted, this is problematic for any long-running pipeline.

This functionality is critical to our use of Jenkins.

This issue has previously been raised in #542.

There are several problems causing this issue:

  • The backup restore happens too late in the Jenkins lifecycle
  • Missing Jenkins state for agent re-connection

To Reproduce

Create a long running pipeline, Ex. create a pipeline with a step sh 'sleep 300'.
Restart Jenkins by deleting the pod.

Additional information

Agent logs:

Dec 03, 2021 3:43:49 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFO: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused (Connection refused)
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Performing onReconnect operation.
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: onReconnect operation failed.
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://jenkins-operator-http-jenkins.jenkins-operator-dev.svc.cluster.local:8080/]
Dec 03, 2021 3:44:00 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
Dec 03, 2021 3:44:00 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting TCP connection tunneling is enabled. Skipping the TCP Agent Listener Port availability check
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
  Agent address: jenkins-operator-slave-jenkins.jenkins-operator-dev.svc.cluster.local
  Agent port:    50000
  Identity:      53:e7:06:b5:b0:1b:2d:13:e9:22:f6:5a:79:4f:ef:69
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins-operator-slave-jenkins.jenkins-operator-dev.svc.cluster.local:50000
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect
Dec 03, 2021 3:44:00 PM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run
INFO: Waiting for ProtocolStack to start.
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Remote identity confirmed: 53:e7:06:b5:b0:1b:2d:13:e9:22:f6:5a:79:4f:ef:69
Dec 03, 2021 3:44:00 PM org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer onRecv
INFO: [JNLP4-connect connection to jenkins-operator-slave-jenkins.jenkins-operator-dev.svc.cluster.local/10.0.56.118:50000] Local headers refused by remote: Unknown client name: img-agent-java-11-fdqph
Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Protocol JNLP4-connect encountered an unexpected exception
java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: img-agent-java-11-fdqph
        at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223)
        at hudson.remoting.Engine.innerRun(Engine.java:778)
        at hudson.remoting.Engine.run(Engine.java:540)
Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: img-agent-java-11-fdqph
        at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:378)
        at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:433)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:825)
        at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:288)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:170)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:825)
        at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:49)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:255)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122)
        at java.base/java.lang.Thread.run(Thread.java:829)
        Suppressed: java.nio.channels.ClosedChannelException
                ... 7 more

Dec 03, 2021 3:44:00 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: None of the protocols were accepted
java.lang.Exception: The server rejected the connection: None of the protocols were accepted
        at hudson.remoting.Engine.onConnectionRejected(Engine.java:864)
        at hudson.remoting.Engine.innerRun(Engine.java:804)
        at hudson.remoting.Engine.run(Engine.java:540)

Jenkins logs:

2021-12-03 15:44:00.039+0000 [id=87]    INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #1 from /10.163.64.241:53150
2021-12-03 15:44:00.122+0000 [id=89]    INFO    o.j.r.p.i.ConnectionHeadersFilterLayer#onRecv: [JNLP4-connect connection from 10.163.64.241/10.163.64.241:53150] Refusing headers from remote: Unknown client name: img-agent-java-11-fdqph

Workaround

This workaround is not great as it breaks the principals of the Jenkins operator.

  • Disable backup
  • Create PVC for
    • /var/lib/jenkins/nodes
    • /var/lib/jenkins/secrets
    • /var/lib/jenkins/jobs

Proposed fix

  • Make the restore step happen earlier in the Jenkins lifecycle
    • Use init container (anything after the Jenkins startup is too late as there would be a small window for an agent to reconnect before state is restored)
    • Remove from the operator reconcile loop
  • Add nodes and secrets state to backup

If you accept the proposal I am able to provide a PR to resolve this issue.

I believe this solution would also address #607 and #679

@613andred 613andred added the bug Something isn't working label Dec 20, 2021
@613andred 613andred changed the title Agents do not survive restart of the Jenkins Agents do not survive Jenkins restart Dec 20, 2021
@thecooldrop
Copy link

Thank for including this. We have some longer-running pipelines ourselves, and we are migrating the Jenkins into our cluster. I was wondering about this just today.

@mortenbirkelund
Copy link

@613andred Sounds like a great proposal. Anything to do, in order to help out?

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this issue is still affecting you, just comment with any updates and we'll keep it open. Thank you for your contributions.

@stale stale bot added the stale label Apr 16, 2022
@613andred
Copy link
Author

Unfortunately I have been very busy as of late so I have not been able to put together PR.

Ideally I would like to get input from the maintainers before I go ahead.

@stale stale bot removed the stale label Apr 20, 2022
@rosscdh
Copy link

rosscdh commented Jun 24, 2022

@613andred this would be a great addition; we suffer this issue as well and getting remote agents to reconnect with new tokens is less than a great strategy.

Any input from the maintainers would go a long way here

@brokenpip3
Copy link
Collaborator

brokenpip3 commented Mar 7, 2023

Make the restore step happen earlier in the Jenkins lifecycle
Use init container (anything after the Jenkins startup is too late as there would be a small window for an agent to reconnect before state is restored)
Remove from the operator reconcile loop

I like this proposal here, in particular the concept of performing the restore before the jenkins start. If you want to try to make a PR to address this will be great, atm the priority is to fix the current code and release a new version.

Add nodes and secrets state to backup

I do not understand this part, here we are talking about some kubernetes jobs or vm/ec2/static jenkins nodes?

@brokenpip3
Copy link
Collaborator

Coming back to this issue, I believe we should do something about this since apart of kubernetes or ssh agents is not possible to create any permanent inbound agent due to after a restart the inbound agent secrets will be re-generated (this is also causing the kubernetes agent issue of not able to reconnect)

After investigating for what I understood when you create a jenkins node the node secrets is stored in the DefaultConfidentialStore that is not transparent to the end users and store the secrets -as file- under $JENKINS_HOME/secrets (correct me if I'm wrong). So after a jenkins restart that secret will be lost forever. I believe there is no way to specify the node secret using a "user" credentials (if we exclude the ssh agents like the ec2 instance from ec2 plugins that will read an ssh keys or u/p from the credentials store) or tell jenkins to avoid storing the node secret in the DefaultConfidentialStore.

I tried a couple of things, the first one was to add a persistent layer mounted in /var/lib/jenkins/secrets and the secret of the node survived after a reboot/recreate of the jenkins master, my test inbound agent was able to connect correctly (via websocket).

Now these are the options that comes to my mind to address this issue (in my personal preferred order, other opinions and options will be appreciated):

1- Find a way (groovy/api script) to write in the DefaultConfidentialStore. This method have a pro that is following the spirit of this project (no persistent layer at all, everything as code) but the cons of not rescuing the kubernetes job agents but only the permanent not-ssh nodes (which is ok to me, the k8s job should born and die as long the master is up imho)

2- This issue's proposal: add an init container to restore the backup before the jenkins java process start and add to the backup script the secrets directory. The cons of this is that each time you want a proper restore you need to restart the jenkins master

3-Add a persistent volume (optional) for $JENKINS_HOME/secrets, this will be the most easy to implement but also the one that is far away from the spirit of this project.

Any opinion, thought, criticism?

Thanks

@brokenpip3
Copy link
Collaborator

@brokenpip3 Thanks for the response. Just a quick question, I tried mounting a PV at /var/lib/jenkins/secrets to preserve the secrets to allow agent reconnection after master comes back up once its down. But still the agent wasn't able to connect to the master. This is in relation to #691 . Any idea how to tackle this?

what exactly you tried? when you are referring to agents, which kind of agents we are talking about?

@dashashutosh24
Copy link

dashashutosh24 commented May 3, 2023

Screenshot 2023-05-03 at 1 10 06 PM

@brokenpip3 I am referring to the k8s pods that spin up dynamically when a job is run. I added a sleep command in a job, ran it which brought up a k8s pod as slave agent, then killed the master pod. Once the master came back up, the jenkins slave pod that was spun by the job went to NotReady state indefinitely trying to connect to the master. I had to manually delete the pod. I tried to mount a PV at /var/lib/jenkins/secrets path via a PVC, but the slave pod was still unable to connect back to the master once master came back up.

@brokenpip3
Copy link
Collaborator

Ok but you did this by scaling down the operator, save the jenkins pod as code, modify it and delete/apply again?
About the k8s agent: I never tried with that because like I said I believe that k8s jobs should be tied to the master lifecycle, anyway I can try to understand if we can support
If you want to make another test you can do exactly what you did by mounting the pvc for the secret directory, run a job with a big sleep, check the agent secret from the console like this:

jenkins.model.Jenkins.getInstance().getComputer("<agentname>").getJnlpMac()

restart jenkins
and check the secret again

@dashashutosh24
Copy link

dashashutosh24 commented May 3, 2023

@brokenpip3 Yes, I modified the jenkins master and recreated it. I tested it the way you mentioned. It seems the value returned by jenkins.model.Jenkins.getInstance().getComputer("").getJnlpMac() is removed after jenkins master restart. That is why the slave pod is unable to communicate with the master.
Screenshot 2023-05-03 at 1 48 21 PM
Screenshot 2023-05-03 at 1 48 43 PM
Screenshot 2023-05-03 at 1 53 18 PM
Screenshot 2023-05-03 at 1 53 53 PM

The slave pod logs:
Screenshot 2023-05-03 at 1 55 28 PM
Screenshot 2023-05-03 at 2 00 21 PM

@brokenpip3
Copy link
Collaborator

So the issue it's different in your tests: as you can see the second groovy query failed because the node does not exist in the jenkins configuration not because the persistent layer does not have the right secret.
I tried instead with permanent inbound jenkins agent with the proper jenkins casc that will automatically re-create the node each time and the secret (that is a file) will persist across pod restart. Obv in case of kubernetes on-demand agents will be hard to write down the jenkins casc each time to the configmap because like I said imho the kubernetes jenkins agents should be treated as short living agents that will last until the master will be up.
If you want a permanent kubernetes agent you can achieve that with an inbound agent as deployment, and in that case the casc + /secrets persistent layer will do the work.
Sadly the secrets pvc is not yet implemented in the code, like I said some comments ago I would prefer others solutions, let's see. :)

@dashashutosh24
Copy link

@brokenpip3 Thanks for looking into this! I understand your point about the lifecycle of slave pod being tied to the master. That made sense. But the issue is that once master comes back up, the existing slave pods go into NotReady state indefinitely. It would be great if the operator would check for orphan pods and terminates them once master comes back up. Otherwise it becomes a manual step.

@brokenpip3
Copy link
Collaborator

Yep that make sense 100%.
We can do something around that but not in 0.8, after the first 0.9 maybe

@github-actions github-actions bot added the stale label Jul 3, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

6 participants