Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional slow startup times during Booting builder step #283

Open
3 tasks done
niodice opened this issue Oct 25, 2023 · 9 comments
Open
3 tasks done

Occasional slow startup times during Booting builder step #283

niodice opened this issue Oct 25, 2023 · 9 comments

Comments

@niodice
Copy link

niodice commented Oct 25, 2023

Contributing guidelines

I've found a bug, and:

  • The documentation does not mention anything about my problem
  • There are no open or closed issues that are related to my problem

Description

I see that, about 25% of the time, that the Booting builder step can take quite some time to execute. I am running on Amazon EKS and we are using self hosted runners deployed using https://github.com/actions/actions-runner-controller. Before my runner becomes available to pick up a workflow, I see that the docker daemon has started successfully:

time="2023-10-25T21:08:30.983564648Z" level=info msg="Daemon has completed initialization"
time="2023-10-25T21:08:31.036285627Z" level=info msg="API listen on /var/run/docker.sock"

Expected behaviour

Consistent startup time

Actual behaviour

Startup time is very slow at time.

Repository URL

No response

Workflow run URL

No response

YAML workflow

The relevant parts of the workflow are:


jobs:
  publish-docker:
    name: publish docker image
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v3
      - name: setup docker buildx
        uses: docker/setup-buildx-action@v3

Workflow logs

Wed, 25 Oct 2023 21:08:49 GMT
  /usr/local/bin/docker buildx inspect --bootstrap --builder builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb9
Wed, 25 Oct 2023 21:08:49 GMT
  #1 [internal] booting buildkit
Wed, 25 Oct 2023 21:08:49 GMT
  #1 pulling image moby/buildkit:buildx-stable-1
Wed, 25 Oct 2023 21:08:57 GMT
  #1 pulling image moby/buildkit:buildx-stable-1 7.5s done
Wed, 25 Oct 2023 21:08:57 GMT
  #1 creating container buildx_buildkit_builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb90
Wed, 25 Oct 2023 21:11:12 GMT
  #1 143.0 error: failed to list workers: Unavailable: connection error: desc = "transport: error while dialing: dial unix /run/buildkit/buildkitd.sock: connect: no such file or directory"
Wed, 25 Oct 2023 21:11:12 GMT
  #1 creating container buildx_buildkit_builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb90 135.5s done
Wed, 25 Oct 2023 21:11:12 GMT
  #1 ERROR: exit code 1
Wed, 25 Oct 2023 21:11:12 GMT
  ------
Wed, 25 Oct 2023 21:11:12 GMT
   > [internal] booting buildkit:
Wed, 25 Oct 2023 21:11:12 GMT
  143.0 error: failed to list workers: Unavailable: connection error: desc = "transport: error while dialing: dial unix /run/buildkit/buildkitd.sock: connect: no such file or directory"
Wed, 25 Oct 2023 21:11:12 GMT
  ------
Wed, 25 Oct 2023 21:11:12 GMT
  Name:          builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb9
Wed, 25 Oct 2023 21:11:12 GMT
  Driver:        docker-container
Wed, 25 Oct 2023 21:11:12 GMT
  Last Activity: 2023-10-25 21:08:49 +0000 UTC
Wed, 25 Oct 2023 21:11:12 GMT
  
Wed, 25 Oct 2023 21:11:12 GMT
  Nodes:
Wed, 25 Oct 2023 21:11:12 GMT
  Name:     builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb90
Wed, 25 Oct 2023 21:11:12 GMT
  Endpoint: unix:///var/run/docker.sock
Wed, 25 Oct 2023 21:11:12 GMT
  Error:    Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/buildx_buildkit_builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb90/json": context deadline exceeded

The delay seems to be in the step that produces this log line:

  Error:    Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/buildx_buildkit_builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb90/json": context deadline exceeded

BuildKit logs

No response

Additional info

No response

@deanb-everc
Copy link

bump, also experiencing this, and this fails my workflows..

@zpiazza-combocurve
Copy link

Same here

@ChrisPage-AT
Copy link

I'm also seeing this but it just fails the workflows instead of being slow. It doesn't happen with every build but it does force us to rebuild when it fails. My logs are exactly the same as above, and show up under the "Booting builder" dropdown of the docker/setup-buildx-action@v3 action.

@niodice
Copy link
Author

niodice commented Jan 26, 2024

@crazy-max I wonder if you might have some expertise in how docker and/or the buildx plugin interact, and what we might look for in terms of debugging the slowness.

@ivadimko
Copy link

bump, we also fase issues with docker/setup-buildx-action@v3

affair on v2 it worked great so we had to rollback :(

@LucasAgin
Copy link

LucasAgin commented Feb 16, 2024

Also seeing this failing on our self-hosted runners, this is after running for ~6-7 minutes. Also using action runner controller on AWS EKS.

It is periodic, some builds eventually work, some do not.

Run docker/setup-buildx-action@v3
Docker info
Buildx version
Creating a new builder instance
Booting builder
  /usr/bin/docker buildx inspect --bootstrap --builder builder-e9dc3275-XXXX-XXXX-95e9-2c1fcb2a5c6a
  #1 [internal] booting buildkit
  #1 pulling image moby/buildkit:buildx-stable-1
  #1 pulling image moby/buildkit:buildx-stable-1 2.3s done
  #1 creating container buildx_buildkit_builder-e9dc3275-XXXX-XXXX-95e9-2c1fcb2a5c6a0
  #1 467.0 error: failed to list workers: Unavailable: connection error: desc = "transport: error while dialing: dial unix /run/buildkit/buildkitd.sock: connect: no such file or directory"
  #1 creating container buildx_buildkit_builder-e9dc3275-XXXX-XXXX-95e9-2c1fcb2a5c6a0 464.7s done
  #1 ERROR: exit code 1
  ------
   > [internal] booting buildkit:
  467.0 error: failed to list workers: Unavailable: connection error: desc = "transport: error while dialing: dial unix /run/buildkit/buildkitd.sock: connect: no such file or directory"
  ------
  ERROR: exit code 1
Error: The process '/usr/bin/docker' failed with exit code 1

@devonjones
Copy link

We had this problem, and we found the solution: you must set limits and requests for resources for the arc runners. There's no default requests/limits, which means that kube is free to schedule pods anywhere. This is why the error is happening at random. If a build comes in that uses lots of containers at once, kube is probably killing containers to maintain health. You have to have requests and limits configured to get kube to do intelligent scheduling.

The bug here is that arc should set a sensible default, or it should error. In their helm chart they comment that they don't set a default because the user should be setting it. That's a reasonable impulse, but instead of having no requests/limits, the helm chart should fail if the user isn't setting them.

https://github.com/actions/actions-runner-controller/blob/master/charts/actions-runner-controller/values.yaml#L127

@gete76
Copy link

gete76 commented Apr 8, 2024

@devonjones, those errors do pop up when a node's drive is slowed down by runner overload, but I don't think that's the root cause in this issue since this does seem to be similar to the issue that just mentioned this ticket ^. I've also seen this error on our github hosted runners.

@tschwaerzl
Copy link

I just got the exact same error on a GitHub hosted runner. It worked without any issues the last days.

Docker info

  /usr/bin/docker version
  Client: Docker Engine - Community
   Version:           26.1.3
   API version:       1.45
   Go version:        go1.21.10
   Git commit:        b72abbb
   Built:             Thu May 16 08:33:29 2024
   OS/Arch:           linux/amd64
   Context:           default
  
  Server: Docker Engine - Community
   Engine:
    Version:          26.1.3
    API version:      1.45 (minimum version 1.24)
    Go version:       go1.21.10
    Git commit:       8e96db1
    Built:            Thu May 16 08:33:29 2024
    OS/Arch:          linux/amd64
    Experimental:     false
   containerd:
    Version:          1.7.18
    GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
   runc:
    Version:          1.7.18
    GitCommit:        v1.1.13-0-g58aa920
   docker-init:
    Version:          0.19.0
    GitCommit:        de40ad0
  /usr/bin/docker info
  Client: Docker Engine - Community
   Version:    26.1.3
   Context:    default
   Debug Mode: false
   Plugins:
    buildx: Docker Buildx (Docker Inc.)
      Version:  v0.15.1
      Path:     /usr/libexec/docker/cli-plugins/docker-buildx
    compose: Docker Compose (Docker Inc.)
      Version:  v2.27.1
      Path:     /usr/libexec/docker/cli-plugins/docker-compose
  
  Server:
   Containers: 0
    Running: 0
    Paused: 0
    Stopped: 0
   Images: 15
   Server Version: 26.1.3
   Storage Driver: overlay2
    Backing Filesystem: extfs
    Supports d_type: true
    Using metacopy: false
    Native Overlay Diff: false
    userxattr: false
   Logging Driver: json-file
   Cgroup Driver: cgroupfs
   Cgroup Version: 2
   Plugins:
    Volume: local
    Network: bridge host ipvlan macvlan null overlay
    Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
   Swarm: inactive
   Runtimes: io.containerd.runc.v2 runc
   Default Runtime: runc
   Init Binary: docker-init
   containerd version: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
   runc version: v1.1.13-0-g58aa920
   init version: de40ad0
   Security Options:
    apparmor
    seccomp
     Profile: builtin
    cgroupns
   Kernel Version: 6.5.0-1023-azure
   Operating System: Ubuntu 22.04.4 LTS
   OSType: linux
   Architecture: x86_64
   CPUs: 2
   Total Memory: 7.744GiB
   Name: fv-az1661-692
   ID: e47e7669-35e6-4532-94ca-a2b577895d60
   Docker Root Dir: /var/lib/docker
   Debug Mode: false
   Username: githubactions
   Experimental: false
   Insecure Registries:
    127.0.0.0/8
   Live Restore Enabled: false

Buildx version

  /usr/bin/docker buildx version
  github.com/docker/buildx v0.15.1 1c1dbb2e4cb5363110f42102744a08d034c2300d

Booting builder

  /usr/bin/docker buildx inspect --bootstrap --builder builder-0dec8ecb-3d6f-409f-959a-b8d0905fba2a
  #1 [internal] booting buildkit
  #1 pulling image moby/buildkit:buildx-stable-1
  #1 pulling image moby/buildkit:buildx-stable-1 0.4s done
  #1 creating container buildx_buildkit_builder-0dec8ecb-3d6f-409f-959a-b8d0905fba2a0
  #1 18.86 error: failed to list workers: Unavailable: connection error: desc = "transport: Error while dialing: dial unix /run/buildkit/buildkitd.sock: connect: no such file or directory"
  #1 creating container buildx_buildkit_builder-0dec8ecb-3d6f-409f-959a-b8d0905fba2a0 18.5s done
  #1 ERROR: exit code 1
  ------
   > [internal] booting buildkit:
  18.86 error: failed to list workers: Unavailable: connection error: desc = "transport: Error while dialing: dial unix /run/buildkit/buildkitd.sock: connect: no such file or directory"
  ------
  ERROR: exit code 1

@devonjones: Is there a way to configure/provide the limits within the GitHub Actions file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants