Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random DNS issue when using Github Actions #107

Open
saarw-opti opened this issue Jan 24, 2024 · 31 comments
Open

Random DNS issue when using Github Actions #107

saarw-opti opened this issue Jan 24, 2024 · 31 comments

Comments

@saarw-opti
Copy link

saarw-opti commented Jan 24, 2024

Screenshot 2024-01-24 at 10 47 32

I'm trying to use TailScale GitHub actions on the latest (I also tested different versions) version of TailScale and getting these DNS Issues.
It also happens when attempting to install it manually on the machine while it is running,
I've tried injecting the nameserver and the search to /etc/resolve.conf
But it doesn't help in this case.
On the Admin console, I've defined the machine as an Ephermal and pre-approved machine.
It happens only on GitHub action machines.
This issue is something that happens and sometimes does not.

Thanks.

@saarw-opti saarw-opti changed the title DNS issue when using Github Actions TailsSale Random DNS issue when using Github Actions TailsSale Jan 24, 2024
@saarw-opti saarw-opti changed the title Random DNS issue when using Github Actions TailsSale Random DNS issue when using Github Actions Jan 24, 2024
@bradfitz
Copy link
Member

Are you using a Dockerfile runner or the Tailscale-supplied action.yml?

What's your GitHub runner type/version?

@saarw-opti
Copy link
Author

saarw-opti commented Jan 30, 2024

I tried both the GitHub actions and the manual installation, run on ubuntu-latest, and 20.04 (seems more stable).
tried the 1.58.0 and the 1.56.0 of tailscale.

@matthewjthomas
Copy link

I was running into a lot of transient DNS resolution failures, followed this recommendation and it seems to be working a lot better: #51 (comment)

@sylr
Copy link

sylr commented Apr 5, 2024

I too encounter a lot of transient DNS errors, my deployment pipelines randomly fail like this:

> Run helm package ./deploy/chart \
Successfully packaged chart and saved it to: /home/runner/work/.../..../......tgz
Error: Kubernetes cluster unreachable: Get "https://xxxxx.gr7.eu-central-1.eks.amazonaws.com/version": dial tcp: lookup xxxxx.gr7.eu-central-1.eks.amazonaws.com on 127.0.0.53:53: read udp 127.0.0.1:40699->127.0.0.53:53: i/o timeout

It was working fine a few weeks ago, now I have to restart my deployment pipelines a lot.

@dgivens
Copy link

dgivens commented May 2, 2024

I saw this a while back, but it seemed to go away for a while, then it became a problem again about a week ago. We are using the standard hosted runner and the following action. When it started causing us problems last week, we added the Tailscale version based on the same issue @matthewjthomas referenced, #51. It has not made a difference.

This is our action.

name: 'connect_tailscale'
description: 'Connects to Tailscale'
inputs:
    ts_oauth_client_id:
        description: 'TS_OAUTH_CLIENT_ID'
        required: true
    ts_oauth_secret:
        description: 'TS_OAUTH_SECRET'
        required: true
runs:
    using: 'composite'
    steps:
        - name: Tailscale
          uses: tailscale/github-action@v2
          with:
              version: 1.64.0
              oauth-client-id: ${{ inputs.TS_OAUTH_CLIENT_ID }}
              oauth-secret: ${{ inputs.TS_OAUTH_SECRET }}
              tags: tag:github
              args: --accept-routes --accept-dns

@KlausVii
Copy link

We are also experiencing DNS timeouts with tailscale in our ci. Our setup

     - name: Tailscale
        uses: tailscale/github-action@v2
        with:
          oauth-client-id: ${{ env.TS_OAUTH_CLIENT_ID }}
          oauth-secret: ${{ env.TS_OAUTH_SECRET }}
          tags: tag:ci
          version: 1.64.0

@arnecls
Copy link

arnecls commented Jun 24, 2024

We found that the tailscale action is "reporting ready" to quickly.
It waits for tailscale status to return ok, but it takes another ~10s until DNS becomes available. So sleeping for 10s after the connect step usually solves the issue.

I'd like to have a more consistent way of waiting for DNS to become ready though.

@sylr
Copy link

sylr commented Dec 4, 2024

We found that the tailscale action is "reporting ready" to quickly. It waits for tailscale status to return ok, but it takes another ~10s until DNS becomes available. So sleeping for 10s after the connect step usually solves the issue.

I'd like to have a more consistent way of waiting for DNS to become ready though.

I've just hit this problem (again) and it took several minutes for tailscale network to be in a working state (I put a sleep 600 and tried to ssh into the github runner, it took at least 3 minutes before my ssh went through). I'm wondering if the problem could be caused by an overloaded github network.

In any case, I agree with @arnecls, it would be nice to have a tailscale command that could wait until magic dns is in working order.

@lukeramsden
Copy link

Also seeing this at the moment, trying a sleep 10 as we speak but yes ideally there would be a way to explicitly wait for the propagation to happen.

@sylr
Copy link

sylr commented Dec 4, 2024

@lukeramsden I'm glad someone is having the issue at the same time as me.

My hypothesis:

  • flaky github network.
  • nearest github actions DERP servers overloaded.

@lukasmrtvy
Copy link

@sylr same for us, especially today

@lukeramsden
Copy link

sleep 10 seemed to work as a one-off for me just now but sounds like there can be quite a lot of variance

@sylr
Copy link

sylr commented Dec 4, 2024

It's a bit hackish but less dumb than a sleep: sylr@338b779

@sylr
Copy link

sylr commented Dec 9, 2024

I've forked the github action and added this at the end:

  timeout --verbose --kill-after=1s ${TIMEOUT} sudo -E bash -c 'while tailscale dns query google.com. a | grep "failed to query DNS"; do sleep 1; done'

Currently seeing this:


Run if [ X64 = "ARM64" ]; then
Downloading https://pkgs.tailscale.com/stable/tailscale_1.76.6_amd64.tgz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    81  100    81    0     0    362      0 --:--:-- --:--:-- --:--:--   363

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 27.6M  100 27.6M    0     0  42.5M      0 --:--:-- --:--:-- --:--:-- 80.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    64  100    64    0     0    346      0 --:--:-- --:--:-- --:--:--   347
Expected sha256: 08f2377b78f7b9e411caa28f231a9c4cd0887209c142b49b815bcc7042ff61f7
Actual sha256: 08f2377b78f7b9e411caa28f231a9c4cd0887209c142b49b815bcc7042ff61f7  tailscale.tgz
tailscale.tgz: OK
Run if [ "$STATEDIR" == "" ]; then
Run if [ -z "${HOSTNAME}" ]; then
  if [ -z "${HOSTNAME}" ]; then
    HOSTNAME="github-$(cat /etc/hostname)"
  fi
  if [ -n "***" ]; then
    TAILSCALE_AUTHKEY="***?preauthorized=true&ephemeral=true"
    TAGS_ARG="--advertise-tags=tag:github-actions"
  fi
  timeout --verbose --kill-after=1s ${TIMEOUT} sudo -E tailscale up ${TAGS_ARG} --authkey=${TAILSCALE_AUTHKEY} --hostname=${HOSTNAME} --accept-routes ${ADDITIONAL_ARGS}
  timeout --verbose --kill-after=1s ${TIMEOUT} sudo -E bash -c 'while tailscale dns query google.com. a | grep "failed to query DNS"; do sleep 1; done'
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    AWS_DEFAULT_REGION: eu-central-1
    AWS_REGION: eu-central-1
    AWS_ACCESS_KEY_ID: ***
    AWS_SECRET_ACCESS_KEY: ***
    AWS_SESSION_TOKEN: ***
    ADDITIONAL_ARGS: 
    HOSTNAME: 
    TAILSCALE_AUTHKEY: 
    TIMEOUT: 2m
    TS_EXPERIMENT_OAUTH_AUTHKEY: true
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded
timeout: sending signal TERM to command ‘sudo’
failed to query DNS: 500 Internal Server Error: waiting for response or error from [100.68.130.112 100.122.168.44]: context deadline exceeded

@lukasmrtvy
Copy link

  • flaky github network.
  • nearest github actions DERP servers overloaded.

It makes sense, maybe the us-east is running workflows just before lunch :D

@bithavoc
Copy link

bithavoc commented Dec 9, 2024

we're seeing this with github actions:

  • hosted runners (ubuntu 22.04 and 24.04)
  • tailscale 1.78.1
read udp 127.0.0.1:48903->127.0.0.53:53: i/o timeout

@bithavoc
Copy link

bithavoc commented Dec 9, 2024

this is failing 100% of the time now when SplitDNS is used on Github Action hosted runners:

timeout --verbose --kill-after=1s 2m sudo -E bash -c 'while tailscale dns query MY-CLUSTER-BEHIND-TAILSCALE-SPLIT-DNS.eks.amazonaws.com. a | grep "failed to query DNS"; do sleep 1; done'
  
failed to query DNS: 500 Internal Server Error: waiting for response or error from [172.31.0.2]: context deadline exceeded

@lukeramsden
Copy link

Yep all my deploys failing atm. Seems to be correlated with other people. I'm also hosting my runners on https://blacksmith.sh/ and not using GitHub Actions hosted runners so maybe its on Tailscale's end?

@bithavoc
Copy link

bithavoc commented Dec 9, 2024

tailscale status now showing degraded performance in Coordinator server https://status.tailscale.com/

@aaomidi
Copy link

aaomidi commented Dec 9, 2024

I've made a feature request for this: #146

On this note, even before this outage we'd sometimes still get timeouts because of what I assume is coordination latency - is this something other folks have experienced?

@bithavoc
Copy link

bithavoc commented Dec 9, 2024

it's working now, it seems like it was the coordinator server after all

@KlausVii
Copy link

We are still encountering lots of timeouts, is there another incident on going? Status page is reporting all green

@lukeramsden
Copy link

image image

I'm seeing it take 15 minutes for DNS to propagate

@sylr
Copy link

sylr commented Dec 16, 2024

Has anyone else tried to ring tailscale support about this ? I've sent 2 mails without response.

@bithavoc
Copy link

it's not great, they acknowledged it in their status page today but it's been like this since Saturday, it's pretty bad for Split DNS.

@lukeramsden
Copy link

Has anyone else tried to ring tailscale support about this ? I've sent 2 mails without response.

not heard anything either

@nicolasbriere1
Copy link

We faced same issue without response from support too :/

@bithavoc
Copy link

bithavoc commented Jan 6, 2025

we're seeing this issue again, it's timing out not just in Github actions but also our on-premise ubuntu machines with split DNS

@lukasmrtvy
Copy link

yes, same for us

@zenire
Copy link

zenire commented Jan 13, 2025

we're seeing this with github actions:

  • hosted runners (ubuntu 22.04 and 24.04)
  • tailscale 1.78.1
read udp 127.0.0.1:48903->127.0.0.53:53: i/o timeout

experiencing exact this behavior too. Become worse last quarter.

@aaomidi
Copy link

aaomidi commented Jan 16, 2025

We primarily use Tailscale to access subnets hosted on AWS. To work around this problem, I've written a script that does the following:

    - name: Wait for propagation
      uses: Wandalen/wretry.action@master
      with:
        command: |
          check_dns() {
            local server=$1
            if dig @$server example.com +short +timeout=5; then
              echo "DNS check against $server successful"
              return 0
            else
              echo "DNS check against $server failed"
              return 1
            fi
          }

          # https://docs.aws.amazon.com/vpc/latest/userguide/AmazonDNS-concepts.html
          # This is our VPC addresses + 2
          servers=(
            "10.152.0.2"
            "10.160.0.2"
            "10.168.0.2"
            "10.176.0.2"
          )

          for server in "${servers[@]}"; do
            if ! check_dns "$server"; then
              echo "Failed checking DNS server $server"
              exit 1
            fi
          done

          echo "All DNS checks successful!"
        attempt_limit: 20
        attempt_delay: 5000  # 5 seconds in milliseconds

This is done after connecting to tailscale in the GH action, and has effectively worked around this problem. I imagine that you can find another IP to ping/dig to verify that the connection has in fact fully propagated to the entire tailnet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests