Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird behavior when running tests #145

Open
cattuz opened this issue Mar 29, 2024 · 1 comment · May be fixed by #146
Open

Weird behavior when running tests #145

cattuz opened this issue Mar 29, 2024 · 1 comment · May be fixed by #146

Comments

@cattuz
Copy link

cattuz commented Mar 29, 2024

I'm running the tests from "chitchat-test" locally on my windows machine with the following script (trying to mimic the bash version as best I can):

Get-Process "chitchat-test" | Stop-Process

cargo build --release

for ($i = 10000; $i -lt 10100; $i++)
{
    $listen_addr = "127.0.0.1:$i";
    Write-Host $listen_addr;

    Start-Process -NoNewWindow "cargo" -ArgumentList "run --release -- --listen_addr $listen_addr --seed 127.0.0.1:10002 --node_id node_$i"
}

Read-Host

Are the following results expected behavior, or am I maybe running into some windows bugs?

  1. The services started at 127.0.0.1:10000 and 127.0.0.1:10001 (before the seed address port) never receive any peers, and are stuck with a rather bare "state", regardless of how long I leave it running:
{
  "cluster_id": "testing",
  "cluster_state": {
    "node_state_snapshots": [
      {
        "chitchat_id": {
          "node_id": "node_10000",
          "generation_id": 1711736008,
          "gossip_advertise_addr": "127.0.0.1:10000"
        },
        "node_state": {
          "chitchat_id": {
            "node_id": "node_10000",
            "generation_id": 1711736008,
            "gossip_advertise_addr": "127.0.0.1:10000"
          },
          "heartbeat": 2,
          "key_values": {},
          "max_version": 0,
          "last_gc_version": 0
        }
      }
    ],
    "seed_addrs": [
      "127.0.0.1:10002"
    ]
  },
  "live_nodes": [
    {
      "node_id": "node_10000",
      "generation_id": 1711736008,
      "gossip_advertise_addr": "127.0.0.1:10000"
    }
  ],
  "dead_nodes": []
}

The other services on ports greater than the seed start up fine.

  1. If I select a random process and kill it, all heartbeats stop (even for the services which received no peers). The state is still readable as JSON with a GET request to 127.0.0.1:10XXX, but the heartbeat does not increment, and no new peers are registered if I try adding them after the stoppage.

Is any of this expected behavior, or do we think maybe its windows related?

@cattuz cattuz linked a pull request Mar 30, 2024 that will close this issue
@cattuz
Copy link
Author

cattuz commented Mar 30, 2024

I think I found the issue. The following line in server.rs stops the server whenever a node is no longer reachable:

https://github.com/quickwit-oss/chitchat/blob/b08423fdcb4022f31a8cc37af018af38cc35193b/chitchat/src/server.rs#L223C1-L223C49

The error is:

Error {
    context: "Error while receiving UDP message",
    source: Os {
        code: 10054,
        kind: ConnectionReset,
        message: "An existing connection was forcibly closed by the remote host.",
    },
}

Is this something you are meant to recover from in client code when setting up the server?

I've created a draft PR #146 to show what fixed the issue for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant