Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: Can't bind transport socket /crtools-fd-3-0: Address in use #2499

Open
fntlnz opened this issue Oct 21, 2024 · 1 comment · May be fixed by #2500
Open

error: Can't bind transport socket /crtools-fd-3-0: Address in use #2499

fntlnz opened this issue Oct 21, 2024 · 1 comment · May be fixed by #2500

Comments

@fntlnz
Copy link

fntlnz commented Oct 21, 2024

Description
Hey team!

When restoring a dump in a new mount + pid namspace in an environment where multiple dumps sharing the same network namespace
can happen simultaneously CRIU was observed to create an anonymous unix socket named crtools-fd-3-0.

Note: I already prepared a patch with what I think would fix this, submitting it right after opening this.

The log line that appears in a restore.log when this happens will look something like this

1606:(01.379765)      3: Error (criu/files.c:1695): Can't bind transport socket /crtools-fd-3-0: Address in use

The pattern pattern crtools-fd-%d-%d is defined in files.c.

In this case 3 is always the restore PID, even when multiple restores are running, because the same process is being restored in a pid namespace whith PID 3, however I was surprised to see that criu_run_id was always 0.

At that point, we investigated a bit and noticed that the last element of that print is the id of the pid namespace, which is obviously not zero in our case so something was off.

It looks like that bit is set in util_init, however, in crtools.c util_init is called after cr_service_work is started so srwk will never have that bit set.

Steps to reproduce the issue:

Start a new process + mount namespace with a process to dump inside of it

unshare --pid --mount --fork /bin/bash
top # <--- this is the process we want to dump

Now nsenter that namespace and dump top

nsenter -t <bash pid> -p -m
criu dump -t $TOP_PID  -D /tmp/criu-dump -v4 --shell-job

Now try to restore that process multiple times simultaneously in two different dump folders

cp /tmp/criu-dump /tmp/criu-dump-1
cp /tmp/criu-dump /tmp/criu-dump-2

For example, in two different terminals, start a new pid namespace again and restore the process.

unshare --pid --mount --fork --mount-proc criu restore -d -D /tmp/criu-dump-1 -v4 --shell-job

and

unshare --pid --mount --fork --mount-proc criu restore -d -D /tmp/criu-dump-2 -v4 --shell-job

Now, if the timing is correct and you are running both of those at the same moment what will happen is that you'll see an error in your log mentioning that the socket address is already bound.

Describe the results you received:

The restore fails with this error

1606:(01.379765)      3: Error (criu/files.c:1695): Can't bind transport socket /crtools-fd-3-0: Address in use

Describe the results you expected:

The restore is successful and the crtools-fd path is something like /crtools-fd-3-f000058200000002

Additional information you deem important (e.g. issue happens only occasionally):

CRIU logs and information:

CRIU full dump/restore logs:

(01.452241)      3: Restore on-core sigactions for 3
(01.452300)      3: Error (criu/files.c:1695): Can't bind transport socket /crtools-fd-3-0: Address in use
(01.452376) Error (criu/cr-restore.c:2313): Restoring FAILED.

Output of `criu --version`:

I'm using the current criu-dev, which i would've expected to say 4.0 but it still is on 3.18. Whatever, this is the output

Version: 3.18
GitID: v3.18-320-gdfb56eed6

Output of `criu check --all`:

Looks good.

Additional environment details:

Please let me know if you need more details or something isn't clear and thanks for the hard work y'all put in making this happen!

@fntlnz
Copy link
Author

fntlnz commented Oct 21, 2024

looks like the malware bots are hitting this repo, reported

@checkpoint-restore checkpoint-restore deleted a comment Oct 21, 2024
fntlnz added a commit to fntlnz/criu that referenced this issue Oct 23, 2024
When restoring dumps in new mount + pid namespaces where multiple dumps
share the same network namespace, CRIU may fail due to conflicting
unix socket names. This happens because the service worker creates
sockets using a pattern that includes criu_run_id, but util_init()
is called after cr_service_work() starts.

The socket naming pattern "crtools-fd-%d-%d" uses the restore PID
and criu_run_id, however criu_run_id is always 0 when not initialized,
leading to conflicts when multiple restores run simultaneously either
in the same CRIU process or because of multiple CRIU processes
doing the same operation in different PID namespaces.

Fix this by:

- Moving util_init() before cr_service_work() starts
- Adding a second util_init() call in the service worker fork
to ensure unique IDs across multiple worker runs
- Making sure that dump and restore operations have util_init() called
early to generate unique socket names

With this fix, socket names always include the namespace ID, preventing
conflicts when multiple processes with the same pid share a network
namespace.

Fixes checkpoint-restore#2499

Signed-off-by: Lorenzo Fontana <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant