Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestUpdateDevices fails on i386 #4594

Open
kolyshkin opened this issue Jan 16, 2025 · 3 comments · May be fixed by #4610
Open

TestUpdateDevices fails on i386 #4594

kolyshkin opened this issue Jan 16, 2025 · 3 comments · May be fixed by #4610

Comments

@kolyshkin
Copy link
Contributor

kolyshkin commented Jan 16, 2025

Description

We have a CI job to compile runc and run its unit tests to make sure the code is 32-bit clean.

Recently (about a month ago) TestUpdateDevices and/or TestUpdateDevicesSystemd start to fail randomly.

The failure is happening in one of two ways:

  1. invalid character 'ÿ' looking for beginning of object key string
Log
=== RUN   TestUpdateDevicesSystemd
.......
time="2024-12-23T02:28:15Z" level=info msg="found more than one filter (2) attached to a cgroup -- removing extra filters!"
time="2024-12-23T02:28:15Z" level=info msg="removing old filter 0 from cgroup" id=1028 name= run_count=0 runtime=0s tag=531db05b114e9af3 type=CGroupDevice
time="2024-12-23T02:28:15Z" level=info msg="removing old filter 1 from cgroup" id=1029 name= run_count=0 runtime=0s tag=a04f5eef06a7f555 type=CGroupDevice
time="2024-12-23T02:28:15Z" level=info msg="found more than one filter (2) attached to a cgroup -- removing extra filters!"
time="2024-12-23T02:28:15Z" level=info msg="removing old filter 0 from cgroup" id=1030 name= run_count=0 runtime=0s tag=fb6cb1c301453333 type=CGroupDevice
time="2024-12-23T02:28:15Z" level=info msg="removing old filter 1 from cgroup" id=1031 name= run_count=0 runtime=0s tag=3b0b81b071f088cd type=CGroupDevice
    update_test.go:67: unexpected error: unable to start container process: invalid character 'ÿ' looking for beginning of object key string
--- FAIL: TestUpdateDevicesSystemd (3.03s)
  1. panic in strings.Contains
Log
....
time="2025-01-16T04:13:05Z" level=info msg="removing old filter 0 from cgroup" id=824 name= run_count=0 runtime=0s tag=531db05b114e9af3 type=CGroupDevice
time="2025-01-16T04:13:05Z" level=info msg="removing old filter 1 from cgroup" id=825 name= run_count=0 runtime=0s tag=a04f5eef06a7f555 type=CGroupDevice
time="2025-01-16T04:13:05Z" level=info msg="found more than one filter (2) attached to a cgroup -- removing extra filters!"
time="2025-01-16T04:13:05Z" level=info msg="removing old filter 0 from cgroup" id=826 name= run_count=0 runtime=0s tag=fb6cb1c301453333 type=CGroupDevice
time="2025-01-16T04:13:05Z" level=info msg="removing old filter 1 from cgroup" id=827 name= run_count=0 runtime=0s tag=3b0b81b071f088cd type=CGroupDevice
unexpected fault address 0xffffffff
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0xffffffff pc=0x80cdf14]

goroutine 1310 gp=0x8c077a8 m=90 mp=0x908e008 [running]:
runtime.throw({0x85897db, 0x5})
	/opt/hostedtoolcache/go/1.23.4/x64/src/runtime/panic.go:1067 +0x35 fp=0x8cd3b80 sp=0x8cd3b6c pc=0x80c2d55
runtime.sigpanic()
	/opt/hostedtoolcache/go/1.23.4/x64/src/runtime/signal_unix.go:931 +0x2af fp=0x8cd3bac sp=0x8cd3b80 pc=0x80c499f
internal/stringslite.Index({0x8c25740, 0x29}, {0xffffffff, 0x24})
	/opt/hostedtoolcache/go/1.23.4/x64/src/internal/stringslite/strings.go:78 +0x54 fp=0x8cd3bf4 sp=0x8cd3bac pc=0x80cdf14
strings.Index(...)
	/opt/hostedtoolcache/go/1.23.4/x64/src/strings/strings.go:1221
strings.Contains(...)
	/opt/hostedtoolcache/go/1.23.4/x64/src/strings/strings.go:62
github.com/opencontainers/runc/libcontainer/integration.testUpdateDevices(0x8cd8808, 0x1)
	/home/runner/work/runc/runc/libcontainer/integration/update_test.go:71 +0x8cf fp=0x8cd3f68 sp=0x8cd3bf4 pc=0x84bae5f
github.com/opencontainers/runc/libcontainer/integration.TestUpdateDevicesSystemd(0x8cd8808)
	/home/runner/work/runc/runc/libcontainer/integration/update_test.go:97 +0x83 fp=0x8cd3f84 sp=0x8cd3f68 pc=0x84bb203
testing.tRunner(0x8cd8808, 0x85b26c0)
	/opt/hostedtoolcache/go/1.23.4/x64/src/testing/testing.go:1690 +0x119 fp=0x8cd3fe4 sp=0x8cd3f84 pc=0x818fb29
testing.(*T).Run.gowrap1()
	/opt/hostedtoolcache/go/1.23.4/x64/src/testing/testing.go:1743 +0x2e fp=0x8cd3ff0 sp=0x8cd3fe4 pc=0x8190b4e
runtime.goexit({})
	/opt/hostedtoolcache/go/1.23.4/x64/src/runtime/asm_386.s:1393 +0x1 fp=0x8cd3ff4 sp=0x8cd3ff0 pc=0x80c9961
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.23.4/x64/src/testing/testing.go:1743 +0x3d1
...

FAIL	github.com/opencontainers/runc/libcontainer/integration	18.695s

Both cases look like some kind of memory corruption.

Steps to reproduce the issue

  • Create a PR
  • See ci / cross-i386 fails

Describe the results you received and expected

no failures

What version of runc are you using?

git HEAD

Host OS information

No response

Host kernel information

No response

@kolyshkin
Copy link
Contributor Author

OK, thanks to @rata (who reproduced it locally) I did some more testing and was also able to reproduce this.

Turns out it's quite reproducible, and the issue is, after some time of running UpdateDevicesSystemd, Go's reflect (used by json package) somehow becomes broken, resulting in the first four bytes of a random JSON field name replaced with 0xff bytes.

This can happen in any JSON, not just the one sent when passing initConfig to a child.

To catch it earlier, I added this check to utils.WriteJSON:

diff --git a/libcontainer/utils/utils.go b/libcontainer/utils/utils.go
index db420ea6..58d00d38 100644
--- a/libcontainer/utils/utils.go
+++ b/libcontainer/utils/utils.go
@@ -1,7 +1,9 @@
 package utils
 
 import (
+       "bytes"
        "encoding/json"
+       "fmt"
        "io"
        "os"
        "path/filepath"
@@ -32,6 +34,10 @@ func WriteJSON(w io.Writer, v interface{}) error {
        if err != nil {
                return err
        }
+       if bad := bytes.IndexByte(data, 0xff); bad != -1 {
+                excerpt := data[bad-16:bad+20]
+               panic(fmt.Errorf("WriteJSON: bad data at pos %d; excerpt: %s %v; original data: %+v", bad, excerpt, excerpt, v))
+        }
        _, err = w.Write(data)
        return err
 }

Then you only need to run TestUpdateDevicesSystemd:

vagrant@vagrant:~/git/runc$ cd libcontainer/integration/
vagrant@vagrant:~/git/runc/libcontainer/integration$ GOARCH=386 CGO_ENABLED=1 go test -c .
vagrant@vagrant:~/git/runc/libcontainer/integration$ time sudo -E PATH=$PATH ./integration.test -test.run UpdateDevicesSystemd -test.count 20

and now it fails like this:

	panic: WriteJSON: bad data at pos 6227; excerpt: 0,"spec_state":{����Version":"1.2.0" [48 44 34 115 112 101 99 95 115 116 97 116 101 34 58 123 255 255 255 255 86 101 114 115 105 111 110 34 58 34 49 46 50 46 48 34]; original data: &{Args:[cat] Env:[HOME=/root PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=integration TERM=xterm] Cwd:/ Capabilities:<nil> ProcessLabel: AppArmorProfile: NoNewPrivileges:false User: AdditionalGroups:[] Config:0x99e97a8 Networks:[0x9b6eb40] PassedFilesCount:0 ContainerID:TestUpdateDevices-if4m6d Rlimits:[{Type:7 Hard:1025 Soft:1025}] CreateConsole:false ConsoleWidth:0 ConsoleHeight:0 RootlessEUID:false RootlessCgroups:false SpecState:0x98c6d50 Cgroup2Path:/sys/fs/cgroup/test/integration}

(and so on -- in different places).

Also, runc v1.2.0 is fine, while the git main HEAD is buggy.

git bisect points to commit e809db8, which is PR #4563.

So, apparently, something in cilium/ebpf v0.17.0 messes up with Go reflection.

@kolyshkin
Copy link
Contributor Author

Latest cilium/ebpf release (v0.17.2) is still buggy. Guess I have to bisect it now.

kolyshkin added a commit to kolyshkin/ebpf that referenced this issue Jan 31, 2025
Commit 78074c5 ("info: expose more prog jited info"), which made its
way into v0.17.0, resulted in random runc CI failures on i386 (see [1]).
In some cases it manifested in a panic or SIGSEGV, and in others we saw
a slightly broken JSON, in which the first 4 bytes of a key were
replaced with 0xff byte.

Changing uintptr (which is 32 bit) back to uint64 fixes the issue for
runc. It changes the public API but I see no way around it (and the
uintptr cast of uint64 which was there before does not look correct
either).

Alas, I don't have a good reproducer, nor a unit test. For a rather
complicated one, see [1].

[1]: opencontainers/runc#4594

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Jan 31, 2025

It's cilium/ebpf#1598 (specifically, cilium/ebpf@a34ec46) which changed ksym from uint64 to uintptr, with the last one being 32-bit on i386, but since the kernel is x86_64 and they use unsafe.Pointer and friends, this results in writing to some data adjacent to theirs. Oh well.

Proposed fix: cilium/ebpf#1660

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant