Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/backupccl: TestRestoreDatabaseVersusTable failed #134020

Closed
cockroach-teamcity opened this issue Oct 31, 2024 · 8 comments
Closed

ccl/backupccl: TestRestoreDatabaseVersusTable failed #134020

cockroach-teamcity opened this issue Oct 31, 2024 · 8 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month T-storage Storage Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 31, 2024

ccl/backupccl.TestRestoreDatabaseVersusTable failed with artifacts on release-24.3 @ c077ebf6e98bcd579481b93c83f14184ab94f2e6:

goroutine 30230 gp=0x4011b7ce00 m=nil [select]:
runtime.gopark(0x400839def8?, 0x2?, 0x27?, 0x0?, 0x400839dec4?)
	GOROOT/src/runtime/proc.go:402 +0xc8 fp=0x400ecb7d70 sp=0x400ecb7d50 pc=0x453eb8
runtime.selectgo(0x400ecb7ef8, 0x400839dec0, 0x400d7524e0?, 0x0, 0x400c5c7a98?, 0x1)
	GOROOT/src/runtime/select.go:327 +0x614 fp=0x400ecb7e80 sp=0x400ecb7d70 pc=0x467584
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0x4011ef7c70, 0x1)
	external/org_golang_google_grpc/internal/transport/controlbuf.go:418 +0x14c fp=0x400ecb7f20 sp=0x400ecb7e80 pc=0xb5e21c
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0x400be4ad20)
	external/org_golang_google_grpc/internal/transport/controlbuf.go:552 +0x7c fp=0x400ecb7f80 sp=0x400ecb7f20 pc=0xb5ea8c
google.golang.org/grpc/internal/transport.NewServerTransport.func2()
	external/org_golang_google_grpc/internal/transport/http2_server.go:336 +0xd8 fp=0x400ecb7fd0 sp=0x400ecb7f80 pc=0xb73688
runtime.goexit({})
	src/runtime/asm_arm64.s:1222 +0x4 fp=0x400ecb7fd0 sp=0x400ecb7fd0 pc=0x48e8a4
created by google.golang.org/grpc/internal/transport.NewServerTransport in goroutine 30227
	external/org_golang_google_grpc/internal/transport/http2_server.go:333 +0x14d4

r0      0xffff4c916b08
r1      0x400c792000
r2      0xffff4c916b08
r3      0x40000
r4      0x0
r5      0x0
r6      0x400d59d818
r7      0x40000daf08
r8      0x95
r9      0x400
r10     0x0
r11     0x5
r12     0x1
r13     0x0
r14     0x0
r15     0xffffffffffffffff
r16     0xffff4ebfd5d0
r17     0xffff4f3fcd50
r18     0x971b80
r19     0x1
r20     0xffff4f3fcb18
r21     0xffff4f3fcbd0
r22     0x1
r23     0x6364
r24     0x7a61
r25     0x40000dc8f0
r26     0xffffffffffffffff
r27     0xffffffffffffff80
r28     0x400c2121c0
r29     0xffff4f3fcca8
lr      0x437fd4
sp      0xffff4f3fccb0
pc      0x42b81c
fault   0x20
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-43874

@cockroach-teamcity cockroach-teamcity added branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Oct 31, 2024
@msbutler
Copy link
Collaborator

msbutler commented Nov 1, 2024

This looks like a seg fault in the runtime? i quickly looked at the stack dump and unable to follow the seg fault to crdb code.

Some notes:

I doubt this has to do with backupccl code:

  • there are no stacks in prod backupccl
  • this test code is quite simple -- no concurrency
  • this test hasn't failed in months
=== RUN   TestRestoreDatabaseVersusTable
    test_log_scope.go:165: test logs captured to: /artifacts/tmp/_tmp/47447a7ed84475b6aaa4b9399a882ce0/logTestRestoreDatabaseVersusTable2978839770
    test_log_scope.go:76: use -show-logs to present logs inline
    test_server_shim.go:152: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
=== RUN   TestRestoreDatabaseVersusTable/incomplete-db
    test_server_shim.go:152: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
SIGSEGV: segmentation violation
PC=0x42b81c m=19 sigcode=1 addr=0x20

goroutine 0 gp=0x400c2121c0 m=19 mp=0x400c210008 [idle]:
runtime.(*mspan).typePointersOfUnchecked(0x40168850e0?, 0x4015086c00?)
  GOROOT/src/runtime/mbitmap_allocheaders.go:202 +0x3c fp=0xffff4f3fccd0 sp=0xffff4f3fccb0 pc=0x42b81c
runtime.scanobject(0x400c792000, 0x40000dc168)
  GOROOT/src/runtime/mgcmark.go:1441 +0x1c4 fp=0xffff4f3fcd60 sp=0xffff4f3fccd0 pc=0x437fd4
runtime.gcDrain(0x40000dc168, 0x2)
  GOROOT/src/runtime/mgcmark.go:1242 +0x1d4 fp=0xffff4f3fcdd0 sp=0xffff4f3fcd60 pc=0x437774
runtime.gcDrainMarkWorkerDedicated(...)
  GOROOT/src/runtime/mgcmark.go:1124
runtime.gcBgMarkWorker.func2()
  GOROOT/src/runtime/mgc.go:1402 +0x154 fp=0xffff4f3fce20 sp=0xffff4f3fcdd0 pc=0x433a34
runtime.systemstack(0x0)
  src/runtime/asm_arm64.s:243 +0x6c fp=0xffff4f3fce30 sp=0xffff4f3fce20 pc=0x48c3fc

goroutine 38 gp=0x4000a80a80 m=19 mp=0x400c210008 [GC worker (active)]:
runtime.systemstack_switch()
  src/runtime/asm_arm64.s:200 +0x8 fp=0x4000a88730 sp=0x4000a88720 pc=0x48c378
runtime.gcBgMarkWorker()
  GOROOT/src/runtime/mgc.go:1370 +0x204 fp=0x4000a887d0 sp=0x4000a88730 pc=0x433614
runtime.goexit({})
  src/runtime/asm_arm64.s:1222 +0x4 fp=0x4000a887d0 sp=0x4000a887d0 pc=0x48e8a4
created by runtime.gcBgMarkStartWorkers in goroutine 1
  GOROOT/src/runtime/mgc.go:1234 +0x28

@benbardin
Copy link
Collaborator

What's a good next step here? Should this (retroactively) block the beta?

@msbutler
Copy link
Collaborator

msbutler commented Nov 1, 2024

i don't think so, but i can ask around.

@exalate-issue-sync exalate-issue-sync bot added the P-1 Issues/test failures with a fix SLA of 1 month label Nov 4, 2024
@benbardin benbardin assigned RaduBerinde and unassigned benbardin Nov 4, 2024
@benbardin benbardin added A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team and removed A-disaster-recovery T-disaster-recovery labels Nov 4, 2024
@benbardin
Copy link
Collaborator

Provisionally assigning to Storage, on the hypothesis that this could be a bug with unsafe memory usage and they would be best equipped to track it down further. Thank you!

@RaduBerinde
Copy link
Member

I have been trying to repro on an arm AWS node (same machine type as the failed test) with no luck so far. Whatever this is, it must be extremely rare. I filed #134312 to upgrade Go to 1.22.8 which has a fix which may in principle be relevant.

@benbardin
Copy link
Collaborator

Makes sense to me. Thank you very much, Radu!

@itsbilal itsbilal moved this from Incoming to Investigations in [Deprecated] Storage Nov 5, 2024
@RaduBerinde
Copy link
Member

Still no luck reproducing. I am removing the release-blocker label since probably the only course of action here is to upgrade Go (and that issue is marked as a blocker).

@RaduBerinde RaduBerinde removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 5, 2024
@RaduBerinde
Copy link
Member

Go was upgraded which hopefully will address this. I was unable to reproduce the crash; not much more we can do here.

@github-project-automation github-project-automation bot moved this from Investigations to Done in [Deprecated] Storage Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month T-storage Storage Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants