Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and potentially fix the multi-threading issue with decoding using the klauspost leopard implementation #123

Open
evan-forbes opened this issue Sep 15, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@evan-forbes
Copy link
Member

evan-forbes commented Sep 15, 2022

When parallelising the crossword loop, it works fine with infectious, but errors on go-leopard and klauspost leopard. Seems like a potential issue with doing parallel decodings. Otherwise, it resulted in 4x+ performance increase for 128x128 blocks with infectious.

Branch with parallel decoding (klauspost): https://github.com/celestiaorg/rsmt2d/blob/parallelisation_klauspost/extendeddatacrossword.go#L73

klauspost leopard:

goos: linux
goarch: amd64
pkg: github.com/celestiaorg/rsmt2d
cpu: 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz
BenchmarkRepair/RSGF8_4x4x256_ODS-16         	--- FAIL: BenchmarkRepair/RSGF8_4x4x256_ODS-16
    extendeddatacrossword_test.go:217: byzantine row: 3
BenchmarkRepair/LeopardFF8_4x4x256_ODS-16    	     518	   2273441 ns/op
BenchmarkRepair/LeopardFF16_4x4x256_ODS-16   	     499	   2318601 ns/op
BenchmarkRepair/RSGF8_8x8x256_ODS-16         	--- FAIL: BenchmarkRepair/RSGF8_8x8x256_ODS-16
    extendeddatacrossword_test.go:217: byzantine row: 7
BenchmarkRepair/LeopardFF8_8x8x256_ODS-16    	     300	   3929318 ns/op
BenchmarkRepair/LeopardFF16_8x8x256_ODS-16   	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5781d4]

goroutine 436131 [running]:
github.com/klauspost/reedsolomon.mulgf16_avx2({0xc0092c6d00, 0x100, 0x100}, {0x0, 0x100, 0x100}, 0xc0048ccc80)
	/home/mus/go/pkg/mod/github.com/klauspost/[email protected]/galois_gen_amd64.s:63559 +0x54
github.com/klauspost/reedsolomon.mulgf16({0xc0092c6d00?, 0x10?, 0x10?}, {0x0?, 0x100?, 0xc000280000?}, 0xffff?, 0x0?)
	/home/mus/go/pkg/mod/github.com/klauspost/[email protected]/galois_amd64.go:336 +0x6d
github.com/klauspost/reedsolomon.(*leopardFF16).reconstruct(0xc00011c000, {0xc006300300, 0x10?, 0x10?}, 0x1)
	/home/mus/go/pkg/mod/github.com/klauspost/[email protected]/leopard.go:456 +0x7fa
github.com/klauspost/reedsolomon.(*leopardFF16).Reconstruct(0x203001?, {0xc006300300?, 0x203001?, 0xc006300300?})
	/home/mus/go/pkg/mod/github.com/klauspost/[email protected]/leopard.go:315 +0x25
github.com/celestiaorg/rsmt2d.decode({0xc006300300, 0x10, 0x10})
	/home/mus/Code/rsmt2d/leopard.go:62 +0xf8
github.com/celestiaorg/rsmt2d.leoRSFF16Codec.Decode(...)
	/home/mus/Code/rsmt2d/leopard.go:81
github.com/celestiaorg/rsmt2d.(*ExtendedDataSquare).rebuildShares(0xc000078000, 0x1, {0xc006300300?, 0x10, 0x10})
	/home/mus/Code/rsmt2d/extendeddatacrossword.go:259 +0x58
github.com/celestiaorg/rsmt2d.(*ExtendedDataSquare).solveCrosswordCol(0xc000078000, 0x2, {0xc006300600, 0x10, 0x10}, {0xc006300780, 0x10, 0x10})
	/home/mus/Code/rsmt2d/extendeddatacrossword.go:218 +0x1f3
github.com/celestiaorg/rsmt2d.(*ExtendedDataSquare).solveCrossword.func2()
	/home/mus/Code/rsmt2d/extendeddatacrossword.go:102 +0x4f
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/home/mus/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
	/home/mus/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0xa5
exit status 2
FAIL	github.com/celestiaorg/rsmt2d	6.506s

go-leopard:

goos: linux
goarch: amd64
pkg: github.com/celestiaorg/rsmt2d
cpu: 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz
BenchmarkRepair/Repairing_16x16_ODS_using_LeopardFF16-16         	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x46ae98]

goroutine 6166 [running]:
github.com/celestiaorg/go-leopard._Cfunc_CBytes(...)
	_cgo_gotypes.go:46
github.com/celestiaorg/go-leopard.copyByteBuffer.func1({0x0, 0x100, 0xc000500001?})
	/home/mus/go/pkg/mod/github.com/celestiaorg/[email protected]/wrapper.go:215 +0x85
github.com/celestiaorg/go-leopard.copyByteBuffer({0x0?, 0xc00022dbc8?, 0x44f9d2?})
	/home/mus/go/pkg/mod/github.com/celestiaorg/[email protected]/wrapper.go:215 +0x25
github.com/celestiaorg/go-leopard.copyToCmallocedPtrs({0xc0003c0600, 0x10, 0x0?})
	/home/mus/go/pkg/mod/github.com/celestiaorg/[email protected]/wrapper.go:208 +0x85
github.com/celestiaorg/go-leopard.Recover({0xc0003c0600?, 0x10, 0x20}, {0xc0003c0780?, 0x10, 0x10})
	/home/mus/go/pkg/mod/github.com/celestiaorg/[email protected]/wrapper.go:127 +0x36a
github.com/celestiaorg/go-leopard.Decode({0xc0003c0600?, 0x10, 0x4205e0?}, {0xc0003c0780?, 0x10, 0xc0006ef670?})
	/home/mus/go/pkg/mod/github.com/celestiaorg/[email protected]/wrapper.go:160 +0x3b
github.com/celestiaorg/rsmt2d.leoRSFF16Codec.Decode({}, {0xc0003c0600?, 0x300?, 0x0?})
	/home/mus/Code/rsmt2d/leopard.go:45 +0x54
github.com/celestiaorg/rsmt2d.(*ExtendedDataSquare).rebuildShares(0xc0000e8060, 0x1, {0xc0003c0600?, 0x20, 0x20})
	/home/mus/Code/rsmt2d/extendeddatacrossword.go:259 +0x58
github.com/celestiaorg/rsmt2d.(*ExtendedDataSquare).solveCrosswordRow(0xc0000e8060, 0x1a, {0xc000582000, 0x20, 0x20}, {0xc000582300, 0x20, 0x20})
	/home/mus/Code/rsmt2d/extendeddatacrossword.go:156 +0x1d4
github.com/celestiaorg/rsmt2d.(*ExtendedDataSquare).solveCrossword.func1()
	/home/mus/Code/rsmt2d/extendeddatacrossword.go:91 +0x4f
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/home/mus/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
	/home/mus/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0xa5
exit status 2
FAIL	github.com/celestiaorg/rsmt2d	0.101s

Originally posted by @musalbas in #116 (comment)

We should investigate why the klauspost leopard implementation breaks when used to decode data in parallel, so that we can use it to repair the square in a parallelized way. Depending on the reason why it isn't working, we might also want to fix it.

@evan-forbes evan-forbes moved this to TODO in Celestia Node Sep 15, 2022
@evan-forbes evan-forbes changed the title Investigate and potentially fix the multi-threading issue with decoding using the klauspost leopard implementation.\ Investigate and potentially fix the multi-threading issue with decoding using the klauspost leopard implementation Sep 15, 2022
@musalbas
Copy link
Member

cc @elias-orijtech

@rahulghangas
Copy link
Contributor

It seems there are multiple race conditions within the parallel implementation as well

@musalbas
Copy link
Member

Where? Can you open an issue describing the race conditions?

@adlerjohn adlerjohn linked a pull request Sep 28, 2022 that will close this issue
1 task
@rootulp rootulp removed this from Celestia Node Mar 14, 2023
@rootulp rootulp added the bug Something isn't working label Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants