Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues that caused gamescope to abort at exit #1335

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sharkautarch
Copy link

@sharkautarch sharkautarch commented May 24, 2024

The first commit fixes issue #1305
Note: I used a user event instead of the sdl quit event, to ensure that the sdl thread is only closed from inside steamcompmgr_exit()

After fixing the aforementioned issue, I noticed that gamescope would still sometimes abort/coredump at exit.

So I found a separate issue, wherein the present_wait thread, when running GetVBlankTimer().MarkVBlank( vblanktime, true );, would segfault upon trying to access a backend while steamcompmgr_exit() was deleting said backend.
The second commit fixes this w/ an awkward dance between the two threads, where steamcompmgr_exit() asks the present_wait thread to exit, and then waits for the present_wait thread to respond back.
The reason why I did that weird stuff in the second commit is that I wanted to minimize the overhead being added to present_wait_thread_func(): for most of the runtime, this commit will not add any delay to present_wait_thread_func()

update: Slight edit to the second commit: in steamcompmgr_exit(), I moved the atomic store to g_presentThreadShouldExit so that it happens before the atomic store to g_currentPresentWaitId, just to cover any rare edge case w/ atomic load/store timing
EDIT:
I just revised this PR w.r.t. the second issue, and instead of trying to insert checks inside the present wait thread, I changed how backend creation and deletion worked, so that it is now safe to do SetBackend() while other threads are accessing the backend via GetBackend().
The added synchronization overhead should be pretty low, since the only change effecting GetBackend() and IBackend::Get() is that s_pBackend is now an atomic ptr, and the plain atomic load (even w/ the default seq_cst ordering) from s_pBackend gets compiled down to a plain move instruction on x86_64 and on aarch64, it's only a load-acquire atomic instruction (LDAR instruction, which still allows non-atomic accesses to be reordered around it)

I also fixed an additional issue that I noticed, where, at least when using SDL backend, gamescope would hang at exit (tho this isn't only able to be seen after fixing the two previous issue that would otherwise coredump gamescope at exit before it could hang).

For whatever reason, it seems like gamescope could hang at exit if the xwayland_server_guard lock was still locked when calling pthread_exit(), so this PR addresses that issue by simply unlocking that lock before calling pthread_exit().

Also disclaimer that when testing this PR w/ UBSAN, I noticed some warnings from UBSAN about vptr issues w.r.t. CVBlankTimer which only appeared while gamescope was exiting, and which don't seem to appear in upstream gamescope.
Not sure if that's just a false positive, and I couldn't find anything wrong besides that.

@sharkautarch sharkautarch force-pushed the graceful_exit branch 4 times, most recently from 41000c4 to 61392a9 Compare May 30, 2024 18:02
@sharkautarch
Copy link
Author

sharkautarch commented May 30, 2024

Last Update: for some reason, the present thread needs to run g_presentThreadShouldExit.notify_all() when exiting, to ensure that steamcompmgr_exit() doesn't wait forever. So I added that in

@matte-schwartz
Copy link

matte-schwartz commented Jul 8, 2024

just as a heads up @sharkautarch this PR causes both a gamescope and an xwayland coredump once the "Switch to Desktop" shortcut is used within gamescope-session on my 7900XTX on Fedora 40. this results in a 20-40 second freeze before the session switch executes properly and returns me to desktop.

gamescope-coredump.log
xwayland-coredump.log
gamescope-stdout-xwayland-coredump.log

edit: also, the same issue happens when applied against upstream gamescope master so you can disregard the gamescope-plus logging in gamescope-stdout

@sharkautarch
Copy link
Author

@matte-schwartz
Hmm so something is causing Xwayland to abort…
First, make sure you have asan and ubsan installed:
Make sure you have the subpackages of gcc: libasan, libubsan installed

Then, try running gamescope with these environment variables to run w/ address sanitizer:
ASAN_OPTIONS= dump_instruction_bytes=1:halt_on_error=0:fast_unwind_on_malloc=0:detect_leaks=0:leak_check_at_exit=0:handle_abort=1:intercept_tls_get_addr:1 LD_PRELOAD="$(gcc -print-file-name=libasan.so) $(gcc -print-file-name=libubsan.so)"

Note that when running w/ address sanitizer, coredumping is disabled, because otherwise the coredumps would be terabytes in size…

I’m not sure if you might need to actually recompile gamescope w/ address sanitizer instrumentation in order to catch whatever issue is going on…
Here’s how to compile with address sanitize:
just add these flags to meson setup:
-Db_sanitize=address,undefined -Dc_args="-fsanitize-recover=all -w -Wno-error=calloc-transposed-args -Wno-error=stringop-overflow" -Dc_link_args="-fsanitize-recover=all -w -Wno-error=calloc-transposed-args -Wno-error=stringop-overflow" -Dcpp_args="-fsanitize-recover=all -w -Wno-error=calloc-transposed-args -Wno-error=stringop-overflow" -Dcpp_link_args="-fsanitize-recover=all -w -Wno-error=calloc-transposed-args -Wno-error=stringop-overflow" --buildtype=debugoptimized

@matte-schwartz
Copy link

Then, try running gamescope with these environment variables to run w/ address sanitizer: ASAN_OPTIONS= dump_instruction_bytes=1:halt_on_error=0:fast_unwind_on_malloc=0:detect_leaks=0:leak_check_at_exit=0:handle_abort=1:intercept_tls_get_addr:1 LD_PRELOAD="$(gcc -print-file-name=libasan.so) $(gcc -print-file-name=libubsan.so)"

I gave this a try but keep getting an error about needing to link asan at runtime, even though I built gamescope following your instructions below. Let me read up a bit on how these tools are supposed to work so I can try and figure out what's going wrong.

I’m not sure if you might need to actually recompile gamescope w/ address sanitizer instrumentation in order to catch whatever issue is going on… Here’s how to compile with address sanitize: just add these flags to meson setup: -Db_sanitize=address,undefined -Dc_args="-fsanitize-recover=all -w -Wno-error=calloc-transposed-args -Wno-error=stringop-overflow" -Dc_link_args="-fsanitize-recover=all -w -Wno-error=calloc-transposed-args -Wno-error=stringop-overflow" -Dcpp_args="-fsanitize-recover=all -w -Wno-error=calloc-transposed-args -Wno-error=stringop-overflow" -Dcpp_link_args="-fsanitize-recover=all -w -Wno-error=calloc-transposed-args -Wno-error=stringop-overflow" --buildtype=debugoptimized

@sharkautarch
Copy link
Author

sharkautarch commented Jul 9, 2024

@matte-schwartz
Is the message saying something like: ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD ?

If so, try running ASAN again, but this time with the verify_asan_link_order=0 option:
ASAN_OPTIONS=verify_asan_link_order=0:dump_instruction_bytes=1:halt_on_error=0:fast_unwind_on_malloc=0:detect_leaks=0:leak_check_at_exit=0:handle_abort=1:intercept_tls_get_addr:1 LD_PRELOAD="$(gcc -print-file-name=libasan.so) $(gcc -print-file-name=libubsan.so)"

@sharkautarch sharkautarch force-pushed the graceful_exit branch 4 times, most recently from 9ef7f05 to 9cda7ad Compare October 24, 2024 17:03
@sharkautarch
Copy link
Author

sharkautarch commented Oct 24, 2024

@matte-schwartz I revised this PR, so hopefully it no longer causes xwayland to coredump when trying to switch to desktop in gamescope-session, tho I haven't tested that specific case, so you'll have to test that yourself.

I also ended up also having to address another issue I found in the process of revising this PR, where it appeared that gamescope may have sometimes been hanging onto a lock (xwayland_server_guard) while calling pthread_exit().
It seems like the issues caused by holding onto that lock for too long are more noticable when using the SDL backend, because, when I fixed all the issues that'd cause gamescope to coredump at exit, I found that holding onto that lock would consistently result in gamescope hanging indefinitely at exit.
With the fix to that issue that's included with this PR, where xwayland_server_guard is unlocked right before calling pthread_exit(), the aforementioned hang no longer happens!

@sharkautarch sharkautarch force-pushed the graceful_exit branch 13 times, most recently from 476a116 to a8c9bc8 Compare October 25, 2024 00:50
@sharkautarch sharkautarch force-pushed the graceful_exit branch 11 times, most recently from b59ae95 to de8186f Compare October 25, 2024 01:52
also release the xwayland_server_guard lock before calling pthread_exit() to prevent gamescope from hanging at exit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants