Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolve additional deadlock during drop database :: pglogical_apply.c :: CHECK_FOR_INTERRUPTS() #427

Open
wants to merge 1 commit into
base: REL2_x_STABLE
Choose a base branch
from

Conversation

Andrei-Pozolotin
Copy link

@Andrei-Pozolotin Andrei-Pozolotin commented Jun 2, 2023

this was tested on top of: PostgreSQL-15.3 + pglogical-2.4.3

resolve additional deadlock during drop database:
fix #418
fix #399

solution for the deadlock in pglogical_apply_main()
is identical to the solution for the deadlock in pglogical_supervisor_main():
#403

@Andrei-Pozolotin
Copy link
Author

note: there are still few more "deadlock candidates" in the code, for example, search for:

while (!got_SIGTERM)

and review (loops are not using CHECK_FOR_INTERRUPTS()):

pglogical_sync.c

pglogical_sync_worker_cleanup(PGLogicalSubscription *sub)
wait_for_sync_status_change(Oid subid, const char *nspname, const char *relname,

@Andrei-Pozolotin
Copy link
Author

@eulerto please review

@nmisch
Copy link
Contributor

nmisch commented Oct 2, 2023

I've reviewed this. Note that the linked issues say this began with PostgreSQL
15 commit postgres/postgres@4eb2176318d, but it was Windows-only until
postgres/postgres@e2f65f4255. I do recommend merging this PR as-is, but it
would be even better with these improvements:

  1. Changing pglogical_sync_worker_cleanup and wait_for_sync_status_change
    the same way, even though they're not expected to loop indefinitely like
    apply_work

  2. Normal order of operations is:
    WaitLatch
    ResetLatch
    check WL_POSTMASTER_DEATH
    CHECK_FOR_INTERRUPTS()
    normal loop work

    See
    https://github.com/postgres/postgres/blob/b1a8dc846da4d96d903dcb5733f68a1e02d82a23/src/include/storage/latch.h#L41
    for some background and
    https://github.com/postgres/postgres/blob/d392e9bdea957964e1fa6a5481e5adb5904d759a/src/test/modules/worker_spi/worker_spi.c#L238
    for an example. This PR is instead putting CHECK_FOR_INTERRUPTS() just
    before WaitLatch. This is largely harmless, and order does vary within
    pglogical code. Still, I recommend moving the call to just after the
    WL_POSTMASTER_DEATH check.

  3. Add test coverage, perhaps just:

    --- a/sql/parallel.sql
    +++ b/sql/parallel.sql
    @@ -34,6 +34,10 @@ SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status IN ('y', '
     
     SELECT * FROM pglogical.show_subscription_status();
     
    +-- apply worker shall not block DROP of a different database
    +CREATE DATABASE to_be_dropped;
    +DROP DATABASE to_be_dropped;
    +
     -- Make sure we see the slot and active connection
     \c :provider_dsn
     SELECT plugin, slot_type, database, active FROM pg_replication_slots;
    
  4. Change leading whitespace to match surrounding lines (spaces vs. tabs).

@nmisch
Copy link
Contributor

nmisch commented Jan 24, 2024

This achieves a lot for a one-line change. @eulerto (since no other person has
merged a pglogical commit in the last eight months), would you merge this? If
not, what would you like to see first?

@nmisch
Copy link
Contributor

nmisch commented Mar 3, 2024

@petere I see you are the other of the two people who have committed to
pglogical in the last 2y. This pull request achieves a lot for a one-line
change. Would you merge this? If not, what would you like to see first?

@nmisch
Copy link
Contributor

nmisch commented May 11, 2024

@eulerto @petere @mwanner This is just adding a CHECK_FOR_INTERRUPTS() so DROP
DATABASE does not hang. For 11 months, the pull request has been waiting
attention from someone with repository write access. Would you grant me write
access so I can take responsibility for merging this? If not, how should we
get this process-hang bug resolved?

@eulerto
Copy link
Contributor

eulerto commented May 22, 2024

@nmisch I will check all pending PRs (including this one) in the next few days.

@nmisch
Copy link
Contributor

nmisch commented Aug 23, 2024

@eulerto This is still just adding one CHECK_FOR_INTERRUPTS(), and it's been
waiting 14 months for someone with repository write access. Adding a
CHECK_FOR_INTERRUPTS() is one of the safest changes doable in PostgreSQL code.
If merging this creates the need for a followup fix, you can count on me to
write that followup fix. What is one obstacle to merging it over the next
month?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants