Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

killing primary results in it not being able to rejoin #259

Open
yee379 opened this issue Mar 23, 2020 · 3 comments
Open

killing primary results in it not being able to rejoin #259

yee379 opened this issue Mar 23, 2020 · 3 comments
Labels

Comments

@yee379
Copy link

yee379 commented Mar 23, 2020

i have postdock spun up on kubernetes. its working great! however, the first test i did was to kill the first statefulset replica by running kubectl delete pod/postgres-0 (yeah, i renamed everything).

so before the killing of the primary:

postgres@postgres-2:/$ repmgr cluster show
 ID | Name       | Role    | Status    | Upstream   | Location | Connection string
----+------------+---------+-----------+------------+----------+--------------------------------------------------------------------------------------------
 1000 | postgres-0 | primary | * running |            | default  | user=repl password=replicate host=postgres-0 dbname=replica_db port=5432 connect_timeout=2
 1001 | postgres-1 | standby |   running | postgres-0 | default  | user=repl password=replicate host=postgres-1 dbname=replica_db port=5432 connect_timeout=2
 1002 | postgres-2 | standby |   running | postgres-0 | default  | user=repl password=replicate host=postgres-2 dbname=replica_db port=5432 connect_timeout=2

and then after i get

postgres@postgres-2:/$ repmgr cluster show
 ID | Name       | Role    | Status    | Upstream   | Location | Connection string
----+------------+---------+-----------+------------+----------+--------------------------------------------------------------------------------------------
 1001 | postgres-1 | primary | * running |            | default  | user=repl password=replicate host=postgres-1 dbname=replica_db port=5432 connect_timeout=2
 1002 | postgres-2 | standby |   running | postgres-1 | default  | user=repl password=replicate host=postgres-2 dbname=replica_db port=5432 connect_timeout=2

postres-1 took over primary... good! but postgres-0 is completely missing... the logs from postgres-0

❯ k logs postgres-0 -f
>>> Setting up STOP handlers...
>>> STARTING SSH (if required)...
>>> SSH is not enabled!
>>> STARTING POSTGRES...
>>> SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)...
>>> TUNING UP POSTGRES...
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Will add configs to the exists file
>>>>>> Adding config 'wal_keep_segments'='250'
>>>>>> Adding config 'shared_buffers'='300MB'
>>>>>> Adding config 'archive_command'=''/bin/true''
>>>>>> Adding config 'shared_preload_libraries'=''repmgr''
>>> Check all partner nodes for common upstream node...
>>>>>> Checking NODE=postgres-0.postgres...
psql: could not connect to server: Connection refused
	Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
	TCP/IP connections on port 5432?
>>>>>> Skipping: failed to get master from the node!
>>>>>> Checking NODE=postgres-1.postgres...
>>>>>>>>> Pretending master role node - postgres-0.postgres
>>>>>> Checking NODE=postgres-2.postgres...
>>>>>>>>> Pretending master role node - postgres-0.postgres
>>> Auto-detected master name: 'postgres-0.postgres'
>>> Setting up repmgr...
>>> Setting up repmgr config file '/etc/repmgr.conf'...
>>> Setting up upstream node...
cat: /var/lib/postgresql/data/standby.lock: No such file or directory
>>> Previously Locked standby upstream node LOCKED_STANDBY=''
>>> Waiting for upstream postgres server...
>>> Wait schema replica_db.repmgr on postgres-0.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
psql: could not connect to server: Connection refused
	Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
	TCP/IP connections on port 5432?
[snip!]
>>>>>> Host postgres-0.postgres:5432 is not accessible (will try 2 times more)
psql: could not connect to server: Connection refused
	Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
	TCP/IP connections on port 5432?
>>>>>> Host postgres-0.postgres:5432 is not accessible (will try 1 times more)
>>> Schema replica_db.repmgr is not accessible, even after 30 tries!

Then it crashes and upon restart (ad infinitum...)

❯ k logs postgres-0 -f
>>> Setting up STOP handlers...
>>> STARTING SSH (if required)...
>>> SSH is not enabled!
>>> STARTING POSTGRES...
>>> SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)...
>>> TUNING UP POSTGRES...
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Will add configs to the exists file
>>>>>> Adding config 'wal_keep_segments'='250'
>>>>>> Adding config 'shared_buffers'='300MB'
>>>>>> Adding config 'archive_command'=''/bin/true''
>>>>>> Adding config 'shared_preload_libraries'=''repmgr''
>>> Check all partner nodes for common upstream node...
>>>>>> Checking NODE=postgres-0.postgres...
psql: could not connect to server: Connection refused
	Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
	TCP/IP connections on port 5432?
>>>>>> Skipping: failed to get master from the node!
>>>>>> Checking NODE=postgres-1.postgres...
>>>>>>>>> Pretending master role node - postgres-1.postgres
>>>>>> Checking NODE=postgres-2.postgres...
>>>>>>>>> Pretending master role node - postgres-1.postgres
>>> Auto-detected master name: 'postgres-1.postgres'
>>> Setting up repmgr...
>>> Setting up repmgr config file '/etc/repmgr.conf'...
>>> Setting up upstream node...
cat: /var/lib/postgresql/data/standby.lock: No such file or directory
>>> Previously Locked standby upstream node LOCKED_STANDBY=''
>>> Waiting for upstream postgres server...
>>> Wait schema replica_db.repmgr on postgres-1.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Schema replica_db.repmgr exists on host postgres-1.postgres:5432!
>>> Waiting for upstream node id replica_db on postgres-1.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Upstream node id 1001 received from host postgres-1.postgres:5432!
>>> REPLICATION_UPSTREAM_NODE_ID=1001
>>> Sending in background postgres start...
>>> Waiting for upstream postgres server...
>>> Wait schema replica_db.repmgr on postgres-1.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Schema replica_db.repmgr exists on host postgres-1.postgres:5432!
>>> Data folder is not empty /var/lib/postgresql/data:
total 136
drwx------ 19 postgres tape      4096 Mar 23 07:31 .
drwxr-xr-x  1 postgres postgres  4096 Mar 23 07:31 ..
-rwx------  1 postgres postgres   175 Mar 23 07:29 backup_label.old
drwx------  7 postgres postgres  4096 Mar 23 07:17 base
drwx------  2 postgres postgres  4096 Mar 23 07:30 global
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_commit_ts
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_dynshmem
-rwx------  1 postgres postgres  4571 Mar 23 07:29 pg_hba.conf
-rwx------  1 postgres postgres  1636 Mar 23 07:29 pg_ident.conf
drwx------  4 postgres postgres  4096 Mar 23 07:30 pg_logical
drwx------  4 postgres postgres  4096 Mar 23 07:17 pg_multixact
drwx------  2 postgres postgres  4096 Mar 23 07:30 pg_notify
drwx------  2 postgres postgres  4096 Mar 23 07:29 pg_replslot
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_serial
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_snapshots
drwx------  2 postgres postgres  4096 Mar 23 07:30 pg_stat
drwx------  2 postgres postgres  4096 Mar 23 07:30 pg_stat_tmp
drwx------  2 postgres postgres  4096 Mar 23 07:30 pg_subtrans
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_tblspc
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_twophase
-rwx------  1 postgres postgres     3 Mar 23 07:17 PG_VERSION
drwx------  3 postgres postgres  4096 Mar 23 07:29 pg_wal
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_xact
-rwx------  1 postgres postgres    88 Mar 23 07:29 postgresql.auto.conf
-rwx------  1 postgres postgres 21646 Mar 23 07:31 postgresql.conf
-rwx------  1 postgres postgres    36 Mar 23 07:30 postmaster.opts
-rwx------  1 postgres postgres    96 Mar 23 07:30 postmaster.pid
-rwx------  1 postgres postgres   232 Mar 23 07:29 recovery.done
>>> Starting standby node...
>>> Instance has been set up already.
>>> Rewinding to the latest state
>>>>>> Archiving configs
The following command line errors were encountered:
  unknown repmgr action 'standby archive-config'
Try "repmgr --help" for more information.
>>>>>> Start server to be able to rewind (weird hack to avoid dirty shutdown issue)
waiting for server to start....2020-03-23 07:31:21.567 UTC [189] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2020-03-23 07:31:21.567 UTC [189] LOG:  could not bind IPv6 address "::1": Cannot assign requested address
2020-03-23 07:31:21.567 UTC [189] HINT:  Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
2020-03-23 07:31:21.568 UTC [189] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-03-23 07:31:21.597 UTC [190] LOG:  database system was interrupted; last known up at 2020-03-23 07:30:19 UTC
2020-03-23 07:31:21.626 UTC [190] LOG:  database system was not properly shut down; automatic recovery in progress
2020-03-23 07:31:21.627 UTC [190] LOG:  redo starts at 0/5026958
2020-03-23 07:31:21.628 UTC [190] LOG:  redo done at 0/5026990
2020-03-23 07:31:21.638 UTC [189] LOG:  database system is ready to accept connections
 done
server started
>>>>>> Removing unactive replication slots of partners
>>> Getting slots from master node
repmgr_slot_1002
>>> Closing slots which exist on master node
>>>>>> Closing slot repmgr_slot_1002
2020-03-23 07:31:21.809 UTC [226] ERROR:  replication slot "repmgr_slot_1002" does not exist
2020-03-23 07:31:21.809 UTC [226] STATEMENT:  SELECT pg_drop_replication_slot('repmgr_slot_1002')
ERROR:  replication slot "repmgr_slot_1002" does not exist
>>>>>>>>> Can't close the slot!
>>> Closing slot which current master used on the node
>>>>>> Closing slot repmgr_slot_1001
2020-03-23 07:31:21.989 UTC [257] ERROR:  replication slot "repmgr_slot_1001" does not exist
2020-03-23 07:31:21.989 UTC [257] STATEMENT:  SELECT pg_drop_replication_slot('repmgr_slot_1001');
ERROR:  replication slot "repmgr_slot_1001" does not exist
>>>>>>>>> Can't close the slot!
>>>>>> Stop server
2020-03-23 07:31:21.996 UTC [189] LOG:  received fast shutdown request
waiting for server to shut down....2020-03-23 07:31:21.997 UTC [189] LOG:  aborting any active transactions
2020-03-23 07:31:22.000 UTC [189] LOG:  background worker "logical replication launcher" (PID 196) exited with exit code 1
2020-03-23 07:31:22.000 UTC [191] LOG:  shutting down
2020-03-23 07:31:22.016 UTC [189] LOG:  database system is shut down
 done
server stopped
>>>>>> Creating replication slot for the standby in case it's not there
cat: /var/lib/postgresql/data/recovery.conf: No such file or directory
>>>>>> Rewind to the host postgres-1.postgres
source and target cluster are on the same timeline
no rewind required
>>>>>> Restoring configs
The following command line errors were encountered:
  unknown repmgr action 'standby restore-config'
Try "repmgr --help" for more information.
>>>>>> Tell repmgr to follow upstream for the node
DEBUG: do_standby_follow()
DEBUG: connecting to: "user=repl password=replicate connect_timeout=2 dbname=replica_db host=postgres-0.postgres port=5432 fallback_application_name=repmgr"
ERROR: connection to database failed:
  could not connect to server: Connection refused
	Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
	TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=repl password=replicate connect_timeout=2 dbname=replica_db host=postgres-0.postgres port=5432 fallback_application_name=repmgr
HINT: use "repmgr node rejoin" to re-add an inactive node to the replication cluster
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
>>>>>> Stop server
pg_ctl: PID file "/var/lib/postgresql/data/postmaster.pid" does not exist
Is server running?
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Will add configs to the exists file
>>>>>> Adding config 'wal_keep_segments'='250'
>>>>>> Adding config 'shared_buffers'='300MB'
>>>>>> Adding config 'archive_command'=''/bin/true''
>>>>>> Adding config 'shared_preload_libraries'=''repmgr''
>>> Starting postgres...
>>> Waiting for local postgres server recovery if any in progress:LAUNCH_RECOVERY_CHECK_INTERVAL=30
>>> Recovery is in progress:
2020-03-23 07:31:22.308 UTC [309] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-03-23 07:31:22.308 UTC [309] LOG:  listening on IPv6 address "::", port 5432
2020-03-23 07:31:22.309 UTC [309] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-03-23 07:31:22.339 UTC [320] LOG:  database system was shut down at 2020-03-23 07:31:22 UTC
2020-03-23 07:31:22.345 UTC [309] LOG:  database system is ready to accept connections
>>>>>> RECOVERY_WAL_ID is empty!
>>> Not in recovery state (anymore)
>>> Waiting for local postgres server start...
>>> Wait schema replica_db.public on postgres-0.postgres:5432(user: repl,password: *******), will try 60 times with delay 10 seconds (TIMEOUT=600)
>>>>>> Schema replica_db.public exists on host postgres-0.postgres:5432!
>>> Unregister the node if it was done before
DELETE 0
>>> Registering node with role standby
INFO: connecting to local node "postgres-0" (ID: 1000)
ERROR: this node should be a standby (user=repl password=replicate host=postgres-0.postgres dbname=replica_db port=5432 connect_timeout=2)
>>> Starting repmgr daemon...
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
[2020-03-23 07:31:52] [NOTICE] repmgrd (repmgr 4.0.6) starting up
[2020-03-23 07:31:52] [INFO] connecting to database "user=repl password=replicate host=postgres-0.postgres dbname=replica_db port=5432 connect_timeout=2"
[2020-03-23 07:31:52] [ERROR] this node is marked as inactive and cannot be used as a failover target
[2020-03-23 07:31:52] [HINT] Check that "repmgr (primary|standby) register" was executed for this node
[2020-03-23 07:31:52] [INFO] executing notification command for event "repmgrd_shutdown"
[2020-03-23 07:31:52] [DETAIL] command is:
  /usr/local/bin/cluster/repmgr/events/router.sh 1000 repmgrd_shutdown 0 "2020-03-23 07:31:52+0000" "node is inactive and cannot be used as a failover target"
[2020-03-23 07:31:52] [INFO] repmgrd terminating...
@webdobe
Copy link

webdobe commented Feb 24, 2021

I am in the same boat....

I am running 2 servers 1 server has the master and a slave. The second server has 2 more slaves and the pool and backup.

When I shutdown server 1. Master goes down and a slave on the second server get's promoted... But I also lose one of the slaves because it is pointing at the old master...

I spool up server 1 Master cannot join because it is no longer master. The slave wont' start because pgmaster won't start...

So now server 1 has no running databases. Server 2 has only 1 running.

How do we rejoin the original master? How do we ensure we have all 4 nodes up?

What am I missing here? All the the test cases in your docs show to the point when a node is promoted... But nothing for the steps to take when putting things back up.

Any information surrounding this is greatly appreciated! I will keep hacking at it...

@smokes2345
Copy link

smokes2345 commented Jul 8, 2021

I am also experiencing this issue. I don't think i have anything to add that @yee379 hasn't already covered. @yee379 its been over a year since you opened this issue, did you ever find a solution or work-around?

EDIT: For anyone else experiencing this issue, i've found removing the contents of the failed master's pg-data dir allows the failed master to startup and join the cluster again.

@smokes2345
Copy link

@paunin it looks like PR #261 might fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants