Differences

This shows you the differences between two versions of the page.

--- postgresql_repmgr_installation [2021/09/17 06:25] – andonovj
+++ postgresql_repmgr_installation [2021/09/28 11:30] (current) – andonovj
@@ Line 9: / Line 9: @@
   * 2nd Quandrant - For Rempgr
+{{:1_g-5e4zhde9lojvmbdsiptg.png?600|}}
 =====Virtual Environment Config=====
@@ Line 533: / Line 535: @@
 ====Automatic Failover====
 To test that, we can shutdown the primary and see if it is failover automatically (as per our settings)
+Now, I changed a little bit the config, so please check Appendix A.
+===Modify Postgres user===
+Firstly, we need to add the postgres user to the sudo group:
+<Code:bash|Add user to Sudoers group>
+### Server 1
+root@server2:/var/log/postgresql# usermod -aG sudo postgres
+root@server2:/var/log/postgresql#
+### Server 2
+root@server2:/var/log/postgresql# usermod -aG sudo postgres
+root@server2:/var/log/postgresql#
+</Code>
+===Test the failover===
+Now, we can finally test the failover.
 <Code:bash|Test Failover>
+postgres@server1:/var/log/postgresql$ repmgr cluster show
+ ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
+----+---------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------
+  | server1 | standby | * running |          | default  | 100      | 3        | host=server1 user=repmgr dbname=repmgr
+  | server2 | primary |   running | server1  | default  | 100      | 3        | host=server2 user=repmgr dbname=repmgr
+</Code>
+We can simply shutdown the prod:
+<Code:bash|Shutdown production>
+root@server2:/var/log/postgresql# service postgresql restart
 </Code>
+Once, this is done, we can see the log file from server1:
+<Code:bash|Check the log on standby>
+[2021-09-28 10:46:45] [INFO] primary and this node have the same location ("default")
+[2021-09-28 10:46:45] [INFO] no other sibling nodes - we win by default
+[2021-09-28 10:46:45] [DEBUG] election result: WON
+[2021-09-28 10:46:45] [NOTICE] this node is the only available candidate and will now promote itself
+[2021-09-28 10:46:45] [INFO] promote_command is:
+  "repmgr standby promote -f /etc/repmgr.conf --log-to-file"
+[2021-09-28 10:46:45] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgr.log"
+[2021-09-28 10:46:45] [DEBUG] connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=server1 fallback_application_name=repmgr"
+[2021-09-28 10:46:45] [DEBUG] connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=server2 fallback_application_name=repmgr"
+[2021-09-28 10:46:45] [DEBUG] connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=server1 fallback_application_name=repmgr"
+[2021-09-28 10:46:45] [NOTICE] promoting standby to primary
+[2021-09-28 10:46:45] [DETAIL] promoting server "server1" (ID: 1) using pg_promote()
+[2021-09-28 10:46:45] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
+[2021-09-28 10:46:46] [DEBUG] setting node 1 as primary and marking existing primary as failed
+[2021-09-28 10:46:46] [NOTICE] STANDBY PROMOTE successful
+[2021-09-28 10:46:46] [DETAIL] server "server1" (ID: 1) was successfully promoted to primary
+[2021-09-28 10:46:46] [INFO] 0 followers to notify
+[2021-09-28 10:46:46] [INFO] switching to primary monitoring mode
+[2021-09-28 10:46:46] [NOTICE] monitoring cluster primary "server1" (ID: 1)
+</Code>
+As we can see, server1 has been promoted to primary, that however leaves us to deal with the aftermath.
+===Dealing with the aftermath===
+Now, we have one primary and 1 failed primary, we need to re-establish connection.
+<Code:bash|Re-establish the connection>
+postgres@server1:/var/log/postgresql$ repmgr cluster show
+DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=server1 fallback_application_name=repmgr"
+DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=server1 fallback_application_name=repmgr"
+DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=server2 fallback_application_name=repmgr"
+ ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
+----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------
+  | server1 | primary | * running |          | default  | 100      | 3        | host=server1 user=repmgr dbname=repmgr connect_timeout=2
+  | server2 | primary | - failed  |          | default  | 100      | ?        | host=server2 user=repmgr dbname=repmgr connect_timeout=2
+WARNING: following issues were detected
+  - unable to connect to node "server2" (ID: 2)
+HINT: execute with --verbose option to see connection error messages
+postgres@server1:
+</Code>
+We can re-establish the connection as follows:
+. Clear the old path and Resnap:
+<Code:bash|Resnap failed server's data>
+root@server2:/var/log/postgresql# rm -rf /var/lib/postgresql/12/main/
+root@server2:/var/log/postgresql# su - postgres
+postgres@server2:~$ repmgr -h server1 -U repmgr -d repmgr standby clone
+WARNING: following problems with command line parameters detected:
+  "config_directory" set in repmgr.conf, but --copy-external-config-files not provided
+NOTICE: destination directory "/var/lib/postgresql/12/main" provided
+INFO: connecting to source node
+DETAIL: connection string is: host=server1 user=repmgr dbname=repmgr
+DETAIL: current installation size is 31 MB
+DEBUG: 2 node records returned by source node
+DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=server1 fallback_application_name=repmgr"
+DEBUG: upstream_node_id determined as 1
+NOTICE: checking for available walsenders on the source node (2 required)
+NOTICE: checking replication connections can be made to the source server (2 required)
+INFO: creating directory "/var/lib/postgresql/12/main"...
+DEBUG: create_replication_slot(): creating slot "repmgr_slot_2" on upstream
+NOTICE: starting backup (using pg_basebackup)...
+HINT: this may take some time; consider using the -c/--fast-checkpoint option
+INFO: executing:
+  /usr/lib/postgresql/12/bin/pg_basebackup -l "repmgr base backup"  -D /var/lib/postgresql/12/main -h server1 -p 5432 -U repmgr -X stream -S repmgr_slot_2
+DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=server1 fallback_application_name=repmgr"
+NOTICE: standby clone (using pg_basebackup) complete
+NOTICE: you can now start your PostgreSQL server
+HINT: for example: sudo systemctl start postgresql
+HINT: after starting the server, you need to re-register this standby with "repmgr standby register --force" to update the existing node record
+postgres@server2:~$
+</Code>
+. Startup the postgresql
+. Re-register
+<Code:bash|Startup PostgreSQL and Re-register>
+postgres@server2:~$ logout
+root@server2:/var/log/postgresql# service postgresql start
+root@server2:/var/log/postgresql# su - postgres
+postgres@server2:~$ repmgr standby register --force
+INFO: connecting to local node "server2" (ID: 2)
+DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=server2 fallback_application_name=repmgr"
+INFO: connecting to primary database
+DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=server1 fallback_application_name=repmgr"
+INFO: standby registration complete
+postgres@server2:~$
+</Code>
+Now we can check our status as follows:
+<Code:bash|Check the newly recovered cluster>
+postgres@server2:~$ repmgr cluster show
+ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
+----+---------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------
+  | server1 | primary | * running |          | default  | 100      | 3        | host=server1 user=repmgr dbname=repmgr
+  | server2 | standby |   running | server1  | default  | 100      | 3        | host=server2 user=repmgr dbname=repmgr
+postgres@server2:~$
+</Code>
+For that configuration I have used the following appendix:
+===Appendix===
+<Code:bash|repmgrd.conf>
+node_id='2'                    # A unique integer greater than zero (1/2 depends on the node)
+node_name='server2'                       # An arbitrary (but unique) string; we recommend. (unique number for the cluster)
+conninfo='host=server1 user=repmgr dbname=repmgr connect_timeout=2'         # Database connection information as a conninfo string.
+data_directory='/var/lib/postgresql/12/main'             # The node's data directory. This is needed by repmgr
+config_directory='/etc/postgresql/12/main'               # If configuration files are located outside the data
+replication_user='repmgr'        # User to make replication connections with, if not set
+replication_type='physical'      # Must be one of "physical" or "bdr".
+location='default'               # An arbitrary string defining the location of the node; this
+use_replication_slots='yes'      # whether to use physical replication slots
+witness_sync_interval=15         # interval (in seconds) to synchronise node records
+log_level='DEBUG'                 # Log level: possible values are DEBUG, INFO, NOTICE,
+log_facility='STDERR'            # Logging facility: possible values are STDERR, or for
+log_file='/var/log/postgresql/repmgr.log'
+log_status_interval=300  # interval (in seconds) for repmgrd to log a status message
+pg_bindir='/usr/lib/postgresql/12/bin'                          # Path to PostgreSQL binary directory (location
+repmgr_bindir='/usr/lib/postgresql/12/bin'                      # Path to repmgr binary directory (location of the repmgr
+use_primary_conninfo_password=false     # explicitly set "password" in "primary_conninfo"
+passfile='/var/lib/postgresql/12/main/.pgpass'       # path to .pgpass file to include in "primary_conninfo"
+ssh_options='-o StrictHostKeyChecking=no'       # Options to append to "ssh"
+promote_check_timeout=60                # The length of time (in seconds) to wait
+promote_check_interval=1                # The interval (in seconds) to check whether
+primary_follow_timeout=60               # The max length of time (in seconds) to wait
+standby_follow_timeout=15               # The max length of time (in seconds) to wait
+shutdown_check_timeout=60               # The max length of time (in seconds) to wait for the demotion
+standby_reconnect_timeout=60            # The max length of time (in seconds) to wait
+node_rejoin_timeout=60          # The maximum length of time (in seconds) to wait for
+failover='automatic'                    # one of 'automatic', 'manual'.
+priority=100                            # indicates a preferred priority for promoting nodes;
+reconnect_attempts=6                    # Number of attempts which will be made to reconnect to an unreachable
+reconnect_interval=10                   # Interval between attempts to reconnect to an unreachable
+promote_command='repmgr standby promote -f /etc/repmgr.conf --log-to-file'  # command repmgrd executes when promoting a new primary; use something like:
+follow_command='repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'  # command repmgrd executes when instructing a standby to follow a new primary;
+primary_notification_timeout=60                 # Interval (in seconds) which repmgrd on a standby
+repmgrd_standby_startup_timeout=60              # Interval (in seconds) which repmgrd on a standby will wait
+monitoring_history=no           # Whether to write monitoring data to the "montoring_history" table
+monitor_interval_secs=2                 # Interval (in seconds) at which to write monitoring data
+degraded_monitoring_timeout=-1          # Interval (in seconds) after which repmgrd will terminate if the
+async_query_timeout=60          # Interval (in seconds) which repmgrd will wait before
+standby_disconnect_on_failover=false    # If "true", in a failover situation wait for all standbys to
+primary_visibility_consensus=false      # If "true", only continue with failover if no standbys have seen
+service_start_command ='sudo systemctl start postgresql'
+service_stop_command ='sudo systemctl stop postgresql'
+service_restart_command ='sudo systemctl restart postgresql'
+service_reload_command ='sudo systemctl reload postgresql'
+archive_ready_warning=16                # repmgr node check --archive-ready
+archive_ready_critical=128              #
+replication_lag_warning=300             # repmgr node check --replication-lag
+replication_lag_critical=600            #
+bdr_local_monitoring_only=false         # Only monitor the local node; no checks will be
+bdr_recovery_timeout=30                 # If a BDR node was offline and has become available
+</Code>
 ====Manual Failover====