BDR Replication lag
Job details
Name: |
BDR Replication lag |
Platform: |
Postgres |
Category: |
Cluster |
Description: |
Checks for BDR replication lag |
Long description: |
Task analyses BDR replication lag |
Version: |
0.3 |
Default schedule: |
0,10,20,30,40,50 * * * |
Requires engine install: |
Yes |
Compatibility tag: |
.[type=‘instance’ & databasetype=‘postgres’]/.[hasengine=‘YES’ & replication_used > 0] |
Parameters
Name |
Default value |
Description |
keep data for |
7 |
The number of days to keep the data for. |
max lag bytes warning |
1000000 |
Maximum lag bytes for warning |
max lag bytes alarm |
2000000 |
Maximum lag bytes for alarm |
Job Summary
- Purpose: This job is designed to monitor and manage the replication lag in PostgreSQL using Bi-Directional Replication (BDR).
- Why: Monitoring BDR replication lag is critical for maintaining the health, performance, and consistency of data across replicated database clusters. It helps in proactive issue detection and mitigation before they impact the system’s performance or lead to data discrepancies.
- Manual checking: You can check the BDR replication lag manually in the database by executing the following SQL command:
SELECT pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes, pid, application_name, usesysid, client_addr, client_port, backend_start, state, sync_state, CURRENT_TIMESTAMP AS currtime FROM pg_stat_replication;
Detailed Description
The job operates by tracking the difference in log positions between the master and the replica (lag in bytes). It evaluates this data to determine if the lag exceeds predefined warning or alarm thresholds, which could indicate potential issues in the replication process.
Implementation Details
- Stored Procedure: A PL/pgSQL function “dbw_bdr_replication_lag” is created, which periodically fetches replication data, evaluates it against set thresholds, and stores the results in a designated historical table “dbw_bdr_replication_lag_histr” for audit and further analysis.
- Automation: This function is scheduled to run at predefined intervals (every 10 minutes as per the default schedule), and it also manages data retention by cleaning historical entries older than a specified number of days.
Dependencies
- The job relies on two primary components:
- A function defined in the PostgreSQL that executes the core logic.
Reporting and Alerts
- The job uses a report template “BDR replication lag statistics” to visually present the data stored in “dbw_bdr_replication_lag_histr”.
- It includes graphs and tables displaying various attributes such as replication lag, process identifier, application name, and more, helping in quick visual analysis of the replication state over time.
- Alerts are generated based on the thresholds defined for warning and alarm conditions. The job updates the monitoring system with appropriate status and details regarding any potential issues detected during the check.
Impact and Importance
- This job is crucial for database administrators (DBAs) to ensure that the BDR mechanism functions optimally without significant lags that could affect the entire database cluster’s operations.
- It provides automation and real-time insights, thereby reducing the manual effort required in monitoring and allows DBAs to focus more on strategic activities than routine checks.