BDR Replication lag

Job details

Name:	BDR Replication lag
Platform:	Postgres
Category:	Cluster
Description:	Checks for BDR replication lag
Long description:	Task analyses BDR replication lag
Version:	0.3
Default schedule:	0,10,20,30,40,50 * * *
Requires engine install:	Yes
Compatibility tag:	.[type=‘instance’ & databasetype=‘postgres’]/.[hasengine=‘YES’ & replication_used > 0]

Parameters

Name	Default value	Description
keep data for	7	The number of days to keep the data for.
max lag bytes warning	1000000	Maximum lag bytes for warning
max lag bytes alarm	2000000	Maximum lag bytes for alarm

Job Summary

Purpose: This job is designed to monitor and manage the replication lag in PostgreSQL using Bi-Directional Replication (BDR).
Why: Monitoring BDR replication lag is critical for maintaining the health, performance, and consistency of data across replicated database clusters. It helps in proactive issue detection and mitigation before they impact the system’s performance or lead to data discrepancies.
Manual checking: You can check the BDR replication lag manually in the database by executing the following SQL command:

SELECT pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes, pid, application_name, usesysid, client_addr, client_port, backend_start, state, sync_state, CURRENT_TIMESTAMP AS currtime FROM pg_stat_replication;

Detailed Description

The job operates by tracking the difference in log positions between the master and the replica (lag in bytes). It evaluates this data to determine if the lag exceeds predefined warning or alarm thresholds, which could indicate potential issues in the replication process.

Implementation Details

Stored Procedure: A PL/pgSQL function “dbw_bdr_replication_lag” is created, which periodically fetches replication data, evaluates it against set thresholds, and stores the results in a designated historical table “dbw_bdr_replication_lag_histr” for audit and further analysis.
Automation: This function is scheduled to run at predefined intervals (every 10 minutes as per the default schedule), and it also manages data retention by cleaning historical entries older than a specified number of days.

Dependencies

The job relies on two primary components:
- A function defined in the PostgreSQL that executes the core logic.

Reporting and Alerts

The job uses a report template “BDR replication lag statistics” to visually present the data stored in “dbw_bdr_replication_lag_histr”.
It includes graphs and tables displaying various attributes such as replication lag, process identifier, application name, and more, helping in quick visual analysis of the replication state over time.
Alerts are generated based on the thresholds defined for warning and alarm conditions. The job updates the monitoring system with appropriate status and details regarding any potential issues detected during the check.

Impact and Importance

This job is crucial for database administrators (DBAs) to ensure that the BDR mechanism functions optimally without significant lags that could affect the entire database cluster’s operations.
It provides automation and real-time insights, thereby reducing the manual effort required in monitoring and allows DBAs to focus more on strategic activities than routine checks.