Replication slave client alert

Job details

Name:	Replication slave client alert
Platform:	Postgres
Category:	Cluster
Description:	Checks for BDR replication lag
Long description:
Version:	1
Default schedule:	15m
Requires engine install:	No
Compatibility tag:	.[type=‘instance’ & databasetype=‘postgres’]/.[newer_than_ninetwo = ‘1′ & maj_version =‘9′]

Parameters

Name	Default value	Description
warning_threshold	1000	Replication lag in bytes in order to generate warning
alarm_threshold	5000	Replication lag in bytes in order to generate alarm

Job Summary

Purpose: The purpose of this job is to monitor the byte lag between the master and slave clients in a PostgreSQL replication setup.
Why: This job is important because it helps ensure that data replication between master and slave is occurring within acceptable limits to prevent data inconsistency and potential downtime. Monitoring replication lag is crucial for maintaining database availability and performance in distributed database environments.
Manual checking: You can manually check the replication lag in PostgreSQL by using the following SQL command:

SELECT client_hostname, client_addr, sent_offset - (replay_offset - (sent_xlog - replay_xlog) * 255 * 16 ^ 6 ) AS byte_lag FROM (SELECT client_hostname, client_addr,('x' || lpad(split_part(sent_lsn::text, '/', 1), 8, '0'))::bit(32)::bigint AS sent_xlog,('x' || lpad(split_part(replay_lsn::text, '/', 1), 8, '0'))::bit(32)::bigint AS replay_xlog,('x' || lpad(split_part(sent_lsn::text, '/', 2), 8, '0'))::bit(32)::bigint AS sent_offset,('x' || lpad(split_part(replay_lsn::text, '/', 2), 8, '0'))::bit(32)::bigint AS replay_offset FROM pg_stat_replication) AS s;

Job Details

default-schedule: Every 15 minutes
valid-for: 15 minutes after execution
version: 1
company: dbwatch.com
artifactid: postgres_replication_slave_client

Replication Lag Monitoring Logic

The job leverages both SQL to retrieve data from PostgreSQL using the “pg_stat_replication” view, which provides details like client hostname, client address, and replication lags using log sequence numbers (LSNs). It then uses JavaScript to process these results. The JavaScript logic performs the following checks:

Retrieves client hostname, address, and byte lag for each replication connection.
Compares the byte lag against the defined thresholds to set status warnings or alarms:
If byte lag exceeds the “warning_threshold”, set status to 1 (Warning).

Replication Monitoring Report Generation

Under the dbwatch-report-template, a report named “Replication info” is prepared, which includes details about each replication client such as username, application name, client address, client hostname, backend start time, current state, and synchronization state. Here’s how the resulting report is structured in a tabular format:

Username	Application name	Client address	Client hostname	Backend start	State	Sync state
data	data	data	data	data	data	data

This report is generated hourly as specified in the ‘default-schedule’ of the report template, ensuring up-to-date replication monitoring information is available.