Replica delay

Job details

Name:	Replica delay
Platform:	Mariadb
Category:	Cluster and Replication
Description:	This job collects the number of bytes received and sent from from other Galera Cluster nodes.
Long description:
Version:	1.5
Default schedule:	30s
Requires engine install:	No
Compatibility tag:	.[type=‘instance’ & databasetype=‘mariadb’]/instance[is_mariadb_branch=‘1′]

Parameters

Name	Default value	Description
warning_threshold	120	Maximum number of seconds between the replication SQL (applier) thread and the replication I/O (receiver) thread before a warning is triggered.
alarm_threshold	600	Maximum number of seconds between the replication SQL (applier) thread and the replication I/O (receiver) thread before an alarm is triggered.
return_status_when_Replica_IO_not_running	2	Return status value (ALARM – 2, WARNING – 1, or OK – 0) when replication I/O (receiver) thread is not started or/and it has not connected successfully to the source.
return_status_when_Replica_SQL_not_running	1	Return status value (ALARM – 2, WARNING – 1, or OK – 0) when replication I/O (applier) thread is not started.

Job Summary

Purpose: The purpose of this monitoring job is to track and manage the delay in replication between a master and a replica in a MariaDB environment.
Why: This job is important to ensure the data integrity and synchronization between the master database and its replicas. Monitoring the replication delay helps in identifying potential issues that can affect database performance and availability.
Manual checking: You can check this manually in the database by issuing the following SQL commands:

SHOW SLAVE STATUS;

Job Details

Name: Replica delay
Version: 1.5
Provider: dbwatch.no
Group: com.dbwatch.job
Artifact ID: mariadb_replica_delay
Category: Cluster and Replication
Compatibility: This job is compatible with MariaDB instances that employ replication features.

Monitoring Details

Description: This job checks how “late” the replica is by measuring the time difference in seconds between the replication SQL (applier) thread and the replication I/O (receiver) thread.
Default Schedule: Every 30 seconds

Status Calculation

The status is evaluated based on the delays between the replication SQL and I/O threads.
A warning is issued if the delay exceeds 120 seconds but is less than 600 seconds.
An alarm is triggered if the delay exceeds 600 seconds.
Additional status values are set based on whether the replication I/O or SQL threads are not running or not connected to the source.

Output and Reporting

Field	Description
Status	Indicates the overall status based on the configured thresholds and the running state of the replication threads (OK, WARNING, ALARM).
Details	Provides specifics about the replication delay, including how many seconds the replica is behind the master and the operational state of the replication threads.

Alerting Logic

The alerting logic involves several conditions based on the operation state of the replication I/O and SQL threads and their respective connection statuses.
Messages and statuses are constructed to reflect various faults like non-running threads or issues in connection to the replication source.
The monitoring job leverages a custom JavaScript engine to process and evaluate the replication status dynamically, based on real-time data from the database.

This monitoring job is crucial for database administrators to keep a close eye on the health and performance of their MariaDB replicas, ensuring data consistency and timely troubleshooting of replication issues.